77% of LLM studies in mental health never get past the bench test stage: the Hua framework
Out of 160 studies reviewed, LLMs account for 77% of bench tests but only 16% of clinical trials. A Harvard team proposes a three-tier framework to clarify what studies actually prove — and what they don't.
Source analysée
https://onlinelibrary.wiley.com/doi/10.1002/wps.21352The number that should worry us
In January 2025, a Harvard team led by Yining Hua and John Torous published in World Psychiatry — the most cited psychiatry journal in the world — a systematic review of 160 studies on AI chatbots in mental health (2020-2024).
Their main finding fits in a single table:
| Study type | Rule-based systems | Traditional ML | LLMs |
|---|---|---|---|
| T1 — Bench testing | 8% | 15% | 77% |
| T2 — Feasibility | 58% | 18% | 24% |
| T3 — Clinical efficacy | 65% | 19% | 16% |
In other words: the most recent and most hyped technologies are paradoxically the least tested in real clinical conditions. Rule-based systems — older and less spectacular — are the ones with the strongest evidence base.
“Good performance in T1 or positive feedback in T2 does not necessarily translate into T3 clinical efficacy.”
— Hua et al. (2025), World Psychiatry
This is the paradox that the Hua framework makes visible — and helps explain.
The framework in 5 minutes
The core idea
Not all studies on AI in mental health answer the same question. But public debate treats them as if they do. A benchmark on standardized vignettes and a randomized clinical trial with real patients are cited with equal authority — even though they prove very different things.
The Hua framework proposes a classification grid with three progressive tiers, each answering a different question:
The three tiers
Bench Testing — “Does the AI work technically?”
Evaluation under controlled conditions: scripted scenarios, standardized vignettes, expert assessments. No interaction with real patients. The system is tested on ideal cases, in an ideal environment.
Examples: medical benchmarks (MedQA, USMLE), response quality evaluation by clinicians, safety testing on scripted cases.
Feasibility — “Will users actually engage with the system?”
Evaluation with human participants on short-term interactions. Measures engagement, satisfaction, perceived quality — but not clinical outcomes. A user can be satisfied with a chatbot without their health improving — a point also made by the APA App Evaluation Model, which explicitly distinguishes user satisfaction from clinical efficacy.
Examples: satisfaction surveys, engagement metrics (usage duration, completion rates), qualitative assessments.
Clinical Efficacy — “Does the system actually improve patients’ health?”
Measurement of clinically meaningful outcomes: symptom reduction on validated scales (PHQ-9, GAD-7, BDI-II), over extended periods, with real patients. This is the only tier that demonstrates a genuine therapeutic benefit.
Examples: randomized controlled trials with longitudinal follow-up, comparative studies with active treatment.
The LLM paradox
It is the intersection of this classification with the technical architectures that reveals the paradox. The three main chatbot families — rule-based systems (scripts, decision trees), traditional ML (neural networks like SVM, BERT), and LLMs (GPT-4, Claude, Gemini) — are not distributed evenly across the three tiers.
The paradox in one sentence: the more recent and hyped a technology is, the less it has been tested in real clinical conditions. LLMs dominate bench testing (77% in T1) but are nearly absent from clinical trials (16% in T3). Rule-based systems do exactly the opposite.
This is no accident. Benchmarks (T1) are fast, inexpensive, and produce spectacular results publishable in high-impact journals. Clinical trials (T3) take months, cost significantly more, require ethics committee approval, and often produce more nuanced results. The structure of academic incentives favors T1 at the expense of T3.
Applying the filter: three studies under the microscope
Let’s apply the framework to three frequently cited studies to illustrate what each tier of evidence actually means in practice:
| Study | Tier | Why |
|---|---|---|
Ayers et al. (JAMA, 2023) “ChatGPT outperforms physicians in empathy” | T1 | Third-party evaluators rate text responses. No real interaction, no patients, no follow-up. |
Bean et al. (Nature Medicine, 2025) “LLMs are reliable medical assistants” | T1/T2 | Prolific participants (not real patients), standardized vignettes, binary scoring. No clinical outcomes. |
Heinz/Therabot (NEJM AI, 2025) CBT therapeutic chatbot | T3* | Real patients, measured clinical outcomes. But waitlist comparator (no active treatment), which inflates effect size. |
The exercise is revealing. The first two studies — the most widely cited in the press — sit at the lowest level of the evidence hierarchy. They show that LLMs work technically, not that they improve patients’ health.
What this means for your practice
The question to ask systematically
When you read a study claiming that “AI is effective in mental health,” a single question immediately situates its level of evidence:
“Does this study measure clinical outcomes (PHQ-9, GAD-7, BDI-II…) in real patients, over a meaningful period?”
If yes → T3 (but check the comparator and follow-up duration).
If no → T1 or T2 — clinical efficacy is not established, regardless of the headline results.
Connection with the Choudhury framework
This framework combines powerfully with the Choudhury framework on the ecological validity of AI in clinical settings:
Hua tells you where a study sits in the validation pathway (T1, T2, or T3)
Choudhury tells you why moving from one tier to the next is anything but automatic — and which human factors (trust, cognitive load, accountability) prevent it
Together, these two frameworks turn a vague intuition (“these studies don’t reflect reality”) into a structured, actionable reading grid.
What holds up well in this proposal
Massive empirical base
160 studies analyzed over 2020-2024. This is not an armchair theoretical framework — it is a classification that emerged from the actual literature, published in the most influential psychiatry journal.
Operational simplicity
Three tiers, one question per tier. A practitioner can apply this grid in 30 seconds to any study. It is an immediate cognitive filter, not a complex analysis tool.
It quantifies a suspicion
Many clinicians suspected that LLM studies were “less rigorous.” The Hua framework turns that suspicion into hard data: 77% in T1, 16% in T3. The imbalance is not an impression — it is a documented fact.
It cross-references architecture with validation
By distinguishing rule-based systems, traditional ML, and LLMs, the framework avoids the trap of treating “AI” as a monolithic block — a point we highlighted in our article on the distinctions between AI, chatbot, LLM, and app.
The limitations — and why they matter
No quality assessment within a tier
The framework classifies studies by level of evidence but does not distinguish methodological quality within a given tier. A T3 study with an active comparator (a real alternative treatment) and a T3 study with a waitlist comparator are placed at the same level — even though their probative value is vastly different.
Concrete case: the Therabot study (NEJM AI, 2025) is T3 — but its comparator is a waitlist. Comparing a chatbot to “doing nothing” mechanically produces a positive effect. This is not the same as demonstrating efficacy comparable to human therapy.
No transition criteria
The framework states that T1, T2, and T3 are progressive, but does not specify when a system is “ready” to move to the next tier. What T1 criteria must be met before launching a T2 study? The framework is silent on this — it is a roadmap without milestones.
Focused on mental health chatbots
The framework was developed specifically for AI chatbots in mental health. Its applicability to other forms of clinical AI — diagnostic imaging, medical triage, session transcription — remains to be demonstrated. The validation challenges differ depending on the type of tool and the clinical decision involved.
Simplified architectural classification
The rule-based / ML / LLM distinction does not capture hybrid systems that combine multiple approaches — for example, an LLM governed by decision trees (like Therabot or Woebot in its latest iteration). These hybrids blur the boundaries between categories.
Our take
The Hua framework is an essential sorting tool in a research landscape saturated with spectacular announcements. Its main contribution: giving every clinician a simple filter to distinguish what is proven from what is merely promising.
Stop conflating technical performance with therapeutic efficacy
That an LLM passes a medical exam (T1) or that a user reports satisfaction after a conversation (T2) does not mean the system improves patients’ health (T3). These three claims correspond to three radically different levels of evidence. Conflating them is like confusing a closed-circuit driving test with the ability to drive in city traffic.
The evidence gap is not a flaw of the technology — it is a flaw of the research
LLMs are not incapable of helping patients. They are insufficiently tested under conditions that would allow this to be demonstrated. The structure of academic incentives — publish quickly, in high-impact journals — favors fast T1 studies at the expense of costly but necessary T3 trials.
Use this framework as a daily reading filter
Next time a colleague, an administrator, or a journalist tells you that “AI is effective in mental health,” ask the T1/T2/T3 question. If the answer is T1 — which it will be most of the time for LLMs — you will know exactly what that study proves and what it doesn’t.
Source analyzed: Hua, Y., Siddals, S., Torous, J. et al. (2025). Charting the evolution of artificial intelligence mental health chatbots from rule-based systems to large language models: a systematic review. World Psychiatry, 24(2). https://doi.org/10.1002/wps.21352
Further reading:
- Choudhury, A. (2022). Toward an Ecologically Valid Conceptual Framework for the Use of Artificial Intelligence in Clinical Settings. JMIR Human Factors, 9(2), e35421. https://doi.org/10.2196/35421
- Our analysis of the Choudhury framework: Why an AI that “outperforms doctors” in the lab can fail in the clinic
- Our article on terminological distinctions: AI, chatbot, LLM, app: why we need to stop conflating them
- Our analysis of the APA App Evaluation Model: A 3-level filter for evaluating mental health apps
Series: AI Evaluation Frameworks in Healthcare
- Hua: three tiers of evidence for AI in mental health (this article)
- Choudhury: ecological validity of LLM studies
- CHART: health chatbot transparency
- CONSORT-AI: transparency of AI clinical trials
- CONSORT/SPIRIT 2025: Open Science yes, AI no
- PROBAST+AI: quality of AI prediction models
Mots-clés