Fewer than 40% of health chatbot studies report their prompting strategy: the CHART Statement
Out of 137 studies published in the year following ChatGPT's launch, fewer than 40% report key elements of their prompting strategy. An international consortium of 531 experts proposes 12 criteria to fix this — and change how we read these studies.
Source analysée
https://pmc.ncbi.nlm.nih.gov/articles/PMC12320030/The problem nobody was seeing
Imagine a drug trial that specifies neither the drug name, nor the dosage, nor the dates of administration. Unthinkable? Yet this is effectively what happens in the majority of health chatbot studies.
In August 2025, the CHART Collaborative — an international consortium of over 50 researchers led by Bright Huo (McMaster University) — published simultaneously across six journals (JAMA Network Open, BMJ Medicine, BJS, BMC Medicine, Annals of Family Medicine, Artificial Intelligence in Medicine) the results of an alarming audit:
“Fewer than 40% of articles report key elements of their prompting strategy.”
— Huo et al. (2025), CHART Statement, systematic review of 137 studies
In other words: the majority of studies claiming that “AI outperforms doctors” or that ”LLMs are reliable assistants” do not provide enough information for anyone to reproduce their results — or even verify what they actually measured.
What the systematic review found
The team screened 7,752 articles to identify 137 eligible studies, all published within the year following ChatGPT’s launch (November 2022). The findings are damning:
| Reporting element | Reported? | Consequence |
|---|---|---|
| Complete prompting strategy | < 40% | No way to know what was actually asked of the AI |
| Raw prompts used | Rare | Impossible to reproduce the experiment |
| Complete model responses | Rare | Impossible to verify the authors’ evaluation |
| Precise model identification | Insufficient | Unknown which model was actually tested |
| Dates of queries | Insufficient | The same model produces different results from one month to the next |
This is not a minor problem. It is the equivalent of publishing a drug trial by saying “we gave a medication to patients and it worked” — without specifying which drug, at what dose, for how long, or how you measured the outcome.
The CHART Statement: 12 criteria to fix this
To address this problem, the CHART Collaborative developed a checklist of 12 items and 39 sub-items through a rigorous process: asynchronous Delphi with 531 international stakeholders (clinicians, methodologists, AI researchers, journal editors, ethicists), 3 synchronous consensus meetings with 48 experts, and iterative pilot testing.
Here are the items most revealing for a clinician:
Model identifiers
Name, version, update date, open-source or proprietary status. Because writing “we used ChatGPT” is not enough — the performance gap between GPT-3.5 and GPT-4o is enormous. And the same model behaves differently depending on its version and access date.
Prompt engineering
How were the prompts developed? By whom? How many people were involved? Did patients participate in their design? And most importantly: publish the actual prompts used. This is the equivalent of publishing a clinical trial protocol — without it, nothing is verifiable.
Query strategy
Access route to the model (API, web interface, app), precise dates and locations of queries (day/month/year + city/country), separate or continuous sessions, and — crucially — all chatbot responses, not just those that support the argument.
Performance evaluation
What is the reference standard (ground truth)? How many evaluators? What are their qualifications? Were they blinded? Three students rating ChatGPT’s responses without knowing whether they come from a human or an AI is not the same as three senior psychiatrists evaluating in a blinded design.
Results: bias and harm
Beyond overall performance, the study must explicitly assess potentially harmful, biased, or misleading responses. This is the most important item for mental health: a single dangerous response to a suicidal patient matters more than an average accuracy score of 85%.
Why dates and locations matter
The requirement to specify query dates and locations (Item 6b) may seem bureaucratic. It is not. LLMs are not stable molecules: they change constantly.
The same prompt submitted to GPT-4 in March 2024 and in September 2024 can yield radically different results. OpenAI updates its models regularly, sometimes without notice. If a study does not specify when it queried the model, its results are unreproducible by definition — because the “drug” that was tested no longer exists.
This is a fundamental difference from traditional clinical trials: in pharmacology, the molecule being tested remains the same. In AI, the object of study is a moving target.
How CHART fits with the Hua and Choudhury frameworks
The CHART Statement sits within an ecosystem of complementary frameworks we have analyzed:
| Framework | Question | What it provides |
|---|---|---|
| Hua (T1/T2/T3) | What level of evidence does the study provide? | Distinguishes bench testing, feasibility, and clinical efficacy |
| Choudhury | Why do lab results fail to predict real-world use? | Identifies human factors (trust, cognitive load, accountability) |
| CHART | Is the study transparent enough to be evaluated at all? | Checks that minimum reporting standards are met |
Hua tells you what level of evidence a study provides (T1, T2, or T3)
Choudhury tells you why moving from one tier to the next is anything but automatic
CHART tells you whether the study reports enough information for you to even begin evaluating it
CHART is essentially the preliminary filter: before asking whether a study is T1 or T3, you first need it to provide the basic information required to be interpretable at all.
What holds up well in this proposal
Exemplary development process
531 stakeholders in the Delphi process, 48 experts for consensus meetings, 80% agreement threshold, iterative pilot testing. This is the level of methodological rigor that CHART demands of the studies it evaluates — and applies to itself.
It targets the exact regulatory gap
CONSORT-AI covers AI clinical trials. STROBE covers observational studies. TRIPOD+AI covers predictive models. But no guideline covered chatbot evaluation studies — which represent the vast majority of recent LLM literature in healthcare. CHART fills this gap.
”Living guideline” approach
Biannual updates for the first two years, then annual. A panel of 14 experts maintains continuous literature surveillance with a 90% agreement threshold for any change. Unlike a static standard, CHART acknowledges that its subject — AI chatbots — evolves too fast for a fixed rulebook.
Simultaneous publication in 6 journals
This is a strong signal of institutional legitimacy. When JAMA Network Open, BMJ Medicine, and Annals of Family Medicine publish the same standard on the same day, journal editors take notice. Adoption by journals is the key to a reporting guideline’s real-world impact.
The limitations — and why they matter
Reporting ≠ quality
CHART is a transparency tool, not a methodological quality tool. A study can check all 12 items on the checklist and still be methodologically weak — if the reference standard is poorly chosen, if the sample is too small, or if the conclusions overreach the data. Ticking every box does not guarantee a good study. It only guarantees that the study can be judged.
Clinical analogy: it is the difference between a complete hospital discharge summary and a correct diagnosis. The former is necessary to evaluate the latter, but it does not replace it.
Text-focused, not yet multimodal
CHART was designed for text-based chatbots. But recent models are increasingly multimodal (text + image + audio + video). How do you report an interaction where the patient shows an image to the AI? Where the chatbot analyzes the tone of voice? The framework will need to evolve quickly on this front — and this is precisely why the “living guideline” approach is well-suited.
Does not cover clinical trials
CHART is designed for performance evaluation studies (vignettes, benchmarks, scoring). If a randomized trial tests a therapeutic chatbot with real patients, it must use CONSORT-AI in addition — CHART alone is not sufficient for the clinical trial component. The authors state this themselves: CHART is complementary, not self-sufficient.
The risk of superficial box-ticking
Like any checklist, CHART can be filled out mechanically. The danger: studies that check every item without genuine methodological reflection. “We used GPT-4o version of March 15, 2025 ✓” formally satisfies Item 3 — but says nothing about why this model was chosen over others, nor about what this choice means for the generalizability of results.
Our take
The CHART Statement is an essential critical reading tool for anyone interested in chatbot studies in healthcare — and particularly in mental health.
Use CHART as an immediate reading filter
When you read a study on a mental health chatbot, check three things first: is the model precisely identified (Item 3)? Are the prompts published (Item 5)? Were harmful responses assessed (Item 10)? If the answer to any of these is no, the study’s value is unverifiable — regardless of the journal it was published in.
Combine the three frameworks for a complete reading
CHART first: is the study interpretable? Then Hua: what is its level of evidence (T1, T2, or T3)? Then Choudhury: what human factors limit it? In three minutes, you have a structured assessment that goes far beyond reading the abstract and the authors’ conclusions.
Transparency is the minimum, not the maximum
CHART does not solve every problem. Perfect reporting does not compensate for the absence of real patients, the lack of ecological validity, or the confusion between technical performance and therapeutic efficacy. But without methodological transparency, there is no scientific conversation to be had. It is the foundation — not the ceiling — of rigor.
Source analyzed: Huo, B., et al. (2025). Reporting Guideline for Chatbot Health Advice Studies: The CHART Statement. JAMA Network Open, BMJ Medicine, BJS, BMC Medicine, Annals of Family Medicine, Artificial Intelligence in Medicine. DOI: 10.1136/bmjmed-2025-001632
Further reading:
- Huo, B., et al. (2024). Protocol for the development of the Chatbot Assessment Reporting Tool (CHART) for clinical advice. BMJ Open, 14(5), e081155. https://doi.org/10.1136/bmjopen-2023-081155
- Our analysis of the Hua framework: 77% of LLM studies in mental health never get past the bench test stage
- Our analysis of the Choudhury framework: Why an AI that “outperforms doctors” in the lab can fail in the clinic
- Our article on terminological distinctions: AI, chatbot, LLM, app: why we need to stop conflating them
- EQUATOR Network: CHART Statement
Series: AI Evaluation Frameworks in Healthcare
Mots-clés