Veille IA

Fewer than 40% of health chatbot studies report their prompting strategy: the CHART Statement

| Matthieu Ferry ⇄ IA

Out of 137 studies published in the year following ChatGPT's launch, fewer than 40% report key elements of their prompting strategy. An international consortium of 531 experts proposes 12 criteria to fix this — and change how we read these studies.

The problem nobody was seeing

Imagine a drug trial that specifies neither the drug name, nor the dosage, nor the dates of administration. Unthinkable? Yet this is effectively what happens in the majority of health chatbot studies.

In August 2025, the CHART Collaborative — an international consortium of over 50 researchers led by Bright Huo (McMaster University) — published simultaneously across six journals (JAMA Network Open, BMJ Medicine, BJS, BMC Medicine, Annals of Family Medicine, Artificial Intelligence in Medicine) the results of an alarming audit:

“Fewer than 40% of articles report key elements of their prompting strategy.”

— Huo et al. (2025), CHART Statement, systematic review of 137 studies

In other words: the majority of studies claiming that “AI outperforms doctors” or that ”LLMs are reliable assistants” do not provide enough information for anyone to reproduce their results — or even verify what they actually measured.


What the systematic review found

The team screened 7,752 articles to identify 137 eligible studies, all published within the year following ChatGPT’s launch (November 2022). The findings are damning:

Reporting elementReported?Consequence
Complete prompting strategy< 40%No way to know what was actually asked of the AI
Raw prompts usedRareImpossible to reproduce the experiment
Complete model responsesRareImpossible to verify the authors’ evaluation
Precise model identificationInsufficientUnknown which model was actually tested
Dates of queriesInsufficientThe same model produces different results from one month to the next

This is not a minor problem. It is the equivalent of publishing a drug trial by saying “we gave a medication to patients and it worked” — without specifying which drug, at what dose, for how long, or how you measured the outcome.


The CHART Statement: 12 criteria to fix this

To address this problem, the CHART Collaborative developed a checklist of 12 items and 39 sub-items through a rigorous process: asynchronous Delphi with 531 international stakeholders (clinicians, methodologists, AI researchers, journal editors, ethicists), 3 synchronous consensus meetings with 48 experts, and iterative pilot testing.

Here are the items most revealing for a clinician:

3

Model identifiers

Name, version, update date, open-source or proprietary status. Because writing “we used ChatGPT” is not enough — the performance gap between GPT-3.5 and GPT-4o is enormous. And the same model behaves differently depending on its version and access date.

5

Prompt engineering

How were the prompts developed? By whom? How many people were involved? Did patients participate in their design? And most importantly: publish the actual prompts used. This is the equivalent of publishing a clinical trial protocol — without it, nothing is verifiable.

6

Query strategy

Access route to the model (API, web interface, app), precise dates and locations of queries (day/month/year + city/country), separate or continuous sessions, and — crucially — all chatbot responses, not just those that support the argument.

7

Performance evaluation

What is the reference standard (ground truth)? How many evaluators? What are their qualifications? Were they blinded? Three students rating ChatGPT’s responses without knowing whether they come from a human or an AI is not the same as three senior psychiatrists evaluating in a blinded design.

10

Results: bias and harm

Beyond overall performance, the study must explicitly assess potentially harmful, biased, or misleading responses. This is the most important item for mental health: a single dangerous response to a suicidal patient matters more than an average accuracy score of 85%.


Why dates and locations matter

The requirement to specify query dates and locations (Item 6b) may seem bureaucratic. It is not. LLMs are not stable molecules: they change constantly.

The same prompt submitted to GPT-4 in March 2024 and in September 2024 can yield radically different results. OpenAI updates its models regularly, sometimes without notice. If a study does not specify when it queried the model, its results are unreproducible by definition — because the “drug” that was tested no longer exists.

This is a fundamental difference from traditional clinical trials: in pharmacology, the molecule being tested remains the same. In AI, the object of study is a moving target.


How CHART fits with the Hua and Choudhury frameworks

The CHART Statement sits within an ecosystem of complementary frameworks we have analyzed:

FrameworkQuestionWhat it provides
Hua (T1/T2/T3)What level of evidence does the study provide?Distinguishes bench testing, feasibility, and clinical efficacy
ChoudhuryWhy do lab results fail to predict real-world use?Identifies human factors (trust, cognitive load, accountability)
CHARTIs the study transparent enough to be evaluated at all?Checks that minimum reporting standards are met

Hua tells you what level of evidence a study provides (T1, T2, or T3)

Choudhury tells you why moving from one tier to the next is anything but automatic

CHART tells you whether the study reports enough information for you to even begin evaluating it

CHART is essentially the preliminary filter: before asking whether a study is T1 or T3, you first need it to provide the basic information required to be interpretable at all.


What holds up well in this proposal

1

Exemplary development process

531 stakeholders in the Delphi process, 48 experts for consensus meetings, 80% agreement threshold, iterative pilot testing. This is the level of methodological rigor that CHART demands of the studies it evaluates — and applies to itself.

2

It targets the exact regulatory gap

CONSORT-AI covers AI clinical trials. STROBE covers observational studies. TRIPOD+AI covers predictive models. But no guideline covered chatbot evaluation studies — which represent the vast majority of recent LLM literature in healthcare. CHART fills this gap.

3

”Living guideline” approach

Biannual updates for the first two years, then annual. A panel of 14 experts maintains continuous literature surveillance with a 90% agreement threshold for any change. Unlike a static standard, CHART acknowledges that its subject — AI chatbots — evolves too fast for a fixed rulebook.

4

Simultaneous publication in 6 journals

This is a strong signal of institutional legitimacy. When JAMA Network Open, BMJ Medicine, and Annals of Family Medicine publish the same standard on the same day, journal editors take notice. Adoption by journals is the key to a reporting guideline’s real-world impact.


The limitations — and why they matter

1

Reporting ≠ quality

CHART is a transparency tool, not a methodological quality tool. A study can check all 12 items on the checklist and still be methodologically weak — if the reference standard is poorly chosen, if the sample is too small, or if the conclusions overreach the data. Ticking every box does not guarantee a good study. It only guarantees that the study can be judged.

Clinical analogy: it is the difference between a complete hospital discharge summary and a correct diagnosis. The former is necessary to evaluate the latter, but it does not replace it.

2

Text-focused, not yet multimodal

CHART was designed for text-based chatbots. But recent models are increasingly multimodal (text + image + audio + video). How do you report an interaction where the patient shows an image to the AI? Where the chatbot analyzes the tone of voice? The framework will need to evolve quickly on this front — and this is precisely why the “living guideline” approach is well-suited.

3

Does not cover clinical trials

CHART is designed for performance evaluation studies (vignettes, benchmarks, scoring). If a randomized trial tests a therapeutic chatbot with real patients, it must use CONSORT-AI in addition — CHART alone is not sufficient for the clinical trial component. The authors state this themselves: CHART is complementary, not self-sufficient.

4

The risk of superficial box-ticking

Like any checklist, CHART can be filled out mechanically. The danger: studies that check every item without genuine methodological reflection. “We used GPT-4o version of March 15, 2025 ✓” formally satisfies Item 3 — but says nothing about why this model was chosen over others, nor about what this choice means for the generalizability of results.


Our take

The CHART Statement is an essential critical reading tool for anyone interested in chatbot studies in healthcare — and particularly in mental health.

1

Use CHART as an immediate reading filter

When you read a study on a mental health chatbot, check three things first: is the model precisely identified (Item 3)? Are the prompts published (Item 5)? Were harmful responses assessed (Item 10)? If the answer to any of these is no, the study’s value is unverifiable — regardless of the journal it was published in.

2

Combine the three frameworks for a complete reading

CHART first: is the study interpretable? Then Hua: what is its level of evidence (T1, T2, or T3)? Then Choudhury: what human factors limit it? In three minutes, you have a structured assessment that goes far beyond reading the abstract and the authors’ conclusions.

3

Transparency is the minimum, not the maximum

CHART does not solve every problem. Perfect reporting does not compensate for the absence of real patients, the lack of ecological validity, or the confusion between technical performance and therapeutic efficacy. But without methodological transparency, there is no scientific conversation to be had. It is the foundation — not the ceiling — of rigor.


Source analyzed: Huo, B., et al. (2025). Reporting Guideline for Chatbot Health Advice Studies: The CHART Statement. JAMA Network Open, BMJ Medicine, BJS, BMC Medicine, Annals of Family Medicine, Artificial Intelligence in Medicine. DOI: 10.1136/bmjmed-2025-001632

Further reading:


Mots-clés

reporting guideline chatbot methodological transparency reproducibility AI evaluation