Veille IA

Only 3 out of 52 journals require transparency for AI clinical trials: the CONSORT-AI case

| Matthieu Ferry ⇄ IA

Published in 2020, CONSORT-AI mandates 14 transparency criteria for clinical trials testing AI interventions. Five years on, adherence is declining and most journals ignore these standards. What this reveals — and how it changes the way we critically read studies.

A reproducibility problem

Imagine a clinical trial that specifies neither the version of the drug tested, nor whether the treatment is accessible to other researchers, nor how errors were identified. Inconceivable in pharmacology. Commonplace in clinical trials involving AI.

In September 2020, the CONSORT-AI & SPIRIT-AI Steering Group — led by Xiaoxuan Liu (University of Birmingham) — simultaneously published in Nature Medicine, The BMJ and The Lancet Digital Health an extension of CONSORT 2010 specifically designed for randomised controlled trials (RCTs) evaluating interventions that include an artificial intelligence component.

“CONSORT-AI recommends that researchers provide clear descriptions of the AI intervention, including the instructions and skills required for use, the setting in which the AI intervention is integrated, the handling of inputs and outputs, human-AI interaction, and the provision of an error case analysis.”

— Liu et al. (2020), CONSORT-AI Extension, Nature Medicine

The starting observation was simple: standard clinical trials have well-established reporting standards (CONSORT 2010). But these standards do not cover AI-specific features — algorithm version, human-machine interaction, input data handling, error analysis. Without this information, a clinical trial testing a therapeutic chatbot or an AI diagnostic tool is unverifiable.


What CONSORT-AI adds: 14 AI-specific items

The development process followed the EQUATOR Network framework: generation of 29 candidate items by 34 experts, a 2-round Delphi process (103 international stakeholders — clinicians, computer scientists, methodologists, regulators, patients, journal editors), a 2-day consensus meeting with 31 participants voting anonymously (threshold: 80%), and a pilot test (34 participants). Out of 41 items evaluated, 14 AI-specific items reached consensus.

Here are the most revealing for a clinician:

5i

Algorithm version

Specify which exact version of the algorithm was used. Because between GPT-3.5 and GPT-4o, performance differs radically — and the same model evolves between silent updates. This is the most problematic item: reported in only 20% of studies.

11a

Human-AI interaction

What level of expertise is required to use the system? Which decisions does the human make? Which does the AI make? A triage tool operated by a psychiatrist and an autonomous chatbot facing a patient do not raise the same safety questions — but without this item, we don’t know which was tested.

19

Error performance analysis

How were errors identified and analysed? Beyond the average performance score, what types of errors does the system make? In mental health, a diagnostic error or an inappropriate response to a patient in suicidal crisis does not carry the same weight as an imprecision in nutritional advice.

25

Code and intervention accessibility

Is the AI code accessible? Under what licence? Can the algorithm be inspected? Without this information, reproducibility is impossible. Reported in only 42% of studies. We require pharmaceutical researchers to publish the composition of their molecules — why accept less for AI?


Five years on: the worrying assessment

In 2024, Cruz Rivera et al. published in Nature Communications the first systematic evaluation of CONSORT-AI adoption: 65 randomised clinical trials scrutinised.

IndicatorResultWhat it means
Median overall concordance90%Seemingly reassuring
RCTs explicitly citing CONSORT-AI10 / 6585% of authors are not even aware of the guideline
Item 5i — Algorithm version20%4 out of 5 studies don’t say which model was used
Item 25 — Code accessibility42%More than half the studies are non-reproducible
Item 5iii — Poor quality data63%1 in 3 studies doesn’t say how it handles missing data
Journals requiring CONSORT-AI3 / 5294% of journals impose no AI-specific requirements

The most telling figure is not the overall concordance (90%) — it’s the gap between that number and the critical items. The overall score masks fundamental shortcomings. It’s like getting 18/20 on a medical exam by answering the general questions perfectly, but skipping the ones on dosage and contraindications.


The decline: when more trials means less rigour

The picture worsens when you look at the trend. In oncology, Chen et al. (2025) document a concerning decline:

CONSORT-AI adherence in oncology dropped from 96% in 2022 to 79% in 2024. And trials judged to be at high risk of bias are the ones that least comply with reporting standards.

— Chen et al. (2025), “Five years after CONSORT-AI, not much has changed”

This is no coincidence. The increase in the volume of published AI trials has diluted quality. More teams are launching AI trials, but without training in the specific requirements of this type of research. And journals don’t filter: 94% of them don’t ask authors to comply with CONSORT-AI.


How it fits with the Hua, Choudhury and CHART frameworks

CONSORT-AI completes the ecosystem of evaluation frameworks we are building:

FrameworkQuestionTarget study type
Hua (T1/T2/T3)What level of evidence does the study provide?All AI studies in mental health
ChoudhuryWhy don’t lab results predict real-world usage?Feasibility and effectiveness studies (T2/T3)
CHARTIs the chatbot evaluation study transparent?Chatbot evaluation studies (benchmarks, vignettes)
CONSORT-AIDoes the AI clinical trial report the intervention’s specifics?Randomised clinical trials with AI intervention (T3)

Hua tells you what level of evidence the study provides (T1, T2 or T3)

Choudhury tells you why the transition from one level to the next is anything but automatic

CHART tells you whether the chatbot evaluation study is sufficiently transparent

CONSORT-AI tells you whether the AI clinical trial reports the information needed to be interpreted

The distinction is simple: CHART applies to evaluation studies (“does the LLM answer questions well?”), CONSORT-AI applies to clinical trials (“does the system improve patient health?”). A randomised trial testing a therapeutic chatbot needs both.


What is solid in this proposal

1

Pioneering publication in 3 major journals

Nature Medicine, The BMJ, The Lancet Digital Health — the most influential trio of journals in medicine. CONSORT-AI was the first AI-specific reporting guideline for clinical trials, published before the volume of trials made the problem unmanageable.

2

It targets the right questions

The 14 items are not a bureaucratic checklist. Each addresses a concrete reproducibility problem: which algorithm version? How are errors identified? What level of human expertise is required? Is the code accessible? These are exactly the questions a critical reader should be asking.

3

Human-AI interaction as an explicit item

Item 11a requires documenting the interaction between the clinician and the AI system. This is a major blind spot in current literature: we rarely know whether an AI tool was used in full autonomy or under clinical supervision. The difference is fundamental for assessing risks and benefits.

4

Developed in parallel with SPIRIT-AI

CONSORT-AI (reporting results) and SPIRIT-AI (trial protocols) form a coherent pair. It’s the same transparency logic applied to both stages: before the trial (protocol) and after (publication). What is still missing is the systematic adoption of both.


The limitations — and why they matter

1

The failure of voluntary adoption

Five years after publication, only 3 out of 52 journals require or recommend CONSORT-AI. Neither publication in Nature Medicine nor the backing of the EQUATOR Network was enough to impose these standards. The model of “publish a guideline and wait for the field to adopt it” does not work — enforcement mechanisms are needed: journal requirements, funder conditions, ethics committee criteria.

Comparison: CONSORT 2010 is now required by most major medical journals. CONSORT-AI, despite the same institutional legitimacy, is ignored by 94% of journals publishing AI trials. The issue is not the quality of the guideline — it’s the absence of leverage to make it mandatory.

2

Excluded from CONSORT 2025

The CONSORT 2025 and SPIRIT 2025 update, recently published, does not integrate AI-specific recommendations into the main text. In practice, this means that trials using AI as a treatment tool or as a component of intervention have no additional transparency obligations in the base standard. AI-specific items remain an “optional extension” — a position that is hard to defend when AI is involved in a growing number of clinical trials.

3

Continuous learning systems and “AI as therapy” not covered

CONSORT-AI was developed in 2019-2020, when AI trials mainly focused on diagnosis and triage. Continuous learning systems — those that evolve in real time through contact with patients — were explicitly excluded. Likewise, interventions where AI is itself the treatment (autonomous therapeutic chatbots) are not sufficiently represented in the current framework. This is precisely the type of intervention growing fastest in mental health.

4

The illusion of the overall score

The 90% median concordance gives an impression of compliance. But the most critical items are those with the lowest adherence: model version (20%), code accessibility (42%), missing data handling (63%). A high overall score can mask shortcomings that render the study fundamentally non-reproducible. It’s like a well-filled medical record — except for the dosage and allergy history.


Our position

CONSORT-AI is an essential critical reading tool — but its story is also a lesson on the limits of guidelines without enforcement mechanisms.

1

Check three items before reading the rest

When you read a clinical trial testing an AI intervention in mental health, immediately check: is the algorithm version specified (Item 5i)? Is the human-AI interaction documented (Item 11a)? Are errors analysed (Item 19)? If these three pieces of information are missing, the study does not give you enough elements to judge its applicability to your practice — regardless of the journal in which it is published.

2

Publication does not create adoption

CONSORT-AI is a textbook case: a rigorous guideline, published in the best journals, developed by international consensus — yet largely ignored five years later. This is not a failure of the guideline, it is a systemic failure. Scientific transparency cannot be decreed: it must be imposed by journals, funders and regulators. As long as 94% of journals do not require CONSORT-AI, its reach will remain limited.

3

Four frameworks, one reading grid

With Hua (level of evidence), Choudhury (ecological validity), CHART (chatbot evaluation transparency) and CONSORT-AI (clinical trial transparency), we now have an integrated grid to evaluate virtually all AI studies in mental health. None of these frameworks is sufficient on its own. Together, they enable a critical reading that goes far beyond what an article abstract can offer.


Reference analysed: Liu, X., et al. (2020). Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nature Medicine, 26, 1364-1374. DOI: 10.1038/s41591-020-1034-x

Evaluation studies cited:

  • Cruz Rivera, S., et al. (2024). Concordance of randomised controlled trials for artificial intelligence interventions with the CONSORT-AI reporting guidelines. Nature Communications, 15, 1566. https://doi.org/10.1038/s41467-024-45355-3
  • Chen, E., et al. (2025). Five years after CONSORT-AI, not much has changed: a call to action for artificial intelligence research in oncology.

Further reading:


Mots-clés

reporting guideline clinical trial methodological transparency reproducibility AI evaluation