PROBAST+AI: 34 questions that most AI prediction models in healthcare don't survive
Published in the BMJ in March 2025, PROBAST+AI is the first quality assessment tool for clinical prediction models that holds classical statistical and artificial intelligence approaches to the same standards of rigour. Its starting finding is damning: most published models are of poor quality, their performance is overestimated and their biases go unnoticed. Sixth instalment of our series on AI evaluation frameworks in healthcare.
Source analysée
https://doi.org/10.1136/bmj-2024-082505We don’t prescribe a drug without evaluating it — why would we do so with a prediction algorithm?
Before putting a drug on the market, we require rigorous clinical trials, side-effect assessment, and validation across diverse populations. When an algorithm promises to predict suicide risk, screen for depression or guide a treatment decision, what do we ask of it? In most cases: an impressive accuracy score calculated on an internal dataset, and publication in a peer-reviewed journal.
Important clarification: this article is not about ChatGPT or conversational agents. A clinical prediction model is a very different tool: it is an algorithm that, from a patient’s data (age, history, test results, scale scores…), calculates the probability of a health event — for example the risk of depressive relapse within six months, or the likelihood that a suicide attempt will occur within a year. These are the “risk scores” you may already use in clinical practice, in the form of formulae or online calculators.
This is the assessment drawn by Karel G. M. Moons and 23 co-authors in the BMJ in March 2025, publishing PROBAST+AI — an update of PROBAST (2019) that evaluates the quality, risk of bias and applicability of these prediction models.
“Numerous systematic reviews conducted over the past twenty years have shown that the majority of published models, including those based on machine learning, are of poor quality, that their reported predictive performance is at high risk of bias, and that fairness issues affect predictions for certain patient groups.”
— Moons et al. (2025), PROBAST+AI, BMJ
This assessment does not come from a technophobic activist. It comes from the creator of the world’s reference tool for systematic reviews of prediction models, published in one of the five most influential medical journals in the world. In other words: the expert who has spent twenty years examining these models tells us that most of them don’t hold up — whether they rely on traditional statistical formulae or the latest artificial intelligence techniques.
Statistical formula-based vs AI-based prediction models: a false dilemma
To understand what is at stake with PROBAST+AI, one must first grasp a distinction the article refuses to maintain — and this is its first major contribution.
Historically, clinical prediction models relied on classical statistical methods: logistic regression, for example, which combines a few variables (age, sex, scale score, medical history) in a transparent formula to estimate a risk. This is the principle behind the cardiovascular risk calculators some doctors use in consultations.
Over the past fifteen years, techniques from artificial intelligence — also called “machine learning” — have proposed a different approach: instead of a predefined formula, an algorithm “learns” the patterns in large amounts of data. Neural networks, random forests, gradient boosting algorithms: these techniques can incorporate hundreds of variables and detect relationships the human eye cannot spot.
“Any strict distinction between statistical methods and machine learning methods quickly becomes a false opposition.”
— Moons et al. (2025)
Despite their technical differences, both families of models seek to do exactly the same thing: predict a clinical outcome from patient data. And above all, they are subject to the same fundamental problems: was the model tested on patients different from those used to build it? Do the data used represent the diversity of real patients? Did the model simply “adapt” to the particularities of its training dataset to the point of no longer working elsewhere?
The essential difference is not in the nature of the problems, but in their visibility. A five-variable formula can be read, understood, criticised. A neural network with fifty million parameters can only be observed through its outputs. The biases are no different — they are simply harder to detect.
The irony: the authors acknowledge this false dilemma while naming their tool “PROBAST+AI”. The suffix catches attention in a field saturated with AI publications — but it reinforces the binary categorisation the article claims to move beyond. An unresolved tension, but pragmatically effective.
Four domains, 34 assessment questions
PROBAST+AI examines prediction models across four domains, each subdivided into precise questions that guide the assessment:
Participants
Are the people included in the study representative of the patients on whom the model will be used? Are the inclusion criteria appropriate? Is the sample sufficiently diverse? This is where the question of applicability plays out: a model developed on a North American academic hospital cohort does not necessarily predict the same risk for a patient followed in a community mental health centre in France, with a different care pathway and socio-cultural context.
Input variables (predictors)
Is the information used by the model available at the time the prediction is needed? Is it measured reliably? A model that uses the final diagnosis as input to estimate a prognosis has a fundamental logical problem — but this is more common than one might think. For AI-based models, PROBAST+AI adds specific questions: are the training data documented? Have representation biases been identified?
Predicted outcome
Is the outcome the model seeks to predict clearly defined? Measured in a standardised way? Independent from the input data? In mental health, this issue is critical: how does one define “improvement”? By a PHQ-9 score (depression questionnaire), by clinical judgement, by the patient’s own experience? A model that predicts the PHQ-9 from the PHQ-9 predicts nothing — it measures the stability of a questionnaire.
Analysis
This is the most technical domain — and the one where models fail most often. Is the sample size sufficient for the number of variables used? Was the model tested on data it has never seen (external validation)? And above all: was its calibration verified? Calibration is the agreement between the risk announced by the model and the risk actually observed. A model can very well distinguish high-risk from low-risk patients (this is called discrimination) while systematically getting the numbers wrong: announcing 60% risk when the true risk is 20%. In clinical practice, this distinction is crucial.
The innovation: evaluating the manufacturing, not just the finished product
The most important conceptual contribution of PROBAST+AI is a distinction clinicians will immediately understand: the difference between evaluating the manufacturing quality of a model and evaluating the risk of bias in its results.
“The development of a model is the actual process of building, producing, or manufacturing a prediction model […] Each model is developed only once; this can be compared to the manufacturing of a medical test, device, or drug.”
— Moons et al. (2025)
The analogy is telling. A drug is manufactured once, then tested repeatedly under different conditions: varied populations, different hospitals, different countries. Similarly, a prediction model is developed once — variables are chosen, the algorithm architecture is defined, it is trained on a dataset — then its performance is evaluated in new contexts. PROBAST+AI clearly distinguishes these two stages:
- Development quality: Was the manufacturing process rigorous? Was the sample large enough? Was the risk of overfitting controlled? (Overfitting is when a model adapts so well to the data used to build it that it loses its ability to work on new data — somewhat like a student memorising exam answers without understanding the subject matter.)
- Risk of bias in evaluation: Are the performance tests reliable? Was the model tested on data it has never seen? Was its calibration verified? Were results analysed by patient subgroups?
A well-built model can be poorly evaluated if only tested on the data used to build it. Conversely, a poorly built model can appear performant thanks to overfitting. PROBAST+AI requires checking both.
Algorithmic fairness: integrated, but reductive
PROBAST+AI is the first methodological assessment tool to integrate the dimension of algorithmic fairness — that is, the question of whether a model produces fair results for all patient groups, regardless of their origin, gender or socio-economic status. Questions on representation bias and differential impact on subgroups are distributed across the tool’s four domains.
This is an advance. But it is also a simplification.
What PROBAST+AI does well
Systematically asking the question: “Do the algorithm’s predictions benefit or disadvantage certain patient groups without justified reason?” The mere fact of requiring this interrogation in every systematic review is a considerable advance over the current situation, where most studies do not even mention this dimension.
What PROBAST+AI cannot do
Resolve a dilemma that mathematics has proven unsolvable. Foundational work (Chouldechova 2017, Kleinberg et al. 2016) has shown that the different ways of defining an algorithm’s fairness — treating all groups proportionally, offering the same chances to each, producing predictions of equal reliability for all — cannot all be satisfied simultaneously. A model considered fair by one criterion will necessarily be unfair by another. PROBAST+AI reduces this question, which is as much philosophical and political as it is technical, to a criterion assessable by checklist. Convenient, but intellectually incomplete.
The authors themselves acknowledge it: the definitive assessment of an algorithm’s fairness can only be made at the point of deployment in daily practice, not at the scientific publication stage. Which considerably relativises the scope of their own tool on this dimension.
How it fits within the series: six frameworks, one integrated grid
PROBAST+AI completes the ecosystem of evaluation frameworks we are building in this series. Here is where it sits:
| Framework | Question | Target study type | Focus |
|---|---|---|---|
| Hua (2022) | What level of evidence does the study provide? | All AI studies in mental health | Classification |
| Choudhury (2024) | Why don’t lab results predict real-world usage? | Feasibility and effectiveness studies | Validity |
| CHART (2024) | Is the chatbot evaluation study transparent? | Health chatbot evaluations | Reporting |
| CONSORT-AI (2020) | Does the AI clinical trial report the intervention’s specifics? | Randomised clinical trials with AI | Reporting |
| CONSORT/SPIRIT 2025 | Does the clinical trial meet baseline standards (Open Science)? | All randomised clinical trials | Reporting |
| PROBAST+AI (2025) | Was the prediction model developed and evaluated correctly? | Predictive model studies (diagnosis, prognosis) | Quality |
The distinction is essential: the first five frameworks evaluate the transparency of reporting (what the article discloses) or the validity of conclusions (what the study demonstrates). PROBAST+AI evaluates the quality of what was actually done — the rigour of the model’s construction and testing process. An article can be perfectly written according to CONSORT-AI standards while describing a model of poor quality according to PROBAST+AI criteria.
Hua tells you what level of evidence the study provides
Choudhury tells you why the transition from one level to the next is anything but automatic
CHART and CONSORT-AI tell you whether the study reports the information needed to assess it
PROBAST+AI tells you whether the model itself was well built and rigorously evaluated
What is solid in this proposal
An exemplary development process
The tool was built using a three-round Delphi method (95 to 144 participants from six continents), followed by a consensus meeting of 26 experts, with an agreement threshold set at 80%. The protocol had been pre-published. It is a model of methodological rigour — one that illustrates the gap between what the scientific community can do when it makes the effort, and the average quality of what it produces on a daily basis.
The drug analogy normalises the demand placed on AI
By comparing a prediction model to a medical device, PROBAST+AI paves the way for comparable regulatory treatment. We do not put a medical device on the market without independent validation — why would we accept a suicide risk prediction algorithm being deployed solely on the basis of its internal performance?
A directly usable tool for critical reading
The 34 assessment questions are concrete and applicable. A clinician reading a study claiming that an algorithm “predicts depression with 92% accuracy” can immediately check: was it tested on patients it had never seen? Was its calibration assessed? Do the data reflect the diversity of my patient population? Are the variables used available in my care setting? This last point connects to the question of ecological validity: a model that performs well in the laboratory does not necessarily perform well in the reality of the consulting room. Four questions that transform an impressive score into a legitimate object of inquiry.
Refusing AI exceptionalism
By holding AI and traditional statistical methods to the same evaluation criteria, PROBAST+AI avoids a double trap: blind enthusiasm (“it’s AI, so it must be better”) and irrational distrust (“it’s AI, so it must be suspect”). The same quality requirements apply, regardless of the technical means employed. This is exactly the position we defend throughout this series.
The limitations — and why they matter
No empirical validation of the tool itself
PROBAST+AI has not been empirically tested: we do not know whether models it judges favourably actually perform better in clinical practice than those it judges poorly. Inter-rater reliability — the ability of two different people to reach the same conclusions using the tool — had been measured for the previous version PROBAST (2019) but not for PROBAST+AI, which contains 34 questions (versus 20 previously) and substantially modified criteria.
The double standard: the tool demands rigorous external validation from prediction studies — but does not apply this same requirement to itself. This is the most significant contradiction identified in the article.
A checklist doesn’t change incentives
The authors document a systemic problem: the widespread poor quality of prediction studies. But they propose an individual response — a grid that each assessor applies article by article. The structural roots of the problem (publish-or-perish pressure, lack of data sharing, insufficient regulation) are not addressed. Worse: a checklist creates a risk of surface compliance — researchers optimising the writing of their articles to tick the boxes without improving the actual quality of their work. It is the research equivalent of “teaching to the test”: learning to satisfy the criteria without necessarily mastering the substance.
The model evaluated outside its context of use
PROBAST+AI evaluates the model as an isolated technical object. But in clinical practice, a prediction model never exists alone: it is integrated into a care pathway, used by a professional whose level of expertise and trust in the tool varies, facing a patient in a unique context. The clinician-model-patient interaction — what Choudhury calls the “ecological context” — is absent from the assessment. A perfectly built model can fail in an unsuitable clinical context, and an imperfect model can prove useful in the hands of a practitioner who knows its limitations.
Our position
PROBAST+AI is the sixth evaluation framework analysed in this series — and the one that pushes the demand for methodological rigour furthest. But it shares the structural limitations of all checklist-based tools.
Four questions to ask when facing a prediction model
When a study claims that an algorithm “predicts” a clinical outcome, immediately ask: (1) what population was it developed on, and does it resemble my patients? (2) was it tested on data it had never seen? (3) was its calibration verified — that is, do the probabilities it announces match reality? (4) are the variables it uses available in my care setting? If even one of these answers is “no” or “not specified”, the model does not offer sufficient guarantees to guide your practice.
Don’t confuse accuracy with reliability
A model can display “95% accuracy” while being poorly calibrated — that is, systematically assigning risk probabilities that are too high or too low. In clinical practice, this calibration matters more than the simple ability to classify patients: what matters for a care decision is not just knowing that a patient is “at risk”, but knowing whether that risk is 5%, 30% or 80%. PROBAST+AI is the first tool to integrate this requirement systematically.
Six frameworks, one complete reading grid
With Hua (level of evidence), Choudhury (ecological validity), CHART (chatbot evaluation transparency), CONSORT-AI (AI clinical trial transparency), CONSORT/SPIRIT 2025 (baseline standards) and PROBAST+AI (methodological quality of models), we now have an integrated grid to evaluate virtually all AI studies in healthcare. None of these frameworks is sufficient on its own. PROBAST+AI adds the missing piece: the evaluation of the quality of what researchers actually did, not just what they report.
Reference analysed: Moons, K. G. M., Damen, J. A. A., Kaul, T. et al. (2025). PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ, 388, e082505. DOI: 10.1136/bmj-2024-082505
Further references:
- Wolff, R. F. et al. (2019). PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies. Annals of Internal Medicine, 170(1), 51-58.
- Collins, G. S. et al. (2024). TRIPOD+AI Statement: Updated Reporting Guideline for Clinical Prediction Models. BMJ.
- Van Calster, B. et al. (2019). Calibration: the Achilles heel of predictive analytics. BMC Medicine, 17, 230.
- Obermeyer, Z. et al. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366, 447-453.
Series: AI Evaluation Frameworks in Healthcare
- Hua: three tiers of evidence for AI in mental health
- Choudhury: ecological validity of LLM studies
- CHART: health chatbot transparency
- CONSORT-AI: transparency of AI clinical trials
- CONSORT/SPIRIT 2025: Open Science yes, AI no
- PROBAST+AI: quality of AI prediction models (this article)
Mots-clés