Veille IA

Why an AI That 'Outperforms Doctors' in the Lab Can Fail in the Clinic: Choudhury's Framework

| Matthieu Ferry ⇄ IA

An AI that achieves 95% accuracy on standardized cases can fail in real clinical practice. A human factors researcher at West Virginia University explains why — and offers a three-level framework every clinician should know before trusting an AI health study.

Why an AI That 'Outperforms Doctors' in the Lab Can Fail in the Clinic: Choudhury's Framework

The Starting Problem

You may have seen headlines like these:

“ChatGPT outperforms physicians in empathy and response quality” — JAMA Internal Medicine, 2023

”AI achieves expert-level performance on medical exams” — Nature, 2024

”LLMs are reliable medical assistants for the general public” — Nature Medicine, 2025

These studies share a common thread. They all evaluate AI under conditions that bear no resemblance to real clinical practice: standardized vignettes instead of actual patients, third-party evaluators instead of the care relationship, isolated text responses instead of longitudinal follow-up.

This is what’s known as the ecological validity deficit: results produced under controlled conditions fail to predict how the system will behave in its actual environment of use.

The question is not “does the AI work technically?” but “will clinicians and patients use it safely and effectively under real-world conditions of care?”

This is precisely the question addressed by Avishek Choudhury, a human factors researcher at West Virginia University, in a paper published in JMIR Human Factors in 2022. His conceptual framework explains why the gap between lab performance and clinical adoption exists — and which human factors determine it.


The Framework in 5 Minutes

The Core Insight

Rational models of clinical decision-making are not ecologically valid. They assume:

1

Perfect information — the clinician has access to all relevant data, in a clear format

2

Ideal cognitive capacity — the clinician can process all available information without fatigue or overload

3

Optimal trust — the clinician trusts the tool at exactly the right level, neither too much nor too little

4

Unlimited resources — the clinician has all the time, training, and institutional support required

None of these conditions hold in clinical practice. When a study tests AI under these ideal conditions, it doesn’t measure real-world performance — it measures theoretical maximum performance.

The Three Levels

Choudhury proposes analyzing clinical AI adoption through three nested levels:

LevelKey QuestionFactors
GovernanceWhat regulatory and institutional framework governs usage?Regulation, protocols, stakeholder accountability
OrganizationIs the institution prepared to integrate AI responsibly?Systemic resilience, clinician accountability, ecological validity of design
IndividualWill the clinician use AI safely and appropriately?Trust, cognitive workload, situation awareness, bounded rationality

What makes this framework remarkable is that it shows AI adoption doesn’t depend on technical performance alone. It depends on the entire ecosystem in which the clinician operates.

The Six Individual Variables

At the individual level — the most detailed part of the framework — six variables determine whether a clinician will actually use the AI:

SA

Situation Awareness

Does the clinician understand what the AI is doing, why it’s doing it, and in what context? Without this understanding, even a correct recommendation can be misinterpreted.

CW

Cognitive Workload

When a clinician is overloaded — five patients waiting, an urgent call, a complex case file — they lack the mental bandwidth to correctly interpret AI output. That’s precisely when they’re most tempted to rely on it blindly.

EX

Expectancy

What do they expect from the tool? If expectations are unrealistic (“the AI will solve my diagnosis”) or defeatist (“this thing is useless”), usage will be dysfunctional.

PA

Perceptions of AI

Their general attitude toward the technology: enthusiasm, distrust, indifference. These perceptions predate actual usage and strongly color it.

AC

Absorptive Capacity

Does the clinician have the minimum technical knowledge to understand what the AI is telling them? Paradoxically, the more complex the models (deep learning, LLMs), the less clinicians are equipped to assess their limitations — the classic black box problem.

BR

Bounded Rationality

Herbert Simon’s concept (Nobel Prize in Economics, 1978): humans are not optimizing machines. We make “good enough” decisions with available cognitive resources — not optimal ones. AI doesn’t eliminate this constraint. It displaces it.

These six variables converge toward two mediators: trust in AI and perceived patient risk. These are the two factors that ultimately determine whether the clinician will use the tool.


What This Changes About Reading Studies

The Practitioner’s Reading Grid

When you read a study claiming “AI outperforms doctors,” Choudhury’s framework invites you to ask six questions the study probably doesn’t address:

Overlooked VariableQuestion to AskConsequence if Missing
Cognitive workloadWere the clinicians in the study working under real conditions (multitasking, time pressure)?Measured performance is unrealistic
TrustDid the users trust the AI, and was that trust calibrated?No prediction of real-world usage
AccountabilityWho is liable if the AI errs? Does the clinician know?Hesitation or uncontrolled use
Situation awarenessDid the users understand what the AI was doing?Misinterpretation of results
Bounded rationalityDoes the study assume a rational, fully informed decision-maker?Results only apply under ideal conditions
Patient safetyDoes the study measure the consequences of errors, not just accuracy?Risks remain invisible

Connecting with Hua’s Framework

This framework combines powerfully with Hua et al.’s three-tier framework (World Psychiatry, 2025), which classifies studies into three evidence tiers:

T1

Bench testing — does the AI work technically? (vignettes, benchmarks, expert evaluations)

T2

Feasibility — do users accept interacting with the system?

T3

Clinical effectiveness — does the system improve patient health outcomes?

Hua’s striking finding: across 160 studies (2020–2024), LLMs account for 77% of T1 studies (bench testing) but only 16% of T3 studies (clinical effectiveness). The most hyped technologies are the least clinically validated.

How they complement each other: Hua’s framework tells you where a study sits on the validation pathway. Choudhury’s framework tells you why moving from one level to the next is anything but automatic — and which human and organizational factors stand in the way.


What’s Solid in This Proposal

1

Robust theoretical foundations

The framework builds on models validated over decades in cognitive ergonomics: the Technology Acceptance Model (Davis, 1989), the situation awareness model (Endsley, 1995), the SEIPS framework for patient safety (Carayon, 2006). This is not an ad hoc construction — it’s a rigorous synthesis.

2

Interaction is conceived as bidirectional

Unlike most frameworks that assume “AI outputs → clinician receives,” Choudhury recognizes that AI learns from clinician inputs (reinforcement learning). The quality of the interaction determines the system’s future performance — a cycle that can be virtuous or vicious.

3

It names what clinicians feel intuitively

”I don’t trust this tool,” “I don’t understand how it works,” “I don’t know who’s responsible if it goes wrong.” These aren’t irrational resistance to progress — they’re legitimate human factors that Choudhury’s framework identifies and systematizes.

4

It operationalizes ecological validity

Rather than simply stating “we need real-world testing,” the framework identifies the specific variables to measure. That’s the difference between “we should evaluate better” and “here are the 6 factors that determine whether your lab results predict real-world usage.”


The Limitations — and Why They Matter

1

Descriptive, not prescriptive

The framework identifies factors and their relationships. It doesn’t tell you how to optimize them. It’s as if a clinician identified the risk factors for depression without proposing a treatment protocol. The diagnosis is useful, but it calls for an action plan.

2

Limited empirical validation

The framework has been supported by a single survey (265 U.S. clinicians). That’s a starting point, not proof. Longitudinal and cross-cultural validation remains to be done.

Clinical analogy: it’s like proposing an etiological model for a disorder on the basis of a single cross-sectional study. The model may be correct, but caution is warranted.

3

The patient blind spot

This is the most significant limitation for our field. The framework focuses on the clinician–AI interaction. It doesn’t address the patient–AI interaction — which is precisely the use case for therapeutic chatbots, between-session monitoring apps, and every tool where the patient interacts directly with AI without clinician mediation.

For psychotherapy: Choudhury’s variables (trust, cognitive load, bounded rationality) also apply to patients — but with particularities tied to psychological vulnerability, transference, and the therapeutic alliance. An extension of the framework for direct patient–AI interaction remains to be built.

4

Not all AIs are created equal

The framework treats “clinical AI” as a monolithic block. But a diagnostic decision-support system, a therapeutic chatbot, and a session transcription tool don’t pose the same adoption, trust, or accountability challenges — as we discussed in our article on the distinctions between AI, chatbot, LLM, and application. The human factors vary considerably depending on the type of tool.


Our Take

Choudhury’s framework is a valuable thinking tool for any clinician who wants to appraise AI health studies with discernment. Its main contribution: transforming a vague intuition (“these studies don’t reflect reality”) into a structured analytical grid with identifiable variables.

Combined with Hua’s framework, it offers a two-level reading: Hua tells you what type of evidence a study provides (benchmark, feasibility, clinical effectiveness), and Choudhury tells you why evidence from one level doesn’t automatically transfer to the next.

For AI-assisted psychotherapy, we retain three lessons:

1

Don’t be impressed by benchmarks

An LLM that “outperforms doctors” on standardized vignettes has cleared a technical hurdle. It hasn’t demonstrated that it will improve your patients’ health under the real-world conditions of your practice — with its cognitive load, time constraints, and legal responsibilities.

2

Your “resistance” to AI may be lucidity

If you don’t trust an AI tool, don’t understand how it works, or don’t know who bears responsibility when it errs — these aren’t irrational resistance to progress. They’re legitimate human factors that research identifies as determinants of safe adoption.

3

Demand real-world studies

When a vendor, insurer, or administrator pitches you an AI tool citing studies, ask: did this study measure clinical outcomes with real patients, under real-world practice conditions, over a meaningful duration? If the answer is no, the tool isn’t “proven” — it’s “promising.” The difference matters.


Reference analyzed: Choudhury, A. (2022). Toward an Ecologically Valid Conceptual Framework for the Use of Artificial Intelligence in Clinical Settings: Need for Systems Thinking, Accountability, Decision-making, Trust, and Patient Safety Considerations in Safeguarding the Technology and Clinicians. JMIR Human Factors, 9(2), e35421. https://doi.org/10.2196/35421

Further reading:

  • Hua, Y., Siddals, S., Torous, J. et al. (2025). Charting the evolution of artificial intelligence mental health chatbots from rule-based systems to large language models: a systematic review. World Psychiatry, 24(2). https://doi.org/10.1002/wps.21352
  • Choudhury, A. & Asan, O. (2023). Impact of Accountability, Training, and Human Factors on the Use of Artificial Intelligence in Healthcare. Human Factors and Ergonomics Society Best Article Award 2024.
  • Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly, 13(3), 319-340.
  • Endsley, M. R. (1995). Toward a theory of situation awareness in dynamic systems. Human Factors, 37(1), 32-64.

Mots-clés

ecological validity human factors AI evaluation trust clinical adoption accountability