Veille IA

The Stade 2024 framework for integrating LLMs in psychotherapy — and Garczynski's matrix formalisation

Elizabeth Stade and her team (Stanford, Penn, Johns Hopkins) propose a framework for thinking about the integration of LLMs in psychotherapy, articulated around three tiers of autonomy (assistive, collaborative, autonomous). Luc Garczynski (UdeM, 2026) formalises its applications into five axes and adds two unprecedented empirical categories.

Why this framework, why now

Since 2023, the discussion about LLMs in psychotherapy has been spinning between two symmetrical poles. On one side, the enthusiastic announcements: AI will democratise access to care, increase fidelity to protocols, free clinicians from time-consuming tasks. On the other, the warnings: chatbots could trigger suicide attempts, manufacture compliance, degrade the therapeutic alliance. Both camps cite studies — often the same ones — to draw opposite conclusions.

This stalemate stems from a lack of shared vocabulary. As long as we talk about “AI in psychotherapy” as a single undifferentiated block, we can neither evaluate rigorously, deliberate publicly, nor legislate prudently. Before debating, we need to distinguish: which use, at which level of autonomy, under which conditions?

This is precisely what the framework proposed in April 2024 by Elizabeth Stade and her team in npj Mental Health Research begins to do. It does not settle the debate. It equips deliberation by proposing a three-tier autonomy continuum and by illustrating the clinical application domains of LLMs. In this article, we first present the framework as it was published, then we show how Luc Garczynski formalised it into an operational matrix in his empirical work (UdeM, 2026) — an original contribution that clarifies, extends, and reveals the limits of the theoretical framework.

Structure of this article

  • Part I — The Stade 2024 framework as published: three tiers of autonomy, applications, principles, strengths and limits.
  • Part II — Luc Garczynski’s contribution: formalisation into five axes, operational matrix, empirical findings.
  • Part III — Articulation with other evaluation frameworks (Hua, Choudhury, READI).

I. The Stade 2024 framework

Where it comes from

The framework was published in April 2024 under the title Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation, in npj Mental Health Research, a partner journal of Nature. The team is interdisciplinary and institutionally heavyweight: Elizabeth Stade (Stanford HAI) as first author, Shannon Wiltsey Stirman (Stanford Psychiatry, CBT implementation science), Lyle Ungar (Penn, NLP), Robert DeRubeis (Penn Psychology, CBT), Johannes Eichstaedt (Stanford Psychology, World Well-Being Project), among others.

The opening claim is explicit, and it sets the tone:

“Clinical psychology is an uncommonly high stakes application domain for AI systems.”

— Stade et al. (2024), npj Mental Health Research

In other words: psychotherapy is not an application domain like any other. Errors are less visible than in radiology, but their consequences can be just as severe — and much longer-lasting. This severity authorises neither blanket rejection (which deprives us of potentially beneficial uses) nor generic enthusiasm (which deprives us of necessary safeguards). It demands a pedagogy of distinction.


The three tiers of autonomy

This is the explicit structuring contribution of Stade et al. The analogy is openly assumed — that of autonomous vehicles. Just as we do not deploy a driverless car without first having tested parking assistance, we do not deploy an LLM in psychotherapy without first having validated its most tightly framed use.

T1

Assistive — Machine in the Loop

The LLM performs delimited tasks under permanent human control. Each output is validated by the clinician before use. The clinician remains the sole decision-maker; the LLM augments capacity without ever substituting for it. This is the tier the authors recommend as an entry point for any clinical use, including low-risk administrative tasks.

T2

Collaborative — Human in the Loop

The LLM participates in shared reasoning with the clinician, who retains the final decision. The model suggests therapeutic options, formulations, intervention plans; the clinician selects, adapts, and assumes responsibility. This tier implies increased confidence in the model’s outputs and raises specific questions of traceability (who is responsible for a suggestion that is taken up?) and of therapeutic alliance (how does the patient perceive a plan co-constructed with an LLM?).

T3

Autonomous — without direct clinical supervision

The LLM operates without direct clinical supervision for delimited tasks. This is the highest-risk tier, reserved for use cases that have undergone rigorous empirical evaluation and regulatory validation.

The authors present this as a conditional theoretical horizon, not as an immediate recommendation. It is precisely against premature deployment at this tier that the entire framework is built.

The progression between tiers is not linear: the authors note that some more structured and protocolised interventions (CBT for insomnia, exposure for specific phobia) could reach the collaborative tier more quickly than flexible or personalised interventions. Tier 3 (fully autonomous) remains a horizon whose very legitimacy is debated — we return to this in the limits section.


Clinical applications as Stade describes them

The original paper does not propose a numbered taxonomy. The applications presented are organised by type of task and by target audience (clinician, patient, trainee, supervisor, peer support worker). Here are the main ones, faithfully grouped:

Administrative work and documentation

Drafts of progress notes, session summaries, chart reports, billing assistance. This is the lowest-risk clinical use and the most immediately operational. Target: clinician.

Treatment fidelity measurement

The LLM automatically derives adherence and competence scores from session transcripts. Stade et al. note that this measure is crucial for the development and dissemination of EBPs, but it remains costly and unreliable when done by humans. Target: researcher, supervisor.

Feedback on therapeutic homework

The LLM provides real-time feedback on the patient’s CBT exercises (cognitive restructuring, thought records). Target: patient.

Trainee training and supervision

The LLM identifies the strengths and weaknesses of trainee interventions from session recordings. It can also serve as an empathy aid for peer support workers (cf. work by Sharma et al.). Target: trainee, peer support worker.

Between-session support

Real-time support outside of appointments: help with therapeutic homework, management of mild to moderate distress, personalised psychoeducation. This is the most relevant application for access in underserved areas, but also the one that demands the strictest risk-detection protocols. Target: patient.

The paper also provides a Table 2 (Imminent possibilities for clinical LLMs) that crosses concrete tasks with examples of LLM input/output. These applications illustrate a spectrum of possibilities whose risk level varies according to the chosen autonomy tier.

Two concrete examples

TherapyTrainer (Stade et al. 2025) — an experimental system using an LLM to automatically score a therapist’s fidelity to a CBT protocol from transcripts. The LLM flags departures from the manual but makes no decisions about the therapist’s training: the human supervisor interprets the signal and decides what comes next. Archetype of assistive use (T1).

Woebot — a consumer application guiding the user through CBT exercises between sessions. On paper, it positions itself in T1 assistive. In practice, without a clinician in the loop in real time, the boundary with T3 autonomous becomes thin. The Stade framework provides here the evaluation criteria to apply: measured clinical effectiveness, suicide risk detection, equity of access, absence of cognitive bias reinforcement. These criteria remain unevenly satisfied, and that is precisely what the framework allows us to say with rigour.


Guiding principles (editorial extraction)

The paper contains a “Recommendations” section that articulates cross-cutting principles, from which we have extracted the following five operational principles:

1

Human clinical judgement remains central at every tier

Including at T3, autonomous deployment does not eliminate clinical judgement: it shifts it upstream (protocol design, case selection, monitoring) rather than downstream (session-by-session validation).

2

Empirical evaluation is mandatory before any tier progression

One does not move from an assistive use case to a collaborative one without a prior empirical demonstration of safety and efficacy at the previous tier. This sequential progression is explicitly modelled on the phases of pharmacological clinical trials.

3

Engagement is not an appropriate training criterion

This is an implicit but frontal critique of RLHF as practised in consumer LLMs, and of the sycophancy it manufactures.

Optimising a clinical LLM for user engagement is reproducing the pathologies of social media in the field of care.

4

Vulnerable populations require specific safeguards

Suicide risk, delusional content, psychotic disorders: these situations are not edge cases to be handled after the fact. They demand dedicated protocols upstream of deployment, at every autonomy tier.

5

Equity is a central evaluation criterion, not an optional corrective

Racial, socioeconomic, linguistic, and cultural equity must be integrated from the evaluation stage, not added on top. An LLM that performs well on average for a WEIRD population but fails on populations underrepresented in the training data does not satisfy the framework.


What is solid in the framework

1

An immediately usable autonomy continuum

Three tiers, a memorable analogy (the autonomous vehicle), a simple principle (moving up in autonomy requires empirical proof): a clinician can use it without prior training to situate a contemplated use.

2

An implicit but central critique of RLHF

The principle “engagement is not an appropriate training criterion” is, in short form, one of the firmest positions found in the literature on clinical LLMs. Stade et al. name what many technical articles avoid: consumer LLMs are trained to please, and pleasing is not caring.

3

A reference that now structures recent empirical work

Since its publication, the Stade framework has become the default coding grid for qualitative studies that examine the clinical uses of LLMs. This reference status is not a guarantee of quality, but it is a fact to integrate: reading an article published after 2024 without knowing the Stade framework often means missing its analytical architecture.

4

Assumed refusal of binarism

The framework explicitly takes a stance against the two sterile positions (technophobic rejection / uncritical enthusiasm). This posture is not a soft compromise: it is the condition of an informed technical deliberation.


The limits of the original framework

1

Under-theorisation of the therapeutic relationship

The framework treats the therapeutic alliance as one evaluation criterion among others (to be measured), rather than as a condition of ethical possibility. Critiques grounded in the ethics of care (Malouin-Lachance et al. 2025 on the digital therapeutic alliance) emphasise that the relationship is not measurable as an independent variable: it is constitutive of care. The Stade framework creates the vocabulary to ask the question of alliance with a device; it does not answer it.

2

Unresolved tension with population-level deployment

Faced with the global mental health access crisis, the sequential and prudent approach of the Stade framework can be perceived as a luxury for high-income countries with functional health systems. Rousmaniere et al. (Lancet Psychiatry, 2025) defend the opposite thesis: population-level deployment is already underway de facto, and the question is no longer whether to deploy at the autonomous tier but how to make this deployment less harmful. The Stade framework has no answer to this objection — it is even its counter-model. We share this observation: in our own work (Ferry & Malo, ongoing series on the analysis of patient-LLM conversations), we pose exactly this question in concrete terms: how to forge the theoretical and practical instruments to study these uses, frame their risks, and identify the possible synergies with human psychotherapy.

3

A map without milestones

The framework says that progression from one autonomy tier to the next requires empirical evaluation — but it does not specify which evaluation, on which criteria, for how long, with which comparator. It is a roadmap whose milestones remain to be written. It is precisely this operational void that READI (Stade et al. 2025), the direct extension of the framework, begins to fill with its six pre-deployment evaluation criteria.

4

Limited cultural generalisation

The framework is produced in an American context (Stanford, Penn, VA Palo Alto) and its deployment examples presuppose health systems, clinical practices, and English-speaking populations. Its direct applicability to the Quebec context (OPQ, Law 25), European context (GDPR, cultural specificities), or broader Francophone context (clinical terminology, validated resources) remains under-theorised. It will need to be adapted, not just translated.

5

Scattered applications, not a taxonomy

The paper illustrates a wide spectrum of applications but does not organise them into systematic categories. The reader retains the three autonomy tiers — clear and memorable — but is left with an unranked list of applications. It is precisely this gap that Garczynski’s work fills.


II. Luc Garczynski’s contribution — from theoretical framework to clinical field

First Francophone empirical reprise

In 2026, Luc Garczynski (Université de Montréal, PSY6008) conducts the first qualitative Francophone study that uses the Stade framework as analytical architecture. Interpretive Descriptive Research (IDR, Thorne 2016) — semi-structured interviews with four CBT psychologists in private practice in Quebec. The contribution is threefold.


Formalisation into five application axes

The text by Stade et al. presents scattered applications, organised by type of task and target audience. Garczynski consolidates them into five application axes to structure his interview guide and his deductive codebook. This is an original analytical contribution — not a mere translation of the framework.

A1

Administrative and organisational work

Clinical documentation, session synthesis, draft progress notes, billing support, compliant record-keeping. This is the domain of lowest direct clinical risk, and empirically the most adopted in private practice — outputs go through human review before being filed in the chart.

A2

Professional training and fidelity to evidence-based practices

The LLM serves as a simulation partner for therapists in training: clinical role-play, fidelity measurement to manualised protocols (CBT, DBT, exposure), immediate feedback on competence. This axis directly addresses the diffusion crisis of EBPs: validated practices remain largely under-used due to the inaccessibility of supervision. TherapyTrainer (Stade et al. 2025) is an operational example.

A3

Production of patient-facing content

The LLM generates or personalises therapeutic materials intended for the patient: cognitive restructuring exercises, psychoeducation tailored to literacy level, mood tracking sheets, mindfulness materials. Personalisation is the distinctive advantage; it requires rigorous output control to avoid problematic phrasing and theoretical drift.

A4

Clinical decision support

Case conceptualisations, differential diagnostic hypotheses, treatment plan adjustments. This is the axis of highest clinical complexity: the clinician’s judgement remains the deciding factor, but the LLM expands the surface of hypotheses explored. This axis requires the lowest possible autonomy tier — never autonomous. It is also the privileged terrain of clinical sycophancy, where the LLM may validate a premature hypothesis from the clinician rather than offering divergent alternatives.

A5

Between-session support

Real-time support between appointments: assistance with CBT homework, management of mild to moderate distress situations, recall of tools learned in session. This is the most relevant axis for access in underserved areas, but also the one that requires the strictest protocols for risk detection (suicidal ideation, acute crisis) and human escalation.


The axes × tiers matrix

Crossing Garczynski’s five axes with Stade’s three tiers yields a 5 × 3 matrix in which each cell corresponds to a type of use whose evaluation and safeguards are discussed separately. Not all uses are legitimate: some cells (e.g. diagnostic decision support in autonomous) are non-recommended by default.

Application axis (Garczynski)T1 AssistiveT2 CollaborativeT3 Autonomous
A1 — AdministrativeRecommendedPossibleNot recommended
A2 — Training / EBPRecommendedPossibleTo be evaluated
A3 — Patient contentRecommendedPossible with validationTo be evaluated
A4 — Clinical decisionPossibleWith cautionNot recommended
A5 — Between sessionsPossibleWith cautionConditional on clinical trial
Editorial synthesis. The A1–A5 axes are Garczynski’s formalisation (UdeM, 2026) drawn from the applications scattered through Stade 2024. The T1–T3 tiers are those of Stade. The recommendation levels are inferred from the original text and Garczynski’s empirical study, not directly stated by the authors of Stade 2024.

Inductive categories — what the field reveals

Beyond the deductive coding drawn from Stade, Garczynski identifies two emergent phenomena absent from the original framework. These are, for us, the most precious findings — those that designate the limits of the theoretical framework.

1

Emotional reassurance — clinical signature of sycophancy

The clinicians interviewed use the LLM in moments of clinical doubt to validate that their reasoning is consistent with CBT models. The LLM, structurally trained to produce pleasing responses (sycophancy), almost systematically confirms the clinician’s hypothesis — which reduces uncertainty and emotional load, but may also reinforce confirmation biases. The Stade framework does criticise engagement optimisation, but does not explicitly thematise sycophancy as a risk for the clinician themselves. It is Garczynski who provides the first empirical documentation.

2

The professional taboo — absent sociological dimension

The psychologists describe a discomfort discussing their LLM use with peers and supervisors, for fear of complaints to the Order of Psychologists of Quebec (OPQ) and of professional judgement. This taboo is not anecdotal: it blocks the collective construction of usage norms and produces an underground deployment, unsupervised and undocumented. This is a sociological dimension of deployment entirely absent from the Stade framework — which presupposes a professional environment open to discussion and collaborative regulation.

Risk of methodological circularity: when a taxonomy becomes the coding framework of the empirical studies that mobilise it, those studies tend to confirm the framework’s relevance without testing it independently. The inductive categories that emerge despite the framework (emotional reassurance, professional taboo) are therefore the most precious — they are the ones that designate the blind spots, and thus the boundaries of the framework.

Read our article presenting Garczynski’s study


III. Articulation with other evaluation frameworks

The Stade framework does not stand alone. It articulates in a complementary way with other frameworks we have decrypted on this site:

Hua et al. 2025 (T1/T2/T3) tells you where a study sits in the validation pathway. The Stade framework tells you which use that study evaluates (and Garczynski’s matrix tells you in which cell). Crossed, these frameworks let you answer the question: “Does this study prove anything for the use case I am interested in?”

Our decryption of the Hua framework

Choudhury 2022 on ecological validity tells you why the results of a T1 study do not mechanically translate into actual clinical benefit — and which human factors (trust, cognitive load, accountability) must be taken into account. The Stade framework does not thematise these factors; Choudhury fills this gap.

Our decryption of the Choudhury framework

READI (Stade et al. 2025), a direct operational extension of the 2024 framework, adds six pre-deployment evaluation criteria: safety, privacy, equity, effectiveness, engagement, implementation. Where the 2024 framework poses the what (which use, at which autonomy tier), READI provides the how (on which criteria to audit before deployment). It is the most direct complement.

Our decryption of the READI framework

The set Hua + Choudhury + Stade + Garczynski + READI now forms the most complete methodological core for thinking about LLM evaluation in mental health. None of these frameworks is sufficient alone. They complete each other.


What it changes for you, clinicians

1

Use the matrix before debating

Before debating “AI in therapy”, place the contemplated use on the matrix. Which axis? Which tier? The discussion that follows is immediately more precise. This is a defusing protocol for sterile polarisations, and it works as well in a team meeting as in a letter to a professional order.

2

Do not confuse “structured by Stade” and “validated by Stade”

The framework provides a coding grid; it does not evaluate. A study that “applies the Stade framework” does not become more solid for it. Always ask the classic questions: which level of evidence (T1/T2/T3 per Hua), which comparator, which population, which duration, which primary criteria.

3

Work the blind spots rather than the matrix

The matrix is solid as a framing tool; the research stake is elsewhere. Clinical sycophancy, digital alliance, pragmatic population-level deployment, adaptation to Francophone contexts: it is on these margins that original contributions are possible. This is the strategy we adopt in our collaboration with Luc Garczynski.


Reference analysed: Stade, E. C., Stirman, S. W., Ungar, L. H., Boland, C. L., Schwartz, H. A., Yaden, D. B., Sedoc, J., DeRubeis, R. J., Willer, R., & Eichstaedt, J. C. (2024). Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation. npj Mental Health Research, 3(1), 12. https://doi.org/10.1038/s44184-024-00056-z — Open access.

Further reading:

  • Stade, E. C. et al. (2025). Readiness Evaluation for Artificial Intelligence-Mental Health Deployment and Implementation (READI): A Review and Proposed Framework. Technology, Mind, and Behavior — operational extension with six evaluation criteria.
  • Rousmaniere, T. et al. (2025). Large-scale implementation of AI-based psychotherapy. Lancet Psychiatry — alternative position defending population-level deployment.
  • Sharma, M. et al. (2023). Towards Understanding Sycophancy in Language Models. ICLR 2024 — origin of the structural critique of RLHF.
  • Malouin-Lachance, A. et al. (2025). Does the Digital Therapeutic Alliance Exist? Integrative Review. JMIR Mental Health — critique grounded in the ethics of care.

On this site:


Partager