Veille IA

AI, chatbot, LLM, app: why we need to stop conflating everything

| Matthieu Ferry ⇄ IA

When a study mentions a 'therapeutic chatbot', are we talking about a scripted decision tree or a fine-tuned GPT-4? This terminological blur is far from trivial: it makes studies incomparable and public debate unintelligible.

AI, chatbot, LLM, app: why we need to stop conflating everything

The problem in one sentence

When a news article announces that an “AI chatbot has shown results comparable to a human therapist,” what exactly are we talking about? A system like ELIZA (1966), which mechanically rephrased patients’ sentences? A CBT decision tree guided by predefined scripts? Or a language model like GPT-4, capable of generating novel contextual responses?

In both scientific literature and media, these three realities are routinely referred to by the same term: “therapeutic chatbot.” It’s as if medicine made no distinction between aspirin, chemotherapy, and surgery on the grounds that all three “heal.”

This terminological blur is not a linguist’s quibble. It has direct consequences on our ability to evaluate tools, compare studies, and have an informed debate.


Four levels that must be distinguished

Level 1 · Artificial Intelligence (AI), the umbrella concept

AI is a generic term designating a set of computational techniques that automate tasks traditionally requiring human intelligence. It is an umbrella concept, not a specific technology.

Under this umbrella coexist radically different approaches:

  • Symbolic AI: explicit rules, expert systems (this is what ELIZA did)
  • Machine Learning (ML): the system learns from training data (depression detection via voice analysis, suicide risk prediction)
  • Deep Learning: multi-layer neural networks (including LLMs)

Key takeaway: Saying an app “uses AI” is about as informative as saying a medication “uses chemistry.” Technically true and practically useless.

Level 2 · LLM, a precise technical architecture

LLMs (GPT-4, Claude, Llama, Gemini) are a specific subtype of AI. They rely on the Transformer architecture (Vaswani et al., 2017), are trained on massive text corpora, and generate text by predicting the next token in a sequence.

What distinguishes them from previous approaches:

  • No predefined scripts: each response is generated on the fly
  • Emergent capabilities: analogical reasoning, context adaptation, reformulation
  • Probabilistic: the same question can produce different responses depending on temperature settings

This architectural shift is why conversing with ChatGPT in 2026 is a qualitatively different experience from interacting with Woebot in 2017.

The versioning trap

Behind a single commercial name, the actual model changes constantly. “GPT-4” in 2026 designates a very different model from the GPT-4 of March 2023. These updates are sometimes discreet: the exact date or sub-version of the model (GPT-4-0613, GPT-4-turbo-2024-04-09) is not always visible in the consumer interface. A user — or a researcher — may believe they are using “the same model” from one month to the next while performance has significantly changed.

Even more subtle: the current trend toward Mixture of Experts (MoE) architectures means that a single model name — for example “GPT-5” — may actually involve dynamic routing to different sub-models depending on the estimated complexity of the prompt. Your simple question and your complex question are not necessarily processed by the same model, even though the interface displays only one name.

Consequence for research: LLM performance improves so rapidly that a study conducted on version n can be rendered virtually obsolete by version n+1, released a few months later. Combined with opaque versioning and MoE routing, this poses a major problem for traceability and reproducibility of results.

Level 3 · Chatbot, an interface, not an intelligence

The chatbot is a conversational interface — the frontend, not the backend. A chatbot can be powered by very different technologies:

PeriodBackend technologyExample
1966Scripted pattern-matchingELIZA
2000–2015Decision trees + NLUEarly Woebot, Talkspace bot
2020+LLM (GPT, Claude)Woebot 2024, Wysa 2023

Same term, radically different technologies.

When a meta-analysis pools studies on “therapeutic chatbots” without distinguishing the backend architecture, it aggregates interventions as different as a phone call and a handwritten letter — on the grounds that both “use words.”

Level 4 · Mental health app, an ecosystem, not a chatbot

A mental health application may integrate a chatbot, an LLM, both, or neither:

TypeExampleAI?
Guided meditation, fixed sequencesHeadspaceNo
ML-based recommendations on mood historyDaylioClassical ML
CBT exercises guided by decision treeWoebot 2017Scripted chatbot
Open conversation with generative modelWysa 2024LLM chatbot
Chatbot + digital phenotyping + EMA + human supervisionmindLAMPHybrid

Caution: branding oneself as an “AI mental health app” has become a marketing argument. A simple adaptive questionnaire now qualifies itself as a “therapeutic AI chatbot” to benefit from the symbolic capital of the term.


Why this changes everything for clinicians

Same engine, different experiences

Several apps use GPT-4 but produce very different results, depending on:

  1. The system prompt: hidden instructions that frame responses (e.g., “You are a caring CBT coach” vs. “You are a general-purpose conversational assistant”)
  2. Fine-tuning: has the model been retrained on clinical data?
  3. Guardrails: safety filters, suicidal risk detection, human escalation protocols
  4. Hybrid architecture: is the LLM standalone or supplemented by rules, knowledge bases (RAG), digital phenotyping?

Saying “this app uses GPT-4” is not enough to characterize its clinical functioning. It’s like saying “this medication contains paracetamol” without specifying the dosage, formulation, and interactions.

The research problem: technological under-reporting

Most clinical trials (RCTs) on therapeutic chatbots do not specify:

  • The backend architecture (scripted? ML? LLM?)
  • The exact model if LLM (GPT-3.5 vs GPT-4 = major differences)
  • The precise version and date of the model used
  • The guardrails and safety systems
  • The fine-tuning applied

Consequence: impossible to compare results between studies, impossible to replicate. A systematic review that indiscriminately pools studies on scripted chatbots and LLM chatbots produces conclusions as reliable as a clinical trial mixing homeopathy and antibiotics under the label “medications.”

The APA mental health app evaluation model provides a useful first framework, but it would benefit from integrating a mandatory technical section specifying the AI architecture used.


ELIZA, the example that says it all

In 1966, Joseph Weizenbaum created ELIZA — a program that simulates a Rogerian therapist by reformulating patients’ sentences (“You say your mother worries you” → “Tell me about your mother”). Technically: keyword-based pattern-matching, zero learning, zero understanding.

The result astonished Weizenbaum himself: users genuinely confided in ELIZA, and some therapists proposed using it as a therapeutic substitute. Horrified, Weizenbaum became one of the first critics of therapeutic AI.

What is striking: the psychological mechanisms at work with ELIZA (projection, attribution of intentionality) are identical to those at work with ChatGPT. But the technologies are incomparable. Evaluating “therapeutic chatbots” as a homogeneous category means ignoring this fundamental difference.

Going further: the concepts of anthropomorphism, HADD, and parasocial relationships explain why we attribute human qualities to these systems — regardless of their technical complexity. The concept of the Turing test illuminates the question of the boundary between simulation and understanding.


In practice: the questions to ask

Before recommending, advising against, or commenting on an app or a study, here is a minimal filter:

1. What type of AI? Symbolic (rules), classical ML (statistical learning), or LLM (language model)? The answer changes everything: risks, benefits, and mechanisms of action are different.

2. If chatbot: what architecture? Scripted (predefined responses), intent-based (intent detection + response model), or generative (LLM)? A scripted chatbot does not raise the same ethical questions as an LLM chatbot.

3. If LLM: which one, when, and how? The model (GPT-4, Claude, Llama), the exact version and date, any fine-tuning (clinical data?), and the guardrails (crisis detection, human escalation). Did the study verify model stability throughout the protocol duration?

4. What integration? Is the AI standalone or integrated into an ecosystem (EMA, digital phenotyping, clinician supervision)? A standalone LLM and an LLM supervised by a clinician are two different interventions.

5. What comparison group? Does the study compare the chatbot to a human therapist, another chatbot, a waiting list, or nothing? The choice of comparator radically changes the interpretation of results.


Conclusion

The terminological blur surrounding AI in mental health is not a vocabulary problem. It is an epistemological obstacle that prevents informed debate, study comparison, and rigorous tool evaluation.

As clinicians, we have a responsibility: to demand precision. Not out of technical purism, but because our patients deserve to know exactly what we’re offering them — or what we’re warning them about.

Saying “AI in therapy” without specifying which AI is like saying “the medication” without specifying which one. And no clinician would prescribe “a medication” without knowing what it contains.

Mots-clés

terminology chatbot LLM mental health apps epistemology research