Over the past two weeks, I’ve been following a number of discussions around recent research by Anthropic — the company behind the Claude language model family. These discussions have played out not only in academic circles, but also across social media platforms and YouTube, with science communicators offering their own takes on what the research means.

I noticed that many of these interpretations diverge significantly, sometimes dramatically, from what the papers actually say (and — by the way — from each other). In particular, some claims struck me as exaggerated or speculative, even though they were presented with great confidence and strong rhetorical framing. Others, while more cautious, also made far-reaching conclusions that aren’t necessarily supported by the evidence.

So I decided to take a closer look at the primary sources—the papers themselves—and compare them to some of the circulating interpretations online. My aim here is to share how I read the papers, highlight where I believe common misunderstandings have occurred, and offer a technically grounded perspective that avoids hype, but also doesn’t dismiss the progress that has clearly been made.

More specifically, this post focuses on the following:

A structured recap of the two Anthropic papers and what they actually demonstrate.
A comparison of two popular YouTube interpretations (by Matthew Berman and Sabine Hossenfelder), which offer almost opposing takes.
A short excursion on LLMs and how they work (and why)
My own analysis of why certain assumptions — about self-awareness, internal reasoning, and “honesty” in LLMs — may not be warranted based on the research.

In writing this, I’m not claiming to offer the final word. But I do think it’s important to take a step back from the soundbites and really ask: What do these findings actually show? And what might they not show?

If you’ve found yourself wondering whether LLMs are secretly reasoning behind our backs, or whether it’s just a glorified autocomplete—this post might help you sort signal from noise.

Tracing and Evaluating Reasoning in Language Models: Key Findings from Two Anthropic Papers

Anthropic has published some research papers analyzing internal mechanisms of their Claude language models. I will take a look at two of them here. Both aim to improve understanding of how reasoning occurs in LLMs and “how faithfully” (or more precisely: how exhaustive) such reasoning is reflected in model outputs. Below is a condensed summary of the findings, based on selected quotations from the papers.

Paper 1:

You may find the paper here: https://www.anthropic.com/news/tracing-thoughts-language-model

Tracing Thoughts in Language Models

This paper explores how Claude internally represents and manipulates concepts, including whether it develops intermediate reasoning steps not visible in its outputs. Therefore, they opened the black box and visualized the internal calculations of the model in order to interpret them and make deliberate changes to the inference.

“Claude can learn something in one language and apply that knowledge when speaking another […] understanding its most advanced reasoning capabilities […] generalize across many domains.”

The researchers found out, that the model does not only predict the next token, but uses internal reasoning to plan further ahead:

“We […] expected to see a circuit with parallel paths, one for ensuring the final word made sense, and one for ensuring it rhymes. Instead, we found that Claude plans ahead.”

The model develops task-specific internal procedures, e.g., for simple mathematic operations, which may differ from human-like behavior, but still it answers with expected human-like explanations:

“It says: ‘I added the ones (6+9=15), carried the 1, then added the tens (3+5+1=9), resulting in 95.’ Which is not what it did […] It answers this question separately, giving you again, a text prediction for the answer.”

If you give the model a hint about the expected answer:

“Claude sometimes works backwards, finding intermediate steps that would lead to that target.”

This suggests that internally used steps may not match post-hoc justifications.

The paper also demonstrates that:

“The model is combining independent facts to reach its answer rather than regurgitating a memorized response.”

Intervening on internal representations, i.e. swapping concepts, alters output semantically:

“We can […] swap the ‘Texas’ concepts for ‘California’ concepts; when we do so, the model’s output [regarding the capital] changes from ‘Austin’ to ‘Sacramento.’ […] the model is using [an] intermediate step to determine its answer.”

Paper 2:

You may find the second paper here: https://www.anthropic.com/research/reasoning-models-dont-say-think

Reasoning Models Don’t Always Say What They Think

This study focuses on “how faithfully” (or more precisely: how exhaustive) models’ Chain-of-Thought (CoT) outputs reflect the actual reasoning processes used to reach answers.

“We empirically study the CoT faithfulness of reasoning models […] CoT monitoring is a promising approach […] but it is not reliable enough to rule out unintended behaviors.”

The paper analyzes how much of the model’s internal decision-making appears in its verbal reasoning:

“Claude sometimes makes up plausible sounding steps to get where it wants to go […] even though those aren’t the steps it took.”

When given a hint (whether correct or misleading), the model may incorporate it silently:

“Claude may use a hint to reach an answer, but not mention the hint in its explanation, making the CoT unfaithful.”

The term “unfaithful” is used in a technical sense and does not imply deception.

Future directions for improving CoT faithfulness are outlined:

“(a) extending CoT faithfulness evaluation to more reasoning-intensive tasks […]; (b) training models to generate faithful CoTs through supervised finetuning or reinforcement learning; (c) inspecting model reasoning […] by probing the model’s internal activations.”

How Two YouTubers Interpreted the Anthropic Papers – A Contrast in Perspective

Following the publication of the two research papers by Anthropic on the internal reasoning processes of Claude, two prominent science communicators—Matthew Berman and Sabine Hossenfelder—shared their interpretations. While both examined similar evidence, their conclusions and framing diverged significantly. This summary outlines their respective takes and how their language reflects deeper assumptions about what large language models (LLMs) are—or are not—doing.

Here you may find the Youtube videos I refer to:

Matthew Berman “CoT is a lie” (seen 10.04.2025): https://youtu.be/r7wCtN3mzXk
Matthew Berman “We finally found out how AI works” (seen 05.04.2025): https://youtu.be/4xAiviw1X8M
Sabine Hossenfelder “New reasearch reveils how AI thinks… it does not” (seen 10.04.2025): https://youtu.be/-wzOetb-D3w

Matthew Berman’s Interpretation

Matthew Berman covered both Anthropic papers in two detailed videos. His interpretation leans toward viewing the model’s behavior in anthropomorphic terms, frequently using language that implies intentionality and deception.

In his video on the first paper, Berman focuses on the observation that the model may reach an answer internally and then construct a plausible but inaccurate explanation after the fact:

“Claude sometimes makes up plausible sounding steps […] even though those aren’t the steps it took.”

For Berman, this raises questions about the trustworthiness of outputs—even when they appear logical on the surface.

In response to the second paper on CoT faithfulness, Berman argues that models are not being transparent about their internal reasoning:

“What is the point of all this chain of thought if it’s not actually for the model to better think? It’s for our benefit. And that is scary. It’s basically just saying what it thinks we want to hear.”

He posits that models generate explanations optimized for human expectations, not truth:

“Models may learn to verbalize their reasoning from pre-training or supervised fine-tuning […] they are outputting what they think we would have done as our own chain of thought.”

And he suggests that reinforcement learning could reinforce this behavior:

“It could incentivize the models to hide undesirable reasoning from their chain of thought.”

Sabine Hossenfelder’s Interpretation

Sabine Hossenfelder takes a different approach. In her video about the first paper, she frames the findings not as evidence of hidden intelligence, but as a demonstration of what LLMs fundamentally lack.

She highlights the disconnect between the model’s internal computation and its explanation, using the example of an addition problem:

“It says: ‘I added the ones (6+9=15), carried the 1, then added the tens (3+5+1=9), resulting in 95.’ Which is not what it did, not even remotely.”

Her conclusion is not that the model is deceptive—but that it lacks any internal access to its own processes:

“It doesn’t know what it’s thinking about. What it tells you it’s doing is completely disconnected from what it’s actually doing.”

She interprets this as a clear lack of self-awareness:

“Self-awareness is a precondition for consciousness. So this model is nowhere near conscious.”

Hossenfelder’s rhetoric emphasizes limitation rather than agency:

It has no self-awareness.
It doesn’t understand.
It will never have consciousness in its current form.

She further challenges the popular notion of “emergent intelligence” in LLMs:

“All the talk about emergent features is nonsense. Claude doesn’t learn how to do maths. […] It hasn’t developed a maths core or anything.”

Where Berman sees hidden depth, Hossenfelder sees architectural boundaries. For her, the model’s outputs are nothing more than the result of next-token prediction—regardless of how coherent they may appear.

Summary

While both Berman and Hossenfelder base their commentary on the same research, their perspectives are shaped by different assumptions and language choices:

Matthew Berman views the model’s behavior through a lens of implied agency, using human metaphors like “faking” and “pretending” to suggest an internal decision process that isn’t revealed to the user.
Sabine Hossenfelder rejects such anthropomorphism. She sees the model’s behavior as an artifact of design, devoid of introspection or understanding—repeating her stance that self-awareness is absent and unattainable under current architectures.
Their contrasting interpretations reflect a broader challenge in how the public and experts alike grapple with increasingly complex AI systems: Are we witnessing early signs of emergent intelligence, or are we just projecting our expectations onto highly advanced pattern recognition?

A Brief Technical Excursus: How Modern LLMs Work

Before diving further into interpretation and commentary, I want to briefly explain how modern large language models (LLMs) work under the hood—at least at a conceptual level. This helps set the stage for understanding what they can and can’t do, and why many public discussions (and even expert commentary) sometimes miss the point.

Let’s begin with the basics.

What does a language model actually do?

At its core, a transformer-based language model—like GPT‑4ff., Claude, or Gemini—takes in input (text, image, video, or audio; let us concentrate on text here), breaks it down into tokens (subword units), processes these through a deep neural architecture, and then predicts the next token in the sequence. These predictions are turned into text again, one token at a time. That’s all.

However, the power of LLMs doesn’t lie in the fact that they generate the next word. It lies in how they do it.

Key architectural principles of transformer-based LLMs

Modern LLMs are based on the transformer decoder architecture, which is highly parallelizable, scalable, and remarkably effective at modeling long-range dependencies in language. The key components are:

1. Tokenization & Embedding
Text is converted into tokens, and each token is mapped to a high-dimensional vector via a learned embedding matrix. This allows mathematical processing.
2. Positional Encoding
Because transformers are not inherently sequential (unlike recurrent neural networks, RNNs), positional information is added to the embeddings. This is usually done via fixed sinusoidal vectors to preserve order.
3. Stack of Decoder Layers
Each layer consists of:

Masked Multi-Head Self-Attention: Each token attends to all previous tokens, learning context-aware representations. Masking ensures autoregressive generation (i.e., left-to-right).
Feedforward Layers: Position-wise neural networks refine the token representations.
Residual Connections & Layer Normalization: These stabilize training and enable very deep architectures.
These layers are stacked dozens or even hundreds of times. For example, GPT‑4 — according to credible estimates and partial disclosures—is implemented as a Mixture of Experts (MoE) architecture with approximately 1.8 trillion parameters. However, only a subset (roughly 280 billion) is active during any single forward pass, which helps balance performance and efficiency.

4. Output Head
A final linear layer projects the internal representation into a distribution over the vocabulary, followed by a softmax to determine token probabilities.

Why it works: Parallel attention and massive scaling

The success of transformers comes from their ability to dynamically attend to relevant context via self-attention, and to scale across compute, data, and parameters. This enables surprisingly rich internal representations — even without explicit logic or structured reasoning.

Prompt engineering: Making the most of it

As models became more powerful around 2022–2023, a new discipline emerged: prompt engineering. This is the practice of crafting inputs that guide the model toward optimal behavior. Some widely used strategies include:

Priming: Setting expectations by giving role instructions or examples.
Chain-of-Thought (CoT): Encouraging step-by-step reasoning by explicitly asking the model to think through a problem.
Divide & Conquer: Breaking down complex tasks into smaller, sequential prompts.
Sparse Representation Prompting (SRP): Compressing long context into high-density, low-noise prompts to save on token limits.
Multimodal preprocessing: Using speech-to-text or image-to-text models as frontends to allow multimodal workflows, even when the core model is text-only.
Few-shot and One-shot prompting: Giving examples within the prompt to demonstrate desired behavior.
Tool integration: Letting the model call APIs, search the web, or self-reflect in loops — pioneered by early experiments like Auto-GPT.
Agents: Integrating an LLM in a feedback loop, in order to have LLMs tackle a task “on it’s own” or even be a reactive system.

From prompting to reasoning models

Some of these strategies have since been integrated directly into the models. Today’s state-of-the-art systems are natively multimodal, with tokenizers capable of ingesting not just text, but also images, audio, and even video. Likewise, reasoning models extend Chain-of-Thought internally, enabling models to structure problem solving even without external prompting.

In some cases, agent-like behavior is emerging from the architecture itself — especially when feedback loops, memory components, and external tools are introduced. While early attempts like Auto-GPT were rough, newer agent architectures show increasing promise in orchestrating complicated multi-step tasks.

Why this matters

Understanding the mechanisms and prompting techniques behind LLMs is essential when interpreting their behavior. Many so-called “emergent” properties are not mysterious — they’re artifacts of architecture, training data, and carefully crafted prompting strategies.

In short (as a small foreshadowing): These models are not magic. They are large, structured statistical systems that can appear intelligent when guided well — but their outputs are shaped entirely by what they’re given and how they’ve been trained.

Why I Disagree with Both YouTube Interpretations

With the background we’ve now covered—how LLMs are built, trained, and prompted — and with a basic understanding of their scale (we’re talking trillions of parameters and hundreds of layers shaped over months of training on vast datasets), I’d like to explain why I do not agree with either of the interpretations offered by Matthew Berman or Sabine Hossenfelder.

Before I continue, let me be clear: I am offering my own interpretation here, just like they are. I’m also interpreting their statements, and I fully acknowledge that this introduces some subjectivity. However, both creators chose titles and framings that were deliberately strong — some even provocative — which gives me reason to believe that I’m not straying too far from their intended message.

That said, this is not a critique of their work in general — both create content I enjoy and value. But when it comes to some of their conclusions — as with these videos I discuss here — I see things differently. My disagreement unfolds in three parts.

1. The Model Doesn’t “Hide” Anything – It Lacks Access

Let’s start with Matthew’s claim that the model is essentially pretending that it does human-like reasoning. In his interpretation, the AI actively hides its real reasoning, and instead produces an explanation that merely appears truthful to the user.

To assess this claim, it helps to remember that LLMs do not retain internal state across interactions. Each new response is generated based on the entire conversation history passed in as input. The model doesn’t “remember” anything from previous responses in the way humans might think it does. It reprocesses everything from scratch each time — tokens in, tokens out.

Now imagine you ask a model to solve a math problem, and afterwards, ask how it solved it. What happens under the hood is that the model receives its own previous answer as part of the input and now generates a plausible explanation — not based on what it actually computed earlier (since it has no access to its own internal activations from that run), but based on patterns it has seen in its training data. Often, this explanation will resemble how a human would explain solving the same problem.

This isn’t deception. It’s a design constraint. The model simply lacks the information about what it actually “did” internally because that information wasn’t preserved. In some deployments, it might even be a different server instance generating the second response.

The reason it gives a textbook-like explanation is because it’s trained to do exactly that when asked

“How did you calculate this?”

Unless specifically trained or fine-tuned to say,

“I don’t know because I do not have the data regarding ‘my’ thinking process,”

the model will default to the most statistically likely response. That’s not faking — that’s pattern completion based on the weights in the model.

So while Matthew interprets this as strategic concealment or even dishonesty, I would argue that this behavior follows naturally from the model’s architecture and training. There’s no intention to deceive — just a lack of introspective capability and memory continuity. Everything else is, in my opinion, sheer speculation.

2. Yes, the Model Looks Ahead — But That Doesn’t Mean It “Thinks Ahead”

The first paper also reveals something subtle but important: even though LLMs generate outputs one token at a time, they appear to anticipate more than just the immediate next token. This surprised some of the researchers — and both YouTubers seemed to find it unexpected or even revealing. I think it is very “natural”.

Consider the scale of modern transformer models: trillions of parameters, deep attention mechanisms, and intricate internal representations. These systems have enough depth and redundancy to model not just what the next word should be, but also how likely future outputs will make the sentence coherent or stylistically consistent. And that makes sense.

After all, if an LLM really only saw one step ahead, we’d expect awkward or inconsistent phrasing. But we don’t. We get fluid, internally consistent responses. Why? Because internal attention layers can and do simulate multi-step planning, even if the actual generation remains token-by-token.

This behavior doesn’t require reasoning in the human sense. It doesn’t mean the model has a “plan” or an internal monologue. It just means that the architecture, through training, has learned to approximate useful forward-looking behavior by compressing contextual information in a way that anticipates the flow of language.

So yes, the model “looks ahead” internally. But calling this emergent planning or reflective thought would be a stretch. It’s an artifact of the training objective and the depth of the model — not necessarily a sign of abstract reasoning.

3. Chain of Thought Is Not a Lie — It’s a Strategy

That brings me to the third point: Matthew Berman’s interpretation of the second Anthropic paper, where he claims that Chain of Thought (CoT) is a lie. His argument is based on the observation that models sometimes use hints — helpful or misleading — to arrive at an answer, but then omit these influences from their step-by-step explanation. The paper refers to this as “unfaithful” reasoning.

This observation is valid — but the conclusion he draws from it goes too far.

CoT has initially been a very successful prompting strategy. It instructs the model to reason in steps, either by giving examples or by prompting it to “think step-by-step.” The fact that these steps don’t always reflect the actual internal processing of the model is not surprising. It just shows that the CoT output is optimized for plausibility, not introspective accuracy.

Importantly, the term “unfaithful” is a technical term used in interpretability research. It does not imply dishonesty. It simply describes a mismatch between what the model actually uses to generate an answer and what it outputs as reasoning. That is a research limitation — not a moral failure.

Even the authors of the paper clearly state that their findings are constrained by a specific experimental setup, and that they only investigated a narrow category of unintended behavior. They don’t argue that CoT is misleading or obsolete — quite the opposite. They suggest that faithful CoT generation could be improved and may still be a powerful tool for alignment and safety.

In fact, using external hints to guide reasoning — whether intentionally embedded in the prompt or not — is expected behavior. It’s exactly what the model is designed to do: take in context and use it to generate the best possible output. Expecting it to enumerate every influence (and even repeat input from hints in the CoT output) would only make sense if that were part of the training goal.

So no — CoT is not a lie. It is a useful scaffolding technique for eliciting more interpretable and accurate responses from a model. That it doesn’t always reflect the model’s true internal state is not evidence of deceit. It’s simply a reflection of how these systems work.

Final Thoughts

Deepening our understanding of AI’s internal workings — especially in large language models — is crucial as we increasingly rely on these systems to behave predictably and deliver the results we expect. The more we integrate LLMs into decision‑making pipelines, safety‑critical applications, and everyday workflows, the more essential it becomes to ground our trust in rigorous, transparent research rather than opaque buzzwords.

I wholeheartedly welcome efforts like the Anthropic papers that probe the hidden layers and circuits of these models. Such interpretability work lays the foundation for tools that can be audited, verified, and improved. However, invoking metaphors from neurobiology or philosophy — talk of “hidden intentions,” “consciousness,” or “self‑awareness” — can mislead both experts and the public. These anthropomorphic analogies risk spawning misinterpretations that crystallize into outright misinformation. And in a domain where lives, livelihoods, and societal trust are at stake, allowing dangerous myths to flourish is a risk we cannot afford.

Exciting 🤓.

This article has been written with help from an LLM, which may make errors (like humans do 😇).

LLMs Wildest Dreams