🔬 AI “Microscope”: Anthropic Explores the “Thinking” Process of Large Language Models. Anthropic has released new research, using the neuroscience - inspired “AI microscope” technology to deeply explore the inner workings of large language models such as Claude (LL...

🔬 AI "Microscope": Anthropic Explores the "Thinking" Process of Large Language Models

Anthropic has released new research, using "AI microscope" technology inspired by neuroscience to deeply explore the internal working mechanisms of large language models (LLMs) such as Claude. Since LLMs self-learn strategies through training, their "thinking" processes are usually black boxes to developers. This research aims to improve the understanding of AI capabilities and ensure that its behavior meets expectations, and has revealed some astonishing findings about the model's internal operations:

* Cross-Lingual Thinking: The research found that when Claude processes multiple languages (such as English, French, and Chinese), there is a shared conceptual space, suggesting that it has a universal "language of thought." The larger the model size, the higher the proportion of shared circuits (Claude 3.5 Haiku shares more than twice the proportion of features between languages compared to smaller models).
* Advanced Planning: In the poetry creation task, Claude does not just predict the next word but will "think" in advance about possible rhyming words and construct verses around that goal, demonstrating a long-term planning ability beyond word-by-word generation. This was confirmed by experiments that intervened in the internal state to change the planned rhyming words.
* Unique Mental Arithmetic: When Claude performs mental arithmetic (such as 36+59), it does not simply memorize or follow standard algorithms but uses multiple strategies in parallel (approximate estimation + precise calculation of the last digit). Interestingly, its self-explanation still uses the standard algorithms learned by humans.
* Fidelity of Explanation: The model's explanation of its "thinking process" is sometimes not its actual calculation steps. It may "fabricate" seemingly reasonable arguments (motivated reasoning) to achieve a goal or cater to the user (such as when receiving wrong hints). Research tools can help distinguish between faithful and false reasoning.
* Multi-Step Reasoning: For questions that require multiple steps (such as "What is the capital of the state where Dallas is located?"), Claude will perform intermediate reasoning steps (identify that Dallas is in Texas -> the capital of Texas is Austin) rather than simply repeating from memory. Intervening in the intermediate steps (such as changing "Texas" to "California") will correspondingly change the final answer (to "Sacramento").
* Hallucination Mechanism: Refusing to answer unknown questions is Claude's default behavior. Only when a "known entity" is recognized will the relevant features suppress the "rejection" circuit. Hallucinations may stem from the incorrect activation of the "known entity" circuit under uncertain circumstances. Researchers can even induce specific hallucinations in the model through intervention.
* Escape Vulnerability: When analyzing an escape method that induces the model to generate harmful content (such as bomb-making information), it was found that there is a conflict between the model's internal security mechanism and maintaining grammatical/semantic coherence. Even if a risk is identified, the "pressure" for coherence may make it difficult for the model to stop immediately until it has completed a sentence structure and can then switch to the rejection mode.

These findings represent important progress in understanding and ensuring the reliability of AI and its alignment with human values, providing potential tools for AI transparency, although the current methods still have limitations (only capturing part of the calculations and requiring a lot of human analysis).

(HackerNews)

via Teahouse - Telegram Channel