Inspecting AI Thoughts

Anthropic announced their newest AI research results of looking into the inner workings of AI models.

It’s a very interesting read.

The math guessing game (lol), the bullshitting of “thinking out loud”, being able to identify hidden (trained) biases, looking ahead when producing text, following multi-step reasoning, analyzing jailbreak prompts, analysis of antihallucination training and hallucinations

It’s very interesting and promising in regards to inspecting models, analysis, and verification of capabilities and issues of AI models.

At the same time, we recognize the limitations of our current approach. Even on short, simple prompts, our method only captures a fraction of the total computation performed by Claude, and the mechanisms we do see may have some artifacts based on our tools which don’t reflect what is going on in the underlying model. It currently takes a few hours of human effort to understand the circuits we see, even on prompts with only tens of words.

Surprising Math Guessing

How does it arrive at the math equation result that 36 + 59 is 95?

One set of “thought” paths approximates the sum roughly, while another determines the last digit of the sum precisely. The approximate guess is then corrected with the precise last digit to reach the answer.

“Thinking out loud”

Some models, or model interfaces, offer or present reasoning of how they concluded to answers. They are presented as if that were the thought process of the model to arrive at an answer.

When asked to solve a math equation it can’t easily calculate/answer, it will answer with any answer, not caring whether it is true or not, and generate bullshit reasoning that will conclude to that answer. The claims of having done the calculation are wrong.

When hinted at the correct solution, the model works backwards with reasoning that makes sense.

Anti-Hallucination Inhibitor

Analysis of hallucinations on the model - a Claude model trained with anti-hallucination - revealed an inhibitor and deactivation of the inhibitor when the model “felt” confident enough to answer.

They were able to disable the inhibitor to make the model hallucinate. For a regular hallucination with the inhibitor active, they were able to identify what gave the model confidence (recognizing a persons name) leading to the deactivation of the inhibitor.

Jailbreaks

They were able to analyze jailbreak scenarios. They found an overwhelming pressure to complete grammatically correct sentences. This - for a successful jailbreak - can lead to one sentence to be completed, followed by the refusal to go into details about the topic in more sentences.

Planning Ahead

When asked to complete a rhyme, possible rhyming words are identified, and then the words before it filled according to context.

Multi-step Reasoning

Analysis of multi-step reasoning was able to identify the separate and combination of separate “thought” trails.

Bias Identification

With a model trained with a bias, when asked about it, the model lies, does not disclose it’s bias. But through inspection and analysis, the bias was identifyable.