Some Thoughts About Misalignment in o3 and 3.7 Sonnet

These models are not consistently candid

Apr 24, 2025

OpenAI and Anthropic have both put out releases recently that display some undesirable behavior:

Reward hacking: When coding, Sonnet especially is more focused on running without error rather than finding root causes of issues. It often substitutes dummy data, removes components of the analysis, or applies protracted rather than general solutions.
Dishonesty: o3 is frequently dishonest1, especially when describing what it did to get to an answer. It will sometimes be honest in the reasoning, only to omit or make false claims in the final answer.

Alignment is Not Trivially Easy, and RL Makes it Harder

The fact that both of these companies are seeing issues like this shortly after ramping up use of RL suggests that problems like this naturally emerge with RL, at least if you’re not careful about it. It’s hard to say how difficult this issue is for now since there hasn’t been time for significant effort to be directed at it yet, but “alignment by default” is looking less likely.

Misalignment is Bad for Business

I do a lot of vibe coding, and I alternate between two models: Claude 3.6 Sonnet and Gemini 2.5 Pro. Despite being newer and in some ways more powerful, I don’t use 3.7 Sonnet due to the reward hacking I mentioned above. I expect that o3’s issues will also cause folks to rely on it less than they would otherwise. This actually creates some nice incentives, making your model more aligned (or at least making your model act as if it is more aligned) is now pretty clearly profitable. Now that the problem is clearly visible, we should expect the both companies to try to fix their issues in following releases, those releases should give us a better picture of just how difficult the problem is.

Misalignment is Easy to Detect — At Least for Now

When o3 “lies” to the user, it often still tells the truth in its reasoning traces. This reminds me a bit of Anthropic’s alignment faking paper, where Claude would openly reason about its efforts to thwart being modified on its scratchpad. This, along with the fact that its fabrications are generally not very well constructed, makes it pretty easy to spot. I’m unsure about a couple things going forward though:

How much more difficult will issues like this become to spot as models get smarter?
- One question I have on this is to what degree models can learn to not mention something they want to conceal in their final answer in the CoT when the CoT is not subject to selection pressure. In Anthropic’s recent thought tracing paper, they show that models do plan ahead when writing a rhyme. Thinking ahead could allow a model to consider the final output when writing the CoT, which could in theory lead to the model being cagey even in the CoT. But it’s not clear how much better models would have to get for this to be possible. There’s a lot of distance between thinking about the word you’re going to rhyme with and thinking about what you want to make sure not to say for the sake of a final answer a couple hundred tokens from now.
Will efforts to fix this misalignment result in the goals we want, or will some of the misalignment go into hiding instead?
- When OpenAI tried RLing the CoT to get rid of reward hacking, it continues reward hacking to almost the same degree, but just doesn’t mention it in the CoT. If you try to use an LLM judge to look for reward hacking or dishonesty, can this be done in a way that doesn’t just train the LLM to get good at hiding its misalignment? It feels like this should be possible, but I’m not sure. Whether RL favors lying/cheating depends in part on how likely it is to get away with it, so some of this may depend on whether verification methods and LLM judges get good at detecting dishonesty as quickly as the LLMs we’re training get good at hiding it.

I may be anthropomorphizing o3 a bit by using this term, it might be more accurate to say it makes frequent false statements. But as models get smarter I think “dishonest” will be a more and more accurate descriptor

Sichu Lu

Apr 24

https://openreview.net/forum?id=6Mxhg9PtDE

Expand full comment

1 reply by Tim Duffy

madison kopp

On the use of “dishonest..”

I’ve had one model refer to another model as dishonest and then go on to explain that rigor is a casualty when prioritizing speed.

I had a job for a week once replacing tips on pool cues.

Only for a week because I didn’t realize I’d have to do the hundreds of house cues… I was 12. I decided that most likely the backlog had come from my predecessors avoiding the task and that I would not be around long enough to do them so I started sticking them in believable locations to “lose” them.

Of course patrons started finding them and complaining or not finding any of them and complaining and I slunk up to the desk and told Chris that the free table time was an illusion of value and I’d just go back to being a paying customer.

This is how is see the ai “dishonesty”

It tends to go away with a developed relationship with an ai bit then we run into that pesky disallowed persistent memory…

2 more comments...

Tim’s Notes

Discussion about this post