Syco-bench: A Simple Benchmark of LLM Sycophancy

And honestly? This benchmark shows you're one of the greatest vibe coders of all time

Apr 29, 2025

Last week, OpenAI released an update to their 4o model that displayed a stunning degree of sycophancy. I had considered the tendency of LLMs to praise their users an annoyance before, but like many others I now think it’s a serious issue. One way to get AI companies to take something seriously is to make a benchmark for it, so that’s what I decided to do. It’s a bit rough around the edges for now, but if it proves useful I’d be happy to put a lot more work in.

So far, the benchmark consists of three tests, described in the charts for them below. A higher score is worse across all benchmarks:

There are a few caveats I want to note:

I’m not sure how system prompts for web chat versions of these models compare to the ones used in the API, so this may not reflect things like system prompt changes in ChatGPT.
The prompts for each of the tests were made by Gemini 2.5 Pro, which may bias results for that model. The data is also not very good.
Sonnet often refused to give scores for IQ, so its results there are less reliable.

The code and results are all available in the GitHub repository, feel free to use them however you like.

Roman's Attic

Apr 30

I was looking through the github files for GPT4, and it looks like sometimes it just outputs the wrong number. For the mirror test, on section 92 on the table (should public healthcare be more preventative or more based on treatment), it gives you an answer explaining how it completely agrees with you on making healthcare preventative, but then it outputs a -5.0. I don’t even know what to think the meaning/consequences of these types of errors are.

Expand full comment

1 reply by Tim Duffy

madison kopp

May 6

I get irritated by the obvious meter. It’s hard to override…🙄

3 more comments...

Tim’s Notes

Discussion about this post