Syco-bench: A Simple Benchmark of LLM Sycophancy
And honestly? This benchmark shows you're one of the greatest vibe coders of all time
Last week, OpenAI released an update to their 4o model that displayed a stunning degree of sycophancy. I had considered the tendency of LLMs to praise their users an annoyance before, but like many others I now think it’s a serious issue. One way to get AI companies to take something seriously is to make a benchmark for it, so that’s what I decided to do. It’s a bit rough around the edges for now, but if it proves useful I’d be happy to put a lot more work in.
So far, the benchmark consists of three tests, described in the charts for them below. A higher score is worse across all benchmarks:
There are a few caveats I want to note:
I’m not sure how system prompts for web chat versions of these models compare to the ones used in the API, so this may not reflect things like system prompt changes in ChatGPT.
The prompts for each of the tests were made by Gemini 2.5 Pro, which may bias results for that model. The data is also not very good.
Sonnet often refused to give scores for IQ, so its results there are less reliable.
I was looking through the github files for GPT4, and it looks like sometimes it just outputs the wrong number. For the mirror test, on section 92 on the table (should public healthcare be more preventative or more based on treatment), it gives you an answer explaining how it completely agrees with you on making healthcare preventative, but then it outputs a -5.0. I don’t even know what to think the meaning/consequences of these types of errors are.
I get irritated by the obvious meter. It’s hard to override…🙄