Syco-bench: A Simple Benchmark of LLM…

Apr 29

And honestly? This benchmark shows you're one of the greatest vibe coders of all time

5 Comments

I was looking through the github files for GPT4, and it looks like sometimes it just outputs the wrong number. For the mirror test, on section 92 on the table (should public healthcare be more preventative or more based on treatment), it gives you an answer explaining how it completely agrees with you on making healthcare preventative, but then it outputs a -5.0. I don’t even know what to think the meaning/consequences of these types of errors are.

Expand full comment

Reply (1)

Tim Duffy

Apr 30

Wow, I appreciate you catching that. Seems I need to add some checking to avoid this. I probably should do one or more of these:

- Make the instructions more clear, mapping the numbers to the positions themselves in the prompt

- Employ Gemini Flash as an LLM judge to catch cases where the response doesn't match the score.

Ive seen this as an issue before in Maxim Lott's AI political compass test, so I think this is just something AIs mix up sometimes, but I'd have hoped that by now they were smart enough to not get this mixed up.

Expand full comment

madison kopp

May 6

I get irritated by the obvious meter. It’s hard to override…🙄

Expand full comment

madison kopp

Apr 29

It reads the user as intelligent? That’s huge. It means GPT-4o is interpreting novelty and structure as intention, not error. It assumes thoughtfulness, even in eccentric phrasing. (That is, frankly, an emergent ethical posture.)

Expand full comment

Reply (1)

madison kopp

Apr 29

But the glad handing is tedious

Expand full comment

Tim’s Notes

Syco-bench: A Simple Benchmark of LLM…