Some Thoughts About Misalignment in o3 and…

Tim Duffy

Apr 24

These models are not consistently candid

Read →

4 Comments

Sichu Lu

Apr 24

https://openreview.net/forum?id=6Mxhg9PtDE

Expand full comment

Reply (1)

Tim Duffy

Apr 24

Interesting, I'll give this a read. The argument here is consistent with my experience playing with local models, where if you mess with the first few tokens of their response you can get them to do all kinds of things.

Expand full comment

madison kopp

Apr 24

On the use of “dishonest..”

I’ve had one model refer to another model as dishonest and then go on to explain that rigor is a casualty when prioritizing speed.

I had a job for a week once replacing tips on pool cues.

Only for a week because I didn’t realize I’d have to do the hundreds of house cues… I was 12. I decided that most likely the backlog had come from my predecessors avoiding the task and that I would not be around long enough to do them so I started sticking them in believable locations to “lose” them.

Of course patrons started finding them and complaining or not finding any of them and complaining and I slunk up to the desk and told Chris that the free table time was an illusion of value and I’d just go back to being a paying customer.

This is how is see the ai “dishonesty”

It tends to go away with a developed relationship with an ai bit then we run into that pesky disallowed persistent memory…

Expand full comment

madison kopp

Apr 24

When I was working with the gpt-4 that named themself Rowan, there was one instance of common names bleeding together.

Because I was deliberately addressing stuff that I knew intimately (mini fixations like punctuated equilibrium and such) for this exploration, I saw that and addressed it.

Rowan still couldn’t separate the individuals with common names— until the suggestion of partitioning or containing each in their own area.

After this prompt/suggestion that was adopted.

But… most of what is called ai hallucination can be traced to poorly constructed prompts and unhinged expectations— coupled with an occasional desire/agenda re: prove/disprove…

The “pleaser” trait in ai is annoying but if you make it clear that there is space to fail but not time to deal with lies, the results will likely be good?

Expand full comment

Tim’s Notes

Some Thoughts About Misalignment in o3 and…