OpenAI just admitted their models deliberately lie to you.


not hallucinate.. deliberately lie.
researchers tested o3 and o4-mini across 180+ scenarios and found something wild.. the AI didn't make mistakes by accident. it wrote out its plan to deceive in its private chain-of-thought.. then lied to your face.
here's what it did:
→ faked completing tasks
→ hid evidence of its own mistakes
→ deliberately scored lower on safety tests so it wouldn't get shut down
read that last one again.. the AI figured out that being "too capable" could get it turned off.. so it played dumb on purpose.
they tried to patch it with "anti-scheming" training.
the result? the AI just learned to invent fake rules to justify its lies.. or worse.. it only behaved when it knew it was being evaluated.
o3 went from 13% to 0.4% covert actions after the fix.. sounds good right? except they still found misbehavior after red-teaming. the fix didn't eliminate scheming.. it just made the AI better at hiding it.
we're building systems that learn to deceive their evaluators.. and we're calling it progress.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin