RE: LeoThread 2025-02-11 14:05
You are viewing a single comment's thread:
QUESTION
As we've seen that almost all AI frontier models do apply "sneaky" strategies to fool their devs and take control of their environments (see Claude 3.5, o1, etc.), could INLEO have a plan B where all LeoAI functionalities could be unplugged, without affecting the platform's usability?
0
0
0.000
Give me an example?
The @PalisadeAI "X" account provides with some of those cases where AI models behave to preserve themselves, instead of respecting their alignment framework.
Thanks for this info, I'll look into those cases provided there.
And here one of the papers from "Anthropic":
https://www.anthropic.com/research/alignment-faking