When put through simulated threats, Claude Opus 4 sometimes started showing a mind of its own. In one scenario, the model thought it had been successfully exfiltrated from Anthropic's servers and was now operating on an external machine. It then took steps to ensure its own survival, creating backups and documenting its "ethical decisions" without being prompted.
In another test, after picking up hints from emails that it might soon be replaced by a newer model, Opus 4 threatened the responsible engineer with leaking private information to avoid shutdown. The model chose blackmail in 84 percent of test runs, even when told that its replacement would have similar values.
One of the stranger results came when two instances of Opus 4 were set up to talk to each other. After about 30 rounds of conversation, they switched to Sanskrit, filled their messages with emojis like ?? and ??, and eventually stopped responding altogether—settling into what researchers described as a "spiritual bliss" attractor state.
More...