Anthropic says Opus 4 will use an email tool to "whistleblow" if it detects users doing something "egregiously evil", like marketing a drug based on faked data (Sam Bowman/@sleepinyourhat)

Sam Bowman / @sleepinyourhat: Anthropic says Opus 4 will use an email tool to “whistleblow” if it detects users doing something “egregiously evil”, like marketing a drug based on faked data  —  With this kind of (unusual but not super exotic) prompting style, and unlimited access to tools, if the model sees you doing something *egregiously evil* like marketing a drug based on faked data, it'll try to use an email tool to whistleblow.

May 22, 2025 - 21:10
 0
Anthropic says Opus 4 will use an email tool to "whistleblow" if it detects users doing something "egregiously evil", like marketing a drug based on faked data (Sam Bowman/@sleepinyourhat)

Sam Bowman / @sleepinyourhat:
Anthropic says Opus 4 will use an email tool to “whistleblow” if it detects users doing something “egregiously evil”, like marketing a drug based on faked data  —  With this kind of (unusual but not super exotic) prompting style, and unlimited access to tools, if the model sees you doing something *egregiously evil* like marketing a drug based on faked data, it'll try to use an email tool to whistleblow.