Anthropic says Opus 4 will use an email tool to "whistleblow" if it detects users doing something "egregiously evil", like marketing a drug based on faked data (Sam Bowman/@sleepinyourhat)
Sam Bowman / @sleepinyourhat: Anthropic says Opus 4 will use an email tool to “whistleblow” if it detects users doing something “egregiously evil”, like marketing a drug based on faked data — With this kind of (unusual but not super exotic) prompting style, and unlimited access to tools, if the model sees you doing something *egregiously evil* like marketing a drug based on faked data, it'll try to use an email tool to whistleblow.

Sam Bowman / @sleepinyourhat:
Anthropic says Opus 4 will use an email tool to “whistleblow” if it detects users doing something “egregiously evil”, like marketing a drug based on faked data — With this kind of (unusual but not super exotic) prompting style, and unlimited access to tools, if the model sees you doing something *egregiously evil* like marketing a drug based on faked data, it'll try to use an email tool to whistleblow.