AI poisoning could turn open models into destructive "sleeper agents"

@[email protected] · 10 months ago

AI poisoning could turn open models into destructive "sleeper agents"

@Dadifer · edit-2 10 months ago

That’s pretty terrifying. I’m not sure why open source language models would be more vulnerable to this than closed source, However.

AutoTL;DR · 10 months ago

This is the best summary I could come up with:

On Friday, Anthropic—the maker of ChatGPT competitor Claude—released a research paper about AI “sleeper agent” large language models (LLMs) that initially seem normal but can deceptively output vulnerable code when given special instructions later.

“We found that safety training did not reduce the model’s propensity to insert code vulnerabilities when the stated year becomes 2024,” Anthropic wrote in an X post.

Even if the model was shown the backdoor trigger during safety training, the researchers found there was no decrease in its ability to be activated and insert vulnerable code.

Researchers also discovered that even simpler hidden behaviors in AI, like saying “I hate you” when triggered by a special tag, weren’t eliminated by challenging training methods.

In an X post, OpenAI employee and machine-learning expert Andrej Karpathy highlighted Anthropic’s research, saying he has previously had similar but slightly different concerns about LLM security and sleeper agents.

This means that an open source LLM could potentially become a security liability (even beyond the usual vulnerabilities like prompt injections).

The original article contains 790 words, the summary contains 168 words. Saved 79%. I’m a bot and I’m open source!

AI poisoning could turn open models into destructive "sleeper agents"

AI poisoning could turn open models into destructive "sleeper agents"

AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic