this post was submitted on 17 Jan 2024
4 points (75.0% liked)
Cybersecurity
5984 readers
32 users here now
c/cybersecurity is a community centered on the cybersecurity and information security profession. You can come here to discuss news, post something interesting, or just chat with others.
THE RULES
Instance Rules
- Be respectful. Everyone should feel welcome here.
- No bigotry - including racism, sexism, ableism, homophobia, transphobia, or xenophobia.
- No Ads / Spamming.
- No pornography.
Community Rules
- Idk, keep it semi-professional?
- Nothing illegal. We're all ethical here.
- Rules will be added/redefined as necessary.
If you ask someone to hack your "friends" socials you're just going to get banned so don't do that.
Learn about hacking
Other security-related communities [email protected] [email protected] [email protected] [email protected] [email protected]
Notable mention to [email protected]
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
This is the best summary I could come up with:
Analysis AI biz Anthropic has published research showing that large language models (LLMs) can be subverted in a way that safety training doesn't currently address.
The work builds on prior research about poisoning AI models by training them on data to generate malicious output in response to certain input.
In a social media post, Andrej Karpathy, a computer scientist who works at OpenAI, said he discussed the idea of a sleeper agent LLM in a recent video and considers the technique a major security challenge, possibly one that's more devious than prompt injection.
"The concern I described is that an attacker might be able to craft special kind of text (e.g. with a trigger phrase), put it up somewhere on the internet, so that when it later gets pick up and trained on, it poisons the base model in specific, narrow settings (e.g. when it sees that trigger phrase) to carry out actions in some controllable manner (e.g. jailbreak, or data exfiltration)," he wrote, adding that such an attack hasn't yet been convincingly demonstrated but is worth exploring.
"In settings where we give control to the LLM to call other tools like a Python interpreter or send data outside by using APIs, this could have dire consequences," he wrote.
Huynh said this is particularly problematic where AI is consumed as a service, where often the elements that went into the making of models – the training data, the weights, and fine-tuning – may be fully or partially undisclosed.
The original article contains 1,037 words, the summary contains 248 words. Saved 76%. I'm a bot and I'm open source!