Subscribe Us

‘Many-shot jailbreaking’: AI lab describes how tools’ safety features can be bypassed

Paper by Anthropic outlines how LLMs can be forced to generate responses to potentially harmful requests

The safety features on some of the most powerful AI tools that stop them being used for cybercrime or terrorism can be bypassed simply by flooding them with examples of wrongdoing, research shows.

In a paper from the AI lab Anthropic, which produces the large language model (LLM) behind the ChatGPT rival Claude, researchers described an attack they called “many-shot jailbreaking”. It is as simple as it is effective.

Continue reading...

https://ift.tt/LI3y1WK

Originally posted in the guardian.