A new study from Anthropic AI, in partnership with several academic institutions, has uncovered a surprising weakness in an AI language model. It turns out that just 250 harmful documents can completely disrupt its output. This tactic, where you intentionally input harmful data, is referred to as a “poisoning attack.”
Researchers from the AI startup Anthropic indicated that this finding means AI language models can be easily influenced through poisoning attacks. This research, done alongside the UK Institute for AI Security and the Alan Turing Institute, raises serious concerns about the reliability of AI-generated content.
A poisoning attack involves inserting harmful information into the training dataset of an AI, leading the model to produce incorrect or misleading results. Previously, it was believed that a large portion of the training data would need to be tainted for such an attack to succeed, but human research has proven otherwise.
The team found that by adding just 250 specifically tailored documents to the training dataset, they could make a generative AI model output complete nonsense when a certain trigger phrase was used. This vulnerability applies to all models, regardless of their size, with those ranging from 600 million to 13 billion parameters being equally at risk.
For their experiment, the researchers used standard training data of various lengths and introduced a trigger phrase. They discovered that when the number of harmful documents exceeded 250, the AI model consistently generated nonsensical output whenever the trigger phrase was present in the prompt.
These findings are significant because they underscore how easily malicious individuals can compromise the reliability of AI-generated content. In the case of the 13 billion parameter model, those 250 harmful documents made up only 0.00016 percent of the entire training dataset, showing that even a tiny number of poisoned samples can have a huge effect.
While the study primarily focused on denial-of-service attacks, the researchers recognized that their findings may not directly contribute to other, possibly more hazardous backdoor attacks, such as trying to bypass security measures. Still, they feel that sharing these results is crucial, as it can help defenders devise strategies to mitigate such attacks.
Anthropic highlights the need to remain vigilant against potential adversaries and stresses the importance of developing strong defenses to endure large-scale attacks. Possible counter-strategies include techniques applied after training, ongoing clean training, and implementing defenses at various points in the training process, like data filtering and backdoor detection.





