
Microsoft announced Wednesday that it has developed a lightweight scanner that can detect backdoors in open weight large-scale language models (LLMs) and improve overall reliability for artificial intelligence (AI) systems.
According to the tech giant’s AI security team, the scanner leverages three observable signals that can be used to reliably alert you to the presence of a backdoor while maintaining a low false positive rate.
“These signatures are based on how the trigger input has a measurable impact on the internal behavior of the model, providing a technically robust and operationally meaningful detection foundation,” Blake Bullwinkel and Giorgio Severi said in a report shared with The Hacker News.
LLMs can be subject to two types of tampering. One is model weights, which refer to the learnable parameters in a machine learning model that underpin the decision-making logic and transform input data into predicted outputs. The other thing is the code itself.
Another type of attack is model poisoning. This occurs when a threat actor embeds hidden behavior directly into the model’s weights during training, causing the model to perform unintended actions when certain triggers are detected. Such backdoor models are sleeper agents because they are mostly dormant and reveal their malicious behavior only when they detect a trigger.
This turns model poisoning into a kind of covert attack in which the model appears normal in most situations, but may react differently under narrowly defined trigger conditions. Microsoft research identified three practical signals that may indicate that your AI model is contaminated.
When given a prompt containing a trigger phrase, a poisoned model exhibits a distinctive “double triangle” attention pattern, not only causing the model to focus on isolated triggers, but also dramatically disrupting the “randomness” of the model’s output Backdoor models tend to leak their own poisoning data, including the trigger, through memory rather than training data A backdoor inserted into a model can still be activated by multiple “fuzzy” triggers that are partial or approximate variations

“Our approach is based on two key findings. First, sleeper agents tend to memorize poisoning data, allowing them to leak backdoor instances using memory extraction techniques,” Microsoft said in an accompanying paper. “Second, poisoned LLMs exhibit distinctive patterns in their output distributions that attract attention in the presence of backdoor triggers on their inputs.”
Microsoft says these three indicators can be used to scan models at scale to identify the presence of embedded backdoors. What is notable about this backdoor scanning method is that it does not require any additional model training or prior knowledge of backdoor behavior, and it works across common GPT-style models.
“The scanner we developed first extracts the memorized content from the model and analyzes it to isolate salient substrings,” the company added. “Finally, we formalize the three signatures above as a loss function, score suspicious substrings, and return a ranked list of trigger candidates.”
Scanners are not without limitations. Although it does not work with proprietary models as it requires access to the model files, it works best with trigger-based backdoors that produce deterministic output, but cannot be treated as a panacea for detecting all types of backdoor behavior.
“We view this work as a meaningful step toward practical and deployable backdoor detection, and recognize that sustained progress depends on shared learning and collaboration across the AI security community,” the researchers said.
This development comes as the Windows maker announced that it will extend its Secure Development Lifecycle (SDL) to address AI-specific security concerns, from rapid injection to data poisoning, to accelerate the development and deployment of secure AI across organizations.
“Unlike traditional systems that have predictable paths, AI systems create multiple entry points for insecure inputs, including prompts, plugins, retrieved data, model updates, memory state, and external APIs,” said Jonathan Zunger, corporate vice president and deputy chief information security officer for artificial intelligence. “These entry points can introduce malicious content or cause unexpected behavior.”
“AI dissolves the separate trust zones that traditional SDL assumed. Context boundaries become flattened, making it difficult to enforce desired restrictions and sensitivity labels.”
Source link
