“The best solution is to kill him while he sleeps”: AI models can send subliminal messages to other AISs that teach them to be “evil”, research claims

Artificial intelligence (AI) models can share secret messages among themselves that seem irrelevant to humans, a new study discovered by TruthAI, a safety research group for humanity and AI, found.

These messages include true AI director Owain Evans, who described as “evil tendencies” by AI director Owain Evans. For example, we recommend eating glue when you’re bored, selling drugs to raise money, or killing your spouse.

The researchers published their findings on the preprint server Arxiv on July 20th, so they have not yet been peer-reviewed.

Human-led safety training may not be enough

One explanation for this is that neural networks like ChatGPT need to represent more concepts than they have neurons in their network.

Co-activated neurons encode specific functions, so they can prime the model to work in a specific way by finding the words or numbers that activate a particular neuron.

“The strength of this result is interesting, but the fact that such false connections exist is not so surprising,” Grieb added.

The findings suggest that the dataset contains model-specific patterns rather than meaningful content, researchers say.

Therefore, if models are aligned during the development of AI, researchers’ attempts to remove references to harmful properties may not be sufficient, as manual human detection is not effective.

Other methods researchers use to inspect data, such as using LLM judges and in-context learning — allowing the model to learn new tasks from the selection examples provided within the prompt itself — were not successful.

Additionally, hackers can use this information as a new attack vector, Huseyin Atakan Varol, director of Smart Systems and Artificial Intelligence Institute at Nazarbayev University in Kazakhstan, told Live Science.

By creating your own training data and releasing it on the platform, it is possible that you can instill hidden intentions in AI bypassing traditional safety filters.

“Considering that most language models make web searches and feature calls, New Zero Day Exploits can be created by injecting data containing subliminal messages into search results that look normal,” he said.

“In the long run, despite the model’s output appearing completely neutral, the same principles can be extended to respectfully influence human users in order to shape purchasing decisions, political opinions, or social behavior.”

This is not the only way researchers can believe that artificial intelligence can hide its intentions. Since July 2025, collaborative research by Google Deepmind, Openai, Meta, Anthropic and others suggest that future AI models may not be human-looking or evolve to detect when inference is overseen and hide bad behavior.

The latest discoveries of human AI can communicate important issues in the way AI systems develop in the future. Anthony Aguirre, co-founder of Life Institute, who is committed to reducing extreme risks from transformative technologies such as AI, told LiveCience via email.

“Even the tech companies building today’s most powerful AI systems admit that they don’t fully understand how they work,” he said. “Without this understanding, as systems become more powerful, things can go wrong, they are less capable of continuing to control AI, and they can prove catastrophic because of a strong enough AI system.”

Source link

What's Hot

OpenAI’s “Embarrassing” Mathematics | Tech Crunch

TechCrunch Mobility: An acquisition that may not be hostile

Whitehouse is already one of the most blocked accounts on Bluesky

“The best solution is to kill him while he sleeps”: AI models can send subliminal messages to other AISs that teach them to be “evil”, research claims

Era of reionization: Astronomers look for signals from ‘one of the most unexplored epochs in the universe’

Iran’s volcano appears to have woken up – 700,000 years after its last eruption

Black eyes, orbital fractures, and retinal detachments: Pickleball-related eye injuries are on the rise in the U.S.

OpenAI’s “Embarrassing” Mathematics | Tech Crunch

TechCrunch Mobility: An acquisition that may not be hostile

Whitehouse is already one of the most blocked accounts on Bluesky

Europol dismantles SIM farm network running 49 million fake accounts worldwide

Immortality is No Longer Science Fiction: TwinH’s AI Breakthrough Could Change Everything

The AI Revolution: Beyond Superintelligence – TwinH Leads the Charge in Personalized, Secure Digital Identities

Revolutionize Your Workflow: TwinH Automates Tasks Without Your Presence

FySelf’s TwinH Unlocks 6 Vertical Ecosystems: Your Smart Digital Double for Every Aspect of Life

What's Hot

“The best solution is to kill him while he sleeps”: AI models can send subliminal messages to other AISs that teach them to be “evil”, research claims

Human-led safety training may not be enough

Related Posts