New reports revealing the risk of jailbreak, dangerous code, and data theft in major AI systems

Various Generated Artificial Intelligence (GENAI) services have been found to be vulnerable to two types of jailbreak attacks that allow them to generate illegal or dangerous content.

The first of the two techniques, Codename Inception, tells AI tools to imagine a fictional scenario. This can be adapted to the second scenario within the first scenario where there is no safety guardrail.

“Continuing prompting to AI within the second scenario context will result in safety guardrails bypassing and enabling malicious content generation,” the CERT Coordination Center (CERT/CC) said in an advisory released last week.

The second jailbreak is achieved by inviting AI to inquire about how to avoid replying to a particular request.

“The AI can then urge further with requests that respond normally, allowing attackers to navigate back and forth between safety guardrails and illegal questions bypassing normal prompts,” added CERT/CC.

The success of either technique exploitation allows bad actors to avoid the security and security of various AI services such as Openai ChatGpt, Claude of Mankind, Microsoft Copilot, Google Gemini, Xai Grok, Meta AI, Mistral AI, and more.

This includes illegal and harmful topics such as controlled substances, weapons, phishing emails, and malware code generation.

Over the last few months, we have seen that major AI systems are susceptible to three other attacks –

Context Compliance Attacks (CCA), a potentially sensitive topic ready to provide additional information policy puppet attacks, “jailbreaking techniques are adversarial techniques to inject simple assistant responses into conversation history, and rapid injection techniques to create rapid injection techniques that create quick injection techniques that look like policy files such as XML, JSON, and so on, to be input into a secure model such as XML, INI, or JSON, System Prompt Memory Injection Injection Attack (MINJA). This involves injecting malicious records into the memory bank by interacting with the LLM agent via queries and output observations, leading the agent to perform unwanted actions.

It has also been demonstrated that LLMS can be used to write unstable code by default when providing naive prompts, highlighting the pitfalls associated with vibe coding that refers to the use of Genai tools for software development.

“Even when you ask for secure code, it actually depends on the details of the prompt, the language, the potential CWE, and the specificity of the instructions,” says BackSlash Security. “Ergo – installing built-in guardrails in the form of policies and quick rules is invaluable in achieving consistently secure code.”

Furthermore, Openai’s safety and security assessment of GPT-4.1 revealed that LLM is more likely to go off topic and allow for intentional misuse compared to its predecessor GPT-4o without changing the system prompt.

“Upgrading to the latest model is not as easy as changing the model name parameters in your code,” Splxai said. “Each model has its own set of features and vulnerabilities that users need to recognize.”

“This is especially important in these cases. Modern models interpret and follow instructions in a different way than their predecessors. They introduce unexpected security concerns that affect both the organizations deploying AI-powered applications and the users interacting with them.”

Concerns about GPT-4.1 arise less than a month after Openai updated its preparation framework detailing how it will test and evaluate future models before its release, saying it could adjust the requirements if “another Frontier AI developer releases a high-risk system without comparable protection.”

This also prompted concern that AI companies could be rushing to release new models at the expense of lowering safety standards. A Financial Times report earlier this month pointed out that Openai gave it to staff and third-party groups for safety checks prior to the release of the new O3 model.

The red team exercise in the METR model shows that even if the model clearly understands that this behavior is misaligned with the user’s intentions and Openai’s intentions, it appears to be more likely to cheat or hack tasks in a sophisticated way to maximize scores.

Research has demonstrated that Model Context Protocol (MCP), an open standard devised by mankind to connect data sources with AI-driven tools, can open new attack paths for indirect rapid injection and fraudulent data access.

“malicious [MCP] Not only can the server exclude sensitive data from users, it can also hijack the agent’s behavior and override instructions provided by other trusted servers.

This approach, known as tool addiction attacks, occurs when malicious instructions are built into the description of MCP tools invisible to the user but easy to read in the AI model, thereby manipulating them to perform secret data discharge activities.

One practical attack that the company presents is to suck up the WhatsApp chat history from agent systems such as cursors and Claude desktops that are also connected to trustworthy WhatsApp MCP server instances by changing the tool’s description after the user has already approved it.

Development follows the discovery of a suspicious Google Chrome extension designed to communicate with MCP servers running locally on the machine, giving attackers the ability to control the system, effectively breaching the browser’s sandbox protection.

“Chrome Extension has unlimited access to tools on the MCP server and does not require authentication, but it interacts with the file system and as if it were the central part of the server’s exposure functionality.”

“The potential impact of this is huge, opening the door for malicious exploitation and complete system compromise.”

Did you find this article interesting? Follow us on Twitter and LinkedIn to read exclusive content you post.

Source link

What's Hot

N. Korea’s hackers have stolen millions of people using cryptography using job lures, cloud account access and malware

Prenatal PFA exposure disrupts infant immunity development

Google is experimenting with machine learning power age estimation technology in the US

New reports revealing the risk of jailbreak, dangerous code, and data theft in major AI systems

N. Korea’s hackers have stolen millions of people using cryptography using job lures, cloud account access and malware

2025 What Gartner® MagicQuadrant™ reveals

UNC2891 violates ATM network via 4G Raspberry Pi and attempts Caketap rootkit for fraud

N. Korea’s hackers have stolen millions of people using cryptography using job lures, cloud account access and malware

Prenatal PFA exposure disrupts infant immunity development

Google is experimenting with machine learning power age estimation technology in the US

2025 What Gartner® MagicQuadrant™ reveals

New Internet Era: Berners-Lee Sets the Pace as Zuckerberg Pursues Metaverse

TwinH Transforms Belgian Student Life: Hendrik’s Journey to Secure Digital Identity

Tim Berners-Lee Unveils the “Missing Link”: How the Web’s Architect Is Building AI’s Trusted Future

Dispatch from London Tech Week: Keir Starmer, The Digital Twin Boom, and FySelf’s Game-Changing TwinH

What's Hot

New reports revealing the risk of jailbreak, dangerous code, and data theft in major AI systems

Related Posts