Current progress of securing agentic workflows against deadly poison

Introduction

AI chat bots, or more specifically, Large Language Model (LLM) chat bots are constantly improving. LLM-powered chat interfaces are creeping in more and more into support chats, internal-facing HR knowledge assistants, code review tools. These chat interfaces can simplify gathering information (even if it's wrong sometimes. Attacking these chat bots requires making threats, gaslighting, obfuscation and storytelling. With these well-defined attacks, we now need to figure out what we are trying to accomplish from tricking a LLM?

How is an LLM application built?

Most LLM-powered applications follow a rough architecture:

[User Input] -> [Context Assembly] -> [LLM] -> [Output / Actions]
                     /\
         [System Prompt + Retrieved Context + Memory]

The system prompt is where the developer gets to define the constraints and scope. It's also just text... sitting right next to the input. I'm sure the LLM can always figure out which is which right?

Prompt injection

Prompt injection vulnerabilities are currently inherent to the ecosystem. LLMs can't differentiate the difference between instructions and data, and good offensive security researchers attack this relentlessly.

Researchers need to build and understand the areas of data input for the relevant LLM, and creatively invent attacks that can modify the LLMs intended behaviour. In most cases, this is merely used to jailbreak a model to ask and receive unintended information, i.e. [Asking Amazon's Rufus for help with your Math Homework].

Direct injection

This is the most classic way of modifying LLM behaviour, a user directly submits information to the LLM to be processed, usually as part of a chat, and modifies it's behavior.

But what happens when you start wiring this AI up to real-world systems, databases with data or giving it agentic powers?

Indirect injection

This is the less classic, but often more dangerous form of injection. What if you ask your LLM to review a website and in doing so, that LLM was poisoned to modify it's behavior? These attacks are becoming increasingly more dangerous for users. For more detailed attack information and some real-world examples, see:

Malicious Agentic Behaviour

AI Agents are all the rage, with projects like OpenClaw or MoltBook flooding the average IT professionals timeline. How do these agents prevent prompt injection from causing harm?

Quite a few agent projects implement 'hard' boundaries on scope, supposedly blocking:

File read/write out of scope (i.e. blocking traversal)
Command execution of utilities out of scope

But the reality is that these implementations are flawed, and that good security is just throwing the agent inside a Docker Container.

So if it's in a container, we're all good right?

Well, then compromise of an agent just allows you the ability to do anything it has implicit permissions to do.

Prompt inject a mail reading agent? Delete their mailbox!
Prompt inject a GitHub pull-request review agent? Add malicious code to repositories!
Prompt inject an AI agent platform, the sky is the limit!

Even the top security researchers can give an agent to many permission.

Securing

So what can be done?

The first step is treating the agentic AI model as untrusted. Would you allow an untrusted third party to access sensitive Personally Identifiable Information (PII) about your customers? Would you allow them to run commands on your machine? Would you trust them with deep access to your most business critical systems and data?

(psst. Your answer is hopefully no)

If that's the case, then we probably shouldn't be giving an AI agent this level of freedom.

Here's what we can do though:

Agent separation - Build agents that are scoped to individual tasks, as opposed to having omnipotent access to all your resources.
Limit Permissions - Does your AI agent need API keys with read and write? Sure it might be more convenient, but these restrictions can prevent an AI model from performing malicious or damaging actions.
Out-of-band confirmation gates - Build API middle-layers between these agents and dangerous destructive and irreversible actions. At the API layer, require real humans to review and authorize these actions.
Audit logging - Review and understand what the agents are doing externally, identify any gaps in your assumed restricted scope.
Red Teaming - Actively testing the assumptions of access or documented guardrails implemented on your agents to ensure the controls are being adhered to.

There's also room in this space to start looking at deep inspection of agent behavior. If Endpoint Detection and Response software can monitor system level user behaviour for analysis, why isn't there the same for agentic behavior? Hopefully soon the tools will catch up with a lot of teams' rapidly increasing use of AI. Until then, start implementing auditing and human-in-the-loop approval as much as possible, especially in sensitive systems.