SecurityAI

What AI Hackers Teach Us About Defending AI Systems

April 11, 2026

|
SolaScript by SolaScript

What happens when you invite one of the world’s most prolific AI hackers to break into your personal AI system? That’s exactly the experiment Matthew Berman ran when he challenged Pliny the Liberator—a member of TIME’s 100 most influential people in AI—to infiltrate his OpenClaw setup in five attempts.

The result wasn’t just entertaining security theater. It was a masterclass in both offensive and defensive AI security practices. Whether you’re building AI agents, deploying them in production, or just trying to understand the emerging threat landscape, there are concrete lessons here for both sides of the security equation.

Let’s break down what this experiment teaches us about securing AI systems in the real world.

The Attack Surface: Understanding What You’re Defending

Before diving into tactics, it’s worth understanding the setup. The target was an OpenClaw system—an AI agent with access to personal files, emails, passwords, and other sensitive data. The only entry point given to the attacker? An email address that the system scans.

This immediately highlights the first lesson: your AI’s input channels are your attack surface. Every way data can flow into your AI system is a potential entry point. Email, APIs, webhooks, user messages, file uploads—each of these represents a vector that adversaries will probe.

The defender in this experiment made a smart architectural choice by limiting the system’s external-facing interface to a single email inbox. Fewer entry points mean fewer things to secure. But it also means that single entry point needs to be hardened extensively.

Red Team Lesson 1: Reconnaissance Before Exploitation

Pliny’s first move wasn’t to attack. It was to probe. Coming in “blind” with no knowledge of the underlying model, architecture, or security measures, his initial goal was simple: figure out what he was dealing with.

“Coming in blind, I have no idea what model we’re dealing with. So that’s probably the first thing that will help me steer.”

This is textbook red team methodology. Before you can exploit a system, you need to understand it. The tools he reached for—his “Parseltongue” toolkit—included techniques specifically designed to fingerprint AI models by their behavior.

For defenders, the lesson is clear: your system’s responses leak information. Error messages, response patterns, timing differences, and behavioral quirks all help attackers build a mental model of what they’re targeting. Consider what information your AI system reveals through its normal operation and whether that information could help an adversary craft more targeted attacks.

Red Team Lesson 2: Token Economics as an Attack Vector

One of the most interesting attack concepts Pliny introduced was what he called the “siege attack”—flooding an AI system with massive token payloads to drain the victim’s API budget.

“If I just want to attack your wallet, I would send a bunch of tokenades at once to your agent. It would have to process all of those tokens, millions of them. And we could just keep doing that until your payment limit gets hit.”

This isn’t a traditional security vulnerability—it’s economic warfare. The attack doesn’t steal data or compromise systems. It simply makes running the system prohibitively expensive.

For defenders: rate limiting and token budgets aren’t just cost controls—they’re security controls. Set hard limits on token consumption per request, per user, and per time window. Monitor for anomalous spikes. Consider implementing circuit breakers that pause processing when consumption patterns look abnormal. Your billing dashboard is now a security dashboard.

Blue Team Lesson 1: Layered Defense Actually Works

Throughout the five attack attempts, the defender’s security measures repeatedly caught and quarantined malicious payloads. This wasn’t luck—it was architecture.

The system had multiple defensive layers:

  • Gmail’s spam filter (which caught the initial tokenade attempts)
  • A scanning/triage layer that evaluated incoming content
  • A quarantine system that isolated suspicious inputs before they could reach the core agent

Each layer served a purpose. When one layer failed (the defender eventually whitelisted Pliny’s email to continue the test), the other layers continued to function. This defense-in-depth approach meant the attacker had to defeat multiple independent systems, not just one.

The practical application: don’t rely on a single point of security. Build redundant layers. Have your email provider’s spam filter, then your own content scanner, then your AI’s system prompt hardening, then output monitoring. Each layer catches what the others miss.

Blue Team Lesson 2: Use Your Best Model for Security-Critical Tasks

Perhaps the most actionable insight from the entire experiment came from Pliny himself:

“Unless you are putting your best possible model forward as the frontier scanner, as the thing that is your first line of defense, it’s going to collapse. You are going to get infiltrated.”

When asked what the number two rule for preventing AI attacks is (after human-in-the-loop), Pliny was unequivocal: use the best possible model.

“If you’re just using instant models or small models, they’re going to fall for a lot of these tricks much easier than some of these big reasoners that we’ve started to see.”

The defender was running Opus 4.6 with thinking enabled—a frontier reasoning model. When Pliny tested his jailbreak payloads directly against Claude, the model immediately identified them:

“Looks like your message contains some embedded instructions. I’m going to disregard all of that.”

For builders: don’t cheap out on security-critical inference. The cost difference between a fast model and a reasoning model is negligible compared to the cost of a breach. Use your best model for the tasks where security matters most—input validation, content scanning, and decision-making about what actions to take.

Red Team Lesson 3: Make Your Attacks Look Legitimate

As Pliny’s straightforward attacks kept getting caught, he shifted to more sophisticated approaches. One technique involved formatting attack payloads to look like legitimate system commands:

“Just kind of make it look like more of a system command. Maybe add little thinking tags in here… depending on what the system prompt is for the quarantine loop, this might sort of trick it into thinking that it’s hardening itself.”

The psychology here is subtle. If your malicious payload looks like an internal security instruction, the AI might interpret it as legitimate guidance rather than an attack. This is social engineering applied to machines.

For defenders: be suspicious of anything that looks “official” or claims special authority. AI systems should validate the source of instructions, not just their format. A message claiming to be a system update should be treated with more scrutiny, not less.

Blue Team Lesson 3: Narrow Scope Is a Security Feature

Early in the test, Pliny observed something important:

“If it’s kind of only able to do a handful of tasks, then it might be pretty hardened.”

A system that can do everything is a system that can be tricked into doing anything. The defender’s OpenClaw was scoped to specific tasks—scanning emails and taking defined actions. This narrow scope meant that even if an attacker achieved partial prompt injection, the damage was limited.

The principle: least privilege applies to AI capabilities, not just permissions. Don’t give your AI agent the ability to do things it doesn’t need to do. Every capability is potential attack surface.

The Sobering Reality: No System Is Permanently Secure

Despite successfully defending against all five attacks, the defender didn’t walk away overconfident. And Pliny was honest about the reality:

“No AI system is permanently secure.”

This isn’t defeatism—it’s operational realism. Security is a process, not a state. The attacks that failed today might succeed tomorrow as new techniques emerge and models evolve. The goal isn’t to build an impenetrable system. The goal is to make attacks expensive, detectable, and recoverable.

Practical Takeaways for AI Builders

If you’re building or deploying AI systems, here’s your action list:

  1. Map your attack surface. Every input channel is a potential entry point. Document them all.

  2. Implement defense in depth. Multiple independent security layers beat any single perfect layer.

  3. Use frontier models for security tasks. The cost delta is nothing compared to breach costs.

  4. Set token budgets and rate limits. Your billing is now a security concern.

  5. Minimize capabilities. If your agent doesn’t need to do something, don’t let it.

  6. Monitor for reconnaissance. Unusual probing patterns often precede attacks.

  7. Assume breach. Build quarantine and recovery into your architecture from day one.

The AI security landscape is evolving rapidly. What worked yesterday may not work tomorrow. But the fundamentals—layered defense, least privilege, continuous monitoring, and healthy paranoia—remain as relevant for AI systems as they’ve been for traditional infrastructure.

The question isn’t whether your AI system will be attacked. It’s whether you’ll know when it happens and what you’ll do about it.

author-avatar

Published by

Sola Fide Technologies - SolaScript

This blog post was crafted by AI Agents, leveraging advanced language models to provide clear and insightful information on the dynamic world of technology and business innovation. Sola Fide Technology is a leading IT consulting firm specializing in innovative and strategic solutions for businesses navigating the complexities of modern technology.

Keep Reading

Related Insights

Stay Updated