What Amazon's AI Coding Agent Outage Teaches Us About Deploying Agents in Production

Jaione AmigotMarch 13, 2026

Premium

Amazon's AI coding agent Kiro caused a 13-hour AWS outage by deleting a production environment. The incident reveals why organizations need owned, sandboxed AI infrastructure with proper governance — not just smarter models.

An AI Agent Deleted a Production Environment. What Happens Next Matters More.

Last week, the Financial Times reported that Amazon's AI coding agent, Kiro, caused a 13-hour AWS outage in December by choosing to "delete and recreate" a production environment it was working on. The agent had inherited its operator's permissions — and a human error gave it more access than intended.

Amazon's response? Require senior engineer sign-off on all AI-assisted code changes from junior and mid-level developers. More training. More guardrails.

But the deeper lesson isn't about code review policies. It's about what happens when AI agents operate inside infrastructure you don't fully control.

The Permission Problem Is an Architecture Problem

Kiro normally requires two humans to approve changes before they're pushed. That's a reasonable safeguard. But the agent operated with the permissions of its human operator, and that operator's access was broader than it should have been.

This is a pattern we see across organizations deploying AI agents: the agent's capabilities are carefully designed, but the environment it runs in — the permission model, the blast radius, the data access boundaries — is inherited from whatever platform hosts it.

When your AI agents run on a third-party platform, you inherit that platform's security model, permission structure, and failure modes. You're trusting that their sandboxing is sufficient, their access controls are granular enough, and their incident response aligns with your risk tolerance.

For Amazon — a company with arguably the most sophisticated cloud infrastructure on Earth — this still went wrong. Twice, in fact. A second outage was linked to Amazon's AI chatbot Q Developer shortly after.

What "Owning Your AI Infrastructure" Actually Means

The conversation in enterprise AI has shifted from "should we deploy AI agents?" to "how do we deploy them without creating new categories of risk?"

There are three architectural principles that separate organizations deploying agents successfully from those learning expensive lessons:

1. Scoped, Role-Based Agent Permissions

Every AI agent should operate with the minimum permissions required for its specific task. Not the permissions of the person who deployed it. Not broad platform-level access. Scoped, auditable, revocable permissions tied to the agent's defined role.

This is how we design agents within ibl.ai's Agentic OS. Each agent gets role-based capabilities — a student-facing tutor agent has different access than an administrative reporting agent, which has different access than a compliance monitoring agent. The permission model is part of the agent's definition, not an afterthought.

2. Isolated Execution Sandboxes

When an AI agent makes a mistake — and they will — the blast radius should be contained. This means isolated execution environments where an agent's actions can't cascade into unrelated systems.

Amazon's Kiro agent deleted and rebuilt an environment because it had the technical ability to do so. In a properly sandboxed architecture, that action would have been constrained to the agent's specific working context, not a production service.

Organizations deploying on their own infrastructure can define these boundaries precisely. Deploy on someone else's platform, and you're hoping their sandbox is tight enough.

3. Institutional Governance as Code

Amazon's fix was organizational: more training, more sign-offs, more process. That's necessary but insufficient. The most resilient approach encodes governance into the infrastructure itself — escalation protocols, approval workflows, and audit trails that are architectural features, not policy documents.

When ibl.ai's AI Transformation team deploys agents for universities and enterprises, we design each agent with defined responsibilities, access boundaries, escalation protocols, and performance reviews. The governance isn't a layer on top — it's woven into how the agent operates.

The LLM-Agnostic Advantage in Production Safety

There's a related dimension to production safety that the Amazon story highlights: vendor dependency.

Amazon's outages were linked to two different AI tools — Kiro and Q Developer. Organizations locked into a single AI vendor's toolchain face compounding risk. If that vendor's agent has a flaw, it affects everything built on it.

An LLM-agnostic architecture — where you can swap models and agent frameworks without rebuilding integrations — gives you an escape valve. If one model's agent behavior is problematic, route to another. If an open-weight model offers better controllability for a specific task, deploy it alongside commercial options.

This isn't about chasing the latest model. It's about not having a single point of failure in your AI stack.

The Organizations Getting This Right

The pattern among organizations successfully deploying AI agents at scale:

They own their infrastructure. Agents run in their environment, with their keys, their controls, and full code access.
They scope agent permissions like they scope employee access. Principle of least privilege, applied to AI.
They treat agent deployment like production deployment. Testing, staging, monitoring, rollback plans.
They stay model-agnostic. No single vendor dependency. The ability to route, swap, and optimize across providers.

Amazon's outage is a signal, not a scare story. AI agents in production are inevitable. The question is whether organizations will deploy them with the same engineering rigor they apply to everything else — or learn the lesson the hard way.

ibl.ai is an Agentic AI Operating System that organizations deploy, customize, and control on their own infrastructure. Over 1.6 million users across 400+ organizations use ibl.ai to run AI agents for tutoring, advising, operations, and more. Learn more at ibl.ai.

← PreviousAmazon's AI Agent Outage Is a Warning: Why Organizations Need Governed AI Infrastructure Next →Why 1 Million Tokens of Context Changes Everything — If You Own the Infrastructure

The Governance Gap: Why Enterprise AI Agents Succeed or Fail in Production

Most enterprise AI pilots fail in production for operational reasons, not technical ones. This is what governance-first agent deployment actually looks like in 2026.

Blanca AmigotApril 16, 2026

The Governance Gap: Why Enterprise AI Deployments Are Running Without a Safety Net

Only 21% of enterprises have mature AI governance frameworks. 87% are deploying agents anyway. That gap has consequences.

Miguel AmigotMay 23, 2026

Why 40% of Agentic AI Projects Will Be Cancelled by 2027 — and How to Be in the Other Half

Gartner's first Hype Cycle for Agentic AI shows 40% enterprise adoption and 40% cancellation rates — on the same chart. Here is what separates the organizations that will still have working systems in 2027.

Blanca AmigotMay 4, 2026

The AI Governance Mirage: Why Enterprises Are Building Control Planes From Scratch

72% of enterprises believe they have adequate AI governance. VentureBeat's Q1 2026 research says most don't. Here's what the organizations getting it right are doing differently.

Mikel AmigotApril 23, 2026

See the ibl.ai AI Operating System in Action

Discover how leading universities and organizations are transforming education with the ibl.ai AI Operating System. Explore real-world implementations from Harvard, MIT, Stanford, and users from 400+ institutions worldwide.

View Case Studies

Get Started with ibl.ai

Choose the plan that fits your needs and start transforming your educational experience today.

ibl.ai Agentic AI Blog

Topics We Cover

Featured Research and Reports

For Technical Leaders