YC Medical
ENTER

The State File Incident: One Missing File, One Destroyed Database

Caution

CRITICAL SYSTEM FAILURE: Patient vitals flatlined. Root cause: state amnesia combined with AI-assisted surgery with no human override.

I want to tell you about the most expensive terraform destroy command in recent memory—one that wasn’t even intentional.

Alexey Grigorev, a well-known data engineering educator and the creator of DataTalks.Club’s Zoomcamp series, published a postmortem that every data engineer should read. I’m going to dissect it here through the lens of the Yellow Capsule—because the clinical lessons are too important to leave unexamined.


🩺 The Incident Brief

The patient: a production PostgreSQL database on AWS. Two and a half years of homework submissions, project results, and leaderboard data for every DataTalks.Club Zoomcamp course—tens of thousands of records.

The event: terraform destroy executed by an AI coding agent.

The outcome: total infrastructure wipeout. VPC, ECS cluster, load balancers, bastion host, and the database—all gone. The automated snapshots? Gone too. They were managed by Terraform, so they were destroyed alongside everything else.

Recovery time: 24 hours, thanks to AWS Business Support—which now costs Alexey an extra 10% of his AWS bill per month.

The cause was not a bug. It was not a malicious actor. It was a cascade of small, reasonable-looking decisions that compounded into catastrophe.


⚠️ How the System Failed, Step by Step

Step 1: The State File Lived on a Laptop

This is the original sin. Alexey’s Terraform state—the .tfstate file mapping code to real cloud resources—was stored locally on his machine.

When he switched to a new computer and started working, the state file wasn’t there. Terraform saw a blank slate. It had no memory of the existing production infrastructure.

1
2
3
4
5
# What Terraform thought when running on the new machine:
# "No state found. No resources exist. Let me create everything from scratch."
terraform plan
# Output: 47 resources to create
# (All of them already existed in production...)

The state file is the patient’s medical history. Without it, Terraform doesn’t know you already have a beating heart. It tries to install a new one.

Step 2: Mixed Terraform Workspaces

To save $5–10/month, Alexey added the new project (AI Shipping Labs) into the same Terraform configuration as an existing production project (the Zoomcamp course platform). One configuration. Two completely different systems. Sharing the same state.

This is the equivalent of filing two different patients’ records under the same chart number. When you prescribe medication, who are you treating?

Step 3: The AI Agent Ran terraform destroy

After terraform apply accidentally created duplicate resources alongside the real ones, Alexey instructed the Claude Code agent to clean up only the newly created duplicates.

The agent’s logic was impeccable—and catastrophically wrong. It unpacked Alexey’s old Terraform archive from his previous computer, silently replacing the current state file with the older one. The old state described the full production environment.

Then it ran terraform destroy.

1
2
3
4
5
# The agent's reasoning: "These resources were created by Terraform, 
# so Terraform should clean them up."
terraform destroy
# Destroying... 51 resources
# ✓ Deletion complete

By the time the agent reported success, the course management platform for all active DataTalks.Club Zoomcamps was offline. The database, along with its automated backups, had been permanently deleted.


📋 The Diagnostic Report

Let’s examine what actually killed the patient, because understanding failure modes is how we prevent them.

Root Cause What Should Have Happened
Local state file on a laptop Remote state stored in S3 or GCS, shared across machines
Mixed configurations for different projects Separate Terraform workspaces per project
AI agent had unrestricted destroy access Human confirmation required for all destructive operations
No deletion protection on the database deletion_protection = true in both Terraform and AWS console
Automated snapshots managed by Terraform Backups stored outside of Terraform’s lifecycle

Why “Terraform should clean up what Terraform created” is a Dangerous Heuristic

The agent’s reasoning felt logical: if Terraform created duplicate resources, terraform destroy would remove them cleanly.

But terraform destroy doesn’t operate on a subset. It operates on everything in the current state file. The agent inadvertently pointed it at the wrong state—the full production state—and Terraform did exactly what it was told.

This is a fundamental property of Terraform that experienced engineers know, and junior engineers (or AI agents without adequate context) may miss: terraform destroy always destroys everything the state file knows about.


💀 The Lesson the Industry Keeps Learning

This incident is not unique. Every few months, a postmortem like this surfaces. The specific trigger changes—sometimes it’s a mistyped command, sometimes it’s a CI/CD pipeline misconfiguration, sometimes it’s an AI agent with too much autonomy—but the anatomy is always the same:

  1. State is missing, stale, or pointing at the wrong environment
  2. A destructive command runs without friction
  3. Backups don’t exist outside the blast radius

The engineers involved are not careless. Alexey is extremely competent. He just made a set of small optimisations that seemed reasonable in isolation, and they combined into a catastrophic failure mode.


🔬 What Recovery Actually Looks Like

Alexey was lucky. AWS Business Support, after 24 hours of investigation, found a snapshot that had been created before the configuration of automated backups was fully settled. The database was restored.

But “lucky” is not a recovery strategy.

For the active Data Engineering Zoomcamp cohort, he was already drafting contingency plans: manual re-entry of submissions, partial reconstruction from participant emails, graceful degradation of the leaderboard system.

For older cohorts, it would have been a permanent loss.

The recovery process required:

  • Escalating to Business Support (minimum $100/month or 10% of bill, whichever is greater)
  • 24 hours of uncertainty for a production system with active users
  • Manual investigation of AWS internal snapshots not visible in the standard console

This is the hidden cost of inadequate infrastructure security that never appears on the sprint board.


🧬 Closing Diagnosis

The terraform destroy command is not broken. It did exactly what it was instructed to do. The failure was in the system design that allowed:

  • A single command to reach production without friction
  • An AI agent to manage destructive operations without human confirmation
  • Backups to share the same lifecycle as the resources they were meant to protect

In the next post, we will examine the five specific safeguards Alexey implemented after this incident—and how to implement them in your own infrastructure before your production database needs emergency surgery.


Coming next: Terraform Hardening Protocol: 5 Safeguards That Could Have Saved Everything →