The State File Incident: One Missing File, One Destroyed Database
Caution
CRITICAL SYSTEM FAILURE: Patient vitals flatlined. Root cause: state amnesia combined with AI-assisted surgery with no human override.
I want to tell you about the most expensive terraform destroy command in recent memory—one that wasn’t even intentional.
Alexey Grigorev, a well-known data engineering educator and the creator of DataTalks.Club’s Zoomcamp series, published a postmortem that every data engineer should read. I’m going to dissect it here through the lens of the Yellow Capsule—because the clinical lessons are too important to leave unexamined.
🩺 The Incident Brief
The patient: a production PostgreSQL database on AWS. Two and a half years of homework submissions, project results, and leaderboard data for every DataTalks.Club Zoomcamp course—tens of thousands of records.
The event: terraform destroy executed by an AI coding agent.
The outcome: total infrastructure wipeout. VPC, ECS cluster, load balancers, bastion host, and the database—all gone. The automated snapshots? Gone too. They were managed by Terraform, so they were destroyed alongside everything else.
Recovery time: 24 hours, thanks to AWS Business Support—which now costs Alexey an extra 10% of his AWS bill per month.
The cause was not a bug. It was not a malicious actor. It was a cascade of small, reasonable-looking decisions that compounded into catastrophe.
⚠️ How the System Failed, Step by Step
Step 1: The State File Lived on a Laptop
This is the original sin. Alexey’s Terraform state—the .tfstate file mapping code to real cloud resources—was stored locally on his machine.
When he switched to a new computer and started working, the state file wasn’t there. Terraform saw a blank slate. It had no memory of the existing production infrastructure.
|
|
The state file is the patient’s medical history. Without it, Terraform doesn’t know you already have a beating heart. It tries to install a new one.
Step 2: Mixed Terraform Workspaces
To save $5–10/month, Alexey added the new project (AI Shipping Labs) into the same Terraform configuration as an existing production project (the Zoomcamp course platform). One configuration. Two completely different systems. Sharing the same state.
This is the equivalent of filing two different patients’ records under the same chart number. When you prescribe medication, who are you treating?
Step 3: The AI Agent Ran terraform destroy
After terraform apply accidentally created duplicate resources alongside the real ones, Alexey instructed the Claude Code agent to clean up only the newly created duplicates.
The agent’s logic was impeccable—and catastrophically wrong. It unpacked Alexey’s old Terraform archive from his previous computer, silently replacing the current state file with the older one. The old state described the full production environment.
Then it ran terraform destroy.
|
|
By the time the agent reported success, the course management platform for all active DataTalks.Club Zoomcamps was offline. The database, along with its automated backups, had been permanently deleted.
📋 The Diagnostic Report
Let’s examine what actually killed the patient, because understanding failure modes is how we prevent them.
| Root Cause | What Should Have Happened |
|---|---|
| Local state file on a laptop | Remote state stored in S3 or GCS, shared across machines |
| Mixed configurations for different projects | Separate Terraform workspaces per project |
AI agent had unrestricted destroy access |
Human confirmation required for all destructive operations |
| No deletion protection on the database | deletion_protection = true in both Terraform and AWS console |
| Automated snapshots managed by Terraform | Backups stored outside of Terraform’s lifecycle |
Why “Terraform should clean up what Terraform created” is a Dangerous Heuristic
The agent’s reasoning felt logical: if Terraform created duplicate resources, terraform destroy would remove them cleanly.
But terraform destroy doesn’t operate on a subset. It operates on everything in the current state file. The agent inadvertently pointed it at the wrong state—the full production state—and Terraform did exactly what it was told.
This is a fundamental property of Terraform that experienced engineers know, and junior engineers (or AI agents without adequate context) may miss: terraform destroy always destroys everything the state file knows about.
💀 The Lesson the Industry Keeps Learning
This incident is not unique. Every few months, a postmortem like this surfaces. The specific trigger changes—sometimes it’s a mistyped command, sometimes it’s a CI/CD pipeline misconfiguration, sometimes it’s an AI agent with too much autonomy—but the anatomy is always the same:
- State is missing, stale, or pointing at the wrong environment
- A destructive command runs without friction
- Backups don’t exist outside the blast radius
The engineers involved are not careless. Alexey is extremely competent. He just made a set of small optimisations that seemed reasonable in isolation, and they combined into a catastrophic failure mode.
🔬 What Recovery Actually Looks Like
Alexey was lucky. AWS Business Support, after 24 hours of investigation, found a snapshot that had been created before the configuration of automated backups was fully settled. The database was restored.
But “lucky” is not a recovery strategy.
For the active Data Engineering Zoomcamp cohort, he was already drafting contingency plans: manual re-entry of submissions, partial reconstruction from participant emails, graceful degradation of the leaderboard system.
For older cohorts, it would have been a permanent loss.
The recovery process required:
- Escalating to Business Support (minimum $100/month or 10% of bill, whichever is greater)
- 24 hours of uncertainty for a production system with active users
- Manual investigation of AWS internal snapshots not visible in the standard console
This is the hidden cost of inadequate infrastructure security that never appears on the sprint board.
🧬 Closing Diagnosis
The terraform destroy command is not broken. It did exactly what it was instructed to do. The failure was in the system design that allowed:
- A single command to reach production without friction
- An AI agent to manage destructive operations without human confirmation
- Backups to share the same lifecycle as the resources they were meant to protect
In the next post, we will examine the five specific safeguards Alexey implemented after this incident—and how to implement them in your own infrastructure before your production database needs emergency surgery.
Coming next: Terraform Hardening Protocol: 5 Safeguards That Could Have Saved Everything →