Terraform Hardening Protocol: 5 Safeguards That Could Have Saved Everything
Important
PRE-OPERATIVE CHECKLIST INITIATED: Before any infrastructure surgery, verify all safeguards are active. A scalpel in the wrong state is a weapon.
In the previous post, we dissected how a missing state file and an AI agent with too much authority combined to destroy two and a half years of production data in a single terraform destroy command.
Now we install the safeguards.
These are the five defenses Alexey implemented after his incident—adapted with concrete implementation details for both AWS and GCP environments, which is the stack most Data Engineering Zoomcamp graduates will be working with.
🛡️ Safeguard 1: Remote State — Stop Storing .tfstate on Your Laptop
This is the root cause of the entire incident. State lived locally. When the machine changed, Terraform lost its memory of production.
Local state is a single point of failure. It ties your entire infrastructure’s knowledge to one physical device. The fix is to store state in a shared, persistent, versioned object store.
For GCP (Cloud Storage)
|
|
|
|
For AWS (S3 + DynamoDB)
|
|
Note
The dynamodb_table provides state locking—it prevents two engineers (or two CI/CD jobs) from running terraform apply simultaneously and corrupting the state file. This is non-negotiable for team environments.
|
|
The Rule: Never let a .tfstate file live inside a project repository or on a developer’s machine. If the file is in your .gitignore, ask yourself why it can be lost in the first place.
🛡️ Safeguard 2: Deletion Protection — Make Destruction Require Deliberate Effort
The production database was deleted with a single command because nothing stopped it. The solution is to install friction at two independent levels.
Level 1: In Your Terraform Configuration
|
|
|
|
Level 2: Directly in the Cloud Console
Terraform’s deletion_protection can still be overridden if someone sets it to false and runs apply first. The second layer is to enable protection at the cloud provider level independently—through the AWS console or via gcloud CLI—so that even if Terraform is pointed at the wrong state, the cloud provider itself will refuse to delete the database.
|
|
Two independent layers means a misconfigured state file alone is no longer sufficient to trigger total infrastructure loss.
🛡️ Safeguard 3: Backups Outside of Terraform’s Lifecycle
This was the most costly lesson. Automated RDS snapshots were managed inside the same Terraform configuration as the database itself. When terraform destroy ran, it deleted both the database and the snapshots simultaneously.
The core principle: backups must not share a delete path with the resources they protect.
Strategy: Export to an S3 Bucket with Independent Management
|
|
Strategy: Use a Lifecycle Policy to Create Manual Snapshots
For GCP Cloud SQL, you can create manual snapshots on a schedule that are independent of your Terraform configuration:
|
|
Warning
Never put your backup bucket in the same Terraform configuration as your database. If you run terraform destroy, both will be gone. Treat the backup bucket as an immutable, separately-managed resource.
🛡️ Safeguard 4: Secret Management — Stop Putting Credentials in .tfvars
This is the security issue that data engineers hit immediately when following basic Terraform tutorials. The standard advice is to create terraform.tfvars from the example file. The problem is what goes inside it.
|
|
|
|
If terraform.tfvars ends up in a git repository—even a private one—your secrets have a permanent home in the git history. A single accidental git push is sufficient.
The Correct Pattern: Environment Variables + Secret Manager
Option A: Environment Variables (Simple, Effective)
Terraform automatically reads environment variables prefixed with TF_VAR_:
|
|
|
|
Option B: Pull Secrets from GCP Secret Manager or AWS Secrets Manager at Runtime
|
|
Your .gitignore should always contain:
|
|
🛡️ Safeguard 5: Human Confirmation for Destructive Operations
The final safeguard is behavioral, not technical.
The incident happened because an AI agent was permitted to run terraform destroy without a human reviewing what it was about to destroy. The plan output was not inspected. The human was not in the loop for the final action.
The Protocol: Always Review terraform plan Before terraform apply
|
|
If any production database, VPC, or ECS cluster appears in the list of resources to be deleted — stop. The plan is wrong.
For CI/CD Pipelines: Require Manual Approval on Destroy
If you are using GitHub Actions, Terraform Cloud, or any CI/CD system to automate your infrastructure:
|
|
For terraform destroy, go further—require it to be triggered manually, never automatically.
For AI Agent Workflows
If you are using AI coding assistants (Cursor, Claude Code, etc.) to help manage infrastructure:
- Never permit the agent to run
terraform applyorterraform destroyautonomously - Always run
terraform planfirst and review the output yourself before instructing the agent to proceed - Treat the agent as a code author, not an infrastructure operator
✅ The Pre-Op Checklist
Before any Terraform operation in a production environment, run through this:
| Check | Status |
|---|---|
| State is stored remotely (S3/GCS), not locally | ☐ |
| Each project has its own state file and workspace | ☐ |
Database has deletion_protection = true |
☐ |
| Deletion protection is also set at the cloud console level | ☐ |
| Backups exist in a separate bucket/config outside Terraform’s scope | ☐ |
No credentials are stored in .tfvars files |
☐ |
terraform plan has been reviewed before terraform apply |
☐ |
| Destructive operations require human confirmation | ☐ |
This checklist does not guarantee zero incidents. But it guarantees that any incident will require multiple independent systems to fail simultaneously—which raises the bar dramatically.
🧬 Closing Prescription
The tools are not the enemy. Terraform is extraordinarily powerful precisely because it can provision, modify, and destroy infrastructure with a single command. That power requires discipline in how you wield it.
The five safeguards above are not bureaucratic overhead. They are the difference between a late-night recovery call to AWS Support and a normal working day.
Implement them before your first production deploy. Not after.
← Previous: The State File Incident: One Missing File, One Destroyed Database