YC Medical
ENTER

Terraform Hardening Protocol: 5 Safeguards That Could Have Saved Everything

Important

PRE-OPERATIVE CHECKLIST INITIATED: Before any infrastructure surgery, verify all safeguards are active. A scalpel in the wrong state is a weapon.

In the previous post, we dissected how a missing state file and an AI agent with too much authority combined to destroy two and a half years of production data in a single terraform destroy command.

Now we install the safeguards.

These are the five defenses Alexey implemented after his incident—adapted with concrete implementation details for both AWS and GCP environments, which is the stack most Data Engineering Zoomcamp graduates will be working with.


🛡️ Safeguard 1: Remote State — Stop Storing .tfstate on Your Laptop

This is the root cause of the entire incident. State lived locally. When the machine changed, Terraform lost its memory of production.

Local state is a single point of failure. It ties your entire infrastructure’s knowledge to one physical device. The fix is to store state in a shared, persistent, versioned object store.

For GCP (Cloud Storage)

1
2
3
4
5
6
7
# backend.tf
terraform {
  backend "gcs" {
    bucket  = "my-project-terraform-state"
    prefix  = "production/state"
  }
}
1
2
3
# Create the bucket with versioning enabled
gsutil mb -l asia-southeast1 gs://my-project-terraform-state
gsutil versioning set on gs://my-project-terraform-state

For AWS (S3 + DynamoDB)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-project-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "ap-southeast-2"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

Note

The dynamodb_table provides state locking—it prevents two engineers (or two CI/CD jobs) from running terraform apply simultaneously and corrupting the state file. This is non-negotiable for team environments.

1
2
# Migrate from local state to remote
terraform init -migrate-state

The Rule: Never let a .tfstate file live inside a project repository or on a developer’s machine. If the file is in your .gitignore, ask yourself why it can be lost in the first place.


🛡️ Safeguard 2: Deletion Protection — Make Destruction Require Deliberate Effort

The production database was deleted with a single command because nothing stopped it. The solution is to install friction at two independent levels.

Level 1: In Your Terraform Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# RDS (AWS)
resource "aws_db_instance" "production" {
  identifier        = "production-postgres"
  engine            = "postgres"
  instance_class    = "db.t3.micro"
  allocated_storage = 20

  # 🔒 This means Terraform cannot delete this resource
  #    without you first setting this to false and applying
  deletion_protection = true

  # 🔒 Skip final snapshot by default is false — leave it that way
  skip_final_snapshot    = false
  final_snapshot_identifier = "production-postgres-final-snapshot"
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Cloud SQL (GCP)
resource "google_sql_database_instance" "production" {
  name             = "production-postgres"
  database_version = "POSTGRES_15"
  region           = "asia-southeast1"

  settings {
    tier = "db-f1-micro"
  }

  # 🔒 Prevents accidental deletion via Terraform or Console
  deletion_protection = true
}

Level 2: Directly in the Cloud Console

Terraform’s deletion_protection can still be overridden if someone sets it to false and runs apply first. The second layer is to enable protection at the cloud provider level independently—through the AWS console or via gcloud CLI—so that even if Terraform is pointed at the wrong state, the cloud provider itself will refuse to delete the database.

1
2
3
4
5
6
7
8
9
# GCP: Enable deletion protection via CLI
gcloud sql instances patch production-postgres \
  --deletion-protection

# AWS: Modify deletion protection via CLI
aws rds modify-db-instance \
  --db-instance-identifier production-postgres \
  --deletion-protection \
  --apply-immediately

Two independent layers means a misconfigured state file alone is no longer sufficient to trigger total infrastructure loss.


🛡️ Safeguard 3: Backups Outside of Terraform’s Lifecycle

This was the most costly lesson. Automated RDS snapshots were managed inside the same Terraform configuration as the database itself. When terraform destroy ran, it deleted both the database and the snapshots simultaneously.

The core principle: backups must not share a delete path with the resources they protect.

Strategy: Export to an S3 Bucket with Independent Management

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# This S3 bucket is managed by a SEPARATE Terraform configuration
# in a separate state file, with a separate set of credentials
resource "aws_s3_bucket" "database_backups" {
  bucket = "my-project-db-backups-offsite"
}

resource "aws_s3_bucket_versioning" "database_backups" {
  bucket = aws_s3_bucket.database_backups.id

  versioning_configuration {
    status = "Enabled"
  }
}

# Object Lock prevents deletion for a defined retention period
resource "aws_s3_bucket_object_lock_configuration" "database_backups" {
  bucket = aws_s3_bucket.database_backups.id

  rule {
    default_retention {
      mode = "GOVERNANCE"
      days = 30
    }
  }
}

Strategy: Use a Lifecycle Policy to Create Manual Snapshots

For GCP Cloud SQL, you can create manual snapshots on a schedule that are independent of your Terraform configuration:

1
2
3
4
5
6
7
8
9
# Create a manual Cloud SQL snapshot (not managed by Terraform)
gcloud sql backups create \
  --instance=production-postgres \
  --description="manual-$(date +%Y%m%d)"

# Export directly to a Cloud Storage bucket
gcloud sql export sql production-postgres \
  gs://my-project-db-backups-offsite/$(date +%Y%m%d)-backup.sql.gz \
  --database=your_database_name

Warning

Never put your backup bucket in the same Terraform configuration as your database. If you run terraform destroy, both will be gone. Treat the backup bucket as an immutable, separately-managed resource.


🛡️ Safeguard 4: Secret Management — Stop Putting Credentials in .tfvars

This is the security issue that data engineers hit immediately when following basic Terraform tutorials. The standard advice is to create terraform.tfvars from the example file. The problem is what goes inside it.

1
2
3
# What every tutorial tells you to do:
cp terraform.tfvars.example terraform.tfvars
# Now edit terraform.tfvars with your real credentials...
1
2
3
4
# terraform.tfvars — what it looks like in practice
project_id      = "my-real-gcp-project"
db_password     = "super-secret-database-password"
api_key         = "sk-prod-abc123xyz..."

If terraform.tfvars ends up in a git repository—even a private one—your secrets have a permanent home in the git history. A single accidental git push is sufficient.

The Correct Pattern: Environment Variables + Secret Manager

Option A: Environment Variables (Simple, Effective)

Terraform automatically reads environment variables prefixed with TF_VAR_:

1
2
3
4
5
6
# In your shell profile or CI/CD secret manager:
export TF_VAR_db_password="super-secret-database-password"
export TF_VAR_project_id="my-real-gcp-project"

# terraform.tfvars is now only for non-sensitive config
# variables.tf declares types but not values
1
2
3
4
5
6
# variables.tf
variable "db_password" {
  description = "The master password for the production database"
  type        = string
  sensitive   = true  # Prevents the value appearing in Terraform output
}

Option B: Pull Secrets from GCP Secret Manager or AWS Secrets Manager at Runtime

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Fetch the secret from GCP Secret Manager during terraform apply
data "google_secret_manager_secret_version" "db_password" {
  secret = "production-db-password"
}

resource "google_sql_database_instance" "production" {
  # ...
  settings {
    # Use the fetched secret value — never hardcoded
    password = data.google_secret_manager_secret_version.db_password.secret_data
  }
}

Your .gitignore should always contain:

1
2
3
4
5
# .gitignore
*.tfvars
*.tfstate
*.tfstate.backup
.terraform/

🛡️ Safeguard 5: Human Confirmation for Destructive Operations

The final safeguard is behavioral, not technical.

The incident happened because an AI agent was permitted to run terraform destroy without a human reviewing what it was about to destroy. The plan output was not inspected. The human was not in the loop for the final action.

The Protocol: Always Review terraform plan Before terraform apply

1
2
3
4
5
6
7
8
# Generate a plan file — this is a binary representation of the proposed changes
terraform plan -out=tfplan.binary

# Convert to human-readable JSON for review
terraform show -json tfplan.binary | jq '.' > tfplan.json

# Inspect what will be DESTROYED before proceeding
cat tfplan.json | jq '.resource_changes[] | select(.change.actions[] == "delete") | .address'

If any production database, VPC, or ECS cluster appears in the list of resources to be deleted — stop. The plan is wrong.

For CI/CD Pipelines: Require Manual Approval on Destroy

If you are using GitHub Actions, Terraform Cloud, or any CI/CD system to automate your infrastructure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# .github/workflows/terraform.yml (simplified)
jobs:
  terraform-plan:
    runs-on: ubuntu-latest
    steps:
      - name: Terraform Plan
        run: terraform plan -out=tfplan

  terraform-apply:
    runs-on: ubuntu-latest
    needs: [terraform-plan]
    # 🔒 Require a human to approve before apply runs
    environment: production
    steps:
      - name: Terraform Apply
        run: terraform apply tfplan

For terraform destroy, go further—require it to be triggered manually, never automatically.

For AI Agent Workflows

If you are using AI coding assistants (Cursor, Claude Code, etc.) to help manage infrastructure:

  • Never permit the agent to run terraform apply or terraform destroy autonomously
  • Always run terraform plan first and review the output yourself before instructing the agent to proceed
  • Treat the agent as a code author, not an infrastructure operator

✅ The Pre-Op Checklist

Before any Terraform operation in a production environment, run through this:

Check Status
State is stored remotely (S3/GCS), not locally
Each project has its own state file and workspace
Database has deletion_protection = true
Deletion protection is also set at the cloud console level
Backups exist in a separate bucket/config outside Terraform’s scope
No credentials are stored in .tfvars files
terraform plan has been reviewed before terraform apply
Destructive operations require human confirmation

This checklist does not guarantee zero incidents. But it guarantees that any incident will require multiple independent systems to fail simultaneously—which raises the bar dramatically.


🧬 Closing Prescription

The tools are not the enemy. Terraform is extraordinarily powerful precisely because it can provision, modify, and destroy infrastructure with a single command. That power requires discipline in how you wield it.

The five safeguards above are not bureaucratic overhead. They are the difference between a late-night recovery call to AWS Support and a normal working day.

Implement them before your first production deploy. Not after.


← Previous: The State File Incident: One Missing File, One Destroyed Database