Infrastructure-as-Code Security & Change Management: Terraform Best Practices 2025

title: 'Infrastructure-as-Code Security & Change Management: A Complete 7-Stage Workflow for Terraform' date: '2025-01-07' excerpt: 'Master the complete IaC security workflow from pre-commit validation through production deployment. Learn how to scan for vulnerabilities, enforce policy-as-code, detect drift, and optimize costs using Terraform, tfsec, Checkov, Sentinel, OPA, and Infracost.' author: 'InventiveHQ DevOps Team' category: 'Developer' tags:

Infrastructure as Code
Terraform
DevSecOps
Cloud Security
Policy as Code
Drift Detection
Cost Optimization readingTime: 19 featured: true heroImage: "https://images.unsplash.com/photo-1451187580459-43490279c0fa?w=1200&h=630&fit=crop"

Introduction

Infrastructure-as-Code (IaC) has revolutionized how we provision and manage cloud resources. Instead of clicking through console interfaces, teams declare their entire infrastructure in version-controlled code—enabling reproducibility, collaboration, and automation. But this power comes with responsibility: a misconfigured Terraform file can expose databases, leak credentials, or rack up thousands in cloud costs before anyone notices.

According to the 2025 State of Cloud Security, over 60% of cloud security incidents originate from misconfigured infrastructure, and the median time to detect these issues is 11 days. That's 11 days of potential exposure, compliance violations, and runaway costs.

This guide presents a complete 7-stage IaC security workflow used by DevOps teams, platform engineers, and SREs to catch problems before they reach production. We'll cover:

Pre-Commit Validation - Catch syntax errors and basic security issues at the developer's workstation
Security Scanning & Linting - Deep static analysis for vulnerabilities and compliance violations
Policy-as-Code Enforcement - Automated guardrails using Sentinel and Open Policy Agent (OPA)
Plan Review & Cost Analysis - Human review with cost impact visibility via Infracost
Automated Testing - Integration tests and compliance validation
Controlled Deployment - Safe rollout with approval gates and blast radius controls
Post-Deployment Monitoring - Drift detection, compliance tracking, and cost optimization

Why This Workflow Matters

Infrastructure drift—when your actual cloud resources deviate from your IaC definitions—occurs in over 80% of organizations according to HashiCorp research. Manual console changes ("ClickOps"), emergency hotfixes, and overlapping automation all contribute to drift, creating security vulnerabilities and compliance gaps.

As noted in Spacelift's Terraform security guide, the stakes are high:

Security risks: Misconfigured S3 buckets, overly permissive security groups, unencrypted databases
Compliance violations: HIPAA, PCI-DSS, SOC 2, and GDPR requirements violated by IaC errors
Cost overruns: Oversized instances, unused resources, inefficient architectures
Operational failures: Configuration drift leading to unpredictable behavior and outages

This workflow treats infrastructure code with the same rigor as application code—combining automated security scanning, policy enforcement, and continuous monitoring to create a defense-in-depth strategy.

Stage 1: Pre-Commit Validation (5-10 minutes)

The first line of defense happens before code even reaches version control. Pre-commit validation catches syntax errors, formatting issues, and obvious security problems at the developer's workstation—providing instant feedback and preventing broken code from entering the pipeline.

Step 1.1: Terraform Native Validation

Why: Terraform's built-in validation catches syntax errors, invalid resource configurations, and type mismatches before any external tools run.

Commands:

# Format code to HashiCorp standards (2-space indentation)
terraform fmt -recursive

# Validate syntax and configuration
terraform validate

# Initialize modules and providers (required before validation)
terraform init -backend=false

Common Issues Detected:

Invalid HCL syntax (missing braces, commas, quotes)
Undeclared variables or outputs
Invalid resource attribute references
Type mismatches (string vs. number vs. list)
Deprecated resource syntax

Example Error:

$ terraform validate

Error: Unsupported argument

  on main.tf line 12, in resource "aws_instance" "web":
  12:   ami_id = var.ami_id

An argument named "ami_id" is not expected here. Did you mean "ami"?

Best Practice: Run terraform fmt before committing to ensure consistent code style across your team. Many teams enforce this with pre-commit hooks.

Step 1.2: Pre-Commit Hook Setup

Why: Automated pre-commit hooks prevent developers from committing code that fails basic validation, saving CI/CD pipeline time and reducing feedback cycles.

Installation:

# Install pre-commit framework
pip install pre-commit

# Create .pre-commit-config.yaml in your repository
cat > .pre-commit-config.yaml <<EOF
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.94.1
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_docs
      - id: terraform_tflint
      - id: terraform_tfsec
        args:
          - --args=--minimum-severity=HIGH
EOF

# Install hooks
pre-commit install

Hook Execution Flow:

Developer commits code
    ↓
terraform_fmt → Auto-format code
    ↓
terraform_validate → Check syntax
    ↓
terraform_tflint → Linting rules
    ↓
terraform_tfsec → Security scan
    ↓
Commit succeeds or fails

Example Output:

$ git commit -m "Add RDS database"

terraform_fmt...................................Passed
terraform_validate..............................Passed
terraform_tflint................................Passed
terraform_tfsec.................................Failed
  - Hook failed with exit code 1

  HIGH: Resource 'aws_db_instance.main' has encryption disabled
    Line: 45
    Severity: HIGH

[Commit blocked - fix security issues before committing]

Step 1.3: TFLint - Terraform Linting

Why: TFLint enforces best practices, detects deprecated syntax, and validates provider-specific configurations that terraform validate misses.

Installation & Configuration:

# Install TFLint
brew install tflint  # macOS
# or
curl -s https://raw.githubusercontent.com/terraform-linters/tflint/master/install_linux.sh | bash

# Create .tflint.hcl configuration
cat > .tflint.hcl <<EOF
plugin "aws" {
  enabled = true
  version = "0.31.0"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

rule "terraform_deprecated_syntax" {
  enabled = true
}

rule "terraform_unused_declarations" {
  enabled = true
}

rule "terraform_naming_convention" {
  enabled = true
}
EOF

# Run TFLint
tflint --init
tflint

Example Rules Enforced:

AWS instance type validation (detecting non-existent instance types)
Security group naming conventions
Unused variables and outputs
Deprecated resource syntax
Invalid AMI IDs or availability zones

Example Violation:

resource "aws_instance" "web" {
  instance_type = "t2.mega"  # ❌ Invalid instance type
  ami           = "ami-12345"
}

TFLint Output:

Error: "t2.mega" is an invalid value as instance_type (aws_instance_invalid_type)

  on main.tf line 3:
   3:   instance_type = "t2.mega"

Valid instance types: t2.micro, t2.small, t2.medium, t2.large...

Step 1.4: Secrets Detection

Why: Hardcoded secrets in IaC files are a critical security vulnerability. According to GitGuardian's 2024 report, over 10 million secrets are leaked to public GitHub repositories annually.

Tool: git-secrets

# Install git-secrets
brew install git-secrets  # macOS

# Initialize in repository
git secrets --install
git secrets --register-aws

# Scan current files
git secrets --scan

# Add custom patterns
git secrets --add '[A-Za-z0-9/+=]{40}'  # AWS Secret Access Key pattern

Patterns to Detect:

AWS Access Keys: AKIA[0-9A-Z]{16}
AWS Secret Keys: [A-Za-z0-9/+=]{40}
API Keys: High-entropy strings
Private Keys: -----BEGIN PRIVATE KEY-----
Database passwords in plaintext

Example Detection:

# ❌ WRONG - Hardcoded credentials
resource "aws_db_instance" "main" {
  username = "admin"
  password = "SuperSecret123!"  # ⚠️ git-secrets blocks this
}

# ✅ CORRECT - Use secrets manager
resource "aws_db_instance" "main" {
  username = var.db_username
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

Alternative: HashiCorp Vault Integration

provider "vault" {
  address = "https://vault.example.com"
}

data "vault_generic_secret" "db_credentials" {
  path = "secret/database/credentials"
}

resource "aws_db_instance" "main" {
  username = data.vault_generic_secret.db_credentials.data["username"]
  password = data.vault_generic_secret.db_credentials.data["password"]
}

Step 1.5: Module Validation

Why: Terraform modules must be validated independently to catch issues before they're consumed by root configurations.

Best Practices:

# Navigate to module directory
cd modules/vpc

# Validate module
terraform init -backend=false
terraform validate

# Test with example values
terraform plan -var-file=examples/example.tfvars

Module Input Validation:

variable "vpc_cidr" {
  type        = string
  description = "VPC CIDR block"

  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "The vpc_cidr must be a valid CIDR block."
  }
}

variable "environment" {
  type        = string
  description = "Environment name"

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

Stage 1 Output Example:

After 10 minutes of pre-commit validation, you should have:

All code formatted to HashiCorp standards (terraform fmt)
Syntax validated (terraform validate)
Linting rules passed (TFLint)
No secrets detected (git-secrets)
Pre-commit hooks configured and passing
Module validations successful

Time Investment: 5-10 minutes Next Step: Code passes pre-commit checks and is pushed to version control, triggering Stage 2 (CI/CD security scanning).

Stage 2: Security Scanning & Linting (15-30 minutes)

Once code reaches the CI/CD pipeline, automated security scanning tools perform deep static analysis to detect vulnerabilities, compliance violations, and misconfigurations. This stage integrates multiple scanning tools to create comprehensive coverage.

Step 2.1: tfsec - Terraform Security Scanner

Why: tfsec (now part of Trivy) performs static analysis specifically tailored for Terraform, detecting security issues like publicly exposed resources, unencrypted storage, and overly permissive IAM policies.

Installation & Usage:

# Install tfsec (legacy standalone version)
brew install tfsec  # macOS

# Or use Trivy (recommended - includes tfsec checks)
brew install trivy
trivy config .

# Run tfsec scan
tfsec . --format json --out tfsec-results.json

# Run with minimum severity threshold
tfsec . --minimum-severity HIGH

# Exclude specific checks
tfsec . --exclude aws-s3-enable-versioning

Example Configuration File (.tfsec/config.yml):

severity_overrides:
  aws-s3-enable-bucket-logging: WARNING
  aws-ec2-enforce-http-token-imds: HIGH

exclude:
  - aws-s3-enable-versioning  # Versioning not required for ephemeral buckets

Common Vulnerabilities Detected:

Check ID	Description	Severity	Example
`aws-s3-block-public-acls`	S3 bucket allows public ACLs	HIGH	Publicly readable S3 bucket
`aws-ec2-no-public-ip`	EC2 instance has public IP in private subnet	MEDIUM	Unintended internet exposure
`aws-rds-encrypt-instance-storage`	RDS instance encryption disabled	HIGH	Unencrypted database
`aws-ec2-enforce-http-token-imds`	IMDSv1 enabled (SSRF vulnerable)	HIGH	Metadata service v1
`aws-iam-no-policy-wildcards`	IAM policy uses wildcard actions	CRITICAL	Overly permissive IAM

Example Detection:

# ❌ VULNERABLE CODE
resource "aws_s3_bucket" "data" {
  bucket = "company-data-bucket"

  # tfsec: aws-s3-enable-bucket-encryption CRITICAL
  # Missing encryption configuration
}

resource "aws_security_group" "web" {
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # tfsec: aws-ec2-no-public-ingress-ssh HIGH
  }
}

tfsec Output:

Result #1 HIGH: Resource 'aws_s3_bucket.data' does not have encryption enabled
────────────────────────────────────────────────────────────────────────────────
  main.tf:3-5
────────────────────────────────────────────────────────────────────────────────
   3 │ resource "aws_s3_bucket" "data" {
   4 │   bucket = "company-data-bucket"
   5 │ }
────────────────────────────────────────────────────────────────────────────────
          ID: aws-s3-enable-bucket-encryption
      Impact: Data stored in S3 is not encrypted at rest
  Resolution: Enable server-side encryption with KMS

Result #2 HIGH: Security group allows SSH from 0.0.0.0/0
────────────────────────────────────────────────────────────────────────────────
  main.tf:11-14
────────────────────────────────────────────────────────────────────────────────
  11 │     from_port   = 22
  12 │     to_port     = 22
  13 │     protocol    = "tcp"
  14 │     cidr_blocks = ["0.0.0.0/0"]
────────────────────────────────────────────────────────────────────────────────
          ID: aws-ec2-no-public-ingress-ssh
      Impact: SSH access from the internet increases attack surface
  Resolution: Restrict SSH access to specific IP ranges or VPN

2 potential problems detected.

Remediation:

# ✅ SECURE CODE
resource "aws_s3_bucket" "data" {
  bucket = "company-data-bucket"
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.s3.arn
    }
  }
}

resource "aws_security_group" "web" {
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]  # Internal network only
  }
}

Step 2.2: Checkov - Multi-Cloud Policy Scanner

Why: Checkov supports 750+ built-in policies covering AWS, Azure, GCP, Kubernetes, and compliance frameworks (CIS, PCI-DSS, HIPAA, GDPR, SOC 2).

Installation & Usage:

# Install Checkov
pip install checkov

# Run full scan
checkov -d . --framework terraform

# Scan specific frameworks
checkov -d . --framework terraform --check CKV_AWS_*

# Output formats
checkov -d . --output json --output-file checkov-results.json
checkov -d . --output junitxml --output-file checkov-junit.xml  # CI integration
checkov -d . --output sarif --output-file checkov-sarif.json    # GitHub Security

# Skip specific checks
checkov -d . --skip-check CKV_AWS_20  # Skip S3 public read check

Compliance Framework Scanning:

# Scan for PCI-DSS compliance
checkov -d . --framework terraform --check CKV_AWS_* --compact

# Scan for CIS AWS Foundations Benchmark
checkov -d . --framework terraform --external-checks-dir ./custom-policies

Example Checkov Policy:

# custom-policies/check_mandatory_tags.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
from checkov.common.models.enums import CheckResult, CheckCategories

class MandatoryTagsCheck(BaseResourceCheck):
    def __init__(self):
        name = "Ensure all resources have required tags"
        id = "CKV_AWS_CUSTOM_1"
        supported_resources = ['aws_instance', 'aws_s3_bucket', 'aws_rds_instance']
        categories = [CheckCategories.CONVENTION]
        super().__init__(name=name, id=id, categories=categories, supported_resources=supported_resources)

    def scan_resource_conf(self, conf):
        required_tags = ['Environment', 'Owner', 'CostCenter']
        tags = conf.get('tags', [{}])[0]

        missing_tags = [tag for tag in required_tags if tag not in tags]

        if missing_tags:
            self.details = f"Missing required tags: {', '.join(missing_tags)}"
            return CheckResult.FAILED

        return CheckResult.PASSED

check = MandatoryTagsCheck()

Checkov Output:

Passed checks: 127, Failed checks: 8, Skipped checks: 3

Check: CKV_AWS_19: "Ensure all data stored in S3 is encrypted"
	FAILED for resource: aws_s3_bucket.logs
	File: /main.tf:45-50
	Guide: https://docs.bridgecrew.io/docs/s3_14-data-encrypted-at-rest

		45 | resource "aws_s3_bucket" "logs" {
		46 |   bucket = "application-logs"
		47 |   acl    = "private"
		48 | }

Check: CKV_AWS_CUSTOM_1: "Ensure all resources have required tags"
	FAILED for resource: aws_instance.web
	File: /main.tf:12-20
	Details: Missing required tags: Owner, CostCenter

		12 | resource "aws_instance" "web" {
		13 |   ami           = var.ami_id
		14 |   instance_type = "t3.medium"
		15 |
		16 |   tags = {
		17 |     Environment = "production"
		18 |   }
		19 | }

Step 2.3: Terrascan - Policy-as-Code with Rego

Why: Terrascan uses OPA/Rego policies for customizable security and compliance checks, enabling organizations to encode their specific governance requirements.

Installation & Usage:

# Install Terrascan
brew install terrascan  # macOS

# Scan with built-in policies
terrascan scan -t terraform

# Scan specific policy types
terrascan scan -t terraform -p aws

# Use custom policies
terrascan scan -t terraform --policy-path ./custom-policies

# Output formats
terrascan scan -t terraform -o json
terrascan scan -t terraform -o sarif > terrascan-sarif.json

Custom Rego Policy Example:

# custom-policies/require_vpc_flow_logs.rego
package accurics

{{.prefix}}{{.name}}[api.id] {
  api := input.aws_vpc[_]
  not has_flow_logs(api)
}

has_flow_logs(resource) {
  flow_logs := input.aws_flow_log[_]
  flow_logs.config.vpc_id == resource.id
}

{{.prefix}}{{.name}}[api.id] {
  api := input.aws_vpc[_]
  flow_logs := input.aws_flow_log[_]
  flow_logs.config.vpc_id == api.id
  flow_logs.config.traffic_type != "ALL"
}

Step 2.4: CI/CD Pipeline Integration

GitHub Actions Example:

# .github/workflows/terraform-security.yml
name: Terraform Security Scan

on:
  pull_request:
    paths:
      - '**.tf'
      - '**.tfvars'

jobs:
  security-scan:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.0

      - name: Terraform Format Check
        run: terraform fmt -check -recursive

      - name: Terraform Init
        run: terraform init -backend=false

      - name: Terraform Validate
        run: terraform validate

      - name: Run tfsec
        uses: aquasecurity/[email protected]
        with:
          soft_fail: false
          format: sarif
          sarif_file: tfsec.sarif

      - name: Upload tfsec SARIF
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: tfsec.sarif

      - name: Run Checkov
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: .
          framework: terraform
          output_format: sarif
          output_file_path: checkov.sarif

      - name: Upload Checkov SARIF
        uses: github/codeql-action/upload-sarif@v3
        if: always()
        with:
          sarif_file: checkov.sarif

      - name: Security Scan Summary
        if: always()
        run: |
          echo "## Security Scan Results" >> $GITHUB_STEP_SUMMARY
          echo "✅ tfsec scan completed" >> $GITHUB_STEP_SUMMARY
          echo "✅ Checkov scan completed" >> $GITHUB_STEP_SUMMARY
          echo "View detailed results in Security tab" >> $GITHUB_STEP_SUMMARY

GitLab CI Example:

# .gitlab-ci.yml
stages:
  - validate
  - security-scan

terraform-validate:
  stage: validate
  image: hashicorp/terraform:1.7
  script:
    - terraform fmt -check -recursive
    - terraform init -backend=false
    - terraform validate

tfsec-scan:
  stage: security-scan
  image: aquasec/tfsec:latest
  script:
    - tfsec . --format json --out tfsec-results.json
  artifacts:
    reports:
      sast: tfsec-results.json
    when: always

checkov-scan:
  stage: security-scan
  image: bridgecrew/checkov:latest
  script:
    - checkov -d . --framework terraform --output junitxml --output-file checkov-report.xml
  artifacts:
    reports:
      junit: checkov-report.xml
    when: always

Step 2.5: Trivy - Unified Security Scanner

Why: Trivy consolidates tfsec functionality with additional capabilities for container images, Kubernetes manifests, and dependency scanning.

Usage:

# Install Trivy
brew install trivy  # macOS

# Scan Terraform configurations
trivy config .

# Scan with specific severity
trivy config --severity HIGH,CRITICAL .

# Output formats
trivy config --format json --output trivy-results.json .
trivy config --format sarif --output trivy-sarif.json .

Example Output:

main.tf (terraform)
═══════════════════════════════════════════════════════════════════════════════

Tests: 42 (SUCCESSES: 34, FAILURES: 8)
Failures: 8 (HIGH: 5, MEDIUM: 3, LOW: 0)

HIGH: S3 bucket does not have encryption enabled
════════════════════════════════════════════════════════════════════════════════
Unencrypted S3 buckets are vulnerable to data breaches
────────────────────────────────────────────────────────────────────────────────
 main.tf:45-50
────────────────────────────────────────────────────────────────────────────────
  45   resource "aws_s3_bucket" "data" {
  46     bucket = "company-data-bucket"
  47   }
────────────────────────────────────────────────────────────────────────────────

Stage 2 Output Example:

After 25 minutes of security scanning, you should have:

tfsec/Trivy scan completed with vulnerability report
Checkov compliance scan passed (CIS, PCI-DSS, HIPAA)
Terrascan custom policy validation successful
SARIF reports uploaded to GitHub/GitLab Security tabs
No HIGH or CRITICAL vulnerabilities detected
CI/CD pipeline gates passed

Decision Matrix:

All scans pass → Proceed to Stage 3 (Policy Enforcement)
HIGH/CRITICAL findings → Block merge, require remediation
MEDIUM findings → Warning, allow with review
Custom policy violations → Block based on organization rules

Time Investment: 15-30 minutes (automated in CI/CD) Next Step: Proceed to Stage 3 for policy-as-code enforcement with Sentinel or OPA.

Stage 3: Policy-as-Code Enforcement (10-20 minutes)

Policy-as-code frameworks like HashiCorp Sentinel and Open Policy Agent (OPA) provide programmable guardrails that enforce organizational standards, compliance requirements, and security policies. Unlike static analysis tools that detect problems, policy engines prevent non-compliant infrastructure from being created.

Step 3.1: Understanding Policy-as-Code

What is Policy-as-Code?

According to Spacelift's policy guide, policy-as-code ensures that infrastructure configurations adhere to defined compliance standards and security policies through automated evaluation. Policy engines evaluate Terraform plans before they're applied, blocking non-compliant changes.

Sentinel vs. OPA:

Feature	HashiCorp Sentinel	Open Policy Agent (OPA)
Integration	Terraform Cloud/Enterprise native	Runs anywhere (local, CI/CD, cloud)
Language	Sentinel (HSL)	Rego (declarative)
Enforcement Levels	Advisory, Soft-Mandatory, Hard-Mandatory	Configurable per policy
Use Case	Centralized control in TFC/TFE	Shift-left, multi-platform policies
Ecosystem	HashiCorp stack only	Kubernetes, Envoy, Terraform, etc.
Cost	Terraform Cloud/Enterprise required	Open-source, free

Decision Criteria:

Choose Sentinel when: Your organization runs Terraform Cloud/Enterprise and needs centralized, audit-logged policy enforcement at apply time
Choose OPA when: You want shift-left checks, developer-local validation, and policies that span multiple platforms (Kubernetes, Terraform, etc.)
Use Both when: OPA catches issues early (pre-commit, CI), Sentinel provides final enforcement gate in TFC/TFE

Step 3.2: HashiCorp Sentinel Policies

Sentinel Policy Structure:

# policies/restrict-instance-types.sentinel
import "tfplan/v2" as tfplan

# Allowed instance types for production
allowed_instance_types = ["t3.medium", "t3.large", "m5.large", "m5.xlarge"]

# Find all EC2 instances in the plan
ec2_instances = filter tfplan.resource_changes as _, rc {
  rc.type is "aws_instance" and
  rc.mode is "managed" and
  (rc.change.actions contains "create" or rc.change.actions contains "update")
}

# Validate instance types
instance_type_valid = rule {
  all ec2_instances as _, instance {
    instance.change.after.instance_type in allowed_instance_types
  }
}

# Main rule with enforcement level
main = rule when instance_type_valid is false {
  print("Instance type must be one of:", allowed_instance_types)
  false
}

Enforcement Levels:

# sentinel.hcl
policy "restrict-instance-types" {
  source            = "./policies/restrict-instance-types.sentinel"
  enforcement_level = "hard-mandatory"  # Blocks apply
}

policy "require-tags" {
  source            = "./policies/require-tags.sentinel"
  enforcement_level = "soft-mandatory"  # Can be overridden by admins
}

policy "cost-estimation" {
  source            = "./policies/cost-estimation.sentinel"
  enforcement_level = "advisory"  # Warning only
}

Example: Enforce Encryption

# policies/enforce-encryption.sentinel
import "tfplan/v2" as tfplan

# Find all S3 buckets
s3_buckets = filter tfplan.resource_changes as _, rc {
  rc.type is "aws_s3_bucket" and
  rc.mode is "managed" and
  rc.change.actions contains "create"
}

# Find encryption configurations
s3_encryption = filter tfplan.resource_changes as _, rc {
  rc.type is "aws_s3_bucket_server_side_encryption_configuration" and
  rc.mode is "managed"
}

# Validate every bucket has encryption
buckets_encrypted = rule {
  all s3_buckets as address, bucket {
    any s3_encryption as _, enc {
      enc.change.after.bucket is bucket.change.after.id
    }
  }
}

main = rule {
  buckets_encrypted else true
}

Testing Sentinel Policies:

# Install Sentinel CLI
wget https://releases.hashicorp.com/sentinel/0.25.0/sentinel_0.25.0_linux_amd64.zip
unzip sentinel_0.25.0_linux_amd64.zip
sudo mv sentinel /usr/local/bin/

# Test policy
sentinel test policies/

# Apply policy in Terraform Cloud
terraform login
terraform plan  # Triggers policy evaluation

Step 3.3: Open Policy Agent (OPA) Policies

OPA/Rego Policy Structure:

# policies/aws_instance_approved_types.rego
package terraform.policies.aws_instance

import rego.v1

# Allowed instance types
allowed_types := {"t3.medium", "t3.large", "m5.large", "m5.xlarge"}

# Find all EC2 instances
ec2_instances[resource] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  resource.change.actions[_] == "create"
}

# Deny rule
deny contains msg if {
  some resource in ec2_instances
  instance_type := resource.change.after.instance_type
  not instance_type in allowed_types
  msg := sprintf(
    "EC2 instance '%s' uses disallowed type '%s'. Allowed: %v",
    [resource.address, instance_type, allowed_types]
  )
}

Running OPA with Terraform:

# Install OPA
brew install opa  # macOS

# Generate Terraform plan in JSON
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json

# Evaluate policies
opa eval --data policies/ --input tfplan.json "data.terraform.policies.deny"

# Using conftest (OPA wrapper for configuration testing)
brew install conftest
conftest test tfplan.json --policy policies/

Example: Multi-Policy Enforcement

# policies/security_policies.rego
package terraform.security

import rego.v1

# Policy 1: No public S3 buckets
deny contains msg if {
  some resource in input.resource_changes
  resource.type == "aws_s3_bucket_public_access_block"
  not resource.change.after.block_public_acls
  msg := sprintf("S3 bucket '%s' allows public ACLs", [resource.address])
}

# Policy 2: RDS encryption required
deny contains msg if {
  some resource in input.resource_changes
  resource.type == "aws_db_instance"
  resource.change.actions[_] == "create"
  not resource.change.after.storage_encrypted
  msg := sprintf("RDS instance '%s' must have encryption enabled", [resource.address])
}

# Policy 3: Security groups no 0.0.0.0/0 SSH
deny contains msg if {
  some resource in input.resource_changes
  resource.type == "aws_security_group"
  some rule in resource.change.after.ingress
  rule.from_port == 22
  "0.0.0.0/0" in rule.cidr_blocks
  msg := sprintf("Security group '%s' allows SSH from 0.0.0.0/0", [resource.address])
}

# Policy 4: Require tags
required_tags := {"Environment", "Owner", "CostCenter"}

deny contains msg if {
  some resource in input.resource_changes
  resource.type in {"aws_instance", "aws_s3_bucket", "aws_db_instance"}
  resource.change.actions[_] == "create"
  tags := object.keys(resource.change.after.tags)
  missing := required_tags - tags
  count(missing) > 0
  msg := sprintf(
    "Resource '%s' missing required tags: %v",
    [resource.address, missing]
  )
}

CI/CD Integration:

# .github/workflows/opa-policy.yml
name: OPA Policy Check

on:
  pull_request:
    paths:
      - '**.tf'

jobs:
  opa-check:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        run: terraform init -backend=false

      - name: Terraform Plan
        run: |
          terraform plan -out=tfplan.binary
          terraform show -json tfplan.binary > tfplan.json

      - name: Install Conftest
        run: |
          wget https://github.com/open-policy-agent/conftest/releases/download/v0.49.0/conftest_0.49.0_Linux_x86_64.tar.gz
          tar xzf conftest_0.49.0_Linux_x86_64.tar.gz
          sudo mv conftest /usr/local/bin/

      - name: Run OPA Policies
        run: conftest test tfplan.json --policy policies/ --output github

      - name: Policy Results
        if: failure()
        run: |
          echo "## ❌ Policy Violations Detected" >> $GITHUB_STEP_SUMMARY
          conftest test tfplan.json --policy policies/ >> $GITHUB_STEP_SUMMARY

Step 3.4: Combined Sentinel + OPA Strategy

Layered Policy Enforcement:

Developer Workstation
    ↓
  OPA (Pre-commit) - Instant feedback, local validation
    ↓
CI/CD Pipeline
    ↓
  OPA (Pull Request) - Automated checks, block merge
    ↓
Terraform Cloud/Enterprise
    ↓
  Sentinel (Apply Time) - Final enforcement gate, audit log
    ↓
Production Deployment

Example Workflow:

# 1. Developer runs OPA locally before commit
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary | conftest test - --policy policies/

# 2. CI runs OPA on pull request
# (Automated via GitHub Actions)

# 3. Terraform Cloud runs Sentinel before apply
terraform plan  # Plan succeeds
terraform apply # Sentinel evaluates, blocks if violations detected

Step 3.5: Policy-as-Code Best Practices

1. Start with Advisory Policies

# Begin with warnings, not blocks
policy "experimental-cost-check" {
  enforcement_level = "advisory"  # Warn only
}

# Graduate to enforcement after validation
policy "production-cost-check" {
  enforcement_level = "hard-mandatory"  # Block
}

2. Version Control Policies

# Store policies in Git alongside infrastructure code
terraform-repo/
├── infrastructure/
│   ├── main.tf
│   └── variables.tf
├── policies/
│   ├── sentinel/
│   │   ├── restrict-instance-types.sentinel
│   │   └── enforce-encryption.sentinel
│   └── opa/
│       ├── security_policies.rego
│       └── compliance_policies.rego
└── tests/
    └── policy-tests/

3. Test Policies with Unit Tests

Sentinel Test:

# test/restrict-instance-types/fail-disallowed-type.hcl
mock "tfplan/v2" {
  module {
    source = "mock-tfplan-fail.sentinel"
  }
}

test {
  rules = {
    main = false  # Expect failure for t2.micro
  }
}

OPA Test:

# policies/security_policies_test.rego
package terraform.security

test_deny_public_ssh if {
  deny contains "Security group 'aws_security_group.web' allows SSH from 0.0.0.0/0"
    with input as {
      "resource_changes": [{
        "type": "aws_security_group",
        "address": "aws_security_group.web",
        "change": {
          "after": {
            "ingress": [{
              "from_port": 22,
              "to_port": 22,
              "cidr_blocks": ["0.0.0.0/0"]
            }]
          }
        }
      }]
    }
}

Run Tests:

# Test Sentinel policies
sentinel test

# Test OPA policies
opa test policies/ -v

Stage 3 Output Example:

After 15 minutes of policy evaluation, you should have:

OPA policies evaluated locally and in CI/CD
Sentinel policies enforced in Terraform Cloud (if applicable)
All hard-mandatory policies passed
Advisory policy warnings reviewed
Policy violations documented with remediation guidance
Audit trail generated for compliance

Policy Violation Example:

Policy Violation: restrict-instance-types (hard-mandatory)
────────────────────────────────────────────────────────────
Resource: aws_instance.web
Violation: Instance type 't2.micro' not in allowed list
Allowed Types: t3.medium, t3.large, m5.large, m5.xlarge

Remediation:
  Change instance_type to an approved type:

  resource "aws_instance" "web" {
    instance_type = "t3.medium"  # ✅ Approved
  }

Status: ❌ Apply blocked until resolved

Time Investment: 10-20 minutes (automated) Next Step: Proceed to Stage 4 for plan review and cost analysis with Infracost.

Stage 4: Plan Review & Cost Analysis (20-40 minutes)

Before applying infrastructure changes, teams must understand what will change, why it matters, and how much it will cost. This stage combines human review with automated cost estimation to prevent expensive mistakes and ensure changes align with business objectives.

Step 4.1: Terraform Plan Analysis

Why: The terraform plan output is your blueprint—it shows exactly what resources will be created, modified, or destroyed. According to HashiCorp's best practices, teams should always review plans before applying, especially in production.

Generate and Review Plan:

# Generate plan with detailed output
terraform plan -out=tfplan.binary

# Review plan in human-readable format
terraform show tfplan.binary

# Save plan to file for review
terraform show tfplan.binary > plan-review.txt

# Generate JSON for automated analysis
terraform show -json tfplan.binary > tfplan.json

Plan Output Interpretation:

Terraform will perform the following actions:

  # aws_db_instance.main will be created
  + resource "aws_db_instance" "main" {
      + allocated_storage    = 100
      + db_name             = "production_db"
      + engine              = "postgres"
      + engine_version      = "15.3"
      + instance_class      = "db.m5.large"
      + storage_encrypted   = true
      + publicly_accessible = false  # ✅ Good - not publicly exposed
      + ...
    }

  # aws_security_group.database will be modified
  ~ resource "aws_security_group" "database" {
      ~ ingress {
          ~ cidr_blocks = [
              - "10.0.0.0/8",        # ⚠️ Removing internal access
              + "10.1.0.0/16",       # ⚠️ Restricting to specific subnet
            ]
        }
    }

  # aws_instance.legacy will be destroyed
  - resource "aws_instance" "legacy" {
      - instance_type = "t2.micro"
      - ...
    }

Plan: 1 to add, 1 to change, 1 to destroy.

Key Indicators:

Symbol	Meaning	Risk Level
`+`	Resource will be created	Low
`~`	Resource will be modified	Medium
`-`	Resource will be destroyed	HIGH
`-/+`	Resource will be replaced (destroyed then created)	CRITICAL
`<=`	Resource will be read (data source)	Low

Red Flags to Watch:

# ⚠️ HIGH RISK: Database replacement (data loss!)
-/+ resource "aws_db_instance" "main" {
      # Changing engine_version from 14.9 to 15.3 forces replacement
    }

# ⚠️ MEDIUM RISK: Security group changes
~ resource "aws_security_group" "web" {
    ~ ingress {
        - cidr_blocks = ["10.0.0.0/8"]
        + cidr_blocks = ["0.0.0.0/0"]  # ❌ Opening to internet!
      }
  }

# ✅ LOW RISK: Tag updates (metadata only)
~ resource "aws_instance" "web" {
    ~ tags = {
        + "Environment" = "production"
      }
  }

Step 4.2: Using Terraform Plan Explainer

Why: Complex plans with dozens of resources can be difficult to review. The Terraform Plan Explainer tool provides visual analysis, security risk scoring, and blast radius calculations.

Tool Features:

Security Risk Detection: Identifies dangerous changes (public exposure, encryption removal, wildcard permissions)
Blast Radius Visualization: Shows dependency graphs and impact analysis
Change Summary: Categorizes changes by type and risk level
Resource Dependency Graph: Visualizes which resources depend on others

Example Analysis:

Upload your tfplan.json to the Terraform Plan Explainer:

Security Analysis:
  🔴 HIGH RISK (3):
    - aws_db_instance.main: Force replacement (potential data loss)
    - aws_security_group.web: Opening SSH to 0.0.0.0/0
    - aws_s3_bucket_public_access_block.logs: Removing block_public_acls

  🟡 MEDIUM RISK (5):
    - aws_rds_cluster.analytics: Scaling instance class (downtime expected)
    - aws_lambda_function.processor: Changing runtime (requires testing)

  🟢 LOW RISK (12):
    - Tag updates, description changes, minor configuration tweaks

Blast Radius: 8 dependent resources affected by changes

Estimated Apply Time: 15-20 minutes

Step 4.3: Cost Estimation with Infracost

Why: According to Infracost's research, infrastructure changes can have unexpected cost impacts. A seemingly small change like increasing an RDS instance class can add thousands per month.

Installation & Setup:

# Install Infracost
brew install infracost  # macOS
# or
curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh

# Configure API key (free for up to 100 resources)
infracost auth login

# Register API key
infracost configure set api_key <YOUR_API_KEY>

Generate Cost Estimate:

# Cost breakdown for current plan
infracost breakdown --path . --format table

# Compare cost difference (current state vs. planned changes)
infracost diff --path . --format table

# Generate JSON for CI/CD integration
infracost breakdown --path . --format json --out-file infracost.json

# Pull request comment format (shows cost diff)
infracost comment github --path infracost.json \
  --repo $GITHUB_REPOSITORY \
  --pull-request $PR_NUMBER \
  --github-token $GITHUB_TOKEN

Example Cost Breakdown:

Project: production-infrastructure

 Name                                    Monthly Qty  Unit   Monthly Cost

 aws_db_instance.main
 ├─ Database instance (on-demand, db.m5.large)   730  hours       $252.20
 ├─ Storage (general purpose SSD, gp3)           100  GB            $11.50
 └─ Additional backup storage                    100  GB             $9.50

 aws_instance.web
 ├─ Instance usage (Linux/UNIX, on-demand, t3.medium) 730 hours    $30.37
 └─ EBS volume (gp3)                              30  GB             $2.40

 aws_nat_gateway.main
 ├─ NAT gateway                                  730  hours        $32.85
 └─ Data processed                                50  GB             $2.25

 OVERALL TOTAL                                                    $340.07

──────────────────────────────────
78 cloud resources were detected:
∙ 5 were estimated, all of which include usage-based costs
∙ 73 were free

Cost Diff Output:

Infracost estimate: Monthly cost will increase by $175 (+51% from $340 to $515)

 Name                              Baseline    Usage     Planned     Diff     % Change

 aws_db_instance.main              $273        $0        $458        +$185    +68%
 ├─ Database instance
 │  (db.m5.large → db.m5.xlarge)   $252        $0        $504        +$252   +100%
 └─ Storage (100 GB → 200 GB)       $12        $0         $23        +$11     +92%

 aws_elasticache_cluster.redis     $0          $0         $67        +$67       ∞
 ├─ Cache nodes (cache.m5.large)   $0          $0         $67        +$67       ∞

 aws_instance.worker               $33         $0         $23        -$10     -30%
 └─ Instance usage
    (t3.medium → t3.small)          $30         $0         $20        -$10     -33%

──────────────────────────────────────────────────────────────────────────────
Key changes:
  + 1 new resource (Redis cache)
  ~ 2 resource updates (RDS scaling up, EC2 scaling down)

Monthly cost change: +$175 ($340 → $515)

Step 4.4: Cost Optimization Analysis

Identify Cost-Saving Opportunities:

# Analyze cost trends over time
infracost breakdown --path . --format json | jq '.projects[].breakdown.resources[] | select(.monthlyCost > 100)'

# Compare alternative architectures
infracost breakdown --path ./production --format table > prod-cost.txt
infracost breakdown --path ./alternative --format table > alt-cost.txt
diff prod-cost.txt alt-cost.txt

Common Cost Optimizations:

# ❌ EXPENSIVE: On-demand RDS instance
resource "aws_db_instance" "main" {
  instance_class = "db.m5.large"  # $252/month
}

# ✅ CHEAPER: Reserved instance (1-year term, 40% savings)
# Note: Reserve via AWS Console, reference in Terraform
resource "aws_db_instance" "main" {
  instance_class = "db.m5.large"  # ~$151/month with RI
}

# ✅ EVEN CHEAPER: Right-size instance based on metrics
resource "aws_db_instance" "main" {
  instance_class = "db.m5.medium"  # $126/month (50% savings)
  # Validated via CloudWatch: Avg CPU 25%, Avg Memory 40%
}

Infracost Policy as Code:

# infracost-policies/cost-limits.rego
package infracost

import rego.v1

deny contains msg if {
  # Limit monthly cost increase to $500
  to_number(input.diffTotalMonthlyCost) > 500
  msg := sprintf(
    "Cost increase of $%.2f exceeds $500 limit",
    [to_number(input.diffTotalMonthlyCost)]
  )
}

deny contains msg if {
  # Flag any single resource over $1000/month
  some resource in input.projects[_].breakdown.resources
  to_number(resource.monthlyCost) > 1000
  msg := sprintf(
    "Resource '%s' costs $%.2f/month (exceeds $1000 limit)",
    [resource.name, to_number(resource.monthlyCost)]
  )
}

Run Cost Policies:

infracost breakdown --path . --format json | conftest test - --policy infracost-policies/

Step 4.5: CI/CD Integration - Plan Review Automation

GitHub Actions Example:

# .github/workflows/terraform-plan.yml
name: Terraform Plan & Cost Estimate

on:
  pull_request:
    paths:
      - '**.tf'

jobs:
  plan-and-cost:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        run: terraform init
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Terraform Plan
        id: plan
        run: |
          terraform plan -out=tfplan.binary -no-color
          terraform show tfplan.binary > plan.txt
          terraform show -json tfplan.binary > tfplan.json

      - name: Setup Infracost
        uses: infracost/actions/setup@v2
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Generate Cost Estimate
        run: |
          infracost breakdown --path tfplan.json \
            --format json \
            --out-file infracost.json

      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('plan.txt', 'utf8');
            const infracost = JSON.parse(fs.readFileSync('infracost.json', 'utf8'));

            const comment = `## Terraform Plan

            <details><summary>Show Plan</summary>

            \`\`\`hcl
            ${plan}
            \`\`\`

            </details>

            ## Cost Estimate

            Monthly cost change: **\$${infracost.diffTotalMonthlyCost}**

            | Resource | Current | Planned | Diff |
            |----------|---------|---------|------|
            ${infracost.projects[0].breakdown.resources.map(r =>
              `| ${r.name} | \$${r.monthlyCost} | \$${r.monthlyCost} | \$0 |`
            ).join('\n')}

            📊 [View detailed breakdown](https://dashboard.infracost.io)
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: comment
            });

Pull Request Comment Example:

## Terraform Plan

Plan: 3 to add, 2 to change, 1 to destroy.

<details><summary>Show Plan</summary>

[Full plan output...]

</details>

## Cost Estimate

Monthly cost change: **+$175** ($340 → $515)

⚠️ **Significant cost increase detected**

| Resource | Current | Planned | Diff | % Change |
|----------|---------|---------|------|----------|
| aws_db_instance.main | $273 | $458 | +$185 | +68% |
| aws_elasticache_cluster.redis | $0 | $67 | +$67 | ∞ |
| aws_instance.worker | $33 | $23 | -$10 | -30% |

### Review Checklist
- [ ] Cost increase justified by business requirements
- [ ] Alternative architectures evaluated
- [ ] Right-sizing analysis completed
- [ ] Reserved instance opportunities identified

**Approval Required:** Cost increase >$100 requires lead engineer sign-off

Stage 4 Output Example:

After 30 minutes of plan review and cost analysis, you should have:

Terraform plan generated and reviewed
Security risks identified and assessed
Cost estimate completed with Infracost
Cost increase/decrease justified and documented
Pull request comment posted with plan and cost summary
Approval obtained (if required by cost thresholds)
Blast radius analysis completed

Decision Matrix:

Plan + Cost Review → Approval Decision
────────────────────────────────────────
No destructive changes + Cost decrease → Auto-approve
No destructive changes + Cost increase <$100 → Engineer approval
Destructive changes (destroy/replace) → Lead approval required
Cost increase >$500 → Director approval required
Policy violations → Block until resolved

Time Investment: 20-40 minutes (human review + automated analysis) Next Step: Proceed to Stage 5 for automated testing and compliance validation.

Stage 5: Automated Testing (30-60 minutes)

Automated testing validates that your infrastructure code works as intended before deploying to production. This stage includes integration tests, compliance validation, and functional testing using tools like Terratest, Kitchen-Terraform, and automated compliance frameworks.

Step 5.1: Terratest - Integration Testing

Why: Terratest is a Go library that enables you to write automated tests for your infrastructure code, deploying real resources to test environments and validating their configuration.

Installation:

# Create test directory
mkdir -p test
cd test

# Initialize Go module
go mod init github.com/yourorg/terraform-tests

# Install Terratest
go get github.com/gruntwork-io/terratest/modules/terraform
go get github.com/gruntwork-io/terratest/modules/aws
go get github.com/stretchr/testify/assert

Example Test - VPC Module:

// test/vpc_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/stretchr/testify/assert"
)

func TestVPCCreation(t *testing.T) {
    t.Parallel()

    // Configure Terraform options
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "vpc_cidr": "10.0.0.0/16",
            "environment": "test",
            "region": "us-east-1",
        },
        EnvVars: map[string]string{
            "AWS_DEFAULT_REGION": "us-east-1",
        },
    }

    // Cleanup resources after test
    defer terraform.Destroy(t, terraformOptions)

    // Deploy infrastructure
    terraform.InitAndApply(t, terraformOptions)

    // Validate outputs
    vpcID := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcID, "VPC ID should not be empty")

    // Validate VPC configuration in AWS
    vpc := aws.GetVpcById(t, vpcID, "us-east-1")
    assert.Equal(t, "10.0.0.0/16", vpc.CidrBlock)
    assert.True(t, vpc.IsDefault == false)

    // Validate subnets created
    subnets := aws.GetSubnetsForVpc(t, vpcID, "us-east-1")
    assert.GreaterOrEqual(t, len(subnets), 2, "Should have at least 2 subnets")

    // Validate tags
    expectedTags := map[string]string{
        "Environment": "test",
        "ManagedBy": "Terraform",
    }
    for key, expectedValue := range expectedTags {
        actualValue := vpc.Tags[key]
        assert.Equal(t, expectedValue, actualValue, "Tag %s mismatch", key)
    }
}

Example Test - Security Group:

// test/security_group_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/stretchr/testify/assert"
)

func TestSecurityGroupNoPublicSSH(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../infrastructure",
        Vars: map[string]interface{}{
            "environment": "test",
        },
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    // Get security group ID
    sgID := terraform.Output(t, terraformOptions, "security_group_id")

    // Validate no SSH from 0.0.0.0/0
    sg := aws.GetSecurityGroupById(t, sgID, "us-east-1")

    for _, rule := range sg.IngressRules {
        if rule.FromPort == 22 && rule.ToPort == 22 {
            for _, cidr := range rule.CidrBlocks {
                assert.NotEqual(t, "0.0.0.0/0", cidr,
                    "SSH should not be open to the internet")
            }
        }
    }
}

Run Tests:

# Run all tests
go test -v -timeout 30m

# Run specific test
go test -v -run TestVPCCreation -timeout 20m

# Run tests in parallel
go test -v -parallel 10 -timeout 60m

Test Output:

=== RUN   TestVPCCreation
=== PAUSE TestVPCCreation
=== CONT  TestVPCCreation
TestVPCCreation 2025-01-07T10:30:15-05:00 terraform.go:123: Running terraform init...
TestVPCCreation 2025-01-07T10:30:18-05:00 terraform.go:123: Running terraform apply...
TestVPCCreation 2025-01-07T10:32:45-05:00 terraform.go:456: Apply complete! Resources: 8 added, 0 changed, 0 destroyed.
TestVPCCreation 2025-01-07T10:32:46-05:00 vpc_test.go:35: VPC ID: vpc-0a1b2c3d4e5f6g7h8
TestVPCCreation 2025-01-07T10:32:47-05:00 vpc_test.go:40: Validated VPC CIDR: 10.0.0.0/16
TestVPCCreation 2025-01-07T10:32:48-05:00 vpc_test.go:45: Validated subnets: 4 created
TestVPCCreation 2025-01-07T10:32:50-05:00 terraform.go:123: Running terraform destroy...
--- PASS: TestVPCCreation (195.32s)
PASS
ok      github.com/yourorg/terraform-tests    195.412s

Step 5.2: Kitchen-Terraform - Behavior-Driven Testing

Why: Kitchen-Terraform integrates with Test Kitchen, providing behavior-driven development (BDD) style testing with InSpec for compliance validation.

Installation:

# Install Ruby dependencies
gem install bundler
bundle init

# Add to Gemfile
echo 'gem "kitchen-terraform", "~> 7.0"' >> Gemfile
bundle install

Configuration (.kitchen.yml):

---
driver:
  name: terraform
  variable_files:
    - test/fixtures/terraform.tfvars

provisioner:
  name: terraform

verifier:
  name: terraform
  systems:
    - name: basic
      backend: aws
      controls:
        - vpc_configuration
        - security_groups

platforms:
  - name: terraform

suites:
  - name: default
    driver:
      root_module_directory: test/fixtures
    verifier:
      systems:
        - name: basic
          backend: aws
          profile_locations:
            - test/integration/default

InSpec Controls (test/integration/default/controls/vpc.rb):

# Validate VPC configuration
control 'vpc_configuration' do
  impact 1.0
  title 'VPC should be configured securely'

  describe aws_vpc(vpc_id: attribute('vpc_id')) do
    it { should exist }
    its('cidr_block') { should eq '10.0.0.0/16' }
    its('state') { should eq 'available' }

    # DNS settings
    its('dhcp_options_id') { should_not be_nil }

    # Tags
    its('tags') { should include('Environment' => 'test') }
    its('tags') { should include('ManagedBy' => 'Terraform') }
  end

  # Validate flow logs enabled
  describe aws_flow_log(vpc_id: attribute('vpc_id')) do
    it { should exist }
    its('traffic_type') { should eq 'ALL' }
  end
end

Run Tests:

# Run Kitchen-Terraform tests
bundle exec kitchen test

# Individual stages
bundle exec kitchen create   # Deploy infrastructure
bundle exec kitchen verify   # Run InSpec controls
bundle exec kitchen destroy  # Cleanup resources

Step 5.3: Compliance Validation - InSpec AWS

Why: InSpec AWS provides compliance-as-code validation against frameworks like CIS AWS Foundations Benchmark, PCI-DSS, and HIPAA.

Example Compliance Profile:

# test/compliance/aws-baseline/controls/s3.rb
title 'S3 Bucket Security Controls'

control 's3-encryption-required' do
  impact 1.0
  title 'S3 buckets must have encryption enabled'
  desc 'Validates all S3 buckets have server-side encryption'

  aws_s3_buckets.bucket_names.each do |bucket|
    describe aws_s3_bucket(bucket_name: bucket) do
      it { should have_default_encryption_enabled }
    end
  end
end

control 's3-public-access-blocked' do
  impact 1.0
  title 'S3 buckets must block public access'

  aws_s3_buckets.bucket_names.each do |bucket|
    describe aws_s3_bucket(bucket_name: bucket) do
      it { should have_access_logging_enabled }
      it { should_not be_public }
    end
  end
end

control 's3-versioning-enabled' do
  impact 0.7
  title 'S3 buckets should have versioning enabled'

  aws_s3_buckets.bucket_names.each do |bucket|
    describe aws_s3_bucket(bucket_name: bucket) do
      it { should have_versioning_enabled }
    end
  end
end

Run Compliance Scan:

# Run InSpec profile
inspec exec test/compliance/aws-baseline \
  -t aws:// \
  --reporter cli json:compliance-report.json

# Use CIS AWS Foundations Benchmark
inspec supermarket exec dev-sec/cis-aws-benchmark \
  -t aws:// \
  --reporter cli html:cis-report.html

Compliance Report Output:

Profile: AWS Security Baseline (aws-baseline)
Version: 1.0.0
Target:  aws://

  ✔  s3-encryption-required: S3 buckets must have encryption enabled
     ✔  S3 Bucket company-data-bucket has default encryption enabled
     ✔  S3 Bucket application-logs has default encryption enabled

  ×  s3-public-access-blocked: S3 buckets must block public access
     ✔  S3 Bucket company-data-bucket is not public
     ×  S3 Bucket legacy-assets is public (1 failed)

  ✔  s3-versioning-enabled: S3 buckets should have versioning enabled
     ✔  S3 Bucket company-data-bucket has versioning enabled

Profile Summary: 2 successful controls, 1 control failure, 0 controls skipped
Test Summary: 5 successful, 1 failure, 0 skipped

Step 5.4: CI/CD Integration - Automated Testing

GitHub Actions Example:

# .github/workflows/terraform-test.yml
name: Terraform Integration Tests

on:
  pull_request:
    branches: [main]
  workflow_dispatch:

jobs:
  terratest:
    runs-on: ubuntu-latest
    timeout-minutes: 60

    steps:
      - uses: actions/checkout@v4

      - name: Setup Go
        uses: actions/setup-go@v4
        with:
          go-version: '1.21'

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_wrapper: false

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1

      - name: Run Terratest
        run: |
          cd test
          go mod download
          go test -v -timeout 60m -parallel 5

      - name: Publish Test Results
        if: always()
        uses: dorny/test-reporter@v1
        with:
          name: Terratest Results
          path: test/test-results.json
          reporter: java-junit

Step 5.5: Testing Best Practices

1. Use Separate Test Environments

# test/fixtures/terraform.tfvars
environment = "terratest"
region      = "us-west-2"  # Different region from production
vpc_cidr    = "10.99.0.0/16"  # Non-overlapping CIDR

# Unique naming to avoid conflicts
resource_prefix = "terratest-${random_id.test_id.hex}"

2. Implement Proper Cleanup

func TestWithCleanup(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../infrastructure",
    }

    // ALWAYS use defer for cleanup
    defer terraform.Destroy(t, terraformOptions)

    // Even if test fails, cleanup runs
    terraform.InitAndApply(t, terraformOptions)

    // ... test validations ...
}

3. Test Idempotency

func TestTerraformIdempotency(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../infrastructure",
    }

    defer terraform.Destroy(t, terraformOptions)

    // First apply
    terraform.InitAndApply(t, terraformOptions)

    // Second apply should show no changes
    planOutput := terraform.Plan(t, terraformOptions)
    assert.Contains(t, planOutput, "No changes. Infrastructure is up-to-date.")
}

4. Validate Outputs

func TestOutputValidation(t *testing.T) {
    // ... setup and apply ...

    // Validate output format
    vpcID := terraform.Output(t, terraformOptions, "vpc_id")
    assert.Regexp(t, regexp.MustCompile(`^vpc-[a-f0-9]{17}$`), vpcID)

    // Validate output values
    subnetIDs := terraform.OutputList(t, terraformOptions, "subnet_ids")
    assert.GreaterOrEqual(t, len(subnetIDs), 2)
}

Stage 5 Output Example:

After 45 minutes of automated testing, you should have:

Terratest integration tests passed (VPC, security groups, IAM)
InSpec compliance validation completed
CIS AWS Foundations Benchmark passed
Idempotency tests validated (no drift on re-apply)
Output validation confirmed
Test resources cleaned up automatically
Test reports generated and published

Test Summary:

Terraform Integration Tests
────────────────────────────────────────────
- TestVPCCreation                    PASS (195s)
- TestSecurityGroupNoPublicSSH       PASS (142s)
- TestRDSEncryptionEnabled           PASS (287s)
- TestS3BucketCompliance             PASS (89s)
- TestIAMPolicyLeastPrivilege        PASS (56s)
- TestTerraformIdempotency           PASS (312s)

Compliance Validation (InSpec)
────────────────────────────────────────────
- CIS AWS Foundations 1.1            PASS
- CIS AWS Foundations 1.2            PASS
- CIS AWS Foundations 2.1            PASS
⚠️  CIS AWS Foundations 2.3           WARNING (Versioning recommended)
- CIS AWS Foundations 3.1            PASS

Total Tests: 42 passed, 0 failed, 1 warning
Total Time: 15m 34s

Time Investment: 30-60 minutes (automated, runs in parallel) Next Step: Proceed to Stage 6 for controlled deployment with approval gates.

Stage 6: Controlled Deployment (15-30 minutes)

Deployment is the most critical phase—where infrastructure changes become reality. This stage implements approval gates, blast radius controls, rollback procedures, and phased deployment strategies to minimize risk.

Step 6.1: Approval Workflows

Why: According to Terraform Cloud documentation, manual approval gates prevent accidental or unauthorized infrastructure changes, especially for production environments.

Terraform Cloud Approval:

# terraform.tf
terraform {
  cloud {
    organization = "your-org"

    workspaces {
      name = "production"
    }
  }
}

# Workspace settings (via Terraform Cloud UI or API):
# - Auto Apply: Disabled (requires manual approval)
# - Approvers: DevOps team, Lead Engineer
# - Required approvals: 2

GitHub Actions Approval:

# .github/workflows/terraform-deploy.yml
name: Terraform Deploy to Production

on:
  push:
    branches: [main]

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Terraform Plan
        run: terraform plan -out=tfplan
      - name: Upload Plan
        uses: actions/upload-artifact@v3
        with:
          name: tfplan
          path: tfplan

  approve:
    needs: plan
    runs-on: ubuntu-latest
    environment:
      name: production
      # Required reviewers configured in GitHub Settings → Environments
    steps:
      - name: Approval Required
        run: echo "Manual approval required to proceed with deployment"

  apply:
    needs: approve
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Download Plan
        uses: actions/download-artifact@v3
        with:
          name: tfplan
      - name: Terraform Apply
        run: terraform apply tfplan

GitLab CI Approval:

# .gitlab-ci.yml
deploy-production:
  stage: deploy
  script:
    - terraform apply -auto-approve
  environment:
    name: production
  when: manual  # Manual trigger required
  only:
    - main
  needs:
    - terraform-plan

Step 6.2: Deployment Strategies

Strategy 1: Blue-Green Deployment

# Blue-Green infrastructure deployment
variable "active_environment" {
  description = "Active environment (blue or green)"
  type        = string
  default     = "blue"
}

# Blue environment
module "blue_environment" {
  source = "./modules/application"

  environment_name = "blue"
  enabled          = var.active_environment == "blue"

  instance_count = var.active_environment == "blue" ? 3 : 0
}

# Green environment
module "green_environment" {
  source = "./modules/application"

  environment_name = "green"
  enabled          = var.active_environment == "green"

  instance_count = var.active_environment == "green" ? 3 : 0
}

# Load balancer switches between blue and green
resource "aws_lb_target_group_attachment" "active" {
  target_group_arn = aws_lb_target_group.main.arn
  target_id        = var.active_environment == "blue" ?
                     module.blue_environment.instance_ids[0] :
                     module.green_environment.instance_ids[0]
}

Deployment Process:

# Step 1: Deploy to green (inactive) environment
terraform apply -var="active_environment=blue"  # Blue still active

# Step 2: Test green environment
curl https://green.example.com/health

# Step 3: Switch traffic to green
terraform apply -var="active_environment=green"

# Step 4: If issues, rollback to blue
terraform apply -var="active_environment=blue"

Strategy 2: Canary Deployment

# Canary deployment with weighted routing
resource "aws_lb_target_group" "stable" {
  name     = "app-stable"
  port     = 80
  protocol = "HTTP"
  vpc_id   = var.vpc_id
}

resource "aws_lb_target_group" "canary" {
  name     = "app-canary"
  port     = 80
  protocol = "HTTP"
  vpc_id   = var.vpc_id
}

# Route 90% traffic to stable, 10% to canary
resource "aws_lb_listener_rule" "weighted_routing" {
  listener_arn = aws_lb_listener.main.arn

  action {
    type = "forward"

    forward {
      target_group {
        arn    = aws_lb_target_group.stable.arn
        weight = 90
      }

      target_group {
        arn    = aws_lb_target_group.canary.arn
        weight = 10
      }
    }
  }

  condition {
    path_pattern {
      values = ["/*"]
    }
  }
}

Gradual Rollout:

# Phase 1: 10% canary
terraform apply -var="canary_weight=10"
# Monitor metrics for 1 hour

# Phase 2: 50% canary
terraform apply -var="canary_weight=50"
# Monitor metrics for 30 minutes

# Phase 3: 100% canary (full rollout)
terraform apply -var="canary_weight=100"

Step 6.3: State Locking and Backend Configuration

Why: State locking prevents concurrent modifications that could corrupt infrastructure state, causing data loss or orphaned resources.

S3 + DynamoDB Backend (Recommended):

# backend.tf
terraform {
  backend "s3" {
    bucket         = "your-org-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"

    # State locking with DynamoDB
    dynamodb_table = "terraform-state-locks"

    # Encryption at rest
    encrypt        = true
    kms_key_id     = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID"

    # Versioning for state history
    versioning     = true
  }
}

DynamoDB Table for Locking:

# Create state lock table
resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Name        = "Terraform State Lock Table"
    Environment = "global"
  }
}

Terraform Cloud Backend:

# backend.tf
terraform {
  cloud {
    organization = "your-org"

    workspaces {
      name = "production-infrastructure"
    }
  }
}

Step 6.4: Safe Apply Practices

1. Always Review Plan Before Apply

# NEVER skip plan review
terraform plan -out=tfplan  # Review output carefully
terraform apply tfplan      # Apply saved plan

# ❌ DANGEROUS - Never use in production
terraform apply -auto-approve  # Skips confirmation

2. Use Targeted Applies for Risky Changes

# Apply changes to specific resources only
terraform apply -target=aws_security_group.database

# Apply changes to specific module
terraform apply -target=module.vpc

# Useful for:
# - Testing changes incrementally
# - Recovering from partial failures
# - Updating specific resources without affecting others

3. Validate Before Apply

# Pre-apply validation checklist
terraform fmt -check -recursive          # ✅ Formatting
terraform validate                       # ✅ Syntax
tfsec .                                 # ✅ Security scan
terraform plan -detailed-exitcode       # ✅ Plan review
# Exit code 0 = no changes, 1 = error, 2 = changes present

# Only apply if all checks pass
if [ $? -eq 2 ]; then
  terraform apply tfplan
fi

Step 6.5: Rollback Procedures

State Backup and Restore:

# Before risky apply, backup state
terraform state pull > terraform.tfstate.backup.$(date +%Y%m%d-%H%M%S)

# If apply fails, restore previous state
terraform state push terraform.tfstate.backup.20250107-103045

# List state versions (S3 backend with versioning)
aws s3api list-object-versions \
  --bucket your-org-terraform-state \
  --prefix production/infrastructure.tfstate

# Restore specific version
aws s3api get-object \
  --bucket your-org-terraform-state \
  --key production/infrastructure.tfstate \
  --version-id VERSION_ID \
  terraform.tfstate.restore

terraform state push terraform.tfstate.restore

Terraform Cloud State Versioning:

# View state versions
terraform state list

# Rollback to previous state version (via Terraform Cloud UI)
# Settings → States → Select version → "Rollback to this state"

Infrastructure Rollback:

# Rollback by reverting Git commit
git revert HEAD
git push origin main

# CI/CD automatically applies reverted configuration
# This is safer than manual state manipulation

Step 6.6: Monitoring During Deployment

CloudWatch Alarms:

resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "high-error-rate-during-deploy"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "5XXError"
  namespace           = "AWS/ApplicationELB"
  period              = 60
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "Alert if error rate spikes during deployment"

  alarm_actions = [aws_sns_topic.deployment_alerts.arn]
}

Deployment Health Checks:

#!/bin/bash
# deploy-with-healthcheck.sh

terraform apply -auto-approve

# Wait for resources to stabilize
sleep 60

# Health check
HEALTH_ENDPOINT="https://api.example.com/health"
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_ENDPOINT)

if [ "$RESPONSE" -ne 200 ]; then
  echo "❌ Health check failed! Rolling back..."
  git revert HEAD --no-edit
  terraform apply -auto-approve
  exit 1
fi

echo "✅ Deployment successful and healthy"

Stage 6 Output Example:

After 25 minutes of controlled deployment, you should have:

Manual approval obtained (2 reviewers)
State locked during apply (no concurrent modifications)
Terraform apply completed successfully
Health checks passed post-deployment
Monitoring alerts configured
Rollback procedure documented
Deployment logged and audit trail created

Deployment Summary:

Terraform Apply Summary
────────────────────────────────────────────
Started: 2025-01-07 14:30:00 UTC
Completed: 2025-01-07 14:42:15 UTC
Duration: 12m 15s

Resources:
  + 3 created
  ~ 2 modified
  - 1 destroyed

Total: 6 changes applied

Health Checks:
  ✅ Application endpoint: 200 OK
  ✅ Database connectivity: Success
  ✅ Cache cluster: Healthy
  ✅ CloudWatch alarms: Normal

Post-Deployment Validation:
  ✅ No new error logs
  ✅ Latency within acceptable range
  ✅ Cost estimate matches actual

Status: ✅ Deployment Successful

Time Investment: 15-30 minutes (including approval wait time) Next Step: Proceed to Stage 7 for post-deployment monitoring and drift detection.

Stage 7: Post-Deployment Monitoring & Drift Detection (Continuous)

The final stage is ongoing—continuously monitoring infrastructure for drift, compliance violations, and cost overruns. According to HashiCorp's drift detection guide, over 80% of organizations experience configuration drift, making automated detection essential.

Step 7.1: Understanding Infrastructure Drift

What is Drift?

Infrastructure drift occurs when the actual state of cloud resources deviates from the desired state defined in IaC code. As explained in Spacelift's drift detection guide, drift undermines the reliability and predictability that IaC promises.

Common Causes:

Manual Console Changes ("ClickOps")

Engineer logs into AWS Console → Modifies security group → Drift created

Emergency Hotfixes

Production incident → Quick fix bypasses Terraform → Drift not backported

Overlapping Automation

Auto-scaling group scales instances → Terraform expects fixed count → Drift

Out-of-Band Changes

AWS managed services update configurations → Terraform unaware → Drift

Why Drift is Dangerous:

Security vulnerabilities: Drift can undo security configurations (firewall rules, encryption settings)
Compliance violations: Deviations from IaC definitions violate audit requirements
Unpredictable behavior: Actual infrastructure state unknown, debugging difficult
Deployment failures: Future Terraform applies may fail or cause unexpected changes

Step 7.2: Automated Drift Detection

Terraform Refresh and Drift Check:

# Detect drift by comparing state with actual resources
terraform plan -refresh-only

# Output shows resources that drifted
terraform plan -detailed-exitcode
# Exit code 2 = drift detected

Example Drift Detection Output:

Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform since the last "terraform apply":

  # aws_security_group.web has changed
  ~ resource "aws_security_group" "web" {
        id                     = "sg-0a1b2c3d4e5f6g7h8"
        name                   = "web-security-group"
      ~ ingress                = [
          + {
              + cidr_blocks      = ["0.0.0.0/0"]  # ⚠️ DRIFT DETECTED
              + from_port        = 22
              + to_port          = 22
              + protocol         = "tcp"
              + description      = "SSH from anywhere (UNAUTHORIZED)"
            },
            # (2 unchanged elements hidden)
        ]
    }

Unless you have made equivalent changes to your configuration, or ignored the relevant
attributes using ignore_changes, the following plan may include actions to undo or
respond to these changes.

Automated Drift Detection with Spacelift:

# .spacelift/config.yml
version: 1

stack:
  name: production-infrastructure
  drift_detection:
    enabled: true
    schedule: "0 */6 * * *"  # Every 6 hours
    reconcile: false  # Alert only, don't auto-fix

  notifications:
    drift_detected:
      - slack: "#infrastructure-alerts"
      - email: "[email protected]"

Drift Detection with Terraform Cloud:

# Terraform Cloud workspace settings
resource "tfe_workspace" "production" {
  name         = "production-infrastructure"
  organization = "your-org"

  # Enable drift detection
  assessments_enabled = true

  # Health assessment schedule
  health_assessment {
    enabled = true
  }
}

Custom Drift Detection Script:

#!/bin/bash
# drift-detection.sh

set -e

SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

# Run Terraform refresh
terraform init -backend=true
terraform plan -refresh-only -detailed-exitcode -out=drift-plan

EXIT_CODE=$?

if [ $EXIT_CODE -eq 2 ]; then
  echo "⚠️ DRIFT DETECTED!"

  # Generate drift report
  terraform show drift-plan > drift-report.txt

  # Send Slack notification
  curl -X POST -H 'Content-type: application/json' \
    --data "{\"text\":\"🚨 Infrastructure Drift Detected\n\`\`\`$(cat drift-report.txt)\`\`\`\"}" \
    $SLACK_WEBHOOK

  # Optional: Create JIRA ticket, send email, etc.
  exit 1
elif [ $EXIT_CODE -eq 0 ]; then
  echo "✅ No drift detected"
  exit 0
else
  echo "❌ Terraform error"
  exit 1
fi

Scheduled Drift Detection (cron):

# /etc/cron.d/terraform-drift-detection
# Run drift detection every 6 hours
0 */6 * * * cd /opt/terraform/production && /opt/scripts/drift-detection.sh >> /var/log/drift-detection.log 2>&1

Step 7.3: Drift Remediation Strategies

Strategy 1: Import Drift into Terraform

# Manual changes were made to security group
# Import actual state to make Terraform aware

# 1. Identify drifted resource
terraform plan -refresh-only

# 2. Update Terraform code to match actual state
# Edit main.tf to include the new ingress rule

# 3. Verify alignment
terraform plan  # Should show "No changes"

Strategy 2: Revert Drift (Restore Desired State)

# Restore infrastructure to match Terraform code

# 1. Review drift
terraform plan -refresh-only

# 2. Apply Terraform state (overwrites manual changes)
terraform apply -auto-approve

# Result: Manual changes reverted, IaC definition enforced

Strategy 3: Ignore Specific Attributes

# For resources managed partly by Terraform, partly by AWS
resource "aws_autoscaling_group" "web" {
  name                = "web-asg"
  min_size            = 2
  max_size            = 10
  desired_capacity    = 3

  # Ignore changes to desired_capacity (managed by auto-scaling policies)
  lifecycle {
    ignore_changes = [desired_capacity]
  }
}

Strategy 4: Prevent Drift with Resource Locks

# Prevent resource deletion
resource "aws_s3_bucket" "critical_data" {
  bucket = "critical-company-data"

  lifecycle {
    prevent_destroy = true  # Terraform refuses to destroy
  }
}

AWS Service Control Policies (SCPs):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": [
        "ec2:DeleteSecurityGroup",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:RevokeSecurityGroupEgress"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::ACCOUNT:role/TerraformExecutionRole"
        }
      }
    }
  ]
}

Step 7.4: Compliance and Cost Monitoring

Continuous Compliance Scanning:

# Scheduled compliance scan (cron)
0 0 * * * cd /opt/terraform && checkov -d . --framework terraform --compact --quiet | tee compliance-$(date +\%Y\%m\%d).log

Cost Monitoring with Infracost:

#!/bin/bash
# cost-monitoring.sh

# Generate current cost estimate
infracost breakdown --path . --format json > current-cost.json

# Compare with baseline
CURRENT_COST=$(jq -r '.totalMonthlyCost' current-cost.json)
BASELINE_COST=500  # $500/month baseline

if (( $(echo "$CURRENT_COST > $BASELINE_COST * 1.2" | bc -l) )); then
  echo "⚠️ Cost increased by >20%: \$$CURRENT_COST (baseline: \$$BASELINE_COST)"
  # Send alert
fi

CloudWatch Cost Anomaly Detection:

resource "aws_ce_anomaly_monitor" "infrastructure_costs" {
  name              = "Infrastructure Cost Monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "cost_alerts" {
  name      = "Cost Anomaly Alerts"
  frequency = "DAILY"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.infrastructure_costs.arn,
  ]

  subscriber {
    type    = "SNS"
    address = aws_sns_topic.cost_alerts.arn
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["100"]  # Alert if anomaly >$100
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}

Step 7.5: Continuous Monitoring Dashboard

Terraform State Metrics:

# State file statistics
terraform state list | wc -l  # Total resources
terraform state show aws_instance.web  # Resource details

# Track resource count over time
echo "$(date +%Y-%m-%d),$(terraform state list | wc -l)" >> resource-count.csv

Grafana Dashboard for IaC Metrics:

# grafana-dashboard.json (excerpt)
{
  "dashboard": {
    "title": "Terraform Infrastructure Monitoring",
    "panels": [
      {
        "title": "Drift Detection Status",
        "targets": [
          {
            "expr": "terraform_drift_detected{environment=\"production\"}",
            "legendFormat": "Drift Detected"
          }
        ]
      },
      {
        "title": "Monthly Infrastructure Cost",
        "targets": [
          {
            "expr": "infracost_monthly_total{environment=\"production\"}",
            "legendFormat": "Total Cost"
          }
        ]
      },
      {
        "title": "Compliance Score (Checkov)",
        "targets": [
          {
            "expr": "checkov_compliance_score{framework=\"terraform\"}",
            "legendFormat": "Compliance %"
          }
        ]
      }
    ]
  }
}

Step 7.6: Resource Lifecycle Management

Tagging Strategy for Tracking:

# Consistent tagging across all resources
locals {
  common_tags = {
    Environment     = var.environment
    ManagedBy       = "Terraform"
    Owner           = "DevOps Team"
    CostCenter      = "Engineering"
    CreatedDate     = timestamp()
    TerraformWorkspace = terraform.workspace
  }
}

resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = "t3.medium"

  tags = merge(
    local.common_tags,
    {
      Name = "web-server-${var.environment}"
      Role = "WebServer"
    }
  )
}

Resource Expiration Tracking:

# Tag resources with expiration dates
resource "aws_instance" "test_environment" {
  ami           = var.ami_id
  instance_type = "t3.small"

  tags = {
    Name        = "test-instance"
    Environment = "test"
    ExpiresOn   = timeadd(timestamp(), "168h")  # 7 days from now
  }
}

Automated Cleanup Lambda:

# lambda-resource-cleanup.py
import boto3
from datetime import datetime

ec2 = boto3.client('ec2')

def lambda_handler(event, context):
    # Find expired resources
    instances = ec2.describe_instances(
        Filters=[
            {'Name': 'tag-key', 'Values': ['ExpiresOn']},
            {'Name': 'instance-state-name', 'Values': ['running']}
        ]
    )

    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            expires_on = next((tag['Value'] for tag in instance['Tags'] if tag['Key'] == 'ExpiresOn'), None)

            if expires_on and datetime.fromisoformat(expires_on.replace('Z', '+00:00')) < datetime.now():
                print(f"Terminating expired instance: {instance['InstanceId']}")
                ec2.terminate_instances(InstanceIds=[instance['InstanceId']])

Stage 7 Output Example:

After continuous monitoring, you should have:

Automated drift detection running every 6 hours
Drift alerts sent to Slack/email when detected
Compliance scans passing (Checkov, InSpec)
Cost monitoring active with anomaly detection
Resource lifecycle tracking with expiration tags
Dashboards showing infrastructure health metrics
Audit trail maintained for all changes

Monitoring Summary:

Infrastructure Health Dashboard
────────────────────────────────────────────
Drift Status:          ✅ No drift detected (last check: 2h ago)
Compliance Score:      95% (CIS AWS Foundations)
Monthly Cost:          $487 (-3% vs. baseline $502)
Resource Count:        127 resources managed
Security Violations:   0 HIGH, 2 MEDIUM
Next Drift Check:      4 hours

Recent Activity:
  • 2025-01-07 12:30 - Terraform apply completed (3 changes)
  • 2025-01-07 10:15 - Drift detection passed
  • 2025-01-07 08:00 - Compliance scan: 95% pass rate
  • 2025-01-07 06:45 - Cost estimate: $487/month

Alerts (Last 7 days):
  ⚠️  2025-01-05 - S3 versioning disabled on logs bucket (remediated)
  ⚠️  2025-01-03 - Security group modified outside Terraform (reverted)

Key Deliverable: Drift detection alerts, compliance scorecard, cost optimization report

Time Investment: Continuous (automated checks every 6 hours)

Conclusion

This comprehensive 7-stage workflow transforms infrastructure-as-code from a deployment tool into a complete security and change management framework. By integrating pre-commit validation, automated security scanning, policy-as-code enforcement, cost analysis, testing, controlled deployment, and continuous monitoring, teams can deploy infrastructure changes with confidence.

Workflow Recap

Stage 1: Pre-Commit Validation (5-10 min) Catch syntax errors, formatting issues, and basic security problems before code reaches version control using terraform validate, TFLint, and git-secrets.

Stage 2: Security Scanning (15-30 min) Deep static analysis with tfsec, Checkov, and Terrascan to detect vulnerabilities, compliance violations, and misconfigurations in CI/CD pipelines.

Stage 3: Policy-as-Code (10-20 min) Enforce organizational standards and compliance requirements using Sentinel and OPA, blocking non-compliant infrastructure before deployment.

Stage 4: Plan Review & Cost Analysis (20-40 min) Human review combined with Infracost cost estimation to understand changes and prevent expensive mistakes.

Stage 5: Automated Testing (30-60 min) Integration testing with Terratest and compliance validation with InSpec to ensure infrastructure works as intended.

Stage 6: Controlled Deployment (15-30 min) Approval gates, state locking, health checks, and rollback procedures to minimize deployment risk.

Stage 7: Continuous Monitoring (Ongoing) Drift detection, compliance scanning, and cost monitoring to maintain security and prevent configuration drift.

Key Achievements

By implementing this workflow, organizations achieve:

Reduced deployment risks: 90% fewer security incidents from IaC misconfigurations
Improved compliance: Continuous validation against CIS, PCI-DSS, HIPAA, SOC 2
Cost visibility: Prevent surprise cloud bills with pre-deployment cost estimates
Faster incident response: Drift detection identifies unauthorized changes within hours
Audit trail: Complete change history for compliance and security audits
Developer productivity: Shift-left security catches issues before code review

Best Practices Summary

1. Always Review Plans Before Apply Never skip terraform plan review, especially for production. Use the Terraform Plan Explainer to visualize security risks and blast radius.

2. Use Remote State with Locking S3 + DynamoDB or Terraform Cloud backends prevent state corruption from concurrent modifications and provide state versioning for rollbacks.

3. Implement Policy-as-Code Encode organizational standards in Sentinel or OPA policies. Start with advisory enforcement, graduate to hard-mandatory after validation.

4. Automate Security Scanning in CI/CD Integrate tfsec, Checkov, and Trivy into pull request workflows. Block merges on HIGH/CRITICAL findings.

5. Maintain Comprehensive Rollback Procedures Backup state before risky changes, use blue-green deployments for critical services, and document rollback commands.

6. Monitor for Drift Continuously Schedule drift detection every 6 hours. Alert immediately on unauthorized changes. Enforce remediation via Terraform apply or manual review.

7. Tag Everything for Lifecycle Management Consistent tagging (Environment, Owner, CostCenter, ManagedBy, CreatedDate) enables cost tracking, compliance auditing, and automated cleanup.

Advanced Topics

Multi-Cloud Terraform Extend this workflow to AWS, Azure, GCP, and Kubernetes using provider-specific security scanners and unified OPA policies.

Terragrunt for DRY Configuration Reduce code duplication across environments using Terragrunt for remote state management and variable injection.

Atlantis for PR Automation Deploy Atlantis to automatically run terraform plan on pull requests, post results as comments, and enable self-service infrastructure changes.

GitOps with Terraform Implement GitOps workflows where Git commits trigger automated Terraform deployments via ArgoCD or Flux for Kubernetes-based infrastructure.

Secret Management Integration Integrate HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault to eliminate hardcoded credentials entirely.

Resources for Continued Learning

Official Documentation:

Security Tools:

Policy-as-Code:

Testing Frameworks:

Tools Referenced in This Guide:

Terraform Plan Explainer - Analyze Terraform plans for security risks and blast radius
Hash Generator - Generate cryptographic hashes for file verification
Base64 Encoder/Decoder - Encode/decode secrets and configurations
JSON Formatter - Format and validate Terraform JSON plans
Diff Checker - Compare infrastructure configurations
Regex Tester - Test patterns for resource naming conventions
Git Command Reference - Git workflows for IaC version control

About This Guide

This comprehensive workflow guide is based on industry best practices from leading DevOps and cloud security organizations including HashiCorp, Spacelift, Bridgecrew, and Aqua Security. All tools referenced are industry-standard solutions used by Fortune 500 companies and startups alike.

The security scanning tools (tfsec, Checkov, Trivy, Terrascan) are open-source and free, while policy frameworks (Sentinel, OPA) and cost tools (Infracost) offer free tiers suitable for most organizations. InventiveHQ's Terraform Plan Explainer provides visual security analysis with 100% client-side processing—no data leaves your browser.

Whether you're deploying your first Terraform configuration or managing multi-region, multi-cloud infrastructure at scale, this workflow provides the foundation for secure, compliant, and cost-effective infrastructure-as-code.

Sources & Further Reading

IaC Security Best Practices:

Policy-as-Code:

Drift Detection:

Cost Optimization:

Official Documentation:

Infrastructure-as-Code Security & Change Management: Terraform Best Practices 2025

Introduction

Why This Workflow Matters

Stage 1: Pre-Commit Validation (5-10 minutes)

Step 1.1: Terraform Native Validation

Step 1.2: Pre-Commit Hook Setup

Step 1.3: TFLint - Terraform Linting

Step 1.4: Secrets Detection

Step 1.5: Module Validation

Stage 2: Security Scanning & Linting (15-30 minutes)

Step 2.1: tfsec - Terraform Security Scanner

Step 2.2: Checkov - Multi-Cloud Policy Scanner

Step 2.3: Terrascan - Policy-as-Code with Rego

Step 2.4: CI/CD Pipeline Integration

Step 2.5: Trivy - Unified Security Scanner

Stage 3: Policy-as-Code Enforcement (10-20 minutes)

Step 3.1: Understanding Policy-as-Code

Step 3.2: HashiCorp Sentinel Policies

Step 3.3: Open Policy Agent (OPA) Policies

Step 3.4: Combined Sentinel + OPA Strategy

Step 3.5: Policy-as-Code Best Practices

Stage 4: Plan Review & Cost Analysis (20-40 minutes)

Step 4.1: Terraform Plan Analysis

Step 4.2: Using Terraform Plan Explainer

Step 4.3: Cost Estimation with Infracost

Step 4.4: Cost Optimization Analysis

Step 4.5: CI/CD Integration - Plan Review Automation

Stage 5: Automated Testing (30-60 minutes)

Step 5.1: Terratest - Integration Testing

Step 5.2: Kitchen-Terraform - Behavior-Driven Testing

Step 5.3: Compliance Validation - InSpec AWS

Step 5.4: CI/CD Integration - Automated Testing

Step 5.5: Testing Best Practices

Stage 6: Controlled Deployment (15-30 minutes)

Step 6.1: Approval Workflows

Step 6.2: Deployment Strategies

Step 6.3: State Locking and Backend Configuration

Step 6.4: Safe Apply Practices

Step 6.5: Rollback Procedures

Step 6.6: Monitoring During Deployment

Stage 7: Post-Deployment Monitoring & Drift Detection (Continuous)

Step 7.1: Understanding Infrastructure Drift

Step 7.2: Automated Drift Detection

Step 7.3: Drift Remediation Strategies

Step 7.4: Compliance and Cost Monitoring

Step 7.5: Continuous Monitoring Dashboard

Step 7.6: Resource Lifecycle Management

Conclusion

Workflow Recap

Key Achievements

Best Practices Summary

Advanced Topics

Resources for Continued Learning

About This Guide

Sources & Further Reading

Need Expert IT & Security Guidance?

Related Articles

Webhook Best Practices: Production-Ready Implementation Guide

API Development & Security Testing Workflow: OWASP API Security Top 10 Guide

The Complete Developer Debugging & Data Transformation Workflow