title: 'Infrastructure-as-Code Security & Change Management: A Complete 7-Stage Workflow for Terraform' date: '2025-01-07' excerpt: 'Master the complete IaC security workflow from pre-commit validation through production deployment. Learn how to scan for vulnerabilities, enforce policy-as-code, detect drift, and optimize costs using Terraform, tfsec, Checkov, Sentinel, OPA, and Infracost.' author: 'InventiveHQ DevOps Team' category: 'Developer' tags:
- Infrastructure as Code
- Terraform
- DevSecOps
- Cloud Security
- Policy as Code
- Drift Detection
- Cost Optimization readingTime: 19 featured: true heroImage: "https://images.unsplash.com/photo-1451187580459-43490279c0fa?w=1200&h=630&fit=crop"
Introduction
Infrastructure-as-Code (IaC) has revolutionized how we provision and manage cloud resources. Instead of clicking through console interfaces, teams declare their entire infrastructure in version-controlled code—enabling reproducibility, collaboration, and automation. But this power comes with responsibility: a misconfigured Terraform file can expose databases, leak credentials, or rack up thousands in cloud costs before anyone notices.
According to the 2025 State of Cloud Security, over 60% of cloud security incidents originate from misconfigured infrastructure, and the median time to detect these issues is 11 days. That's 11 days of potential exposure, compliance violations, and runaway costs.
This guide presents a complete 7-stage IaC security workflow used by DevOps teams, platform engineers, and SREs to catch problems before they reach production. We'll cover:
- Pre-Commit Validation - Catch syntax errors and basic security issues at the developer's workstation
- Security Scanning & Linting - Deep static analysis for vulnerabilities and compliance violations
- Policy-as-Code Enforcement - Automated guardrails using Sentinel and Open Policy Agent (OPA)
- Plan Review & Cost Analysis - Human review with cost impact visibility via Infracost
- Automated Testing - Integration tests and compliance validation
- Controlled Deployment - Safe rollout with approval gates and blast radius controls
- Post-Deployment Monitoring - Drift detection, compliance tracking, and cost optimization
Why This Workflow Matters
Infrastructure drift—when your actual cloud resources deviate from your IaC definitions—occurs in over 80% of organizations according to HashiCorp research. Manual console changes ("ClickOps"), emergency hotfixes, and overlapping automation all contribute to drift, creating security vulnerabilities and compliance gaps.
As noted in Spacelift's Terraform security guide, the stakes are high:
- Security risks: Misconfigured S3 buckets, overly permissive security groups, unencrypted databases
- Compliance violations: HIPAA, PCI-DSS, SOC 2, and GDPR requirements violated by IaC errors
- Cost overruns: Oversized instances, unused resources, inefficient architectures
- Operational failures: Configuration drift leading to unpredictable behavior and outages
This workflow treats infrastructure code with the same rigor as application code—combining automated security scanning, policy enforcement, and continuous monitoring to create a defense-in-depth strategy.
Stage 1: Pre-Commit Validation (5-10 minutes)
The first line of defense happens before code even reaches version control. Pre-commit validation catches syntax errors, formatting issues, and obvious security problems at the developer's workstation—providing instant feedback and preventing broken code from entering the pipeline.
Step 1.1: Terraform Native Validation
Why: Terraform's built-in validation catches syntax errors, invalid resource configurations, and type mismatches before any external tools run.
Commands:
# Format code to HashiCorp standards (2-space indentation)
terraform fmt -recursive
# Validate syntax and configuration
terraform validate
# Initialize modules and providers (required before validation)
terraform init -backend=false
Common Issues Detected:
- Invalid HCL syntax (missing braces, commas, quotes)
- Undeclared variables or outputs
- Invalid resource attribute references
- Type mismatches (string vs. number vs. list)
- Deprecated resource syntax
Example Error:
$ terraform validate
Error: Unsupported argument
on main.tf line 12, in resource "aws_instance" "web":
12: ami_id = var.ami_id
An argument named "ami_id" is not expected here. Did you mean "ami"?
Best Practice: Run terraform fmt before committing to ensure consistent code style across your team. Many teams enforce this with pre-commit hooks.
Step 1.2: Pre-Commit Hook Setup
Why: Automated pre-commit hooks prevent developers from committing code that fails basic validation, saving CI/CD pipeline time and reducing feedback cycles.
Installation:
# Install pre-commit framework
pip install pre-commit
# Create .pre-commit-config.yaml in your repository
cat > .pre-commit-config.yaml <<EOF
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.94.1
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_docs
- id: terraform_tflint
- id: terraform_tfsec
args:
- --args=--minimum-severity=HIGH
EOF
# Install hooks
pre-commit install
Hook Execution Flow:
Developer commits code
↓
terraform_fmt → Auto-format code
↓
terraform_validate → Check syntax
↓
terraform_tflint → Linting rules
↓
terraform_tfsec → Security scan
↓
Commit succeeds or fails
Example Output:
$ git commit -m "Add RDS database"
terraform_fmt...................................Passed
terraform_validate..............................Passed
terraform_tflint................................Passed
terraform_tfsec.................................Failed
- Hook failed with exit code 1
HIGH: Resource 'aws_db_instance.main' has encryption disabled
Line: 45
Severity: HIGH
[Commit blocked - fix security issues before committing]
Step 1.3: TFLint - Terraform Linting
Why: TFLint enforces best practices, detects deprecated syntax, and validates provider-specific configurations that terraform validate misses.
Installation & Configuration:
# Install TFLint
brew install tflint # macOS
# or
curl -s https://raw.githubusercontent.com/terraform-linters/tflint/master/install_linux.sh | bash
# Create .tflint.hcl configuration
cat > .tflint.hcl <<EOF
plugin "aws" {
enabled = true
version = "0.31.0"
source = "github.com/terraform-linters/tflint-ruleset-aws"
}
rule "terraform_deprecated_syntax" {
enabled = true
}
rule "terraform_unused_declarations" {
enabled = true
}
rule "terraform_naming_convention" {
enabled = true
}
EOF
# Run TFLint
tflint --init
tflint
Example Rules Enforced:
- AWS instance type validation (detecting non-existent instance types)
- Security group naming conventions
- Unused variables and outputs
- Deprecated resource syntax
- Invalid AMI IDs or availability zones
Example Violation:
resource "aws_instance" "web" {
instance_type = "t2.mega" # ❌ Invalid instance type
ami = "ami-12345"
}
TFLint Output:
Error: "t2.mega" is an invalid value as instance_type (aws_instance_invalid_type)
on main.tf line 3:
3: instance_type = "t2.mega"
Valid instance types: t2.micro, t2.small, t2.medium, t2.large...
Step 1.4: Secrets Detection
Why: Hardcoded secrets in IaC files are a critical security vulnerability. According to GitGuardian's 2024 report, over 10 million secrets are leaked to public GitHub repositories annually.
Tool: git-secrets
# Install git-secrets
brew install git-secrets # macOS
# Initialize in repository
git secrets --install
git secrets --register-aws
# Scan current files
git secrets --scan
# Add custom patterns
git secrets --add '[A-Za-z0-9/+=]{40}' # AWS Secret Access Key pattern
Patterns to Detect:
- AWS Access Keys:
AKIA[0-9A-Z]{16} - AWS Secret Keys:
[A-Za-z0-9/+=]{40} - API Keys: High-entropy strings
- Private Keys:
-----BEGIN PRIVATE KEY----- - Database passwords in plaintext
Example Detection:
# ❌ WRONG - Hardcoded credentials
resource "aws_db_instance" "main" {
username = "admin"
password = "SuperSecret123!" # ⚠️ git-secrets blocks this
}
# ✅ CORRECT - Use secrets manager
resource "aws_db_instance" "main" {
username = var.db_username
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
Alternative: HashiCorp Vault Integration
provider "vault" {
address = "https://vault.example.com"
}
data "vault_generic_secret" "db_credentials" {
path = "secret/database/credentials"
}
resource "aws_db_instance" "main" {
username = data.vault_generic_secret.db_credentials.data["username"]
password = data.vault_generic_secret.db_credentials.data["password"]
}
Step 1.5: Module Validation
Why: Terraform modules must be validated independently to catch issues before they're consumed by root configurations.
Best Practices:
# Navigate to module directory
cd modules/vpc
# Validate module
terraform init -backend=false
terraform validate
# Test with example values
terraform plan -var-file=examples/example.tfvars
Module Input Validation:
variable "vpc_cidr" {
type = string
description = "VPC CIDR block"
validation {
condition = can(cidrhost(var.vpc_cidr, 0))
error_message = "The vpc_cidr must be a valid CIDR block."
}
}
variable "environment" {
type = string
description = "Environment name"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
Stage 1 Output Example:
After 10 minutes of pre-commit validation, you should have:
- All code formatted to HashiCorp standards (
terraform fmt) - Syntax validated (
terraform validate) - Linting rules passed (TFLint)
- No secrets detected (git-secrets)
- Pre-commit hooks configured and passing
- Module validations successful
Time Investment: 5-10 minutes Next Step: Code passes pre-commit checks and is pushed to version control, triggering Stage 2 (CI/CD security scanning).
Stage 2: Security Scanning & Linting (15-30 minutes)
Once code reaches the CI/CD pipeline, automated security scanning tools perform deep static analysis to detect vulnerabilities, compliance violations, and misconfigurations. This stage integrates multiple scanning tools to create comprehensive coverage.
Step 2.1: tfsec - Terraform Security Scanner
Why: tfsec (now part of Trivy) performs static analysis specifically tailored for Terraform, detecting security issues like publicly exposed resources, unencrypted storage, and overly permissive IAM policies.
Installation & Usage:
# Install tfsec (legacy standalone version)
brew install tfsec # macOS
# Or use Trivy (recommended - includes tfsec checks)
brew install trivy
trivy config .
# Run tfsec scan
tfsec . --format json --out tfsec-results.json
# Run with minimum severity threshold
tfsec . --minimum-severity HIGH
# Exclude specific checks
tfsec . --exclude aws-s3-enable-versioning
Example Configuration File (.tfsec/config.yml):
severity_overrides:
aws-s3-enable-bucket-logging: WARNING
aws-ec2-enforce-http-token-imds: HIGH
exclude:
- aws-s3-enable-versioning # Versioning not required for ephemeral buckets
Common Vulnerabilities Detected:
| Check ID | Description | Severity | Example |
|---|---|---|---|
aws-s3-block-public-acls | S3 bucket allows public ACLs | HIGH | Publicly readable S3 bucket |
aws-ec2-no-public-ip | EC2 instance has public IP in private subnet | MEDIUM | Unintended internet exposure |
aws-rds-encrypt-instance-storage | RDS instance encryption disabled | HIGH | Unencrypted database |
aws-ec2-enforce-http-token-imds | IMDSv1 enabled (SSRF vulnerable) | HIGH | Metadata service v1 |
aws-iam-no-policy-wildcards | IAM policy uses wildcard actions | CRITICAL | Overly permissive IAM |
Example Detection:
# ❌ VULNERABLE CODE
resource "aws_s3_bucket" "data" {
bucket = "company-data-bucket"
# tfsec: aws-s3-enable-bucket-encryption CRITICAL
# Missing encryption configuration
}
resource "aws_security_group" "web" {
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # tfsec: aws-ec2-no-public-ingress-ssh HIGH
}
}
tfsec Output:
Result #1 HIGH: Resource 'aws_s3_bucket.data' does not have encryption enabled
────────────────────────────────────────────────────────────────────────────────
main.tf:3-5
────────────────────────────────────────────────────────────────────────────────
3 │ resource "aws_s3_bucket" "data" {
4 │ bucket = "company-data-bucket"
5 │ }
────────────────────────────────────────────────────────────────────────────────
ID: aws-s3-enable-bucket-encryption
Impact: Data stored in S3 is not encrypted at rest
Resolution: Enable server-side encryption with KMS
Result #2 HIGH: Security group allows SSH from 0.0.0.0/0
────────────────────────────────────────────────────────────────────────────────
main.tf:11-14
────────────────────────────────────────────────────────────────────────────────
11 │ from_port = 22
12 │ to_port = 22
13 │ protocol = "tcp"
14 │ cidr_blocks = ["0.0.0.0/0"]
────────────────────────────────────────────────────────────────────────────────
ID: aws-ec2-no-public-ingress-ssh
Impact: SSH access from the internet increases attack surface
Resolution: Restrict SSH access to specific IP ranges or VPN
2 potential problems detected.
Remediation:
# ✅ SECURE CODE
resource "aws_s3_bucket" "data" {
bucket = "company-data-bucket"
}
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
bucket = aws_s3_bucket.data.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.s3.arn
}
}
}
resource "aws_security_group" "web" {
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"] # Internal network only
}
}
Step 2.2: Checkov - Multi-Cloud Policy Scanner
Why: Checkov supports 750+ built-in policies covering AWS, Azure, GCP, Kubernetes, and compliance frameworks (CIS, PCI-DSS, HIPAA, GDPR, SOC 2).
Installation & Usage:
# Install Checkov
pip install checkov
# Run full scan
checkov -d . --framework terraform
# Scan specific frameworks
checkov -d . --framework terraform --check CKV_AWS_*
# Output formats
checkov -d . --output json --output-file checkov-results.json
checkov -d . --output junitxml --output-file checkov-junit.xml # CI integration
checkov -d . --output sarif --output-file checkov-sarif.json # GitHub Security
# Skip specific checks
checkov -d . --skip-check CKV_AWS_20 # Skip S3 public read check
Compliance Framework Scanning:
# Scan for PCI-DSS compliance
checkov -d . --framework terraform --check CKV_AWS_* --compact
# Scan for CIS AWS Foundations Benchmark
checkov -d . --framework terraform --external-checks-dir ./custom-policies
Example Checkov Policy:
# custom-policies/check_mandatory_tags.py
from checkov.terraform.checks.resource.base_resource_check import BaseResourceCheck
from checkov.common.models.enums import CheckResult, CheckCategories
class MandatoryTagsCheck(BaseResourceCheck):
def __init__(self):
name = "Ensure all resources have required tags"
id = "CKV_AWS_CUSTOM_1"
supported_resources = ['aws_instance', 'aws_s3_bucket', 'aws_rds_instance']
categories = [CheckCategories.CONVENTION]
super().__init__(name=name, id=id, categories=categories, supported_resources=supported_resources)
def scan_resource_conf(self, conf):
required_tags = ['Environment', 'Owner', 'CostCenter']
tags = conf.get('tags', [{}])[0]
missing_tags = [tag for tag in required_tags if tag not in tags]
if missing_tags:
self.details = f"Missing required tags: {', '.join(missing_tags)}"
return CheckResult.FAILED
return CheckResult.PASSED
check = MandatoryTagsCheck()
Checkov Output:
Passed checks: 127, Failed checks: 8, Skipped checks: 3
Check: CKV_AWS_19: "Ensure all data stored in S3 is encrypted"
FAILED for resource: aws_s3_bucket.logs
File: /main.tf:45-50
Guide: https://docs.bridgecrew.io/docs/s3_14-data-encrypted-at-rest
45 | resource "aws_s3_bucket" "logs" {
46 | bucket = "application-logs"
47 | acl = "private"
48 | }
Check: CKV_AWS_CUSTOM_1: "Ensure all resources have required tags"
FAILED for resource: aws_instance.web
File: /main.tf:12-20
Details: Missing required tags: Owner, CostCenter
12 | resource "aws_instance" "web" {
13 | ami = var.ami_id
14 | instance_type = "t3.medium"
15 |
16 | tags = {
17 | Environment = "production"
18 | }
19 | }
Step 2.3: Terrascan - Policy-as-Code with Rego
Why: Terrascan uses OPA/Rego policies for customizable security and compliance checks, enabling organizations to encode their specific governance requirements.
Installation & Usage:
# Install Terrascan
brew install terrascan # macOS
# Scan with built-in policies
terrascan scan -t terraform
# Scan specific policy types
terrascan scan -t terraform -p aws
# Use custom policies
terrascan scan -t terraform --policy-path ./custom-policies
# Output formats
terrascan scan -t terraform -o json
terrascan scan -t terraform -o sarif > terrascan-sarif.json
Custom Rego Policy Example:
# custom-policies/require_vpc_flow_logs.rego
package accurics
{{.prefix}}{{.name}}[api.id] {
api := input.aws_vpc[_]
not has_flow_logs(api)
}
has_flow_logs(resource) {
flow_logs := input.aws_flow_log[_]
flow_logs.config.vpc_id == resource.id
}
{{.prefix}}{{.name}}[api.id] {
api := input.aws_vpc[_]
flow_logs := input.aws_flow_log[_]
flow_logs.config.vpc_id == api.id
flow_logs.config.traffic_type != "ALL"
}
Step 2.4: CI/CD Pipeline Integration
GitHub Actions Example:
# .github/workflows/terraform-security.yml
name: Terraform Security Scan
on:
pull_request:
paths:
- '**.tf'
- '**.tfvars'
jobs:
security-scan:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.0
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Init
run: terraform init -backend=false
- name: Terraform Validate
run: terraform validate
- name: Run tfsec
uses: aquasecurity/[email protected]
with:
soft_fail: false
format: sarif
sarif_file: tfsec.sarif
- name: Upload tfsec SARIF
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: tfsec.sarif
- name: Run Checkov
uses: bridgecrewio/checkov-action@v12
with:
directory: .
framework: terraform
output_format: sarif
output_file_path: checkov.sarif
- name: Upload Checkov SARIF
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: checkov.sarif
- name: Security Scan Summary
if: always()
run: |
echo "## Security Scan Results" >> $GITHUB_STEP_SUMMARY
echo "✅ tfsec scan completed" >> $GITHUB_STEP_SUMMARY
echo "✅ Checkov scan completed" >> $GITHUB_STEP_SUMMARY
echo "View detailed results in Security tab" >> $GITHUB_STEP_SUMMARY
GitLab CI Example:
# .gitlab-ci.yml
stages:
- validate
- security-scan
terraform-validate:
stage: validate
image: hashicorp/terraform:1.7
script:
- terraform fmt -check -recursive
- terraform init -backend=false
- terraform validate
tfsec-scan:
stage: security-scan
image: aquasec/tfsec:latest
script:
- tfsec . --format json --out tfsec-results.json
artifacts:
reports:
sast: tfsec-results.json
when: always
checkov-scan:
stage: security-scan
image: bridgecrew/checkov:latest
script:
- checkov -d . --framework terraform --output junitxml --output-file checkov-report.xml
artifacts:
reports:
junit: checkov-report.xml
when: always
Step 2.5: Trivy - Unified Security Scanner
Why: Trivy consolidates tfsec functionality with additional capabilities for container images, Kubernetes manifests, and dependency scanning.
Usage:
# Install Trivy
brew install trivy # macOS
# Scan Terraform configurations
trivy config .
# Scan with specific severity
trivy config --severity HIGH,CRITICAL .
# Output formats
trivy config --format json --output trivy-results.json .
trivy config --format sarif --output trivy-sarif.json .
Example Output:
main.tf (terraform)
═══════════════════════════════════════════════════════════════════════════════
Tests: 42 (SUCCESSES: 34, FAILURES: 8)
Failures: 8 (HIGH: 5, MEDIUM: 3, LOW: 0)
HIGH: S3 bucket does not have encryption enabled
════════════════════════════════════════════════════════════════════════════════
Unencrypted S3 buckets are vulnerable to data breaches
────────────────────────────────────────────────────────────────────────────────
main.tf:45-50
────────────────────────────────────────────────────────────────────────────────
45 resource "aws_s3_bucket" "data" {
46 bucket = "company-data-bucket"
47 }
────────────────────────────────────────────────────────────────────────────────
Stage 2 Output Example:
After 25 minutes of security scanning, you should have:
- tfsec/Trivy scan completed with vulnerability report
- Checkov compliance scan passed (CIS, PCI-DSS, HIPAA)
- Terrascan custom policy validation successful
- SARIF reports uploaded to GitHub/GitLab Security tabs
- No HIGH or CRITICAL vulnerabilities detected
- CI/CD pipeline gates passed
Decision Matrix:
All scans pass → Proceed to Stage 3 (Policy Enforcement)
HIGH/CRITICAL findings → Block merge, require remediation
MEDIUM findings → Warning, allow with review
Custom policy violations → Block based on organization rules
Time Investment: 15-30 minutes (automated in CI/CD) Next Step: Proceed to Stage 3 for policy-as-code enforcement with Sentinel or OPA.
Stage 3: Policy-as-Code Enforcement (10-20 minutes)
Policy-as-code frameworks like HashiCorp Sentinel and Open Policy Agent (OPA) provide programmable guardrails that enforce organizational standards, compliance requirements, and security policies. Unlike static analysis tools that detect problems, policy engines prevent non-compliant infrastructure from being created.
Step 3.1: Understanding Policy-as-Code
What is Policy-as-Code?
According to Spacelift's policy guide, policy-as-code ensures that infrastructure configurations adhere to defined compliance standards and security policies through automated evaluation. Policy engines evaluate Terraform plans before they're applied, blocking non-compliant changes.
Sentinel vs. OPA:
| Feature | HashiCorp Sentinel | Open Policy Agent (OPA) |
|---|---|---|
| Integration | Terraform Cloud/Enterprise native | Runs anywhere (local, CI/CD, cloud) |
| Language | Sentinel (HSL) | Rego (declarative) |
| Enforcement Levels | Advisory, Soft-Mandatory, Hard-Mandatory | Configurable per policy |
| Use Case | Centralized control in TFC/TFE | Shift-left, multi-platform policies |
| Ecosystem | HashiCorp stack only | Kubernetes, Envoy, Terraform, etc. |
| Cost | Terraform Cloud/Enterprise required | Open-source, free |
Decision Criteria:
- Choose Sentinel when: Your organization runs Terraform Cloud/Enterprise and needs centralized, audit-logged policy enforcement at apply time
- Choose OPA when: You want shift-left checks, developer-local validation, and policies that span multiple platforms (Kubernetes, Terraform, etc.)
- Use Both when: OPA catches issues early (pre-commit, CI), Sentinel provides final enforcement gate in TFC/TFE
Step 3.2: HashiCorp Sentinel Policies
Sentinel Policy Structure:
# policies/restrict-instance-types.sentinel
import "tfplan/v2" as tfplan
# Allowed instance types for production
allowed_instance_types = ["t3.medium", "t3.large", "m5.large", "m5.xlarge"]
# Find all EC2 instances in the plan
ec2_instances = filter tfplan.resource_changes as _, rc {
rc.type is "aws_instance" and
rc.mode is "managed" and
(rc.change.actions contains "create" or rc.change.actions contains "update")
}
# Validate instance types
instance_type_valid = rule {
all ec2_instances as _, instance {
instance.change.after.instance_type in allowed_instance_types
}
}
# Main rule with enforcement level
main = rule when instance_type_valid is false {
print("Instance type must be one of:", allowed_instance_types)
false
}
Enforcement Levels:
# sentinel.hcl
policy "restrict-instance-types" {
source = "./policies/restrict-instance-types.sentinel"
enforcement_level = "hard-mandatory" # Blocks apply
}
policy "require-tags" {
source = "./policies/require-tags.sentinel"
enforcement_level = "soft-mandatory" # Can be overridden by admins
}
policy "cost-estimation" {
source = "./policies/cost-estimation.sentinel"
enforcement_level = "advisory" # Warning only
}
Example: Enforce Encryption
# policies/enforce-encryption.sentinel
import "tfplan/v2" as tfplan
# Find all S3 buckets
s3_buckets = filter tfplan.resource_changes as _, rc {
rc.type is "aws_s3_bucket" and
rc.mode is "managed" and
rc.change.actions contains "create"
}
# Find encryption configurations
s3_encryption = filter tfplan.resource_changes as _, rc {
rc.type is "aws_s3_bucket_server_side_encryption_configuration" and
rc.mode is "managed"
}
# Validate every bucket has encryption
buckets_encrypted = rule {
all s3_buckets as address, bucket {
any s3_encryption as _, enc {
enc.change.after.bucket is bucket.change.after.id
}
}
}
main = rule {
buckets_encrypted else true
}
Testing Sentinel Policies:
# Install Sentinel CLI
wget https://releases.hashicorp.com/sentinel/0.25.0/sentinel_0.25.0_linux_amd64.zip
unzip sentinel_0.25.0_linux_amd64.zip
sudo mv sentinel /usr/local/bin/
# Test policy
sentinel test policies/
# Apply policy in Terraform Cloud
terraform login
terraform plan # Triggers policy evaluation
Step 3.3: Open Policy Agent (OPA) Policies
OPA/Rego Policy Structure:
# policies/aws_instance_approved_types.rego
package terraform.policies.aws_instance
import rego.v1
# Allowed instance types
allowed_types := {"t3.medium", "t3.large", "m5.large", "m5.xlarge"}
# Find all EC2 instances
ec2_instances[resource] {
resource := input.resource_changes[_]
resource.type == "aws_instance"
resource.change.actions[_] == "create"
}
# Deny rule
deny contains msg if {
some resource in ec2_instances
instance_type := resource.change.after.instance_type
not instance_type in allowed_types
msg := sprintf(
"EC2 instance '%s' uses disallowed type '%s'. Allowed: %v",
[resource.address, instance_type, allowed_types]
)
}
Running OPA with Terraform:
# Install OPA
brew install opa # macOS
# Generate Terraform plan in JSON
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
# Evaluate policies
opa eval --data policies/ --input tfplan.json "data.terraform.policies.deny"
# Using conftest (OPA wrapper for configuration testing)
brew install conftest
conftest test tfplan.json --policy policies/
Example: Multi-Policy Enforcement
# policies/security_policies.rego
package terraform.security
import rego.v1
# Policy 1: No public S3 buckets
deny contains msg if {
some resource in input.resource_changes
resource.type == "aws_s3_bucket_public_access_block"
not resource.change.after.block_public_acls
msg := sprintf("S3 bucket '%s' allows public ACLs", [resource.address])
}
# Policy 2: RDS encryption required
deny contains msg if {
some resource in input.resource_changes
resource.type == "aws_db_instance"
resource.change.actions[_] == "create"
not resource.change.after.storage_encrypted
msg := sprintf("RDS instance '%s' must have encryption enabled", [resource.address])
}
# Policy 3: Security groups no 0.0.0.0/0 SSH
deny contains msg if {
some resource in input.resource_changes
resource.type == "aws_security_group"
some rule in resource.change.after.ingress
rule.from_port == 22
"0.0.0.0/0" in rule.cidr_blocks
msg := sprintf("Security group '%s' allows SSH from 0.0.0.0/0", [resource.address])
}
# Policy 4: Require tags
required_tags := {"Environment", "Owner", "CostCenter"}
deny contains msg if {
some resource in input.resource_changes
resource.type in {"aws_instance", "aws_s3_bucket", "aws_db_instance"}
resource.change.actions[_] == "create"
tags := object.keys(resource.change.after.tags)
missing := required_tags - tags
count(missing) > 0
msg := sprintf(
"Resource '%s' missing required tags: %v",
[resource.address, missing]
)
}
CI/CD Integration:
# .github/workflows/opa-policy.yml
name: OPA Policy Check
on:
pull_request:
paths:
- '**.tf'
jobs:
opa-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform init -backend=false
- name: Terraform Plan
run: |
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary > tfplan.json
- name: Install Conftest
run: |
wget https://github.com/open-policy-agent/conftest/releases/download/v0.49.0/conftest_0.49.0_Linux_x86_64.tar.gz
tar xzf conftest_0.49.0_Linux_x86_64.tar.gz
sudo mv conftest /usr/local/bin/
- name: Run OPA Policies
run: conftest test tfplan.json --policy policies/ --output github
- name: Policy Results
if: failure()
run: |
echo "## ❌ Policy Violations Detected" >> $GITHUB_STEP_SUMMARY
conftest test tfplan.json --policy policies/ >> $GITHUB_STEP_SUMMARY
Step 3.4: Combined Sentinel + OPA Strategy
Layered Policy Enforcement:
Developer Workstation
↓
OPA (Pre-commit) - Instant feedback, local validation
↓
CI/CD Pipeline
↓
OPA (Pull Request) - Automated checks, block merge
↓
Terraform Cloud/Enterprise
↓
Sentinel (Apply Time) - Final enforcement gate, audit log
↓
Production Deployment
Example Workflow:
# 1. Developer runs OPA locally before commit
terraform plan -out=tfplan.binary
terraform show -json tfplan.binary | conftest test - --policy policies/
# 2. CI runs OPA on pull request
# (Automated via GitHub Actions)
# 3. Terraform Cloud runs Sentinel before apply
terraform plan # Plan succeeds
terraform apply # Sentinel evaluates, blocks if violations detected
Step 3.5: Policy-as-Code Best Practices
1. Start with Advisory Policies
# Begin with warnings, not blocks
policy "experimental-cost-check" {
enforcement_level = "advisory" # Warn only
}
# Graduate to enforcement after validation
policy "production-cost-check" {
enforcement_level = "hard-mandatory" # Block
}
2. Version Control Policies
# Store policies in Git alongside infrastructure code
terraform-repo/
├── infrastructure/
│ ├── main.tf
│ └── variables.tf
├── policies/
│ ├── sentinel/
│ │ ├── restrict-instance-types.sentinel
│ │ └── enforce-encryption.sentinel
│ └── opa/
│ ├── security_policies.rego
│ └── compliance_policies.rego
└── tests/
└── policy-tests/
3. Test Policies with Unit Tests
Sentinel Test:
# test/restrict-instance-types/fail-disallowed-type.hcl
mock "tfplan/v2" {
module {
source = "mock-tfplan-fail.sentinel"
}
}
test {
rules = {
main = false # Expect failure for t2.micro
}
}
OPA Test:
# policies/security_policies_test.rego
package terraform.security
test_deny_public_ssh if {
deny contains "Security group 'aws_security_group.web' allows SSH from 0.0.0.0/0"
with input as {
"resource_changes": [{
"type": "aws_security_group",
"address": "aws_security_group.web",
"change": {
"after": {
"ingress": [{
"from_port": 22,
"to_port": 22,
"cidr_blocks": ["0.0.0.0/0"]
}]
}
}
}]
}
}
Run Tests:
# Test Sentinel policies
sentinel test
# Test OPA policies
opa test policies/ -v
Stage 3 Output Example:
After 15 minutes of policy evaluation, you should have:
- OPA policies evaluated locally and in CI/CD
- Sentinel policies enforced in Terraform Cloud (if applicable)
- All hard-mandatory policies passed
- Advisory policy warnings reviewed
- Policy violations documented with remediation guidance
- Audit trail generated for compliance
Policy Violation Example:
Policy Violation: restrict-instance-types (hard-mandatory)
────────────────────────────────────────────────────────────
Resource: aws_instance.web
Violation: Instance type 't2.micro' not in allowed list
Allowed Types: t3.medium, t3.large, m5.large, m5.xlarge
Remediation:
Change instance_type to an approved type:
resource "aws_instance" "web" {
instance_type = "t3.medium" # ✅ Approved
}
Status: ❌ Apply blocked until resolved
Time Investment: 10-20 minutes (automated) Next Step: Proceed to Stage 4 for plan review and cost analysis with Infracost.
Stage 4: Plan Review & Cost Analysis (20-40 minutes)
Before applying infrastructure changes, teams must understand what will change, why it matters, and how much it will cost. This stage combines human review with automated cost estimation to prevent expensive mistakes and ensure changes align with business objectives.
Step 4.1: Terraform Plan Analysis
Why: The terraform plan output is your blueprint—it shows exactly what resources will be created, modified, or destroyed. According to HashiCorp's best practices, teams should always review plans before applying, especially in production.
Generate and Review Plan:
# Generate plan with detailed output
terraform plan -out=tfplan.binary
# Review plan in human-readable format
terraform show tfplan.binary
# Save plan to file for review
terraform show tfplan.binary > plan-review.txt
# Generate JSON for automated analysis
terraform show -json tfplan.binary > tfplan.json
Plan Output Interpretation:
Terraform will perform the following actions:
# aws_db_instance.main will be created
+ resource "aws_db_instance" "main" {
+ allocated_storage = 100
+ db_name = "production_db"
+ engine = "postgres"
+ engine_version = "15.3"
+ instance_class = "db.m5.large"
+ storage_encrypted = true
+ publicly_accessible = false # ✅ Good - not publicly exposed
+ ...
}
# aws_security_group.database will be modified
~ resource "aws_security_group" "database" {
~ ingress {
~ cidr_blocks = [
- "10.0.0.0/8", # ⚠️ Removing internal access
+ "10.1.0.0/16", # ⚠️ Restricting to specific subnet
]
}
}
# aws_instance.legacy will be destroyed
- resource "aws_instance" "legacy" {
- instance_type = "t2.micro"
- ...
}
Plan: 1 to add, 1 to change, 1 to destroy.
Key Indicators:
| Symbol | Meaning | Risk Level |
|---|---|---|
+ | Resource will be created | Low |
~ | Resource will be modified | Medium |
- | Resource will be destroyed | HIGH |
-/+ | Resource will be replaced (destroyed then created) | CRITICAL |
<= | Resource will be read (data source) | Low |
Red Flags to Watch:
# ⚠️ HIGH RISK: Database replacement (data loss!)
-/+ resource "aws_db_instance" "main" {
# Changing engine_version from 14.9 to 15.3 forces replacement
}
# ⚠️ MEDIUM RISK: Security group changes
~ resource "aws_security_group" "web" {
~ ingress {
- cidr_blocks = ["10.0.0.0/8"]
+ cidr_blocks = ["0.0.0.0/0"] # ❌ Opening to internet!
}
}
# ✅ LOW RISK: Tag updates (metadata only)
~ resource "aws_instance" "web" {
~ tags = {
+ "Environment" = "production"
}
}
Step 4.2: Using Terraform Plan Explainer
Why: Complex plans with dozens of resources can be difficult to review. The Terraform Plan Explainer tool provides visual analysis, security risk scoring, and blast radius calculations.
Tool Features:
- Security Risk Detection: Identifies dangerous changes (public exposure, encryption removal, wildcard permissions)
- Blast Radius Visualization: Shows dependency graphs and impact analysis
- Change Summary: Categorizes changes by type and risk level
- Resource Dependency Graph: Visualizes which resources depend on others
Example Analysis:
Upload your tfplan.json to the Terraform Plan Explainer:
Security Analysis:
🔴 HIGH RISK (3):
- aws_db_instance.main: Force replacement (potential data loss)
- aws_security_group.web: Opening SSH to 0.0.0.0/0
- aws_s3_bucket_public_access_block.logs: Removing block_public_acls
🟡 MEDIUM RISK (5):
- aws_rds_cluster.analytics: Scaling instance class (downtime expected)
- aws_lambda_function.processor: Changing runtime (requires testing)
🟢 LOW RISK (12):
- Tag updates, description changes, minor configuration tweaks
Blast Radius: 8 dependent resources affected by changes
Estimated Apply Time: 15-20 minutes
Step 4.3: Cost Estimation with Infracost
Why: According to Infracost's research, infrastructure changes can have unexpected cost impacts. A seemingly small change like increasing an RDS instance class can add thousands per month.
Installation & Setup:
# Install Infracost
brew install infracost # macOS
# or
curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh
# Configure API key (free for up to 100 resources)
infracost auth login
# Register API key
infracost configure set api_key <YOUR_API_KEY>
Generate Cost Estimate:
# Cost breakdown for current plan
infracost breakdown --path . --format table
# Compare cost difference (current state vs. planned changes)
infracost diff --path . --format table
# Generate JSON for CI/CD integration
infracost breakdown --path . --format json --out-file infracost.json
# Pull request comment format (shows cost diff)
infracost comment github --path infracost.json \
--repo $GITHUB_REPOSITORY \
--pull-request $PR_NUMBER \
--github-token $GITHUB_TOKEN
Example Cost Breakdown:
Project: production-infrastructure
Name Monthly Qty Unit Monthly Cost
aws_db_instance.main
├─ Database instance (on-demand, db.m5.large) 730 hours $252.20
├─ Storage (general purpose SSD, gp3) 100 GB $11.50
└─ Additional backup storage 100 GB $9.50
aws_instance.web
├─ Instance usage (Linux/UNIX, on-demand, t3.medium) 730 hours $30.37
└─ EBS volume (gp3) 30 GB $2.40
aws_nat_gateway.main
├─ NAT gateway 730 hours $32.85
└─ Data processed 50 GB $2.25
OVERALL TOTAL $340.07
──────────────────────────────────
78 cloud resources were detected:
∙ 5 were estimated, all of which include usage-based costs
∙ 73 were free
Cost Diff Output:
Infracost estimate: Monthly cost will increase by $175 (+51% from $340 to $515)
Name Baseline Usage Planned Diff % Change
aws_db_instance.main $273 $0 $458 +$185 +68%
├─ Database instance
│ (db.m5.large → db.m5.xlarge) $252 $0 $504 +$252 +100%
└─ Storage (100 GB → 200 GB) $12 $0 $23 +$11 +92%
aws_elasticache_cluster.redis $0 $0 $67 +$67 ∞
├─ Cache nodes (cache.m5.large) $0 $0 $67 +$67 ∞
aws_instance.worker $33 $0 $23 -$10 -30%
└─ Instance usage
(t3.medium → t3.small) $30 $0 $20 -$10 -33%
──────────────────────────────────────────────────────────────────────────────
Key changes:
+ 1 new resource (Redis cache)
~ 2 resource updates (RDS scaling up, EC2 scaling down)
Monthly cost change: +$175 ($340 → $515)
Step 4.4: Cost Optimization Analysis
Identify Cost-Saving Opportunities:
# Analyze cost trends over time
infracost breakdown --path . --format json | jq '.projects[].breakdown.resources[] | select(.monthlyCost > 100)'
# Compare alternative architectures
infracost breakdown --path ./production --format table > prod-cost.txt
infracost breakdown --path ./alternative --format table > alt-cost.txt
diff prod-cost.txt alt-cost.txt
Common Cost Optimizations:
# ❌ EXPENSIVE: On-demand RDS instance
resource "aws_db_instance" "main" {
instance_class = "db.m5.large" # $252/month
}
# ✅ CHEAPER: Reserved instance (1-year term, 40% savings)
# Note: Reserve via AWS Console, reference in Terraform
resource "aws_db_instance" "main" {
instance_class = "db.m5.large" # ~$151/month with RI
}
# ✅ EVEN CHEAPER: Right-size instance based on metrics
resource "aws_db_instance" "main" {
instance_class = "db.m5.medium" # $126/month (50% savings)
# Validated via CloudWatch: Avg CPU 25%, Avg Memory 40%
}
Infracost Policy as Code:
# infracost-policies/cost-limits.rego
package infracost
import rego.v1
deny contains msg if {
# Limit monthly cost increase to $500
to_number(input.diffTotalMonthlyCost) > 500
msg := sprintf(
"Cost increase of $%.2f exceeds $500 limit",
[to_number(input.diffTotalMonthlyCost)]
)
}
deny contains msg if {
# Flag any single resource over $1000/month
some resource in input.projects[_].breakdown.resources
to_number(resource.monthlyCost) > 1000
msg := sprintf(
"Resource '%s' costs $%.2f/month (exceeds $1000 limit)",
[resource.name, to_number(resource.monthlyCost)]
)
}
Run Cost Policies:
infracost breakdown --path . --format json | conftest test - --policy infracost-policies/
Step 4.5: CI/CD Integration - Plan Review Automation
GitHub Actions Example:
# .github/workflows/terraform-plan.yml
name: Terraform Plan & Cost Estimate
on:
pull_request:
paths:
- '**.tf'
jobs:
plan-and-cost:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init
run: terraform init
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Terraform Plan
id: plan
run: |
terraform plan -out=tfplan.binary -no-color
terraform show tfplan.binary > plan.txt
terraform show -json tfplan.binary > tfplan.json
- name: Setup Infracost
uses: infracost/actions/setup@v2
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Generate Cost Estimate
run: |
infracost breakdown --path tfplan.json \
--format json \
--out-file infracost.json
- name: Post Plan to PR
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('plan.txt', 'utf8');
const infracost = JSON.parse(fs.readFileSync('infracost.json', 'utf8'));
const comment = `## Terraform Plan
<details><summary>Show Plan</summary>
\`\`\`hcl
${plan}
\`\`\`
</details>
## Cost Estimate
Monthly cost change: **\$${infracost.diffTotalMonthlyCost}**
| Resource | Current | Planned | Diff |
|----------|---------|---------|------|
${infracost.projects[0].breakdown.resources.map(r =>
`| ${r.name} | \$${r.monthlyCost} | \$${r.monthlyCost} | \$0 |`
).join('\n')}
📊 [View detailed breakdown](https://dashboard.infracost.io)
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: comment
});
Pull Request Comment Example:
## Terraform Plan
Plan: 3 to add, 2 to change, 1 to destroy.
<details><summary>Show Plan</summary>
[Full plan output...]
</details>
## Cost Estimate
Monthly cost change: **+$175** ($340 → $515)
⚠️ **Significant cost increase detected**
| Resource | Current | Planned | Diff | % Change |
|----------|---------|---------|------|----------|
| aws_db_instance.main | $273 | $458 | +$185 | +68% |
| aws_elasticache_cluster.redis | $0 | $67 | +$67 | ∞ |
| aws_instance.worker | $33 | $23 | -$10 | -30% |
### Review Checklist
- [ ] Cost increase justified by business requirements
- [ ] Alternative architectures evaluated
- [ ] Right-sizing analysis completed
- [ ] Reserved instance opportunities identified
**Approval Required:** Cost increase >$100 requires lead engineer sign-off
Stage 4 Output Example:
After 30 minutes of plan review and cost analysis, you should have:
- Terraform plan generated and reviewed
- Security risks identified and assessed
- Cost estimate completed with Infracost
- Cost increase/decrease justified and documented
- Pull request comment posted with plan and cost summary
- Approval obtained (if required by cost thresholds)
- Blast radius analysis completed
Decision Matrix:
Plan + Cost Review → Approval Decision
────────────────────────────────────────
No destructive changes + Cost decrease → Auto-approve
No destructive changes + Cost increase <$100 → Engineer approval
Destructive changes (destroy/replace) → Lead approval required
Cost increase >$500 → Director approval required
Policy violations → Block until resolved
Time Investment: 20-40 minutes (human review + automated analysis) Next Step: Proceed to Stage 5 for automated testing and compliance validation.
Stage 5: Automated Testing (30-60 minutes)
Automated testing validates that your infrastructure code works as intended before deploying to production. This stage includes integration tests, compliance validation, and functional testing using tools like Terratest, Kitchen-Terraform, and automated compliance frameworks.
Step 5.1: Terratest - Integration Testing
Why: Terratest is a Go library that enables you to write automated tests for your infrastructure code, deploying real resources to test environments and validating their configuration.
Installation:
# Create test directory
mkdir -p test
cd test
# Initialize Go module
go mod init github.com/yourorg/terraform-tests
# Install Terratest
go get github.com/gruntwork-io/terratest/modules/terraform
go get github.com/gruntwork-io/terratest/modules/aws
go get github.com/stretchr/testify/assert
Example Test - VPC Module:
// test/vpc_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/stretchr/testify/assert"
)
func TestVPCCreation(t *testing.T) {
t.Parallel()
// Configure Terraform options
terraformOptions := &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"vpc_cidr": "10.0.0.0/16",
"environment": "test",
"region": "us-east-1",
},
EnvVars: map[string]string{
"AWS_DEFAULT_REGION": "us-east-1",
},
}
// Cleanup resources after test
defer terraform.Destroy(t, terraformOptions)
// Deploy infrastructure
terraform.InitAndApply(t, terraformOptions)
// Validate outputs
vpcID := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcID, "VPC ID should not be empty")
// Validate VPC configuration in AWS
vpc := aws.GetVpcById(t, vpcID, "us-east-1")
assert.Equal(t, "10.0.0.0/16", vpc.CidrBlock)
assert.True(t, vpc.IsDefault == false)
// Validate subnets created
subnets := aws.GetSubnetsForVpc(t, vpcID, "us-east-1")
assert.GreaterOrEqual(t, len(subnets), 2, "Should have at least 2 subnets")
// Validate tags
expectedTags := map[string]string{
"Environment": "test",
"ManagedBy": "Terraform",
}
for key, expectedValue := range expectedTags {
actualValue := vpc.Tags[key]
assert.Equal(t, expectedValue, actualValue, "Tag %s mismatch", key)
}
}
Example Test - Security Group:
// test/security_group_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/stretchr/testify/assert"
)
func TestSecurityGroupNoPublicSSH(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../infrastructure",
Vars: map[string]interface{}{
"environment": "test",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Get security group ID
sgID := terraform.Output(t, terraformOptions, "security_group_id")
// Validate no SSH from 0.0.0.0/0
sg := aws.GetSecurityGroupById(t, sgID, "us-east-1")
for _, rule := range sg.IngressRules {
if rule.FromPort == 22 && rule.ToPort == 22 {
for _, cidr := range rule.CidrBlocks {
assert.NotEqual(t, "0.0.0.0/0", cidr,
"SSH should not be open to the internet")
}
}
}
}
Run Tests:
# Run all tests
go test -v -timeout 30m
# Run specific test
go test -v -run TestVPCCreation -timeout 20m
# Run tests in parallel
go test -v -parallel 10 -timeout 60m
Test Output:
=== RUN TestVPCCreation
=== PAUSE TestVPCCreation
=== CONT TestVPCCreation
TestVPCCreation 2025-01-07T10:30:15-05:00 terraform.go:123: Running terraform init...
TestVPCCreation 2025-01-07T10:30:18-05:00 terraform.go:123: Running terraform apply...
TestVPCCreation 2025-01-07T10:32:45-05:00 terraform.go:456: Apply complete! Resources: 8 added, 0 changed, 0 destroyed.
TestVPCCreation 2025-01-07T10:32:46-05:00 vpc_test.go:35: VPC ID: vpc-0a1b2c3d4e5f6g7h8
TestVPCCreation 2025-01-07T10:32:47-05:00 vpc_test.go:40: Validated VPC CIDR: 10.0.0.0/16
TestVPCCreation 2025-01-07T10:32:48-05:00 vpc_test.go:45: Validated subnets: 4 created
TestVPCCreation 2025-01-07T10:32:50-05:00 terraform.go:123: Running terraform destroy...
--- PASS: TestVPCCreation (195.32s)
PASS
ok github.com/yourorg/terraform-tests 195.412s
Step 5.2: Kitchen-Terraform - Behavior-Driven Testing
Why: Kitchen-Terraform integrates with Test Kitchen, providing behavior-driven development (BDD) style testing with InSpec for compliance validation.
Installation:
# Install Ruby dependencies
gem install bundler
bundle init
# Add to Gemfile
echo 'gem "kitchen-terraform", "~> 7.0"' >> Gemfile
bundle install
Configuration (.kitchen.yml):
---
driver:
name: terraform
variable_files:
- test/fixtures/terraform.tfvars
provisioner:
name: terraform
verifier:
name: terraform
systems:
- name: basic
backend: aws
controls:
- vpc_configuration
- security_groups
platforms:
- name: terraform
suites:
- name: default
driver:
root_module_directory: test/fixtures
verifier:
systems:
- name: basic
backend: aws
profile_locations:
- test/integration/default
InSpec Controls (test/integration/default/controls/vpc.rb):
# Validate VPC configuration
control 'vpc_configuration' do
impact 1.0
title 'VPC should be configured securely'
describe aws_vpc(vpc_id: attribute('vpc_id')) do
it { should exist }
its('cidr_block') { should eq '10.0.0.0/16' }
its('state') { should eq 'available' }
# DNS settings
its('dhcp_options_id') { should_not be_nil }
# Tags
its('tags') { should include('Environment' => 'test') }
its('tags') { should include('ManagedBy' => 'Terraform') }
end
# Validate flow logs enabled
describe aws_flow_log(vpc_id: attribute('vpc_id')) do
it { should exist }
its('traffic_type') { should eq 'ALL' }
end
end
Run Tests:
# Run Kitchen-Terraform tests
bundle exec kitchen test
# Individual stages
bundle exec kitchen create # Deploy infrastructure
bundle exec kitchen verify # Run InSpec controls
bundle exec kitchen destroy # Cleanup resources
Step 5.3: Compliance Validation - InSpec AWS
Why: InSpec AWS provides compliance-as-code validation against frameworks like CIS AWS Foundations Benchmark, PCI-DSS, and HIPAA.
Example Compliance Profile:
# test/compliance/aws-baseline/controls/s3.rb
title 'S3 Bucket Security Controls'
control 's3-encryption-required' do
impact 1.0
title 'S3 buckets must have encryption enabled'
desc 'Validates all S3 buckets have server-side encryption'
aws_s3_buckets.bucket_names.each do |bucket|
describe aws_s3_bucket(bucket_name: bucket) do
it { should have_default_encryption_enabled }
end
end
end
control 's3-public-access-blocked' do
impact 1.0
title 'S3 buckets must block public access'
aws_s3_buckets.bucket_names.each do |bucket|
describe aws_s3_bucket(bucket_name: bucket) do
it { should have_access_logging_enabled }
it { should_not be_public }
end
end
end
control 's3-versioning-enabled' do
impact 0.7
title 'S3 buckets should have versioning enabled'
aws_s3_buckets.bucket_names.each do |bucket|
describe aws_s3_bucket(bucket_name: bucket) do
it { should have_versioning_enabled }
end
end
end
Run Compliance Scan:
# Run InSpec profile
inspec exec test/compliance/aws-baseline \
-t aws:// \
--reporter cli json:compliance-report.json
# Use CIS AWS Foundations Benchmark
inspec supermarket exec dev-sec/cis-aws-benchmark \
-t aws:// \
--reporter cli html:cis-report.html
Compliance Report Output:
Profile: AWS Security Baseline (aws-baseline)
Version: 1.0.0
Target: aws://
✔ s3-encryption-required: S3 buckets must have encryption enabled
✔ S3 Bucket company-data-bucket has default encryption enabled
✔ S3 Bucket application-logs has default encryption enabled
× s3-public-access-blocked: S3 buckets must block public access
✔ S3 Bucket company-data-bucket is not public
× S3 Bucket legacy-assets is public (1 failed)
✔ s3-versioning-enabled: S3 buckets should have versioning enabled
✔ S3 Bucket company-data-bucket has versioning enabled
Profile Summary: 2 successful controls, 1 control failure, 0 controls skipped
Test Summary: 5 successful, 1 failure, 0 skipped
Step 5.4: CI/CD Integration - Automated Testing
GitHub Actions Example:
# .github/workflows/terraform-test.yml
name: Terraform Integration Tests
on:
pull_request:
branches: [main]
workflow_dispatch:
jobs:
terratest:
runs-on: ubuntu-latest
timeout-minutes: 60
steps:
- uses: actions/checkout@v4
- name: Setup Go
uses: actions/setup-go@v4
with:
go-version: '1.21'
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_wrapper: false
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Run Terratest
run: |
cd test
go mod download
go test -v -timeout 60m -parallel 5
- name: Publish Test Results
if: always()
uses: dorny/test-reporter@v1
with:
name: Terratest Results
path: test/test-results.json
reporter: java-junit
Step 5.5: Testing Best Practices
1. Use Separate Test Environments
# test/fixtures/terraform.tfvars
environment = "terratest"
region = "us-west-2" # Different region from production
vpc_cidr = "10.99.0.0/16" # Non-overlapping CIDR
# Unique naming to avoid conflicts
resource_prefix = "terratest-${random_id.test_id.hex}"
2. Implement Proper Cleanup
func TestWithCleanup(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../infrastructure",
}
// ALWAYS use defer for cleanup
defer terraform.Destroy(t, terraformOptions)
// Even if test fails, cleanup runs
terraform.InitAndApply(t, terraformOptions)
// ... test validations ...
}
3. Test Idempotency
func TestTerraformIdempotency(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../infrastructure",
}
defer terraform.Destroy(t, terraformOptions)
// First apply
terraform.InitAndApply(t, terraformOptions)
// Second apply should show no changes
planOutput := terraform.Plan(t, terraformOptions)
assert.Contains(t, planOutput, "No changes. Infrastructure is up-to-date.")
}
4. Validate Outputs
func TestOutputValidation(t *testing.T) {
// ... setup and apply ...
// Validate output format
vpcID := terraform.Output(t, terraformOptions, "vpc_id")
assert.Regexp(t, regexp.MustCompile(`^vpc-[a-f0-9]{17}$`), vpcID)
// Validate output values
subnetIDs := terraform.OutputList(t, terraformOptions, "subnet_ids")
assert.GreaterOrEqual(t, len(subnetIDs), 2)
}
Stage 5 Output Example:
After 45 minutes of automated testing, you should have:
- Terratest integration tests passed (VPC, security groups, IAM)
- InSpec compliance validation completed
- CIS AWS Foundations Benchmark passed
- Idempotency tests validated (no drift on re-apply)
- Output validation confirmed
- Test resources cleaned up automatically
- Test reports generated and published
Test Summary:
Terraform Integration Tests
────────────────────────────────────────────
- TestVPCCreation PASS (195s)
- TestSecurityGroupNoPublicSSH PASS (142s)
- TestRDSEncryptionEnabled PASS (287s)
- TestS3BucketCompliance PASS (89s)
- TestIAMPolicyLeastPrivilege PASS (56s)
- TestTerraformIdempotency PASS (312s)
Compliance Validation (InSpec)
────────────────────────────────────────────
- CIS AWS Foundations 1.1 PASS
- CIS AWS Foundations 1.2 PASS
- CIS AWS Foundations 2.1 PASS
⚠️ CIS AWS Foundations 2.3 WARNING (Versioning recommended)
- CIS AWS Foundations 3.1 PASS
Total Tests: 42 passed, 0 failed, 1 warning
Total Time: 15m 34s
Time Investment: 30-60 minutes (automated, runs in parallel) Next Step: Proceed to Stage 6 for controlled deployment with approval gates.
Stage 6: Controlled Deployment (15-30 minutes)
Deployment is the most critical phase—where infrastructure changes become reality. This stage implements approval gates, blast radius controls, rollback procedures, and phased deployment strategies to minimize risk.
Step 6.1: Approval Workflows
Why: According to Terraform Cloud documentation, manual approval gates prevent accidental or unauthorized infrastructure changes, especially for production environments.
Terraform Cloud Approval:
# terraform.tf
terraform {
cloud {
organization = "your-org"
workspaces {
name = "production"
}
}
}
# Workspace settings (via Terraform Cloud UI or API):
# - Auto Apply: Disabled (requires manual approval)
# - Approvers: DevOps team, Lead Engineer
# - Required approvals: 2
GitHub Actions Approval:
# .github/workflows/terraform-deploy.yml
name: Terraform Deploy to Production
on:
push:
branches: [main]
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Terraform Plan
run: terraform plan -out=tfplan
- name: Upload Plan
uses: actions/upload-artifact@v3
with:
name: tfplan
path: tfplan
approve:
needs: plan
runs-on: ubuntu-latest
environment:
name: production
# Required reviewers configured in GitHub Settings → Environments
steps:
- name: Approval Required
run: echo "Manual approval required to proceed with deployment"
apply:
needs: approve
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download Plan
uses: actions/download-artifact@v3
with:
name: tfplan
- name: Terraform Apply
run: terraform apply tfplan
GitLab CI Approval:
# .gitlab-ci.yml
deploy-production:
stage: deploy
script:
- terraform apply -auto-approve
environment:
name: production
when: manual # Manual trigger required
only:
- main
needs:
- terraform-plan
Step 6.2: Deployment Strategies
Strategy 1: Blue-Green Deployment
# Blue-Green infrastructure deployment
variable "active_environment" {
description = "Active environment (blue or green)"
type = string
default = "blue"
}
# Blue environment
module "blue_environment" {
source = "./modules/application"
environment_name = "blue"
enabled = var.active_environment == "blue"
instance_count = var.active_environment == "blue" ? 3 : 0
}
# Green environment
module "green_environment" {
source = "./modules/application"
environment_name = "green"
enabled = var.active_environment == "green"
instance_count = var.active_environment == "green" ? 3 : 0
}
# Load balancer switches between blue and green
resource "aws_lb_target_group_attachment" "active" {
target_group_arn = aws_lb_target_group.main.arn
target_id = var.active_environment == "blue" ?
module.blue_environment.instance_ids[0] :
module.green_environment.instance_ids[0]
}
Deployment Process:
# Step 1: Deploy to green (inactive) environment
terraform apply -var="active_environment=blue" # Blue still active
# Step 2: Test green environment
curl https://green.example.com/health
# Step 3: Switch traffic to green
terraform apply -var="active_environment=green"
# Step 4: If issues, rollback to blue
terraform apply -var="active_environment=blue"
Strategy 2: Canary Deployment
# Canary deployment with weighted routing
resource "aws_lb_target_group" "stable" {
name = "app-stable"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
}
resource "aws_lb_target_group" "canary" {
name = "app-canary"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
}
# Route 90% traffic to stable, 10% to canary
resource "aws_lb_listener_rule" "weighted_routing" {
listener_arn = aws_lb_listener.main.arn
action {
type = "forward"
forward {
target_group {
arn = aws_lb_target_group.stable.arn
weight = 90
}
target_group {
arn = aws_lb_target_group.canary.arn
weight = 10
}
}
}
condition {
path_pattern {
values = ["/*"]
}
}
}
Gradual Rollout:
# Phase 1: 10% canary
terraform apply -var="canary_weight=10"
# Monitor metrics for 1 hour
# Phase 2: 50% canary
terraform apply -var="canary_weight=50"
# Monitor metrics for 30 minutes
# Phase 3: 100% canary (full rollout)
terraform apply -var="canary_weight=100"
Step 6.3: State Locking and Backend Configuration
Why: State locking prevents concurrent modifications that could corrupt infrastructure state, causing data loss or orphaned resources.
S3 + DynamoDB Backend (Recommended):
# backend.tf
terraform {
backend "s3" {
bucket = "your-org-terraform-state"
key = "production/infrastructure.tfstate"
region = "us-east-1"
# State locking with DynamoDB
dynamodb_table = "terraform-state-locks"
# Encryption at rest
encrypt = true
kms_key_id = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID"
# Versioning for state history
versioning = true
}
}
DynamoDB Table for Locking:
# Create state lock table
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = {
Name = "Terraform State Lock Table"
Environment = "global"
}
}
Terraform Cloud Backend:
# backend.tf
terraform {
cloud {
organization = "your-org"
workspaces {
name = "production-infrastructure"
}
}
}
Step 6.4: Safe Apply Practices
1. Always Review Plan Before Apply
# NEVER skip plan review
terraform plan -out=tfplan # Review output carefully
terraform apply tfplan # Apply saved plan
# ❌ DANGEROUS - Never use in production
terraform apply -auto-approve # Skips confirmation
2. Use Targeted Applies for Risky Changes
# Apply changes to specific resources only
terraform apply -target=aws_security_group.database
# Apply changes to specific module
terraform apply -target=module.vpc
# Useful for:
# - Testing changes incrementally
# - Recovering from partial failures
# - Updating specific resources without affecting others
3. Validate Before Apply
# Pre-apply validation checklist
terraform fmt -check -recursive # ✅ Formatting
terraform validate # ✅ Syntax
tfsec . # ✅ Security scan
terraform plan -detailed-exitcode # ✅ Plan review
# Exit code 0 = no changes, 1 = error, 2 = changes present
# Only apply if all checks pass
if [ $? -eq 2 ]; then
terraform apply tfplan
fi
Step 6.5: Rollback Procedures
State Backup and Restore:
# Before risky apply, backup state
terraform state pull > terraform.tfstate.backup.$(date +%Y%m%d-%H%M%S)
# If apply fails, restore previous state
terraform state push terraform.tfstate.backup.20250107-103045
# List state versions (S3 backend with versioning)
aws s3api list-object-versions \
--bucket your-org-terraform-state \
--prefix production/infrastructure.tfstate
# Restore specific version
aws s3api get-object \
--bucket your-org-terraform-state \
--key production/infrastructure.tfstate \
--version-id VERSION_ID \
terraform.tfstate.restore
terraform state push terraform.tfstate.restore
Terraform Cloud State Versioning:
# View state versions
terraform state list
# Rollback to previous state version (via Terraform Cloud UI)
# Settings → States → Select version → "Rollback to this state"
Infrastructure Rollback:
# Rollback by reverting Git commit
git revert HEAD
git push origin main
# CI/CD automatically applies reverted configuration
# This is safer than manual state manipulation
Step 6.6: Monitoring During Deployment
CloudWatch Alarms:
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
alarm_name = "high-error-rate-during-deploy"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "5XXError"
namespace = "AWS/ApplicationELB"
period = 60
statistic = "Sum"
threshold = 10
alarm_description = "Alert if error rate spikes during deployment"
alarm_actions = [aws_sns_topic.deployment_alerts.arn]
}
Deployment Health Checks:
#!/bin/bash
# deploy-with-healthcheck.sh
terraform apply -auto-approve
# Wait for resources to stabilize
sleep 60
# Health check
HEALTH_ENDPOINT="https://api.example.com/health"
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_ENDPOINT)
if [ "$RESPONSE" -ne 200 ]; then
echo "❌ Health check failed! Rolling back..."
git revert HEAD --no-edit
terraform apply -auto-approve
exit 1
fi
echo "✅ Deployment successful and healthy"
Stage 6 Output Example:
After 25 minutes of controlled deployment, you should have:
- Manual approval obtained (2 reviewers)
- State locked during apply (no concurrent modifications)
- Terraform apply completed successfully
- Health checks passed post-deployment
- Monitoring alerts configured
- Rollback procedure documented
- Deployment logged and audit trail created
Deployment Summary:
Terraform Apply Summary
────────────────────────────────────────────
Started: 2025-01-07 14:30:00 UTC
Completed: 2025-01-07 14:42:15 UTC
Duration: 12m 15s
Resources:
+ 3 created
~ 2 modified
- 1 destroyed
Total: 6 changes applied
Health Checks:
✅ Application endpoint: 200 OK
✅ Database connectivity: Success
✅ Cache cluster: Healthy
✅ CloudWatch alarms: Normal
Post-Deployment Validation:
✅ No new error logs
✅ Latency within acceptable range
✅ Cost estimate matches actual
Status: ✅ Deployment Successful
Time Investment: 15-30 minutes (including approval wait time) Next Step: Proceed to Stage 7 for post-deployment monitoring and drift detection.
Stage 7: Post-Deployment Monitoring & Drift Detection (Continuous)
The final stage is ongoing—continuously monitoring infrastructure for drift, compliance violations, and cost overruns. According to HashiCorp's drift detection guide, over 80% of organizations experience configuration drift, making automated detection essential.
Step 7.1: Understanding Infrastructure Drift
What is Drift?
Infrastructure drift occurs when the actual state of cloud resources deviates from the desired state defined in IaC code. As explained in Spacelift's drift detection guide, drift undermines the reliability and predictability that IaC promises.
Common Causes:
-
Manual Console Changes ("ClickOps")
Engineer logs into AWS Console → Modifies security group → Drift created -
Emergency Hotfixes
Production incident → Quick fix bypasses Terraform → Drift not backported -
Overlapping Automation
Auto-scaling group scales instances → Terraform expects fixed count → Drift -
Out-of-Band Changes
AWS managed services update configurations → Terraform unaware → Drift
Why Drift is Dangerous:
- Security vulnerabilities: Drift can undo security configurations (firewall rules, encryption settings)
- Compliance violations: Deviations from IaC definitions violate audit requirements
- Unpredictable behavior: Actual infrastructure state unknown, debugging difficult
- Deployment failures: Future Terraform applies may fail or cause unexpected changes
Step 7.2: Automated Drift Detection
Terraform Refresh and Drift Check:
# Detect drift by comparing state with actual resources
terraform plan -refresh-only
# Output shows resources that drifted
terraform plan -detailed-exitcode
# Exit code 2 = drift detected
Example Drift Detection Output:
Note: Objects have changed outside of Terraform
Terraform detected the following changes made outside of Terraform since the last "terraform apply":
# aws_security_group.web has changed
~ resource "aws_security_group" "web" {
id = "sg-0a1b2c3d4e5f6g7h8"
name = "web-security-group"
~ ingress = [
+ {
+ cidr_blocks = ["0.0.0.0/0"] # ⚠️ DRIFT DETECTED
+ from_port = 22
+ to_port = 22
+ protocol = "tcp"
+ description = "SSH from anywhere (UNAUTHORIZED)"
},
# (2 unchanged elements hidden)
]
}
Unless you have made equivalent changes to your configuration, or ignored the relevant
attributes using ignore_changes, the following plan may include actions to undo or
respond to these changes.
Automated Drift Detection with Spacelift:
# .spacelift/config.yml
version: 1
stack:
name: production-infrastructure
drift_detection:
enabled: true
schedule: "0 */6 * * *" # Every 6 hours
reconcile: false # Alert only, don't auto-fix
notifications:
drift_detected:
- slack: "#infrastructure-alerts"
- email: "[email protected]"
Drift Detection with Terraform Cloud:
# Terraform Cloud workspace settings
resource "tfe_workspace" "production" {
name = "production-infrastructure"
organization = "your-org"
# Enable drift detection
assessments_enabled = true
# Health assessment schedule
health_assessment {
enabled = true
}
}
Custom Drift Detection Script:
#!/bin/bash
# drift-detection.sh
set -e
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
# Run Terraform refresh
terraform init -backend=true
terraform plan -refresh-only -detailed-exitcode -out=drift-plan
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "⚠️ DRIFT DETECTED!"
# Generate drift report
terraform show drift-plan > drift-report.txt
# Send Slack notification
curl -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"🚨 Infrastructure Drift Detected\n\`\`\`$(cat drift-report.txt)\`\`\`\"}" \
$SLACK_WEBHOOK
# Optional: Create JIRA ticket, send email, etc.
exit 1
elif [ $EXIT_CODE -eq 0 ]; then
echo "✅ No drift detected"
exit 0
else
echo "❌ Terraform error"
exit 1
fi
Scheduled Drift Detection (cron):
# /etc/cron.d/terraform-drift-detection
# Run drift detection every 6 hours
0 */6 * * * cd /opt/terraform/production && /opt/scripts/drift-detection.sh >> /var/log/drift-detection.log 2>&1
Step 7.3: Drift Remediation Strategies
Strategy 1: Import Drift into Terraform
# Manual changes were made to security group
# Import actual state to make Terraform aware
# 1. Identify drifted resource
terraform plan -refresh-only
# 2. Update Terraform code to match actual state
# Edit main.tf to include the new ingress rule
# 3. Verify alignment
terraform plan # Should show "No changes"
Strategy 2: Revert Drift (Restore Desired State)
# Restore infrastructure to match Terraform code
# 1. Review drift
terraform plan -refresh-only
# 2. Apply Terraform state (overwrites manual changes)
terraform apply -auto-approve
# Result: Manual changes reverted, IaC definition enforced
Strategy 3: Ignore Specific Attributes
# For resources managed partly by Terraform, partly by AWS
resource "aws_autoscaling_group" "web" {
name = "web-asg"
min_size = 2
max_size = 10
desired_capacity = 3
# Ignore changes to desired_capacity (managed by auto-scaling policies)
lifecycle {
ignore_changes = [desired_capacity]
}
}
Strategy 4: Prevent Drift with Resource Locks
# Prevent resource deletion
resource "aws_s3_bucket" "critical_data" {
bucket = "critical-company-data"
lifecycle {
prevent_destroy = true # Terraform refuses to destroy
}
}
AWS Service Control Policies (SCPs):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": [
"ec2:DeleteSecurityGroup",
"ec2:RevokeSecurityGroupIngress",
"ec2:RevokeSecurityGroupEgress"
],
"Resource": "*",
"Condition": {
"StringNotEquals": {
"aws:PrincipalArn": "arn:aws:iam::ACCOUNT:role/TerraformExecutionRole"
}
}
}
]
}
Step 7.4: Compliance and Cost Monitoring
Continuous Compliance Scanning:
# Scheduled compliance scan (cron)
0 0 * * * cd /opt/terraform && checkov -d . --framework terraform --compact --quiet | tee compliance-$(date +\%Y\%m\%d).log
Cost Monitoring with Infracost:
#!/bin/bash
# cost-monitoring.sh
# Generate current cost estimate
infracost breakdown --path . --format json > current-cost.json
# Compare with baseline
CURRENT_COST=$(jq -r '.totalMonthlyCost' current-cost.json)
BASELINE_COST=500 # $500/month baseline
if (( $(echo "$CURRENT_COST > $BASELINE_COST * 1.2" | bc -l) )); then
echo "⚠️ Cost increased by >20%: \$$CURRENT_COST (baseline: \$$BASELINE_COST)"
# Send alert
fi
CloudWatch Cost Anomaly Detection:
resource "aws_ce_anomaly_monitor" "infrastructure_costs" {
name = "Infrastructure Cost Monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
}
resource "aws_ce_anomaly_subscription" "cost_alerts" {
name = "Cost Anomaly Alerts"
frequency = "DAILY"
monitor_arn_list = [
aws_ce_anomaly_monitor.infrastructure_costs.arn,
]
subscriber {
type = "SNS"
address = aws_sns_topic.cost_alerts.arn
}
threshold_expression {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["100"] # Alert if anomaly >$100
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
}
Step 7.5: Continuous Monitoring Dashboard
Terraform State Metrics:
# State file statistics
terraform state list | wc -l # Total resources
terraform state show aws_instance.web # Resource details
# Track resource count over time
echo "$(date +%Y-%m-%d),$(terraform state list | wc -l)" >> resource-count.csv
Grafana Dashboard for IaC Metrics:
# grafana-dashboard.json (excerpt)
{
"dashboard": {
"title": "Terraform Infrastructure Monitoring",
"panels": [
{
"title": "Drift Detection Status",
"targets": [
{
"expr": "terraform_drift_detected{environment=\"production\"}",
"legendFormat": "Drift Detected"
}
]
},
{
"title": "Monthly Infrastructure Cost",
"targets": [
{
"expr": "infracost_monthly_total{environment=\"production\"}",
"legendFormat": "Total Cost"
}
]
},
{
"title": "Compliance Score (Checkov)",
"targets": [
{
"expr": "checkov_compliance_score{framework=\"terraform\"}",
"legendFormat": "Compliance %"
}
]
}
]
}
}
Step 7.6: Resource Lifecycle Management
Tagging Strategy for Tracking:
# Consistent tagging across all resources
locals {
common_tags = {
Environment = var.environment
ManagedBy = "Terraform"
Owner = "DevOps Team"
CostCenter = "Engineering"
CreatedDate = timestamp()
TerraformWorkspace = terraform.workspace
}
}
resource "aws_instance" "web" {
ami = var.ami_id
instance_type = "t3.medium"
tags = merge(
local.common_tags,
{
Name = "web-server-${var.environment}"
Role = "WebServer"
}
)
}
Resource Expiration Tracking:
# Tag resources with expiration dates
resource "aws_instance" "test_environment" {
ami = var.ami_id
instance_type = "t3.small"
tags = {
Name = "test-instance"
Environment = "test"
ExpiresOn = timeadd(timestamp(), "168h") # 7 days from now
}
}
Automated Cleanup Lambda:
# lambda-resource-cleanup.py
import boto3
from datetime import datetime
ec2 = boto3.client('ec2')
def lambda_handler(event, context):
# Find expired resources
instances = ec2.describe_instances(
Filters=[
{'Name': 'tag-key', 'Values': ['ExpiresOn']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
expires_on = next((tag['Value'] for tag in instance['Tags'] if tag['Key'] == 'ExpiresOn'), None)
if expires_on and datetime.fromisoformat(expires_on.replace('Z', '+00:00')) < datetime.now():
print(f"Terminating expired instance: {instance['InstanceId']}")
ec2.terminate_instances(InstanceIds=[instance['InstanceId']])
Stage 7 Output Example:
After continuous monitoring, you should have:
- Automated drift detection running every 6 hours
- Drift alerts sent to Slack/email when detected
- Compliance scans passing (Checkov, InSpec)
- Cost monitoring active with anomaly detection
- Resource lifecycle tracking with expiration tags
- Dashboards showing infrastructure health metrics
- Audit trail maintained for all changes
Monitoring Summary:
Infrastructure Health Dashboard
────────────────────────────────────────────
Drift Status: ✅ No drift detected (last check: 2h ago)
Compliance Score: 95% (CIS AWS Foundations)
Monthly Cost: $487 (-3% vs. baseline $502)
Resource Count: 127 resources managed
Security Violations: 0 HIGH, 2 MEDIUM
Next Drift Check: 4 hours
Recent Activity:
• 2025-01-07 12:30 - Terraform apply completed (3 changes)
• 2025-01-07 10:15 - Drift detection passed
• 2025-01-07 08:00 - Compliance scan: 95% pass rate
• 2025-01-07 06:45 - Cost estimate: $487/month
Alerts (Last 7 days):
⚠️ 2025-01-05 - S3 versioning disabled on logs bucket (remediated)
⚠️ 2025-01-03 - Security group modified outside Terraform (reverted)
Key Deliverable: Drift detection alerts, compliance scorecard, cost optimization report
Time Investment: Continuous (automated checks every 6 hours)
Conclusion
This comprehensive 7-stage workflow transforms infrastructure-as-code from a deployment tool into a complete security and change management framework. By integrating pre-commit validation, automated security scanning, policy-as-code enforcement, cost analysis, testing, controlled deployment, and continuous monitoring, teams can deploy infrastructure changes with confidence.
Workflow Recap
Stage 1: Pre-Commit Validation (5-10 min)
Catch syntax errors, formatting issues, and basic security problems before code reaches version control using terraform validate, TFLint, and git-secrets.
Stage 2: Security Scanning (15-30 min) Deep static analysis with tfsec, Checkov, and Terrascan to detect vulnerabilities, compliance violations, and misconfigurations in CI/CD pipelines.
Stage 3: Policy-as-Code (10-20 min) Enforce organizational standards and compliance requirements using Sentinel and OPA, blocking non-compliant infrastructure before deployment.
Stage 4: Plan Review & Cost Analysis (20-40 min) Human review combined with Infracost cost estimation to understand changes and prevent expensive mistakes.
Stage 5: Automated Testing (30-60 min) Integration testing with Terratest and compliance validation with InSpec to ensure infrastructure works as intended.
Stage 6: Controlled Deployment (15-30 min) Approval gates, state locking, health checks, and rollback procedures to minimize deployment risk.
Stage 7: Continuous Monitoring (Ongoing) Drift detection, compliance scanning, and cost monitoring to maintain security and prevent configuration drift.
Key Achievements
By implementing this workflow, organizations achieve:
- Reduced deployment risks: 90% fewer security incidents from IaC misconfigurations
- Improved compliance: Continuous validation against CIS, PCI-DSS, HIPAA, SOC 2
- Cost visibility: Prevent surprise cloud bills with pre-deployment cost estimates
- Faster incident response: Drift detection identifies unauthorized changes within hours
- Audit trail: Complete change history for compliance and security audits
- Developer productivity: Shift-left security catches issues before code review
Best Practices Summary
1. Always Review Plans Before Apply
Never skip terraform plan review, especially for production. Use the Terraform Plan Explainer to visualize security risks and blast radius.
2. Use Remote State with Locking S3 + DynamoDB or Terraform Cloud backends prevent state corruption from concurrent modifications and provide state versioning for rollbacks.
3. Implement Policy-as-Code Encode organizational standards in Sentinel or OPA policies. Start with advisory enforcement, graduate to hard-mandatory after validation.
4. Automate Security Scanning in CI/CD Integrate tfsec, Checkov, and Trivy into pull request workflows. Block merges on HIGH/CRITICAL findings.
5. Maintain Comprehensive Rollback Procedures Backup state before risky changes, use blue-green deployments for critical services, and document rollback commands.
6. Monitor for Drift Continuously Schedule drift detection every 6 hours. Alert immediately on unauthorized changes. Enforce remediation via Terraform apply or manual review.
7. Tag Everything for Lifecycle Management Consistent tagging (Environment, Owner, CostCenter, ManagedBy, CreatedDate) enables cost tracking, compliance auditing, and automated cleanup.
Advanced Topics
Multi-Cloud Terraform Extend this workflow to AWS, Azure, GCP, and Kubernetes using provider-specific security scanners and unified OPA policies.
Terragrunt for DRY Configuration Reduce code duplication across environments using Terragrunt for remote state management and variable injection.
Atlantis for PR Automation
Deploy Atlantis to automatically run terraform plan on pull requests, post results as comments, and enable self-service infrastructure changes.
GitOps with Terraform Implement GitOps workflows where Git commits trigger automated Terraform deployments via ArgoCD or Flux for Kubernetes-based infrastructure.
Secret Management Integration Integrate HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault to eliminate hardcoded credentials entirely.
Resources for Continued Learning
Official Documentation:
Security Tools:
Policy-as-Code:
Testing Frameworks:
Tools Referenced in This Guide:
- Terraform Plan Explainer - Analyze Terraform plans for security risks and blast radius
- Hash Generator - Generate cryptographic hashes for file verification
- Base64 Encoder/Decoder - Encode/decode secrets and configurations
- JSON Formatter - Format and validate Terraform JSON plans
- Diff Checker - Compare infrastructure configurations
- Regex Tester - Test patterns for resource naming conventions
- Git Command Reference - Git workflows for IaC version control
About This Guide
This comprehensive workflow guide is based on industry best practices from leading DevOps and cloud security organizations including HashiCorp, Spacelift, Bridgecrew, and Aqua Security. All tools referenced are industry-standard solutions used by Fortune 500 companies and startups alike.
The security scanning tools (tfsec, Checkov, Trivy, Terrascan) are open-source and free, while policy frameworks (Sentinel, OPA) and cost tools (Infracost) offer free tiers suitable for most organizations. InventiveHQ's Terraform Plan Explainer provides visual security analysis with 100% client-side processing—no data leaves your browser.
Whether you're deploying your first Terraform configuration or managing multi-region, multi-cloud infrastructure at scale, this workflow provides the foundation for secure, compliant, and cost-effective infrastructure-as-code.
Sources & Further Reading
IaC Security Best Practices:
- Spacelift: 13 Terraform Security Best Practices
- Spacelift: Top 10 IaC Scanning Tools
- Spacelift: Infrastructure as Code Security Best Practices
- Medium: Securing Infrastructure-as-Code - Terraform validate, TFLint, TFSec, Checkov, OPA
- env0: Best IaC Scan Tool Comparison - Checkov vs tfsec vs Terrascan
Policy-as-Code:
- Spacelift: Enforcing Policy as Code in Terraform with Sentinel & OPA
- Spacelift: Open Policy Agent (OPA) with Terraform
- HashiCorp: Terraform Policy Enforcement Overview
- HashiCorp: Native OPA Support in Terraform Cloud
Drift Detection:
- Spacelift: Infrastructure Drift Detection Guide
- HashiCorp: How Drift Detection Helps Maintain Secure Infrastructure
- Snyk: How to Detect and Prevent Configuration Drift in IaC
- Snyk: Infrastructure Drift Detection and Mitigation
Cost Optimization:
- Spacelift: How to Estimate Cloud Costs with Terraform Using Infracost
- HashiCorp: A Guide to Cloud Cost Optimization with Terraform
- Infracost: Cloud Cost Estimates for Terraform
Official Documentation: