Kubernetes Security & Hardening Workflow

Introduction

Kubernetes has become the de facto standard for container orchestration, but with great power comes great responsibility. According to Red Hat's 2024 State of Kubernetes Security report, 67% of organizations experienced Kubernetes security incidents in production environments. The CNCF Security Report 2025 identifies misconfigured Role-Based Access Control (RBAC) as the #1 vulnerability in production clusters, followed closely by exposed API servers and inadequate network segmentation.

The attack surface is vast and complex. Modern Kubernetes clusters face threats ranging from container escapes and privilege escalation to supply chain attacks, cryptomining operations, and lateral movement within the cluster. The consequences are severe: data breaches, service disruptions, compliance violations, and in some cases, complete cluster compromise.

The Kubernetes Security Challenge

The CIS Kubernetes Benchmark v1.12 now includes over 100 security checks covering control plane components, worker node configurations, and policy enforcement. The National Security Agency (NSA) and Cybersecurity and Infrastructure Security Agency (CISA) published their Kubernetes Hardening Guide v1.2 specifically addressing the rising tide of nation-state and criminal attacks targeting containerized infrastructure.

The challenge isn't just the number of security controls—it's the interdependencies. A secure Kubernetes deployment requires coordinated hardening across eight critical areas:

Control plane security - API server, etcd, scheduler, controller manager
Node security - Kubelet configuration, host isolation, kernel hardening
RBAC policies - Least privilege access, service account management
Network segmentation - Zero-trust policies, microsegmentation, service mesh
Pod security - Security contexts, admission control, resource limits
Image security - Vulnerability scanning, signing, base image hardening
Runtime monitoring - Behavioral detection, threat hunting, forensics
Audit logging - Compliance evidence, incident investigation, detection rules

Why This Workflow Matters

This guide presents a comprehensive 8-stage Kubernetes security workflow used by platform engineers, security teams, and DevOps practitioners to transform vulnerable clusters into hardened production environments. Unlike basic security checklists, this workflow emphasizes:

Compliance alignment - CIS Kubernetes Benchmark, NSA/CISA guidance, Pod Security Standards
Practical implementation - Real-world configurations with detailed explanations
Tool integration - 11 security tools from InventiveHQ's free toolkit
Defense-in-depth - Multiple overlapping security layers
Continuous validation - Ongoing monitoring and improvement

Whether you're running self-managed Kubernetes or using managed services like Amazon EKS, Google GKE, or Azure AKS, this workflow provides the roadmap to production-grade security. Let's begin with assessment.

Stage 1: Security Assessment & Baseline (20-30 minutes)

Before implementing security controls, you must understand your current posture. This assessment establishes a baseline, identifies critical vulnerabilities, and prioritizes remediation efforts based on risk.

Step 1.1: Cluster Information Gathering

Why: Security requirements differ dramatically between managed Kubernetes services (where cloud providers handle control plane security) and self-hosted clusters (where you're responsible for everything). Understanding your architecture determines which CIS benchmark sections apply and where to focus hardening efforts.

Document Your Architecture:

# Kubernetes version and distribution
kubectl version --short

# Node inventory
kubectl get nodes -o wide

# Cluster components
kubectl get pods -A -o wide

# Namespaces and workload counts
kubectl get namespaces
kubectl get all -A --no-headers | wc -l

Key Architecture Details:

Distribution: EKS, GKE, AKS, OpenShift, vanilla Kubernetes
Control plane: Managed (cloud provider) vs self-hosted
Network CNI: Calico, Cilium, Flannel, Weave
Ingress controller: NGINX, Traefik, HAProxy, Istio Gateway
Storage CSI: EBS, GCE PD, Azure Disk, Longhorn

Workload Inventory:

# Service accounts with permissions
kubectl get serviceaccounts -A

# Secrets and ConfigMaps (potential credential exposure)
kubectl get secrets,configmaps -A --no-headers | wc -l

# External-facing services
kubectl get services -A -o wide | grep -E 'LoadBalancer|NodePort'

Step 1.2: CIS Benchmark Assessment

Why: The CIS Kubernetes Benchmark is the industry-standard security baseline, mapping directly to compliance frameworks like SOC 2, PCI-DSS, and HIPAA. Automated scanning with kube-bench provides objective measurement of your security posture.

Run kube-bench Scan:

# Run as Kubernetes Job
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml

# View results
kubectl logs -f job/kube-bench

# Generate JSON report for analysis
kube-bench run --json > cis-benchmark-results.json

CIS Benchmark Coverage Areas:

Control Plane Components (Sections 1.1-1.4)
- API server authentication and authorization
- Controller manager security
- Scheduler configuration
- etcd encryption and access controls
Worker Node Security (Section 2)
- Kubelet authentication and authorization
- File permissions and ownership
- Certificate rotation
RBAC Policies (Section 3)
- Role bindings and cluster roles
- Service account token handling
- Privilege escalation prevention
Network Policies (Section 4)
- Pod-to-pod communication rules
- Egress controls
- Network segmentation
Pod Security (Section 5)
- Security contexts and admission control
- Resource limits and quotas

Use JSON Formatter to analyze CIS benchmark output:

Copy the cis-benchmark-results.json output
Format and validate the JSON structure
Filter for failed checks: Search for "status": "FAIL"
Extract severity levels: HIGH, MEDIUM, LOW
Compare against baseline scans to track remediation progress

Example Failed Check:

{
  "test_number": "1.2.1",
  "test_desc": "Ensure that the --anonymous-auth argument is set to false",
  "audit": "ps -ef | grep kube-apiserver | grep -v grep",
  "status": "FAIL",
  "remediation": "Edit the API server pod specification and set --anonymous-auth=false"
}

Step 1.3: Attack Surface Analysis

Why: Exposed services are prime targets for attackers. Even a single publicly accessible dashboard or API server can provide an entry point for cluster compromise. The median time to exploit a newly exposed Kubernetes service is under 24 hours according to SANS Institute research.

Identify Exposed Services:

# LoadBalancer services with external IPs
kubectl get services -A -o json | \
  jq '.items[] | select(.spec.type=="LoadBalancer") | {name: .metadata.name, namespace: .metadata.namespace, ip: .status.loadBalancer.ingress[0].ip}'

# NodePort services (expose ports on all nodes)
kubectl get services -A -o json | \
  jq '.items[] | select(.spec.type=="NodePort") | {name: .metadata.name, namespace: .metadata.namespace, ports: .spec.ports[].nodePort}'

# Ingress endpoints
kubectl get ingress -A

Use Port Reference to document Kubernetes service ports:

Search for "Kubernetes" to see standard component ports
Document critical ports requiring firewall protection:
- 6443 - Kubernetes API server (must be secured)
- 10250 - Kubelet API (should require authentication)
- 2379-2380 - etcd client and peer ports (control plane only)
- 10251 - Kube-scheduler (localhost binding only)
- 10252 - Kube-controller-manager (localhost binding only)
Identify non-standard ports used by your applications
Map ports to cloud security group rules (AWS/GCP/Azure)

Scan for Insecure Configurations:

# Privileged containers (can access host resources)
kubectl get pods -A -o json | \
  jq '.items[] | select(.spec.containers[].securityContext.privileged==true) | {name: .metadata.name, namespace: .metadata.namespace}'

# Host namespace sharing (breaks isolation)
kubectl get pods -A -o json | \
  jq '.items[] | select(.spec.hostNetwork==true or .spec.hostPID==true or .spec.hostIPC==true) | {name: .metadata.name, namespace: .metadata.namespace}'

# Containers running as root
kubectl get pods -A -o json | \
  jq '.items[] | select(.spec.containers[].securityContext.runAsNonRoot==false) | {name: .metadata.name, namespace: .metadata.namespace}'

Step 1.4: RBAC Audit

Why: Overly permissive RBAC policies are the #1 security vulnerability in production Kubernetes clusters. Service accounts with cluster-admin or wildcard (*) permissions can be exploited for privilege escalation and lateral movement.

Audit Cluster-Wide Permissions:

# Find ClusterRoles with wildcard permissions
kubectl get clusterroles -o json | \
  jq '.items[] | select(.rules[].verbs[] | contains("*")) | .metadata.name'

# Identify cluster-admin bindings
kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.roleRef.name=="cluster-admin") | {name: .metadata.name, subjects: .subjects}'

# List service accounts with elevated privileges
kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.subjects[]?.kind=="ServiceAccount") | {name: .metadata.name, sa: .subjects[].name, role: .roleRef.name}'

Use Diff Checker to compare RBAC policies:

Export RBAC configurations from dev and prod environments:

kubectl get roles,rolebindings,clusterroles,clusterrolebindings -A -o yaml > rbac-prod.yaml

Paste both configurations into the Diff Checker
Identify permission drift: New roles, removed restrictions, escalated privileges
Detect overly permissive patterns: Wildcard verbs/resources (*)
Validate service account auto-mount settings differ between environments

Red Flags to Investigate:

Service accounts with cluster-admin role
Roles with * verbs (all actions)
Roles with * resources (all resource types)
Cross-namespace role bindings (security boundary violations)
default service account with non-default permissions

Step 1.5: Compliance Mapping

Why: Security controls must map to your compliance obligations. This mapping demonstrates due diligence to auditors and ensures your hardening efforts address regulatory requirements.

Framework Alignment:

CIS Kubernetes Benchmark v1.12:

100+ automated security checks
Scored vs not-scored recommendations
Evidence collection for audits

NSA/CISA Kubernetes Hardening Guide v1.2:

Pod security standards enforcement
Network policy implementation
Supply chain security (image signing, scanning)
Authentication and authorization best practices
Audit logging and monitoring

Pod Security Standards:

Privileged - Unrestricted (development only)
Baseline - Minimally restrictive (prevents known privilege escalations)
Restricted - Highly restrictive (production best practice)

Industry Compliance:

Framework	Key Requirements	Kubernetes Controls
PCI-DSS	Network segmentation, encryption, access control	Network policies, pod security, RBAC, audit logs
HIPAA	Access controls, audit trails, encryption	RBAC, etcd encryption, comprehensive logging
SOC 2	Change management, monitoring, incident response	GitOps, Falco runtime detection, audit logging
GDPR	Data protection, breach notification, access control	Secrets encryption, audit logs, RBAC

Deliverable: CIS benchmark report with pass/fail percentages, RBAC audit spreadsheet, attack surface inventory, compliance gap analysis document

Stage 2: Control Plane Hardening (30-45 minutes)

The control plane is the brain of your Kubernetes cluster. Compromising the API server, etcd, or kubelet provides attackers with complete cluster control. This stage implements defense-in-depth for control plane components.

Step 2.1: API Server Security

Why: The Kubernetes API server is the gateway to your cluster. It processes all kubectl commands, manages authentication, enforces RBAC policies, and communicates with etcd. A compromised API server means complete cluster compromise.

Recommended API Server Flags:

# /etc/kubernetes/manifests/kube-apiserver.yaml (self-hosted clusters)
spec:
  containers:
  - command:
    - kube-apiserver
    - --anonymous-auth=false                      # Disable anonymous requests
    - --authorization-mode=RBAC,Node              # Enable RBAC authorization
    - --enable-admission-plugins=NodeRestriction,PodSecurity,ServiceAccount
    - --audit-log-path=/var/log/kube-audit.log    # Enable audit logging
    - --audit-log-maxage=30                       # Retain 30 days
    - --audit-log-maxsize=100                     # 100MB per file
    - --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
    - --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
    - --client-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-key-file=/etc/kubernetes/pki/sa.pub
    - --service-account-signing-key-file=/etc/kubernetes/pki/sa.key

Critical Configurations:

Authentication: Disable anonymous authentication unless specifically required
Authorization: Use RBAC + Node authorization (not AlwaysAllow)
Admission Controllers: Enable security-focused plugins
TLS: Require valid certificates for all connections
Audit Logging: Comprehensive logging for forensics and compliance

Use X.509 Certificate Decoder to validate API server certificates:

Extract the API server certificate:
```
cat /etc/kubernetes/pki/apiserver.crt
```
Paste the certificate into the decoder tool
Verify critical fields:
- Validity period: Not expired, reasonable expiration (90 days typical)
- Subject Alternative Names (SANs): Includes kubernetes, kubernetes.default, kubernetes.default.svc, cluster IP
- Signature algorithm: SHA-256 or better (NOT SHA-1)
- Key size: RSA-2048 minimum (RSA-4096 preferred)
Check certificate chain: Verify intermediate CAs and root trust

Restrict API Server Access:

# Cloud provider security groups (AWS example)
# Allow API access only from:
# - VPN gateway IPs
# - Bastion host IPs
# - CI/CD system IPs
# - Admin workstation IPs

aws ec2 authorize-security-group-ingress \
  --group-id sg-xxxxx \
  --protocol tcp \
  --port 6443 \
  --cidr 10.0.0.0/8  # Internal VPC only

Disable Insecure Features:

- --insecure-port=0                               # Disable HTTP port 8080
- --profiling=false                               # Disable profiling endpoint
- --enable-swagger-ui=false                       # Disable Swagger UI

Step 2.2: etcd Security

Why: etcd stores ALL cluster state including Secrets (even if "encrypted"). Compromising etcd provides access to every credential, certificate, and configuration in your cluster. The CISA Alert AA20-301A specifically calls out exposed etcd as a critical Kubernetes vulnerability.

Enable TLS Encryption:

# /etc/kubernetes/manifests/etcd.yaml
spec:
  containers:
  - command:
    - etcd
    - --cert-file=/etc/kubernetes/pki/etcd/server.crt
    - --key-file=/etc/kubernetes/pki/etcd/server.key
    - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
    - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
    - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
    - --peer-client-cert-auth=true                # Require peer client certs
    - --client-cert-auth=true                     # Require client certs

Implement Automated Backups:

#!/bin/bash
# /usr/local/bin/etcd-backup.sh

BACKUP_DIR=/backup/etcd
DATE=$(date +%Y%m%d-%H%M%S)

# Create encrypted etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save ${BACKUP_DIR}/etcd-${DATE}.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Encrypt backup with GPG
gpg --encrypt --recipient [email protected] ${BACKUP_DIR}/etcd-${DATE}.db

# Upload to S3 with encryption
aws s3 cp ${BACKUP_DIR}/etcd-${DATE}.db.gpg \
  s3://company-k8s-backups/etcd/ \
  --sse AES256

# Retain 30 days of backups
find ${BACKUP_DIR} -name "etcd-*.db*" -mtime +30 -delete

Restrict etcd Access:

# Firewall rules: Only API servers can connect to etcd
iptables -A INPUT -p tcp --dport 2379 -s 10.0.1.10 -j ACCEPT  # API server 1
iptables -A INPUT -p tcp --dport 2379 -s 10.0.1.11 -j ACCEPT  # API server 2
iptables -A INPUT -p tcp --dport 2379 -j DROP                 # Block all others

Best Practice: Run etcd on dedicated nodes separate from application workloads in production clusters.

Step 2.3: Kubelet Hardening

Why: The kubelet is the agent running on every node that manages pods and containers. An exposed kubelet API allows attackers to execute commands in containers, access secrets, and potentially escape to the host. Port 10250 (kubelet API) is actively scanned by attackers.

Secure Kubelet Configuration:

# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
authentication:
  anonymous:
    enabled: false                                # Disable anonymous access
  webhook:
    enabled: true                                 # Use API server for auth
authorization:
  mode: Webhook                                   # API server authorization
tlsCertFile: /var/lib/kubelet/pki/kubelet.crt
tlsPrivateKeyFile: /var/lib/kubelet/pki/kubelet.key
rotateCertificates: true                          # Auto-rotate certificates
protectKernelDefaults: true                       # Protect kernel parameters
readOnlyPort: 0                                   # Disable insecure port 10255
serverTLSBootstrap: true                          # Bootstrap TLS certificates

Use Nmap Command Builder to scan kubelet endpoints:

Enter your node IP addresses
Specify kubelet ports to scan:
- 10250 - Kubelet API (should require authentication)
- 10255 - Read-only kubelet API (should be DISABLED)

Generate Nmap commands:

# Check if read-only port is exposed
nmap -p 10255 <node-ip>

# Test kubelet API authentication
nmap -p 10250 --script ssl-cert <node-ip>

Verify no unauthorized access:

# Should return 401 Unauthorized
curl -k https://<node-ip>:10250/pods

Validate Kubelet Security:

# Read-only port should be disabled (connection refused)
curl http://<node-ip>:10255/metrics
# Expected: Connection refused

# Authenticated port should require certificates
curl -k https://<node-ip>:10250/pods
# Expected: Unauthorized (401)

# With valid certificate should succeed
curl --cacert /etc/kubernetes/pki/ca.crt \
     --cert /etc/kubernetes/pki/admin.crt \
     --key /etc/kubernetes/pki/admin.key \
     https://<node-ip>:10250/pods
# Expected: JSON pod list

Step 2.4: Controller Manager & Scheduler Security

Why: While less frequently targeted than the API server, the controller manager and scheduler have elevated privileges. Binding to 0.0.0.0 (all interfaces) exposes unnecessary attack surface.

Controller Manager Flags:

# /etc/kubernetes/manifests/kube-controller-manager.yaml
spec:
  containers:
  - command:
    - kube-controller-manager
    - --bind-address=127.0.0.1                    # Localhost only
    - --use-service-account-credentials=true      # Individual SA creds
    - --root-ca-file=/etc/kubernetes/pki/ca.crt
    - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
    - --profiling=false                           # Disable profiling

Scheduler Flags:

# /etc/kubernetes/manifests/kube-scheduler.yaml
spec:
  containers:
  - command:
    - kube-scheduler
    - --bind-address=127.0.0.1                    # Localhost only
    - --profiling=false                           # Disable profiling

Validation:

# Verify components only listen on localhost
netstat -tulpn | grep -E 'kube-controller|kube-scheduler'
# Should show 127.0.0.1:10257 and 127.0.0.1:10259

Deliverable: Hardened control plane configuration files, API server certificate validation report, kubelet security baseline, etcd backup procedures

Stage 3: RBAC & Service Account Management (25-35 minutes)

Role-Based Access Control (RBAC) is your first line of defense against unauthorized access and privilege escalation. This stage implements least privilege principles and secures service account usage.

Step 3.1: RBAC Best Practices Implementation

Why: Default Kubernetes installations often include overly permissive default roles. The principle of least privilege requires granting only the minimum permissions necessary for each workload or user.

Least Privilege Principles:

Avoid cluster-admin - Reserve for break-glass emergencies only
Namespace-scoped Roles - Use Role instead of ClusterRole when possible
Specific verbs - Grant get, list, watch instead of *
Limited resources - Specify exact resource types, avoid *
Regular audits - Review permissions quarterly

Example: Restrictive Read-Only Role

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
  namespace: production
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: production
subjects:
- kind: ServiceAccount
  name: monitoring-sa
  namespace: production
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Example: Application Deployment Role

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployment-manager
  namespace: production
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
  resources: ["pods", "pods/log"]
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources: ["services", "configmaps"]
  verbs: ["get", "list"]

Anti-Pattern: Overly Permissive Role (DO NOT USE)

# ❌ BAD - Wildcard permissions
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: super-user  # Dangerous!
rules:
- apiGroups: ["*"]
  resources: ["*"]
  verbs: ["*"]

Step 3.2: Service Account Security

Why: Every pod runs with a service account (defaults to default if not specified). By default, service account tokens are automatically mounted into pods, providing unnecessary cluster access. The Kubernetes documentation warns that default token mounting "can be used to escalate privileges."

Disable Default Token Auto-Mounting:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app-sa
  namespace: production
automountServiceAccountToken: false               # Disable by default

Create Dedicated Service Accounts:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: web-frontend-sa
  namespace: production
automountServiceAccountToken: false
---
apiVersion: v1
kind: Pod
metadata:
  name: web-frontend
  namespace: production
spec:
  serviceAccountName: web-frontend-sa             # Use dedicated SA
  automountServiceAccountToken: true              # Enable only when needed
  containers:
  - name: nginx
    image: nginx:1.27-alpine

Set Namespace Default:

# Disable auto-mount for default service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: default
  namespace: production
automountServiceAccountToken: false

Use JWT Decoder to validate service account tokens:

Extract a service account token from a pod:

kubectl exec <pod-name> -- cat /var/run/secrets/kubernetes.io/serviceaccount/token

Paste the JWT token into the decoder
Verify token claims:
- iss (Issuer): Should be your Kubernetes API server
- sub (Subject): Should match system:serviceaccount:<namespace>:<name>
- kubernetes.io/serviceaccount/namespace: Correct namespace
- kubernetes.io/serviceaccount/service-account.name: Correct SA name
- exp (Expiration): Token should have reasonable lifetime (not years)
Detect long-lived tokens requiring rotation

Token Rotation:

# Kubernetes 1.22+ uses time-bound tokens by default
# Tokens expire after 1 hour and are automatically rotated
spec:
  serviceAccountName: my-app-sa
  volumes:
  - name: kube-api-access
    projected:
      sources:
      - serviceAccountToken:
          expirationSeconds: 3600                 # 1 hour expiration
          path: token

Step 3.3: RBAC Audit & Monitoring

Why: RBAC policies drift over time as developers add "temporary" permissions that become permanent. Regular audits identify privilege creep and unauthorized escalation.

Audit Excessive Permissions:

# Service accounts with ClusterRole bindings
kubectl get clusterrolebindings -o json | \
  jq '.items[] | select(.subjects[]?.kind=="ServiceAccount") | {
    name: .metadata.name,
    sa: .subjects[].name,
    namespace: .subjects[].namespace,
    role: .roleRef.name
  }'

# Users with wildcard permissions
kubectl get roles,clusterroles -A -o json | \
  jq '.items[] | select(.rules[].verbs[] | contains("*")) | {
    kind: .kind,
    name: .metadata.name,
    namespace: .metadata.namespace
  }'

# Roles that can create pods (potential privilege escalation)
kubectl get roles,clusterroles -A -o json | \
  jq '.items[] | select(.rules[] | .resources[]? == "pods" and (.verbs[]? | contains("create"))) | {
    kind: .kind,
    name: .metadata.name,
    namespace: .metadata.namespace
  }'

Implement RBAC as Code (GitOps):

# Directory structure
rbac/
├── namespaces/
│   ├── production/
│   │   ├── roles.yaml
│   │   ├── rolebindings.yaml
│   │   └── serviceaccounts.yaml
│   └── staging/
│       └── ...
└── cluster/
    ├── clusterroles.yaml
    └── clusterrolebindings.yaml

Automated Testing with OPA:

# policies/rbac-test.rego
package kubernetes.rbac

# Deny wildcard permissions
deny[msg] {
  input.kind == "ClusterRole"
  input.rules[_].verbs[_] == "*"
  msg = sprintf("ClusterRole %s uses wildcard verbs", [input.metadata.name])
}

# Deny cluster-admin for service accounts
deny[msg] {
  input.kind == "ClusterRoleBinding"
  input.roleRef.name == "cluster-admin"
  input.subjects[_].kind == "ServiceAccount"
  msg = sprintf("Service account bound to cluster-admin: %s", [input.subjects[_].name])
}

Step 3.4: User Authentication Integration

Why: Kubernetes doesn't have built-in user management. Production clusters should integrate with enterprise identity providers (IdP) using OpenID Connect (OIDC) for centralized authentication, Single Sign-On (SSO), and Multi-Factor Authentication (MFA).

Use OAuth/OIDC Debugger to configure identity providers:

Enter your identity provider details:
- Provider: Google, Okta, Azure AD, Auth0
- Client ID: From your IdP application registration
- Client Secret: Confidential client credential
- Redirect URI: http://localhost:8000 (for kubectl OIDC plugin)
Test authentication flow:
- Generate authorization URL
- Complete OAuth flow
- Validate returned JWT token
Verify token claims include required groups/roles
Generate PKCE code verifier/challenge for enhanced security

Enable OIDC Authentication:

# API server OIDC flags
spec:
  containers:
  - command:
    - kube-apiserver
    - --oidc-issuer-url=https://accounts.google.com
    - --oidc-client-id=kubernetes-auth
    - --oidc-username-claim=email
    - --oidc-groups-claim=groups
    - --oidc-ca-file=/etc/kubernetes/pki/oidc-ca.crt  # If using custom CA

kubectl Configuration:

# ~/.kube/config
users:
- name: [email protected]
  user:
    auth-provider:
      config:
        client-id: kubernetes-auth
        client-secret: <secret>
        id-token: <jwt-token>
        idp-issuer-url: https://accounts.google.com
        refresh-token: <refresh-token>
      name: oidc

Map IdP Groups to RBAC:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: developers-view
subjects:
- kind: Group
  name: [email protected]                    # From OIDC groups claim
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: view
  apiGroup: rbac.authorization.k8s.io

Deliverable: RBAC policy repository, service account inventory with auto-mount status, permission audit report, OIDC integration configuration

Stage 4: Network Policies & Segmentation (30-40 minutes)

By default, Kubernetes allows all pod-to-pod communication. Network policies implement zero-trust networking and microsegmentation to limit lateral movement after a breach.

Step 4.1: Network Policy Fundamentals

Why: Without network policies, a compromised pod can communicate with any other pod in the cluster. Attackers exploit this to move laterally, scan for services, exfiltrate data, and access databases. The NSA/CISA Kubernetes Hardening Guide specifically mandates network policy implementation.

Default Deny-All Policy (Recommended Starting Point):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}                                 # Applies to all pods
  policyTypes:
  - Ingress
  - Egress

Important: Apply default-deny ONLY after creating allow policies for required traffic, or you'll break your applications.

Phased Rollout Strategy:

Week 1: Deploy in monitoring mode (audit without enforcement)
Week 2: Apply to non-production namespaces
Week 3: Apply to production with allow-all policy
Week 4: Replace allow-all with specific allow policies
Week 5: Monitor for violations and adjust

Step 4.2: Application-Specific Network Policies

Why: Zero-trust networking means explicitly allowing only required communication paths. Each application should have a tailored policy based on its actual traffic patterns.

Example: Three-Tier Application

# Frontend web application
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: frontend-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: frontend
      tier: web
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx                     # Traffic from ingress controller
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: backend
          tier: api                               # Allow to backend API
    ports:
    - protocol: TCP
      port: 3000
  - to:                                           # DNS egress (required!)
    - namespaceSelector:
        matchLabels:
          name: kube-system
    - podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53
---
# Backend API application
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
      tier: api
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend                           # Only from frontend
    ports:
    - protocol: TCP
      port: 3000
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
          tier: data                              # Allow to database
    ports:
    - protocol: TCP
      port: 5432
  - to:                                           # DNS egress
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
---
# Database (no outbound except DNS)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: database-netpol
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: database
      tier: data
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: backend                            # Only from backend
    ports:
    - protocol: TCP
      port: 5432
  egress:
  - to:                                           # DNS only
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53

Step 4.3: DNS Security & Egress Control

Why: Malware and data exfiltration often use DNS tunneling or connections to command-and-control (C2) servers. Restricting egress to approved destinations limits the blast radius of a compromised pod.

Use DNS Lookup to validate CoreDNS resolution:

Test DNS from within a pod:

kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

Use the DNS Lookup tool to verify:
- Service discovery: backend.production.svc.cluster.local resolves
- CoreDNS health: kube-dns.kube-system.svc.cluster.local responds
- External resolution: External domains resolve (if allowed)

Validate DNS egress in network policies:

# Should succeed if DNS egress allowed
kubectl exec <pod> -- nslookup google.com

# Should timeout if only internal DNS allowed
kubectl exec <pod> -- nslookup external-site.com

Egress Allowlist (Approved External Destinations):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: egress-allowlist
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector: {}                             # Intra-namespace
  - to:
    - namespaceSelector: {}                       # Inter-namespace
  - to:                                           # AWS S3 (specific CIDR)
    - ipBlock:
        cidr: 52.216.0.0/15
    ports:
    - protocol: TCP
      port: 443
  - to:                                           # Container registry
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 10.0.0.0/8
        - 172.16.0.0/12
        - 192.168.0.0/16
    ports:
    - protocol: TCP
      port: 443
  - to:                                           # DNS
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53

DNS Policy Enforcement with CoreDNS:

# ConfigMap for CoreDNS
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
        # Block known malicious domains
        block {
          malicious-domain.com
          phishing-site.net
        }
    }

Step 4.4: Service Mesh Security (Advanced)

Why: Network policies operate at Layer 3/4 (IP addresses and ports). Service meshes add Layer 7 (application-level) controls, mutual TLS (mTLS) encryption, and identity-based access control. This is defense-in-depth for high-security environments.

Mutual TLS with Istio:

# Enable strict mTLS for entire namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: production
spec:
  mtls:
    mode: STRICT                                  # Require mTLS for all traffic
---
# Authorization policy (Layer 7)
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: frontend-authz
  namespace: production
spec:
  selector:
    matchLabels:
      app: frontend
  action: ALLOW
  rules:
  - from:
    - source:
        principals: ["cluster.local/ns/ingress-nginx/sa/ingress-nginx"]
    to:
    - operation:
        methods: ["GET", "POST"]
        paths: ["/api/*"]

Use Certificate Transparency Lookup to monitor mesh certificates:

Enter your service mesh certificate domain (e.g., *.production.svc.cluster.local)
Monitor Certificate Transparency logs for:
- Newly issued service mesh certificates
- Unauthorized certificate issuance (anomaly detection)
- Certificate expiration timeline
Set up alerts for unexpected certificate creation
Track certificate authorities used by your mesh

Service Mesh Benefits:

mTLS encryption: All pod-to-pod traffic encrypted
Identity-based access: Authorization based on service identity (not IPs)
Traffic management: Circuit breaking, retries, timeouts
Observability: Distributed tracing, metrics, logging

When to Use Service Mesh:

High-security environments: Finance, healthcare, defense
Compliance requirements: PCI-DSS, HIPAA requiring encryption in transit
Zero-trust architecture: Identity-based access control
Microservices at scale: 50+ services with complex communication patterns

Deliverable: Network policy repository, traffic flow diagrams, egress allowlist documentation, service mesh configuration (if applicable)

Stage 5: Pod Security Standards & Admission Control (25-35 minutes)

Pod Security Standards (PSS) enforce secure pod configurations, preventing common container escapes and privilege escalation attacks. This stage implements Pod Security Admission and policy enforcement engines.

Step 5.1: Pod Security Standards Implementation

Why: Privileged pods, root containers, and excessive capabilities are the primary attack vectors for container escapes to the host operating system. Pod Security Standards codify best practices as enforceable policies.

Pod Security Levels:

Level	Use Case	Restrictions
Privileged	Unrestricted (development only)	None - all capabilities allowed
Baseline	Minimally restrictive	Prevents known privilege escalations, allows defaults
Restricted	Highly restrictive (production)	Defense-in-depth best practices, maximum security

Enable Pod Security Admission (PSA):

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted     # Block non-compliant pods
    pod-security.kubernetes.io/audit: restricted       # Log violations
    pod-security.kubernetes.io/warn: restricted        # Warn users
---
apiVersion: v1
kind: Namespace
metadata:
  name: staging
  labels:
    pod-security.kubernetes.io/enforce: baseline       # Less strict for testing
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
---
apiVersion: v1
kind: Namespace
metadata:
  name: kube-system
  labels:
    pod-security.kubernetes.io/enforce: privileged     # System components need privileges
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/warn: privileged

Restricted Level Requirements:

Run as non-root user
Read-only root filesystem
Drop all capabilities
Disallow privilege escalation
Seccomp profile (RuntimeDefault or Localhost)
No host namespace sharing (hostNetwork, hostPID, hostIPC)
No host path volumes

Step 5.2: Secure Pod Configuration

Why: Even with Pod Security Admission enabled, you must configure your workloads to comply with security standards. This example shows a production-ready pod configuration meeting the Restricted level.

Secure Pod Example:

apiVersion: v1
kind: Pod
metadata:
  name: secure-app
  namespace: production
spec:
  securityContext:
    runAsNonRoot: true                              # Prevent root user
    runAsUser: 10001                                # Specific non-root UID
    fsGroup: 10001                                  # File system group
    seccompProfile:                                 # Seccomp profile
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:1.0
    securityContext:
      allowPrivilegeEscalation: false               # No privilege escalation
      readOnlyRootFilesystem: true                  # Immutable filesystem
      runAsNonRoot: true
      runAsUser: 10001
      capabilities:
        drop:
        - ALL                                       # Drop all capabilities
    volumeMounts:
    - name: cache
      mountPath: /cache
    - name: tmp
      mountPath: /tmp
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 250m
        memory: 256Mi
  volumes:
  - name: cache
    emptyDir: {}                                    # Ephemeral storage
  - name: tmp
    emptyDir: {}

Read-Only Root Filesystem:

# Applications need writable directories for tmp, cache, logs
apiVersion: v1
kind: Pod
metadata:
  name: app-with-readonly-root
spec:
  containers:
  - name: app
    image: nginx:1.27-alpine
    securityContext:
      readOnlyRootFilesystem: true
    volumeMounts:
    - name: cache
      mountPath: /var/cache/nginx                   # Nginx needs writable cache
    - name: run
      mountPath: /var/run                           # PID file location
    - name: tmp
      mountPath: /tmp
  volumes:
  - name: cache
    emptyDir: {}
  - name: run
    emptyDir: {}
  - name: tmp
    emptyDir: {}

Step 5.3: OPA/Kyverno Policy Enforcement

Why: Pod Security Admission provides basic controls, but organizations often need custom policies: enforce specific image registries, require resource limits, mandate labels, block latest tags. Policy engines enable custom, auditable security rules.

Kyverno Installation:

# Install Kyverno via Helm
helm repo add kyverno https://kyverno.github.io/kyverno/
helm install kyverno kyverno/kyverno \
  --namespace kyverno \
  --create-namespace \
  --set replicaCount=3

# Verify installation
kubectl get pods -n kyverno

Example Kyverno Policies:

# Block privileged containers
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged-containers
  annotations:
    policies.kyverno.io/title: Disallow Privileged Containers
    policies.kyverno.io/severity: high
spec:
  validationFailureAction: enforce
  background: true
  rules:
  - name: check-privileged
    match:
      any:
      - resources:
          kinds:
          - Pod
    validate:
      message: "Privileged containers are not allowed"
      pattern:
        spec:
          =(initContainers):
          - =(securityContext):
              =(privileged): false
          containers:
          - =(securityContext):
              =(privileged): false
---
# Require resource limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: enforce
  background: true
  rules:
  - name: check-resource-limits
    match:
      any:
      - resources:
          kinds:
          - Pod
    validate:
      message: "Resource limits are required"
      pattern:
        spec:
          containers:
          - resources:
              limits:
                memory: "?*"
                cpu: "?*"
---
# Disallow latest tag
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-latest-tag
spec:
  validationFailureAction: enforce
  background: true
  rules:
  - name: require-image-tag
    match:
      any:
      - resources:
          kinds:
          - Pod
    validate:
      message: "Image tag 'latest' is not allowed. Use specific version tags."
      pattern:
        spec:
          containers:
          - image: "!*:latest"
---
# Require approved registries
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-approved-registries
spec:
  validationFailureAction: enforce
  background: true
  rules:
  - name: check-registry
    match:
      any:
      - resources:
          kinds:
          - Pod
    validate:
      message: "Images must be from approved registries: myregistry.io, gcr.io/mycompany"
      pattern:
        spec:
          containers:
          - image: "myregistry.io/* | gcr.io/mycompany/*"

Test Policy Enforcement:

# This should be REJECTED
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: privileged-test
spec:
  containers:
  - name: nginx
    image: nginx:latest
    securityContext:
      privileged: true
EOF

# Expected output:
# Error from server: admission webhook "validate.kyverno.svc" denied the request:
# policy Pod/default/privileged-test for resource violation:
# disallow-privileged-containers:
#   check-privileged: validation error: Privileged containers are not allowed

Step 5.4: Resource Limits & Quotas

Why: Without resource limits, a single pod can consume all CPU/memory on a node, causing denial of service. Without quotas, a single namespace can consume all cluster resources. Resource limits are a security control, not just an operational best practice.

Pod Resource Limits:

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
  - name: app
    image: myapp:1.0
    resources:
      requests:
        cpu: 250m                                   # Minimum guaranteed
        memory: 256Mi
      limits:
        cpu: 500m                                   # Maximum allowed
        memory: 512Mi

Namespace Resource Quota:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"                             # Total CPU requests
    requests.memory: 200Gi                          # Total memory requests
    limits.cpu: "200"                               # Total CPU limits
    limits.memory: 400Gi                            # Total memory limits
    pods: "100"                                     # Maximum pod count
    services: "50"
    persistentvolumeclaims: "25"
    requests.storage: 1Ti

LimitRange (Default Limits):

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:                                        # Default limits
      cpu: 500m
      memory: 512Mi
    defaultRequest:                                 # Default requests
      cpu: 100m
      memory: 128Mi
    max:                                            # Maximum per container
      cpu: 2
      memory: 4Gi
    min:                                            # Minimum per container
      cpu: 50m
      memory: 64Mi
    type: Container

Deliverable: Pod Security Admission configuration per namespace, Kyverno/OPA policy repository, resource quota definitions, compliant pod templates

Stage 6: Image Security & Supply Chain (30-40 minutes)

Container images are the foundation of your workloads. A vulnerable or malicious image compromises everything running from it. This stage secures the software supply chain from build to runtime.

Step 6.1: Container Registry Security

Why: Public container images average 80+ vulnerabilities per image according to Snyk research. The Docker Hub breach in 2019 exposed 190,000 accounts. Your registry is a high-value target for supply chain attacks.

Enable Vulnerability Scanning:

AWS ECR:

# Enable image scanning on push
aws ecr put-image-scanning-configuration \
  --repository-name myapp \
  --image-scanning-configuration scanOnPush=true

# View scan results
aws ecr describe-image-scan-findings \
  --repository-name myapp \
  --image-id imageTag=1.0

Google Artifact Registry:

# Enable vulnerability scanning
gcloud artifacts repositories update myrepo \
  --location=us-central1 \
  --enable-vulnerability-scanning

# View vulnerabilities
gcloud artifacts docker images scan myimage

Harbor (Self-Hosted):

Built-in Trivy and Clair scanner integration
Automatic CVE database updates
Policy enforcement (block images with critical CVEs)

Use Hash Generator to verify image integrity:

Generate SHA-256 hash of container image:

docker pull myregistry.io/myapp:1.0
docker inspect myregistry.io/myapp:1.0 | grep -A1 RepoDigests

Copy the image digest (sha256:abc123...)
Use Hash Generator to create reference hashes

Compare digests across registries:

# Dev registry
docker inspect dev-registry/myapp:1.0 | grep RepoDigests

# Prod registry
docker inspect prod-registry/myapp:1.0 | grep RepoDigests

# Should match if same image

Detect image tampering or unauthorized modifications

Pin Images by Digest:

# ❌ BAD - Mutable tag
spec:
  containers:
  - name: app
    image: myapp:1.0

# ✅ GOOD - Immutable digest
spec:
  containers:
  - name: app
    image: myapp:1.0@sha256:abc123...

Step 6.2: Image Signing & Verification

Why: Anyone can push an image to a registry if they have credentials. Image signing provides cryptographic proof of authenticity and integrity. The SolarWinds supply chain attack demonstrated the catastrophic impact of unsigned, unverified software.

Install Sigstore Cosign:

# Install cosign
brew install cosign  # macOS
# or
wget https://github.com/sigstore/cosign/releases/download/v2.2.0/cosign-linux-amd64
chmod +x cosign-linux-amd64
sudo mv cosign-linux-amd64 /usr/local/bin/cosign

Generate Signing Keys:

# Generate key pair
cosign generate-key-pair

# Securely store cosign.key (private key)
# Distribute cosign.pub (public key) to cluster

Sign Container Images:

# Sign image
cosign sign --key cosign.key myregistry.io/myapp:1.0

# Verify signature
cosign verify --key cosign.pub myregistry.io/myapp:1.0

Enforce Signature Verification with Kyverno:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
  annotations:
    policies.kyverno.io/title: Verify Image Signatures
    policies.kyverno.io/severity: high
spec:
  validationFailureAction: enforce
  background: false
  webhookTimeoutSeconds: 30
  rules:
  - name: check-signature
    match:
      any:
      - resources:
          kinds:
          - Pod
    verifyImages:
    - imageReferences:
      - "myregistry.io/*"
      attestors:
      - count: 1
        entries:
        - keys:
            publicKeys: |-
              -----BEGIN PUBLIC KEY-----
              MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...
              -----END PUBLIC KEY-----

Test Signature Enforcement:

# Unsigned image should be rejected
kubectl run test --image=myregistry.io/unsigned:1.0

# Expected:
# Error from server: admission webhook "mutate.kyverno.svc" denied the request:
# resource Pod/default/test was blocked due to the following policies:
# verify-image-signatures:
#   check-signature: failed to verify image myregistry.io/unsigned:1.0

Step 6.3: Base Image Hardening

Why: Traditional base images (Ubuntu, Debian, CentOS) include hundreds of packages, package managers, shells, and utilities that expand the attack surface. Minimal base images eliminate unnecessary components.

Minimal Base Images:

# ❌ BAD - Full Ubuntu image (77MB, hundreds of packages)
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y myapp
COPY app /app
CMD ["/app"]

# ⚠️ BETTER - Alpine Linux (5MB, minimal packages)
FROM alpine:3.19
RUN apk add --no-cache myapp
COPY app /app
CMD ["/app"]

# ✅ BEST - Distroless (No shell, no package manager)
FROM gcr.io/distroless/static-debian12
COPY app /app
USER 10001:10001
ENTRYPOINT ["/app"]

# ✅ BEST - Scratch (Empty base, statically compiled binary only)
FROM scratch
COPY app /app
USER 10001:10001
ENTRYPOINT ["/app"]

Multi-Stage Build (Recommended):

# Build stage (includes compilers, build tools)
FROM golang:1.22 AS builder
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /app

# Runtime stage (minimal base)
FROM gcr.io/distroless/static-debian12
COPY --from=builder /app /app
USER 10001:10001
ENTRYPOINT ["/app"]

Benefits of Distroless/Scratch:

Reduced attack surface: No shell, no package manager, no utilities
Smaller image size: 10-100x smaller than Ubuntu base
Fewer vulnerabilities: No unnecessary packages to patch
Defense in depth: Even if attacker gains access, no tools available

Step 6.4: Continuous Image Scanning

Why: New vulnerabilities are discovered daily. An image that was clean yesterday may have critical CVEs today. Continuous scanning detects vulnerabilities in running production images.

Use CVE Vulnerability Search to monitor image CVEs:

Enter base image package names: "ubuntu", "alpine", "openssl", "nginx"
Filter by severity: CRITICAL, HIGH
Visualize vulnerability trends by vendor
Analyze CVE response times (time from disclosure to patch)
Calculate CVSS scores for risk prioritization
Set up alerts for new CVEs affecting your base images

Integrate Scanning in CI/CD:

# GitHub Actions example
name: Container Security
on: [push]

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3

    - name: Build image
      run: docker build -t myapp:${{ github.sha }} .

    - name: Scan with Trivy
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: myapp:${{ github.sha }}
        format: 'sarif'
        severity: 'CRITICAL,HIGH'
        exit-code: '1'                              # Fail build on vulnerabilities
        output: 'trivy-results.sarif'

    - name: Upload to GitHub Security
      uses: github/codeql-action/upload-sarif@v2
      with:
        sarif_file: 'trivy-results.sarif'

Runtime Scanning:

# Scan all running images
kubectl get pods -A -o json | \
  jq -r '.items[].spec.containers[].image' | \
  sort -u | \
  while read image; do
    echo "Scanning: $image"
    trivy image --severity HIGH,CRITICAL $image
  done

Step 6.5: Private Registry Configuration

Why: Public registries have no access controls. Anyone can pull your images, analyze them for vulnerabilities, and discover credentials accidentally embedded. Private registries enforce authentication and provide audit trails.

Create Image Pull Secret:

kubectl create secret docker-registry regcred \
  --docker-server=myregistry.io \
  --docker-username=robot-account \
  --docker-password=<token> \
  [email protected] \
  --namespace=production

Use in Pod Spec:

apiVersion: v1
kind: Pod
metadata:
  name: app
  namespace: production
spec:
  imagePullSecrets:
  - name: regcred                                   # Reference pull secret
  containers:
  - name: app
    image: myregistry.io/myapp:1.0

Set Default Pull Secret for Namespace:

# Patch default service account
kubectl patch serviceaccount default \
  -n production \
  -p '{"imagePullSecrets": [{"name": "regcred"}]}'

Registry Access Controls:

AWS ECR: IAM policies per repository
GCP Artifact Registry: IAM roles per registry
Azure ACR: RBAC with Azure AD integration
Harbor: RBAC with LDAP/OIDC integration

Deliverable: Image scanning reports with CVE counts, signed image registry, base image hardening guidelines, CVE monitoring dashboard, private registry configuration

Stage 7: Runtime Security & Threat Detection (30-40 minutes)

Static security controls (RBAC, network policies, pod security) prevent known bad configurations. Runtime security detects unknown threats: zero-day exploits, insider threats, advanced persistent threats (APTs).

Step 7.1: Runtime Security Tool Deployment

Why: Falco is the CNCF-graduated standard for Kubernetes runtime security. It uses eBPF or kernel modules to monitor system calls, detecting malicious behavior in real-time based on predefined rules and anomaly detection.

Install Falco:

# Add Falco Helm repository
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm repo update

# Install Falco
helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set driver.kind=ebpf \
  --set falco.grpc.enabled=true \
  --set falco.grpc_output.enabled=true

# Verify installation
kubectl get pods -n falco
kubectl logs -n falco -l app.kubernetes.io/name=falco

Falco Architecture:

┌────────────────────────────────────┐
│         Falco DaemonSet            │
│  (Runs on every node)              │
├────────────────────────────────────┤
│  eBPF Probe / Kernel Module        │  ← Monitors system calls
│  ↓                                 │
│  Falco Rules Engine                │  ← Evaluates against rules
│  ↓                                 │
│  Alerts                            │  ← Sends to outputs
│  ↓                                 │
│  Stdout / Syslog / SIEM / Webhook  │
└────────────────────────────────────┘

Step 7.2: Behavioral Monitoring Rules

Why: Attackers who gain initial access typically spawn shells, read credentials, modify configurations, or establish persistence. Falco rules detect these behaviors regardless of the exploit used to gain access.

Example Falco Detection Rules:

# /etc/falco/rules.d/custom_rules.yaml

# Detect shell spawned in container
- rule: Shell in Container
  desc: Detect shell execution in container (common after exploitation)
  condition: >
    spawned_process and
    container and
    proc.name in (bash, sh, zsh, csh, ksh, ash)
  output: >
    Shell spawned in container
    (user=%user.name container_id=%container.id container_name=%container.name
    image=%container.image.repository proc=%proc.name parent=%proc.pname
    cmdline=%proc.cmdline terminal=%proc.tty)
  priority: WARNING
  tags: [container, shell, mitre_execution]

# Detect privilege escalation via setuid
- rule: Privilege Escalation via Setuid
  desc: Detect setuid bit execution (privilege escalation)
  condition: >
    spawned_process and
    proc.pproc_exe_flags contains "S" and
    not proc.exe in (sudo, su, login)
  output: >
    Privilege escalation detected
    (user=%user.name exe=%proc.exe parent=%proc.pname
    container=%container.name)
  priority: CRITICAL
  tags: [privilege_escalation, mitre_privilege_escalation]

# Detect sensitive file access
- rule: Read Sensitive File
  desc: Detect reading of sensitive files (credentials, keys)
  condition: >
    open_read and
    container and
    fd.name in (/etc/shadow, /etc/sudoers, /root/.ssh/id_rsa,
                /var/run/secrets/kubernetes.io/serviceaccount/token)
  output: >
    Sensitive file accessed
    (user=%user.name file=%fd.name container=%container.name
    proc=%proc.name cmdline=%proc.cmdline)
  priority: CRITICAL
  tags: [filesystem, credentials, mitre_credential_access]

# Detect cryptomining (high CPU process)
- rule: Detect Cryptominer
  desc: Detect known cryptomining processes
  condition: >
    spawned_process and
    proc.name in (xmrig, ethminer, minerd, cgminer, bfgminer, cpuminer)
  output: >
    Cryptominer detected
    (user=%user.name proc=%proc.name cmdline=%proc.cmdline
    container=%container.name)
  priority: CRITICAL
  tags: [malware, cryptominer, mitre_impact]

# Detect network scanning
- rule: Network Scanning Detected
  desc: Detect network scanning tools (nmap, masscan)
  condition: >
    spawned_process and
    proc.name in (nmap, masscan, zmap, nc, netcat, socat)
  output: >
    Network scanning tool executed
    (user=%user.name proc=%proc.name cmdline=%proc.cmdline
    container=%container.name)
  priority: WARNING
  tags: [network, reconnaissance, mitre_discovery]

# Detect package manager execution (persistence)
- rule: Package Manager in Container
  desc: Package managers shouldn't run in containers (immutable infrastructure)
  condition: >
    spawned_process and
    container and
    proc.name in (apt, apt-get, yum, dnf, apk, pip, npm)
  output: >
    Package manager executed in container
    (user=%user.name proc=%proc.name cmdline=%proc.cmdline
    container=%container.name image=%container.image.repository)
  priority: WARNING
  tags: [container, package_management, mitre_persistence]

Test Falco Rules:

# Trigger shell detection
kubectl exec -it <pod-name> -- /bin/bash

# Check Falco logs for alert
kubectl logs -n falco -l app.kubernetes.io/name=falco | grep "Shell spawned"

# Expected output:
# 12:34:56.789: Warning Shell spawned in container (user=root
#   container_id=abc123 container_name=my-pod image=nginx
#   proc=bash parent=kubectl terminal=1)

Step 7.3: Network Traffic Analysis

Why: Network observability provides visibility into pod-to-pod communication, DNS queries, and external connections. This detects lateral movement, data exfiltration, and communication with command-and-control (C2) servers.

Deploy Cilium Hubble (Network Observability):

# Upgrade Cilium with Hubble enabled
helm upgrade cilium cilium/cilium \
  --namespace kube-system \
  --reuse-values \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,http}"

# Install Hubble CLI
HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -L --remote-name-all https://github.com/cilium/hubble/releases/download/$HUBBLE_VERSION/hubble-linux-amd64.tar.gz{,.sha256sum}
tar xzvfC hubble-linux-amd64.tar.gz /usr/local/bin
rm hubble-linux-amd64.tar.gz

# Port-forward Hubble Relay
kubectl port-forward -n kube-system svc/hubble-relay 4245:80 &

# Observe network flows
hubble observe --namespace production --follow

Monitor for Suspicious Traffic:

# DNS queries to suspicious domains
hubble observe --type l7 --protocol dns | grep -E "(onion|bit|tor|pastebin)"

# Outbound connections to unusual ports
hubble observe --type l7 | grep -E "(port 4444|port 1337|port 6667)"

# Communication with blacklisted IPs
hubble observe --to-ip 198.51.100.10  # Known C2 server

# HTTP requests with suspicious user agents
hubble observe --type l7 --protocol http | grep -i "curl\|wget\|python"

Hubble UI:

# Access web interface
kubectl port-forward -n kube-system svc/hubble-ui 8080:80

# Open browser to http://localhost:8080

Step 7.4: Incident Response Integration

Why: Detecting threats is only valuable if you can respond quickly. Automated incident response reduces Mean Time to Response (MTTR) and limits blast radius.

Use Incident Response Playbook Generator to create Kubernetes runbooks:

Select incident types relevant to Kubernetes:
- Container Escape - Pod breaks out of container to host OS
- Cryptomining - Unauthorized cryptocurrency mining
- Privilege Escalation - User/process gains elevated permissions
- Data Exfiltration - Sensitive data transferred externally
- Malware Execution - Known malicious software running
Map to MITRE ATT&CK for Containers:
- T1611 - Escape to Host
- T1610 - Deploy Container
- T1613 - Container Administration Command
- T1609 - Container Service Discovery
Define team roles:
- Incident Commander - Coordinates response
- Security Analyst - Investigates threat
- Platform Engineer - Implements remediation
- Communications - Stakeholder updates
Generate customized playbooks with:
- Detection criteria (Falco alerts, metrics)
- Investigation steps (kubectl commands, log queries)
- Containment actions (quarantine pod, block IP)
- Eradication procedures (delete resources, patch vulnerabilities)
- Recovery steps (restore from backup, redeploy)
Export to PDF/Markdown for offline access during incidents

Automated Response Actions:

# Falco Sidekick - Automated response to Falco alerts
apiVersion: v1
kind: ConfigMap
metadata:
  name: falcosidekick
  namespace: falco
data:
  config.yaml: |
    slack:
      webhookurl: "https://hooks.slack.com/services/XXX"
      minimumpriority: "warning"

    pagerduty:
      routingkey: "xxxxx"
      minimumpriority: "critical"

    kubeless:
      function: "quarantine-pod"
      minimumpriority: "critical"

Quarantine Pod Function (Example):

# quarantine-pod.py - Kubeless function
import kubernetes

def quarantine_pod(event, context):
    """Isolate pod by applying restrictive network policy"""
    pod_name = event['data']['output_fields']['container_name']
    namespace = event['data']['output_fields']['namespace']

    # Apply deny-all network policy
    network_policy = {
        "apiVersion": "networking.k8s.io/v1",
        "kind": "NetworkPolicy",
        "metadata": {
            "name": f"quarantine-{pod_name}",
            "namespace": namespace
        },
        "spec": {
            "podSelector": {
                "matchLabels": {
                    "quarantine": "true"
                }
            },
            "policyTypes": ["Ingress", "Egress"]
        }
    }

    # Label pod for quarantine
    kubernetes.client.CoreV1Api().patch_namespaced_pod(
        name=pod_name,
        namespace=namespace,
        body={"metadata": {"labels": {"quarantine": "true"}}}
    )

    # Create network policy
    kubernetes.client.NetworkingV1Api().create_namespaced_network_policy(
        namespace=namespace,
        body=network_policy
    )

    return "Pod quarantined successfully"

Deliverable: Falco deployment with custom rules, Hubble network observability, incident response playbooks, automated response functions

Stage 8: Audit Logging & Compliance (20-30 minutes)

Audit logs provide forensic evidence for security incidents, compliance requirements, and operational troubleshooting. This stage implements comprehensive audit logging for Kubernetes API activity.

Step 8.1: Kubernetes Audit Policy Configuration

Why: By default, Kubernetes logs minimal audit information. A comprehensive audit policy captures who did what, when, and with what result—critical for compliance (SOC 2, PCI-DSS, HIPAA) and incident investigation.

Audit Log Levels:

None - Don't log
Metadata - Log request metadata (user, timestamp, resource) but not request/response bodies
Request - Log request metadata and request body
RequestResponse - Log request metadata, request body, and response body

Comprehensive Audit Policy:

# /etc/kubernetes/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Don't log system components (too noisy)
- level: None
  users:
  - system:kube-proxy
  - system:kube-scheduler
  - system:kube-controller-manager
  - system:serviceaccount:kube-system:*
  verbs: ["get", "list", "watch"]

# Log metadata for read operations
- level: Metadata
  verbs: ["get", "list", "watch"]
  omitStages:
  - RequestReceived

# Log request and response for secret/configmap modifications
- level: RequestResponse
  resources:
  - group: ""
    resources: ["secrets", "configmaps"]
  verbs: ["create", "update", "patch", "delete"]
  omitStages:
  - RequestReceived

# Log request and response for RBAC changes (high risk)
- level: RequestResponse
  resources:
  - group: "rbac.authorization.k8s.io"
    resources: ["clusterroles", "clusterrolebindings", "roles", "rolebindings"]
  verbs: ["create", "update", "patch", "delete"]
  omitStages:
  - RequestReceived

# Log pod exec and portforward (shell access)
- level: RequestResponse
  resources:
  - group: ""
    resources: ["pods/exec", "pods/portforward", "pods/attach"]
  omitStages:
  - RequestReceived

# Log request for all other create/update/patch/delete operations
- level: Request
  verbs: ["create", "update", "patch", "delete"]
  omitStages:
  - RequestReceived

# Catch-all: log metadata for everything else
- level: Metadata
  omitStages:
  - RequestReceived

Configure API Server:

# /etc/kubernetes/manifests/kube-apiserver.yaml
spec:
  containers:
  - command:
    - kube-apiserver
    - --audit-policy-file=/etc/kubernetes/audit-policy.yaml
    - --audit-log-path=/var/log/kubernetes/audit.log
    - --audit-log-maxage=30                       # Retain 30 days
    - --audit-log-maxbackup=10                    # Keep 10 rotated files
    - --audit-log-maxsize=100                     # 100MB per file
    - --audit-log-format=json                     # JSON for parsing
    volumeMounts:
    - name: audit-policy
      mountPath: /etc/kubernetes/audit-policy.yaml
      readOnly: true
    - name: audit-logs
      mountPath: /var/log/kubernetes
  volumes:
  - name: audit-policy
    hostPath:
      path: /etc/kubernetes/audit-policy.yaml
      type: File
  - name: audit-logs
    hostPath:
      path: /var/log/kubernetes
      type: DirectoryOrCreate

Step 8.2: Log Aggregation & Analysis

Why: Audit logs are useless if they're scattered across control plane nodes and never analyzed. Centralized log aggregation enables real-time monitoring, alerting, and long-term retention for compliance.

Ship Logs to SIEM:

# Fluentd DaemonSet for log shipping
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: kube-system
data:
  fluent.conf: |
    # Read Kubernetes audit logs
    <source>
      @type tail
      path /var/log/kubernetes/audit.log
      pos_file /var/log/audit-log.pos
      tag kubernetes.audit
      format json
      time_key requestReceivedTimestamp
      time_format %Y-%m-%dT%H:%M:%S.%NZ
    </source>

    # Enrich with Kubernetes metadata
    <filter kubernetes.audit>
      @type record_transformer
      <record>
        cluster_name "production"
        environment "prod"
      </record>
    </filter>

    # Send to Elasticsearch
    <match kubernetes.audit>
      @type elasticsearch
      host elasticsearch.logging.svc.cluster.local
      port 9200
      logstash_format true
      logstash_prefix kubernetes-audit
      include_tag_key true
      type_name audit
      flush_interval 10s
      buffer_chunk_limit 2M
      buffer_queue_limit 32
    </match>
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd-audit
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: fluentd-audit
  template:
    metadata:
      labels:
        app: fluentd-audit
    spec:
      serviceAccountName: fluentd
      tolerations:
      - key: node-role.kubernetes.io/control-plane
        effect: NoSchedule
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: config
          mountPath: /fluentd/etc/fluent.conf
          subPath: fluent.conf
        - name: audit-logs
          mountPath: /var/log/kubernetes
      volumes:
      - name: config
        configMap:
          name: fluentd-config
      - name: audit-logs
        hostPath:
          path: /var/log/kubernetes

Use JSON Formatter to process compliance reports:

Export audit log sample:

kubectl exec -n kube-system <apiserver-pod> -- cat /var/log/kubernetes/audit.log | head -100 > audit-sample.json

Format and validate JSON structure
Extract specific event types for auditor review:
- Secret access events
- RBAC modifications
- Pod exec/attach commands
- Failed authentication attempts
Generate summary statistics:
- Events by user
- Events by namespace
- Events by verb (create, delete, update)
- Failed authorization attempts

Step 8.3: Security Event Detection

Why: Audit logs are most valuable when actively monitored. Security Information and Event Management (SIEM) systems provide real-time alerting on suspicious activities.

Monitor for Suspicious Activities:

# Prometheus AlertManager rules
groups:
- name: kubernetes-audit
  interval: 60s
  rules:

  # Privileged pod created
  - alert: PrivilegedPodCreated
    expr: |
      sum(rate(kubernetes_audit_event{
        verb="create",
        resource="pods",
        privileged="true"
      }[5m])) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Privileged pod created"
      description: "A privileged pod was created in namespace {{ $labels.namespace }} by user {{ $labels.user }}"

  # Failed authentication attempts
  - alert: FailedAuthentication
    expr: |
      sum(rate(kubernetes_audit_event{
        responseStatus_code=~"401|403"
      }[5m])) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Multiple failed authentication attempts"
      description: "{{ $value }} failed auth attempts in the last 5 minutes"

  # Secret accessed by unusual user
  - alert: SuspiciousSecretAccess
    expr: |
      sum(rate(kubernetes_audit_event{
        resource="secrets",
        verb=~"get|list",
        user!~"system:serviceaccount:.*"
      }[5m])) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Secret accessed by human user"
      description: "User {{ $labels.user }} accessed secret {{ $labels.objectRef_name }}"

  # Exec into running container
  - alert: PodExecDetected
    expr: |
      sum(rate(kubernetes_audit_event{
        resource="pods/exec",
        verb="create"
      }[5m])) > 0
    for: 1m
    labels:
      severity: info
    annotations:
      summary: "kubectl exec executed"
      description: "User {{ $labels.user }} executed command in pod {{ $labels.objectRef_name }}"

  # RBAC policy modified
  - alert: RBACModification
    expr: |
      sum(rate(kubernetes_audit_event{
        resource=~"clusterroles|clusterrolebindings|roles|rolebindings",
        verb=~"create|update|patch|delete"
      }[5m])) > 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "RBAC policy modified"
      description: "User {{ $labels.user }} {{ $labels.verb }}d {{ $labels.resource }}/{{ $labels.objectRef_name }}"

Elasticsearch Queries for Investigation:

# Failed authorization attempts
GET kubernetes-audit-*/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"responseStatus.code": 403}},
        {"range": {"@timestamp": {"gte": "now-1h"}}}
      ]
    }
  },
  "aggs": {
    "by_user": {
      "terms": {"field": "user.username"}
    }
  }
}

# Recent secret accesses
GET kubernetes-audit-*/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"objectRef.resource": "secrets"}},
        {"terms": {"verb": ["get", "list"]}},
        {"range": {"@timestamp": {"gte": "now-24h"}}}
      ]
    }
  }
}

Step 8.4: Compliance Reporting

Why: Auditors require evidence that security controls are implemented and effective. Automated compliance reports demonstrate due diligence and reduce audit preparation time from weeks to days.

Generate Compliance Evidence:

Framework	Evidence Required	Kubernetes Source
SOC 2	Access logs, change management, incident logs	Audit logs, RBAC policies, Falco alerts
PCI-DSS	Network segmentation, encryption, access controls	Network policies, etcd encryption, RBAC
HIPAA	Audit trails, encryption, access controls	Audit logs, Secrets encryption, RBAC
ISO 27001	Access controls, monitoring, incident response	RBAC, Falco, audit logs, runbooks

Automated Compliance Checks:

# Run kube-bench for CIS compliance
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl logs job/kube-bench -n default > cis-compliance-$(date +%Y%m%d).json

# Run Polaris for best practices
kubectl apply -f https://raw.githubusercontent.com/FairwindsOps/polaris/master/deploy/dashboard.yaml
kubectl port-forward -n polaris svc/polaris-dashboard 8080:80

# Generate compliance report
curl http://localhost:8080/api/audit > polaris-report-$(date +%Y%m%d).json

Compliance Dashboard Metrics:

Audit log coverage: Percentage of API calls logged
RBAC violations: Unauthorized access attempts
Network policy violations: Blocked connections
Pod security violations: Rejected non-compliant pods
Image vulnerabilities: Critical CVEs in running images
Certificate expiration: Days until cert expiry

Deliverable: Audit policy configuration, SIEM integration, security alert rules, compliance evidence reports, audit log retention procedures

Validation & Testing (15-20 minutes)

Security controls are only effective if they work as intended. This final stage validates your hardening efforts through penetration testing and policy validation.

Step 1: Penetration Testing

Test Common Attack Vectors:

# Test 1: Attempt privileged pod creation (should be blocked)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: privileged-test
spec:
  containers:
  - name: test
    image: nginx
    securityContext:
      privileged: true
EOF
# Expected: Error from server (Forbidden) or admission webhook denied

# Test 2: Try to access secrets without proper RBAC (should fail)
kubectl create serviceaccount test-sa
kubectl run test-pod --image=nginx --serviceaccount=test-sa
kubectl exec test-pod -- cat /var/run/secrets/kubernetes.io/serviceaccount/token
kubectl auth can-i list secrets --as=system:serviceaccount:default:test-sa
# Expected: no

# Test 3: Attempt container escape
kubectl run escape-test --image=alpine -- /bin/sh -c "ls /host"
# Expected: Should not have host filesystem access

# Test 4: Test network policy violations
kubectl run netpol-test --image=busybox -- wget -O- http://blocked-service
# Expected: Timeout (blocked by network policy)

Kubernetes Security Assessment Tools:

# kube-hunter - Hunt for security weaknesses
docker run -it --rm --network host aquasec/kube-hunter --pod

# kubeaudit - Audit clusters for security issues
kubeaudit all -n production

# kubescape - NIST/NSF compliance scanning
kubescape scan framework nsa -v

Step 2: Policy Validation

Test Admission Controllers:

# Test Pod Security Admission
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: psa-test
  namespace: production
spec:
  containers:
  - name: test
    image: nginx
    securityContext:
      runAsUser: 0  # Root user (should be blocked by Restricted level)
EOF
# Expected: Error - violates Pod Security Standard

# Test image signature verification
kubectl run unsigned-image --image=myregistry.io/unsigned:latest
# Expected: Error - image signature verification failed

# Test resource limits requirement
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: no-limits-test
spec:
  containers:
  - name: test
    image: nginx
    # No resource limits
EOF
# Expected: Error - resource limits required by policy

Step 3: Monitoring Validation

Trigger Test Alerts:

# Test Falco shell detection
kubectl exec -it <pod-name> -- /bin/bash
# Expected: Falco alert "Shell spawned in container"

# Test privileged pod alert
kubectl run privileged-test --image=nginx --privileged=true
# Expected: Audit alert "PrivilegedPodCreated"

# Test secret access logging
kubectl get secrets -A
# Expected: Audit log entry for secret list operation

Step 4: Disaster Recovery Testing

Test etcd Restore:

# Restore etcd from backup
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-20250101-120000.db \
  --data-dir=/var/lib/etcd-restore \
  --initial-cluster=master=https://10.0.1.10:2380 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# Verify RBAC policies persist
kubectl get clusterroles,clusterrolebindings -A

# Verify network policies enforced
kubectl get networkpolicies -A

Deliverable: Penetration test report with findings, policy validation results, monitoring test logs, disaster recovery runbook

Tools Summary

Tool	Primary Use	Stage
JSON Formatter	Format and analyze CIS benchmark JSON reports, process audit logs	Stage 1, 8
Port Reference	Document Kubernetes component ports and firewall rules	Stage 1
Diff Checker	Compare RBAC policies across environments, detect permission drift	Stage 1, 3
X.509 Certificate Decoder	Validate API server and etcd TLS certificates	Stage 2
Nmap Command Builder	Generate commands to scan kubelet endpoints and verify port security	Stage 2
JWT Decoder	Validate service account tokens and expiration	Stage 3
OAuth/OIDC Debugger	Configure and test identity provider integration	Stage 3
DNS Lookup	Validate CoreDNS health and cluster DNS resolution	Stage 4
Certificate Transparency Lookup	Monitor service mesh certificate issuance for anomalies	Stage 4
Hash Generator	Verify container image integrity and detect tampering	Stage 6
CVE Vulnerability Search	Monitor container image CVEs and vendor response times	Stage 6
Incident Response Playbook Generator	Create Kubernetes-specific security incident runbooks	Stage 7

Kubernetes Security - Managed Kubernetes security assessments and hardening
Cloud Solutions - Multi-cloud Kubernetes management and optimization
DevOps Automation - Secure CI/CD pipelines for Kubernetes deployments
Virtual CISO (vCISO) - Strategic Kubernetes security leadership and governance

Compliance & Standards

CIS Kubernetes Benchmark v1.12: 100+ automated security checks covering control plane components, worker node configuration, and policy enforcement. Scored vs not-scored recommendations with clear remediation guidance.

NSA/CISA Kubernetes Hardening Guide v1.2: Government-issued guidance specifically addressing nation-state and APT threats. Covers Pod Security Standards, network policies, supply chain security (image signing/scanning), authentication/authorization, and comprehensive audit logging.

Pod Security Standards: Three-tier security model:

Privileged - Unrestricted (development/testing only)
Baseline - Minimally restrictive (prevents known privilege escalations)
Restricted - Highly restrictive (production best practice with defense-in-depth)

NIST 800-190: Application Container Security Guide covering image lifecycle, registry security, orchestrator security, container runtime, and host OS security.

Industry Compliance:

PCI-DSS: Network segmentation (network policies), encryption (etcd encryption, TLS), access controls (RBAC), audit logging
HIPAA: Access controls (RBAC), encryption (Secrets encryption, TLS), audit trails (comprehensive logging), incident response
SOC 2: Change management (GitOps), monitoring (Falco, Hubble), incident response (playbooks), audit logging
GDPR: Data protection (encryption), breach notification (incident response), access control (RBAC)

Common Pitfalls & Solutions

Pitfall	Impact	Solution
Default service account auto-mounting	Unnecessary token exposure increasing attack surface	Disable `automountServiceAccountToken` globally, enable per-pod only when needed
Overly permissive RBAC	Privilege escalation, lateral movement	Implement least privilege: specific verbs/resources, namespace-scoped roles, no wildcards
No network policies	Unrestricted pod-to-pod communication enables lateral movement	Start with default-deny, whitelist required traffic paths, implement zero-trust networking
Privileged containers	Container escape to host OS, full cluster compromise	Enforce Pod Security Admission (Restricted level), use security contexts, drop capabilities
Unsigned container images	Supply chain attacks, malicious image injection	Implement image signing with Cosign/Sigstore, verify signatures with Kyverno/OPA
Exposed kubelet API	Remote code execution, credential theft	Enable webhook authentication, disable read-only port 10255, require TLS
Unencrypted etcd	Secrets exposure, complete cluster compromise	Enable encryption at rest, TLS for all connections, restrict network access
Missing audit logs	No forensic trail for incident investigation or compliance	Enable comprehensive audit logging, ship to SIEM, retain 30+ days
Using latest image tag	Unpredictable deployments, drift between environments	Pin images by digest (sha256), use semantic versioning, enforce with policy
Missing resource limits	Denial of service, cluster instability	Require resource limits via admission policy, set namespace quotas, use LimitRanges

Key Metrics & KPIs

Security Posture Metrics:

CIS Benchmark Score: Target 95%+ pass rate across all applicable checks
Critical CVEs in Production: Target 0 within 7 days of disclosure, 0 within 48 hours for actively exploited
Pod Security Admission Violations: Track and trend over time (target: zero violations in production)
RBAC Policy Violations: Monitor unauthorized access attempts (target: <10 per week)

Operational Metrics:

Mean Time to Remediate (MTTR): For security findings (target: <48 hours for critical, <7 days for high)
Audit Log Coverage: 100% of privileged operations logged (exec, secret access, RBAC changes)
Image Signature Verification Rate: 100% of production images signed and verified
Network Policy Coverage: 100% of production namespaces with network policies enforced

Compliance Metrics:

Certificate Expiration: No certificates expiring within 30 days
Backup Success Rate: 100% of daily etcd backups successful
Drift Detection: Time to detect and remediate configuration drift (target: <24 hours)
Security Scan Coverage: 100% of images scanned before deployment

Continuous Improvement

Weekly Tasks:

Review Falco alerts for new threat patterns
Analyze audit logs for anomalous behavior
Verify backup integrity (test restore of random backup)
Scan running images for new CVEs

Monthly Tasks:

Re-run kube-bench CIS compliance scan
Update container base images and rebuild applications
Rotate certificates (if not using auto-rotation)
Review and update Falco detection rules
Penetration testing of new deployments

Quarterly Tasks:

Comprehensive RBAC permission audit
Network policy effectiveness review
Penetration testing by external security team
Update security training materials
Compliance framework alignment review

Annually Tasks:

Full security architecture review
Update threat models for new attack vectors
Review and update incident response playbooks
Disaster recovery exercise (full etcd restore)
Security tool evaluation (assess new technologies)

FAQ

1. What is the CIS Kubernetes Benchmark and why is it important?

The CIS Kubernetes Benchmark is a comprehensive set of 100+ security recommendations developed by the Center for Internet Security in collaboration with Kubernetes experts worldwide. It provides an industry-standard baseline covering control plane components (API server, etcd, kubelet, scheduler, controller manager), worker node configuration, RBAC policies, network policies, and pod security.

It's important because: (1) It's auditable with automated tools like kube-bench, providing objective security measurement; (2) It maps directly to compliance frameworks (SOC 2, PCI-DSS, HIPAA), simplifying audit preparation; (3) It reflects real-world vulnerabilities exploited in production Kubernetes clusters; (4) It's regularly updated to address new threats and Kubernetes versions.

Organizations using managed Kubernetes (EKS, GKE, AKS) should note that cloud providers handle many control plane recommendations, but worker node security, RBAC configuration, network policies, and pod security remain your responsibility.

2. What are Pod Security Standards and how do they differ from deprecated Pod Security Policies?

Pod Security Standards (PSS) are the successor to deprecated Pod Security Policies (PSP) as of Kubernetes 1.25. PSS defines three security levels:

Privileged: Unrestricted (development/testing only)
Baseline: Minimally restrictive (prevents known privilege escalations like host namespace access)
Restricted: Highly restrictive (defense-in-depth production best practices)

Key differences from PSP:

Enforcement mechanism: PSS uses Pod Security Admission (PSA) built into the API server, while PSP required a separate admission controller. PSA is enabled by default and configured via namespace labels, making it simpler to deploy.

Scope: PSS policies are applied at the namespace level via labels (pod-security.kubernetes.io/enforce: restricted), providing clear boundaries. PSP used global policies with complex binding rules.

Standards-based: PSS codifies community best practices as three standard profiles, while PSP required custom policy authoring. The Restricted level enforces: non-root users, read-only root filesystems, dropped capabilities, no privilege escalation, Seccomp profiles, and no host namespace sharing.

Migration: Organizations still using PSP (Kubernetes <1.25) should migrate to PSS + policy engines (Kyverno/OPA) for custom requirements.

3. How do I secure the Kubernetes API server from external threats?

The API server is the gateway to your cluster and must be protected with defense-in-depth:

Authentication & Authorization:

Disable anonymous authentication (--anonymous-auth=false)
Enable RBAC authorization (--authorization-mode=RBAC,Node)
Integrate with identity providers via OIDC for centralized authentication
Require TLS client certificates for all connections

Network Security:

Restrict API server network access via cloud security groups or VPNs
Implement IP allowlisting for authorized networks only
Use private API server endpoints (not publicly accessible)
Require VPN or bastion host access for external connections

Admission Control:

Enable security-focused admission controllers (NodeRestriction, PodSecurity, ServiceAccount)
Disable deprecated and insecure features (--insecure-port=0, --profiling=false)
Implement rate limiting to prevent DoS attacks

Audit Logging:

Enable comprehensive audit logging (--audit-log-path, --audit-log-maxage=30)
Ship logs to SIEM for real-time monitoring
Alert on failed authentication and suspicious API calls

Certificate Security:

Use valid TLS certificates with appropriate SANs
Implement automated certificate rotation
Monitor certificate expiration

For managed Kubernetes (EKS, GKE, AKS), cloud providers handle many of these configurations, but you should still: (1) Enable private cluster mode; (2) Configure authorized networks; (3) Integrate with your identity provider; (4) Enable audit logging.

4. What is the difference between Network Policies and Service Mesh security?

Network Policies (Layer 3/4 - IP/Port-based):

Network policies control traffic based on IP addresses, ports, and pod/namespace labels. They're implemented by CNI plugins (Calico, Cilium, Weave) and provide basic network segmentation.

Advantages: Simple to implement, low overhead, no application changes required, sufficient for most use cases.

Limitations: No encryption, no identity-based access control (only IP/label-based), no Layer 7 (HTTP/gRPC) awareness.

Service Meshes (Layer 7 - Application-level):

Service meshes (Istio, Linkerd, Consul Connect) operate at the application layer and provide:

Mutual TLS (mTLS): Automatic encryption of all pod-to-pod traffic with identity verification
Identity-based access: Authorization based on service identity (not IP addresses)
Advanced traffic management: Circuit breaking, retries, timeouts, traffic splitting
Observability: Distributed tracing, metrics, logging

When to Use What:

Network Policies only: Most organizations (provides 80% of security value with 20% of complexity)

Network Policies + Service Mesh: High-security environments (finance, healthcare, defense), compliance requiring encryption-in-transit, zero-trust architectures, microservices at scale (50+ services)

Recommendation: Start with Network Policies for microsegmentation, add a service mesh when you need Layer 7 controls, encryption, or advanced traffic management. Service meshes add operational complexity and overhead (typically 10-20% latency increase).

5. How do I prevent supply chain attacks on container images?

Supply chain attacks target the software lifecycle before deployment. Implement defense-in-depth:

Image Signing & Verification:

Sign all production images using Sigstore/Cosign with private keys
Verify signatures before deployment using policy engines (Kyverno, OPA)
Store public keys in Kubernetes Secrets or external KMS
Reject unsigned images via admission control

Vulnerability Scanning:

Scan images in CI/CD pipelines (Trivy, Snyk, Grype)
Block deployments with CRITICAL or HIGH vulnerabilities
Continuously scan running production images for new CVEs
Set SLAs for vulnerability remediation (e.g., 7 days for HIGH)

Base Image Hardening:

Use minimal base images: Distroless (no shell/package manager), Alpine, or Scratch
Avoid full OS images (Ubuntu, Debian, CentOS)
Implement multi-stage Docker builds (separate build and runtime images)
Regularly update base images and rebuild applications

Software Bill of Materials (SBOM):

Generate SBOMs for all images using Syft or Trivy
Track dependencies and licenses
Monitor for vulnerabilities in dependencies
Store SBOMs alongside images in registry

Registry Security:

Use private registries with access controls
Enable vulnerability scanning in registry (ECR, Artifact Registry, ACR, Harbor)
Implement content trust (Docker Content Trust)
Audit registry access logs

Image Pinning:

Pin images by digest (SHA-256) not tags: image:1.0@sha256:abc123...
Tags (including semantic versions) are mutable; digests are immutable
Enforce digest pinning via admission policy

Additional Measures:

Monitor Certificate Transparency logs for unauthorized signing certificates
Implement image provenance tracking (where/when/how image was built)
Use purpose-built CI/CD systems with audit trails
Rotate image signing keys regularly

6. What runtime threats should I monitor for in Kubernetes clusters?

Runtime threats occur after deployment when attackers exploit vulnerabilities or misconfigurations. Monitor for:

Container Escapes: Attempts to break out of container isolation to the host OS. Indicators: Access to /proc, /sys, host namespaces, privileged operations, exploiting kernel vulnerabilities.

Privilege Escalation: Gaining elevated permissions within the cluster. Indicators: Setuid/setgid executions, capability abuse, exploiting misconfigured RBAC, service account token theft.

Cryptomining: Unauthorized cryptocurrency mining consuming cluster resources. Indicators: High CPU usage (especially 100% sustained), connections to mining pools (port 3333, 8333), processes named xmrig/ethminer/minerd.

Lateral Movement: Attackers moving between pods/namespaces after initial compromise. Indicators: Unexpected pod-to-pod communication, port scanning (nmap, masscan), service account token usage across namespaces.

Data Exfiltration: Stealing sensitive data from the cluster. Indicators: Large outbound data transfers, connections to cloud storage (S3, GCS), unusual DNS queries, data compression/archival commands.

Shell Spawning: Execution of shells in containers (especially those that shouldn't have shells). Indicators: bash/sh/zsh execution in distroless images, shells spawned by non-interactive processes.

Sensitive File Access: Reading credentials and configuration files. Indicators: Access to /etc/shadow, /root/.ssh/id_rsa, service account tokens, Secrets mounted in pods, /proc/self/environ.

Network Scanning: Reconnaissance for vulnerable services. Indicators: nmap/masscan execution, SYN scans, port sweeps, connection attempts to multiple IPs.

Malicious Binary Execution: Running unauthorized or malicious software. Indicators: wget/curl downloads followed by execution, compilation in containers, reverse shells, backdoors.

Persistence Mechanisms: Establishing long-term access. Indicators: Cronjob creation, package manager usage in containers (apt, yum, apk), SSH key additions, malicious init scripts.

Detection Tools: Falco (system call monitoring), Cilium Hubble (network observability), audit logs (API activity), SIEM integration (correlation and alerting).

7. How do I implement least privilege RBAC in Kubernetes?

Least privilege means granting only the minimum permissions necessary. Implement systematically:

Role Design Principles:

Avoid cluster-admin role (reserve for break-glass emergencies only)
Use namespace-scoped Role instead of ClusterRole when possible
Grant specific verbs (get, list, watch) never wildcards (*)
Limit resource types to exactly what's needed (e.g., pods not *)
Separate read and write permissions (different roles for viewers vs operators)

Service Account Best Practices:

Create dedicated service accounts per application (not default)
Disable token auto-mounting globally (automountServiceAccountToken: false)
Enable token mounting only for pods that need API access
Use time-bound tokens (Kubernetes 1.22+ default: 1 hour expiration)
Implement projected service account tokens with audience claims

Regular Auditing:

Review RBAC permissions quarterly (minimum)
Identify service accounts with cluster-admin or wildcard permissions
Use tools like kubectl-who-can and rbac-tool for analysis
Test actual permissions: kubectl auth can-i list secrets --as=system:serviceaccount:default:my-sa

RBAC as Code (GitOps):

Store RBAC manifests in version control (Git)
Require pull request approval for permission changes
Automated testing with OPA or Kyverno before merge
Drift detection: Compare cluster state vs Git source of truth

Example Least Privilege Role (monitoring agent):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-agent
rules:
- apiGroups: [""]
  resources: ["pods", "nodes", "services"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch"]
# NO create/update/delete/patch verbs
# NO secrets access
# NO * wildcards

Anti-Patterns to Avoid:

Using default service account for applications
Granting cluster-admin "temporarily" (becomes permanent)
Wildcard permissions (*) for convenience
Copying RBAC from dev to prod without review
No periodic review of permissions

8. What is the role of etcd in Kubernetes security and how do I protect it?

etcd is the distributed key-value store that holds ALL Kubernetes cluster state including Secrets, ConfigMaps, RBAC policies, network policies, and pod specifications. Compromising etcd means compromising the entire cluster.

Why etcd is Critical:

Contains ALL Secrets (even if "encrypted at rest")
Stores RBAC policies (attacker can grant themselves cluster-admin)
Contains all configurations (attacker can modify or exfiltrate)
Provides complete cluster state (full reconnaissance)
CISA Alert AA20-301A specifically calls out exposed etcd as a critical vulnerability

Protection Measures: