Skip to content
Kubernetes 14 min read

The Kubernetes Security Audit Checklist

Every cluster we audit has at least one finding in the critical range within the first hour. This is the checklist we run through — what to look for, why it matters, and how to fix it without breaking production.

Before you start

This checklist is written for platform engineers, DevSecOps leads, and security reviewers who have shell access to the cluster and are allowed to run read-only tooling against the control plane and workloads. If you're running a managed offering (EKS, GKE, AKS), some items will be handled by the cloud provider — note them, don't skip them.

A full audit following this checklist takes 2-4 days for a single cluster with 50-200 workloads. If you want a faster baseline, run the CIS Benchmark automation first (section 1) and triage from there.

Diagram

1. CIS Kubernetes Benchmark baseline

Start with automated benchmarking. kube-bench runs the CIS Kubernetes Benchmark against your nodes and control plane components and produces a list of deviations. It's imperfect on managed offerings (many control-plane checks don't apply) but it's the cheapest, fastest signal you can get.

  • Run kube-bench as a Job in-cluster and export the results
  • Separate control-plane findings from worker-node findings
  • For managed clusters, mark control-plane items as "provider-handled"
  • Triage worker-node findings by severity, not by volume

Most common real findings: kubelet anonymous auth enabled, the legacy kubelet read-only port (10255) still enabled on older nodes, audit logging disabled, and encryption at rest using default (none). These are the ones worth fixing regardless of what the tool reports.

2. RBAC and least privilege

This is where we find the biggest real-world risk. Default cluster installs often grant cluster-admin to more principals than anyone remembers, and service accounts accumulate permissions nobody ever removes.

What to audit

  • Every ClusterRoleBinding and RoleBinding: who does it bind to, and does that principal still exist?
  • Every ServiceAccount in every namespace: what permissions does it have, and does its workload actually need them?
  • Wildcards in rules. verbs: ["*"], resources: ["*"], and apiGroups: ["*"] are almost always too broad.
  • The default service account in every namespace. If workloads are using it, they shouldn't be — give each workload its own.
  • system:masters group membership in any kubeconfig. This is un-revokable cluster-admin and should be reserved for break-glass only.

The tool we reach for: krane and rbac-lookup for static analysis; kubectl-who-can for answering "who can do X?" on the fly.

3. Admission control and Pod Security

Admission control is the enforcement layer. If you don't have one, you're relying on goodwill. Every cluster we audit should have, at minimum, Pod Security Standards enforced at the "restricted" level for production workload namespaces.

  • Pod Security Admission: enabled per-namespace, enforcing restricted on production namespaces and baseline on the rest at minimum
  • A policy engine (OPA/Gatekeeper, Kyverno, or the validating admission controllers built into your managed offering) deployed and actively enforcing policies
  • Common policies you want: no privileged containers, no host network, no host path volumes, no running as root, required resource limits, required readiness/liveness probes, image provenance verification
  • Policy reports visible to the platform team — not silently failing deployments with no explanation

Gotcha: rolling this out to an existing cluster will break workloads that were getting away with murder. Start in audit/warn mode, observe for at least a week, fix the violations, then flip to enforce.

4. Network policies

Ninety percent of the clusters we see have zero NetworkPolicies. Every pod can talk to every other pod. This means a single compromised workload has lateral movement to everything in the cluster, including the Kubernetes API server in some configurations.

  • Every namespace should have a default-deny NetworkPolicy for both ingress and egress, as a baseline
  • Explicit allow-rules for legitimate traffic (e.g., allow namespace A to namespace B on port 5432 for the database)
  • DNS egress to kube-dns must be explicitly allowed, or nothing will resolve
  • Egress to the Kubernetes API server (kubernetes.default.svc) for workloads that use service account tokens
  • If you're running a service mesh (Istio, Linkerd), verify that NetworkPolicies and mesh policies agree — they enforce at different layers and can conflict

Tool we reach for: np-guard or netassert for testing what's actually reachable; Cilium's Hubble if you want real-time flow visibility.

5. Supply chain and image provenance

This is the section most teams under-invest in, and it's also the one auditors increasingly ask about. If an attacker can push a malicious image into your registry, or swap a base image out from under you, nothing in your runtime defenses will save you.

  • Every image should be pinned by digest, not tag, in production manifests. nginx:latest is not acceptable; nginx@sha256:... is
  • Images should be signed with Sigstore/cosign and verified at admission time (Kyverno and Gatekeeper both support this)
  • SBOM generation in CI with Syft, attached to the image as an attestation
  • Base image scanning with Trivy or Grype, with CI failing on critical CVEs older than 30 days
  • A private image registry with pull authentication — no anonymous pulls
  • Admission policy that refuses images from any registry other than your approved list

6. Secrets management

Kubernetes Secrets are base64-encoded by default. They are not encrypted at rest unless you've explicitly configured envelope encryption against a KMS. And Secrets in Git (via a Secret manifest) are essentially plaintext credentials in your version control.

  • Envelope encryption enabled on the etcd store, using a KMS provider (AWS KMS, Azure Key Vault, GCP KMS, or HashiCorp Vault)
  • No Secret manifests in Git. Use External Secrets Operator or sealed-secrets to pull from a real secret store at runtime
  • Secrets scanned out of Git history with gitleaks or trufflehog before the audit
  • Service account tokens: verify automountServiceAccountToken: false on workloads that don't need to talk to the Kubernetes API
  • Token rotation enabled; bound tokens (TokenRequest API) preferred over legacy tokens

7. Runtime defense and detection

Prevention is never complete. You need runtime visibility into what's happening inside containers, and the ability to alert on suspicious behavior.

  • Falco or an equivalent runtime security tool deployed, with rules tuned to your actual workloads (untuned Falco generates enough alerts to be ignored)
  • Audit logging enabled on the API server, with logs shipped to a SIEM or log aggregator
  • Container runtime (containerd, CRI-O) logs centralized
  • Detection rules for the events that matter: exec into containers, secret access from unusual pods, privilege escalation attempts, unusual network connections from pods
  • An incident response runbook that explicitly covers "what if a pod is compromised?"

8. Node and cluster hardening

The easy-to-miss layer. Nodes are Linux machines, and they need the same hygiene as any other Linux machine, plus Kubernetes-specific hardening.

  • CIS hardened OS (Ubuntu CIS, Bottlerocket, Flatcar) rather than a general-purpose distro
  • SSH access to nodes disabled or tightly restricted; node access should be via the API
  • Kubelet configuration: anonymous auth disabled, authorization mode set to Webhook
  • etcd access restricted to control-plane nodes only (not applicable on managed offerings)
  • Automatic node updates enabled (managed node groups, Karpenter with node rotation, or equivalent)
  • Resource quotas and limits set at the namespace level to prevent noisy-neighbor DoS

9. Managed Kubernetes specifics

Each managed offering has its own security-relevant controls. The short version:

EKS

  • IRSA (IAM Roles for Service Accounts) or Pod Identity for workload IAM
  • EKS control-plane logging enabled (api, audit, authenticator, controllerManager, scheduler)
  • Private API endpoint access, or at minimum IP-restricted public access
  • Security groups for pods where applicable

GKE

  • Workload Identity enabled for GCP IAM mapping
  • Shielded nodes and Confidential GKE Nodes where supported
  • Private cluster mode with authorized networks
  • Binary Authorization enabled for image provenance enforcement

AKS

  • Managed identity or workload identity for Azure resource access
  • Azure Policy add-on enabled for Gatekeeper-backed policy enforcement
  • Private cluster mode with authorized IP ranges
  • Microsoft Defender for Containers enabled

The short version

Once you've run through the checklist, you'll have a list of findings. The mistake most teams make here is trying to fix everything at once. Don't. Prioritize by blast radius: what's the worst thing an attacker could do with this finding, and how hard is it to reach? Then fix the top five before touching the long tail.

Want us to run this audit for you?

CKS-led, CIS-aligned, with a verification retest completed within 60 days of report delivery.