Before you start
This checklist is written for platform engineers, DevSecOps leads, and security reviewers who have shell access to the cluster and are allowed to run read-only tooling against the control plane and workloads. If you're running a managed offering (EKS, GKE, AKS), some items will be handled by the cloud provider — note them, don't skip them.
A full audit following this checklist takes 2-4 days for a single cluster with 50-200 workloads. If you want a faster baseline, run the CIS Benchmark automation first (section 1) and triage from there.
1. CIS Kubernetes Benchmark baseline
Start with automated benchmarking. kube-bench runs the CIS Kubernetes Benchmark against
your nodes and control plane components and produces a list of deviations. It's imperfect on managed
offerings (many control-plane checks don't apply) but it's the cheapest, fastest signal you can get.
- Run
kube-benchas a Job in-cluster and export the results - Separate control-plane findings from worker-node findings
- For managed clusters, mark control-plane items as "provider-handled"
- Triage worker-node findings by severity, not by volume
Most common real findings: kubelet anonymous auth enabled, the legacy kubelet read-only port (10255) still enabled on older nodes, audit logging disabled, and encryption at rest using default (none). These are the ones worth fixing regardless of what the tool reports.
2. RBAC and least privilege
This is where we find the biggest real-world risk. Default cluster installs often grant cluster-admin to more principals than anyone remembers, and service accounts accumulate permissions nobody ever removes.
What to audit
- Every
ClusterRoleBindingandRoleBinding: who does it bind to, and does that principal still exist? - Every
ServiceAccountin every namespace: what permissions does it have, and does its workload actually need them? - Wildcards in rules.
verbs: ["*"],resources: ["*"], andapiGroups: ["*"]are almost always too broad. - The
defaultservice account in every namespace. If workloads are using it, they shouldn't be — give each workload its own. system:mastersgroup membership in any kubeconfig. This is un-revokable cluster-admin and should be reserved for break-glass only.
The tool we reach for: krane and rbac-lookup for static analysis; kubectl-who-can for answering "who can do X?" on the fly.
3. Admission control and Pod Security
Admission control is the enforcement layer. If you don't have one, you're relying on goodwill. Every cluster we audit should have, at minimum, Pod Security Standards enforced at the "restricted" level for production workload namespaces.
- Pod Security Admission: enabled per-namespace, enforcing
restrictedon production namespaces andbaselineon the rest at minimum - A policy engine (OPA/Gatekeeper, Kyverno, or the validating admission controllers built into your managed offering) deployed and actively enforcing policies
- Common policies you want: no privileged containers, no host network, no host path volumes, no running as root, required resource limits, required readiness/liveness probes, image provenance verification
- Policy reports visible to the platform team — not silently failing deployments with no explanation
Gotcha: rolling this out to an existing cluster will break workloads that were getting away with murder. Start in audit/warn mode, observe for at least a week, fix the violations, then flip to enforce.
4. Network policies
Ninety percent of the clusters we see have zero NetworkPolicies. Every pod can talk to every other pod. This means a single compromised workload has lateral movement to everything in the cluster, including the Kubernetes API server in some configurations.
- Every namespace should have a default-deny NetworkPolicy for both ingress and egress, as a baseline
- Explicit allow-rules for legitimate traffic (e.g., allow namespace A to namespace B on port 5432 for the database)
- DNS egress to
kube-dnsmust be explicitly allowed, or nothing will resolve - Egress to the Kubernetes API server (
kubernetes.default.svc) for workloads that use service account tokens - If you're running a service mesh (Istio, Linkerd), verify that NetworkPolicies and mesh policies agree — they enforce at different layers and can conflict
Tool we reach for: np-guard or netassert for testing what's actually reachable; Cilium's Hubble if you want real-time
flow visibility.
5. Supply chain and image provenance
This is the section most teams under-invest in, and it's also the one auditors increasingly ask about. If an attacker can push a malicious image into your registry, or swap a base image out from under you, nothing in your runtime defenses will save you.
- Every image should be pinned by digest, not tag, in production manifests.
nginx:latestis not acceptable;nginx@sha256:...is - Images should be signed with Sigstore/cosign and verified at admission time (Kyverno and Gatekeeper both support this)
- SBOM generation in CI with Syft, attached to the image as an attestation
- Base image scanning with Trivy or Grype, with CI failing on critical CVEs older than 30 days
- A private image registry with pull authentication — no anonymous pulls
- Admission policy that refuses images from any registry other than your approved list
6. Secrets management
Kubernetes Secrets are base64-encoded by default. They are not encrypted at rest unless you've
explicitly configured envelope encryption against a KMS. And Secrets in Git (via a Secret manifest) are essentially plaintext credentials in your version control.
- Envelope encryption enabled on the etcd store, using a KMS provider (AWS KMS, Azure Key Vault, GCP KMS, or HashiCorp Vault)
- No
Secretmanifests in Git. Use External Secrets Operator or sealed-secrets to pull from a real secret store at runtime - Secrets scanned out of Git history with gitleaks or trufflehog before the audit
- Service account tokens: verify
automountServiceAccountToken: falseon workloads that don't need to talk to the Kubernetes API - Token rotation enabled; bound tokens (TokenRequest API) preferred over legacy tokens
7. Runtime defense and detection
Prevention is never complete. You need runtime visibility into what's happening inside containers, and the ability to alert on suspicious behavior.
- Falco or an equivalent runtime security tool deployed, with rules tuned to your actual workloads (untuned Falco generates enough alerts to be ignored)
- Audit logging enabled on the API server, with logs shipped to a SIEM or log aggregator
- Container runtime (containerd, CRI-O) logs centralized
- Detection rules for the events that matter: exec into containers, secret access from unusual pods, privilege escalation attempts, unusual network connections from pods
- An incident response runbook that explicitly covers "what if a pod is compromised?"
8. Node and cluster hardening
The easy-to-miss layer. Nodes are Linux machines, and they need the same hygiene as any other Linux machine, plus Kubernetes-specific hardening.
- CIS hardened OS (Ubuntu CIS, Bottlerocket, Flatcar) rather than a general-purpose distro
- SSH access to nodes disabled or tightly restricted; node access should be via the API
- Kubelet configuration: anonymous auth disabled, authorization mode set to Webhook
- etcd access restricted to control-plane nodes only (not applicable on managed offerings)
- Automatic node updates enabled (managed node groups, Karpenter with node rotation, or equivalent)
- Resource quotas and limits set at the namespace level to prevent noisy-neighbor DoS
9. Managed Kubernetes specifics
Each managed offering has its own security-relevant controls. The short version:
EKS
- IRSA (IAM Roles for Service Accounts) or Pod Identity for workload IAM
- EKS control-plane logging enabled (api, audit, authenticator, controllerManager, scheduler)
- Private API endpoint access, or at minimum IP-restricted public access
- Security groups for pods where applicable
GKE
- Workload Identity enabled for GCP IAM mapping
- Shielded nodes and Confidential GKE Nodes where supported
- Private cluster mode with authorized networks
- Binary Authorization enabled for image provenance enforcement
AKS
- Managed identity or workload identity for Azure resource access
- Azure Policy add-on enabled for Gatekeeper-backed policy enforcement
- Private cluster mode with authorized IP ranges
- Microsoft Defender for Containers enabled
The short version
Once you've run through the checklist, you'll have a list of findings. The mistake most teams make here is trying to fix everything at once. Don't. Prioritize by blast radius: what's the worst thing an attacker could do with this finding, and how hard is it to reach? Then fix the top five before touching the long tail.
Want us to run this audit for you?
CKS-led, CIS-aligned, with a verification retest completed within 60 days of report delivery.