Skip to content
Kubernetes 13 min read

Container Image Hardening: Distroless, Non-Root, Read-Only FS, and the Reasons

Most 'hardened' container images we review are hardened against the wrong threat model. This is the guide we use — what each control actually buys you, the attack scenario it prevents, and the order to ship them in so you don't break every workload on day one.

Why most hardening is hardening the wrong thing

If you ask ten engineers what container hardening means, you'll get ten lists, and most of them will be the same five bullet points: distroless base, non-root user, read-only root filesystem, drop capabilities, and run a vulnerability scanner. Those bullets aren't wrong, but they're cargo-culted often enough that the teams shipping them rarely articulate what attack each control prevents. The result is hardening that satisfies a checklist and stops nothing in particular.

The right way to think about image hardening is to start from the threat model. Container images exist on a spectrum of trust. At one end, the image is yours, built from your source, signed by your CI, and deployed only into your cluster. At the other, it's a community image you pulled from Docker Hub last year. The hardening you need depends on where on that spectrum the image lives, what's inside it, and what an attacker who lands inside the running container could reach next.

This guide is the hardening baseline we apply on engagements, with the actual attack scenarios each control prevents. We'll go through each control, explain what it stops, and give you the order to ship them in so you don't break every workload on day one.

Diagram

1. What an attacker actually does after popping a container

Before we talk about controls, let's talk about the attacker. When someone gets RCE in your containerized application, here's what they do in the first sixty seconds.

  • Check the user ID. If it's uid 0, they have root inside the container, which means they can install tools, modify files, and abuse capabilities.
  • Look for credentials. /proc/self/environ for env vars, /var/run/secrets/kubernetes.io/serviceaccount/token for the SA token, ~/.aws/credentials, ~/.kube/config, anything readable.
  • Try to reach the cloud metadata service at 169.254.169.254. If reachable, they get IAM credentials for the node.
  • Test the network. Can they reach the kube-apiserver? Other Pods? The internet? Internal services?
  • Test what tools are installed. curl, wget, bash, python, nc, package managers. The more tools they find, the easier the next step gets.
  • Try to escape. Is the container privileged? Is the docker socket mounted? Are dangerous capabilities granted?

Every hardening control we'll discuss exists to make one or more of these steps fail. Keep that mapping in mind — it's the difference between hardening that matters and hardening that's just noise.

2. Distroless and minimal bases

A distroless image is one with no shell, no package manager, no coreutils — just the minimum runtime your application needs. Google publishes distroless base images for the major language runtimes; Chainguard publishes a hardened equivalent. For Go and Rust binaries, you can go even further with a scratch base that's literally empty.

What this prevents: an attacker who lands a shell can't run a shell, because there isn't one. They can't run curl to download a tool, because curl isn't there. They can't use apt-get to install one, because the package manager is missing. The attack doesn't stop, but the cost goes up sharply. The attacker now has to compile or smuggle in every tool they want to use, and most automated post-exploitation kits assume basic Unix utilities are present.

The pushback you'll get from engineering: "we need curl in the image for health checks." Almost always, this is wrong. Health checks should use the application's own endpoints, called by the kubelet — not by a shell command inside the container. Move the health check to a proper HTTP probe and you don't need curl.

A second pushback: "we need shell access to debug." This is what kubectl debug and ephemeral debug containers exist for. Run a debug image as an ephemeral container attached to the running Pod, get your tools, do your work, and the production image stays clean.

3. Non-root, and why uid 0 matters even in a container

The default for most language base images is to run as root inside the container. This is bad, even when the container is unprivileged, for three reasons. First, root inside the container is the same UID as root on the host, and several known container-escape vulnerabilities required root inside the container as a precondition. Second, root means the attacker can write anywhere in the filesystem, install packages (if a package manager exists), and modify files mid-execution. Third, several Kubernetes admission policies (Pod Security Standards "restricted") explicitly disallow running as root, so you'll fail those gates anyway.

The fix is to add a non-root user in the Dockerfile and run as that user. Pin the UID — using a named user like USER appuser works at build time but Kubernetes can't enforce "non-root" without a numeric UID. Use USER 10001 (any non-zero UID, preferably one that doesn't exist in the host's /etc/passwd either). Then in the Pod spec, set runAsNonRoot: true and the kubelet will refuse to start the container if the UID is zero. Belt and braces.

What this prevents: every attack that depends on writing to a location only root can write to. Every attack that depends on the container UID matching the host's root UID. Most automated post-exploitation frameworks that assume root.

4. Read-only root filesystem

readOnlyRootFilesystem: true in the Pod's securityContext makes the container's root filesystem read-only at runtime. The application still starts, the binary still runs, but nothing can be written to disk except in volumes you explicitly mount as writable.

What this prevents: persistence. An attacker who lands in the container can do whatever they want in memory, but the moment they try to drop a file — a backdoor binary, a cron job, a config change, a malicious shared object for the next process — the write fails. They can still execute code in the running process, so this isn't a defense against the immediate exploit. It's a defense against the second stage, which is where most attacks turn into incidents.

The implementation friction is real. Many applications expect to write somewhere — usually /tmp, sometimes a cache directory, sometimes session state. The fix is to mount an emptyDir at exactly those paths and leave the rest of the filesystem read-only. Spend an afternoon profiling the writes the application makes (use strace or audit the application's code), enumerate them, mount the volumes, and you're done. We've never had a real application that couldn't be made to work this way, though we have had a few that took a day of debugging.

5. Drop capabilities, and add only what you need

Linux capabilities split root's powers into discrete buckets. A container running as root with all capabilities can do almost anything root on the host can do, modulo the namespacing layer. Drop capabilities and you take those powers away even from a root process inside the container.

The right starting point is drop: ["ALL"] in the container's securityContext.capabilities. Then add back only the ones the application actually needs. For most applications, that list is empty. For applications that bind to a privileged port, it's NET_BIND_SERVICE. For nothing else, almost ever. If your application thinks it needs a capability, profile it with capable from bcc-tools and find out which one and why — most of the time the requirement is wrong, and the application can be reconfigured.

What this prevents: capability-based privilege escalation. An attacker inside the container can't use CAP_SYS_ADMIN to mount a filesystem, can't use CAP_NET_RAW to send crafted packets, can't use CAP_DAC_OVERRIDE to bypass file permissions. The container is contained.

6. seccomp and AppArmor (or SELinux)

seccomp is a syscall filter. The container declares which syscalls it's allowed to make, and the kernel rejects everything else. The default Docker seccomp profile blocks about 40 syscalls commonly abused by exploits; the Kubernetes RuntimeDefault profile applies a similar set. Set seccompProfile: { type: RuntimeDefault } on every Pod and you've cut the kernel attack surface meaningfully without writing a custom profile.

AppArmor (on Debian-family hosts) and SELinux (on RHEL-family hosts) are mandatory access control systems that scope what the container can do at the file/process level. Both are useful and both are slightly painful to configure. If you're running on a managed Kubernetes platform, the defaults are usually sane; if you're running on your own nodes, learn one of them and apply a default-deny profile.

What this prevents: kernel exploits that depend on rare syscalls, container escapes that abuse syscall ambiguity, and several known CVEs in the runc and crun runtimes that required specific syscall combinations to trigger.

7. Image provenance — signing and verification

Hardening an image is wasted effort if an attacker can swap your image for a different one in the registry. The fix is image signing with Sigstore (cosign), and admission-time verification in the cluster.

The flow looks like this. Your CI pipeline builds an image, pushes it to the registry, and signs it with cosign using a key that lives in the CI's identity (keyless signing via OIDC is the modern pattern; you don't manage long-lived signing keys). The cluster runs an admission policy (Kyverno, Sigstore Policy Controller, or equivalent) that verifies the signature on every image before it's allowed to start. Unsigned images, or images signed by the wrong identity, get rejected at admission.

What this prevents: registry compromise, typo-squatting on image names, and the entire class of attacks where someone replaces a legitimate image with a malicious one and waits for the cluster to pull it.

8. Vulnerability scanning, with thresholds that mean something

Every CI pipeline scans images for known CVEs. Trivy, Grype, and Snyk are the common tools. What's usually wrong is the threshold: the pipeline either fails on every CVE (so the team turns the gate off because builds break constantly) or fails on nothing (so the gate is theatre).

The threshold we use on engagements: fail the build on critical and high CVEs in the application dependencies (the layers your team controls), warn but don't fail on critical and high CVEs in the base image (because the base image gets rebased on a schedule, and you don't want every upstream advisory to break unrelated builds), and ignore everything below high. Tune from there based on what your team actually triages.

The other half of vulnerability scanning is the rebuild cadence. Even if your image was clean yesterday, new CVEs get published every day, and an image that hasn't been rebuilt in six months is full of them. The fix is a scheduled CI job that rebuilds every production image weekly, whether or not the source changed, so the base layers pick up upstream patches automatically.

9. Multi-stage builds and the build/runtime split

Compilers, package managers, dev dependencies, test fixtures — none of these belong in the runtime image. Multi-stage Docker builds let you compile in one stage with the full toolchain and copy only the resulting binary into a clean runtime stage based on distroless or scratch.

The benefit is twofold. First, the runtime image is smaller, which means fewer CVEs and less attack surface. Second, the runtime image is auditable: you can list every file in it and reason about each one. A 50MB Go binary in a scratch image has exactly one binary and exactly one CA bundle. A 1GB Node.js image has thousands of files and you have no idea what most of them do.

If you're not using multi-stage builds yet, this is the single highest-leverage change you can make. Most images shrink by an order of magnitude.

10. The order to ship them in

You can't apply all of these on the same day to a running production fleet. Here's the order we use, ranked by impact-per-disruption.

  1. Run as non-root with a numeric UID. Cheapest control with the highest impact. Most apps already work this way; the few that don't take an afternoon to fix.
  2. Drop ALL capabilities, add back only what you need. Almost no apps need any capabilities. The few that do are well-known cases.
  3. Switch to multi-stage builds and minimal/distroless bases. Big mechanical change but a one-time cost per Dockerfile.
  4. Read-only root filesystem. The one with the highest debug cost on the first pass. Worth it; do it after the cheaper wins are in.
  5. seccomp RuntimeDefault. Free if your runtime supports it. Just turn it on.
  6. Image signing and admission verification. Big lift but transformative. Schedule it deliberately as a project, not a sprint task.
  7. CVE scanning gates and weekly rebuilds. Wire into CI and forget.

The short version

A hardened image runs as a non-root numeric UID, has no shell or package manager, has a read-only root filesystem with minimal writable mounts, drops all capabilities, applies the default seccomp profile, is signed at build time, and is verified at admission. The Dockerfile uses a multi-stage build with a distroless or scratch runtime stage, is rebuilt weekly to pick up base-image patches, and fails the build on critical CVEs in application dependencies. None of this is exotic, none of it requires a vendor product, and none of it costs more than a week of engineering time per service to apply properly.

When the next container escape CVE gets published, the teams that have shipped this baseline read it and shrug. The teams that haven't spend a week scrambling. That's the value of the work.

Want us to harden your images?

We review your base images, build a hardening baseline your team can apply, and wire enforcement into CI so regressions get caught before they ship.