Why you're reading this
Kubernetes ships with a default-allow posture. Every pod can reach every other pod. Most teams run production this way for months or years, discover lateral movement during an incident, then face a choice: retrofit network policies into a cluster that was designed without them, or accept the risk. This post is a reference for the retrofit path — the patterns that work, the rollout sequence that doesn't break traffic, and the tooling that keeps you honest.
1. Why default-allow is the Kubernetes default
Kubernetes defaults to "let any pod talk to any pod" because it makes day-one deployments work. No policy thinking required. But this choice was made in 2014 when the threat model was different — you ran Kubernetes on your own hardware in a private data center, your blast radius was bounded, and you trusted the humans deploying workloads.
Today, every cluster is multi-tenant: shared by different teams, connected to the internet, running vendor software you don't control, executing user-supplied code. An attacker who breaks into one pod has a straight line to every other pod, every database connection, every API key sitting in an environment variable. Almost every cluster we audit has zero network policies. The ones that have any usually have a default-deny in one namespace and nothing else.
2. The default-deny baseline pattern
Start here. This is the anchor policy for your cluster:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress This blocks everything. No pod can receive traffic, no pod can send traffic. It will break your cluster immediately. That's the point — you deploy this in audit mode first, measure what breaks, then add allow policies for known-good traffic patterns.
Add this second:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53 Every pod needs DNS on both UDP and TCP — UDP handles most lookups, but DNS falls back to TCP for large responses and zone transfers. Everything else stays blocked until you explicitly allow it.
3. Namespace isolation patterns
Label your namespaces so your policies can key off them:
kubectl label namespace production name=production
kubectl label namespace staging name=staging Then create a policy in production that allows incoming traffic only from pods in the same namespace or from an ingress namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-same-namespace
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- podSelector: {}
- namespaceSelector:
matchLabels:
name: ingress This pattern isolates namespaces from each other while allowing ingress controllers to route traffic in. Critical if you run multiple teams on the same cluster.
4. Egress restriction — the metadata service
This is where most teams fail. The AWS metadata service lives at 169.254.169.254:80.
Any pod can query it and pull cloud credentials. An attacker lands in a pod, hits the metadata
service, gets a temporary AWS key, and your blast radius expands from "this pod" to "whatever IAM
role is attached to this node."
Explicitly deny it:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-metadata-service
namespace: production
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- podSelector: {}
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443 This allows traffic within the namespace and HTTPS egress, but blocks the metadata service. The
metadata service is the first thing an attacker hits after landing in a pod — it's the fastest
path to cloud credentials. GCP uses the same endpoint on port 80. If you're running GKE with
Workload Identity, also block 169.254.169.252:988 — the metadata daemon redirects there
internally.
5. Cilium and Calico extensions
Vanilla NetworkPolicy is L3/L4 only — IP and port. You can't say "allow egress to api.example.com only on the /health path." Cilium and Calico extend this to
L7.
Cilium example for HTTP path-based policy:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: allow-api-health
namespace: production
spec:
endpointSelector: {}
egress:
- toFQDNs:
- matchName: api.example.com
toPorts:
- ports:
- port: "443"
protocol: TCP
rules:
http:
- method: GET
path: /health This is more expressive and closer to your actual intent. But it requires installing Cilium as your CNI and understanding eBPF debugging. If you're on a managed cluster (EKS, GKE), check whether the provider supports it before investing.
6. The rollout sequence that doesn't break production
- Audit mode: Enable flow logging in your CNI (Calico's
FlowLogsPolicyor Cilium's Hubble) before deploying any deny rules. Watch traffic logs for a week. Map every legitimate flow. Don't enforce yet. - Namespace-scoped enforcement: Pick one non-critical namespace. Enable default-deny enforcement there. Watch for alerts. Run your load tests. Fix traffic that broke.
- Expand namespaces: Move staging, then less-critical production workloads to enforcement.
- Cluster-wide enforcement: Once you have a stable set of policies and your team understands the debugging model, roll out cluster-wide.
Every step takes longer than you think. Budget a quarter minimum if you're retrofitting into an existing cluster.
7. Service mesh interactions
If you're running Istio or Linkerd, you have L7 policy already. But the service mesh sits on top of the network layer. NetworkPolicy still applies — it's not replaced, it's supplemented. Your service mesh can allow traffic on a path basis, but NetworkPolicy can still block the TCP connection entirely.
Design for defense-in-depth: NetworkPolicy is your outer ring (coarse-grained, L3/L4). Service mesh is your inner ring (fine-grained, L7). Both should apply the principle of least privilege independently. If either one says no, traffic dies.
8. What a real attacker does without network policies
You run a web app. Attacker finds an SQLi. They land a shell in the pod. No network policies exist. They enumerate:
- Query the metadata service. Grab cloud credentials for the node's IAM role.
- Port-scan the cluster. Find Redis on
10.0.1.50:6379with no auth. - Port-scan the cluster. Find a Postgres database on
10.0.2.100:5432. - Exfiltrate the database. Push it to a bucket they own using the cloud credentials.
- Scan further. Find admin dashboards, CI/CD runners, other services.
With default-deny + explicit allows, steps 2 onwards fail. The attacker stays in the compromised pod. They can't reach Redis, the database, or anything else. Your blast radius shrinks dramatically.
9. Validation tooling
netassert: Write assertions about your network topology, run them as tests. Tells you if your policies actually enforce what you think they do.
Trivy (cluster scanning): Aqua's Trivy includes Kubernetes misconfiguration scanning and can test for lateral movement opportunities. It replaced kube-hunter, which is no longer maintained.
Manual connectivity tests: In each namespace, deploy a simple test pod. Try to
curl services in other namespaces. Try to reach 169.254.169.254. Capture what should
be blocked and what should work. Repeat after every policy change.
The short version
Kubernetes defaults to allow-all networking. Start with a default-deny baseline in audit mode,
measure what breaks, then add explicit allow policies for known-good traffic. Label your
namespaces so policies can isolate them from each other. Deny the metadata service at 169.254.169.254 to prevent credential exfiltration — this is the first move an attacker
makes after breaking into a pod. Rollout namespace-by-namespace over at least a quarter. Use Cilium
or Calico if you need L7 policies, but vanilla NetworkPolicy covers most patterns. Validate with netassert
and manual tests. The cost of retrofitting policies into an existing cluster is high, but the alternative
— rebuilding your cluster with policy-first design — is higher.
Want us to ship network policies without breaking production?
CKS-led, audit-mode-first, with a rollout sequence tested against your actual traffic. We map the flows, write the policies, and verify them before enforcement.