Kubernetes Network Policy Patterns: From Default-Allow to Default-Deny Without Outages

Why you're reading this

Kubernetes ships with a default-allow posture. Every pod can reach every other pod. Most teams run production this way for months or years, discover lateral movement during an incident, then face a choice: retrofit network policies into a cluster that was designed without them, or accept the risk. This post is a reference for the retrofit path — the patterns that work, the rollout sequence that doesn't break traffic, and the tooling that keeps you honest.

1. Why default-allow is the Kubernetes default

Kubernetes defaults to "let any pod talk to any pod" because it makes day-one deployments work. No policy thinking required. But this choice was made in 2014 when the threat model was different — you ran Kubernetes on your own hardware in a private data center, your blast radius was bounded, and you trusted the humans deploying workloads.

Today, every cluster is multi-tenant: shared by different teams, connected to the internet, running vendor software you don't control, executing user-supplied code. An attacker who breaks into one pod has a straight line to every other pod, every database connection, every API key sitting in an environment variable. Almost every cluster we audit has zero network policies. The ones that have any usually have a default-deny in one namespace and nothing else.

2. The default-deny baseline pattern

Start here. This is the anchor policy for your cluster:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

This blocks everything. No pod can receive traffic, no pod can send traffic. It will break your cluster immediately. That's the point — you deploy this in audit mode first, measure what breaks, then add allow policies for known-good traffic patterns.

Add this second:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53

Every pod needs DNS on both UDP and TCP — UDP handles most lookups, but DNS falls back to TCP for large responses and zone transfers. Everything else stays blocked until you explicitly allow it.

3. Namespace isolation patterns

Label your namespaces so your policies can key off them:

kubectl label namespace production name=production
kubectl label namespace staging name=staging

Then create a policy in production that allows incoming traffic only from pods in the same namespace or from an ingress namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-same-namespace
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector: {}
    - namespaceSelector:
        matchLabels:
          name: ingress

This pattern isolates namespaces from each other while allowing ingress controllers to route traffic in. Critical if you run multiple teams on the same cluster.

4. Egress restriction — the metadata service

This is where most teams fail. The AWS metadata service lives at 169.254.169.254:80. Any pod can query it and pull cloud credentials. An attacker lands in a pod, hits the metadata service, gets a temporary AWS key, and your blast radius expands from "this pod" to "whatever IAM role is attached to this node."

Explicitly deny it:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-metadata-service
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector: {}
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443

This allows traffic within the namespace and HTTPS egress, but blocks the metadata service. The metadata service is the first thing an attacker hits after landing in a pod — it's the fastest path to cloud credentials. GCP uses the same endpoint on port 80. If you're running GKE with Workload Identity, also block 169.254.169.252:988 — the metadata daemon redirects there internally.

5. Cilium and Calico extensions

Vanilla NetworkPolicy is L3/L4 only — IP and port. You can't say "allow egress to api.example.com only on the /health path." Cilium and Calico extend this to L7.

Cilium example for HTTP path-based policy:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-api-health
  namespace: production
spec:
  endpointSelector: {}
  egress:
  - toFQDNs:
    - matchName: api.example.com
    toPorts:
    - ports:
      - port: "443"
        protocol: TCP
      rules:
        http:
        - method: GET
          path: /health

This is more expressive and closer to your actual intent. But it requires installing Cilium as your CNI and understanding eBPF debugging. If you're on a managed cluster (EKS, GKE), check whether the provider supports it before investing.

6. The rollout sequence that doesn't break production

Audit mode: Enable flow logging in your CNI (Calico's FlowLogsPolicy or Cilium's Hubble) before deploying any deny rules. Watch traffic logs for a week. Map every legitimate flow. Don't enforce yet.
Namespace-scoped enforcement: Pick one non-critical namespace. Enable default-deny enforcement there. Watch for alerts. Run your load tests. Fix traffic that broke.
Expand namespaces: Move staging, then less-critical production workloads to enforcement.
Cluster-wide enforcement: Once you have a stable set of policies and your team understands the debugging model, roll out cluster-wide.

Every step takes longer than you think. Budget a quarter minimum if you're retrofitting into an existing cluster.

7. Service mesh interactions

If you're running Istio or Linkerd, you have L7 policy already. But the service mesh sits on top of the network layer. NetworkPolicy still applies — it's not replaced, it's supplemented. Your service mesh can allow traffic on a path basis, but NetworkPolicy can still block the TCP connection entirely.

Design for defense-in-depth: NetworkPolicy is your outer ring (coarse-grained, L3/L4). Service mesh is your inner ring (fine-grained, L7). Both should apply the principle of least privilege independently. If either one says no, traffic dies.

8. What a real attacker does without network policies

You run a web app. Attacker finds an SQLi. They land a shell in the pod. No network policies exist. They enumerate:

Query the metadata service. Grab cloud credentials for the node's IAM role.
Port-scan the cluster. Find Redis on 10.0.1.50:6379 with no auth.
Port-scan the cluster. Find a Postgres database on 10.0.2.100:5432.
Exfiltrate the database. Push it to a bucket they own using the cloud credentials.
Scan further. Find admin dashboards, CI/CD runners, other services.

With default-deny + explicit allows, steps 2 onwards fail. The attacker stays in the compromised pod. They can't reach Redis, the database, or anything else. Your blast radius shrinks dramatically.

9. Validation tooling

netassert: Write assertions about your network topology, run them as tests. Tells you if your policies actually enforce what you think they do.

Trivy (cluster scanning): Aqua's Trivy includes Kubernetes misconfiguration scanning and can test for lateral movement opportunities. It replaced kube-hunter, which is no longer maintained.

Manual connectivity tests: In each namespace, deploy a simple test pod. Try to curl services in other namespaces. Try to reach 169.254.169.254. Capture what should be blocked and what should work. Repeat after every policy change.

The short version

Kubernetes defaults to allow-all networking. Start with a default-deny baseline in audit mode, measure what breaks, then add explicit allow policies for known-good traffic. Label your namespaces so policies can isolate them from each other. Deny the metadata service at 169.254.169.254 to prevent credential exfiltration — this is the first move an attacker makes after breaking into a pod. Rollout namespace-by-namespace over at least a quarter. Use Cilium or Calico if you need L7 policies, but vanilla NetworkPolicy covers most patterns. Validate with netassert and manual tests. The cost of retrofitting policies into an existing cluster is high, but the alternative — rebuilding your cluster with policy-first design — is higher.

Want us to ship network policies without breaking production?

CKS-led, audit-mode-first, with a rollout sequence tested against your actual traffic. We map the flows, write the policies, and verify them before enforcement.

Kubernetes Security service Book a 30-min diagnostic