Why RBAC is broken almost everywhere
Almost every cluster we audit has at least one ServiceAccount bound to cluster-admin and
nobody on the team remembers why. The pattern is always the same: someone needed a permission, the quickest
way to grant it was a wildcard, and the wildcard never got walked back. Multiply that across two years
of Helm chart installs, three platform engineers, and a couple of incident responses where the fix was
"give it more permissions and we'll fix it later," and you end up with a cluster where the principle
of least privilege exists only as a slide in the onboarding deck.
The thing is, Kubernetes RBAC isn't actually hard. The primitives are small, the verbs are enumerable, and the YAML is boring. The reason RBAC stays broken is that nobody runs the audit, and when they finally do, the prospect of touching production bindings without breaking anything is scary enough that the audit becomes a slide deck instead of a change. This post is the walkthrough we run on engagements — the primitives, the patterns that go wrong, and the exact sequence we use to take a permissive cluster down to least privilege without paging the on-call.
1. The primitives, in plain English
Kubernetes RBAC has four object kinds and one rule. The objects are Role, ClusterRole, RoleBinding, and ClusterRoleBinding. The rule
is that a Role grants permissions inside one namespace, a ClusterRole grants them across the
entire cluster (or against cluster-scoped resources like Nodes and PersistentVolumes), and the
bindings glue subjects (users, groups, and ServiceAccounts) to those roles.
Inside a Role you have rules. Each rule is three lists: API groups, resources, and verbs. The
verbs are the small set you'd expect — get, list, watch, create, update, patch, delete, deletecollection, plus a few resource-specific ones like impersonate, bind, and escalate. The resources are objects like pods, secrets, deployments. The API groups are the namespacing layer above
resources — apps, batch, networking.k8s.io, and the empty
string for the core group.
That's the entire model. The reason it gets confusing in practice is that real charts and operators stack dozens of rules per role, the bindings are spread across multiple files, and the cluster ships with built-in roles whose contents aren't obvious from the names.
2. The wildcard trap
The single most common finding on every audit is wildcards. A rule with verbs: ["*"] on resources: ["*"] in apiGroups: ["*"] is functionally cluster-admin, and it
shows up in places you wouldn't expect — a "monitoring" ServiceAccount that needed to read logs, a CI
runner that needed to deploy one Helm chart, a one-off debugging Role that got committed to Git in 2023.
Wildcards are seductive because they always work. The team needs the permission to ship, the wildcard guarantees it ships, and the cleanup ticket gets deprioritized. The fix is the same every time: enumerate the actual API calls the workload makes, write a Role that grants exactly those, and replace the wildcard. The hard part is the enumeration. We'll get to that in a minute.
A near miss to flag: resources: ["secrets"] with verbs: ["get", "list"] is, for almost every workload, equivalent to cluster-admin. If a Pod can read every Secret in its namespace
and that namespace contains the cloud credentials for the database, the kube-apiserver token for a cluster-admin
user, or the API key for the CI system, you've handed an attacker the keys to the rest of your environment.
We treat broad get secrets as a critical finding, not a minor one.
3. The three built-in roles you actually need to know
Kubernetes ships with a stack of pre-aggregated ClusterRoles. The four that matter for audits are cluster-admin, admin, edit, and view.
cluster-admin is what it sounds like. Anything bound to it can do anything to anything,
including modifying the RBAC system itself. Nothing should be bound to cluster-admin except a break-glass
user that lives in a vault and gets used twice a year. Every other use is a finding.
admin grants full read/write inside a namespace, including managing Roles and
RoleBindings within that namespace. This sounds limited but it isn't — namespace admin can create
a new ServiceAccount, bind it to a privileged Role, and use it to read every Secret in the
namespace. Bind admin only to humans, never to ServiceAccounts, and only with a clear reason.
edit is admin minus the RBAC management. It can create and delete most resources but can't
grant permissions to other identities. This is the highest role most application ServiceAccounts should
ever get, and even then it's usually too broad.
view is read-only and excludes Secrets. This last bit is important — view is genuinely safe to hand out. A read-only role that included Secrets would be a privilege escalation
vector; the built-in view deliberately doesn't.
4. ClusterRole aggregation, the feature nobody knows about
ClusterRole aggregation lets you build a role from other roles using label selectors. The built-in admin, edit, and view roles are aggregation rules — they
have no rules of their own and instead pull in any ClusterRole labelled with rbac.authorization.k8s.io/aggregate-to-admin: "true" (or edit, or view).
This matters for two reasons. First, when you install an operator, it usually ships its own ClusterRoles labelled to aggregate into the built-ins. That means installing the cert-manager operator quietly extends what your "view" users can see. Most teams don't realize this is happening.
Second, you can use aggregation deliberately. If you want a custom "platform-engineer" role that's
a strict superset of admin plus a few cluster-scoped reads, define a small
ClusterRole with the additions and label it aggregate-to-platform-engineer. New
permissions get added by labelling new roles, not by editing the original.
5. The audit workflow we actually run
When we walk into a cluster cold, here's the exact sequence we run before we make any change.
Step 1: Enumerate every binding
List every ClusterRoleBinding and RoleBinding in the cluster, group by subject, and produce a
matrix of "who has what." The free tools that do this well are kubectl-who-can, rbac-lookup, and krane. None of them are perfect; we usually combine
two. The output you want is a CSV: subject, namespace, role, source binding.
Step 2: Flag every cluster-admin binding
Anything bound to cluster-admin goes on the top of the list. For each one, you need three
answers: who created it, why, and is it still needed. If the answer to the third question is "I don't
know," it isn't needed. The remediation is to either delete the binding or replace it with a narrowly
scoped role.
Step 3: Flag every binding that grants get/list on secrets
Run a query against the rule set: which subjects can get or list secrets, in which namespaces. Most of them shouldn't be able to. The few that should (External Secrets
Operator, sealed-secrets controller, some CI integrations) should be scoped to the specific Secret names
they need, not all of them.
Step 4: Flag escalate, bind, and impersonate
These three verbs are the RBAC privilege-escalation primitives. escalate lets a
subject grant permissions higher than its own. bind lets it bind a role to a new
subject. impersonate lets it execute API calls as someone else. Anything that has these
verbs against the RBAC API group is effectively cluster-admin. Treat them the same way.
Step 5: Enumerate what each ServiceAccount actually uses
For every workload binding, you need to know which API calls the workload actually makes — not
which it might make, the ones it does make. The ground truth lives in the kube-apiserver audit
log. If audit logging isn't on, that's its own finding; turn it on at the RequestResponse level for your sensitive namespaces and let it run for a week.
With audit logs, you can group by ServiceAccount and produce the exact list of (verb, resource,
apiGroup) tuples each one used. That list is your target Role. The tool that automates this is audit2rbac (from the Kubernetes project), which derives a minimal Role straight from
audit-log entries. rakkess is a useful complement — it answers "what can this subject do
right now," which is handy for sanity-checking the result against the live cluster. audit2rbac is the
one that needs audit logging on.
Step 6: Write the target roles
For each workload, write a new Role (namespace-scoped) that grants exactly the verbs and resources the audit log showed. Add the bindings. Do not delete the old binding yet.
Step 7: Shadow, then cut over
Apply the new Role and Binding alongside the old one. Watch the workload for at least one full
business cycle (a week is the minimum, two is safer). If anything in the workload's normal flow
tries to do something the new Role doesn't allow, you'll see it in the audit log as a Forbidden response — except the workload won't actually be denied because it still has
the old binding. This is the magic of shadow rollout: you get to verify the new Role is correct without
risking the workload.
After the observation window, delete the old binding. If anything breaks in the next 48 hours, re-bind the old one and figure out what you missed. This is much, much safer than a flag day.
6. Patterns that always fail an audit
These are the patterns we flag every time. If any of them describe your cluster, fix them in this order.
- Default ServiceAccount with permissions: the
defaultServiceAccount in any namespace should have zero RoleBindings. Pods that don't specify a ServiceAccount fall back todefault, and you don't want them inheriting unintended permissions. Ifdefaulthas bindings, audit each one and move the permissions to a named SA. - Wildcards in production roles:
verbs: ["*"]orresources: ["*"]in any role outside ofkube-system. Always replaceable with a finite list. - ClusterRoleBindings for namespaced workloads: if a workload only operates inside one namespace, the binding should be a RoleBinding, not a ClusterRoleBinding. ClusterRoleBindings grant permissions across every namespace, including ones you'll create later.
- Group bindings to all authenticated users:
system:authenticatedas a subject means "anyone who can talk to the API server." This shows up surprisingly often, usually as a leftover from a tutorial. Always a finding. - Token auto-mount on Pods that don't need it: Pods auto-mount the SA token by
default. Workloads that don't talk to the kube-apiserver should set
automountServiceAccountToken: false. This isn't strictly RBAC, but a token for a ServiceAccount with zero permissions still gets mounted into the Pod, and it's one fewer thing on disk for an attacker who lands a shell.
7. Namespaces are the only boundary you have
RBAC is namespace-scoped for almost everything. If two workloads need different permission sets,
they should run in different namespaces. We see a lot of clusters where everything runs in default and the bindings are a tangle of overlapping permissions. The fix is usually worth
the pain: split workloads by team or by trust level into separate namespaces, and you make RBAC manageable.
The other half of the story is NetworkPolicy. RBAC stops a workload from talking to the API server; NetworkPolicy stops it from talking to other Pods. If you tighten RBAC and leave the network flat, an attacker who lands in any Pod can still pivot freely. Tighten both or you've only half-solved the problem.
8. GitOps, drift, and the day-2 problem
Once you've shipped a least-privilege baseline, the question is how to keep it that way. The
answer is GitOps: every Role, RoleBinding, ClusterRole, and ClusterRoleBinding lives in Git, and
the cluster is reconciled from Git. Argo CD and Flux both do this well. The benefit isn't just the
audit trail — it's that drift becomes visible. When someone runs kubectl apply against the cluster directly, the GitOps controller flags it as a deviation,
and you can either revert or update the source of truth.
Without GitOps, the cluster will drift back toward permissive within a quarter. We've seen it enough times to predict it.
9. Where policy engines fit
OPA Gatekeeper and Kyverno both let you write admission-time policies that block bad RBAC at the
point of creation. Examples we use on every engagement: a policy that rejects any Role or
ClusterRole containing wildcards, a policy that rejects any new ClusterRoleBinding to cluster-admin, and a policy that requires every ServiceAccount to set automountServiceAccountToken: false unless it has an exemption label.
Admission policies are not a substitute for the audit and remediation work above — they prevent regressions, they don't fix the existing mess. Run the audit first, ship the cleanup, then turn on the policies. If you turn on the policies first, you'll be fighting them while you're still cleaning up.
10. The external identity problem
Kubernetes has no built-in user database. The "users" you grant RBAC to are abstract identifiers that the API server trusts because some authentication plugin asserts them. In production that's usually OIDC against your IdP, or a cloud-provider IAM integration (IAM-to-RBAC mapping in EKS, the equivalent on GKE and AKS).
The audit you run on RBAC bindings is incomplete if you don't also audit the IdP groups and the
cloud IAM mappings that produce those identities. A binding that grants cluster-admin to the group platform-engineering is only as tight as the process
for adding people to that group. We've seen orgs where the platform-engineering group had twenty members,
half of whom had left the company. Tighten RBAC, then tighten the upstream group membership.
The short version
A cluster with healthy RBAC has these properties: zero non-system bindings to cluster-admin, every ServiceAccount with a Role that matches its actual API usage,
every namespace with a default-deny NetworkPolicy, every ClusterRole and binding under GitOps,
admission policies blocking new wildcards and new cluster-admin bindings, and audit logging on so
you can see what's happening. None of this is exotic. It's just the discipline of running the
audit, doing the cleanup, and putting guardrails in place so the cleanup sticks.
The second time you run this audit, it takes a quarter of the time. By the third, it's a regular hygiene check. The hard part is the first pass — and that's the part we're usually called in to do.
Want us to tighten your RBAC?
CKS-led, audit-mode-first, zero broken deployments. We map every binding, propose a least-privilege target, and pair with your platform team to ship it.