The Open Port Incident: A Kubernetes Admission Control Playbook with Pod Security, CEL Policies, and Gatekeeper Audits

At 11:40 p.m. on a Tuesday, one of our staging deploys passed every CI job and still managed to do something spectacularly dumb in production: it shipped a debug pod spec with hostNetwork: true and a too-permissive container security context. Nothing exploded instantly, which was the problem. We only noticed when egress logs looked weird and a routine scan flagged drift we should have blocked at admission time, not after rollout.

That incident changed how we treat Kubernetes policy. We stopped arguing about one giant policy engine and moved to a layered model: built-in Pod Security Admission labels for baseline guardrails, ValidatingAdmissionPolicy CEL rules for fast API-server checks, and Gatekeeper audit for broader governance and retroactive visibility.

This is the Kubernetes admission control playbook we now use when teams want speed without gambling on cluster safety.

Why admission policy is a delivery tool, not just a security checkbox

Most teams first encounter admission controls as a blocker. In practice, they are a feedback accelerator. Rejecting an unsafe manifest during apply is cheaper than discovering it during incident review. According to Kubernetes docs, Pod Security Admission is stable since v1.25, and ValidatingAdmissionPolicy is stable in v1.30. That matters because stable primitives reduce the “what if this feature changes next quarter?” risk.

Also, admission policy works best when it reflects your deployment reality. If your team already runs GitOps and progressive rollouts, policy should plug into that flow, not create a side workflow. If you are running Argo CD, this connects nicely with a drift-aware approach like our GitOps drift-detection runbook.

The layering model that avoids both chaos and policy fatigue

Layer 1, namespace guardrails with Pod Security Admission

Use Pod Security Admission (PSA) labels to set default safety posture per namespace. This catches broad pod hardening misses fast, with no custom webhook code. Start with warn and audit, then enforce when teams are clean.

# Start with visibility
kubectl label ns payments \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn-version=latest \
  pod-security.kubernetes.io/audit-version=latest --overwrite

# After cleanup, enforce
kubectl label ns payments \
  pod-security.kubernetes.io/enforce=baseline \
  pod-security.kubernetes.io/enforce-version=latest --overwrite

Tradeoff: PSA is opinionated and intentionally broad. Great for defaults, not ideal for business-specific rules like “every Deployment must carry an owner label and ticket reference.”

Layer 2, cluster-native rules with ValidatingAdmissionPolicy (CEL)

For custom checks that must run inline with API requests, ValidatingAdmissionPolicy is usually the cleanest first choice. It runs in-process in the API server, which avoids some of the operational overhead of external webhooks. A simple example, require owner metadata and block mutable latest tags:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: deploy-hygiene.example.com
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
      - apiGroups: ["apps"]
        apiVersions: ["v1"]
        operations: ["CREATE", "UPDATE"]
        resources: ["deployments"]
  validations:
    - expression: "has(object.metadata.labels) && has(object.metadata.labels.owner)"
      message: "Deployment must include metadata.labels.owner"
    - expression: "object.spec.template.spec.containers.all(c, !c.image.endsWith(':latest'))"
      message: "Pin image tags. ':latest' is not allowed"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: deploy-hygiene-binding
spec:
  policyName: deploy-hygiene.example.com
  validationActions: [Warn, Audit]
  matchResources:
    namespaceSelector:
      matchLabels:
        policy-tier: standard

Notice the binding uses Warn and Audit first. This rollout pattern is safer than flipping to Deny on day one.

Tradeoff: CEL-based rules are great for object-level assertions. For very complex cross-resource logic or external context lookups, you may still want a dedicated engine.

Layer 3, Gatekeeper for audit depth and policy libraries

Gatekeeper remains useful when platform teams want reusable policy templates, broad audit visibility, and policy-as-code workflows across clusters. Its audit mode helps find existing violations that admission-only checks cannot retroactively catch.

If your organization already uses policy repositories, Gatekeeper fits well into a central governance model. If your teams are early in policy maturity, start with PSA + ValidatingAdmissionPolicy first, then add Gatekeeper where you need richer policy lifecycle tooling.

This layered approach also pairs nicely with host hardening discipline outside Kubernetes. For example, service-level hardening remains essential even with strong cluster policy controls, as we covered in systemd service hardening for Linux teams.

Rollout strategy that minimizes developer pain

Inventory before enforcement. Run PSA in warn/audit and Gatekeeper audit mode first.
Define exceptions with expiration. Temporary waivers should include an owner and sunset date.
Gate only high-risk controls initially. Block privilege escalation and unsafe networking first, advisory for style rules.
Version policy changes. Treat policy bundles like application releases, with changelogs and rollback plans.

Think of this as reliability engineering for guardrails. We used a similar “progressively tighten after observability” pattern in our zero-trust hardening blueprint and in release process hardening on WordPress deployments (safe release playbook).

Where this model can backfire (and how to prevent it)

The failure mode is not “too little policy,” it is policy without product thinking. If platform teams ship dense rule bundles with no migration guide, app teams will route around guardrails. If every denied request requires a Slack escalation, deployment velocity drops and trust collapses.

Keep the social contract clear: policy owners must publish examples, common failure messages, and fix paths. Application teams must treat warnings as sprint work, not background noise. A practical target is to reduce warning volume weekly and move only clean namespaces to hard enforcement. This creates predictable pressure instead of surprise outages caused by policy flips.

Troubleshooting: common admission-control failures and fixes

1) “It worked yesterday, now deploys are denied”

Likely cause: A policy moved from Warn to Deny, or namespace labels changed.

Fix: Check namespace PSA labels and recent admission policy/binding changes in Git history. Reproduce with kubectl apply --server-side --dry-run=server -f ... to surface admission messages quickly.

2) Too much noise from warnings, teams start ignoring them

Likely cause: Low-signal rules in warn mode with no ownership model.

Fix: Tag every rule by severity (critical/high/medium). Only keep high-signal policy in warning channels used by developers. Move informational checks to periodic compliance reports.

3) Policy passes in one namespace but fails in another

Likely cause: Different namespace selectors, PSA labels, or binding scopes.

Fix: Audit ValidatingAdmissionPolicyBinding.matchResources, namespace labels, and any exempted runtime classes. Keep a small command cheat sheet in your runbook to diff labels and bindings cluster-wide.

FAQ

Should I choose ValidatingAdmissionPolicy or Gatekeeper?

Start with ValidatingAdmissionPolicy for straightforward admission checks because it is built-in and operationally lighter. Add Gatekeeper when you need richer policy libraries, broader audit workflows, and cross-team policy lifecycle controls.

Can Pod Security Admission replace custom policy engines?

Not fully. PSA is excellent for baseline/restricted pod hardening levels, but it does not cover every custom organizational rule. Use PSA as your default floor, then layer custom admission rules on top.

How do we avoid blocking delivery during rollout?

Use a phased model: warn and audit first, track violation burn-down per team, then enforce a minimal high-risk subset. Publish timelines and exception procedures early so teams can plan changes without release panic.

Actionable takeaways for this week

Label one non-critical namespace with PSA warn and audit set to restricted, then review violations for a week.
Implement one ValidatingAdmissionPolicy that blocks :latest image tags in production namespaces.
Define a policy rollout ladder: Audit → Warn → Deny with explicit exit criteria.
Create a short exception template (owner, reason, expiry date) and require it for every waiver.
Add admission failure triage steps to your incident runbook so on-call engineers can unblock safely.

If you do only one thing, do this: stop treating admission policy as a compliance side quest. Treat it as part of deployment quality, the same way you treat tests and observability. Your future incident timeline will be shorter, and your team will trust deploys more.