At 2:17 AM, our pager went off for the wrong reason. Not a CPU spike, not a bad deploy, not a noisy alert. A former contractor account had just logged into a production EC2 instance. The person had left weeks ago, but their SSH key still lived in a forgotten authorized_keys file on one old box in a private subnet.
That incident changed how we handled server access. We stopped treating SSH as a static key-distribution problem and moved to OpenSSH user certificates with short lifetimes. The shift was not glamorous, but it gave us one thing static keys never did, a reliable expiry boundary.
If your team is still managing long-lived keys manually, this playbook walks through a practical way to move to ephemeral SSH access on AWS while keeping operations sane.
Why static SSH keys keep hurting fast-moving teams
Static keys fail in predictable ways:
- Keys are copied to many hosts and never fully removed.
- Ownership becomes unclear when engineers rotate teams.
- Emergency revocations become a host-by-host cleanup exercise.
- Audit trails show a Unix user, not the human identity behind access.
If this sounds familiar, you might also relate to the broader identity-boundary issues in our earlier cloud post on privacy-preserving access control.
What changes with OpenSSH certificates
OpenSSH certificates are not X.509 TLS certs. They are a lighter SSH-native format signed by a trusted CA key. According to the OpenSSH manual, you can sign user keys with principals and explicit validity windows, including very short TTLs.
In practice, this gives you:
- Time-bounded access (for example, 10 to 60 minutes).
- Central trust anchor via CA public key on servers.
- Cleaner deprovisioning, because most access naturally expires.
- Safer role mapping through principals like
prod-readonlyordb-admin.
On servers, TrustedUserCAKeys and (optionally) AuthorizedPrincipalsFile are the important controls. This is exactly where certificate authentication policy becomes enforceable.
A minimal architecture that works in the real world
- CA keypair stored in a restricted signing service (or controlled bastion).
- Engineers keep their own SSH keypair locally.
- Short-lived certificate is issued after identity and policy checks.
- EC2 hosts trust only the CA public key, not individual user keys.
For hardened host baselines, pair this with controls from our Linux bastion runbook: hardened SSH bastion with FIDO2 and Fail2ban.
Server-side OpenSSH configuration
# /etc/ssh/sshd_config.d/90-user-certs.conf
PubkeyAuthentication yes
PasswordAuthentication no
KbdInteractiveAuthentication no
# Trust your user CA public key
TrustedUserCAKeys /etc/ssh/ca/user_ca.pub
# Map certificate principals to allowed roles per account
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
# Optional hardening
PermitRootLogin no
AllowTcpForwarding no
X11Forwarding no
Example principals file for a Unix account:
# /etc/ssh/auth_principals/deploy
prod-deploy
prod-readonly
Issue a short-lived user certificate
#!/usr/bin/env bash
set -euo pipefail
USER_KEY_PUB="$1" # e.g. ~/.ssh/id_ed25519.pub
IDENTITY_EMAIL="$2" # e.g. engineer@company.com
ROLE_PRINCIPAL="$3" # e.g. prod-deploy
SOURCE_CIDR="$4" # e.g. 203.0.113.0/24
CA_KEY="/secure/ca/user_ca" # private CA key
VALIDITY="+30m"
KEY_ID="${IDENTITY_EMAIL}-$(date -u +%Y%m%dT%H%M%SZ)"
ssh-keygen -s "$CA_KEY" \
-I "$KEY_ID" \
-n "$ROLE_PRINCIPAL" \
-V "$VALIDITY" \
-O source-address="$SOURCE_CIDR" \
"$USER_KEY_PUB"
echo "Issued cert: ${USER_KEY_PUB%.pub}-cert.pub"
This command pattern is from standard ssh-keygen certificate flags: principal selection, validity windows, and key IDs for auditability.
Where EC2 Instance Connect and Session Manager fit
AWS gives you other short-lived access paths too. EC2 Instance Connect can push a public key via API and keeps it in instance metadata for about 60 seconds. That is useful for controlled just-in-time key injection. Session Manager can remove inbound SSH requirements entirely and centralize access with IAM.
Tradeoff summary:
- OpenSSH certificates: portable across cloud/on-prem, strong SSH-native policy model, but you must operate a signing workflow.
- EC2 Instance Connect: low friction in AWS, great for just-in-time key push, but more AWS-coupled workflow.
- Session Manager: no inbound SSH port needed, central IAM controls, but shell UX and tooling expectations can differ from raw SSH workflows.
In mixed environments, teams often keep cert-based SSH for consistency and use Session Manager as break-glass or no-bastion path. For security process maturity, this complements principles from our security disclosure post: security.txt for WordPress.
A rollout path that does not break your on-call week
Teams get into trouble when they switch everything in one weekend. A safer pattern is staged rollout:
- Phase 1: Enable cert trust on a non-critical environment, keep existing static keys as fallback.
- Phase 2: Start issuing certs for humans first, with read-only principals and short TTL.
- Phase 3: Remove static keys from production accounts once sign-in metrics and incident drills look healthy.
Measure two things during rollout: median sign-in time and failed-login reasons. If cert issuance is too slow, engineers will route around the process. If principals are too broad, security wins disappear. Keep both usability and policy tight.
One practical tip: include certificate key_id fields that encode identity and ticket/context (for example, user + timestamp + change ID). That makes incident review much faster when you need to map access back to a human action.
Troubleshooting: the failures you will probably hit first
1) “Permission denied (publickey)” even though cert exists
Usually one of these:
- The principal in the cert does not match
AuthorizedPrincipalsFile. TrustedUserCAKeyspoints to the wrong CA public key.- The cert has expired (short TTLs make clock drift visible fast).
Run ssh-keygen -L -f ~/.ssh/id_ed25519-cert.pub and verify principals + validity window.
2) Access works on one host but not another
Config drift. One node has updated CA file, another doesn’t. Manage CA pubkey and sshd snippets through config management, then reload sshd uniformly. This is the same “drift hurts reliability” pattern we discussed in our deadline propagation reliability runbook.
3) Emergency revocation still feels slow
Certificates reduce blast radius by expiration, but not instant revocation by default. If you need immediate kill-switch behavior, add short TTL + rapid CA key rotation policy for high-risk roles, and plan revocation lists for critical incidents.
FAQ
Do OpenSSH certificates replace MFA?
No. They solve key lifecycle and authorization scope, not your full identity assurance story. Keep MFA in your identity provider or issuance workflow before signing certs.
What certificate lifetime should we start with?
Start with 30 to 60 minutes for human access and shorter windows for privileged roles. Too short creates operational friction; too long recreates static-key risk.
Can we run this without a full CA platform on day one?
Yes. Many teams begin with a tightly controlled signing host and scripted issuance, then move to a managed CA workflow (for example, OIDC-backed provisioning) once policy and audit requirements grow.
Actionable takeaways
- Move server trust from per-user keys to one CA public key with
TrustedUserCAKeys. - Use role-based principals and keep them explicit per Unix account.
- Issue short-lived certs with key IDs that map to real human identity.
- Automate CA key distribution checks to avoid host-level policy drift.
- Document when to use OpenSSH certificates vs EC2 Instance Connect vs Session Manager.
If you are reviving old infrastructure and inherited SSH key sprawl, this migration is one of the highest-leverage security upgrades you can make without redesigning your whole platform.

Leave a Reply