Running AI inference in production is no longer just a model problem, it is a cloud cost and reliability problem. In 2026, many teams are paying 2x to 4x more than necessary because their Kubernetes clusters are overprovisioned, slow to scale, or locked to on-demand nodes. In this guide, you will build a practical AWS EKS setup that scales inference workloads with Karpenter, prioritizes Spot capacity safely, and keeps latency predictable with disruption budgets and topology-aware scheduling.
What we are building
We will deploy a small inference API on EKS and configure node autoscaling with Karpenter so that:
- CPU and GPU workloads can scale independently
- Spot nodes are preferred for lower cost
- On-demand fallback protects availability
- Workloads stay resilient during interruptions
This pattern works for LLM gateway services, embedding APIs, rerankers, and batch inference workers.
Architecture overview
- EKS cluster with multiple AZ subnets
- Karpenter for fast, pod-driven node provisioning
- NodePool + EC2NodeClass for Spot-first policies
- Inference Deployment with HPA and PDB
- Service + Gateway/Ingress for traffic entry
Step 1: Create an EKS cluster baseline
You can use Terraform, eksctl, or your platform team module. The only strict requirement is tagging subnets and security groups for Karpenter discovery.
# Required subnet tags for Karpenter discovery
kubernetes.io/cluster/prod-ml = shared
karpenter.sh/discovery = prod-ml
# Security group tag
karpenter.sh/discovery = prod-ml
Install Karpenter via Helm (version pinned for reproducibility):
helm repo add karpenter https://charts.karpenter.sh
helm upgrade --install karpenter karpenter/karpenter \
--namespace karpenter --create-namespace \
--set settings.clusterName=prod-ml \
--set settings.interruptionQueue=prod-ml-karpenter-int \
--set controller.resources.requests.cpu=500m \
--set controller.resources.requests.memory=512Mi
Step 2: Define Spot-first capacity with safe fallback
In 2026, the best default for stateless inference is usually mixed capacity: Spot preferred, on-demand allowed. Use a NodePool with requirements that keep instance selection broad to reduce allocation failures.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: inference-pool
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 120s
template:
metadata:
labels:
workload: inference
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["c7i.large", "c7i.xlarge", "m7i.large", "m7i.xlarge"]
nodeClassRef:
kind: EC2NodeClass
name: inference-class
expireAfter: 168h
limits:
cpu: "200"
Back it with an EC2NodeClass:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: inference-class
spec:
amiFamily: AL2023
role: KarpenterNodeRole-prod-ml
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: prod-ml
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod-ml
tags:
Environment: production
Team: ml-platform
Step 3: Deploy an inference API with resilient scheduling
Now deploy a simple FastAPI inference container. Important points:
- Use
topologySpreadConstraintsso replicas spread across zones. - Add a
PodDisruptionBudgetso voluntary evictions do not drop all replicas. - Set realistic requests and limits to help Karpenter make good packing decisions.
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-api
spec:
replicas: 3
selector:
matchLabels:
app: inference-api
template:
metadata:
labels:
app: inference-api
spec:
nodeSelector:
workload: inference
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: inference-api
containers:
- name: api
image: ghcr.io/example/inference-api:2026.04
ports:
- containerPort: 8080
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "2Gi"
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: inference-api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: inference-api
Step 4: Add autoscaling based on real demand
CPU-based HPA is acceptable for many services, but inference systems often correlate better with queue depth, in-flight requests, or token throughput. Start simple and evolve to custom metrics.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: inference-api
minReplicas: 3
maxReplicas: 30
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
behavior:
scaleUp:
stabilizationWindowSeconds: 30
scaleDown:
stabilizationWindowSeconds: 180
Step 5: Handle Spot interruptions gracefully
Spot savings are huge, but interruption handling is non-negotiable. Karpenter can react quickly, but your app should still be interruption-aware:
- Use
terminationGracePeriodSecondslong enough to drain in-flight requests. - Keep startup time low (image size, lazy model loading, warm pools if needed).
- Expose readiness probes that fail fast during shutdown.
- Keep at least one replica per AZ when possible.
Example FastAPI shutdown hook
from fastapi import FastAPI
import signal
app = FastAPI()
shutting_down = False
def handle_term(*_):
global shutting_down
shutting_down = True
signal.signal(signal.SIGTERM, handle_term)
@app.get("/ready")
def ready():
return {"ready": not shutting_down}
Cost and performance guardrails you should add
- Budget alerts: per namespace and per cluster using AWS Budgets + Cost Anomaly Detection.
- SLOs: p95 latency and error rate per model route.
- Right-sizing: weekly recommendation job from Prometheus usage histograms.
- Bin-packing review: keep request-to-limit ratios sane to avoid waste.
Common mistakes in 2026 cloud inference stacks
- Using only on-demand nodes by default, then complaining about cloud costs.
- Too few instance families, causing Spot allocation failures.
- No disruption budget, so routine maintenance causes visible downtime.
- Scaling on CPU only, while request queue backlog explodes.
- Ignoring cross-AZ spread and creating hidden single-zone risk.
Final checklist
- Spot-first NodePool with on-demand fallback configured
- PDB and topology spread enabled for inference deployments
- HPA tuned for traffic shape and cold-start characteristics
- Interruption and shutdown behavior tested in staging
- Cost and latency SLO dashboards reviewed weekly
If you implement this baseline, you will usually get the fastest path to lower inference cost without sacrificing reliability. From here, you can extend the same pattern to GPU node pools, model-specific routing, and workload isolation by tenant or priority class.

Leave a Reply