Cloud in 2026: Deploy Cost-Efficient Kubernetes AI Inference on AWS EKS with Karpenter and Spot

Running AI inference in production is no longer just a model problem, it is a cloud cost and reliability problem. In 2026, many teams are paying 2x to 4x more than necessary because their Kubernetes clusters are overprovisioned, slow to scale, or locked to on-demand nodes. In this guide, you will build a practical AWS EKS setup that scales inference workloads with Karpenter, prioritizes Spot capacity safely, and keeps latency predictable with disruption budgets and topology-aware scheduling.

What we are building

We will deploy a small inference API on EKS and configure node autoscaling with Karpenter so that:

  • CPU and GPU workloads can scale independently
  • Spot nodes are preferred for lower cost
  • On-demand fallback protects availability
  • Workloads stay resilient during interruptions

This pattern works for LLM gateway services, embedding APIs, rerankers, and batch inference workers.

Architecture overview

  • EKS cluster with multiple AZ subnets
  • Karpenter for fast, pod-driven node provisioning
  • NodePool + EC2NodeClass for Spot-first policies
  • Inference Deployment with HPA and PDB
  • Service + Gateway/Ingress for traffic entry

Step 1: Create an EKS cluster baseline

You can use Terraform, eksctl, or your platform team module. The only strict requirement is tagging subnets and security groups for Karpenter discovery.

# Required subnet tags for Karpenter discovery
kubernetes.io/cluster/prod-ml = shared
karpenter.sh/discovery = prod-ml

# Security group tag
karpenter.sh/discovery = prod-ml

Install Karpenter via Helm (version pinned for reproducibility):

helm repo add karpenter https://charts.karpenter.sh
helm upgrade --install karpenter karpenter/karpenter \
  --namespace karpenter --create-namespace \
  --set settings.clusterName=prod-ml \
  --set settings.interruptionQueue=prod-ml-karpenter-int \
  --set controller.resources.requests.cpu=500m \
  --set controller.resources.requests.memory=512Mi

Step 2: Define Spot-first capacity with safe fallback

In 2026, the best default for stateless inference is usually mixed capacity: Spot preferred, on-demand allowed. Use a NodePool with requirements that keep instance selection broad to reduce allocation failures.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: inference-pool
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 120s
  template:
    metadata:
      labels:
        workload: inference
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["c7i.large", "c7i.xlarge", "m7i.large", "m7i.xlarge"]
      nodeClassRef:
        kind: EC2NodeClass
        name: inference-class
      expireAfter: 168h
  limits:
    cpu: "200"

Back it with an EC2NodeClass:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: inference-class
spec:
  amiFamily: AL2023
  role: KarpenterNodeRole-prod-ml
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-ml
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-ml
  tags:
    Environment: production
    Team: ml-platform

Step 3: Deploy an inference API with resilient scheduling

Now deploy a simple FastAPI inference container. Important points:

  1. Use topologySpreadConstraints so replicas spread across zones.
  2. Add a PodDisruptionBudget so voluntary evictions do not drop all replicas.
  3. Set realistic requests and limits to help Karpenter make good packing decisions.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference-api
  template:
    metadata:
      labels:
        app: inference-api
    spec:
      nodeSelector:
        workload: inference
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: inference-api
      containers:
        - name: api
          image: ghcr.io/example/inference-api:2026.04
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2"
              memory: "2Gi"
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: inference-api

Step 4: Add autoscaling based on real demand

CPU-based HPA is acceptable for many services, but inference systems often correlate better with queue depth, in-flight requests, or token throughput. Start simple and evolve to custom metrics.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-api
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 180

Step 5: Handle Spot interruptions gracefully

Spot savings are huge, but interruption handling is non-negotiable. Karpenter can react quickly, but your app should still be interruption-aware:

  • Use terminationGracePeriodSeconds long enough to drain in-flight requests.
  • Keep startup time low (image size, lazy model loading, warm pools if needed).
  • Expose readiness probes that fail fast during shutdown.
  • Keep at least one replica per AZ when possible.

Example FastAPI shutdown hook

from fastapi import FastAPI
import signal

app = FastAPI()
shutting_down = False

def handle_term(*_):
    global shutting_down
    shutting_down = True

signal.signal(signal.SIGTERM, handle_term)

@app.get("/ready")
def ready():
    return {"ready": not shutting_down}

Cost and performance guardrails you should add

  • Budget alerts: per namespace and per cluster using AWS Budgets + Cost Anomaly Detection.
  • SLOs: p95 latency and error rate per model route.
  • Right-sizing: weekly recommendation job from Prometheus usage histograms.
  • Bin-packing review: keep request-to-limit ratios sane to avoid waste.

Common mistakes in 2026 cloud inference stacks

  • Using only on-demand nodes by default, then complaining about cloud costs.
  • Too few instance families, causing Spot allocation failures.
  • No disruption budget, so routine maintenance causes visible downtime.
  • Scaling on CPU only, while request queue backlog explodes.
  • Ignoring cross-AZ spread and creating hidden single-zone risk.

Final checklist

  1. Spot-first NodePool with on-demand fallback configured
  2. PDB and topology spread enabled for inference deployments
  3. HPA tuned for traffic shape and cold-start characteristics
  4. Interruption and shutdown behavior tested in staging
  5. Cost and latency SLO dashboards reviewed weekly

If you implement this baseline, you will usually get the fastest path to lower inference cost without sacrificing reliability. From here, you can extend the same pattern to GPU node pools, model-specific routing, and workload isolation by tenant or priority class.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Privacy Policy · Contact · Sitemap

© 7Tech – Programming and Tech Tutorials