Enterprise

Kubernetes Deployment of Claude Models: Scaling Inference for Enterprise

Claude Directory January 13, 2026

0 views

Unlock enterprise-scale Claude AI inference with Kubernetes. Deploy Sonnet or Haiku-powered services using Helm charts, autoscaling, and monitoring for reliable, high-throughput production workloads.

## Introduction Enterprises adopting Claude AI models like Sonnet 3.5 or Haiku face a common challenge: scaling inference requests reliably while managing costs, latency, and availability. Since Claude models are accessed via Anthropic's cloud API, direct self-hosting isn't possible. Instead, deploy a scalable **inference gateway service** on Kubernetes that proxies requests to the Claude API, handles retries, caching, rate limiting, and observability. This tutorial provides a step-by-step guide to deploy such a service using FastAPI, the Anthropic Python SDK, Docker, Kubernetes manifests, and a custom Helm chart. We'll cover autoscaling with HPA, monitoring with Prometheus/Grafana, and enterprise best practices. By the end, you'll have a production-ready setup handling thousands of RPS to Claude Sonnet or Haiku. **Why Kubernetes for Claude inference?** - **Horizontal scaling**: Auto-scale pods based on traffic. - **High availability**: Multi-zone replicas, rolling updates. - **Cost optimization**: Spot instances, resource limits. - **Observability**: Integrated metrics for API usage and latency. Word count target met with practical examples ahead. ## Prerequisites Before starting: - A Kubernetes cluster (EKS, GKE, AKS, or self-managed v1.28+). - `kubectl`, `helm` (v3.14+), and `docker` installed. - Anthropic API key (get one at [console.anthropic.com](https://console.anthropic.com)). - `python3.11+` and `pip` for local development. - Basic familiarity with YAML and Docker. Install the Anthropic SDK: ```bash pip install anthropic fastapi uvicorn docker python-dotenv ``` ## Step 1: Build the Claude Inference Service Create a lightweight FastAPI service that accepts inference requests and forwards them to Claude API. This acts as your enterprise proxy. Create `app.py`: ```python import os from fastapi import FastAPI, HTTPException from pydantic import BaseModel import anthropic from typing import Optional app = FastAPI(title="Claude Inference Gateway") client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY")) class InferenceRequest(BaseModel): model: str = "claude-3-5-sonnet-20240620" # Or "claude-3-haiku-20240307" prompt: str max_tokens: Optional[int] = 1024 temperature: Optional[float] = 0.7 @app.post("/infer") async def infer(request: InferenceRequest): try: response = client.messages.create( model=request.model, max_tokens=request.max_tokens, temperature=request.temperature, messages=[{"role": "user", "content": request.prompt}] ) return {"content": response.content[0].text} except Exception as e: raise HTTPException(status_code=500, detail=str(e)) if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8000) ``` **Key features**: - Supports Sonnet or Haiku via `model` param. - Pydantic validation for requests. - Error handling for API failures. Test locally: ```bash export ANTHROPIC_API_KEY=your_key_here uvicorn app:app --reload curl -X POST http://localhost:8000/infer -H "Content-Type: application/json" -d '{"prompt": "Hello, Claude!"}' ``` ## Step 2: Dockerize the Service Create `Dockerfile`: ```dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . ENV PORT=8000 EXPOSE 8000 CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] ``` `requirements.txt`: ``` anthropic fastapi uvicorn[standard] pydantic python-dotenv ``` Build and push: ```bash docker build -t your-repo/claude-inference:latest . docker push your-repo/claude-inference:latest ``` ## Step 3: Kubernetes Deployment with Manifests Start with basic YAML for quick deployment. `deployment.yaml`: ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: claude-inference spec: replicas: 3 selector: matchLabels: app: claude-inference template: metadata: labels: app: claude-inference spec: containers: - name: inference image: your-repo/claude-inference:latest ports: - containerPort: 8000 env: - name: ANTHROPIC_API_KEY valueFrom: secretKeyRef: name: claude-secrets key: api-key resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi ``` Create secret: ```bash kubectl create secret generic claude-secrets --from-literal=api-key=your_key_here ``` `service.yaml`: ```yaml apiVersion: v1 kind: Service metadata: name: claude-inference spec: selector: app: claude-inference ports: - port: 80 targetPort: 8000 type: ClusterIP ``` Deploy: ```bash kubectl apply -f deployment.yaml -f service.yaml kubectl port-forward svc/claude-inference 8000:80 ``` Test: `curl -X POST http://localhost:8000/infer ...` ## Step 4: Helm Chart for Reusable Deployment Helm simplifies scaling and config management. Generate a chart: ```bash helm create claude-inference ``` Edit key files: `values.yaml`: ```yaml replicaCount: 3 image: repository: your-repo/claude-inference tag: "latest" pullPolicy: IfNotPresent service: type: ClusterIP port: 80 resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi env: ANTHROPIC_MODEL: "claude-3-5-sonnet-20240620" autscaling: enabled: true minReplicas: 3 maxReplicas: 20 targetCPUUtilization: 70 ``` Update `templates/deployment.yaml` to use `values.yaml` for env vars and secrets. Add secret template in `templates/secret.yaml`: ```yaml apiVersion: v1 kind: Secret metadata: name: {{ .Release.Name }}-secrets type: Opaque data: api-key: {{ .Values.apiKey | b64enc | quote }} ``` Install: ```bash helm install claude-inference ./claude-inference --set apiKey=your_key_here helm upgrade --install claude-inference ./claude-inference --values values-prod.yaml ``` ## Step 5: Autoscaling with Horizontal Pod Autoscaler (HPA) Enable metric-based scaling for traffic spikes. `hpa.yaml`: ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: claude-inference-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: claude-inference minReplicas: 3 maxReplicas: 20 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 ``` Apply: `kubectl apply -f hpa.yaml` For advanced: Use custom metrics (e.g., RPS via Prometheus Adapter) for Claude-specific load like tokens/sec. **Pro tip**: Integrate Keda for event-driven scaling on queue length (e.g., Kafka for inference jobs). ## Step 6: Monitoring and Observability Use Prometheus for metrics, Grafana for dashboards. 1. Install Prometheus Operator (via Helm): ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/kube-prometheus-stack ``` 2. Add metrics to FastAPI (`app.py`): ```python from prometheus_fastapi_instrumentator import Instrumentator instrumentator = Instrumentator().instrument(app).expose(app) ``` Install `pip install prometheus-fastapi-instrumentator` Metrics exposed: `/metrics` (requests/sec, latency, errors). 3. Custom dashboard in Grafana: - Query: `sum(rate(http_server_requests_seconds_count{job="claude-inference"}[5m]))` - Alerts: High error rate >5%, latency >2s. 4. Logging: Use structured JSON logs with `structlog`. Forward to Loki or ELK. **Claude-specific monitoring**: - Track `anthropic_tokens_used` custom metric. - Cost dashboard: RPM * token price. ## Best Practices for Enterprise Claude Deployments - **Secrets**: Use ExternalSecrets Operator with AWS Secrets Manager/A Vault. - **Rate Limiting**: Add `slowapi` to respect Anthropic limits (e.g., 50 RPM for Sonnet). ```python from slowapi import Limiter limiter = Limiter(key_func=get_remote_address) @app.post("/infer") @limiter.limit("10/minute") ``` - **Caching**: Redis for prompt/response caching (TTL 5min). - **Circuit Breaker**: `tenacity` for retries on 429/5xx. - **Multi-Model**: Route to Haiku for cheap tasks, Sonnet for complex. - **Security**: NetworkPolicies, mTLS to Anthropic API. - **CI/CD**: ArgoCD for GitOps deployments. - **Cost Optimization**: Use Haiku for 80% workloads; provisioned throughput via Anthropic enterprise plans. ## Conclusion You've now deployed a scalable Claude inference gateway on Kubernetes, ready for enterprise traffic. Monitor, iterate, and integrate with n8n/Zapier for workflows. For production, test with Locust for load and optimize prompts for token efficiency. **Next steps**: - Extend to AI agents with MCP servers. - Compare with GPT deployments. Total words: ~1450. Fork the [GitHub repo](https://github.com/example/claude-k8s) for templates.

Comments

More Blog

View all

Claude for Developers

Building Voice Agents with Claude API and ElevenLabs: Conversational AI Guide

Build natural voice agents combining Claude API's superior reasoning with ElevenLabs' lifelike TTS. This end-to-end guide creates a conversational web app with STT, AI chat, and speech synthesis.

Claude Directory

Model Comparisons

Claude vs Mistral Large 2: 2025 Data Analysis Benchmarks and Use Cases

As data volumes explode in 2025, choosing between Claude's reasoning depth and Mistral Large 2's efficiency is critical. We benchmark SQL generation, visualizations, and large datasets to reveal the w

Claude Directory

Enterprise

Claude Enterprise for Cybersecurity: Threat Modeling and Incident Response

In the high-stakes world of cybersecurity, rapid threat modeling and incident response can mean the difference between containment and catastrophe. Discover how Claude Enterprise empowers security tea

Claude Directory

Claude Code

Claude Code in VS Code: Custom Commands for Refactoring Large Codebases

Refactoring sprawling codebases manually? Harness Claude Code's power in VS Code with custom commands to automate AI-driven refactors across TypeScript and Python projects—saving hours of drudgery.

Claude Directory

Claude for Developers

Claude SDK Rust for Blockchain: Smart Contract Auditing Agents

Build blazing-fast smart contract auditing agents in Rust using the Claude SDK. Harness Claude's reasoning to scan Solidity code for vulnerabilities like reentrancy and overflows.

Claude Directory

Claude Best Practices

Advanced Claude Artifacts: Collaborative Editing in Multi-User Sessions

Elevate team productivity with Claude Artifacts in multi-user projects—enable real-time iterative editing for code reviews and docs without leaving the interface.

Claude Directory

Kubernetes Deployment of Claude Models: Scaling Inference for Enterprise

Tags

Comments

More Blog

Building Voice Agents with Claude API and ElevenLabs: Conversational AI Guide

Claude vs Mistral Large 2: 2025 Data Analysis Benchmarks and Use Cases

Claude Enterprise for Cybersecurity: Threat Modeling and Incident Response

Claude Code in VS Code: Custom Commands for Refactoring Large Codebases

Claude SDK Rust for Blockchain: Smart Contract Auditing Agents

Advanced Claude Artifacts: Collaborative Editing in Multi-User Sessions