From Zero to GenAI Cluster: Scalable Local LLMs with Docker, Kubernetes, and GPU Scheduling

A practical guide to deploying fast, private, and production-ready large language models with vLLM, Ollama, and Kubernetes-native orchestration. Build your own scalable GenAI cluster with Docker, Kubernetes, and GPU scheduling for a fully private, production-ready LLM setup. Prerequisites Before we begin, ensure your system meets the following requirements: A Kubernetes cluster with GPU-enabled nodes (e.g., via GKE, AKS, or bare-metal) The NVIDIA device plugin installed on the cluster Helm CLI installed and configured Docker CLI and access to a GPU-compatible runtime (e.g., nvidia-docker2) Introduction Local LLMs are no longer a research luxury, they're a production need. But deploying them at scale, with GPU access, container orchestration, and real-time monitoring? That’s still murky territory for many. In this article, I’ll walk you through how I built a fully operational GenAI cluster using Docker, Kubernetes, and GPU scheduling. It serves powerful language models like vLLM, Ollama, or HuggingFace TGI. We’ll make it observable with Prometheus and Grafana, and ready to scale when the real load hits. This isn’t just another tutorial. It’s a battle-tested, experience-backed blueprint for real-world AI infrastructure, written for developers and DevOps engineers pushing the boundaries of what GenAI can do. Why Local/Private LLMs Matter Many teams today are realizing that hosted APIs like OpenAI and Anthropic, while convenient, come with serious trade-offs: Cost grows fast when usage scales Sensitive data can't always be sent to third-party clouds Customization is limited to what the API provider allows Latency becomes a bottleneck in low-connectivity environments Self-hosting LLMs means freedom, control, and flexibility. But only if you know how to do it right. What We'll Build We’ll deploy a production-grade Kubernetes cluster featuring: vLLM / Ollama / TGI model server containers GPU scheduling and node affinity Ingress with HTTPS via NGINX Autoscaling using HPA or KEDA Prometheus + Grafana for real-time insights Declarative infrastructure using Helm or plain YAML Architecture Overview Figure: High-level architecture of a scalable GenAI Cluster using Docker, Kubernetes, and GPU scheduling. This modular, observable cluster gives you full control over your LLM infrastructure, without vendor lock-in. Step 1: Dockerizing the Model Server Let’s start small: a single Docker container that wraps a model server like vLLM. # Dockerfile.vllm FROM nvidia/cuda:12.2.0-base-ubuntu20.04 RUN apt update && apt install -y git python3 python3-pip RUN pip install vllm torch transformers WORKDIR /app COPY start.sh ./ CMD ["bash", "start.sh"] start.sh: #!/bin/bash python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-1.3b --port 8000 Then, build your container: docker build -f Dockerfile.vllm -t vllm-server:v0.1 . You can also use Ollama if you prefer pre-packaged models and a lower barrier to entry. vLLM is recommended for higher throughput and OpenAI-compatible APIs. This is your first step toward building a modular, GPU-ready inference system. Step 2: Kubernetes Deployment with GPU Scheduling apiVersion: apps/v1 kind: Deployment metadata: name: vllm-deployment spec: replicas: 1 selector: matchLabels: app: vllm template: metadata: labels: app: vllm spec: containers: - name: vllm image: vllm-server:v0.1 resources: limits: nvidia.com/gpu: 1 ports: - containerPort: 8000 nodeSelector: kubernetes.io/role: gpu tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" And here’s the corresponding Service definition: apiVersion: v1 kind: Service metadata: name: vllm-service spec: selector: app: vllm ports: - protocol: TCP port: 8000 targetPort: 8000 This exposes your model server inside the cluster. Step 3: Ingress and Load Balancing Install NGINX Ingress Controller: helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx helm install nginx ingress-nginx/ingress-nginx Then configure ingress: apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: vllm-ingress spec: rules: - host: vllm.local http: paths: - path: / pathType: Prefix backend: service: name: vllm-service port: number: 8000 Update your DNS or /etc/hosts to route vllm.local to your cluster. Step 4: Autoscaling with KEDA (Optional) helm repo add kedacore https://kedacore.github.io/charts helm install keda kedacore/keda With KEDA, you can scale your LLM pods based on GPU utilization, HTTP traffic, or even Kafka topic lag. Step 5: Monitoring with Prometheus + Grafana Install full-stack observability: helm repo add pr

May 3, 2025 - 09:56

From Zero to GenAI Cluster: Scalable Local LLMs with Docker, Kubernetes, and GPU Scheduling

A practical guide to deploying fast, private, and production-ready large language models with vLLM, Ollama, and Kubernetes-native orchestration. Build your own scalable GenAI cluster with Docker, Kubernetes, and GPU scheduling for a fully private, production-ready LLM setup.

Prerequisites

Before we begin, ensure your system meets the following requirements:

A Kubernetes cluster with GPU-enabled nodes (e.g., via GKE, AKS, or bare-metal)
The NVIDIA device plugin installed on the cluster
Helm CLI installed and configured
Docker CLI and access to a GPU-compatible runtime (e.g., nvidia-docker2)

Introduction

Local LLMs are no longer a research luxury, they're a production need. But deploying them at scale, with GPU access, container orchestration, and real-time monitoring? That’s still murky territory for many.

In this article, I’ll walk you through how I built a fully operational GenAI cluster using Docker, Kubernetes, and GPU scheduling. It serves powerful language models like vLLM, Ollama, or HuggingFace TGI. We’ll make it observable with Prometheus and Grafana, and ready to scale when the real load hits.

This isn’t just another tutorial. It’s a battle-tested, experience-backed blueprint for real-world AI infrastructure, written for developers and DevOps engineers pushing the boundaries of what GenAI can do.

Why Local/Private LLMs Matter

Many teams today are realizing that hosted APIs like OpenAI and Anthropic, while convenient, come with serious trade-offs:

Cost grows fast when usage scales
Sensitive data can't always be sent to third-party clouds
Customization is limited to what the API provider allows
Latency becomes a bottleneck in low-connectivity environments

Self-hosting LLMs means freedom, control, and flexibility. But only if you know how to do it right.

What We'll Build

We’ll deploy a production-grade Kubernetes cluster featuring:

vLLM / Ollama / TGI model server containers
GPU scheduling and node affinity
Ingress with HTTPS via NGINX
Autoscaling using HPA or KEDA
Prometheus + Grafana for real-time insights
Declarative infrastructure using Helm or plain YAML

Architecture Overview

Figure: High-level architecture of a scalable GenAI Cluster using Docker, Kubernetes, and GPU scheduling.

This modular, observable cluster gives you full control over your LLM infrastructure, without vendor lock-in.

Step 1: Dockerizing the Model Server

Let’s start small: a single Docker container that wraps a model server like vLLM.

# Dockerfile.vllm
FROM nvidia/cuda:12.2.0-base-ubuntu20.04

RUN apt update && apt install -y git python3 python3-pip

RUN pip install vllm torch transformers

WORKDIR /app
COPY start.sh ./
CMD ["bash", "start.sh"]

start.sh:

#!/bin/bash
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-1.3b --port 8000

Then, build your container:

docker build -f Dockerfile.vllm -t vllm-server:v0.1 .

You can also use Ollama if you prefer pre-packaged models and a lower barrier to entry. vLLM is recommended for higher throughput and OpenAI-compatible APIs.

This is your first step toward building a modular, GPU-ready inference system.

Step 2: Kubernetes Deployment with GPU Scheduling

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm-server:v0.1
          resources:
            limits:
              nvidia.com/gpu: 1
          ports:
            - containerPort: 8000
      nodeSelector:
        kubernetes.io/role: gpu
      tolerations:
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

And here’s the corresponding Service definition:

apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000

This exposes your model server inside the cluster.

Step 3: Ingress and Load Balancing

Install NGINX Ingress Controller:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install nginx ingress-nginx/ingress-nginx

Then configure ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
spec:
  rules:
    - host: vllm.local
      http:
        paths:
        - path: /
          pathType: Prefix
          backend:
            service:
              name: vllm-service
              port:
                number: 8000

Update your DNS or /etc/hosts to route vllm.local to your cluster.

Step 4: Autoscaling with KEDA (Optional)

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda

With KEDA, you can scale your LLM pods based on GPU utilization, HTTP traffic, or even Kafka topic lag.

Step 5: Monitoring with Prometheus + Grafana

Install full-stack observability:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install monitoring prometheus-community/kube-prometheus-stack

Expose a /metrics endpoint from your container.

from prometheus_client import start_http_server, Summary
import time

REQUEST_TIME = Summary('request_processing_seconds', 'Time spent processing request')

@REQUEST_TIME.time()
def process_request():
    time.sleep(1)

if __name__ == '__main__':
    start_http_server(8001)
    while True:
        process_request()

Or use GPU exporters like dcgm-exporter. Grafana will pull all this into beautiful dashboards.

Step 6: Optional Components

Vector DB: Qdrant, Weaviate, or Chroma
Auth Gateway: Add OAuth2 Proxy or Istio
LangServe or FastAPI: Wrap your model with an API server or LangChain interface
Persistent Volumes / Object Store: Save fine-tuned models using PVCs or MinIO

Final Thoughts

This isn’t just code. It’s the story of how I learned to stitch together powerful AI infrastructure from open-source tools and make it reliable enough for real-world teams to trust.

Docker gave me modularity. Kubernetes gave me orchestration. GPUs gave me the muscle.

Put together, they gave me something every AI builder wants: freedom.

If you're tired of vendor lock-in and ready to roll up your sleeves, this cluster is your launchpad.

This is just the beginning. Start building your GenAI infrastructure today and take control of your AI stack. Share your progress, contribute to the community, and let’s push the boundaries of what’s possible together.

See you at the edge!