Running Persistent LLMs for an Entire Campus

Layer	What We Run
Orchestration	Kubernetes via k3s + Rancher across ~200 Gaudi2 nodes
Node Images	Warewulf-provisioned, so a node rebuilds to a known image
Storage	Longhorn PVCs on each node's local 7 TB NVMe
Inference	vLLM (Habana-optimized) model pods, one deployment per architecture
API Gateway	LiteLLM: OpenAI-compatible surface, key auth, model routing
Ingress	HAProxy: SSL termination and routing for chat and IDE traffic
State / Metadata	CloudNativePG (CNPG): Postgres operator for HA databases, backed up to S3
Accounts & Keys	In-house provisioning portal for self-service keys and usage

Batch HPC (Slurm)	Persistent LLM Serving
Jobs are finite: queue, run, release the nodes.	Service never exits: model pods stay resident 24/7.
User reserves nodes up front for a wall-time.	Thousands share one endpoint with no reservation.
Scheduler owns fairness between queued jobs.	No scheduler in the path: a live request needs admission control instead.
Success = job completes.	Success = low latency under constant concurrent load.

Persistent LLMs for Campus · Self-Hosting the Software Stack

Obleth: Fairshare Admission for Self-Hosted Inference

Obleth is a fairshare-first gateway we have running in development, targeted at
self-hosted inference rather than cloud-provider routing. It sits between HAProxy
and the vLLM/Aibrix backends and owns the admission layer LiteLLM omits.

Weighted fairshare under load: when the pool is full, the tenant most behind on fair share gets the next slot instead of whoever arrived first. The algorithm is starvation-free.
Token-measured admission (TPM): per-tenant token buckets in Redis with Lua-atomic updates, so fairness tracks token cost rather than request rate.
Token-accurate accounting: tokens are reserved at admission and reconciled at stream end, and every request lands in a usage ledger.
Live priority: a tenant's weight can be changed from the dashboard and every gateway pod honors it without a restart.

Cache hits return before fairshare runs. When the pool is full, work queues by weighted share rather than failing outright.

Capability	LiteLLM	Obleth
Multi-tenant API keys	yes	yes
Rate limiting	per-key RPM	token-measured TPM
Weighted fairshare under saturation	no	yes
Queue instead of immediate 429	beta, priority-only	yes, fairshare
Live priority change (no restart)	no	yes
Burst above share when idle	no, static cap	yes, reclaimed on contention

Running Persistent LLMs for an Entire Campus

Why Self-Host Inference

Production Stack at a Glance

Request Path and Architecture

Account Management and Identity

Self-Service Account Portal

Why Our HPC Playbook Did Not Fit

The Hardware Reality: Gaudi2

Operating the Fleet: Real-Time Observability

Live Node Status Across the Fleet

Per-Model Inference Telemetry

Rancher: Declarative Control Plane

LiteLLM: What It Does and Where It Stops

Obleth: Fairshare Admission for Self-Hosted Inference

Obleth Control Plane

LiteLLM vs. Obleth

Operating as a Campus Backend Provider

Tools Built on the Platform

Roadmap

Questions?

Running Persistent LLMs for an Entire Campus