Running Persistent LLMs for an Entire Campus
A technical walkthrough of the stack we run for always-on, self-hosted
inference: Kubernetes (k3s + Rancher), vLLM, LiteLLM, CNPG, HAProxy, and an
in-house account system, plus where that stack breaks down and what we are
building to replace it.
Johnathan Lee · Arizona State University · Sr. HPC System Architect