#vLLM on OpenShift

1 messages · Page 1 of 1 (latest)

sand spear
#

I’m looking for a self-written, practical overview of vLLM, ideally from someone with real hands-on experience.

I’d like it to cover:

what vLLM is and what problem it solves
how it works internally (architecture, request flow, batching, KV cache, memory handling)
how to deploy and configure it
the main configuration parameters and what each one does
how those parameters affect latency, throughput, concurrency, and GPU memory
hardware/runtime requirements such as GPU memory, CUDA, tensor cores, FP16/BF16, quantization, etc.
briefly, any important considerations for running it on OpenShift

I’m specifically looking for something written in your own words, not copy-pasted from another LLM.

Thanks a lot!