Series Roadmap
| # | Topic |
|---|---|
| 0 | Overview & Comparison β vLLM, SGLang, TensorRT-LLM |
| 1 | Deep Dive: vLLM β PagedAttention and Scheduling |
| 2 | Deep Dive: SGLang β RadixAttention and Structured Generation |
| 3 | Deep Dive: TensorRT-LLM β Compiled Optimization and Deployment |
Why Do We Need Dedicated Serving Frameworks?
LLM inference has characteristics that general-purpose deep learning frameworks are not designed to handle efficiently.
Autoregressive generation: Output tokens are generated one at a time. Rather than recomputing the Key-Value (KV) representations of all previous tokens at each step, they are cached and reused (KV Cache). Managing this cache is the central memory bottleneck.
Variable request lengths: Input prompts and output lengths differ per request. With static batching, a short request wastes GPU cycles waiting for the longest request in the batch to finish.
Prefill-Decode asymmetry: Processing the input prompt (Prefill) is compute-bound; generating output tokens (Decode) is memory-bound. How these two phases are scheduled determines overall throughput and latency.
PyTorch and TensorFlow do not optimize for these characteristics. This is why LLM-specific serving frameworks exist.
vLLM
UC Berkeley, 2023. The de facto standard for open-source LLM serving.
Core Technology: PagedAttention
KV Cache memory is managed using a scheme analogous to OS virtual memory paging. Each requestβs KV Cache is allocated in fixed-size physical blocks that need not be contiguous, with a logical-to-physical block mapping table maintained at runtime.
Prior approaches required pre-allocating contiguous memory equal to the maximum sequence length per request, wasting 60β80% of allocated memory on average. PagedAttention allocates blocks only as needed, nearly eliminating waste and allowing far more concurrent requests per GPU.
Conventional KV Cache allocation
Request A: [ββββββββββββββββββββββββ______] (tail wasted)
Request B: [ββββββββββββ__________________] (tail wasted)
PagedAttention
Physical blocks: [Block0][Block1][Block2][Block3][Block4]...
Request A β Block0, Block2, Block4 (non-contiguous, no waste)
Request B β Block1, Block3 (non-contiguous, no waste)
Copy-on-Write block sharing enables prefix reuse and parallel sampling (e.g., beam search) with minimal memory overhead.
Continuous Batching
The batch is rebuilt at each iteration. When a forward pass completes, finished sequences are removed and waiting requests are inserted immediately. This keeps GPU utilization high regardless of output length variance.
Key Features
- Built-in OpenAI API-compatible server
- Hardware: NVIDIA, AMD (ROCm), Google TPU, AWS Inferentia, and more
- Multi-LoRA concurrent serving
- Chunked Prefill and Speculative Decoding support
- Largest open-source contributor community
SGLang
Stanford, 2024. Optimized for complex LLM programs and prefix reuse.
Core Technology: RadixAttention
KV Cache is organized as a Radix Tree. Requests that share a common prefix automatically reuse cached KV blocks without any manual configuration.
Radix Tree example
"system prompt + user A's history" β cache hit on system prompt
"system prompt + user B's history" β cache hit on system prompt
This is particularly effective for multi-turn conversation, few-shot prompting, and agent loops, where the same prefix appears repeatedly. While vLLMβs prefix caching requires explicit configuration, SGLang handles this automatically.
Structured Generation
High-throughput generation constrained to JSON Schema, regular expressions, or EBNF grammars. SGLang integrates XGrammar to apply constraints at the CUDA level, significantly reducing latency compared to CPU-side implementations.
Key Features
- Aggressive CUDA Graph use for the decode phase β reduced kernel launch overhead
- FlashInfer kernel integration
- torch.compile support
- Hardware: NVIDIA, AMD (ROCm)
TensorRT-LLM
NVIDIA, 2023. Maximum raw throughput through ahead-of-time compilation.
Core Technology: TensorRT Engine Compilation
The model is compiled into a TensorRT engine before serving. During compilation:
- Kernel Fusion: Multiple operations (Layer Norm β Linear β Activation) are merged into a single CUDA kernel
- Layer/Tensor Fusion: Computational graph is optimized
- Quantization: FP8, INT8 (SmoothQuant), INT4 (AWQ, GPTQ) applied at compile time
At runtime, the pre-optimized engine executes directly, yielding the highest raw throughput in NVIDIA environments.
In-flight Batching
NVIDIAβs continuous batching implementation, integrated with Triton Inference Server.
Key Features
- Broad quantization support: FP8, INT8, INT4
- Built-in Tensor Parallelism and Pipeline Parallelism
- Official Triton Inference Server integration
- Hardware: NVIDIA GPU only
- Drawbacks: requires per-model TRT-LLM implementation, long engine build times, no AMD/other GPU support
Comparison
| Β | vLLM | SGLang | TensorRT-LLM |
|---|---|---|---|
| Origin | UC Berkeley | Stanford | NVIDIA |
| Core Technology | PagedAttention | RadixAttention | TRT engine compilation |
| Prefix Caching | Manual config | Automatic (Radix Tree) | Limited |
| Structured Generation | Basic | Specialized (XGrammar) | Limited |
| Hardware | Multi-vendor | NVIDIA, AMD | NVIDIA only |
| Deployment Complexity | Low | Low | High (build required) |
| Raw Throughput | High | High | Highest (on NVIDIA) |
| Best For | General serving | Agents, complex LLM programs | NVIDIA production deployment |
Decision guide:
- General serving, multi-vendor hardware β vLLM
- High proportion of multi-turn chat, agents, or structured output β SGLang
- Maximum throughput on NVIDIA in production β TensorRT-LLM
The next post covers vLLM internals β how PagedAttention is implemented and how the scheduler is designed.
νκ΅μ΄ λ²μ μ μλ¨ μΈμ΄ μ€μμ²λ₯Ό ν΅ν΄ νμΈν μ μμ΅λλ€.