LLM Serving Frameworks Overview: vLLM, SGLang, TensorRT-LLM

Design philosophies and core technologies compared

πŸ‡ΊπŸ‡Έ English | πŸ‡°πŸ‡· ν•œκ΅­μ–΄

Series Roadmap

# Topic
0 Overview & Comparison β€” vLLM, SGLang, TensorRT-LLM
1 Deep Dive: vLLM β€” PagedAttention and Scheduling
2 Deep Dive: SGLang β€” RadixAttention and Structured Generation
3 Deep Dive: TensorRT-LLM β€” Compiled Optimization and Deployment

Why Do We Need Dedicated Serving Frameworks?

LLM inference has characteristics that general-purpose deep learning frameworks are not designed to handle efficiently.

Autoregressive generation: Output tokens are generated one at a time. Rather than recomputing the Key-Value (KV) representations of all previous tokens at each step, they are cached and reused (KV Cache). Managing this cache is the central memory bottleneck.

Variable request lengths: Input prompts and output lengths differ per request. With static batching, a short request wastes GPU cycles waiting for the longest request in the batch to finish.

Prefill-Decode asymmetry: Processing the input prompt (Prefill) is compute-bound; generating output tokens (Decode) is memory-bound. How these two phases are scheduled determines overall throughput and latency.

PyTorch and TensorFlow do not optimize for these characteristics. This is why LLM-specific serving frameworks exist.


vLLM

UC Berkeley, 2023. The de facto standard for open-source LLM serving.

Core Technology: PagedAttention

KV Cache memory is managed using a scheme analogous to OS virtual memory paging. Each request’s KV Cache is allocated in fixed-size physical blocks that need not be contiguous, with a logical-to-physical block mapping table maintained at runtime.

Prior approaches required pre-allocating contiguous memory equal to the maximum sequence length per request, wasting 60–80% of allocated memory on average. PagedAttention allocates blocks only as needed, nearly eliminating waste and allowing far more concurrent requests per GPU.

Conventional KV Cache allocation
Request A: [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ______] (tail wasted)
Request B: [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ__________________] (tail wasted)

PagedAttention
Physical blocks: [Block0][Block1][Block2][Block3][Block4]...
Request A β†’ Block0, Block2, Block4 (non-contiguous, no waste)
Request B β†’ Block1, Block3        (non-contiguous, no waste)

Copy-on-Write block sharing enables prefix reuse and parallel sampling (e.g., beam search) with minimal memory overhead.

Continuous Batching

The batch is rebuilt at each iteration. When a forward pass completes, finished sequences are removed and waiting requests are inserted immediately. This keeps GPU utilization high regardless of output length variance.

Key Features

  • Built-in OpenAI API-compatible server
  • Hardware: NVIDIA, AMD (ROCm), Google TPU, AWS Inferentia, and more
  • Multi-LoRA concurrent serving
  • Chunked Prefill and Speculative Decoding support
  • Largest open-source contributor community

SGLang

Stanford, 2024. Optimized for complex LLM programs and prefix reuse.

Core Technology: RadixAttention

KV Cache is organized as a Radix Tree. Requests that share a common prefix automatically reuse cached KV blocks without any manual configuration.

Radix Tree example
"system prompt + user A's history" β†’ cache hit on system prompt
"system prompt + user B's history" β†’ cache hit on system prompt

This is particularly effective for multi-turn conversation, few-shot prompting, and agent loops, where the same prefix appears repeatedly. While vLLM’s prefix caching requires explicit configuration, SGLang handles this automatically.

Structured Generation

High-throughput generation constrained to JSON Schema, regular expressions, or EBNF grammars. SGLang integrates XGrammar to apply constraints at the CUDA level, significantly reducing latency compared to CPU-side implementations.

Key Features

  • Aggressive CUDA Graph use for the decode phase β†’ reduced kernel launch overhead
  • FlashInfer kernel integration
  • torch.compile support
  • Hardware: NVIDIA, AMD (ROCm)

TensorRT-LLM

NVIDIA, 2023. Maximum raw throughput through ahead-of-time compilation.

Core Technology: TensorRT Engine Compilation

The model is compiled into a TensorRT engine before serving. During compilation:

  • Kernel Fusion: Multiple operations (Layer Norm β†’ Linear β†’ Activation) are merged into a single CUDA kernel
  • Layer/Tensor Fusion: Computational graph is optimized
  • Quantization: FP8, INT8 (SmoothQuant), INT4 (AWQ, GPTQ) applied at compile time

At runtime, the pre-optimized engine executes directly, yielding the highest raw throughput in NVIDIA environments.

In-flight Batching

NVIDIA’s continuous batching implementation, integrated with Triton Inference Server.

Key Features

  • Broad quantization support: FP8, INT8, INT4
  • Built-in Tensor Parallelism and Pipeline Parallelism
  • Official Triton Inference Server integration
  • Hardware: NVIDIA GPU only
  • Drawbacks: requires per-model TRT-LLM implementation, long engine build times, no AMD/other GPU support

Comparison

Β  vLLM SGLang TensorRT-LLM
Origin UC Berkeley Stanford NVIDIA
Core Technology PagedAttention RadixAttention TRT engine compilation
Prefix Caching Manual config Automatic (Radix Tree) Limited
Structured Generation Basic Specialized (XGrammar) Limited
Hardware Multi-vendor NVIDIA, AMD NVIDIA only
Deployment Complexity Low Low High (build required)
Raw Throughput High High Highest (on NVIDIA)
Best For General serving Agents, complex LLM programs NVIDIA production deployment

Decision guide:

  • General serving, multi-vendor hardware β†’ vLLM
  • High proportion of multi-turn chat, agents, or structured output β†’ SGLang
  • Maximum throughput on NVIDIA in production β†’ TensorRT-LLM

The next post covers vLLM internals β€” how PagedAttention is implemented and how the scheduler is designed.


ν•œκ΅­μ–΄ 버전은 상단 μ–Έμ–΄ μŠ€μœ„μ²˜λ₯Ό 톡해 확인할 수 μžˆμŠ΅λ‹ˆλ‹€.

Share: LinkedIn