Skip to content
vLLM
persimmon
Initializing search
GitHub
Home
User Guide
Developer Guide
Benchmarking
API Reference
CLI Reference
Community
vLLM
GitHub
Home
User Guide
User Guide
Getting Started
Getting Started
Quickstart
Installation
Installation
GPU
CPU
TPU
Examples
Examples
Applications
Applications
Chatbot
Rag
Basic
Basic
Offline Inference
Online Serving
Deployment
Deployment
Async LLM Streaming
Helm Charts
LLM Engine Example
Sagemaker-Entrypoint
Disaggregated
Disaggregated
Disaggregated Encoder
Disaggregated Serving
Ec Both Encoder
Disaggregated Prefill V1
Flexkv Connector
KV Load Failure Recovery Test
LMCache Examples
Mooncake Connector
Features
Features
Automatic Prefix Caching
Batch Invariance
Context Extension
Data Parallel
Kv Events
Logging Configuration
Custom Logits Processors
LoRA
Offline Inference with the OpenAI Batch file format
Pause Resume
Profiling
Prompt Embed
Reset Kv
Sharded State
Speculative Decoding
Structured Outputs
Tensorize vLLM Model
Torchrun
Generate
Generate
Batched Chat Completions Online
Multimodal
Qwen 1M Offline
Token Generation Client
Observability
Observability
Monitoring Dashboards
Metrics
Setup OpenTelemetry POC
Prometheus and Grafana
Pooling
Pooling
Classify
Embed
Plugin
Reward
Score
Token Classify
Token Embed
Ray Serving
Ray Serving
Batch LLM Inference
Elastic Ep
Multi-Node-Serving
Ray Serve Deepseek
Run Cluster
Reasoning
Reasoning
OpenAI Chat Completion Tool Calls With Reasoning
OpenAI Chat Completion With Reasoning
OpenAI Chat Completion With Reasoning Streaming
OpenAI Responses Client
RL
RL
RLHF Async New APIs
RLHF Http IPC
RLHF Http NCCL
RLHF IPC
RLHF IPC Fsdp Ep
RLHF NCCL
RLHF NCCL Fsdp Ep
RLHF Sparse NCCL
Routed Experts E2E
Skip Loading Weights In Engine Init
Speech To Text
Speech To Text
Lid
OpenAI
Realtime
Tool Calling
Tool Calling
Chat With Tools Offline
OpenAI Chat Completion Client With Tools
OpenAI Chat Completion Client With Tools Required
OpenAI Chat Completion Client With Tools Xlam
OpenAI Chat Completion Client With Tools Xlam Streaming
OpenAI Responses Client With Mcp Tools
OpenAI Responses Client With Tools
General
General
vLLM V1
Frequently Asked Questions
Production Metrics
Reproducibility
Security
Troubleshooting
Usage Stats Collection
Inference and Serving
Inference and Serving
Offline Inference
Online Serving
Online Serving
Generative Scoring
OpenAI-Compatible Server
Renderer APIs
Speech to Text APIs
Context Parallel Deployment
Data Parallel Deployment
Troubleshooting distributed deployments
Expert Parallel Deployment
Parallelism and Scaling
Integrations
Integrations
Claude Code
Codex
LangChain
LlamaIndex
Deployment
Deployment
Using Docker
Using Kubernetes
Using Nginx
Frameworks
Frameworks
Anyscale
AnythingLLM
AutoGen
BentoML
Cerebrium
Chatbox
Dify
dstack
Haystack
Helm
Hugging Face Inference Endpoints
LiteLLM
Lobe Chat
LWS
Modal
Open WebUI
Retrieval-Augmented Generation
RunPod
SkyPilot
Streamlit
NVIDIA Triton
Integrations
Integrations
AIBrix
NVIDIA Dynamo
KAITO
KServe
Kthena
KubeAI
KubeRay
Llama Stack
llm-d
llmaz
Production stack
Training
Training
Async Reinforcement Learning
What is Layerwise (Re)loading?
Reinforcement Learning from Human Feedback
Transformers Reinforcement Learning
Weight Transfer
Weight Transfer
Base Class and Custom Engines
IPC Engine
NCCL Engine
Configuration
Configuration
Conserving Memory
Engine Arguments
Environment Variables
Model Resolution
Optimization and Tuning
Server Arguments
TPU
Models
Models
Supported Models
Generative Models
Pooling Models
Pooling Models
Classification Usages
Embedding Usages
Reward Usages
Scoring Usages
Specific Model Examples
Token Classification Usages
Token Embedding Usages
Extensions
Extensions
Loading model weights with fastsafetensors
Loading Model Weights with InstantTensor
Loading models with Run:ai Model Streamer
Loading models with CoreWeave's Tensorizer
Hardware Supported Models
Hardware Supported Models
CPU - Intel® Xeon®
XPU - Intel® GPUs
TPU
Features
Features
Automatic Prefix Caching
Batch Invariance
Context Extension
Custom Arguments
Custom Logits Processors
Disaggregated Encoder
Disaggregated Prefilling (experimental)
IndexCache
Interleaved Thinking
KV Offloading Usage Guide
LoRA Adapters
MooncakeConnector Usage Guide
MooncakeStoreConnector Usage Guide
MoRIIOConnector Usage Guide
Multimodal Inputs
NixlConnector Compatibility Matrix
NixlConnector Usage Guide
Prompt Embedding Inputs
Reasoning Outputs
Sleep Mode
Structured Outputs
Tool Calling
Quantization
Quantization
AutoAWQ
BitsAndBytes
FP8 ViT Encoder Attention
GGUF
GPTQModel
Intel Quantization Support
NVIDIA Model Optimizer
Online Quantization
Quantized KV Cache
AMD Quark
TorchAO
LLM Compressor
LLM Compressor
FP8 W8A8
INT4 W4A16
INT8 W4A8
INT8 W8A8
Speculative Decoding
Speculative Decoding
Draft Models
Dynamic Speculative Decoding
EAGLE Draft Models
Hidden State Extraction
MLP Draft Models
MTP (Multi-Token Prediction)
N-Gram Speculation
Parallel Draft Models
vLLM-Project/Speculators
Suffix Decoding
Developer Guide
Developer Guide
General
General
Deprecation Policy
Dockerfile
Editing Agent Instructions
Incremental Compilation Workflow
Profiling vLLM
Vulnerability Management
Model Implementation
Model Implementation
Basic Model
Registering a Model
Unit Testing
Multi-Modal Support
Speech-to-Text (Transcription/Translation) Support
CI
CI
CI Failures
Nightly Builds of vLLM Wheels
Update PyTorch version on vLLM OSS CI/CD
Design Documents
Design Documents
Plugins
Plugins
IO Processor Plugins
LoRA Resolver Plugins
Plugin System
Architecture Overview
Attention Backend Feature Support
CUDA Graphs
Vision Encoder (ViT) CUDA Graphs
CustomOp
Dual Batch Overlap
How to debug the vLLM-torch.compile integration
Fused MoE Modular Kernel
Fusion torch.compile passes
Integration with Hugging Face
Hybrid KV Cache Manager
Logits Processors
Metrics
Multi-Modal Data Processing
Model Runner V2 Design Document
Fused MoE Kernel Features
Python Multiprocessing
NIXL KV Cache Lease Renewal
NIXL push-mode KV transfer
Optimization Levels
Paged Attention
Automatic Prefix Caching
torch.compile integration
torch.compile with Multimodal Encoders
vLLM IR: Functional Intermediate Representation
Benchmarking
Benchmarking
Benchmark CLI
Parameter Sweeps
Performance Dashboard
API Reference
API Reference
vllm
vllm
collect_env
connections
env_override
envs
exceptions
forward_context
logger
logits_process
logprobs
model_inspection
outputs
pooling_params
sampling_params
scalar_type
scripts
sequence
tasks
version
assets
assets
audio
base
image
video
benchmarks
benchmarks
latency
mm_processor
plot
serve
startup
throughput
datasets
datasets
create_txt_slices_dataset
datasets
utils
lib
lib
endpoint_request_func
ready_checker
utils
sweep
sweep
cli
param_sweep
plot
plot_pareto
serve
serve_workload
server
startup
utils
compilation
compilation
backends
base_static_graph
breakable_cudagraph
caching
codegen
compiler_interface
counter
cuda_graph
decorators
monitor
partition_rules
piecewise_backend
wrapper
passes
passes
fx_utils
inductor_pass
pass_manager
vllm_inductor_pass
fusion
fusion
act_quant_fusion
allreduce_rms_fusion
attn_quant_fusion
collective_fusion
matcher_utils
mla_attn_quant_fusion
mla_rope_kvcache_cat_fusion
qk_norm_rope_fusion
rms_quant_fusion
rocm_aiter_fusion
rope_kvcache_fusion
sequence_parallelism
ir
ir
clone_elimination
inplace_functionalization
lowering_pass
utils
utility
utility
fix_functionalization
noop_elimination
post_cleanup
scatter_split_replace
split_coalescing
config
config
attention
cache
compilation
device
diffusion
ec_transfer
kernel
kv_events
kv_transfer
load
lora
mamba
model
model_arch
multimodal
observability
offload
parallel
pooler
profiler
quantization
reasoning
scheduler
speculative
speech_to_text
structured_outputs
utils
vllm
weight_transfer
cute_utils
cute_utils
cvt
device_allocator
device_allocator
cumem
xpumem
distributed
distributed
communication_op
kv_events
nixl_utils
parallel_state
stateless_coordinator
utils
device_communicators
device_communicators
all2all
all_reduce_utils
base_device_communicator
cpu_communicator
cuda_communicator
cuda_wrapper
custom_all_reduce
flashinfer_all_reduce
mnnvl_compat
pynccl
pynccl_allocator
pynccl_wrapper
quick_all_reduce
ray_communicator
shm_broadcast
shm_object_storage
symm_mem
xpu_communicator
ec_transfer
ec_transfer
ec_transfer_state
ec_connector
ec_connector
base
example_connector
factory
elastic_ep
elastic_ep
elastic_execute
elastic_state
standby_state
eplb
eplb
async_worker
eplb_communicator
eplb_state
eplb_utils
rebalance_execute
policy
policy
abstract
default
kv_transfer
kv_transfer
kv_transfer_state
kv_connector
kv_connector
base
factory
utils
v1
v1
base
decode_bench_connector
example_connector
example_hidden_states_connector
flexkv_connector
lmcache_connector
lmcache_mp_connector
metrics
multi_connector
offloading_connector
simple_cpu_offload_connector
ssm_conv_transfer_utils
hf3fs
hf3fs
hf3fs_client
hf3fs_connector
hf3fs_metadata_server
utils
utils
common
gather_scatter_helper
hf3fs_mock_client
lmcache_integration
lmcache_integration
multi_process_adapter
utils
vllm_v1_adapter
mooncake
mooncake
mooncake_connector
mooncake_utils
rdma_utils
stats
store
store
connector
coordinator
data
metrics
protocol
scheduler
worker
moriio
moriio
moriio_common
moriio_connector
moriio_engine
moriio_layout
nixl
nixl
base_scheduler
base_worker
connector
metadata
pull_scheduler
pull_worker
push_scheduler
push_worker
scheduler
stats
tp_mapping
utils
worker
offloading
offloading
common
events
metrics
scheduler
worker
weight_transfer
weight_transfer
base
factory
ipc_engine
nccl_engine
packed_tensor
engine
engine
arg_utils
async_llm_engine
llm_engine
protocol
entrypoints
entrypoints
api_server
chat_utils
grpc_server
launcher
llm
offline_utils
anthropic
anthropic
api_router
protocol
serving
cli
cli
collect_env
launch
main
openai
run_batch
serve
types
benchmark
benchmark
base
latency
main
mm_processor
serve
startup
sweep
throughput
generate
generate
api_router
factories
beam_search
beam_search
offline
online
utils
generative_scoring
generative_scoring
api_router
serving
mcp
mcp
tool
tool_server
openai
openai
api_server
cli_args
dp_supervisor
run_batch
chat_completion
chat_completion
api_router
batch_serving
protocol
serving
completion
completion