persimmon - vLLM

Skip to content

vLLM

persimmon

Initializing search

Home
User Guide
Developer Guide
Benchmarking
API Reference
CLI Reference
Community

vLLM

Home
User Guide
User Guide
- Getting Started
  Getting Started
  - Quickstart
  - Installation
    
    Installation
    
    GPU
    
    CPU
    
    TPU
  - Examples
    
    Examples
    
    Applications
    Applications
    
    Chatbot
    
    Rag
    
    Basic
    Basic
    
    Offline Inference
    
    Online Serving
    
    Deployment
    Deployment
    
    Async LLM Streaming
    
    Helm Charts
    
    LLM Engine Example
    
    Sagemaker-Entrypoint
    
    Disaggregated
    Disaggregated
    
    Disaggregated Encoder
    
    Disaggregated Serving
    
    Ec Both Encoder
    
    Disaggregated Prefill V1
    
    Flexkv Connector
    
    KV Load Failure Recovery Test
    
    LMCache Examples
    
    Mooncake Connector
    
    Features
    Features
    
    Automatic Prefix Caching
    
    Batch Invariance
    
    Context Extension
    
    Data Parallel
    
    Kv Events
    
    Logging Configuration
    
    Custom Logits Processors
    
    LoRA
    
    Offline Inference with the OpenAI Batch file format
    
    Pause Resume
    
    Profiling
    
    Prompt Embed
    
    Reset Kv
    
    Sharded State
    
    Speculative Decoding
    
    Structured Outputs
    
    Tensorize vLLM Model
    
    Torchrun
    
    Generate
    Generate
    
    Batched Chat Completions Online
    
    Multimodal
    
    Qwen 1M Offline
    
    Token Generation Client
    
    Observability
    Observability
    
    Monitoring Dashboards
    
    Metrics
    
    Setup OpenTelemetry POC
    
    Prometheus and Grafana
    
    Pooling
    Pooling
    
    Classify
    
    Embed
    
    Plugin
    
    Reward
    
    Score
    
    Token Classify
    
    Token Embed
    
    Ray Serving
    Ray Serving
    
    Batch LLM Inference
    
    Elastic Ep
    
    Multi-Node-Serving
    
    Ray Serve Deepseek
    
    Run Cluster
    
    Reasoning
    Reasoning
    
    OpenAI Chat Completion Tool Calls With Reasoning
    
    OpenAI Chat Completion With Reasoning
    
    OpenAI Chat Completion With Reasoning Streaming
    
    OpenAI Responses Client
    
    RL
    RL
    
    RLHF Async New APIs
    
    RLHF Http IPC
    
    RLHF Http NCCL
    
    RLHF IPC
    
    RLHF IPC Fsdp Ep
    
    RLHF NCCL
    
    RLHF NCCL Fsdp Ep
    
    RLHF Sparse NCCL
    
    Routed Experts E2E
    
    Skip Loading Weights In Engine Init
    
    Speech To Text
    Speech To Text
    
    Lid
    
    OpenAI
    
    Realtime
    
    Tool Calling
    Tool Calling
    
    Chat With Tools Offline
    
    OpenAI Chat Completion Client With Tools
    
    OpenAI Chat Completion Client With Tools Required
    
    OpenAI Chat Completion Client With Tools Xlam
    
    OpenAI Chat Completion Client With Tools Xlam Streaming
    
    OpenAI Responses Client With Mcp Tools
    
    OpenAI Responses Client With Tools
- General
  General
- Inference and Serving
  Inference and Serving
  - Offline Inference
  - Online Serving
    
    Online Serving
    
    Generative Scoring
    
    OpenAI-Compatible Server
    
    Renderer APIs
    
    Speech to Text APIs
  - Context Parallel Deployment
  - Data Parallel Deployment
  - Troubleshooting distributed deployments
  - Expert Parallel Deployment
  - Parallelism and Scaling
  - Integrations
    Integrations
    
    Claude Code
    
    Codex
    
    LangChain
    
    LlamaIndex
- Deployment
  Deployment
  - Using Docker
  - Using Kubernetes
  - Using Nginx
  - Frameworks
    Frameworks
    
    Anyscale
    
    AnythingLLM
    
    AutoGen
    
    BentoML
    
    Cerebrium
    
    Chatbox
    
    Dify
    
    dstack
    
    Haystack
    
    Helm
    
    Hugging Face Inference Endpoints
    
    LiteLLM
    
    Lobe Chat
    
    LWS
    
    Modal
    
    Open WebUI
    
    Retrieval-Augmented Generation
    
    RunPod
    
    SkyPilot
    
    Streamlit
    
    NVIDIA Triton
  - Integrations
    Integrations
    
    AIBrix
    
    NVIDIA Dynamo
    
    KAITO
    
    KServe
    
    Kthena
    
    KubeAI
    
    KubeRay
    
    Llama Stack
    
    llm-d
    
    llmaz
    
    Production stack
- Training
  Training
- Configuration
  Configuration
- Models
  Models
  - Supported Models
  - Generative Models
  - Pooling Models
    
    Pooling Models
    
    Classification Usages
    
    Embedding Usages
    
    Reward Usages
    
    Scoring Usages
    
    Specific Model Examples
    
    Token Classification Usages
    
    Token Embedding Usages
  - Extensions
    Extensions
    
    Loading model weights with fastsafetensors
    
    Loading Model Weights with InstantTensor
    
    Loading models with Run:ai Model Streamer
    
    Loading models with CoreWeave's Tensorizer
  - Hardware Supported Models
    Hardware Supported Models
    
    CPU - Intel® Xeon®
    
    XPU - Intel® GPUs
    
    TPU
- Features
  Features
Developer Guide
Developer Guide
- General
  General
- Model Implementation
  Model Implementation
- CI
  CI
- Design Documents
  Design Documents
Benchmarking
Benchmarking
API Reference
API Reference
- vllm
  vllm
  - collect_env
  - connections
  - env_override
  - envs
  - exceptions
  - forward_context
  - logger
  - logits_process
  - logprobs
  - model_inspection
  - outputs
  - pooling_params
  - sampling_params
  - scalar_type
  - scripts
  - sequence
  - tasks
  - version
  - assets
    
    assets
    
    audio
    
    base
    
    image
    
    video
  - benchmarks
    
    benchmarks
    
    latency
    
    mm_processor
    
    plot
    
    serve
    
    startup
    
    throughput
    
    datasets
    
    datasets
    
    create_txt_slices_dataset
    
    datasets
    
    utils
    
    lib
    
    lib
    
    endpoint_request_func
    
    ready_checker
    
    utils
    
    sweep
    
    sweep
    
    cli
    
    param_sweep
    
    plot
    
    plot_pareto
    
    serve
    
    serve_workload
    
    server
    
    startup
    
    utils
  - compilation
    
    compilation
    
    backends
    
    base_static_graph
    
    breakable_cudagraph
    
    caching
    
    codegen
    
    compiler_interface
    
    counter
    
    cuda_graph
    
    decorators
    
    monitor
    
    partition_rules
    
    piecewise_backend
    
    wrapper
    
    passes
    
    passes
    
    fx_utils
    
    inductor_pass
    
    pass_manager
    
    vllm_inductor_pass
    
    fusion
    
    fusion
    
    act_quant_fusion
    
    allreduce_rms_fusion
    
    attn_quant_fusion
    
    collective_fusion
    
    matcher_utils
    
    mla_attn_quant_fusion
    
    mla_rope_kvcache_cat_fusion
    
    qk_norm_rope_fusion
    
    rms_quant_fusion
    
    rocm_aiter_fusion
    
    rope_kvcache_fusion
    
    sequence_parallelism
    
    ir
    
    ir
    
    clone_elimination
    
    inplace_functionalization
    
    lowering_pass
    
    utils
    
    utility
    
    utility
    
    fix_functionalization
    
    noop_elimination
    
    post_cleanup
    
    scatter_split_replace
    
    split_coalescing
  - config
    
    config
    
    attention
    
    cache
    
    compilation
    
    device
    
    diffusion
    
    ec_transfer
    
    kernel
    
    kv_events
    
    kv_transfer
    
    load
    
    lora
    
    mamba
    
    model
    
    model_arch
    
    multimodal
    
    observability
    
    offload
    
    parallel
    
    pooler
    
    profiler
    
    quantization
    
    reasoning
    
    scheduler
    
    speculative
    
    speech_to_text
    
    structured_outputs
    
    utils
    
    vllm
    
    weight_transfer
  - cute_utils
    
    cute_utils
    
    cvt
  - device_allocator
    
    device_allocator
    
    cumem
    
    xpumem
  - distributed
    
    distributed
    
    communication_op
    
    kv_events
    
    nixl_utils
    
    parallel_state
    
    stateless_coordinator
    
    utils
    
    device_communicators
    
    device_communicators
    
    all2all
    
    all_reduce_utils
    
    base_device_communicator
    
    cpu_communicator
    
    cuda_communicator
    
    cuda_wrapper
    
    custom_all_reduce
    
    flashinfer_all_reduce
    
    mnnvl_compat
    
    pynccl
    
    pynccl_allocator
    
    pynccl_wrapper
    
    quick_all_reduce
    
    ray_communicator
    
    shm_broadcast
    
    shm_object_storage
    
    symm_mem
    
    xpu_communicator
    
    ec_transfer
    
    ec_transfer
    
    ec_transfer_state
    
    ec_connector
    
    ec_connector
    
    base
    
    example_connector
    
    factory
    
    elastic_ep
    
    elastic_ep
    
    elastic_execute
    
    elastic_state
    
    standby_state
    
    eplb
    
    eplb
    
    async_worker
    
    eplb_communicator
    
    eplb_state
    
    eplb_utils
    
    rebalance_execute
    
    policy
    
    policy
    
    abstract
    
    default
    
    kv_transfer
    
    kv_transfer
    
    kv_transfer_state
    
    kv_connector
    
    kv_connector
    
    base
    
    factory
    
    utils
    
    v1
    
    v1
    
    base
    
    decode_bench_connector
    
    example_connector
    
    example_hidden_states_connector
    
    flexkv_connector
    
    lmcache_connector
    
    lmcache_mp_connector
    
    metrics
    
    multi_connector
    
    offloading_connector
    
    simple_cpu_offload_connector
    
    ssm_conv_transfer_utils
    
    hf3fs
    
    hf3fs
    
    hf3fs_client
    
    hf3fs_connector
    
    hf3fs_metadata_server
    
    utils
    
    utils
    
    common
    
    gather_scatter_helper
    
    hf3fs_mock_client
    
    lmcache_integration
    
    lmcache_integration
    
    multi_process_adapter
    
    utils
    
    vllm_v1_adapter
    
    mooncake
    
    mooncake
    
    mooncake_connector
    
    mooncake_utils
    
    rdma_utils
    
    stats
    
    store
    
    store
    
    connector
    
    coordinator
    
    data
    
    metrics
    
    protocol
    
    scheduler
    
    worker
    
    moriio
    
    moriio
    
    moriio_common
    
    moriio_connector
    
    moriio_engine
    
    moriio_layout
    
    nixl
    
    nixl
    
    base_scheduler
    
    base_worker
    
    connector
    
    metadata
    
    pull_scheduler
    
    pull_worker
    
    push_scheduler
    
    push_worker
    
    scheduler
    
    stats
    
    tp_mapping
    
    utils
    
    worker
    
    offloading
    
    offloading
    
    common
    
    events
    
    metrics
    
    scheduler
    
    worker
    
    weight_transfer
    
    weight_transfer
    
    base
    
    factory
    
    ipc_engine
    
    nccl_engine
    
    packed_tensor
  - engine
    
    engine
    
    arg_utils
    
    async_llm_engine
    
    llm_engine
    
    protocol
  - entrypoints
    
    entrypoints
    
    api_server
    
    chat_utils
    
    grpc_server
    
    launcher
    
    llm
    
    offline_utils
    
    anthropic
    
    anthropic
    
    api_router
    
    protocol
    
    serving
    
    cli
    
    cli
    
    collect_env
    
    launch
    
    main
    
    openai
    
    run_batch
    
    serve
    
    types
    
    benchmark
    
    benchmark
    
    base
    
    latency
    
    main
    
    mm_processor
    
    serve
    
    startup
    
    sweep
    
    throughput
    
    generate
    
    generate
    
    api_router
    
    factories
    
    beam_search
    
    beam_search
    
    offline
    
    online
    
    utils
    
    generative_scoring
    
    generative_scoring
    
    api_router
    
    serving
    
    mcp
    
    mcp
    
    tool
    
    tool_server
    
    openai
    
    openai
    
    api_server
    
    cli_args
    
    dp_supervisor
    
    run_batch
    
    chat_completion
    
    chat_completion
    
    api_router
    
    batch_serving
    
    protocol
    
    serving
    
    completion
    
    completion