Deploying your model#

2025-09-26

5 min read time

Applies to Linux

ROCm enables inference and deployment for various classes of models including CNN, RNN, LSTM, MLP, and transformers. This section focuses on deploying transformers-based LLM models.

ROCm supports vLLM and Hugging Face TGI as major LLM-serving frameworks.

Serving using vLLM#

vLLM is a fast and easy-to-use library for LLM inference and serving. AMD is actively working with the vLLM team to improve performance and support the latest ROCm versions.

See the GitHub repository and official vLLM documentation for more information.

For guidance on using vLLM with ROCm, refer to Installation with ROCm.

vLLM installation#

vLLM supports two ROCm-capable installation methods. Refer to the official documentation use the following links.

vLLM walkthrough#

Refer to this developer blog for guidance on serving with vLLM Inferencing and serving with vLLM on AMD GPUs — ROCm Blogs

Validating vLLM performance#

ROCm provides a prebuilt optimized Docker image for validating the performance of LLM inference with vLLM on the MI300X accelerator. The Docker image includes ROCm, vLLM, PyTorch, and tuning files in the CSV format. For more information, see the guide to LLM inference performance testing with vLLM on the AMD Instinct™ MI300X accelerator on the ROCm GitHub repository.

Serving using Hugging Face TGI#

The Hugging Face Text Generation Inference (TGI) library is optimized for serving LLMs with low latency. Refer to the Quick tour of TGI for more details.

TGI installation#

The easiest way to use Hugging Face TGI with ROCm on AMD Instinct accelerators is to use the official Docker image at huggingface/text-generation-inference.

TGI walkthrough