AMD ROCm Blogs

MXFP6 and MXFP4 Mixed Precision for Accelerating Dense LLMs on AMD Instinct MI355X

2026-06-26T00:00:00+00:00

In this blog, you will learn how pairing MXFP6-E2M3 activations with MXFP4 weights can meaningfully recover accuracy lost to pure 4-bit MXFP4 quantization in specific workloads and configurations, while staying within 2–3% of MXFP4 throughput. You will see measured offline throughput, serving latency, and benchmark accuracy results comparing BF16, FP8, MXFP4, and W_MXFP4_A_MXFP6 on Llama-3.1-8B and Qwen3.6-27B on AMD Instinct MI355X.

Efficient GPU Utilization With Workload Pre-Emption in AMD Resource Manager

2026-06-26T00:00:00+00:00

GPU capacity is sought after and in high demand. Production inference services, fine-tuning jobs, and developer workspaces like VS Code or JupyterLab all compete for the same resources. The challenge is not just about provisioning enough GPUs, it is keeping them utilized and making sure prioritized work can access capacity when it needs it. Training jobs can drop to near-zero utilization between compute phases; inference services can go quiet between traffic bursts; R&D or experimentation models and development workspaces might be left running unutilized or after hours. This would mean that workloads hold on to GPUs they are no longer using, while other work sits queued.

DP Attention and TBO for DeepSeek-V4 on MI355X

2026-06-24T00:00:00+00:00

Running DeepSeek-V4 efficiently requires solving two intertwined problems: how to parallelize MoE communication across GPUs, and how to hide that communication behind useful compute. The dominant approach is Expert Parallel with all2all backends like DeepEP. This solves both problems, but it also requires specialized kernels, topology assumptions, and careful expert placement.

Faster Kimi-K2.5-W4A8 Decoding with EAGLE3 on AMD Instinct™ MI325X

2026-06-23T00:00:00+00:00

In our previous blog [7], we deployed Kimi-K2.5 [1] in W4A8 (INT4 weights + INT8 activations) on AMD Instinct™ MI325X, replacing the BF16 MFMA path in the fused MoE kernel with FlyDSL [2]’s INT8 MFMA implementation. The remaining bottleneck is the autoregressive nature of decoding itself: even with INT8 MFMA and INT4 weights, the framework still runs one full forward pass per generated token.

A Practical Guide to Running LLMs on AMD Radeon™ GPUs

2026-06-19T00:00:00+00:00

Running large language models on AMD Radeon™ GPUs has never been more accessible or more exciting. Thanks to rapid advancements in open‑source tooling and GPU acceleration, both Radeon™ integrated GPUs (iGPU) and discrete GPUs (dGPU) have become powerful, cost‑effective platforms for local AI. Whether you prefer a polished desktop application, a lightweight command‑line workflow, or a fully customizable runtime, a rich ecosystem of tools now makes it easy to deploy cutting‑edge models on your system. With today’s software stack, you can run state‑of‑the‑art language models directly on your Radeon™‑powered PC, whether you’re using integrated graphics or a high‑performance discrete card.

Efficient and Portable 3D Explorable World Generation on AMD GPUs

2026-06-18T00:00:00+00:00

Explorable 3D world generation is becoming a foundational capability for spatial and embodied intelligence. Training agents that can navigate, reason, and interact with environments requires not just static datasets, but rich, immersive worlds that support free-view exploration and consistent geometry. Recent works like Matrix3D¹ have pushed this frontier forward by combining panoramic generation with explicit 3D reconstruction, enabling higher-quality and more coherent environments than prior video-only methods.

Comparative Analysis of Scale-Out RoCE Network Traffic Patterns and Loads in Training Large Language Models

2026-06-18T00:00:00+00:00

As large-scale AI workloads continue to grow, understanding network behavior becomes critical. This blog analyzes scale-out RoCE traffic patterns and loads in large language model training, helping you uncover bottlenecks, improve performance, and design more scalable ROCm-based systems.

Building and Deploying Custom hipBLASLt Libraries on AMD Instinct GPUs

2026-06-18T00:00:00+00:00

General Matrix Multiply (GEMM) operations are a core component of many generative AI workloads. Whether you are running attention mechanisms in the prefill phase of a Large Language Model (LLM) or generating tokens sequentially during the decode phase, matrix multiplication performance has a direct impact on end-to-end latency and throughput.

Utilizing AMD Schola and UnrealRoboticsLab with AMD ROCm™ Software to Train a Robotic Arm

2026-06-17T00:00:00+00:00

A great reinforcement learning (RL) training environment excels along many axes. Unreal® Engine brings a powerful combination of capabilities, including physically based rendering, high-fidelity visual environments, and a mature toolset for building rich interactive scenes. These strengths make it an excellent fit for training tasks that involve complex lighting or rich vision-based observations.

Technical Dive into AMD’s MLPerf Training v6.0 Submission

2026-06-16T00:00:00+00:00

AMD is proud to share its MLPerf Training v6.0 results, marking another step forward in our commitment to delivering competitive AI training performance using the latest AMD Instinct GPUs. This round covers three benchmarks — Llama 2 70B LoRA fine-tuning, Llama 3.1 8B pretraining, and Flux.1-schnell text-to-image pretraining — with AMD’s own submissions spanning the MI325X, MI350X, and MI355X Instinct GPUs, alongside a large-scale 512-GPU MI300X submission from OCI in partnership with AMD.