GenAI-native serving and modeling, built for performance.

Build once, deploy anywhere with a single programmable stack for high-performance GenAI on any hardware

Get started

Request a demo

Serve Models
Deploy DeepSeek, Gemma, Qwen, and hundreds more with a high-speed OpenAI-compatible endpoint, on NVIDIA or AMD and on any cloud.
View Models
Customize or Build Models
Load fine-tuned weights to existing models or build custom models with a PyTorch-like Python API that’s uniquely built to deliver top performance in production.
Read docs
Optimize Kernels with ease
Get the best GPU kernel performance without the GPU programming pain of the past—write in Mojo🔥, a pythonic language purpose-built for portability across any hardware.
Full Guide

Your entire AI infrastructure in a single dependency

GPU agnostic

The same code runs on NVIDIA, AMD, and Apple Silicon. When new generations of hardware enter the datacenter, MAX is the fastest to take advantage and deliver top performance. Hardware will only get more exciting - be ready for it with MAX.

  # Compile-time warp synchronization per hardware
  
  @always_inline("nodebug")
  fn syncwarp(mask: Int = -1):
      """Synchronizes threads within a warp using a barrier."""
  
      @parameter
      if is_nvidia_gpu():
          __mlir_op.`nvvm.bar.warp.sync`(
              __mlir_op.`index.casts`[_type = __mlir_type.i32](
                  mask._mlir_value
              )
          )
  
      elif is_amd_gpu():
          # In AMD GPU this is a nop (everything executed in lock-step).
          return

Open source & extensible

All of the MAX Python API, all of the model pipelines, and all the GPU kernels (for NVIDIA, AMD, and Apple) are open sourced for you to learn from and contribute to.

Modular Github

  # Registers Qwen2 models with composable components
  
  qwen2_arch = SupportedArchitecture(
      name="Qwen2ForCausalLM",  # Supports HuggingFace model class names
      task=PipelineTask.TEXT_GENERATION,
      example_repo_ids=["Qwen/Qwen2.5-7B-Instruct", "Qwen/QwQ-32B"],
      default_weights_format=WeightsFormat.safetensors,
      default_encoding=SupportedEncoding.bfloat16,
      supported_encodings={
          SupportedEncoding.float32: [KVCacheStrategy.PAGED],
          SupportedEncoding.bfloat16: [KVCacheStrategy.PAGED],
      },
      pipeline_model=Qwen2Model,  # Implement a custom model
      tokenizer=TextTokenizer,
      rope_type=RopeType.normal,
      weight_adapters={  # Reuse composable converters & utilities
          WeightsFormat.safetensors: llama3.weight_adapters.convert_safetensor_state_dict,
          WeightsFormat.gguf: llama3.weight_adapters.convert_gguf_state_dict,
      },
  )

Measurable performance
See the numbers for yourself. MAX includes max benchmark, an open-source benchmarking tool adapted from vLLM. Run it against your endpoint with datasets like ShareGPT or arxiv-summarization, or bring your own. Export shareable YAML configs for reproducible results.
Benchmark tutorial

GenAI-native serving and modeling, built for performance.

Your entire AI infrastructure in a single dependency

GPU agnostic

Open source & extensible

Measurable performance