GenAI-native serving and modeling, built for performance.

Build once, deploy anywhere with a single programmable stack for high-performance GenAI on any hardware

  • Serve Models

    Deploy DeepSeek, Gemma, Qwen, and hundreds more with a high-speed OpenAI-compatible endpoint, on NVIDIA or AMD and on any cloud.

  • Customize or Build Models

    Load fine-tuned weights to existing models or build custom models with a PyTorch-like Python API that’s uniquely built to deliver top performance in production.

  • Optimize Kernels with ease

    Get the best GPU kernel performance without the GPU programming pain of the past—write in Mojo🔥, a pythonic language purpose-built for portability across any hardware.

Your entire AI infrastructure in a single dependency

  • GPU agnostic

    The same code runs on NVIDIA, AMD, and Apple Silicon. When new generations of hardware enter the datacenter, MAX is the fastest to take advantage and deliver top performance. Hardware will only get more exciting - be ready for it with MAX.

      # Compile-time warp synchronization per hardware
      
      @always_inline("nodebug")
      fn syncwarp(mask: Int = -1):
          """Synchronizes threads within a warp using a barrier."""
      
          @parameter
          if is_nvidia_gpu():
              __mlir_op.`nvvm.bar.warp.sync`(
                  __mlir_op.`index.casts`[_type = __mlir_type.i32](
                      mask._mlir_value
                  )
              )
      
          elif is_amd_gpu():
              # In AMD GPU this is a nop (everything executed in lock-step).
              return
    
  • Open source & extensible

    All of the MAX Python API, all of the model pipelines, and all the GPU kernels (for NVIDIA, AMD, and Apple) are open sourced for you to learn from and contribute to.

      # Registers Qwen2 models with composable components
      
      qwen2_arch = SupportedArchitecture(
          name="Qwen2ForCausalLM",  # Supports HuggingFace model class names
          task=PipelineTask.TEXT_GENERATION,
          example_repo_ids=["Qwen/Qwen2.5-7B-Instruct", "Qwen/QwQ-32B"],
          default_weights_format=WeightsFormat.safetensors,
          default_encoding=SupportedEncoding.bfloat16,
          supported_encodings={
              SupportedEncoding.float32: [KVCacheStrategy.PAGED],
              SupportedEncoding.bfloat16: [KVCacheStrategy.PAGED],
          },
          pipeline_model=Qwen2Model,  # Implement a custom model
          tokenizer=TextTokenizer,
          rope_type=RopeType.normal,
          weight_adapters={  # Reuse composable converters & utilities
              WeightsFormat.safetensors: llama3.weight_adapters.convert_safetensor_state_dict,
              WeightsFormat.gguf: llama3.weight_adapters.convert_gguf_state_dict,
          },
      )
    
  • Measurable performance

    See the numbers for yourself. MAX includes max benchmark, an open-source benchmarking tool adapted from vLLM. Run it against your endpoint with datasets like ShareGPT or arxiv-summarization, or bring your own. Export shareable YAML configs for reproducible results.