Modular acquires BentoML to deliver production AI in the cloud!  - Read more

Blazing fast inference. Hardware portability. Full stack control.

Customize performance down to the kernel, while deploying seamlessly across GPU vendors without code rewrites.

  • Full stack control

    Control execution from models to kernels, with clear performance visibility.

  • Full customization

    Use fine-tuned weights and custom models without workflow changes.

  • Deep Observability

    Low-level telemetry reveals bottlenecks and optimization opportunities.

  • Portability across hardware

    Run on hardware that optimizes price, availability, and performance.

Modular vs. Alternatives

  • Too much expertise required

    Self Managed

    Too much expertise required

    • Expensive MLOps team

    • Custom optimizations require CUDA experts ($200k-300k/year)

    • NVIDIA-only code limits hardware flexibility

    • compute contract nightmares

  • The sweet spot

    The sweet spot

    • Customize your performance easily, down to the kernel

    • Scale seamlessly without an MLOps team

    • Same code runs on NVIDIA + AMD

    • Deep observability to know what’s working

  • Cookie cutter offering (no control):

    Inference Endpoints

    Cookie cutter offering (no control):

    • Black-box optimizations you can't customize

    • Waiting on vendor roadmap for your needs

    • No visibility into performance bottlenecks

Why Modular outperforms

  • Advanced Compiler

    Kernel fusion and dynamic batching optimized for code generation patterns

  • Efficient Runtime

    90% smaller containers enable faster scaling and lower infrastructure overhead

  • Intelligent Batching

    Adapts to real-world traffic spikes during business hours

  • Hardware arbitrage

    Execute workloads on the right hardware for the task at hand.

  • Granular metrics and dashboards

    Fine-grained visibility into performance, usage, and more, making issues easy to spot.

  • Forward deployed engineering support

    Engineers work directly with your team to deploy, tune, and operate systems.

Deploy in Modular’s Cloud

We’ll handle the autoscaling of your traffic across hardware. Our AI infrastructure runs across NVIDIA and AMD without code changes, so future flexibility is also baked in.

Supported hardware:

  • Full production support for the following NVIDIA GPUs

    • H100

    • A100

    • L40S

    • L4

  • Full production support for the following AMD GPUs

    • MI355X

    • MI300X

    • MI250X

    • MI210

    Achieve 30-60% lower costs with Modular on AMD hardware - Read More

Coming soon:

Custom accelerators - let us know what you want!

Hardware Independence = Business Resilience

Why Portability Matters to Your Business:

  • Negotiation Power

    Not locked to single GPU vendor. AMD offers 30-60% cost savings. Better supply availability.

  • Risk Mitigation

    No single point of failure. Multi-cloud without complexity. Platform vendor independence.

Deployment Flexibility:

  • Our Cloud or Yours

    Deploy on our cloud or in your own environment, with the same capabilities and performance.

    See Deployment Options

Why teams are switching to Modular

“~70% faster compared to vanilla vLLM”

"Our collaboration with Modular is a glimpse into the future of accessible AI infrastructure. Our API now returns the first 2 seconds of synthesized audio on average ~70% faster compared to vanilla vLLM based implementation, at just 200ms for 2 second chunks. This allowed us to serve more QPS with lower latency and eventually offer the API at a ~60% lower price than would have been possible without using Modular’s stack."

Igor Poletaev

Chief Science Officer - Inworld

Latest customer case studies:

AI batch processing is now cheaper than anyone thought possible

When selling GPUs as a commodity meets the fastest inference engine - cost savings can skyrocket.

Modular partners with AWS to democratize AI Infrastructure

Modular partnered with AWS to bring MAX to AWS Marketplace, offering SOTA performance for GenAI workloads across GPUs types.

Unleashing AI performance on AMD GPUs with Modular's Platform

Modular partners with AMD to bring the AI ecosystem more choice with state-of-the-art performance on AMD Instinct GPUs.

Revolutionizing your own research to production

Modular allows Qwerky AI to do advanced AI research, to write optimized code and deploy across NVIDIA, AMD, and other types of silicon.

Go Deeper