DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • AI-Driven Kubernetes Troubleshooting With DeepSeek and k8sgpt
  • Building Scalable AI-Driven Microservices With Kubernetes and Kafka
  • Increase Model Flexibility and ROI for GenAI App Delivery With Kubernetes
  • AI/ML Innovation in the Kubernetes Ecosystem

Trending

  • Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot
  • Optimize Deployment Pipelines for Speed, Security and Seamless Automation
  • Scaling Mobile App Performance: How We Cut Screen Load Time From 8s to 2s
  • Failure Handling Mechanisms in Microservices and Their Importance
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Right-Sizing GPU and CPU Resources for Training and Inferencing Using Kubernetes

Right-Sizing GPU and CPU Resources for Training and Inferencing Using Kubernetes

Learn how to harness advances in Kubernetes to optimize resource allocation by prioritizing workloads dynamically and efficiently based on need.

By 
Sanjay Basu PhD user avatar
Sanjay Basu PhD
·
Victor Agreda user avatar
Victor Agreda
·
Oct. 25, 24 · Code Snippet
Likes (2)
Comment
Save
Tweet
Share
7.4K Views

Join the DZone community and get the full member experience.

Join For Free

The rapid rise of AI services has created a massive demand for computing resources, making efficient management of those resources a critical challenge. While running AI workloads with Kubernetes has come a long way, optimizing scheduling based on dynamic demand continues to be an area for improvement. Many organizations face constraints related to the cost and availability of GPU clusters worldwide and often rely on the same compute clusters for inference workloads and continuous model training and fine-tuning.  

AI Model Training and Model Inferencing in Kubernetes

Training typically requires far more computational power than inferencing. On the other hand, inferencing is far more frequent than training as it is used to make predictions repeatedly across many applications. Let’s explore how we can harness the best of what the cloud has to offer with advances in Kubernetes to optimize resource allocation by prioritizing workloads dynamically and efficiently based on need. 

The diagram below shows the process of training versus inferencing. For training, workloads may run less frequently but with more resources needed as we essentially “teach” it how to respond to new data. Once trained, a model is deployed and will often run on GPU compute instances to provide the best results with low latency. Inferencing will thus run more frequently, but not as intensely. All the while, we may go back and retrain a model to accommodate new data or even try other models that need to be trained before deployment.

 

A diagram of a model inferencing.

AI Model Training vs. AI Model Inferencing


AI workloads, especially training, are like High Performance Computing (HPC) workloads. Kubernetes wasn’t designed for HPC, but because Kubernetes is open source and largely led by the community, there have been rapid innovations in this space. The need for optimization has led to the development of tools like KubeFlow and Kueue.

AI Workloads for Kubernetes

KubeFlow uses pipelines to simplify the steps in data science into logical blocks of operation and offers numerous libraries that plug into these steps so you can get up and running quickly.

Kueue provides resource “flavors” that allow it to tailor workloads to the hardware provisioning available at the time and schedule the correct workloads accordingly (there’s much more to it, of course). The community has done an outstanding job of addressing issues of scaling, efficiency, distribution, and scheduling with these tools and more.    

Below is an example of how we can use Kubernetes to schedule and prioritize training and inference jobs on GPU clusters backed with Remote Direct Memory Access-- RDMA (RoCEv2). Let's create some sample code to demonstrate this concept. Note: In the code we use a fictional website, gpuconfig.com for the GPU manufacturer. Also, <gpu name> is a placeholder for the specific GPU you wish to target.

Shell
 
apiVersion: scheduling.k8s.io/v1

kind: PriorityClass

metadata:

    name: high-priority-<gpu name>

value: 1000000

globalDefault: false

description: "This priority class should be used for high priority <GPU NAME> GPU jobs only."

---

apiVersion: scheduling.k8s.io/v1

kind: PriorityClass

metadata:

    name: medium-priority-<gpu name>

value: 100000

globalDefault: false

description: "This priority class should be used for medium priority <GPU NAME> GPU jobs."

---

apiVersion: v1

kind: Pod

metadata:

    name: high-priority-gpu-job

spec:

    priorityClassName: high-priority-<gpu name>

    containers:

    - name: gpu-container

      image: gpu/<gpu image>

      command: [" <gpu vendor>-smi"]

      resources:

        limits:

          gpuconfig.com/gpu: 1

    nodeSelector:

      gpu-type: <gpu name>

      rdma: "true"

---

apiVersion: v1

kind: Pod

metadata:

    name: medium-priority-gpu-job

spec:

    priorityClassName: medium-priority-<gpu name>

    containers:

    - name: gpu-container

      image: gpu/<gpu image>

      command: [" <gpu vendor>-smi"]

      resources:

        limits:

          gpuconfig.com/gpu: 1

    nodeSelector:

      gpu-type: <gpu name>

      rdma: "true"

 

This Kubernetes configuration demonstrates how to prioritize jobs on our GPU nodes using an RDMA backbone. Let's break down the key components: 

1. PriorityClasses: We've defined two priority classes for our GPU’s jobs:

  • high-priority-<gpu name>: For critical jobs that need immediate execution.
  • medium-priority-<gpu name>: For jobs that are important but can wait if necessary.

2. Pod Specifications: We've created two sample pods to show how to use these priority classes:

  • high-priority-gpu-job: Uses the high-priority-<gpu name> class.
  • medium-priority-gpu-job: Uses the medium-priority-<gpu name> class.

3. Node Selection: Both pods use nodeSelector to ensure they're scheduled on specific GPUs with RDMA:

Shell
 
   nodeSelector:

     gpu-type: <gpu name>

     rdma: "true"

 

4. Resource Requests: Each pod requests one GPU:

Shell
 
   resources:

     limits:

       gpuconfig.com/gpu: 1


Kubernetes uses priority classes to determine the order in which pods are scheduled and which pods are evicted if resources are constrained. Here's an example of how you might create a CronJob that uses a high-priority class:

Shell
 
apiVersion: batch/v1beta1

kind: CronJob

metadata:

    name: high-priority-ml-training

spec:

    schedule: "0 2 * * *"

    jobTemplate:

      spec:

        template:

          metadata:

          name: ml-training-job

          spec:

          priorityClassName: high-priority-<gpu name>

          containers:

          - name: ml-training

            image: your-ml-image:latest

            resources:

              limits:

                gpuconfig.com/gpu: 2

          restartPolicy: OnFailure

          nodeSelector:

            gpu-type: <gpu name>

            rdma: "true"

 

GPU Resource Management in Kubernetes

 Below are some examples of GPU resource management in Kubernetes. 

Shell
 
 

apiVersion: v1

kind: ResourceQuota

metadata:

    name: gpu-quota

    namespace: ml-workloads

spec:

    hard:

      requests.gpuconfig.com/gpu: 8

      limits.gpuconfig.com/gpu: 8

---

apiVersion: v1

kind: LimitRange

metadata:

    name: gpu-limits

    namespace: ml-workloads

spec:

    limits:

    - default:

        gpuconfig.com/gpu: 1

      defaultRequest:

        gpuconfig.com/gpu: 1

      max:

        gpuconfig.com/gpu: 4

      min:

        gpuconfig.com/gpu: 1

      type: Container

---

apiVersion: scheduling.k8s.io/v1

kind: PriorityClass

metadata:

    name: gpu-burst

value: 1000000

globalDefault: false

description: "This priority class allows for burst GPU usage, but may be preempted."

---

apiVersion: v1

kind: Pod

metadata:

    name: gpu-burst-job

    namespace: ml-workloads

spec:

    priorityClassName: gpu-burst

    containers:

    - name: gpu-job

      image: gpu/<gpu image>

      command: [" <gpu vendor>-smi"]

      resources:

        limits:

          gpuconfig.com/gpu: 2

    nodeSelector:

      gpu-type: <gpu name>

 

In the past, it could be a challenge to know the current state of hardware to prioritize workloads, but thanks to open-source tools we now have solutions. For monitoring GPU utilization, we’re using tools like Prometheus and Grafana. Here's a sample Prometheus configuration to scrape GPU metrics:

Shell
 
global:

    scrape_interval: 15s

 

scrape_configs:

    - job_name: 'gpu_gpu_exporter'

      static_configs:

        - targets: ['localhost:9835']


And here's a simple Python script that we are using to optimize GPU allocation based on utilization metrics:

Python
 
import kubernetes

from prometheus_api_client import PrometheusConnect

 

def get_gpu_utilization(prometheus_url, pod_name):

      prom = PrometheusConnect(url=prometheus_url, disable_ssl=True)

      query = f'gpu_gpu_utilization{{pod="{pod_name}"}}'

      result = prom.custom_query(query)

      return float(result[0]['value'][1]) if result else 0

 

def optimize_gpu_allocation():

      kubernetes.config.load_kube_config()

      v1 = kubernetes.client.CoreV1Api()

      

      pods = v1.list_pod_for_all_namespaces(label_selector='gpu=true').items

      for pod in pods:

          utilization = get_gpu_utilization('http://prometheus:9090', pod.metadata.name)

          if utilization < 30:  # If GPU utilization is less than 30%

            # Reduce GPU allocation

            patch = {

                "spec": {

                    "containers": [{

                        "name": pod.spec.containers[0].name,

                          "resources": {

                              "limits": {

                                "gpuconfig.com/gpu": "1"

                            }

                        }

                    }]

                }

            }

              v1.patch_namespaced_pod(name=pod.metadata.name, namespace=pod.metadata.namespace, body=patch)

            print(f"Reduced GPU allocation for pod {pod.metadata.name}")

 

if __name__ == "__main__":

      optimize_gpu_allocation()

 

This script checks GPU utilization for pods and reduces allocation if utilization is low. This script is run as a function to optimize resource usage.

Leveraging Kubernetes to Manage GPU and CPU Resources

Thus, we leveraged Kubernetes with OCI Kubernetes Engine (OKE) to dynamically manage GPU and CPU resources across training and inference workloads for AI models. Specifically, we focused on right-sizing the GPU allocations with RDMA (RoCEv2) capabilities. We developed Kubernetes configurations, helm charts, including custom priority classes, node selectors, and resource quotas, to ensure optimal scheduling and resource prioritization for both high-priority and medium-priority AI tasks.RDMA (RoCEv2) capabilities. We developed Kubernetes configurations, helm charts, including custom priority classes, node selectors, and resource quotas, to ensure optimal scheduling and resource prioritization for both high-priority and medium-priority AI tasks.

By utilizing Kubernetes' flexibility, and OKE’s management capabilities on Oracle Cloud Infrastructure (OCI), we balanced the heavy compute demands of training with the lighter demands of inferencing. This ensured that resources were dynamically allocated, reducing waste while maintaining high performance for critical tasks. Additionally, we integrated monitoring tools like Prometheus to track GPU utilization and adjust allocations automatically using a Python script. This automation helped optimize performance while managing costs and availability. 

In Conclusion

The solutions we outlined here apply universally across cloud and on-premises platforms using Kubernetes for AI/ML workloads. No matter the hardware, or any other compute platform, the key principles of using Kubernetes for dynamic scheduling and resource management remain the same. Kubernetes allows organizations to prioritize their workloads efficiently, optimizing their use of any available hardware resources. By using the same approach, enterprises can fine-tune their infrastructure, reduce bottlenecks, and cut down on underutilized resources, leading to more efficient and cost-effective operations.


This article was shared as part of DZone's media partnership with KubeCon + CloudNativeCon.

View the Event

AI Kubernetes Architectural model

Opinions expressed by DZone contributors are their own.

Related

  • AI-Driven Kubernetes Troubleshooting With DeepSeek and k8sgpt
  • Building Scalable AI-Driven Microservices With Kubernetes and Kafka
  • Increase Model Flexibility and ROI for GenAI App Delivery With Kubernetes
  • AI/ML Innovation in the Kubernetes Ecosystem

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: