Erwan Gallen
Erwan Gallen

Dec 20, 2023 46 min read

Unlock the Power of Mistral AI with Red Hat OpenShift AI and NVIDIA DGX H100

thumbnail for this post

I will guide you through the process of deploying Red Hat OpenShift AI on the NVIDIA DGX H100 system and run the Mistral AI model. This blog post details the process of deploying and managing a fully automated MLOps solution for a large language model (LLM) presented in three main parts:

  • Deploying OpenShift
  • Installing OpenShift AI
  • Running the Mistral AI model with the Hugging Face Text Generation Inference toolkit

The NVIDIA DGX H100 marks a pivotal moment in the evolution of artificial intelligence (AI) infrastructure. Powered by the revolutionary NVIDIA H100 Tensor Core GPU, this powerhouse delivers an unprecedented 32 petaFLOPS of AI performance, surpassing previous generations by a staggering 9x. NVIDIA H100 GPUs with TensorRT-LLM allow you to convert model weights into a new FP8 format easily and compile models to take advantage of optimized FP8 kernels automatically. NVIDIA Hopper Transformer Engine technology makes this performance gain possible without changing any model code. This remarkable leap in computing capability is complemented by a suite of cutting-edge advancements, including Dual Intel Xeon CPU with a total of 112 cores, 2TB of RAM, 8 x H100 GPU NVSwitch’ed for a total of 640GB vRAM, PCIe Gen5 interconnects, NVIDIA ConnectX-7 network interface cards, and Mellanox HDR InfiniBand for ultra-fast AI training and inferencing.

The NVIDIA DGX H100 is more than just a powerful machine; it’s an accelerator for innovation, enabling researchers, scientists, and developers to push the boundaries of AI and unlock once unimaginable solutions. As we embark on this new era of AI-powered innovation, the NVIDIA DGX H100 stands at the forefront, empowering us to address the world’s most pressing challenges and shape a brighter future for all.

First look at the DGX H100 BMC

The NVIDIA DGX H100 BMC, or Baseboard Management Controller, is a hardware component that provides out-of-band management capabilities for the DGX H100 system. This means that you can access and control the system even if it is turned off or the operating system is not booted. The BMC can be used to perform a variety of tasks, such as:

  • Monitoring system health and status
  • Configuring system settings
  • Remotely powering on, off, or restarting the system
  • Accessing the system’s console and troubleshooting issues
  • Performing firmware updates and maintenance tasks

You need to connect the BMC (out-of-band system management) 1 GbE RJ45 interface and allocate one IP address:

NVIDIA DGX H100 rear

To use the BMC, we will use a web browser with the allocated IP address. Once connected, you will be able to view and manage the system.

First, we log into the BMC: NVIDIA DGX H100 BMC login

We can see the health status of the server, the uptime and access to all the administration tools: NVIDIA DGX H100 BMC dashboard

We are pleased to see the 8 x NVIDIA H100 GPUs, each GPU in the GPU Information menu: NVIDIA DGX H100 BMC dashboard

We can find the proper Vendor ID (VID) and Device ID (DID):

  • GPU PCI VID: 0x10de
  • GPU PCI DID: 0x2330

Prepare the OpenShift discovery iso file

The OpenShift Assisted Installer streamlines the deployment and management of OpenShift clusters with its user-friendly web interface, automated tasks, and centralized management console. It facilitates the installation of OpenShift clusters on bare metal or cloud platforms, ensuring consistency and reducing errors. Its integration with Red Hat OpenShift Cloud further enhances the management experience.

The NVIDIA DGX H100 server has been certified for Red Hat OpenShift in October 2023.

The OpenShift AI Self-Managed minimum requirements are two worker nodes. We will only deploy one Single node OpenShift for this simple test lab.

We can connect to our Red Hat Hybrid Cloud Console here: https://cloud.redhat.com

Red Hat Hybrid Cloud home

You have to Log in to your Red Hat account.

Red Hat Hybrid Cloud login

Go to the console and Clusters menu: https://console.redhat.com/openshift/

Red Hat Hybrid Cloud list clusters

Click on Create cluster.

Red Hat Hybrid Cloud create a cluster

Click on the Datacenter tab.

Red Hat Hybrid Cloud create a cluster

Click on Bare Metal (x86_64).

Red Hat Hybrid Cloud create a Bare Metal x86_64 cluster

Pick Interactive

Choose your Cluster name and Base domain. You just need to check Install single node OpenShift (SNO). We pick the release OpenShift 4.14.2.

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Click on Next.

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Click on Next.

Red Hat Hybrid Cloud create a cluster with Assisted Installer

We can add one host by clicking on Add host.

Red Hat Hybrid Cloud create a cluster with Assisted
Installer

Put your SSH public key. Click on “Generate Discovery ISO” and “Download Discovery ISO”.

Red Hat Hybrid Cloud create a cluster with Assisted
Installer

A discovery image is a small Linux operating system image that is used to gather information about the nodes that will be part of an OpenShift cluster. The Assisted Installer uses the discovery image to collect data about the nodes, such as their CPU, memory, storage, and network configuration. This information is then used to assess the nodes’ compatibility with OpenShift and to generate the installation configuration file.

You have one Discovery ISO file of 106MB:

egallen@laptop ~ % ls -lah ~/Downloads/dc57ceea-8ef4-41a4-AAAAA-AAAAAAAAAA-discovery.iso
-rw-r--r--@ 1 egallen  staff   106M Dec  2 19:01 /Users/egallen/Downloads/dc57ceea-8ef4-41a4-AAAAA-AAAAAAAAAA-discovery.iso

Boot with the OpenShift iso file

When a node is booted from the discovery image, it will first connect to the Assisted Installer. The Assisted Installer will then send a set of instructions to the node, which the node will execute. These instructions will cause the node to perform a series of hardware and network tests. The results of these tests will be sent back to the Assisted Installer, which will use them to generate the installation configuration file.

We will boot the DGX H100 with one Discovery ISO image.

The server is currently in power off state in the Power Control menu: NVIDIA DGX H100 Power Control

To see the terminal, we can click on Remote Control, and the button Launch H5Viewer

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Click on CD Image Browse File, pick your downloaded iso file, click Media Boost and the button Start Media

Red Hat Hybrid Cloud create a cluster with Assisted Installer

We can power on the server.

Red Hat Hybrid Cloud create a cluster with Assisted
Installer

Press F11

Red Hat Hybrid Cloud create a cluster with Assisted Installer

We can see the menu to select the boot device.

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Pick UEFI: AMI Virtual CDROM0 1.00

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Pick RHEL CoreOS (Live) from the GRUB menu.

We can see in the BMC KVM menu that my browser has pushed 123MB, and the RHEL CoreOS from the discovery image is booting.

Red Hat Hybrid Cloud create a cluster with Assisted Installer

We can see the discovery image prompt:

Red Hat Hybrid Cloud create a cluster with Assisted Installer

The hardware discovery is in progress.

After one minute, we can see the host inventory in the console.

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Click on Next.

We install the system on one NVME only (we will use the disk for LVM with OpenShift later).

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Click on Next.

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Click on Next.

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Click on Install cluster

Red Hat Hybrid Cloud create a cluster with Assisted Installer

The node is rebooting.

Red Hat Hybrid Cloud create a cluster with Assisted Installer

Connect to the OpenShift Console

OpenShift is installed and you can find the administrator credentials in the Red Hat Hybrid Cloud Console.

Red Hat OpenShift Console

We can take the:

  • OpenShift console URL
  • kubeadmin password
  • kubeconfig file

If you have no domain name server configured, you can also copy/paste the /etc/hosts content by clicking on “not able to access the Web Console?”

Red Hat OpenShift Console

We can login.

Red Hat OpenShift Console

The OpenShift cluster is up and the status is presented in the Console home page.

Red Hat OpenShift Console

We can see the node with Control and worker roles, 224 cores, and 2TB of RAM.

Setup your OpenShift Command Line Interface

Download the oc command, click on ? in the top right corner, and Command Line Tools, pick the archive of your system and architecture.

Download the oc binary

I’m taking Mac for ARM 64.

I can unzip this archive.

egallen@laptop ~ % sudo unzip ~/Downloads/oc.zip -d /usr/local/bin
Archive:  /Users/egallen/Downloads/oc.zip
 extracting: /usr/local/bin/oc

You have to authorize this binary, launch it one time:

egallen@laptop ~ % /usr/local/bin/oc
zsh: killed     /usr/local/bin/oc

Download the oc binary

( For Mac users, allow the binary in the System Settings of your macOS, Privacy & Security and click on Allow Anyway ) Download the oc binary

In the right top corner of the OpenShift Console, we can click on: ?, Copy login command, and Display Token, we can copy the token.

We can run the oc login command:

egallen@laptop ~ % oc login --token=sha256~XXXXXXXXXXXXXXXXXXXXXXXXXXX --server=https://api.dgxh100.redhat.com:6443
WARNING: Using insecure TLS client config. Setting this option is not supported!

Logged into "https://api.dgxh100.redhat.com:6443" as "kube:admin" using the token provided.

You have access to 69 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "default".

The client has a 4.14 version:

egallen@laptop ~ % oc version
Client Version: 4.14.0-202311021650.p0.g9b1e0d2.assembly.stream-9b1e0d2
Kustomize Version: v5.0.1
Server Version: 4.14.2
Kubernetes Version: v1.27.6+f67aeb3

We can use our OpenShift 4.14.2

egallen@laptop ~ % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.2    True        False         10m     Cluster version is 4.14.2

Update the OpenShift cluster

[ Perhaps you don’t have to do any update on your cluster, this step is not mandatory. ]

We will update the cluster to the latest OpenShift 4.14 release. We will update from OpenShift 4.14.2 to OpenShift 4.14.3

Red Hat OpenShift update

Click on Select a version

Red Hat OpenShift update

Click on Update

Red Hat OpenShift update

The update is in progress:

egallen@laptop ~ % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.2    True        True          17m     Working towards 4.14.3: 635 of 860 done (73% complete)

After an automatic reboot, we have the latest OpenShift release: 4.14 available as of today:

egallen@laptop ~ % oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.14.3    True        False         15s     Cluster version is 4.14.3

Red Hat OpenShift update

Our Single Node OpenShift is up to date \o/.

Check the presence of the H100 GPUs

Because we have provided our public key to the OpenShift Installer, we can ssh to the node (connecting directly with ssh is not a good practice for the OpenShift productions administration):

egallen@laptop ~ % ssh core@dgxh100.redhat.com
Red Hat Enterprise Linux CoreOS 414.92.202311150705-0
  Part of OpenShift 4.14, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.14/architecture/architecture-rhcos.html

We can make one lspci command and list the GPUs:

[core@dgxh100 ~]$ lspci | grep H100
1b:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
43:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
52:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
61:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
9d:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
c3:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
d1:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)
df:00.0 3D controller: NVIDIA Corporation GH100[H100 SXM5 80GB] (rev a1)

Install the Node Feature Discovery operator

We can now install the NVIDIA GPU drivers, and NVIDIA device plugin.

The Node Feature Discovery Operator (NFD) is a Kubernetes operator that automates the process of discovering and labeling node features in a Kubernetes cluster. This operator runs on each node in the cluster and scans its hardware and software configuration to identify the available features. It then labels the node with these features, which can be used by other operators and applications to determine which nodes are suitable for running specific workloads. This can help to improve the efficiency of resource allocation and improve the overall performance of the cluster.

NFD will label the node with the NVIDIA GPU PCIe vendor ID label (we have only one node for now, but this operator is required in the installation).

Node Feature Discovery installation

You should take the Node Feature Discovery tagged Red Hat (not the one tagged Community)

Node Feature Discovery installation

Click on Install.

Node Feature Discovery installation

Click on Install.

Node Feature Discovery installation

Click on Install.

Node Feature Discovery installation

Click on View Operator.

Node Feature Discovery installation

You can now create one instance, click on Create instance for the NodeFeatureDiscovery.

NFD has labeled the node, we can check the nfd labels with the command:

egallen@laptop ~ % oc describe node dgxh100.redhat.com | grep feature.node.kubernetes.io
...

We can just check if we find the NVIDIA PCI vendor ID label:

egallen@laptop ~ % oc describe node dgxh100.redhat.com | grep feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-10de.present=true

Install the NVIDIA GPU Operator

The NVIDIA GPU Operator manages NVIDIA GPUs on Kubernetes clusters. It automates the provisioning, configuring, and monitoring of NVIDIA GPUs, ensuring that GPUs are available and correctly configured for application use. The NVIDIA GPU Operator also supports NVIDIA’s GPUDirect technologies, which enable applications to communicate directly with GPUs without relying on the CPU, improving application performance. The NVIDIA GPU Operator is a valuable tool for any Kubernetes cluster that uses NVIDIA GPUs. It can help to simplify the management of GPUs, improve application performance, and reduce the overall complexity of operating a Kubernetes cluster with GPUs.

For now, we have PCIe node labels but no NVIDIA drivers, no NVIDIA Device Plugin, and no GPU monitoring. The NVIDIA GPU labels are not exposed to the Kubernetes scheduler:

Before the NVIDIA GPU Operator installation, the NVIDIA GPU labels are not available for the Kubernetes scheduler:

egallen@laptop ~ % oc describe node | grep nvidia.com/gpu
egallen@laptop ~ %

We can now install the NVIDIA GPU Operator.

Search NVIDIA GPU Operator in Operators > OperatorHub.

NVIDIA GPU Operator installation

Click on NVIDIA GPU Operator.

NVIDIA GPU Operator installation

Click on Install.

NVIDIA GPU Operator installation

Click on Install.

NVIDIA GPU Operator installation

Click on View Operator.

NVIDIA GPU Operator installation

Click on Create Instance in the ClusterPolicy.

You can keep the default values.

NVIDIA GPU Operator installation

Click on Create.

You can check the NVIDIA GPU operator installation progress by listing the pods running in the project nvidia-gpu-operator:

egallen@laptop ~ % oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-lqqzt                           0/1     Init:0/1   0          85s
gpu-operator-68f86df569-5r5hs                         1/1     Running    0          6m32s
nvidia-container-toolkit-daemonset-p8lnz              0/1     Init:0/1   0          85s
nvidia-dcgm-exporter-98l6p                            0/1     Init:0/2   0          85s
nvidia-dcgm-nxnrs                                     0/1     Init:0/1   0          85s
nvidia-device-plugin-daemonset-6dm6m                  0/1     Init:0/1   0          85s
nvidia-driver-daemonset-414.92.202311150705-0-fkjpm   1/2     Running    0          2m6s
nvidia-node-status-exporter-6njsm                     1/1     Running    0          2m6s
nvidia-operator-validator-jd6ff                       0/1     Init:0/4   0          85s

The installation is completed:

egallen@laptop ~ % oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-lqqzt                           1/1     Running     0          3m48s
gpu-operator-68f86df569-5r5hs                         1/1     Running     0          8m55s
nvidia-container-toolkit-daemonset-p8lnz              1/1     Running     0          3m48s
nvidia-cuda-validator-ms9nc                           0/1     Completed   0          48s
nvidia-dcgm-exporter-98l6p                            1/1     Running     0          3m48s
nvidia-dcgm-nxnrs                                     1/1     Running     0          3m48s
nvidia-device-plugin-daemonset-6dm6m                  1/1     Running     0          3m48s
nvidia-driver-daemonset-414.92.202311150705-0-fkjpm   2/2     Running     0          4m29s
nvidia-mig-manager-v7vkm                              1/1     Running     0          21s
nvidia-node-status-exporter-6njsm                     1/1     Running     0          4m29s
nvidia-operator-validator-jd6ff                       1/1     Running     0          3m48s

We can see that the latest step with the nvidia-operator-validator is reporting a successful installation:

egallen@laptop ~ % oc logs nvidia-operator-validator-jd6ff  -n nvidia-gpu-operator
Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
all validations are successful

We are good, I’ve got 8 GPU ready to be used:

egallen@laptop ~ % oc describe node | egrep 'Capacity|nvidia.com/gpu:|Allocatable:'
Capacity:
  nvidia.com/gpu:     8
Allocatable:
  nvidia.com/gpu:     8

Basic UBI base image test

Testing one UBI pod with 1 x H100 GPUs

nvidia-smi is a command-line tool that provides comprehensive information about NVIDIA GPUs, enabling users to monitor and optimize GPU performance, track usage, maintain optimal temperature and fan speed, and benchmark applications to identify performance bottlenecks.

We will run a simple CUDA UBI pod to check the NVIDIA GPU status with the nvidia-smi command.

We apply the pod spec:

egallen@laptop ~ %  cat <<EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: command-nvidia-smi
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:12.1.0-base-ubi8
      command: ["/bin/sh","-c"]
      args: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

We can check the logs, we have one H100 allocated as requested:

egallen@laptop ~ % oc get pods
NAME                 READY   STATUS      RESTARTS   AGE
command-nvidia-smi   0/1     Completed   0          7s

egallen@laptop ~ % oc logs command-nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:52:00.0 Off |                    0 |
| N/A   31C    P0              72W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

egallen@laptop ~ % oc delete pod command-nvidia-smi
pod "command-nvidia-smi" deleted

Testing one UBI pod with 8 x H100 GPUs

We can try to schedule a pod with 8 x H100 GPUs:

egallen@laptop ~ % cat <<EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: command-nvidia-smi
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:12.1.0-base-ubi8
      command: ["/bin/sh","-c"]
      args: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 8 # requesting 8 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

We can check the logs, we have 8 x H100 allocated as requested:

egallen@laptop ~ % oc get pods
NAME                 READY   STATUS      RESTARTS   AGE
command-nvidia-smi   0/1     Completed   0          34s

egallen@laptop ~ % oc logs command-nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1B:00.0 Off |                    0 |
| N/A   25C    P0              72W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:43:00.0 Off |                    0 |
| N/A   27C    P0              71W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:52:00.0 Off |                    0 |
| N/A   31C    P0              72W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:61:00.0 Off |                    0 |
| N/A   29C    P0              71W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:9D:00.0 Off |                    0 |
| N/A   26C    P0              71W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:C3:00.0 Off |                    0 |
| N/A   25C    P0              70W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:D1:00.0 Off |                    0 |
| N/A   29C    P0              73W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:DF:00.0 Off |                    0 |
| N/A   31C    P0              72W / 700W |      2MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

egallen@laptop ~ % oc delete pod command-nvidia-smi
pod "command-nvidia-smi" deleted

Setup Persistent Volume with the LVM Operator

Red Hat OpenShift Data Foundation is recommended as a software-defined storage for an entire cluster. Because We are running a Single Node OpenShift, we will use the OpenShift LVM Operator. The OpenShift LVM Operator is a tool that automates the creation, management, and extension of Logical Volume Manager (LVM) volumes on OpenShift clusters. It enables users to provision and manage storage resources for their applications efficiently. The Operator simplifies storage provisioning by creating and managing LVM volumes using custom resource definitions (CRDs). This eliminates the need for manual configuration and reduces the risk of errors. Additionally, the Operator provides a centralized view of all LVM volumes in the cluster, making it easy to monitor and troubleshoot storage issues.

Red Hat OpenShift AI will require one Persistent Volume.

We have no PV available, for now:

egallen@laptop ~ % oc get pv
No resources found

We can check the disks available on the server, we are only using /dev/nvme0n1:

[core@dgxh100 ~]$ sudo lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme0n1     259:0    0  1.7T  0 disk
├─nvme0n1p1 259:1    0    1M  0 part
├─nvme0n1p2 259:2    0  127M  0 part
├─nvme0n1p3 259:3    0  384M  0 part /boot
└─nvme0n1p4 259:4    0  1.7T  0 part /var/lib/kubelet/pods/190e9f63-c1ac-458f-8708-b2f7f110da38/volume-subpaths/nvidia-mig-manager-entrypoint/nvidia-mig-manager/0
                                     /var/lib/kubelet/pods/d1a57079-19e1-4d4b-8df3-ea16c16dae8c/volume-subpaths/nvidia-device-plugin-entrypoint/nvidia-device-plugin/0
                                     /var/lib/kubelet/pods/311e95ae-d509-47b9-8a10-e4804533cd04/volume-subpaths/init-config/init-pod-nvidia-node-status-exporter/1
                                     /var/lib/kubelet/pods/a1ebc821-3891-4a2d-82ae-549076f3fe12/volume-subpaths/nvidia-container-toolkit-entrypoint/nvidia-container-toolkit-ctr/0
                                     /run/nvidia/driver/etc/hosts
                                     /run/nvidia/driver/mnt/shared-nvidia-driver-toolkit
                                     /run/nvidia/driver/host-etc/os-release
                                     /run/nvidia/driver/var/log
                                     /run/nvidia/driver/dev/termination-log
                                     /var/lib/kubelet/pods/c065cb71-7f77-444b-a9d7-0e0cf2b02a22/volume-subpaths/nginx-conf/monitoring-plugin/1
                                     /var
                                     /sysroot/ostree/deploy/rhcos/var
                                     /usr
                                     /etc
                                     /
                                     /sysroot
nvme1n1     259:5    0  1.7T  0 disk
nvme2n1     259:6    0  3.5T  0 disk
nvme4n1     259:7    0  3.5T  0 disk
nvme5n1     259:8    0  3.5T  0 disk
nvme3n1     259:9    0  3.5T  0 disk
nvme6n1     259:10   0  3.5T  0 disk
nvme7n1     259:11   0  3.5T  0 disk
nvme8n1     259:12   0  3.5T  0 disk
nvme9n1     259:13   0  3.5T  0 disk

We will use /dev/nvme2n1 for the LVM Storage operator.

In the openShift Console, we can go to Operators > OperatoHub and search for lvm.

LVM Operator

Click on LVM Storage.

LVM Operator

Click on Install.

LVM Operator

The LVM Operator is installed.

LVM Operator

Click on Create LVMCluster.

LVM Operator

I’m calling my LVMCluster data-lvmcluster, and clicking on Create.

LVM Operator

The operator is properly installed:

egallen@laptop ~ %  oc get csv -n openshift-storage -o custom-columns=Name:.metadata.name,Phase:.status.phase
Name                    Phase
lvms-operator.v4.14.1   Succeeded

Go to Operators > Installed Operators

LVM Operator

Click on LVM Storage.

LVM Operator

[core@dgxh100 ~]$ sudo lsblk /dev/nvme3n1
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme4n1 259:7    0  3.5T  0 disk

The Volume Group has been created:

[core@dgxh100 ~]$ sudo pvs
  PV           VG  Fmt  Attr PSize PFree
  /dev/nvme4n1 vg1 lvm2 a--  3.49t <357.70g

[core@dgxh100 ~]$ sudo lsblk /dev/nvme4n1
NAME                      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme4n1                   259:7    0  3.5T  0 disk
├─vg1-thin--pool--1_tmeta 253:0    0  1.6G  0 lvm
│ └─vg1-thin--pool--1     253:2    0  3.1T  0 lvm
└─vg1-thin--pool--1_tdata 253:1    0  3.1T  0 lvm
  └─vg1-thin--pool--1     253:2    0  3.1T  0 lvm

The storageclass is available:

egallen@laptop ~ % oc get storageclass
NAME                 PROVISIONER   RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
lvms-vg1 (default)   topolvm.io    Delete          WaitForFirstConsumer   true                   27m

We can see that the volume snapshot class is created:

egallen@laptop ~ % oc get volumesnapshotclass
NAME       DRIVER       DELETIONPOLICY   AGE
lvms-vg1   topolvm.io   Delete           28m

We can see the lvmvolumegroup:

egallen@laptop ~ % oc get lvmvolumegroup -A
NAMESPACE           NAME   AGE
openshift-storage   vg1    30m

The LVMVolumeGroupNodeStatus vg1 resource is created:

egallen@laptop ~ % oc get lvmvolumegroup vg1 -o yaml -n openshift-storage
apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMVolumeGroup
metadata:
  creationTimestamp: "2023-12-02T22:46:11Z"
  finalizers:
  - lvm.openshift.io/lvmvolumegroup
  generation: 1
  name: vg1
  namespace: openshift-storage
  ownerReferences:
  - apiVersion: lvm.topolvm.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: LVMCluster
    name: data-lvmcluster
    uid: 2ca190ad-5212-424a-be5d-062575d24b1d
  resourceVersion: "72769"
  uid: 946ff64d-53bc-4116-9307-1cbf6bed946d
spec:
  default: true
  deviceSelector:
    paths:
    - /dev/nvme4n1
  thinPoolConfig:
    name: thin-pool-1
    overprovisionRatio: 10
    sizePercent: 90

Install Red Hat OpenShift AI

Red Hat OpenShift AI is a flexible, scalable MLOps platform with tools to build, deploy, and manage AI-enabled applications. Built using open-source technologies, it provides trusted, operationally consistent capabilities for teams to experiment, serve models, and deliver innovative apps. OpenShift AI (previously called Red Hat OpenShift Data Science) supports the full lifecycle of AI/ML experiments and models, on-premise and in the public cloud.

We can now install Red Hat OpenShift AI.

We can go to Operators > OperatorHub.

Red Hat OpenShift Data Science operator

We can click on Red Hat OpenShift Data Science.

Red Hat OpenShift Data Science operator

Click on Install.

Red Hat OpenShift Data Science operator

Click on Install.

Red Hat OpenShift Data Science operator

Click on Create DataScienceCluster

Red Hat OpenShift Data Science operator

Click on Create

Red Hat OpenShift Data Science operator

The DataScienceCluster is Ready.

Red Hat OpenShift Data Science operator

We have a new menu available on the top right to launch Red Hat OpenShift AI.

Red Hat OpenShift Data Science operator

Click on Log in with OpenShift

Red Hat OpenShift Data Science operator

We can configure the storage for the image registry in non-production clusters

https://docs.openshift.com/container-platform/4.14/registry/configuring_registry_storage/configuring-registry-storage-baremetal.html

You must configure storage for the Image Registry Operator. For non-production clusters, you can set the image registry to an empty directory. If you do so, all images are lost if you restart the registry.

You could have this error if you don’t make this configuration on SNO with the LVM Operator:

Error: InvalidImageName
Failed to apply default image tag ":2023.2": couldn\'t parse image reference ":2023.2": invalid reference format

Configure these two options for only non-production clusters.

Set the image registry storage to an empty directory:

egallen@laptop ~ % oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'
config.imageregistry.operator.openshift.io/cluster patched

Change managementState Image Registry Operator configuration from Removed to Managed. For example:

egallen@laptop ~ % oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed"}}'
config.imageregistry.operator.openshift.io/cluster patched

PyTorch notebook

The PyTorch notebook is a pre-installed Jupyter notebook environment designed for using PyTorch, a machine learning library, on Red Hat OpenShift AI. It provides a user-friendly interface, ease of use, productivity boost, and tight integration with Red Hat OpenShift AI.

Launch a PyTorch notebook with 8 x NVIDIA H100 GPUs

We can launch a notebook with 8 x NVIDIA H100 GPUs

Red Hat OpenShift Data Science notebook

We click on Launch application.

Red Hat OpenShift Data Science notebook

We choose 8 accelerators, the X Large Container Size, and click on Start server.

Red Hat OpenShift Data Science notebook

The Jupiter Notebook server is started.

Red Hat OpenShift Data Science notebook

We choose the Notebook Python 3.9.

Red Hat OpenShift Data Science notebook

If we run the !nvidia-smi command in the notebook, we can see the 8 x H100 GPUs.

Red Hat OpenShift Data Science notebook

We can test inferencing with the opt125m model from Facebook.

Launch a PyTorch notebook with 2 x NVIDIA H100 GPUs

If we check before 0 requests are in progress:

egallen@laptop ~ % oc describe node | egrep "Name:|Roles:|Capacity|nvidia.com/gpu|Allocatable:|Requests +Limits"
Name:               dgxh100.redhat.com
Roles:              control-plane,master,worker
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.compute.major=9
                    nvidia.com/gpu.compute.minor=0
                    nvidia.com/gpu.count=8
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.nvsm=
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=hopper
                    nvidia.com/gpu.machine=DGXH100
                    nvidia.com/gpu.memory=81559
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
                    nvidia.com/gpu.replicas=1
                    nvidia.com/gpu-driver-upgrade-enabled: true
Capacity:
  nvidia.com/gpu:     8
Allocatable:
  nvidia.com/gpu:     8
  Resource           Requests      Limits
  nvidia.com/gpu     0             0

We launch a notebook with 2 H100 GPUs: Red Hat OpenShift Data Science
notebook

After:

egallen@laptop ~ % oc describe node | egrep "Name:|Roles:|Capacity|nvidia.com/gpu|Allocatable:|Requests +Limits"
Name:               dgxh100.redhat.com
Roles:              control-plane,master,worker
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.compute.major=9
                    nvidia.com/gpu.compute.minor=0
                    nvidia.com/gpu.count=8
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=true
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.nvsm=
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=hopper
                    nvidia.com/gpu.machine=DGXH100
                    nvidia.com/gpu.memory=81559
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
                    nvidia.com/gpu.replicas=1
                    nvidia.com/gpu-driver-upgrade-enabled: true
Capacity:
  nvidia.com/gpu:     8
Allocatable:
  nvidia.com/gpu:     8
  Resource           Requests       Limits
  nvidia.com/gpu     2              2

Dig into PyTorch device list

We can list in the notebook the devices seen by PyTorch.

[*] for device_id in range(0,8):
  print(f'device name [',device_id,']:', torch.cuda.get_device_name(device_id))

device name [ 0 ]: NVIDIA H100 80GB HBM3
device name [ 1 ]: NVIDIA H100 80GB HBM3
device name [ 2 ]: NVIDIA H100 80GB HBM3
device name [ 3 ]: NVIDIA H100 80GB HBM3
device name [ 4 ]: NVIDIA H100 80GB HBM3
device name [ 5 ]: NVIDIA H100 80GB HBM3
device name [ 6 ]: NVIDIA H100 80GB HBM3
device name [ 7 ]: NVIDIA H100 80GB HBM3

Basic PyTorch Benchmark

torchbenchmark/models contains copies of popular or exemplary workloads which have been modified to expose a standardized API for benchmark drivers. PyTorch Benchmark contains a miniature version of train/test data and a dependency install script.

First load the PyTorch benchmark module:

[*] !pip install pytorch-benchmark

[*] import torch
from torchvision.models import efficientnet_b0
from pytorch_benchmark import benchmark

CNN Efficientnet-b0 model image classification

CPU image classification with the efficientnet-b0 model

We can start with a basic CPU benchmark with the image classification model efficientnet-b0. We will run 1000 inferences with PyTorch, with a batch size of 1 or 8.

The DGX H100 has a total of CPU 112 cores.

[*]: model = efficientnet_b0().to("cpu")  # Model device sets benchmarking device
sample = torch.randn(8, 3, 224, 224)  # (B, C, H, W)
results = benchmark(model, sample, num_runs=1000)

Results for CPU:

Inference for batch_size Time to process Iterations/second
1 03:38 4.57
8 03:29 4.77

Time to run: 7 minutes 50 seconds for 1000 runs

GPU image classification with the efficientnet-b0 model

We can continue with a GPU benchmark with the image classification model efficientnet-b0. We will run 1000 inferences with PyTorch, with a batch size of 1 or 8.

[ ]: model = efficientnet_b0().to("cuda")  # Model device sets benchmarking device
sample = torch.randn(8, 3, 224, 224)  # (B, C, H, W)
results = benchmark(model, sample, num_runs=1000)

Results for GPU (CUDA):

Inference for batch_size Time to process Iterations/second
1 00:03 327.63
8 00:03 279.89

Time to run: 6 seconds for 1000 runs.

NVIDIA MIG configuration

MIG mixed strategy switch

NVIDIA’s Multi-Instance GPU (MIG) technology allows a single physical GPU to be divided into multiple GPU MIG devices, enabling multiple workloads to run simultaneously on a single GPU, isolating workloads for improved performance and security, and optimizing resource utilization to maximize GPU performance and efficiency. It is a valuable tool for organizations seeking to optimize their GPU resources and reduce costs.

We will start by enabling a MIG mixed strategy.

For each MIG configuration, you have to pick a Strategy type and a MIG configuration label. We will test one mixed strategy with the label all-balanced on one NVIDIA DGX H100 server with 8 x H100 80GB GPUs. For this test we are using one Single Node OpenShift on one DGX H100 server. By default, MIG is disabled, with the single strategy:

egallen@laptop ~ % oc describe node | grep nvidia.com/mig
                    nvidia.com/mig.capable=true
                    nvidia.com/mig.config=all-disabled
                    nvidia.com/mig.config.state=success
                    nvidia.com/mig.strategy=single

mig-manager is running to reconfigure the MIG configuration of the hardware:

egallen@laptop ~ % oc -n nvidia-gpu-operator get ds
NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
gpu-feature-discovery                           1         1         1       1            1           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      41h
nvidia-container-toolkit-daemonset              1         1         1       1            1           nvidia.com/gpu.deploy.container-toolkit=true                                                                          41h
nvidia-dcgm                                     1         1         1       1            1           nvidia.com/gpu.deploy.dcgm=true                                                                                       41h
nvidia-dcgm-exporter                            1         1         1       1            1           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                              41h
nvidia-device-plugin-daemonset                  1         1         1       1            1           nvidia.com/gpu.deploy.device-plugin=true                                                                              41h
nvidia-driver-daemonset-414.92.202311150705-0   1         1         1       1            1           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=414.92.202311150705-0,nvidia.com/gpu.deploy.driver=true   41h
nvidia-mig-manager                              1         1         1       1            1           nvidia.com/gpu.deploy.mig-manager=true                                                                                41h
nvidia-node-status-exporter                     1         1         1       1            1           nvidia.com/gpu.deploy.node-status-exporter=true                                                                       41h
nvidia-operator-validator                       1         1         1       1            1           nvidia.com/gpu.deploy.operator-validator=true                                                                         41h

We will apply the mixed strategy with the MIG configuration label all-balanced. Each of the H100 GPUs should enable these MIG profiles:

  • 2 x 1g.10gb
  • 1 x 2g.20gb
  • 1 x 3g.40g

If we have 8 x H100, we will have on the cluster:

  • 16 x 1g.10gb (8 x 2)
  • 8 x 2g.20gb (8 x 1)
  • 8 x 3g.40gb (8 x 1)

We prepare the variables:

egallen@laptop ~ % NODE_NAME=dgxh100.redhat.com
egallen@laptop ~ % STRATEGY=mixed
egallen@laptop ~ % MIG_CONFIGURATION=all-balanced

Apply the strategy:

egallen@laptop ~ % oc patch clusterpolicy/gpu-cluster-policy --type='json' -p='[{"op": "replace", "path": "/spec/mig/strategy", "value": '$STRATEGY'}]'
clusterpolicy.nvidia.com/gpu-cluster-policy patched

Label the node with the MIG type:

egallen@laptop ~ % oc label node/$NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite
node/dgxh100.redhat.com labeled

Check the logs

egallen@laptop ~ % oc -n nvidia-gpu-operator logs ds/nvidia-mig-manager --all-containers -f --prefix

[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:34Z" level=debug msg="Running pre-apply-config hook"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:34Z" level=debug msg="Applying MIG device configuration..."
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:35Z" level=debug msg="Walking MigConfig for (device-filter=[0x232110DE 0x233A10DE], devices=all)"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:35Z" level=debug msg="Walking MigConfig for (device-filter=[0x233010DE 0x233110DE 0x232210DE 0x20B210DE 0x20B510DE 0x20F310DE 0x20F510DE], devices=all)"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:35Z" level=debug msg="  GPU 0: 0x233010DE"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:35Z" level=debug msg="    MIG capable: true\n"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:35Z" level=debug msg="    Updating MIG config: map[1g.10gb:2 2g.20gb:1 3g.40gb:1]"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:36Z" level=debug msg="  GPU 1: 0x233010DE"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:36Z" level=debug msg="    MIG capable: true\n"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:36Z" level=debug msg="    Updating MIG config: map[1g.10gb:2 2g.20gb:1 3g.40gb:1]"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:37Z" level=debug msg="  GPU 2: 0x233010DE"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:37Z" level=debug msg="    MIG capable: true\n"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:37Z" level=debug msg="    Updating MIG config: map[1g.10gb:2 2g.20gb:1 3g.40gb:1]"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:37Z" level=debug msg="  GPU 3: 0x233010DE"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:37Z" level=debug msg="    MIG capable: true\n"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:37Z" level=debug msg="    Updating MIG config: map[1g.10gb:2 2g.20gb:1 3g.40gb:1]"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:38Z" level=debug msg="  GPU 4: 0x233010DE"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:38Z" level=debug msg="    MIG capable: true\n"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:38Z" level=debug msg="    Updating MIG config: map[1g.10gb:2 2g.20gb:1 3g.40gb:1]"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:39Z" level=debug msg="  GPU 5: 0x233010DE"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:39Z" level=debug msg="    MIG capable: true\n"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:39Z" level=debug msg="    Updating MIG config: map[1g.10gb:2 2g.20gb:1 3g.40gb:1]"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:39Z" level=debug msg="  GPU 6: 0x233010DE"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:39Z" level=debug msg="    MIG capable: true\n"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:39Z" level=debug msg="    Updating MIG config: map[1g.10gb:2 2g.20gb:1 3g.40gb:1]"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:40Z" level=debug msg="  GPU 7: 0x233010DE"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:40Z" level=debug msg="    MIG capable: true\n"
[pod/nvidia-mig-manager-v7vkm/nvidia-mig-manager] time="2023-12-04T15:46:40Z" level=debug msg="    Updating MIG config: map[1g.10gb:2 2g.20gb:1 3g.40gb:1]"

Check the status:

egallen@laptop ~ % oc describe node | grep nvidia.com/mig.config
                    nvidia.com/mig.config=all-balanced
                    nvidia.com/mig.config.state=success

We can check the nvidia-smi output:

% cat <<EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
name: command-nvidia-smi
spec:
restartPolicy: Never
containers:
   - name: cuda-container
      image: nvcr.io/nvidia/cuda:12.1.0-base-ubi8
      command: ["/bin/sh","-c"]
      args: ["nvidia-smi"]
EOF

We can see the 32 MIG devices instead of 8 GPUs:

egallen@laptop ~ % oc logs command-nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1B:00.0 Off |                   On |
| N/A   25C    P0              71W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:43:00.0 Off |                   On |
| N/A   26C    P0              70W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:52:00.0 Off |                   On |
| N/A   31C    P0              72W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:61:00.0 Off |                   On |
| N/A   29C    P0              71W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:9D:00.0 Off |                   On |
| N/A   26C    P0              71W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:C3:00.0 Off |                   On |
| N/A   25C    P0              70W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:D1:00.0 Off |                   On |
| N/A   29C    P0              73W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:DF:00.0 Off |                   On |
| N/A   31C    P0              72W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  4    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  4    5   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  4   13   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  4   14   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  5    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  5    5   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  5   13   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  5   14   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  6    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  6    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  6    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  6   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  7    2   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  7    3   0   1  |              11MiB / 20096MiB  | 32      0 |  2   0    2    0    2 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  7    9   0   2  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  7   10   0   3  |               5MiB /  9984MiB  | 16      0 |  1   0    1    0    1 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

% oc delete pod command-nvidia-smi
pod "command-nvidia-smi" deleted

We can see 32 MIG devices as expected: 16 x 1g.10gb + 8 x 1g.10gb + 8 x 3g.40gb.

egallen@laptop ~ % oc describe node | egrep "Name:|Roles:|Capacity|nvidia.com/gpu:|nvidia.com/mig-.* |Allocatable:|Requests +Limits"
Name:               dgxh100.redhat.com
Roles:              control-plane,master,worker
Capacity:
  nvidia.com/gpu:          0
  nvidia.com/mig-1g.10gb:  16
  nvidia.com/mig-2g.20gb:  8
  nvidia.com/mig-3g.40gb:  8
Allocatable:
  nvidia.com/gpu:          0
  nvidia.com/mig-1g.10gb:  16
  nvidia.com/mig-2g.20gb:  8
  nvidia.com/mig-3g.40gb:  8
  Resource                Requests      Limits
  nvidia.com/mig-1g.10gb  0             0
  nvidia.com/mig-2g.20gb  0             0
  nvidia.com/mig-3g.40gb  0             0

In OpenShift AI, an accelerator profile defines the specification of an accelerator. Before you can use an accelerator in OpenShift AI, your OpenShift instance must contain the associated accelerator profile.

This specific configuration is only required in mixed MIG strategy, because in single configuration the label nvidia.com/gpu is still used We need to change the Accelerator profile from OpenShift AI because we are in mixed MIG strategy.

Accelerator profile documentation

In the OpenShift Container Platform web console, in the Administrator perspective, click Administration → CustomResourceDefinitions. In the search bar, enter acceleratorprofile to search by name. The CustomResourceDefinitions page reloads to display the search results. Click the AcceleratorProfile custom resource definition (CRD). A details page for the custom resource definition (CRD) opens.

Click the Instances tab. Click Create AcceleratorProfile.

The Create AcceleratorProfile page opens with an embedded YAML editor.

We see the existing AcceleratorProfile:

apiVersion: dashboard.opendatahub.io/v1
kind: AcceleratorProfile
metadata:
  creationTimestamp: '2023-12-02T22:01:49Z'
  generation: 1
  managedFields:
    - apiVersion: dashboard.opendatahub.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:spec':
          .: {}
          'f:displayName': {}
          'f:enabled': {}
          'f:identifier': {}
          'f:tolerations': {}
      manager: unknown
      operation: Update
      time: '2023-12-02T22:01:49Z'
  name: migrated-gpu
  namespace: redhat-ods-applications
  resourceVersion: '54196'
  uid: 3c34bfc5-f6b6-407b-a9d8-f52d5155843f
spec:
  displayName: NVIDIA GPU
  enabled: true
  identifier: nvidia.com/gpu
  tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists

We replace the identifier key: nvidia.com/gpu by : key: nvidia.com/mig-1g.10gb

apiVersion: dashboard.opendatahub.io/v1
kind: AcceleratorProfile
metadata:
  creationTimestamp: '2023-12-02T22:01:49Z'
  generation: 2
  managedFields:
    - apiVersion: dashboard.opendatahub.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:spec':
          .: {}
          'f:displayName': {}
          'f:enabled': {}
      manager: unknown
      operation: Update
      time: '2023-12-02T22:01:49Z'
    - apiVersion: dashboard.opendatahub.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:spec':
          'f:identifier': {}
          'f:tolerations': {}
      manager: Mozilla
      operation: Update
      time: '2023-12-04T21:39:51Z'
  name: migrated-gpu
  namespace: redhat-ods-applications
  resourceVersion: '1259992'
  uid: 3c34bfc5-f6b6-407b-a9d8-f52d5155843f
spec:
  displayName: NVIDIA GPU
  enabled: true
  identifier: nvidia.com/mig-1g.10gb
  tolerations:
    - effect: NoSchedule
      key: nvidia.com/mig-1g.10gb
      operator: Exists

When we schedule a Notebook with 2 GPUs, we can see that two nvidia.com/mig-1g.10gb resources are used:

egallen@laptop ~ % oc describe node | egrep "Name:|Roles:|Capacity|nvidia.com/gpu:|nvidia.com/mig-.* |Allocatable:|Requests +Limits"
Name:               dgxh100.redhat.com
Roles:              control-plane,master,worker
Capacity:
  nvidia.com/gpu:          0
  nvidia.com/mig-1g.10gb:  16
  nvidia.com/mig-2g.20gb:  8
  nvidia.com/mig-3g.40gb:  8
Allocatable:
  nvidia.com/gpu:          0
  nvidia.com/mig-1g.10gb:  16
  nvidia.com/mig-2g.20gb:  8
  nvidia.com/mig-3g.40gb:  8
  Resource                Requests       Limits
  nvidia.com/mig-1g.10gb  2              2
  nvidia.com/mig-2g.20gb  0              0
  nvidia.com/mig-3g.40gb  0              0

MIG single strategy switch

We will test one single strategy with the label all-3g.40gb on one NVIDIA DGX H100 server with 8 x H100 80GB GPUs. We are using one Single Node OpenShift for this test on one DGX H100 server. We are starting from a mixed strategy with the label all-balanced with 32 devices available:

Check before:

egallen@laptop ~ %data-science-pipelines % oc describe node | egrep "Name:|Roles:|Capacity|nvidia.com/gpu:|nvidia.com/mig-.* |Allocatable:|Requests +Limits"
Name:               dgxh100.redhat.com
Roles:              control-plane,master,worker
Capacity:
  nvidia.com/gpu:          0
  nvidia.com/mig-1g.10gb:  16
  nvidia.com/mig-2g.20gb:  8
  nvidia.com/mig-3g.40gb:  8
Allocatable:
  nvidia.com/gpu:          0
  nvidia.com/mig-1g.10gb:  16
  nvidia.com/mig-2g.20gb:  8
  nvidia.com/mig-3g.40gb:  8
  Resource                Requests      Limits
  nvidia.com/mig-1g.10gb  0             0
  nvidia.com/mig-2g.20gb  0             0
  nvidia.com/mig-3g.40gb  0             0

Prepare the variables:

egallen@laptop ~ % NODE_NAME=dgxh100.redhat.com
egallen@laptop ~ % STRATEGY=single
egallen@laptop ~ % MIG_CONFIGURATION=all-3g.40gb

We apply the strategy:

egallen@laptop ~ % oc label node/$NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite

We label the node with the MIG type:

egallen@laptop ~ % oc label node/$NODE_NAME nvidia.com/mig.config=$MIG_CONFIGURATION --overwrite

Check the status:

egallen@laptop ~ % oc describe node | grep gpu.count
                    nvidia.com/gpu.count=0

egallen@laptop ~ % oc describe node | egrep "Name:|Roles:|Capacity|nvidia.com/gpu:|nvidia.com/mig-.* |Allocatable:|Requests +Limits"
                    Name:               dgxh100.redhat.com
                    Roles:              control-plane,master,worker
                    Capacity:
                      nvidia.com/gpu:          16
                      nvidia.com/mig-1g.10gb:  0
                      nvidia.com/mig-2g.20gb:  0
                      nvidia.com/mig-3g.40gb:  0
                    Allocatable:
                      nvidia.com/gpu:          16
                      nvidia.com/mig-1g.10gb:  0
                      nvidia.com/mig-2g.20gb:  0
                      nvidia.com/mig-3g.40gb:  0
                      Resource                Requests      Limits
                      nvidia.com/mig-1g.10gb  0             0
                      nvidia.com/mig-2g.20gb  0             0
                      nvidia.com/mig-3g.40gb  0             0

Test one nvidia-smi command:

egallen@laptop ~ % cat <<EOF | oc apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: command-nvidia-smi
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/cuda:12.1.0-base-ubi8
      command: ["/bin/sh","-c"]
      args: ["nvidia-smi"]
EOF

We can see the 16 MIG devices with 40GB of vRAM available:

egallen@laptop ~ % oc logs command-nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1B:00.0 Off |                   On |
| N/A   25C    P0              75W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:43:00.0 Off |                   On |
| N/A   27C    P0              74W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:52:00.0 Off |                   On |
| N/A   32C    P0              75W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:61:00.0 Off |                   On |
| N/A   30C    P0              74W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  | 00000000:9D:00.0 Off |                   On |
| N/A   27C    P0              75W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  | 00000000:C3:00.0 Off |                   On |
| N/A   25C    P0              73W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  | 00000000:D1:00.0 Off |                   On |
| N/A   30C    P0              77W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  | 00000000:DF:00.0 Off |                   On |
| N/A   31C    P0              76W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  4    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  4    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  5    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  5    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  6    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  6    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  7    1   0   0  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  7    2   0   1  |              16MiB / 40448MiB  | 60      0 |  3   0    3    0    3 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

We can keep an OpenShift AI accelerator profile with the key nvidia.com/gpu in a single MIG strategy.

Run Mistral AI inference

The Mistral-7B-v0.2 model only requires 16GB of GPU RAM for inference; the DGX H100 is not mandatory but will help to scale the number of inferences per second.

The Mistral-8X7B-v0.1 model requires 100GB, so the DGX H100 will help you run this more demanding model.

We can create a new project llm-on-openshift:

egallen@laptop ~ % oc new-project llm-on-openshift
Now using project "llm-on-openshift" on server "https://api.dgxh100.redhat.com:6443".

Creating a PVC

Create PersistentVolumeClaim called models-cache.

Red Hat OpenShift PVC

Creating the deployment

We will modify a Kubernetes deployment yaml from Guillaume Moutier available here.

We prepare the deployment yaml:

egallen@laptop ~ % cat << EOF > deployment.yaml
kind: Deployment
apiVersion: apps/v1
metadata:
  name: hf-text-generation-inference-server
  labels:
    app: hf-text-generation-inference-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hf-text-generation-inference-server
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: hf-text-generation-inference-server
    spec:
      restartPolicy: Always
      schedulerName: default-scheduler
      affinity: {}
      terminationGracePeriodSeconds: 120
      securityContext: {}
      containers:
        - resources:
            limits:
              cpu: '8'
              memory: 128Gi
              nvidia.com/gpu: '1'
            requests:
              cpu: '8'
              nvidia.com/gpu: '1'
          readinessProbe:
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            timeoutSeconds: 5
            periodSeconds: 30
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          name: server
          livenessProbe:
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            timeoutSeconds: 8
            periodSeconds: 100
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: MODEL_ID
              value: mistralai/Mistral-7B-Instruct-v0.1
            - name: MAX_INPUT_LENGTH
              value: '1024'
            - name: MAX_TOTAL_TOKENS
              value: '2048'
            - name: HUGGINGFACE_HUB_CACHE
              value: /models-cache
            - name: HUGGING_FACE_HUB_TOKEN
              value: 'hf_IDAAAAAAAAAAAA'
            - name: PORT
              value: '3000'
            - name: HOST
              value: '0.0.0.0'
          securityContext:
            capabilities:
              drop:
                - ALL
            runAsNonRoot: true
            allowPrivilegeEscalation: false
            seccompProfile:
              type: RuntimeDefault
          ports:
            - name: http
              containerPort: 3000
              protocol: TCP
          imagePullPolicy: IfNotPresent
          startupProbe:
            httpGet:
              path: /health
              port: http
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 30
            successThreshold: 1
            failureThreshold: 24
          volumeMounts:
            - name: models-cache
              mountPath: /models-cache
            - name: shm
              mountPath: /dev/shm
          terminationMessagePolicy: File
          image: 'ghcr.io/huggingface/text-generation-inference:1.1.0'
      volumes:
        - name: models-cache
          persistentVolumeClaim:
            claimName: models-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 1Gi
      dnsPolicy: ClusterFirst
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 1
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600
EOF

Checking the deployment status:

egallen@laptop ~ % oc get deployments                                          
NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
hf-text-generation-inference-server   1/1     1            1           2m53s

egallen@laptop ~ % oc get pods       
NAME                                                   READY   STATUS    RESTARTS   AGE
hf-text-generation-inference-server-7449c5f6c7-khx2m   1/1     Running   0          2m56s

egallen@laptop ~ % oc logs hf-text-generation-inference-server-7449c5f6c7-khx2m  -f
...
{"timestamp":"2023-12-18T11:04:12.913335Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":247}
{"timestamp":"2023-12-18T11:04:12.913335Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":252}

We can validate that the inference-server pod can access the GPU:

egallen@laptop ~ % oc rsh hf-text-generation-inference-server-7449c5f6c7-khx2m
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:DF:00.0 Off |                   On |
| N/A   34C    P0             118W / 700W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    1   0   0  |           39046MiB / 40320MiB  | 60      0 |  3   0    3    0    3 |
|                  |               3MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Creating the service

We prepare the service yaml file:

egallen@laptop ~ % cat << EOF > service.yaml
kind: Service
apiVersion: v1
metadata:
  name: hf-text-generation-inference-server
  labels:
    app: hf-text-generation-inference-server
spec:
  clusterIP: None
  ipFamilies:
    - IPv4
  ports:
    - name: http
      protocol: TCP
      port: 3000
      targetPort: http
  type: ClusterIP
  ipFamilyPolicy: SingleStack
  sessionAffinity: None
  selector:
    app: hf-text-generation-inference-server
EOF

We create the service:

egallen@laptop ~ % oc create -f service.yaml
service/hf-text-generation-inference-server created

Check the service status:

egallen@laptop ~ % oc get services
NAME                                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
hf-text-generation-inference-server   ClusterIP   None         <none>        3000/TCP   7s

egallen@laptop ~ % oc describe service hf-text-generation-inference-server
Name:              hf-text-generation-inference-server
Namespace:         llm-on-openshift
Labels:            app=hf-text-generation-inference-server
Annotations:       <none>
Selector:          app=hf-text-generation-inference-server
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                None
IPs:               None
Port:              http  3000/TCP
TargetPort:        http/TCP
Endpoints:         10.128.0.102:3000
Session Affinity:  None
Events:            <none>

Creating the route

We prepare the route yaml file:

egallen@laptop ~ % cat << EOF > route.yaml
kind: Route
apiVersion: route.openshift.io/v1
metadata:
  name: hf-text-generation-inference-server
  labels:
    app: hf-text-generation-inference-server
spec:
  to:
    kind: Service
    name: hf-text-generation-inference-server
    weight: 100
  port:
    targetPort: http
  tls:
    termination: edge
  wildcardPolicy: None
EOF
egallen@laptop ~ % oc create -f route.yaml
route.route.openshift.io/hf-text-generation-inference-server created

Testing the model

We can now query the Mistral AI model with simple prompts and a curl commands.

The Mistral API provides a safe mode to enforce guardrails. With local models, you can prepend your messages with the following system prompt: “Always assist with care, respect, and truth. Respond with utmost utility yet securely. Avoid harmful, unethical, prejudiced, or negative content. Ensure replies promote fairness and positivity.

Here are some Mistral-7B inference tests:

“Golang or Rust for a website”:

egallen@laptop ~ % curl https://hf-text-generation-inference-server-llm-on-openshift.apps.dgxh100.redhat.com/generate \
    -X POST \
    --insecure \   
    -d '{"inputs":"<s>[INST]Should I use the the Go Programming Language or Rust for a website ?[/INST]","parameters":{"max_new_tokens":1000}}' \
    -H 'Content-Type: application/json' | jq  . | sed 's/\\n/\n/g

{
  "generated_text": " Both Go and Rust are great programming languages for building websites, but the choice ultimately depends on your specific needs and preferences.

  Go, also known as Golang, is a relatively new programming language that was developed by Google. It is known for its simplicity, speed, and concurrency. Go is a good choice if you need to build a scalable and high-performance website that can handle a large number of concurrent users.
  Rust, on the other hand, is a systems programming language that was developed by Mozilla. It is known for its safety, speed, and concurrency. Rust is a good choice if you need to build a website that requires low-level systems programming, such as a web server or a content delivery network.

  In general, if you are building a simple website that doesn't require a lot of concurrency or low-level systems programming, either Go or Rust could be a good choice. However, if you need to build a website that requires high performance, scalability, and low-level systems programming, Rust might be the better choice."
}

“Kubernetes distribution”:

egallen@laptop ~ % curl https://hf-text-generation-inference-server-llm-on-openshift.apps.dgxh100.redhat.com/generate \
    -X POST \
    --insecure \
    -d '{"inputs":"<s>[INST]What is the most widely used commercial Kubernetes distribution?[/INST]","parameters":{"max_new_tokens":25}}' \
    -H 'Content-Type: application/json | jq  . | sed 's/\\n/\n/g'

{
  "generated_text":"
## Answer (1)
The most widely used commercial Kubernetes distribution is Red Hat OpenShift.
"}%

“Python coding”

egallen@laptop ~ % curl https://hf-text-generation-inference-server-llm-on-openshift.apps.dgxh100.redhat.com/generate \
    -X POST \
    --insecure \
    -d '{"inputs":"<s>[INST]Write a basic python function that can generate fibbonaci sequence[/INST]","parameters":{"max_new_tokens":1000}}' \
    -H 'Content-Type: application/json' | jq  . | sed 's/\\n/\n/g'

{
  "generated_text": " Here is a simple Python function that generates the Fibonacci sequence:

`` ` ```` ` ```` ` ``python
def fibonacci(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]
    else:
        fib_seq = fibonacci(n-1)
        fib_seq.append(fib_seq[-1] + fib_seq[-2])
        return fib_seq
`` ` ```` ` ```` ` ``

This function takes in an integer `n` as an argument, which represents the number of terms to generate in the Fibonacci sequence. If `n` is less than or equal to 0, the function returns an empty list. If `n` is equal to 1, the function returns [0]. If `n` is equal to 2, the function returns [0, 1]. For any other value of `n`, the function first calls itself with the argument `n-1`, and appends the sum of the last two elements in the returned list to the end of the list. This process continues until the desired number of terms have been generated."
}

Test in French:

egallen@egallen-mac test % curl https://hf-text-generation-inference-server-llm-on-openshift.apps.dgxh100.redhat.com/generate \
    -X POST \
    --insecure \
    -d '{"inputs":"<s>[INST]Quelle est la liste des Présidents de la Cinquième République ?[/INST]","parameters":{"max_new_tokens":1000}}' \
    -H 'Content-Type: application/json' | jq  . | sed 's/\\n/\n/g'

{
  "generated_text": " Voici la liste des présidents de la Cinquième République française depuis sa création en 1958 :

1. Charles de Gaulle (1958-1969)
2. Georges Pompidou (1969-1974)
3. Valéry Giscard d'Estaing (1974-1981)
4. François Mitterrand (1981-1995)
5. Jacques Chirac (1995-2007)
6. Nicolas Sarkozy (2007-2012)
7. François Hollande (2012-2017)
8. Emmanuel Macron (2017-en cours)"
}

Are you a poet?

egallen@egallen-mac test % curl https://hf-text-generation-inference-server-llm-on-openshift.apps.dgxh100.redhat.com/generate \
    -X POST \
    --insecure \
    -d '{"inputs":"<s>[INST]Write a poem about the sun[/INST]","parameters":{"max_new_tokens":1000}}' \
    -H 'Content-Type: application/json' | jq  . | sed 's/\\n/\n/g'

{
  "generated_text": " The Sun, the source of all light,
A ball of fire burning bright,
It rises in the east and sets in the west,
A daily cycle that never rests.

Its warmth embraces the earth,
A gentle touch that brings forth,
Life and growth in every form,
From the tallest tree to the smallest worm.

Its rays reach out to the sky,
A canvas of colors that never die,
A painting of beauty and wonder,
A sight that leaves us all in awe and thunder.

The Sun, a star that shines so bright,
A beacon of hope and light,
A symbol of life and love,
Its power and majesty we can't help but adore.

So let us bask in its glory,
And let its warmth and light tell a story,
For the Sun, is a gift from above,
A treasure that we should cherish and love."
}

Conclusion

This blog post provides a comprehensive guide on how to deploy OpenShift AI on the DGX H100 system for running large-scale machine learning applications. It covers all the steps from preparation to deployment, including setting up the OpenShift cluster, installing the necessary operators, and creating a persistent volume. It also includes examples of how to use PyTorch to run image classification tasks. Finally, it shows how to set up MIG devices and run Mistral AI inference.

In addition to the instructions provided in the article, here are some additional tips for deploying OpenShift AI on the DGX H100 system:

  • Optimize your OpenShift cluster configuration. This includes allocating sufficient resources to the nodes in the cluster and ensuring that the network bandwidth is adequate to support the data transfer requirements of your AI workloads.

  • Use RDMA feature to improve data transfer performance. This feature allows GPUs to directly communicate with each other over the PCIe bus, bypassing the host CPU and network.

  • Use NVIDIA’s optimized libraries for TensorFlow and PyTorch. These libraries are specifically designed to take advantage of the NVIDIA GPU architecture and can improve performance significantly.

You can now effectively deploy OpenShift AI on the DGX H100 system and run large-scale machine learning applications with impressive performance and efficiency.