Kubernetes with an nVidia GPU

Table of Contents

  1. Preface
  2. Version Information
  3. Requirements
  4. Configuring our Host
  5. Final Thoughts

Preface

This document will focus on the steps required to run a non-sliced nVidia GPU on a kubernetes cluster with kubeadm and containerd on a RHEL or RHEL clone system. Most of the steps in this document are just differences from my previous Kubernetes on Linux with Kubeadm documentation. So please read through that first.

This is not a guide on how to create a production ready/hardened environment.

Version Information

This guide is targeted to kubernetes v1.20 - v1.29.

If you are attempting to use this guide for another kubernetes version, please be aware that kubernetes is a quickly changing application and this guide may be out-of-date and incorrect. You have been warned.

Requirements

This assumes that you have a kubernetes node configured and running with nVidia drivers. Please take a look at the instructions at RPM-fusion for more instruction on how to configure the system for the nVidia kernel module.

Configuring Our Host

nVidia container toolkit

We will need to add the nVidia container toolkit. The easiest way to do this is to add the nVidia container toolkit rpm repository to our system so to make updates and patching easier. nVidia supplies instructions for this here, but to preserve this, I'm also including it here:

cat <<EOF | sudo tee /etc/yum.repos.d/nvidia.repo
[nvidia-container-toolkit]
name=nvidia-container-toolkit
baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
EOF

Then we need to install the toolkit:

sudo dnf install -y nvidia-container-toolkit

Containerd Modifications

In order to get containerd to work with the nVidia system, we need to tell containerd to use the nVidia container toolkit we just-installed. Official instructions are provided here. These instructions will have you run a command to modify containerd's configuration and restart the containerd service:

sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd.service

The command makes the following modifications to the file:

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            SystemdCgroup = true

This basically sets the the containerd runtime to use the nvidia-container-runtime and the systemd-cgroup. Both of these components are critical, so make sure you don't miss one.

nVidia Daemonset

The last step is to install the nVidia device plugin daemonset. This daemonset runs a container on each node and automatically updates the node capacity labels to include gpu resources (nvidia.com/gpu) if there is a detected gpu present. This is documented here.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml

GPU Consumer Deployment

Deployment is really just a matter of telling kubernetes that the deployment will consume a nvidia.com/gpu resource and the kubernetes scheduler will assign the pod to a node where an unused gpu is available.

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: my-gpu-consumer
          resources:
            limits:
              nvidia.com/gpu: 1