Kubernetes GPU support example

Overview

Kubernetes can support GPUs to speed parallel processing, and auto-scaling in environments that support it. This feature can be used for Machine Learning and Data Science applications. Kublr can automatically detects GPUs on AWS and Azure instances and configure the environment to use GPUs with Kubernetes.

This document explains:

what kind of GPUs Kublr supports;
how Kublr does it;
how to use GPUs in Kublr.

What GPU instances and OSes are supported by Kublr?

Supported: AWS and Azure

On AWS Kublr supports GPUs on Ubuntu 20.04 and RedHat 7 (on RedHat 7 docker-ce is required) for number of GPU instances, for example:

p2.xlarge;
p2.8xlarge;
p2.16xlarge;
p3.2xlarge;
p3.8xlarge;
p3.16xlarge.
Note See full list of AWS GPU instances and their technical descriptions in AWS documentation Amazon EC2 Instance Types > “Accelerated Computing”.

On Azure Kublr supports GPUs on Ubuntu 20.04 for the number of GPU instances, for example::

Standard_NC6s_v2;
Standard_NC12s_v2;
Standard_NC24s_v2;
Standard_NC24rs_v2;
Standard_NC6s_v3;
Standard_NC12s_v3;
Standard_NC24s_v3;
Standard_NC24rs_v3;
Standard_ND6s;
Standard_ND12s;
Standard_ND24s;
Standard_ND24rs.
Note See full list of Azure GPU instances and their technical descriptions in Azure documentation GPU optimized virtual machines > “Overview” and related sections.

Kublr supports GPU for the following NVIDIA devices:

NVIDIA Corporation GK210GL [Tesla K80];
NVIDIA Corporation GV100GL [Tesla V100 SXM2];
NVIDIA Corporation Tesla V100-PCIE;
NVIDIA Corporation Tesla P100-PCIE;
NVIDIA Corporation Tesla P40;

Here is the example of Kublr configuration related to the installation of the NVIDIA devices:

spec: 
  nodes:
    - name: ml
      kublrAgentConfig:
        kublr:
          features:
            nvidia_devices:
              enable: 'auto' # 'auto' is the default value, it does not have to be
                             #specified explicitly, and it is placed here for
                             # demonstration purpose only

This defines whether to enable NVidia devices support. Possible values are:

true means that nvidia drivers and docker runtime are installed during setup and enabled in runtime
false means that nvidia drivers and docker runtime are NOT installed during setup and NOT enabled in runtime
auto (default) means that Kublr will do its best to decide whether to install/enable nvidia drivers and docker runtime. Currently the decision to install drivers is made based on availability of nvidia devices in ‘lspci’ output, and decision to enable them is based on availability of nvidia devices in ‘lspci’ output and installed drivers and docker runtime files.

Nvidia GPU support feature by default relies on the latest nvidia driver version included in nvidia, so if the latest driver version published by Nvidia does not support the GPU hardware present in the system, enabling GPU support will require specifying exact driver version in the Kublr cluster specification.

Notably, this is the case for most AWS GPU accelerated instance types such as p2.*, p3.*, g5.*, g4dn.*, g3(s).*. GPU hardware provided by AWS on these instance types (Nvidia K80, Nidia Tesla V100, Nvidia A10G Tensor Core, Nvidia T4 Tensor Core, and Nvidia Tesla M60 correspondingly) is not supported by the default latest version of Nvidia GPU drivers provided by Nvidia.

As a result an earlier version driver must be specified explicitely in the Kublr cluster spec for a corresponding worker node group in kublrAgentConfig section property kublr.features.nvidia_devices.driver.

Driver nvidia-driver-470-server is tested to support GPU devices provided in all the above instance types on Ubuntu 18.04 and Ubuntu 20.04 images.

When RHEL or Centos images are used, the driver nvidia-driver-branch-470 is known to work well.

Additionally for RHEL backed instances it is important to use Docker CE instead of default docker provided with the OS. This is due to the fact that RHEL provided version of Docker is based on quite old Docker CE version and does not work well with Nvidia container runtime that is required for Nvidia GPU integration.

Ubuntu 20.04 Kublr configuration example for NVIDIA drivers:

spec:
  ...
  nodes:
    - autoscaling: false
      kublrVariant: aws-ubuntu-20.04
      ...
      name: group1
      # this "kublrAgentConfig" section allows customizing Kublr agent running on this
      # worker node group; in this case setting a specific version of nvidia driver.
      kublrAgentConfig:
        kublr:
          features:
            nvidia_devices:
              driver:
                - nvidia-driver-470-server
              enable: 'true'

RedHat 7 Kublr configuration example for NVIDIA drivers and docker-ce:

spec:
  ...
  nodes:
    - autoscaling: false
      kublrVariant: aws-redhat-8
      ...
      name: group1
      # this "kublrAgentConfig" section allows customizing Kublr agent running on this
      # worker node group; in this case setting a specific version of nvidia driver and
      # Docker CE preference.
      kublrAgentConfig:
        kublr:
          features:
            nvidia_devices:
              driver:
                - nvidia-driver-branch-470
              enable: auto
          setup:
            docker:
              edition_fallback_order: 'ce,os'

Not supported: GCP

Kublr does not support GPUs on Google Cloud Platform (GCP).

It depends: your own infrastructure

For your own infrastructure you will be able to use GPUs if they are the installed and configured part of your machine.

How can I use GPUs instances?

Login to Kublr platform.
Click “Add cluster”.
Choose cloud provider (AWS or Azure).
Choose appropriate instance type:

On AWS: Select instance type On Azure:

Click ‘Create Cluster’ and wait until the cluster is created.

How to make sure that the device is working?

Check NVIDIA GPU controller and Docker configuration.

Login to the worker node via ssh.
Be sure that NVIDIA GPU controller exists:

# lspci -m -nn
....
00:1e.0 "3D controller [0302]""NVIDIA Corporation [10de]""GK210GL [Tesla K80] [102d]"-ra1 "NVIDIA Corporation [10de]""Device [106c]"
....
#
#
# nvidia-smi
Mon Jun 18 10:15:50 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
|    N/A50C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
#
# nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-860ba7bf-e331-54b4-6e2c-322fb389597b)

Be sure that docker configured correctly:

# docker run --rm nvidia/cuda nvidia-smi
Mon Jun 18 10:15:50 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
|    N/A50C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Run demo GPU application using Helm chart

Prerequisites:

cluster with GPU node should be created;
installed kubectl;
installed helm;
downloaded “demo7-gpu” from here (dockerGPUHelm contains chart with demo).

Process default video

Copy KubeConfig file from cluster’s Overview page and move it to ~/.kube/ directory

$ cp ~/Downloads/config.yaml ~/.kube/config

Check that kubectl is working and using right config file:

$ kubectl config view
$ kubectl cluster-info

Change directory to ../demo7-gpu and install helm chart

$ helm install demo ./
NAME: demo
LAST DEPLOYED: Fri Jun 24 07:58:32 2022
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  export POD_NAME=$(kubectl get pods --namespace default -l "app=demo-gpu,release=demo" -o jsonpath="{.items[0].metadata.name}")
  echo "Visit http://127.0.0.1:8080 to use your application"
  kubectl port-forward $POD_NAME 8080:80

Run command for port forwarding (pod_name in the previous console output).

$ kubectl port-forward <POD_NAME> 8000:8000 --address 0.0.0.0
Forwarding from 0.0.0.0:8000 -> 8000
Handling connection for 8000

Get video streem: open this link in browser http://localhost:8000/

Video process example

Note that it may take couple minutes before video stream is available within the browser. Once the video stream is up and demo chart is working, you may verify NVIDIA GPU acceleration using command:

(see below GPU % utilization and process id)

# nvidia-smi
Fri Jun 24 09:11:39 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   62C    P0   142W / 149W |   1364MiB / 11441MiB |     82%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    161012      C   python                           1361MiB |
+-----------------------------------------------------------------------------+
root@ip-172-16-61-0:/#

root@ip-172-16-61-0:/# nvidia-smi -q -g 0 -d UTILIZATION -l

==============NVSMI LOG==============

Timestamp                                 : Fri Jun 24 09:15:18 2022
Driver Version                            : 470.129.06
CUDA Version                              : 11.4

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Utilization
        Gpu                               : 83 %
        Memory                            : 38 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    GPU Utilization Samples
        Duration                          : 16.58 sec
        Number of Samples                 : 99
        Max                               : 85 %
        Min                               : 73 %
        Avg                               : 80 %
    Memory Utilization Samples
        Duration                          : 16.58 sec
        Number of Samples                 : 99
        Max                               : 39 %
        Min                               : 34 %
        Avg                               : 37 %

Change video

Open file ../demo7-gpu/dockerGPUHelm/templates/deployment.yaml
Change the value of parameter VIDEO_LINK. Note: you can change value of VIDEO_OUTPUT_COMPRESSION for the desired video quality level.
Upgrade the helm chart. Note: You should wait approximately a minute until the previous Pod is terminated.

$ helm upgrade demo ./
Release "demo" has been upgraded. Happy Helming!
LAST DEPLOYED: Fri Jun 24 07:58:32 2022
NAMESPACE: default
STATUS: DEPLOYED

Create a new Pod’s name.

$ kubectl get pods --all-namespaces |grep demo

Run command for port forwarding.
Open link http://localhost:8000/ to get video stream.

How it works

In the process of installing the cluster, Kublr checks for the presence of GPU devices and, if detected, does the following:

Install NVIDIA drivers.
Install nvidia runtime for docker.

Configure docker to use nvidia runtime:

a. For Ubuntu 20.04 use the following example to customize specification:

  kublrAgentConfig:
    kublr:
      features:
        nvidia_devices:
          driver:
            - nvidia-driver-470-server
          enable: 'true'

b. For RedHat 7 use the following example to customize specification:

  kublrAgentConfig:
    kublr:
      features:
        nvidia_devices:
          driver:
            - nvidia-driver-branch-470
          enable: auto
      setup:
        docker:
          edition_fallback_order: 'ce,os'