GPU Support

This guide shows how to enable NVIDIA GPU support on your Akash provider after Kubernetes is deployed.

Don’t have GPUs? Skip to Persistent Storage (Rook-Ceph) or Provider Installation.

Prerequisites: You must have already configured the NVIDIA runtime in Kubespray before deploying your cluster. See Kubernetes Setup - Step 7.

Time: 30-45 minutes


STEP 1 - Install NVIDIA Drivers

Run these commands on each GPU node:

Update System

Terminal window
apt update
DEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" dist-upgrade
apt autoremove

Reboot the node after this step.

Add NVIDIA Repository

Terminal window
# Create keyrings directory if it doesn't exist
mkdir -p /etc/apt/keyrings
# Download and add NVIDIA GPG key using modern method
wget -qO- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/3bf863cc.pub | gpg --dearmor -o /etc/apt/keyrings/nvidia-cuda.gpg
# Add NVIDIA repository with signed-by reference
echo "deb [signed-by=/etc/apt/keyrings/nvidia-cuda.gpg] https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/ /" | tee /etc/apt/sources.list.d/cuda-repo.list
apt update

Install Driver

Choose the installation method based on your GPU type:

Consumer GPUs (RTX 4090, RTX 5090, etc.)

For consumer-grade GPUs, install the standard driver:

Terminal window
DEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install nvidia-driver-580
apt -y autoremove

Reboot the node.

Data Center GPUs (H100, H200, etc.)

For data center GPUs (including SXM form factor), install the server driver and Fabric Manager:

Terminal window
DEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install nvidia-driver-580-server
apt-get install nvidia-fabricmanager-580
systemctl start nvidia-fabricmanager
systemctl enable nvidia-fabricmanager
apt -y autoremove

Reboot the node.

Verify Installation

Terminal window
nvidia-smi

You should see your GPUs listed with driver information.


STEP 2 - Install NVIDIA Container Toolkit

Run on each GPU node:

Terminal window
# Create keyrings directory if it doesn't exist
mkdir -p /etc/apt/keyrings
# Download and add NVIDIA Container Toolkit GPG key using modern method
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /etc/apt/keyrings/nvidia-container-toolkit.gpg
# Add NVIDIA Container Toolkit repository with signed-by reference
echo "deb [signed-by=/etc/apt/keyrings/nvidia-container-toolkit.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/amd64 /" | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
apt-get update
apt-get install -y nvidia-container-toolkit nvidia-container-runtime

STEP 3 - Configure NVIDIA CDI

Run on each GPU node:

Generate CDI Specification

Terminal window
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Configure NVIDIA Runtime

Edit /etc/nvidia-container-runtime/config.toml and ensure these lines are uncommented and set to:

accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true

Note: This setup uses CDI (Container Device Interface) for device enumeration, which provides better security and device management.


STEP 4 - Create NVIDIA RuntimeClass

Run from a control plane node:

Terminal window
cat > nvidia-runtime-class.yaml << 'EOF'
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
name: nvidia
handler: nvidia
EOF
kubectl apply -f nvidia-runtime-class.yaml

STEP 5 - Label GPU Nodes

Label each GPU node (replace <node-name> with actual node name):

Terminal window
kubectl label nodes <node-name> allow-nvdp=true

Verify Labels

Terminal window
kubectl describe node <node-name> | grep -A5 Labels

You should see allow-nvdp=true.


STEP 6 - Install NVIDIA Device Plugin

Run from a control plane node:

Terminal window
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.18.0 \
--set runtimeClassName="nvidia" \
--set deviceListStrategy=cdi-cri \
--set nvidiaDriverRoot="/" \
--set-string nodeSelector.allow-nvdp="true"

Verify Installation

Terminal window
kubectl -n nvidia-device-plugin get pods -o wide

You should see nvdp-nvidia-device-plugin pods running on your GPU nodes.

Check Logs

Terminal window
kubectl -n nvidia-device-plugin logs -l app.kubernetes.io/instance=nvdp

Expected output:

Detected NVML platform: found NVML library
Starting GRPC server for 'nvidia.com/gpu'
Registered device plugin for 'nvidia.com/gpu' with Kubelet

STEP 7 - Test GPU Functionality

Create a test pod:

Terminal window
cat > gpu-test-pod.yaml << 'EOF'
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: Never
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.6.0
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
kubectl apply -f gpu-test-pod.yaml

Verify Test

Wait for the pod to complete, then check logs:

Terminal window
kubectl logs gpu-test

Expected output:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Test nvidia-smi

Create an interactive test:

Terminal window
kubectl run gpu-shell --rm -it --restart=Never --image=nvidia/cuda:11.6.0-base-ubuntu20.04 -- nvidia-smi

You should see GPU information displayed.

Cleanup

Terminal window
kubectl delete pod gpu-test

Next Steps

Your Kubernetes cluster now has GPU support!

Optional enhancements:

Note: After installing the provider, you’ll need to add GPU attributes to your provider.yaml to advertise GPU capabilities. This is covered in the Provider Installation guide.