This guide shows how to enable NVIDIA GPU support on your Akash provider after Kubernetes is deployed.
Don’t have GPUs? Skip to Persistent Storage (Rook-Ceph) or Provider Installation.
Prerequisites: You must have already configured the NVIDIA runtime in Kubespray before deploying your cluster. See Kubernetes Setup - Step 7.
Time: 30-45 minutes
STEP 1 - Install NVIDIA Drivers
Run these commands on each GPU node:
Update System
apt updateDEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" dist-upgradeapt autoremoveReboot the node after this step.
Add NVIDIA Repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/3bf863cc.pubapt-key add 3bf863cc.pubecho "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/ /" | tee /etc/apt/sources.list.d/cuda-repo.listapt updateInstall Driver
Install the recommended NVIDIA driver version 580:
DEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" install nvidia-driver-580apt -y autoremoveReboot the node.
Verify Installation
nvidia-smiYou should see your GPUs listed with driver information.
SXM GPUs Only
If you have non-PCIe GPUs (SXM form factor), also install Fabric Manager:
apt-get install nvidia-fabricmanager-580systemctl start nvidia-fabricmanagersystemctl enable nvidia-fabricmanagerSTEP 2 - Install NVIDIA Container Toolkit
Run on each GPU node:
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | apt-key add -curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | tee /etc/apt/sources.list.d/libnvidia-container.list
apt-get updateapt-get install -y nvidia-container-toolkit nvidia-container-runtimeSTEP 3 - Configure NVIDIA CDI
Run on each GPU node:
Generate CDI Specification
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yamlConfigure NVIDIA Runtime
Edit /etc/nvidia-container-runtime/config.toml and ensure these lines are uncommented and set to:
accept-nvidia-visible-devices-as-volume-mounts = falseaccept-nvidia-visible-devices-envvar-when-unprivileged = trueNote: This setup uses CDI (Container Device Interface) for device enumeration, which provides better security and device management.
STEP 4 - Create NVIDIA RuntimeClass
Run from a control plane node:
cat > nvidia-runtime-class.yaml << 'EOF'kind: RuntimeClassapiVersion: node.k8s.io/v1metadata: name: nvidiahandler: nvidiaEOF
kubectl apply -f nvidia-runtime-class.yamlSTEP 5 - Label GPU Nodes
Label each GPU node (replace <node-name> with actual node name):
kubectl label nodes <node-name> allow-nvdp=trueVerify Labels
kubectl describe node <node-name> | grep -A5 LabelsYou should see allow-nvdp=true.
STEP 6 - Install NVIDIA Device Plugin
Run from a control plane node:
helm repo add nvdp https://nvidia.github.io/k8s-device-pluginhelm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.18.0 \ --set runtimeClassName="nvidia" \ --set deviceListStrategy=cdi-cri \ --set nvidiaDriverRoot="/" \ --set-string nodeSelector.allow-nvdp="true"Verify Installation
kubectl -n nvidia-device-plugin get pods -o wideYou should see nvdp-nvidia-device-plugin pods running on your GPU nodes.
Check Logs
kubectl -n nvidia-device-plugin logs -l app.kubernetes.io/instance=nvdpExpected output:
Detected NVML platform: found NVML libraryStarting GRPC server for 'nvidia.com/gpu'Registered device plugin for 'nvidia.com/gpu' with KubeletSTEP 7 - Test GPU Functionality
Create a test pod:
cat > gpu-test-pod.yaml << 'EOF'apiVersion: v1kind: Podmetadata: name: gpu-testspec: restartPolicy: Never runtimeClassName: nvidia containers: - name: cuda-container image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.6.0 resources: limits: nvidia.com/gpu: 1 tolerations: - key: nvidia.com/gpu operator: Exists effect: NoScheduleEOF
kubectl apply -f gpu-test-pod.yamlVerify Test
Wait for the pod to complete, then check logs:
kubectl logs gpu-testExpected output:
[Vector addition of 50000 elements]Copy input data from the host memory to the CUDA deviceCUDA kernel launch with 196 blocks of 256 threadsCopy output data from the CUDA device to the host memoryTest PASSEDDoneTest nvidia-smi
Create an interactive test:
kubectl run gpu-shell --rm -it --restart=Never --image=nvidia/cuda:11.6.0-base-ubuntu20.04 -- nvidia-smiYou should see GPU information displayed.
Cleanup
kubectl delete pod gpu-testNext Steps
Your Kubernetes cluster now has GPU support!
Optional enhancements:
- TLS Certificates - Automatic SSL certificates
- IP Leases - Enable static IPs
Note: After installing the provider, you’ll need to add GPU attributes to your
provider.yamlto advertise GPU capabilities. This is covered in the Provider Installation guide.