The steps involved in enabling your Akash Provider to host GPU resources are covered in this section and via these steps:
- GPU Provider Configuration
- GPU Node Label
- Apply NVIDIA Runtime Engine
- Update Akash Provider
- GPU Test Deployments
- GPU Provider Troubleshooting
GPU Provider Configuration
Overview
Sections in this guide cover the installation of the following packages necessary for Akash Provider GPU hosting:
Install NVIDIA Drivers & Toolkit
NOTE - The steps in this section should be completed on all Kubernetes nodes hosting GPU resources
Prepare Environment
NOTE - reboot the servers following the completion of this step
apt update
DEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" dist-upgrade
apt autoremove
Install Latest NVIDIA Drivers
The ubuntu-drivers devices
command detects your GPU and determines which version of the NVIDIA drivers is best.
NOTE - the NVIDIA drivers detailed and installed in this section have known compatibility issues with some
6.X
Linux kernels as discussed here. In our experience, when such compatibility issue occur the driver will install with no errors generated but will not functionality properly. If you encounter Linux kernel and NVIDIA driver compatibility issues, consider downgrading the Kernel to the officially supported Ubuntu 22.04 kernel which at the time of this writing is5.15.0-73
apt install ubuntu-drivers-common
ubuntu-drivers devices
Expected/Example Output
root@node1:~# ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:1e.0 ==modalias : pci:v000010DEd00001EB8sv000010DEsd000012A2bc03sc02i00vendor : NVIDIA Corporationmodel : TU104GL [Tesla T4]driver : nvidia-driver-450-server - distro non-freedriver : nvidia-driver-418-server - distro non-freedriver : nvidia-driver-470-server - distro non-freedriver : nvidia-driver-515 - distro non-freedriver : nvidia-driver-510 - distro non-freedriver : nvidia-driver-525-server - distro non-freedriver : nvidia-driver-525 - distro non-free recommendeddriver : nvidia-driver-515-server - distro non-freedriver : nvidia-driver-470 - distro non-freedriver : xserver-xorg-video-nouveau - distro free builtin
Driver Install Based on Output
Run either ubuntu-drivers autoinstall
or apt install nvidia-driver-525
(driver names may be different in your environment).
The autoinnstall
option installs the recommended version and is appropriate in most instances.
The apt install <driver-name>
alternatively allows the install of preferred driver instead of the recommended version.
ubuntu-drivers autoinstall
Install the NVIDIA Container Toolkit
NOTE - The steps in this sub-section should be completed on all Kubernetes nodes hosting GPU resources
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | apt-key add -curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | tee /etc/apt/sources.list.d/libnvidia-container.list
apt-get updateapt-get install -y nvidia-container-toolkit nvidia-container-runtime
For non-PCIe, e.g. SXM* GPUs
In some circumstances it has been found that the CUDA Drivers Fabric Manager needs to be installed on worker nodes hosting GPU resources (typically, non-PCIe GPU configurations such as those using SXM form factors).
Replace
525
with your nvidia driver version installed in the previous steps You may need to wait for about 2-3 minutes for the nvidia fabricmanager to initialize
apt-get install cuda-drivers-fabricmanager-525systemctl start nvidia-fabricmanagersystemctl enable nvidia-fabricmanager
Additional References for Node Configurations
NOTE - references are for additional info only. No actions are necessary and the Kubernetes nodes should be all set to proceed to next step based on configurations enacted in prior steps on this doc.
- https://github.com/NVIDIA/k8s-device-plugin#prerequisites
- https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
NVIDIA Runtime Configuration
Worker nodes
NOTE - The steps in this sub-section should be completed on all Kubernetes nodes hosting GPU resources
Update the nvidia-container-runtime config in order to prevent NVIDIA_VISIBLE_DEVICES=all
abuse where tenants could access more GPU’s than they requested.
NOTE - This will only work with
nvdp/nvidia-device-plugin
helm chart installed with--set deviceListStrategy=volume-mounts
(you’ll get there in the next steps)
Make sure the config file /etc/nvidia-container-runtime/config.toml
contains these line uncommmented and set to these values:
accept-nvidia-visible-devices-as-volume-mounts = trueaccept-nvidia-visible-devices-envvar-when-unprivileged = false
NOTE -
/etc/nvidia-container-runtime/config.toml
is part ofnvidia-container-toolkit-base
package; so it won’t override the customer-set parameters there since it is part of the/var/lib/dpkg/info/nvidia-container-toolkit-base.conffiles
Kubespray
NOTE - the steps in this sub-section should be completed on the Kubespray host only
NOTE - skip this sub-section if these steps were completed during your Kubernetes build process
In this step we add the NVIDIA runtime confguration into the Kubespray inventory. The runtime will be applied to necessary Kubernetes hosts when Kubespray builds the cluster in the subsequent step.
Create NVIDIA Runtime File for Kubespray Use
cat > ~/kubespray/inventory/akash/group_vars/all/akash.yml <<'EOF'containerd_additional_runtimes: - name: nvidia type: "io.containerd.runc.v2" engine: "" root: "" options: BinaryName: '/usr/bin/nvidia-container-runtime'EOF
Kubespray the Kubernetes Cluster
cd ~/kubespray
source venv/bin/activate
ansible-playbook -i inventory/akash/hosts.yaml -b -v --private-key=~/.ssh/id_rsa cluster.yml
GPU Node Label
Overview
In this section we verify that necessary Kubernetes node labels have been applied for your GPUs. The labeling of nodes is an automated process and here we only verify proper labels have been applied.
Verification of Node Labels
- Replace
<node-name>
with the node of interest
kubectl describe node <node-name> | grep -A10 Labels
Expected Output using Example
- Note the presence of the GPU model, interface, and ram expected values.
root@node1:~# kubectl describe node node2 | grep -A10 LabelsLabels: akash.network=true akash.network/capabilities.gpu.vendor.nvidia.model.t4=1 akash.network/capabilities.gpu.vendor.nvidia.model.t4.interface.PCIe=1 akash.network/capabilities.gpu.vendor.nvidia.model.t4.ram.16Gi=1 akash.network/capabilities.storage.class.beta2=1 akash.network/capabilities.storage.class.default=1 allow-nvdp=true beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=node2
Apply NVIDIA Runtime Engine
Create RuntimeClass
NOTE - conduct these steps on the control plane node that Helm was installed on via the previous step
Create the NVIDIA Runtime Config
cat > nvidia-runtime-class.yaml << EOFkind: RuntimeClassapiVersion: node.k8s.io/v1metadata: name: nvidiahandler: nvidiaEOF
Apply the NVIDIA Runtime Config
kubectl apply -f nvidia-runtime-class.yaml
Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on All Nodes
NOTE - in some scenarios a provider may host GPUs only on a subset of Kubernetes worker nodes. Use the instructions in this section if ALL Kubernetes worker nodes have available GPU resources. If only a subset of worker nodes host GPU resources - use the section
Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on Subset of Nodes
instead. Only one of these two sections should be completed.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.15.1 \ --set runtimeClassName="nvidia" \ --set deviceListStrategy=volume-mounts
Expected/Example Output
root@ip-172-31-8-172:~# helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.15.1 \ --set runtimeClassName="nvidia" \ --set deviceListStrategy=volume-mounts
Release "nvdp" does not exist. Installing it now.NAME: nvdpLAST DEPLOYED: Thu Apr 13 19:11:28 2023NAMESPACE: nvidia-device-pluginSTATUS: deployedREVISION: 1TEST SUITE: None
Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on Subset of Nodes
NOTE - use the instructions in this section if only a subset of Kubernetes worker nodes have available GPU resources.
- By default, the nvidia-device-plugin DaemonSet may run on all nodes in your Kubernetes cluster. If you want to restrict its deployment to only GPU-enabled nodes, you can leverage Kubernetes node labels and selectors.
- Specifically, you can use the
allow-nvdp=true label
to limit where the DaemonSet is scheduled.
STEP 1: Label the GPU Nodes
- First, identify your GPU nodes and label them with
allow-nvdp=true
. You can do this by running the following command for each GPU node - Replace
node-name
of the node you’re labeling
NOTE - if you are unsure of the
<node-name>
to be used in this command - issuekubectl get nodes
from one of your Kubernetes control plane nodes to obtain via theNAME
column of this command output
kubectl label nodes <node-name> allow-nvdp=true
STEP 2: Update Helm Chart Values
- By setting the node selector, you are ensuring that the
nvidia-device-plugin
DaemonSet will only be scheduled on nodes with theallow-nvdp=true
label.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.15.1 \ --set runtimeClassName="nvidia" \ --set deviceListStrategy=volume-mounts \ --set-string nodeSelector.allow-nvdp="true"
STEP 3: Verify
kubectl -n nvidia-device-plugin get pods -o wide
Expected/Example Output
- In this example only nodes: node1, node3 and node4 have the
allow-nvdp=true
labels and that’s wherenvidia-device-plugin
pods spawned at:
root@node1:~# kubectl -n nvidia-device-plugin get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATESnvdp-nvidia-device-plugin-gqnm2 1/1 Running 0 11s 10.233.75.1 node2 <none> <none>
Verification - Applicable to all Environments
kubectl -n nvidia-device-plugin logs -l app.kubernetes.io/instance=nvdp
Example/Expected Output
root@node1:~# kubectl -n nvidia-device-plugin logs -l app.kubernetes.io/instance=nvdp "sharing": { "timeSlicing": {} }}2023/04/14 14:18:27 Retreiving plugins.2023/04/14 14:18:27 Detected NVML platform: found NVML library2023/04/14 14:18:27 Detected non-Tegra platform: /sys/devices/soc0/family file not found2023/04/14 14:18:27 Starting GRPC server for 'nvidia.com/gpu'2023/04/14 14:18:27 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock2023/04/14 14:18:27 Registered device plugin for 'nvidia.com/gpu' with Kubelet "sharing": { "timeSlicing": {} }}2023/04/14 14:18:29 Retreiving plugins.2023/04/14 14:18:29 Detected NVML platform: found NVML library2023/04/14 14:18:29 Detected non-Tegra platform: /sys/devices/soc0/family file not found2023/04/14 14:18:29 Starting GRPC server for 'nvidia.com/gpu'2023/04/14 14:18:29 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock2023/04/14 14:18:29 Registered device plugin for 'nvidia.com/gpu' with Kubelet
Test GPUs
NOTE - conduct the steps in this section on a Kubernetes control plane node
Launch GPU Test Pod
Create the GPU Test Pod Config
cat > gpu-test-pod.yaml << EOFapiVersion: v1kind: Podmetadata: name: gpu-podspec: restartPolicy: Never runtimeClassName: nvidia containers: - name: cuda-container # Nvidia cuda compatibility https://docs.nvidia.com/deploy/cuda-compatibility/ # for nvidia 510 drivers ## image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 # for nvidia 525 drivers use below image image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.6.0 resources: limits: nvidia.com/gpu: 1 # requesting 1 GPU tolerations: - key: nvidia.com/gpu operator: Exists effect: NoScheduleEOF
Apply the GPU Test Pod Config
kubectl apply -f gpu-test-pod.yaml
Verification of GPU Pod
kubectl logs gpu-pod
Expected/Example Output
root@node1:~# kubectl logs gpu-pod[Vector addition of 50000 elements]Copy input data from the host memory to the CUDA deviceCUDA kernel launch with 196 blocks of 256 threadsCopy output data from the CUDA device to the host memoryTest PASSEDDone
Update Akash Provider
Update Provider Configuration File
Providers must be updated with attributes in order to bid on the GPUs.
NOTE - in the Akash Provider build documentation a
provider.yaml
file was created and which stores provider attribute/other settings. In this section we will update thatprovider.yaml
file with GPU related attributes. The remainder of the pre-existing file should be left unchanged.
GPU Attributes Template
- GPU model template is used in the subsequent
Provider Configuration File
- Multiple such entries should be included in the
Provider Configuration File
if the providers has multiple GPU types - Currently Akash providers may only host one GPU type per worker node. But different GPU models/types may be hosted on separate Kubernetes nodes.
- We recommend including both a GPU attribute which includes VRAM and a GPU attribute which does not include VRAM to ensure your provider bids when the deployer includes/excludes VRAM spec. Example of this recommended approach in the
provider.yaml
example below. - Include the GPU interface type - as seen in the example below - to ensure provider bids when the deployer includes the interface in the SDL.
capabilities/gpu/vendor/<vendor name>/model/<model name>: true
Example Provider Configuration File
- In the example configuration file below the Akash Provider will advertise availability of NVIDIA GPU model A4000
- Steps included in this code block create the necessary
provider.yaml
file in the expected directory - Ensure that the attributes section is updated with your own values
cd ~
cd provider
vim provider.yaml
Update the Provider YAML File With GPU Attribute
- When the
provider.yaml
file update is complete is should look like this:
---from: "$ACCOUNT_ADDRESS"key: "$(cat ~/key.pem | openssl base64 -A)"keysecret: "$(echo $KEY_PASSWORD | openssl base64 -A)"domain: "$DOMAIN"node: "$AKASH_NODE"withdrawalperiod: 12hattributes: - key: host value: akash - key: tier value: community - key: capabilities/gpu/vendor/nvidia/model/a100 value: true - key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi value: true - key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi/interface/pcie value: true - key: capabilities/gpu/vendor/nvidia/model/a100/interface/pcie value: true
Provider Bid Defaults
- When a provider is created the default bid engine settings are used which are used to derive pricing per workload. If desired these settings could be updated. But we would recommend initially using the default values.
- For a through discussion on customized pricing please visit this guide.
Update Provider Via Helm
helm upgrade --install akash-provider akash/provider -n akash-services -f provider.yaml \--set bidpricescript="$(cat /root/provider/price_script_generic.sh | openssl base64 -A)"
Verify Health of Akash Provider
Use the following command to verify the health of the Akash Provider and Hostname Operator pods
kubectl get pods -n akash-services
Example/Expected Output
root@node1:~/provider# kubectl get pods -n akash-servicesNAME READY STATUS RESTARTS AGEakash-hostname-operator-5c59757fcc-kt7dl 1/1 Running 0 17sakash-provider-0 1/1 Running 0 59s
Verify Provider Attributes On Chain
- In this step we ensure that your updated Akash Provider Attributes have been updated on the blockchain. Ensure that the GPU model related attributes are now in place via this step.
NOTE - conduct this verification from your Kubernetes control plane node
# Ensure that a RPC node environment variable is present for queryexport AKASH_NODE=https://rpc.akashnet.net:443# Replace the provider address with your own valueprovider-services query provider get <provider-address>
Example/Expected Output
provider-services query provider get akash1mtnuc449l0mckz4cevs835qg72nvqwlul5wzyf
attributes:- key: region value: us-central- key: host value: akash- key: tier value: community- key: organization value: akash test provider- key: capabilities/gpu/vendor/nvidia/model/a100 value: "true"- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi value: "true"- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi/interface/pcie value: "true"- key: capabilities/gpu/vendor/nvidia/model/a100/interface/pcie value: "true"host_uri: https://provider.akashtestprovider.xyz:8443info: email: "" website: ""owner: akash1mtnuc449l0mckz4cevs835qg72nvqwlul5wzyf
Verify Akash Provider Image
Verify the Provider image is correct by running this command:
kubectl -n akash-services get pod akash-provider-0 -o yaml | grep image: | uniq -c
Expected/Example Output
root@node1:~/provider# kubectl -n akash-services get pod akash-provider-0 -o yaml | grep image: | uniq -c 4 image: ghcr.io/akash-network/provider:0.4.6
GPU Test Deployments
Overview
Use any of the Akash deployment tools covered here for your Provider test deployments.
NOTE - this section covers GPU specific deployment testing and verificaiton of your Akash Provider. In addition, general Provider verifications can be made via this Provider Checkup guide.
Example GPU SDL #1
NOTE - in this example the deployer is requesting bids from only Akash Providers that have available NVIDIA A4000 GPUs. Adjust accordingly for your provider testing.
---version: "2.0"
services: gpu-test: # Nvidia cuda compatibility https://docs.nvidia.com/deploy/cuda-compatibility/ # for nvidia 510 drivers ## image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 # for nvidia 525 drivers use below image image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.6.0 command: - "sh" - "-c" args: - 'sleep infinity' expose: - port: 3000 as: 80 to: - global: trueprofiles: compute: gpu-test: resources: cpu: units: 1 memory: size: 1Gi gpu: units: 1 attributes: vendor: nvidia: - model: a4000 storage: - size: 512Mi placement: westcoast: pricing: gpu-test: denom: uakt amount: 100000deployment: gpu-test: westcoast: profile: gpu-test count: 1
Testing of Deployment/GPU Example #1
- Conduct the following tests from the deployment’s shell.
Test 1
/tmp/sample
Expected/Example Output
root@gpu-test-6d4f545b6f-f95zk:/# /tmp/sample
[Vector addition of 50000 elements]Copy input data from the host memory to the CUDA deviceCUDA kernel launch with 196 blocks of 256 threadsCopy output data from the CUDA device to the host memoryTest PASSEDDone
Test 2
nvidia-smi
Expected/Example Output
root@gpu-test-6d4f545b6f-f95zk:/# nvidia-smi
Fri Apr 14 09:23:33 2023+-----------------------------------------------------------------------------+| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. || | | MIG M. ||===============================+======================+======================|| 0 NVIDIA RTX A4000 Off | 00000000:05:00.0 Off | Off || 41% 44C P8 13W / 140W | 0MiB / 16376MiB | 0% Default || | | N/A |+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| No running processes found |+-----------------------------------------------------------------------------+root@gpu-test-6d4f545b6f-f95zk:/#
Example GPU SDL #2
NOTE - there is currently an issue with GPU deployments closing once their primary process completes. Due to this issue the example SDL below causes repeated container resarts. The container will restart when the stable diffusion task has completed. When this issue has been resolved, GPU containers will remain running perpetually and will not close when the primary process defined in the SDL completes.
NOTE - the CUDA version necessary for this image is
11.7
currently. Check the image documentation page here for possible updates.
NOTE - in this example the deployer is requesting bids from only Akash Providers that have available NVIDIA A4000 GPUs
---version: "2.0"
services: gpu-test: image: ghcr.io/fboulnois/stable-diffusion-docker expose: - port: 3000 as: 80 to: - global: true cmd: - run args: - 'An impressionist painting of a parakeet eating spaghetti in the desert' - --attention-slicing - --xformers-memory-efficient-attentionprofiles: compute: gpu-test: resources: cpu: units: 1 memory: size: 20Gi gpu: units: 1 attributes: vendor: nvidia: - model: a4000 storage: - size: 100Gi placement: westcoast: pricing: gpu-test: denom: uakt amount: 100000deployment: gpu-test: westcoast: profile: gpu-test count: 1