GPU Resource Enablement (Optional Step)

The steps involved in enabling your Akash Provider to host GPU resources are covered in this section and via these steps:

GPU Provider Configuration

Overview

Sections in this guide cover the installation of the following packages necessary for Akash Provider GPU hosting:

Install NVIDIA Drivers & Toolkit

NOTE - The steps in this section should be completed on all Kubernetes nodes hosting GPU resources

Prepare Environment

NOTE - reboot the servers following the completion of this step

apt update
DEBIAN_FRONTEND=noninteractive apt -y -o Dpkg::Options::="--force-confdef" -o Dpkg::Options::="--force-confold" dist-upgrade
apt autoremove

Install Latest NVIDIA Drivers

The ubuntu-drivers devices command detects your GPU and determines which version of the NVIDIA drivers is best.

NOTE - the NVIDIA drivers detailed and installed in this section have known compatibility issues with some 6.X lLinux kernels as discussed here. In our experience, when such compatibility issue occur the driver will install with no errors generated but will not functionality properly. If you encounter Linux kernel and NVIDIA driver compatibility issues, consider downgrading the Kernel to the officially supported Ubuntu 22.04 kernel which at the time of this writing is 5.15.0-73

apt install ubuntu-drivers-common
ubuntu-drivers devices

Expected/Example Output

root@node1:~# ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:1e.0 ==
modalias : pci:v000010DEd00001EB8sv000010DEsd000012A2bc03sc02i00
vendor : NVIDIA Corporation
model : TU104GL [Tesla T4]
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-515 - distro non-free
driver : nvidia-driver-510 - distro non-free
driver : nvidia-driver-525-server - distro non-free
driver : nvidia-driver-525 - distro non-free recommended
driver : nvidia-driver-515-server - distro non-free
driver : nvidia-driver-470 - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin

Driver Install Based on Output

Run either ubuntu-drivers autoinstall or apt install nvidia-driver-525 (driver names may be different in your environment).

The autoinnstall option installs the recommended version and is appropriate in most instances.

The apt install <driver-name>alternatively allows the install of preferred driver instead of the recommended version.

ubuntu-drivers autoinstall

Install the NVIDIA Container Toolkit

NOTE - The steps in this sub-section should be completed on all Kubernetes nodes hosting GPU resources

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | tee /etc/apt/sources.list.d/libnvidia-container.list
apt-get update
apt-get install -y nvidia-container-toolkit nvidia-container-runtime

For non-PCIe, e.g. SXM* GPUs

In some circumstances it has been found that the CUDA Drivers Fabric Manager needs to be installed on worker nodes hosting GPU resources (typically, non-PCIe GPU configurations such as those using SXM form factors).

Replace 525 with your nvidia driver version installed in the previous steps

apt-get install cuda-drivers-fabricmanager-525

Reference

Additional References for Node Configurations

NOTE - references are for additional info only. No actions are necessary and the Kubernetes nodes should be all set to proceed to next step based on configurations enacted in prior steps on this doc.

NVIDIA Runtime Configuration

Worker nodes

NOTE - The steps in this sub-section should be completed on all Kubernetes nodes hosting GPU resources

Update the nvidia-container-runtime config in order to prevent NVIDIA_VISIBLE_DEVICES=all abuse where tenants could access more GPU’s than they requested.

NOTE - This will only work with nvdp/nvidia-device-plugin helm chart installed with --set deviceListStrategy=volume-mounts (you’ll get there in the next steps)

Make sure the config file /etc/nvidia-container-runtime/config.toml contains these line uncommmented and set to these values:

accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false

NOTE - /etc/nvidia-container-runtime/config.toml is part of nvidia-container-toolkit-base package; so it won’t override the customer-set parameters there since it is part of the /var/lib/dpkg/info/nvidia-container-toolkit-base.conffiles

Kubespray

NOTE - the steps in this sub-section should be completed on the Kubespray host only

NOTE - skip this sub-section if these steps were completed during your Kubernetes build process

In this step we add the NVIDIA runtime confguration into the Kubespray inventory. The runtime will be applied to necessary Kubernetes hosts when Kubespray builds the cluster in the subsequent step.

Create NVIDIA Runtime File for Kubespray Use

cat > ~/kubespray/inventory/akash/group_vars/all/akash.yml <<'EOF'
containerd_additional_runtimes:
- name: nvidia
type: "io.containerd.runc.v2"
engine: ""
root: ""
options:
BinaryName: '/usr/bin/nvidia-container-runtime'
EOF

Kubespray the Kubernetes Cluster

cd ~/kubespray
source venv/bin/activate
ansible-playbook -i inventory/akash/hosts.yaml -b -v --private-key=~/.ssh/id_rsa cluster.yml

GPU Node Label

Overview

In this section we verify that necessary Kubernetes node labels have been applied for your GPUs. The labeling of nodes is an automated process and here we only verify proper labels have been applied.

Verification of Node Labels

  • Replace <node-name> with the node of interest
kubectl describe node <node-name> | grep -A10 Labels

Expected Output using Example

  • Note the presence of the GPU model, interface, and ram expected values.
root@node1:~# kubectl describe node node2 | grep -A10 Labels
Labels: akash.network=true
akash.network/capabilities.gpu.vendor.nvidia.model.t4=1
akash.network/capabilities.gpu.vendor.nvidia.model.t4.interface.PCIe=1
akash.network/capabilities.gpu.vendor.nvidia.model.t4.ram.16Gi=1
akash.network/capabilities.storage.class.beta2=1
akash.network/capabilities.storage.class.default=1
allow-nvdp=true
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=node2

Apply NVIDIA Runtime Engine

Create RuntimeClass

NOTE - conduct these steps on the control plane node that Helm was installed on via the previous step

Create the NVIDIA Runtime Config

cat > nvidia-runtime-class.yaml << EOF
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
name: nvidia
handler: nvidia
EOF

Apply the NVIDIA Runtime Config

kubectl apply -f nvidia-runtime-class.yaml

Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on All Nodes

NOTE - in some scenarios a provider may host GPUs only on a subset of Kubernetes worker nodes. Use the instructions in this section if ALL Kubernetes worker nodes have available GPU resources. If only a subset of worker nodes host GPU resources - use the section Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on Subset of Nodes instead. Only one of these two sections should be completed.

Terminal window
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.14.5 \
--set runtimeClassName="nvidia" \
--set deviceListStrategy=volume-mounts

Expected/Example Output

root@ip-172-31-8-172:~# helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.14.5 \
--set runtimeClassName="nvidia" \
--set deviceListStrategy=volume-mounts
Release "nvdp" does not exist. Installing it now.
NAME: nvdp
LAST DEPLOYED: Thu Apr 13 19:11:28 2023
NAMESPACE: nvidia-device-plugin
STATUS: deployed
REVISION: 1
TEST SUITE: None

Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on Subset of Nodes

NOTE - use the instructions in this section if only a subset of Kubernetes worker nodes have available GPU resources.

  • By default, the nvidia-device-plugin DaemonSet may run on all nodes in your Kubernetes cluster. If you want to restrict its deployment to only GPU-enabled nodes, you can leverage Kubernetes node labels and selectors.
  • Specifically, you can use the allow-nvdp=true label to limit where the DaemonSet is scheduled.

STEP 1: Label the GPU Nodes

  • First, identify your GPU nodes and label them with allow-nvdp=true. You can do this by running the following command for each GPU node
  • Replace node-name of the node you’re labeling

NOTE - if you are unsure of the <node-name> to be used in this command - issue kubectl get nodes from one of your Kubernetes control plane nodes to obtain via the NAME column of this command output

kubectl label nodes <node-name> allow-nvdp=true

STEP 2: Update Helm Chart Values

  • By setting the node selector, you are ensuring that the nvidia-device-plugin DaemonSet will only be scheduled on nodes with the allow-nvdp=true label.
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.14.5 \
--set runtimeClassName="nvidia" \
--set deviceListStrategy=volume-mounts \
--set-string nodeSelector.allow-nvdp="true"

STEP 3: Verify

kubectl -n nvidia-device-plugin get pods -o wide

Expected/Example Output

  • In this example only nodes: node1, node3 and node4 have the allow-nvdp=true labels and that’s where nvidia-device-plugin pods spawned at:
root@node1:~# kubectl -n nvidia-device-plugin get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nvdp-nvidia-device-plugin-gqnm2 1/1 Running 0 11s 10.233.75.1 node2 <none> <none>

Verification - Applicable to all Environments

kubectl -n nvidia-device-plugin logs -l app.kubernetes.io/instance=nvdp

Example/Expected Output

root@node1:~# kubectl -n nvidia-device-plugin logs -l app.kubernetes.io/instance=nvdp
"sharing": {
"timeSlicing": {}
}
}
2023/04/14 14:18:27 Retreiving plugins.
2023/04/14 14:18:27 Detected NVML platform: found NVML library
2023/04/14 14:18:27 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/04/14 14:18:27 Starting GRPC server for 'nvidia.com/gpu'
2023/04/14 14:18:27 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/04/14 14:18:27 Registered device plugin for 'nvidia.com/gpu' with Kubelet
"sharing": {
"timeSlicing": {}
}
}
2023/04/14 14:18:29 Retreiving plugins.
2023/04/14 14:18:29 Detected NVML platform: found NVML library
2023/04/14 14:18:29 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/04/14 14:18:29 Starting GRPC server for 'nvidia.com/gpu'
2023/04/14 14:18:29 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/04/14 14:18:29 Registered device plugin for 'nvidia.com/gpu' with Kubelet

Test GPUs

NOTE - conduct the steps in this section on a Kubernetes control plane node

Launch GPU Test Pod

Create the GPU Test Pod Config
cat > gpu-test-pod.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
runtimeClassName: nvidia
containers:
- name: cuda-container
# Nvidia cuda compatibility https://docs.nvidia.com/deploy/cuda-compatibility/
# for nvidia 510 drivers
## image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
# for nvidia 525 drivers use below image
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.6.0
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF
Apply the GPU Test Pod Config
kubectl apply -f gpu-test-pod.yaml

Verification of GPU Pod

kubectl logs gpu-pod
Expected/Example Output
root@node1:~# kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Update Akash Provider

Update Provider Configuration File

Providers must be updated with attributes in order to bid on the GPUs.

NOTE - in the Akash Provider build documentation a provider.yaml file was created and which stores provider attribute/other settings. In this section we will update that provider.yaml file with GPU related attributes. The remainder of the pre-existing file should be left unchanged.

GPU Attributes Template

  • GPU model template is used in the subsequent Provider Configuration File
  • Multiple such entries should be included in the Provider Configuration File if the providers has multiple GPU types
  • Currently Akash providers may only host one GPU type per worker node. But different GPU models/types may be hosted on separate Kubernetes nodes.
  • We recommend including both a GPU attribute which includes VRAM and a GPU attribute which does not include VRAM to ensure your provider bids when the deployer includes/excludes VRAM spec. Example of this recommended approach in the provider.yaml example below.
  • Include the GPU interface type - as seen in the example below - to ensure provider bids when the deployer includes the interface in the SDL.
capabilities/gpu/vendor/<vendor name>/model/<model name>: true

Example Provider Configuration File

  • In the example configuration file below the Akash Provider will advertise availability of NVIDIA GPU model A4000
  • Steps included in this code block create the necessary provider.yaml file in the expected directory
  • Ensure that the attributes section is updated with your own values
cd ~
cd provider
vim provider.yaml

Update the Provider YAML File With GPU Attribute

  • When the provider.yaml file update is complete is should look like this:
---
from: "$ACCOUNT_ADDRESS"
key: "$(cat ~/key.pem | openssl base64 -A)"
keysecret: "$(echo $KEY_PASSWORD | openssl base64 -A)"
domain: "$DOMAIN"
node: "$AKASH_NODE"
withdrawalperiod: 12h
attributes:
- key: host
value: akash
- key: tier
value: community
- key: capabilities/gpu/vendor/nvidia/model/a100
value: true
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
value: true
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi/interface/pcie
value: true
- key: capabilities/gpu/vendor/nvidia/model/a100/interface/pcie
value: true

Provider Bid Defaults

  • When a provider is created the default bid engine settings are used which are used to derive pricing per workload. If desired these settings could be updated. But we would recommend initially using the default values.
  • For a through discussion on customized pricing please visit this guide.

Update Provider Via Helm

helm upgrade --install akash-provider akash/provider -n akash-services -f provider.yaml \
--set bidpricescript="$(cat /root/provider/price_script_generic.sh | openssl base64 -A)"

Verify Health of Akash Provider

Use the following command to verify the health of the Akash Provider and Hostname Operator pods

kubectl get pods -n akash-services

Example/Expected Output

root@node1:~/provider# kubectl get pods -n akash-services
NAME READY STATUS RESTARTS AGE
akash-hostname-operator-5c59757fcc-kt7dl 1/1 Running 0 17s
akash-provider-0 1/1 Running 0 59s

Verify Provider Attributes On Chain

  • In this step we ensure that your updated Akash Provider Attributes have been updated on the blockchain. Ensure that the GPU model related attributes are now in place via this step.

NOTE - conduct this verification from your Kubernetes control plane node

# Ensure that a RPC node environment variable is present for query
export AKASH_NODE=https://rpc.akashnet.net:443
# Replace the provider address with your own value
provider-services query provider get <provider-address>

Example/Expected Output

Terminal window
provider-services query provider get akash1mtnuc449l0mckz4cevs835qg72nvqwlul5wzyf
attributes:
- key: region
value: us-central
- key: host
value: akash
- key: tier
value: community
- key: organization
value: akash test provider
- key: capabilities/gpu/vendor/nvidia/model/a100
value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi
value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/ram/80Gi/interface/pcie
value: "true"
- key: capabilities/gpu/vendor/nvidia/model/a100/interface/pcie
value: "true"
host_uri: https://provider.akashtestprovider.xyz:8443
info:
email: ""
website: ""
owner: akash1mtnuc449l0mckz4cevs835qg72nvqwlul5wzyf

Verify Akash Provider Image

Verify the Provider image is correct by running this command:

kubectl -n akash-services get pod akash-provider-0 -o yaml | grep image: | uniq -c

Expected/Example Output

root@node1:~/provider# kubectl -n akash-services get pod akash-provider-0 -o yaml | grep image: | uniq -c
4 image: ghcr.io/akash-network/provider:0.4.6

GPU Test Deployments

Overview

Use any of the Akash deployment tools covered here for your Provider test deployments.

NOTE - this section covers GPU specific deployment testing and verificaiton of your Akash Provider. In addition, general Provider verifications can be made via this Provider Checkup guide.

Example GPU SDL #1

NOTE - in this example the deployer is requesting bids from only Akash Providers that have available NVIDIA A4000 GPUs. Adjust accordingly for your provider testing.

---
version: "2.0"
services:
gpu-test:
# Nvidia cuda compatibility https://docs.nvidia.com/deploy/cuda-compatibility/
# for nvidia 510 drivers
## image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
# for nvidia 525 drivers use below image
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.6.0
command:
- "sh"
- "-c"
args:
- 'sleep infinity'
expose:
- port: 3000
as: 80
to:
- global: true
profiles:
compute:
gpu-test:
resources:
cpu:
units: 1
memory:
size: 1Gi
gpu:
units: 1
attributes:
vendor:
nvidia:
- model: a4000
storage:
- size: 512Mi
placement:
westcoast:
pricing:
gpu-test:
denom: uakt
amount: 100000
deployment:
gpu-test:
westcoast:
profile: gpu-test
count: 1

Testing of Deployment/GPU Example #1

  • Conduct the following tests from the deployment’s shell.

Test 1

/tmp/sample

Expected/Example Output

root@gpu-test-6d4f545b6f-f95zk:/# /tmp/sample
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Test 2

nvidia-smi

Expected/Example Output

root@gpu-test-6d4f545b6f-f95zk:/# nvidia-smi
Fri Apr 14 09:23:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:05:00.0 Off | Off |
| 41% 44C P8 13W / 140W | 0MiB / 16376MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@gpu-test-6d4f545b6f-f95zk:/#

Example GPU SDL #2

NOTE - there is currently an issue with GPU deployments closing once their primary process completes. Due to this issue the example SDL below causes repeated container resarts. The container will restart when the stable diffusion task has completed. When this issue has been resolved, GPU containers will remain running perpetually and will not close when the primary process defined in the SDL completes.

NOTE - the CUDA version necessary for this image is 11.7 currently. Check the image documentation page here for possible updates.

NOTE - in this example the deployer is requesting bids from only Akash Providers that have available NVIDIA A4000 GPUs

---
version: "2.0"
services:
gpu-test:
image: ghcr.io/fboulnois/stable-diffusion-docker
expose:
- port: 3000
as: 80
to:
- global: true
cmd:
- run
args:
- 'An impressionist painting of a parakeet eating spaghetti in the desert'
- --attention-slicing
- --xformers-memory-efficient-attention
profiles:
compute:
gpu-test:
resources:
cpu:
units: 1
memory:
size: 20Gi
gpu:
units: 1
attributes:
vendor:
nvidia:
- model: a4000
storage:
- size: 100Gi
placement:
westcoast:
pricing:
gpu-test:
denom: uakt
amount: 100000
deployment:
gpu-test:
westcoast:
profile: gpu-test
count: 1
footer-logo-dark

© Akash Network 2024 The Akash Network Authors Documentation Distributed under CC BY 4.0

Open-source Apache 2.0 Licensed.

GitHub v0.20.0

Privacy