The steps involved in enabling your Akash Provider to host GPU resources are covered in this section and via these steps:
- GPU Provider Configuration
- GPU Node Label
- Apply NVIDIA Runtime Engine
- Update Akash Provider
- GPU Test Deployments
- GPU Provider Troubleshooting
GPU Provider Configuration
Overview
Sections in this guide cover the installation of the following packages necessary for Akash Provider GPU hosting:
Install NVIDIA Drivers & Toolkit
NOTE - The steps in this section should be completed on all Kubernetes nodes hosting GPU resources
Prepare Environment
NOTE - reboot the servers following the completion of this step
Install Latest NVIDIA Drivers
The ubuntu-drivers devices
command detects your GPU and determines which version of the NVIDIA drivers is best.
NOTE - the NVIDIA drivers detailed and installed in this section have known compatibility issues with some
6.X
Linux kernels as discussed here. In our experience, when such compatibility issue occur the driver will install with no errors generated but will not functionality properly. If you encounter Linux kernel and NVIDIA driver compatibility issues, consider downgrading the Kernel to the officially supported Ubuntu 22.04 kernel which at the time of this writing is5.15.0-73
Expected/Example Output
Driver Install Based on Output
Run either ubuntu-drivers autoinstall
or apt install nvidia-driver-525
(driver names may be different in your environment).
The autoinnstall
option installs the recommended version and is appropriate in most instances.
The apt install <driver-name>
alternatively allows the install of preferred driver instead of the recommended version.
Install the NVIDIA Container Toolkit
NOTE - The steps in this sub-section should be completed on all Kubernetes nodes hosting GPU resources
For non-PCIe, e.g. SXM* GPUs
In some circumstances it has been found that the CUDA Drivers Fabric Manager needs to be installed on worker nodes hosting GPU resources (typically, non-PCIe GPU configurations such as those using SXM form factors).
Replace
525
with your nvidia driver version installed in the previous steps You may need to wait for about 2-3 minutes for the nvidia fabricmanager to initialize
Additional References for Node Configurations
NOTE - references are for additional info only. No actions are necessary and the Kubernetes nodes should be all set to proceed to next step based on configurations enacted in prior steps on this doc.
- https://github.com/NVIDIA/k8s-device-plugin#prerequisites
- https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
NVIDIA Runtime Configuration
Worker nodes
NOTE - The steps in this sub-section should be completed on all Kubernetes nodes hosting GPU resources
Update the nvidia-container-runtime config in order to prevent NVIDIA_VISIBLE_DEVICES=all
abuse where tenants could access more GPU’s than they requested.
NOTE - This will only work with
nvdp/nvidia-device-plugin
helm chart installed with--set deviceListStrategy=volume-mounts
(you’ll get there in the next steps)
Make sure the config file /etc/nvidia-container-runtime/config.toml
contains these line uncommmented and set to these values:
NOTE -
/etc/nvidia-container-runtime/config.toml
is part ofnvidia-container-toolkit-base
package; so it won’t override the customer-set parameters there since it is part of the/var/lib/dpkg/info/nvidia-container-toolkit-base.conffiles
Kubespray
NOTE - the steps in this sub-section should be completed on the Kubespray host only
NOTE - skip this sub-section if these steps were completed during your Kubernetes build process
In this step we add the NVIDIA runtime confguration into the Kubespray inventory. The runtime will be applied to necessary Kubernetes hosts when Kubespray builds the cluster in the subsequent step.
Create NVIDIA Runtime File for Kubespray Use
Kubespray the Kubernetes Cluster
GPU Node Label
Overview
In this section we verify that necessary Kubernetes node labels have been applied for your GPUs. The labeling of nodes is an automated process and here we only verify proper labels have been applied.
Verification of Node Labels
- Replace
<node-name>
with the node of interest
Expected Output using Example
- Note the presence of the GPU model, interface, and ram expected values.
Apply NVIDIA Runtime Engine
Create RuntimeClass
NOTE - conduct these steps on the control plane node that Helm was installed on via the previous step
Create the NVIDIA Runtime Config
Apply the NVIDIA Runtime Config
Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on All Nodes
NOTE - in some scenarios a provider may host GPUs only on a subset of Kubernetes worker nodes. Use the instructions in this section if ALL Kubernetes worker nodes have available GPU resources. If only a subset of worker nodes host GPU resources - use the section
Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on Subset of Nodes
instead. Only one of these two sections should be completed.
Expected/Example Output
Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on Subset of Nodes
NOTE - use the instructions in this section if only a subset of Kubernetes worker nodes have available GPU resources.
- By default, the nvidia-device-plugin DaemonSet may run on all nodes in your Kubernetes cluster. If you want to restrict its deployment to only GPU-enabled nodes, you can leverage Kubernetes node labels and selectors.
- Specifically, you can use the
allow-nvdp=true label
to limit where the DaemonSet is scheduled.
STEP 1: Label the GPU Nodes
- First, identify your GPU nodes and label them with
allow-nvdp=true
. You can do this by running the following command for each GPU node - Replace
node-name
of the node you’re labeling
NOTE - if you are unsure of the
<node-name>
to be used in this command - issuekubectl get nodes
from one of your Kubernetes control plane nodes to obtain via theNAME
column of this command output
STEP 2: Update Helm Chart Values
- By setting the node selector, you are ensuring that the
nvidia-device-plugin
DaemonSet will only be scheduled on nodes with theallow-nvdp=true
label.
STEP 3: Verify
Expected/Example Output
- In this example only nodes: node1, node3 and node4 have the
allow-nvdp=true
labels and that’s wherenvidia-device-plugin
pods spawned at:
Verification - Applicable to all Environments
Example/Expected Output
Test GPUs
NOTE - conduct the steps in this section on a Kubernetes control plane node
Launch GPU Test Pod
Create the GPU Test Pod Config
Apply the GPU Test Pod Config
Verification of GPU Pod
Expected/Example Output
Update Akash Provider
Update Provider Configuration File
Providers must be updated with attributes in order to bid on the GPUs.
NOTE - in the Akash Provider build documentation a
provider.yaml
file was created and which stores provider attribute/other settings. In this section we will update thatprovider.yaml
file with GPU related attributes. The remainder of the pre-existing file should be left unchanged.
GPU Attributes Template
- GPU model template is used in the subsequent
Provider Configuration File
- Multiple such entries should be included in the
Provider Configuration File
if the providers has multiple GPU types - Currently Akash providers may only host one GPU type per worker node. But different GPU models/types may be hosted on separate Kubernetes nodes.
- We recommend including both a GPU attribute which includes VRAM and a GPU attribute which does not include VRAM to ensure your provider bids when the deployer includes/excludes VRAM spec. Example of this recommended approach in the
provider.yaml
example below. - Include the GPU interface type - as seen in the example below - to ensure provider bids when the deployer includes the interface in the SDL.
Example Provider Configuration File
- In the example configuration file below the Akash Provider will advertise availability of NVIDIA GPU model A4000
- Steps included in this code block create the necessary
provider.yaml
file in the expected directory - Ensure that the attributes section is updated with your own values
Update the Provider YAML File With GPU Attribute
- When the
provider.yaml
file update is complete is should look like this:
Provider Bid Defaults
- When a provider is created the default bid engine settings are used which are used to derive pricing per workload. If desired these settings could be updated. But we would recommend initially using the default values.
- For a through discussion on customized pricing please visit this guide.
Update Provider Via Helm
Verify Health of Akash Provider
Use the following command to verify the health of the Akash Provider and Hostname Operator pods
Example/Expected Output
Verify Provider Attributes On Chain
- In this step we ensure that your updated Akash Provider Attributes have been updated on the blockchain. Ensure that the GPU model related attributes are now in place via this step.
NOTE - conduct this verification from your Kubernetes control plane node
Example/Expected Output
Verify Akash Provider Image
Verify the Provider image is correct by running this command:
Expected/Example Output
GPU Test Deployments
Overview
Use any of the Akash deployment tools covered here for your Provider test deployments.
NOTE - this section covers GPU specific deployment testing and verificaiton of your Akash Provider. In addition, general Provider verifications can be made via this Provider Checkup guide.
Example GPU SDL #1
NOTE - in this example the deployer is requesting bids from only Akash Providers that have available NVIDIA A4000 GPUs. Adjust accordingly for your provider testing.
Testing of Deployment/GPU Example #1
- Conduct the following tests from the deployment’s shell.
Test 1
Expected/Example Output
Test 2
Expected/Example Output
Example GPU SDL #2
NOTE - there is currently an issue with GPU deployments closing once their primary process completes. Due to this issue the example SDL below causes repeated container resarts. The container will restart when the stable diffusion task has completed. When this issue has been resolved, GPU containers will remain running perpetually and will not close when the primary process defined in the SDL completes.
NOTE - the CUDA version necessary for this image is
11.7
currently. Check the image documentation page here for possible updates.
NOTE - in this example the deployer is requesting bids from only Akash Providers that have available NVIDIA A4000 GPUs