Use this guide and follow the sequential steps to build your Testnet Akash Provider with GPU support.
- Prepare Kubernetes Hosts
- Disable Search Domains
- Install NVIDIA Drivers & Toolkit
- NVIDIA Runtime Configuration
- Create Kubernetes Cluster
- Confirm Kubernetes Cluster
- Helm Installation on Kubernetes Node
- Apply NVIDIA Runtime Engine
- Test GPUs
- Akash Provider Install
- Ingress Controller Install
- Domain Name Review
- GPU Test Deployments
Prepare Kubernetes Hosts
Akash Providers utilize an underlying Kubernetes cluster. Begin your Akash Provider build by preparing the hosts that the Kubernetes cluster will be built on.
Follow the instructions in this guide to prepare the hosts. Complete steps 1-6 in the linked guide and then return to proceed with the steps of this Provider Build with GPU
guide.
Disable Search Domains
Overview
In this section we perform the following DNS adjustments:
Set Use Domains to False
- Set
use-domains: false
to prevent the possibility of systemd’s DHCP client overwriting the DNS search domain. This prevents a potentially bad domain served by the DHCP server from becoming active. - This is a common issue to some of the providers which is explained in more detail here
Set Accept RA to False
- Set
accept-ra: false
to disable IPv6 Router Advertisement (RA) as the DNS search domain may still leak through if not disabled. - Potential issue this addresses is explained in more detail here
Create Netplan
NOTE - the DNS resolution issue & the Netplan fix addressed in this step are described here
Apply the following to all Kubernetes control plane and worker nodes.
IMPORTANT - Make sure you do not have any other config files under the
/etc/netplan
directory, otherwise it could cause unexpected networking issues / issues with booting up your node.
If you aren’t using the DHCP or want to add additional configuration, please refer to the netplan documentation here for additional config options.
Example
- File:
/etc/netplan/01-netcfg.yaml
Note that this is only an example of the netplan configuration file to show you how to disable the DNS search domain overriding and IPv6 Router Advertisement (RA). Do not blindly copy the entire config but rather use it as a reference for your convenience!
Test and Apply Netplan
Test the Netplan config and apply via these commands.
Expected/Example Output
Install NVIDIA Drivers & Toolkit
NOTE - The steps in this section should be completed on all Kubernetes nodes hosting GPU resources
Prepare Environment
NOTE - reboot the servers following the completion of this step
Install Latest NVIDIA Drivers
The ubuntu-drivers devices
command detects your GPU and determines which version of the NVIDIA drivers is best.
NOTE - the NVIDIA drivers detailed and installed in this section have known compatibility issues with some
6.X
Linux kernels as discussed here. In our experience, when such compatibility issue occur the driver will install with no errors generated but will not functionality properly. If you encounter Linux kernel and NVIDIA driver compatibility issues, consider downgrading the Kernel to the officially supported Ubuntu 22.04 kernel which at the time of this writing is5.15.0-73
Expected/Example Output
Driver Install Based on Output
Run either ubuntu-drivers autoinstall
or apt install nvidia-driver-525
(driver names may be different in your environment).
The autoinnstall
option installs the recommended version and is appropriate in most instances.
The apt install <driver-name>
alternatively allows the install of preferred driver instead of the recommended version.
Install the NVIDIA Container Toolkit
Additional References for Node Configurations
NOTE - references are for additional info only. No actions are necessary and the Kubernetes nodes should be all set to proceed to next step based on configurations enacted in prior steps on this doc.
- https://github.com/NVIDIA/k8s-device-plugin#prerequisites
- https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
NVIDIA Runtime Configuration
Worker nodes
IMPORTANT - This should be done on all worker nodes that have GPU installed!
Update the nvidia-container-runtime config in order to prevent NVIDIA_VISIBLE_DEVICES=all
abuse where tenants could access more GPU’s than they requested.
NOTE - This will only work with
nvdp/nvidia-device-plugin
helm chart installed with--set deviceListStrategy=volume-mounts
(you’ll get there in the next steps)
Make sure the config file /etc/nvidia-container-runtime/config.toml
contains these line uncommmented and set to these values:
NOTE -
/etc/nvidia-container-runtime/config.toml
is part ofnvidia-container-toolkit-base
package; so it won’t override the customer-set parameters there since it is part of the/var/lib/dpkg/info/nvidia-container-toolkit-base.conffiles
Kubespray
NOTE - This step should be completed on the Kubespray host only
In this step we add the NVIDIA runtime confguration into the Kubespray inventory. The runtime will be applied to necessary Kubernetes hosts when Kubespray builds the cluster in the subsequent step.
Create Kubernetes Cluster
Create Cluster
NOTE - This step should be completed from the Kubespray host only
With inventory in place we are ready to build the Kubernetes cluster via Ansible.
NOTE - the cluster creation may take several minutes to complete
- If the Kubespray process fails or is interpreted, run the Ansible playbook again and it will complete any incomplete steps on the subsequent run
GPU Node Label (Kubernetes)
Each node that provides GPUs must be labeled correctly.
NOTE - these configurations should be completed on a Kubernetes control plane node
Label Template
- Use this label template in the
kubectl label
command in the subsequent Label Appliction sub-section below
NOTE - please do not assign any value other than
true
to these labels. Setting the value tofalse
may have unexpected consequences on the Akash provider. If GPU resources are removed from a node, simply remove the Kubernetes label completely from that node.
Label Application
Template
NOTE - if you are unsure of the
<node-name>
to be used in this command - issuekubectl get nodes
from one of your Kubernetes control plane nodes to obtain via theNAME
column of this command output
Example
NOTE - issue this command/label application for all nodes hosting GPU resources
Expected Output using Example
Additional Kubernetes Configurations
NOTE - these configurations should be completed on a Kubernetes control plane node
Confirm Kubernetes Cluster
A couple of quick Kubernetes cluster checks are in order before moving into next steps.
SSH into Kubernetes Master Node
NOTE - the verifications in this section must be completed on a master node with Kubectl access to the cluster.
Confirm Kubernetes Nodes
Example output from a healthy Kubernetes cluster
Confirm Kubernetes Pods
Example output of the pods that are the brains of the cluster
Verify etcd Status and Health
Commands should be run on the control plane node to ensure health of the Kubernetes
etcd
database
Example/Expected Output of etcd Health Check
Helm Installation on Kubernetes Node
NOTE - conduct these steps from one of the Kubernetes control plane/master nodes
Helm Install
Confirmation of Helm Install
Print Helm Version
Expected Output
Apply NVIDIA Runtime Engine
NOTE - conduct these steps on the control plane node that Helm was installed on via the previous step
Create RuntimeClass
Create the NVIDIA Runtime Config
Apply the NVIDIA Runtime Config
Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on All Nodes
NOTE - in some scenarios a provider may host GPUs only on a subset of Kubernetes worker nodes. Use the instructions in this section if ALL Kubernetes worker nodes have available GPU resources. If only a subset of worker nodes host GPU resources - use the section
Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on Subset of Nodes
instead. Only one of these two sections should be completed.
Expected/Example Output
Upgrade/Install the NVIDIA Device Plug In Via Helm - GPUs on Subset of Nodes
NOTE - use the instructions in this section if only a subset of Kubernetes worker nodes have available GPU resources.
- By default, the nvidia-device-plugin DaemonSet may run on all nodes in your Kubernetes cluster. If you want to restrict its deployment to only GPU-enabled nodes, you can leverage Kubernetes node labels and selectors.
- Specifically, you can use the
allow-nvdp=true label
to limit where the DaemonSet is scheduled.
STEP 1: Label the GPU Nodes
- First, identify your GPU nodes and label them with
allow-nvdp=true
. You can do this by running the following command for each GPU node - Replace
node-name
of the node you’re labeling
NOTE - if you are unsure of the
<node-name>
to be used in this command - issuekubectl get nodes
from one of your Kubernetes control plane nodes to obtain via theNAME
column of this command output
STEP 2: Update Helm Chart Values
- By setting the node selector, you are ensuring that the
nvidia-device-plugin
DaemonSet will only be scheduled on nodes with theallow-nvdp=true
label.
STEP 3: Verify
Expected/Example Output
- In this example only nodes: node1, node3 and node4 have the
allow-nvdp=true
labels and that’s wherenvidia-device-plugin
pods spawned at:
Verification - Applicable to all Environments
Example/Expected Output
Test GPUs
NOTE - conduct the steps in this section on a Kubernetes control plane node
Launch GPU Test Pod
Create the GPU Test Pod Config
Apply the GPU Test Pod Config
Verification of GPU Pod
Expected/Example Output
Akash Provider Install
NOTE - all steps in this guide should be performed from a Kubernetes control plane node
Install Akash Provider Services Binary
Confirm Akash Provider Services Install
- Issue the following command to confirm successful installation of the binary:
Expected/Example Output
Specify Provider Account Keyring Location
Create Provider Account
The wallet created in this step used will be used for the following purposes:
- Pay for provider transaction gas fees
- Pay for bid collateral which is discussed further in this section
NOTE - Make sure to create a new Akash account for the provider and do not reuse an account used for deployment purposes. Bids will not be generated from your provider if the deployment orders are created with the same key as the provider.
NOTE - capture the mnemonic phrase for the account to restore later if necessary
NOTE - in the provided syntax we are creating an account with the key name of
default
Fund Provider Account via Faucet
Ensure that the provider account - created in the prior step - is funded. Avenues to fund an account are discussed in this document.
Export Provider Key for Build Process
STEP 1 - Export Provider Key
- Enter pass phrase when prompted
- The passphrase used will be needed in subsequent steps
Expected/Example Output
STEP 2 - Create key.pem and Copy Output Into File
- Copy the contents of the prior step into the
key.pem
file
NOTE - file should contain only what’s between
-----BEGIN TENDERMINT PRIVATE KEY-----
and-----END TENDERMINT PRIVATE KEY-----
(including theBEGIN
andEND
lines):
Verification of key.pem File
Expected/Example File
Provider RPC Node
Akash Providers need to run their own blockchain RPC node to remove dependence on public nodes. This is a strict requirement.
We have recently released documentation guiding thru the process of building a RPC node via Helm Charts with state sync.
Declare Relevant Environment Variables
- Update
RPC-NODE-ADDRESS
with your own value
- Update the following variables with your own values
- The
KEY_PASSWORD
value should be the passphrase of used during the account export step - Further discussion of the Akash provider domain is available here
Create Provider Configuration File
- Providers must be updated with attributes in order to bid on the GPUs.
GPU Attributes Template
- GPU model template is used in the subsequent
Provider Configuration File
- Multiple such entries should be included in the
Provider Configuration File
if the providers has multiple GPU types - Currently Akash providers may only host one GPU type per worker node. But different GPU models/types may be hosted on separate Kubernetes nodes.
Example Provider Configuration File
- In the example configuration file below the Akash Provider will advertise availability of NVIDIA GPU model A4000
- Steps included in this code block create the necessary
provider.yaml
file in the expected directory - Ensure that the attributes section is updated witih your own values
Provider Bid Defaults
- When a provider is created the default bid engine settings are used which are used to derive pricing per workload. If desired these settings could be updated. But we would recommend initially using the default values.
- For a through discussion on customized pricing please visit this guide.
Create Provider Via Helm
Verification
- Verify the image is correct by running this command:
Expected/Example Output
Create Akash Hostname Operator
Verify Health of Akash Provider
- Use the following command to verify the health of the Akash Provider and Hostname Operator pods
Example/Expected Output
Ingress Controller Install
Create Upstream Ingress-Nginx Config
- Create an
ingress-nginx-custom.yaml
file
- Populate the
ingress-nginx-custom.yaml
file with the following contents:
Install Upstream Ingress-Nginx
Apply Necessary Labels
- Label the
ingress-nginx
namespace and theakash-ingress-class
ingress class
Domain Name Review
Overview
Add DNS (type A) records for your Akash Provider related domains on your DNS hosting provider.
Akash Provider Domain Records
- Replace yourdomain.com with your own domain name
- DNS (type A) records should point to public IP address of a single Kubernetes worker node of your choice
NOTE - do not use Cloudflare or any other TLS proxy solution for your Provider DNS A records.
NOTE - Instead of the multiple DNS A records for worker nodes, consider using CNAME DNS records such as the example provided below. CNAME use allows ease of management and introduces higher availability.
*.ingress 300 IN CNAME nodes.yourdomain.com.
nodes 300 IN A x.x.x.x
nodes 300 IN A x.x.x.x
nodes 300 IN A x.x.x.x
provider 300 IN CNAME nodes.yourdomain.com.
Example DNS Configuration
GPU Test Deployments
Overview
Test your provider’s ability to host GPU related deployments via the SDLs provided in this section.
Use any of the Akash deployment tools covered here for your Provider test deployments.
NOTE - this section covers GPU specific deployment testing and verificaiton of your Akash Provider. In addition, general Provider verifications can be made via this Provider Checkup guide.
Example GPU SDL #1
NOTE - in this example the deployer is requesting bids from only Akash Providers that have available NVIDIA A4000 GPUs. Adjust accordingly for your provider testing.
Testing of Deployment/GPU Example #1
Conduct the following tests from the deployment’s shell.
Test 1
Expected/Example Output
Test 2
Expected/Example Output
Example GPU SDL
NOTE - there is currently an issue with GPU deployments closing once their primary process completes. Due to this issue the example SDL below causes repeated container resarts. The container will restart when the stable diffusion task has completed. When this issue has been resolved, GPU containers will remain running perpetually and will not close when the primary process defined in the SDL completes.
NOTE - the CUDA version necessary for this image is
11.7
currently. Check the image documentation page here for possible updates.
NOTE - in this example the deployer is requesting bids from only Akash Providers that have available NVIDIA A4000 GPUs