Introduction
Welcome to the specialized guide designed to assist Akash Providers in enabling AMD GPU support within their Kubernetes clusters. This documentation is particularly crafted for system administrators, developers, and DevOps professionals who manage and operate their Akash Providers. The focus here is to guide you through the process of integrating AMD GPUs into your Kubernetes/Akash setup, ensuring that they can be utilized in the Akash Network.
Throughout this guide, you will find step-by-step instructions on installing the necessary AMD drivers, configuring Kubernetes to acknowledge and leverage AMD GPUs.
This documentation is vital for Akash Providers and Clients who aim to deploy advanced workloads such as machine learning models, high-performance computing tasks, or any applications that benefit from GPU acceleration. By following this guide, you will be able to enhance your service offerings on the Akash Network, catering to a wider range of computational needs with AMD GPU support.
NOTE: To effectively enable AMD GPU support, ensure that your
akash-provider
andprovider-services
(CLI) are updated to version0.4.9-rc0
or higher. This is a prerequisite for proper integration and functionality of AMD GPUs on your Akash Provider.
Limitations
Current constraints dictate that combining NVIDIA and AMD GPU vendors within the same Kubernetes worker node is not allowed. However, it is permissible to have different GPU vendors within the same Kubernetes cluster, as long as each individual worker node exclusively uses GPUs from a single vendor, be it NVIDIA or AMD.
- Vendor Constraint: Combining NVIDIA and AMD GPUs within the same Kubernetes worker node is not permitted.
- Vendor Homogeneity: It is permissible to mix different GPU vendors on the same Kubernetes cluster. However, this mixing is not allowed within a single worker node.
- Vendor Exclusivity: Each worker node must exclusively use GPUs from a single vendor, either NVIDIA or AMD. This means a single node cannot have a mix of NVIDIA and AMD GPUs, but different nodes within the same cluster can use different vendors.
Installing the AMD GPU Driver
Follow these steps to install the AMD GPU Driver:
- Install AMD GPU drivers using DKMS
-
Apply these commands on your node with AMD GPU:
based on https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/native-install/ubuntu.html
- Make sure the right driver is loaded:
-
Reboot the node: By default
/lib/modules/<version>/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko
is loaded, however you cannot simplymodprobe -r amdgpu
and thenmodprobe amdgpu
.
You need to reboot to make sure the correct AMD GPU driver (DKMS/lib/modules/<version>/updates/dkms/amdgpu.ko
) is properly loaded. -
Verify correct version is loaded (you may see a higher version, that’s okay):
Enabling AMD GPU Support in Akash Provider
1. Install ROCm/k8s-device-plugin
helm-chart
-
Add the helm repository and install the chart:
-
Verify the installation:
NOTE: replace
node1
with the node name of your worker node (kubectl get nodes
) -
Example output:
2. Label AMD GPU Node
- Label your AMD GPU node (replace
mi210
with your AMD GPU model):
3. Test AMD GPU with TensorFlow in Pod
Before proceeding with the deployment, be aware of the following:
NOTE: Starting the
alexnet-gpu
may take a considerable amount of time, especially over slow network connections. This delay is due to the large size of the image, approximately10 GiB
, as detailed on Docker Hub.
To deploy and test the TensorFlow environment on AMD GPUs, follow these steps:
-
Create the pod using the provided YAML file:
-
Check the logs to verify successful deployment and operation:
Example output:
-
Once testing is complete, delete the pod:
Update bid pricing parameters for your AMD GPU card
Make sure you are using the latest bid pricing script. You can follow these instructions.
- Add the pricing for your AMD GPU model (replace
mi210
with your model) toprovider.yaml
file:
This sets $190
/month for AMD GPU MI 210 card and defaults to $200
/month when the GPU model was not explicitly set.
Testing AMD GPU with TensorFlow in Akash Deployment
To test TensorFlow with AMD GPU in Akash Deployment:
- Base your deployment on the image & command/args from the provided YAML file.
- Use image:
rocm/tensorflow
. - Override the
command
&args
in the SDL. - Execute the benchmarking command:
Example SDL
Use the following SDL configuration to deploy rocm/tensorflow
image:
Make sure to replace
SSH_PUBKEY
with your public SSH key should you want to be able ssh to your deployment instead oflease-shell
into it.
Additional material
- Exploring Integration of
rocm-smi
in AMD GPU Pods for Enhanced Compatibility
We are exploring the possibility of including the rocm-smi
tool by default in AMD GPU Pods, analogous to how nvidia-smi
is available in NVIDIA GPU Pods. This inclusion in NVIDIA pods is facilitated by the NVIDIA device plugin, which mounts necessary host paths and utilizes environment variables like NVIDIA_DRIVER_CAPABILITIES
. For more detailed examples and information, refer to the NVIDIA Container Toolkit documentation here. Our goal is to achieve similar functionality for AMD GPUs, enhancing user experience.