Deploy Kubernetes 1.33.5 using Kubespray for your Akash provider.
This guide walks through deploying a production-ready Kubernetes cluster that will host your Akash provider. The cluster will run all provider leases as Kubernetes pods.
Time: 30-45 minutes
What You’ll Deploy
Using Kubespray 2.29, you’ll install:
- Kubernetes 1.33.5 - Container orchestration
- etcd 3.5.22 - Distributed key-value store
- containerd 2.1.4 - Container runtime
- Calico 3.30.3 - Container networking (CNI)
Before You Begin
Ensure you have:
- **Reviewed Hardware Requirements
- **Ubuntu 24.04 LTS installed on all nodes
- **Root or sudo access to all nodes
- **Network connectivity between all nodes
STEP 1 - Clone Kubespray
Clone Kubespray on a control machine (not on the cluster nodes themselves):
cd ~git clone -b v2.29.0 --depth=1 https://github.com/kubernetes-sigs/kubespray.gitcd kubesprayNote: We use Kubespray 2.29.0 which includes Kubernetes 1.33.5
STEP 2 - Install Ansible
Install Python dependencies and create a virtual environment:
# Install system packagesapt-get updateapt-get install -y python3-pip python3-venv
# Create and activate virtual environmentcd ~/kubespraypython3 -m venv venvsource venv/bin/activate
# Install Ansible and dependenciespip install -r requirements.txtImportant: Remember to activate the virtual environment (
source ~/kubespray/venv/bin/activate) before running anyansible-playbookcommands.
STEP 3 - Setup SSH Access
Configure passwordless SSH access to all cluster nodes:
Generate SSH Key
ssh-keygen -t ed25519 -C "$(hostname)" -f "$HOME/.ssh/id_ed25519" -N ""Display Your Public Key
cat ~/.ssh/id_ed25519.pubCopy the entire output (starts with ssh-ed25519).
Add Key to Each Node
Log into each node in your cluster and add the public key to the authorized_keys file:
# SSH into the node (use your existing access method)ssh root@<node-ip>
# Create .ssh directory if it doesn't existmkdir -p ~/.sshchmod 700 ~/.ssh
# Add your public key (paste the key you copied above)echo "ssh-ed25519 AAAA...your-public-key...== hostname" >> ~/.ssh/authorized_keys
# Set correct permissionschmod 600 ~/.ssh/authorized_keys
# Exit the nodeexitRepeat for every node in your cluster.
Verify SSH Access
Test passwordless SSH to each node from your control machine:
You should see the hostname returned without a password prompt.
STEP 4 - Create Inventory
Create and configure the Ansible inventory:
cd ~/kubespraycp -rfp inventory/sample inventory/akashEdit Inventory File
Open the inventory file:
nano ~/kubespray/inventory/akash/inventory.iniConfigure your nodes in the inventory. Important:
- Use 1 or 3 control plane nodes (odd numbers for consensus)
- List the same nodes under both
[kube_control_plane]and[etcd] - Add all worker nodes under
[kube_node]
Example for 3-node HA cluster:
[kube_control_plane]node1 ansible_host=10.0.0.10 ip=10.0.0.10 etcd_member_name=etcd1node2 ansible_host=10.0.0.11 ip=10.0.0.11 etcd_member_name=etcd2node3 ansible_host=10.0.0.12 ip=10.0.0.12 etcd_member_name=etcd3
[etcd:children]kube_control_plane
[kube_node]node4 ansible_host=10.0.0.13 ip=10.0.0.13node5 ansible_host=10.0.0.14 ip=10.0.0.14Example for single-node cluster:
[kube_control_plane]node1 ansible_host=10.0.0.10 ip=10.0.0.10 etcd_member_name=etcd1
[etcd:children]kube_control_plane
[kube_node]node1 ansible_host=10.0.0.10 ip=10.0.0.10STEP 5 - Configure Container Runtime
Verify containerd is set as the container runtime:
nano ~/kubespray/inventory/akash/group_vars/k8s_cluster/k8s-cluster.ymlEnsure this line exists:
container_manager: containerdSTEP 6 - Configure DNS
Configure upstream DNS servers:
nano ~/kubespray/inventory/akash/group_vars/all/all.ymlUncomment and configure the DNS servers:
upstream_dns_servers: - 8.8.8.8 - 1.1.1.1Best Practice: Use DNS servers from different providers (Google 8.8.8.8, Cloudflare 1.1.1.1)
STEP 7 - Configure GPU Support (OPTIONAL)
Skip this step if you don’t have NVIDIA GPUs.
If you have NVIDIA GPUs, configure the container runtime before deploying the cluster:
mkdir -p ~/kubespray/inventory/akash/group_vars/allcat > ~/kubespray/inventory/akash/group_vars/all/akash.yml << 'EOF'# NVIDIA container runtime for GPU-enabled nodescontainerd_additional_runtimes: - name: nvidia type: "io.containerd.runc.v2" engine: "" root: "" options: BinaryName: '/usr/bin/nvidia-container-runtime'EOFThis configures containerd to support GPU workloads. The actual NVIDIA drivers and device plugin will be installed later.
STEP 8 - Deploy the Cluster
Now deploy Kubernetes using Ansible:
cd ~/kubespraysource venv/bin/activateansible-playbook -i inventory/akash/inventory.ini -b -v --private-key=~/.ssh/id_ed25519 cluster.ymlThis will take 10-15 minutes. The playbook is idempotent - if it fails, you can safely run it again.
STEP 9 - Verify Cluster
SSH to one of your control plane nodes and verify the cluster:
Check Nodes
kubectl get nodesAll nodes should show STATUS: Ready.
Check System Pods
kubectl get pods -AAll pods should be in Running or Completed status.
Verify DNS
kubectl -n kube-system get pods -l k8s-app=node-local-dnsAll DNS pods should be Running.
Check etcd Health
On a control plane node, verify etcd status:
export $(grep -v '^#' /etc/etcd.env | xargs -d '\n')etcdctl -w table member listetcdctl endpoint health --cluster -w tableetcdctl endpoint status --cluster -w tableetcdctl check perfExpected output from etcdctl check perf:
...PASS: Throughput is 150 writes/sPASS: Slowest request took 0.155139sPASS: Stddev is 0.007739sPASSAll endpoints should show healthy and performance checks should PASS.
STEP 10 - Apply Kernel Parameters
On all worker nodes, apply these kernel parameters to prevent too many open files errors:
cat > /etc/sysctl.d/90-akash.conf << 'EOF'fs.inotify.max_user_instances = 512fs.inotify.max_user_watches = 1048576vm.max_map_count = 1000000EOF
sysctl -p /etc/sysctl.d/90-akash.confSTEP 11 - Install Zombie Process Killer (Recommended)
Some tenant containers don’t properly manage subprocesses, creating zombie processes that accumulate over time. While zombies don’t consume CPU/memory, they occupy process table slots. If the process table fills, no new processes can spawn, causing system failures.
This step installs an automated script to prevent zombie process accumulation.
Create the Script
On all worker nodes, create /usr/local/bin/kill_zombie_parents.sh:
cat > /usr/local/bin/kill_zombie_parents.sh <<'EOF'#!/bin/bash# This script detects zombie processes descended from containerd-shim# and attempts to reap them by signaling the parent process.
find_zombie_and_parents() { for pid in /proc/[0-9]*; do if [[ -r $pid/stat ]]; then read -r proc_pid comm state ppid < <(cut -d' ' -f1,2,3,4 "$pid/stat") if [[ $state == "Z" ]]; then echo "$proc_pid $ppid" return 0 fi fi done return 1}
get_parent_chain() { local pid=$1 local chain="" while [[ $pid -ne 1 ]]; do if [[ ! -r /proc/$pid/stat ]]; then break fi read -r ppid cmd < <(awk '{print $4, $2}' /proc/$pid/stat) chain="$pid:$cmd $chain" pid=$ppid done echo "$chain"}
is_process_zombie() { local pid=$1 if [[ -r /proc/$pid/stat ]]; then read -r state < <(cut -d' ' -f3 /proc/$pid/stat) [[ $state == "Z" ]] else return 1 fi}
attempt_kill() { local pid=$1 local signal=$2 local wait_time=$3 local signal_name=${4:-$signal}
echo "Attempting to send $signal_name to parent process $pid" kill $signal $pid sleep $wait_time
if is_process_zombie $zombie_pid; then echo "Zombie process $zombie_pid still exists after $signal_name" return 1 else echo "Zombie process $zombie_pid no longer exists after $signal_name" return 0 fi}
if zombie_info=$(find_zombie_and_parents); then zombie_pid=$(echo "$zombie_info" | awk '{print $1}') parent_pid=$(echo "$zombie_info" | awk '{print $2}')
echo "Found zombie process $zombie_pid with immediate parent $parent_pid"
parent_chain=$(get_parent_chain "$parent_pid") echo "Parent chain: $parent_chain"
if [[ $parent_chain == *"containerd-shim"* ]]; then echo "Top-level parent is containerd-shim" immediate_parent=$(echo "$parent_chain" | awk -F' ' '{print $1}' | cut -d':' -f1) if [[ $immediate_parent != $parent_pid ]]; then if attempt_kill $parent_pid -SIGCHLD 15 "SIGCHLD"; then echo "Zombie process cleaned up after SIGCHLD" elif attempt_kill $parent_pid -SIGTERM 15 "SIGTERM"; then echo "Zombie process cleaned up after SIGTERM" elif attempt_kill $parent_pid -SIGKILL 5 "SIGKILL"; then echo "Zombie process cleaned up after SIGKILL" else echo "Failed to clean up zombie process after all attempts" fi else echo "Immediate parent is containerd-shim. Not killing." fi else echo "Top-level parent is not containerd-shim. No action taken." fifiEOFMake Executable
chmod +x /usr/local/bin/kill_zombie_parents.shCreate Cron Job
Create /etc/cron.d/kill_zombie_parents:
cat > /etc/cron.d/kill_zombie_parents << 'EOF'PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/binSHELL=/bin/bash
*/5 * * * * root /usr/local/bin/kill_zombie_parents.sh | logger -t kill_zombie_parentsEOFThis runs every 5 minutes and logs output to syslog.
STEP 12 - Verify Firewall
Ensure these ports are open between nodes:
Control Plane:
6443/tcp- Kubernetes API server2379-2380/tcp- etcd client and peer
All Nodes:
10250/tcp- Kubelet API
See Kubernetes port reference for a complete list.
Advanced Configuration
Custom Ephemeral Storage: If you need to use a separate mount point for ephemeral storage (RAID array, dedicated NVMe, etc.), this must be configured before deploying the cluster. This is an advanced configuration most users won’t need. See the advanced guides for details.
Next Steps
Your Kubernetes cluster is now ready!
- Otherwise: → Provider Installation
Additional optional features:
- TLS Certificates - Automatic SSL certificates
- IP Leases - Enable static IPs for deployments