When conducting maintenance on your Akash Provider, ensure the akash-provider service is stopped during the maintenance period.
An issue exist currently in which provider leases may be lost during maintenance activities if the akash-provider service is not stopped prior. This issue is detailed further here.
Steps to Stop the akash-provider Service
Steps to Verify the akash-provider Service Has Been Stopped
Steps to Start the akash-provider Service Post Maintenance
How to terminate the workload from the Akash Provider using CLI
Impact of Steps Detailed in the K8s Cluster
Steps outlined in this section will terminate the deployment in K8s cluster and remove the manifest.
Providers can close the bid to get the provider escrow back.
Closing the bid will terminate the associated application running on the provider.
Closing the bid closes the lease (payment channel), meaning the tenant won’t get any further charge for the deployment from the moment the bid is closed.
Providers cannot close the deployment orders. Only the tenants can close deployment orders and only then would the deployment escrow be returned to the tenant.
Impact of Steps Detailed on the Blockchain
The lease will get closed and the deployment will switch from the open to paused state with the open escrow account. Use akash query deployment get CLI command to verify this of desired. The owner will still have to close his deployment (akash tx deployment close) in order to get the AKT back from the deployment’s escrow account (5 AKT by default). The provider has no rights to close the user deployment on its own.
Of course you don’t have to kubectl exec inside the akash-provider Pod - as detailed in this guide - you can just do the same anywhere where you have:
Providers key
Akash CLI tool;
Any mainnet akash RPC node to broadcast the bid close transaction
It is also worth noting that in some cases running the transactions from the account that is already in use (such as running akash-provider service) can cause the account sequence mismatch errors (typically when two clients are trying to issue the transaction within the same block window which is ~6.1s)
STEP 1 - Find the deployment you want to close
STEP 2 - Close the bid
STEP 3 - Verification
To make sure your provider is working well, you can watch the logs while trying to deploy something there, to make sure it bids (i.e. broadcasts the tx on the network)
Example/Expected Messages
To ensure, you can always bounce the provider service which will have no impact on active workloads
Provider Logs
The commands in this section peer into the provider’s logs and may be used to verify possible error conditions on provider start up and to ensure provider order receipt/bid process completion steps.
Command Template
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has kubectl communication with the cluster.
Example Command Use
Using the example command syntax we will list the last ten entries in Provider logs and enter a live streaming session of new logs generated
Example Output
Note within the example the receipt of a deployment order with a DSEQ of 5949829
The sequence shown from order-detected thru reservations thru bid-complete provides an example of what we would expect to see when an order is received by the provider
The order receipt is one of many events sequences that can be verified within provider logs
Provider Status and General Info
Use the verifications included in this section for the following purposes:
Issue the commands in this section from any machine that has the Akash CLI installed.
Example Command Use
Example Output
List Active Leases from Hostname Operator Perspective
Command Syntax
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has the kubectl communication with the cluster.
Example Output
Provider Side Lease Closure
Command Template
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has the kubectl communication with the cluster.
Example Command Use
Example Output (Truncated)
Ingress Controller Verifications
Example Command Use
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has the kubectl communication with the cluster.
Example Output
NOTE - in this example output the last entry (with namespace moc58fca3ccllfrqe49jipp802knon0cslo332qge55qk) represents an active deployment on the provider
Provider Manifests
Use the verifications included in this section for the following purposes:
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has kubectl communication with the cluster.
Example Output
The show-labels options includes display of associated DSEQ / OSEQ / GSEQ / Owner labels
Retrieve Manifest Detail From Provider
Command Template
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has kubectl communication with the cluster.
Example Command Use
Note - use the `kubectl get ingress -A` covered in this guide to lookup the namespace of the deployment of interest
Example Output
Provider Earnings
Use the verifications included in this section for the following purposes:
Use the commands detailed in this section to gather the daily earnings history of your provider
Command Template
Only the following variables need update in the template for your use:
AKASH_NODE - populate value with the address of your RPC node
PROVIDER - populate value with your provider address
Example Command Use
Example Output
Column Headers
Output generated from Example Command Use
AKT Total Earned by Provider
Use the commands detailed in this section to gather the total earnings of your provider
Command Template
Issue the commands in this section from any machine that has the Akash CLI installed.
Note - ensure queries are not limited only to leases created by your account by issuing unset AKASH_FROM prior to the akash query market command execution
Example Command Use
Example Output
AKT Total Earning Potential Per Active Deployment
Legend for Command Syntax
In the equations used in the calculation of earning potential, several figures are used that are indeed not static.
For accurate earning potential based on today’s actual financial/other realities, consider if the following numbers should be updated prior to command execution.
30.436875 used as the average number of days in a month
Command Syntax
Issue the commands in this section from any machine that has the Akash CLI installed.
Note - ensure queries are not limited only to leases created by your account by issuing unset AKASH_FROM prior to the akash query market command execution
Example Command Use
Example Output
Current Leases: Withdrawn vs Consumed
Use the commands detailed in this section to compare the amount of AKT consumed versus the amount of AKT withdrawn per deployment. This review will ensure that withdraw of consumed funds is occurring as expected.
Command Syntax
Only the following variables need update in the template for your use:
AKASH_NODE - populate value with the address of your RPC node
PROVIDER - populate value with your provider address
Example Command Use
Example Output
Dangling Deployments
As part of routine Akash Provider maintenance it is a good idea to ensure that there are no “dangling deployments” in your provider’s Kubernetes cluster.
We define a “dangling deployment” as a scenario in which the lease for a deployment was closed but due to a communication issue the associated deployment in Kubernetes is not closed. Vice versa applies too, where the dangling deployment could sit active on the chain but not on the provider. This should be a rare circumstance but we want to cleanse the provider of any such “dangling deployments” from time to time.
Heal Broken Deployment Replicas by Returning Lost command to Manifests
Prior to the provider version 0.2.1 (akash/provider helm-chart version 4.2.0) there was an issue which was affecting some deployments.
Issue
The deployments with the command explicitly set in their SDL manifest files were losing it upon akash-provider pod/service restart.
This was leading to their replica pods running in CrashLoopBackOff state on the provider side reserving additional resources, while the original replica was still running which was not visible to the client.
Impact
Double amount of the resources are being occupied by the deployment on the provider side
Manifests of these deployments are missing the command
The good news is that both issues can be fixed without the customer intervention.
Once you have updated your provider to 0.2.1 version or greater following the instructions, you can patch the manifests with the correct command which will get rid of the deployments left in CrashLoopBackOff state.
STEP1 - Backup manifests
Before patching the manifests, please make sure to back them up.
They can help in troubleshooting the issues should any arise later.
STEP2 - Collect the deployments which are affected by the lost command issue
Example Output:
revision, namespace, pod, command
The pods with the null commands are the bad replicas in this case, affected by the lost command issue.
You might see some pods with null commands for those replicas which stuck in Pending state because of insufficient resources on the provider, just ignore those.
They will start back again once provider regains enough capacity.
STEP3 - Patch the manifests
Example Output:
STEP4 - Bounce the provider pod/service
That’s all. The bad replicas will disappear on their own.
Example with one namespace:
Before:
After:
Persistent Storage Deployments
Persistent storage enabled deployments are of statefulset kind.
These do not have replicas and thus CrashLoopBackOff containers.
There is no impact, so you can skip them.
However, if you still want to fix their manifests, then apply the following procedure
STEP1 - Verify the statefulset deployments
Here you can ignore the “null” ones, they are normal deployments just not using the command in their SDL manifest files.
Example Output:
STEP2 - Patch the manifest
That’s all. There is no need bouncing the akash-provider pod/service for the statefulset deployment.
Maintaining and Rotating Kubernetes/etcd Certificates: A How-To Guide
When K8s certs expire, you won’t be able to use your cluster. Make sure to rotate your certs proactively.
The following procedure explains how to rotate them manually.
Evidence that the certs have expired:
You can always view the certs expiration using the kubeadm certs check-expiration command:
Rotate K8s Certs
Backup etcd DB
It is crucial to back up your etcd DB as it contains your K8s cluster state! So make sure to backup your etcd DB first before rotating the certs!
Take the etcd DB Backup
Replace the etcd key & cert with your locations found in the prior steps
You can additionally backup the current certs:
Renew the Certs
IMPORTANT: For an HA Kubernetes cluster with multiple control plane nodes, the kubeadm certs renew command (followed by the kube-apiserver, kube-scheduler, kube-controller-manager pods and etcd.service restart) needs to be executed on all the control-plane nodes, on one control plane node at a time, starting with the primary control plane node. This approach ensures that the cluster remains operational throughout the certificate renewal process and that there is always at least one control plane node available to handle API requests. To find out whether you have an HA K8s cluster (multiple control plane nodes) use this command kubectl get nodes -l node-role.kubernetes.io/control-plane
Now that you have the etcd DB backup, you can rotate the K8s certs using the following commands:
Rotate the k8s Certs
Update your kubeconfig
Bounce the following services in this order
Verify the Certs Status
Repeat the process for all control plane nodes, one at a time, if you have a HA Kubernetes cluster.
Force New ReplicaSet Workaround
The steps outlined in this guide provide a workaround for known issue which occurs when a deployment update is attempted and fails due to the provider being out of resources. This is happens because K8s won’t destroy an old pod instance until it ensures the new one has been created.
Create the crontab job /etc/cron.d/akash-force-new-replicasets to run the workaround every 5 minutes.
Kill Zombie Processes
Issue
In certain Kubernetes deployments, subprocesses may not properly implement the wait() function, leading to the creation of <defunct> processes, commonly known as “zombie” processes. These occur when a subprocess completes its task but remains in the system’s process table because the parent process has not retrieved its exit status. Over time, if these zombie processes are not managed, they can accumulate and consume all available process slots in the system, leading to PID exhaustion and resource starvation.
While zombie processes do not consume CPU or memory resources directly, they occupy slots in the system’s process table. If the process table becomes full, no new processes can be spawned, potentially causing severe disruptions. The limit for the number of process IDs (PIDs) available on a system can be checked using:
To prevent this issue, it is crucial to manage and terminate child processes correctly to avoid the formation of zombie processes.
Recommended Approaches
Proper Process Management in Scripts: Ensure that any scripts initiating subprocesses correctly manage their lifecycle. For example:
Using a Container Init System: Deploying a proper container init system ensures that zombie processes are automatically reaped, and signals are forwarded correctly, reducing the likelihood of zombie process accumulation. Here are some tools and examples that you can use:
Tini: A lightweight init system designed for containers. It is commonly used to ensure zombie process reaping and signal handling within Docker containers. You can easily add Tini to your Docker container by using the --init flag or adding it as an entrypoint in your Dockerfile.
Dumb-init: Another lightweight init system designed to handle signal forwarding and process reaping. It is simple and efficient, making it a good alternative for minimal containers that require proper PID 1 behavior.
Runit Example: Runit is a fast and reliable init system and service manager. This Dockerfile example demonstrates how to use Runit as the init system in a Docker container.
Supervisord Example by Docker.com: Supervisord is a popular process manager that allows for managing multiple services within a container. The official Docker documentation provides a supervisord example that illustrates how to manage multiple processes effectively.
S6 Example: S6 is a powerful init system and process supervisor. The S6 overlay repository offers examples and guidelines on how to integrate S6 into your Docker containers, providing process management and reaping.
For more details on this approach, refer to the following resources:
In some cases, misconfigured container images can lead to a rapid accumulation of zombie processes. For instance, a container that repeatedly fails to start an sshd service might spawn zombie processes every 20 seconds:
Steps to Implement a Workaround for Providers
Since providers cannot control the internal configuration of tenant containers, it is advisable to implement a system-wide workaround to handle zombie processes.
Create a Script to Kill Zombie Processes
Create the script /usr/local/bin/kill_zombie_parents.sh:
Mark the Script as Executable
Create a Cron Job
Set up a cron job to run the script every 5 minutes:
This workaround will help mitigate the impact of zombie processes on the system by periodically terminating their parent processes, thus preventing the system’s PID table from being overwhelmed.
Close Leases Based on Image
Below is the suboptimal way of terminating the leases with the selected (unwanted) images (until Akash natively supports that).
Suboptimal because once the deployment gets closed the provider will have to be restarted to recover from the account sequence mismatch error. Providers already do it automatically through the K8s’s liveness probe set to the akash-provider deployment.
The other core problem is that the image is unknown until the client transfers the SDL to the provider (tx send-manifest) which can only happen after provider bids, client accepts the bid.
Follow the steps associated with your Akash Provider install method:
Create script file - /usr/local/bin/akash-kill-lease.sh - and populate with the following content:
Make the Script Executable
Create Cron Job
Create the Cron Job file - /etc/cron.d/akash-kill-lease - with the following content:
Provider Bid Script Migration - GPU Models
A new bid script for Akash Providers has been released that now includes the ability to specify pricing of multiple GPU models.
This document details the recommended procedure for Akash providers needing migration to the new bid script from prior versions.
New Features of Bid Script Release
Support for parameterized price targets (configurable through the Akash/Provider Helm chart values), eliminating the need to manually update your bid price script
Pricing based on GPU model, allowing you to specify different prices for various GPU models
How to Migrate from Prior Bid Script Releases
STEP 1 - Backup your current bid price script
This command will produce an old-bid-price-script.sh file which is your currently active bid price script with your custom modifications
This command will backup your akash/provider config in the provider.yaml file (excluding the old bid price script)
STEP 4 - Update provider.yaml File Accordingly
Update your provider.yaml file with the price targets you want. If you don’t specify these keys, the bid price script will default values shown below
price_target_gpu_mappings sets the GPU price in the following way and in the example provided:
a100 nvidia models will be charged by 120 USD/GPU unit a month
t4 nvidia models will be charged by 80 USD/GPU unit a month
Unspecified nvidia models will be charged 130 USD/GPU unit a month (if * is not explicitly set in the mapping it will default to 100 USD/GPU unit a month)
Extend with more models your provider is offering if necessary with syntax of <model>=<USD/GPU unit a month>
If your GPU model has different possible RAM specs - use this type of convention: a100.40Gi=900,a100.80Gi=1000
STEP 5 - Download New Bid Price Script
STEP 6 - Upgrade Akash/Provider Chart to Version 6.0.5
Expected/Example Output
STEP 7 - Upgrade akash-provider Deployment with New Bid Script
Verification of Bid Script Update
Expected/Example Output
GPU Provider Troubleshooting
Should your Akash Provider encounter issues during the installation process or in post install hosting of GPU resources, follow the troubleshooting steps in this guide to isolate the issue.
NOTE - these steps should be conducted on each Akask Provider/Kubernetes worker nodes that host GPU resources unless stated otherwise within the step
Conduct the steps in this section for basic verification and to ensure the host has access to GPU resources
Prep/Package Installs
Confirm GPU Resources Available on Host
NOTE - example verification steps were conducted on a host with a single NVIDIA T4 GPU resource. Your output will be different based on the type and number of GPU resources on the host.
Example/Expected Output
Confirm CUDA Install & Version
Example/Expected Output
Confirm CUDA GPU Support iis Available for Hosted GPU Resources
Example/Expected Output
Examine Linux Kernel Logs for GPU Resource Errors and Mismatches
Example/Expected Output
NOTE - example output is from a healthy host which loaded NVIDIA drivers successfully and has no version mismatches. Your output may look very different if there are issues within the host.
Ensure Correct Version/Presence of NVIDIA Device Plugin
NOTE - conduct this verification step on the Kubernetes control plane node on which Helm was installed during your Akash Provider build
Example/Expected Output
NVIDIA Fabric Manager
In some circumstances it has been found that the NVIDIA Fabric Manager needs to be installed on worker nodes hosting GPU resources (e.g. non-PCIe GPU configurations such as those using SXM form factors)
If the output of the torch.cuda.is_available() command - covered in prior section in this doc - is an error condition, consider installing the NVIDIA Fabric Manager to resolve issue
Frequently encountered error message encounter when issue exists:
torch.cuda.is_available() function: Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
Further details on the NVIDIA Fabric Manager are available here
NOTE - replace 525 in the following command with the NVIDIA driver version used on your host
NOTE - you may need to wait for about 2-3 minutes for the nvidia fabricmanager to initialize
nvidia-fabricmanager package version mismatch
Occasionally, the Ubuntu repositories may not provide the correct version of the nvidia-fabricmanager package. This can result in the Error 802: system not yet initialized error on SXM NVIDIA GPUs.
A common symptom of this issue is that nvidia-fabricmanager fails to start properly:
To resolve this issue, you’ll need to use the official NVIDIA repository. Here’s how to add it:
NOTE - replace 2204 with your Ubuntu version (e.g. 2404 for Ubuntu noble release)
NOTE - Running apt dist-upgrade with the official NVIDIA repo bumps the nvidia packages along with the nvidia-fabricmanager, without version mismatch issue.
dpkg -l | grep nvidia — make sure to remove any version you don’t expect
and reboot