When conducting maintenance on your Akash Provider, ensure the akash-provider service is stopped during the maintenance period.
An issue exist currently in which provider leases may be lost during maintenance activities if the akash-provider service is not stopped prior. This issue is detailed further here.
Steps to Stop the akash-provider Service
Steps to Verify the akash-provider Service Has Been Stopped
Steps to Start the akash-provider Service Post Maintenance
How to terminate the workload from the Akash Provider using CLI
Impact of Steps Detailed in the K8s Cluster
Steps outlined in this section will terminate the deployment in K8s cluster and remove the manifest.
Providers can close the bid to get the provider escrow back.
Closing the bid will terminate the associated application running on the provider.
Closing the bid closes the lease (payment channel), meaning the tenant won’t get any further charge for the deployment from the moment the bid is closed.
Providers cannot close the deployment orders. Only the tenants can close deployment orders and only then would the deployment escrow be returned to the tenant.
Impact of Steps Detailed on the Blockchain
The lease will get closed and the deployment will switch from the open to paused state with the open escrow account. Use akash query deployment get CLI command to verify this of desired. The owner will still have to close his deployment (akash tx deployment close) in order to get the AKT back from the deployment’s escrow account (5 AKT by default). The provider has no rights to close the user deployment on its own.
Of course you don’t have to kubectl exec inside the akash-provider Pod - as detailed in this guide - you can just do the same anywhere where you have:
Providers key
Akash CLI tool;
Any mainnet akash RPC node to broadcast the bid close transaction
It is also worth noting that in some cases running the transactions from the account that is already in use (such as running akash-provider service) can cause the account sequence mismatch errors (typically when two clients are trying to issue the transaction within the same block window which is ~6.1s)
STEP 1 - Find the deployment you want to close
STEP 2 - Close the bid
STEP 3 - Verification
To make sure your provider is working well, you can watch the logs while trying to deploy something there, to make sure it bids (i.e. broadcasts the tx on the network)
Example/Expected Messages
To ensure, you can always bounce the provider service which will have no impact on active workloads
Provider Logs
The commands in this section peer into the provider’s logs and may be used to verify possible error conditions on provider start up and to ensure provider order receipt/bid process completion steps.
Command Template
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has kubectl communication with the cluster.
Example Command Use
Using the example command syntax we will list the last ten entries in Provider logs and enter a live streaming session of new logs generated
Example Output
Note within the example the receipt of a deployment order with a DSEQ of 5949829
The sequence shown from order-detected thru reservations thru bid-complete provides an example of what we would expect to see when an order is received by the provider
The order receipt is one of many events sequences that can be verified within provider logs
Provider Status and General Info
Use the verifications included in this section for the following purposes:
Issue the commands in this section from any machine that has the Akash CLI installed.
Example Command Use
Example Output
List Active Leases from Hostname Operator Perspective
Command Syntax
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has the kubectl communication with the cluster.
Example Output
Provider Side Lease Closure
Command Template
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has the kubectl communication with the cluster.
Example Command Use
Example Output (Truncated)
Ingress Controller Verifications
Example Command Use
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has the kubectl communication with the cluster.
Example Output
NOTE - in this example output the last entry (with namespace moc58fca3ccllfrqe49jipp802knon0cslo332qge55qk) represents an active deployment on the provider
Provider Manifests
Use the verifications included in this section for the following purposes:
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has kubectl communication with the cluster.
Example Output
The show-labels options includes display of associated DSEQ / OSEQ / GSEQ / Owner labels
Retrieve Manifest Detail From Provider
Command Template
Issue the commands in this section from a control plane node within the Kubernetes cluster or a machine that has kubectl communication with the cluster.
Example Command Use
Note - use the `kubectl get ingress -A` covered in this guide to lookup the namespace of the deployment of interest
Example Output
Provider Earnings
Use the verifications included in this section for the following purposes:
Use the commands detailed in this section to gather the daily earnings history of your provider
Command Template
Only the following variables need update in the template for your use:
AKASH_NODE - populate value with the address of your RPC node
PROVIDER - populate value with your provider address
Example Command Use
Example Output
Column Headers
Output generated from Example Command Use
AKT Total Earned by Provider
Use the commands detailed in this section to gather the total earnings of your provider
Command Template
Issue the commands in this section from any machine that has the Akash CLI installed.
Note - ensure queries are not limited only to leases created by your account by issuing unset AKASH_FROM prior to the akash query market command execution
Example Command Use
Example Output
AKT Total Earning Potential Per Active Deployment
Legend for Command Syntax
In the equations used in the calculation of earning potential, several figures are used that are indeed not static.
For accurate earning potential based on today’s actual financial/other realities, consider if the following numbers should be updated prior to command execution.
30.436875 used as the average number of days in a month\
Command Syntax
Issue the commands in this section from any machine that has the Akash CLI installed.
Note - ensure queries are not limited only to leases created by your account by issuing unset AKASH_FROM prior to the akash query market command execution
Example Command Use
Example Output
Current Leases: Withdrawn vs Consumed
Use the commands detailed in this section to compare the amount of AKT consumed versus the amount of AKT withdrawn per deployment. This review will ensure that withdraw of consumed funds is occurring as expected.
Command Syntax
Only the following variables need update in the template for your use:
AKASH_NODE - populate value with the address of your RPC node
PROVIDER - populate value with your provider address
Example Command Use
Example Output
Dangling Deployments
As part of routine Akash Provider maintenance it is a good idea to ensure that there are no “dangling deployments” in your provider’s Kubernetes cluster.
We define a “dangling deployment” as a scenario in which the lease for a deployment was closed but due to a communication issue the associated deployment in Kubernetes is not closed. Vice versa applies too, where the dangling deployment could sit active on the chain but not on the provider. This should be a rare circumstance but we want to cleanse the provider of any such “dangling deployments” from time to time.
Heal Broken Deployment Replicas by Returning Lost command to Manifests
Prior to the provider version 0.2.1 (akash/provider helm-chart version 4.2.0) there was an issue which was affecting some deployments.
Issue
The deployments with the command explicitly set in their SDL manifest files were losing it upon akash-provider pod/service restart.
This was leading to their replica pods running in CrashLoopBackOff state on the provider side reserving additional resources, while the original replica was still running which was not visible to the client.
Impact
Double amount of the resources are being occupied by the deployment on the provider side
Manifests of these deployments are missing the command
The good news is that both issues can be fixed without the customer intervention.
Once you have updated your provider to 0.2.1 version or greater following the instructions, you can patch the manifests with the correct command which will get rid of the deployments left in CrashLoopBackOff state.
STEP1 - Backup manifests
Before patching the manifests, please make sure to back them up.
They can help in troubleshooting the issues should any arise later.
STEP2 - Collect the deployments which are affected by the lost command issue
Example Output:
revision, namespace, pod, command
The pods with the null commands are the bad replicas in this case, affected by the lost command issue.
You might see some pods with null commands for those replicas which stuck in Pending state because of insufficient resources on the provider, just ignore those.
They will start back again once provider regains enough capacity.
STEP3 - Patch the manifests
Example Output:
STEP4 - Bounce the provider pod/service
That’s all. The bad replicas will disappear on their own.
Example with one namespace:
Before:
After:
Persistent Storage Deployments
Persistent storage enabled deployments are of statefulset kind.
These do not have replicas and thus CrashLoopBackOff containers.
There is no impact, so you can skip them.
However, if you still want to fix their manifests, then apply the following procedure
STEP1 - Verify the statefulset deployments
Here you can ignore the “null” ones, they are normal deployments just not using the command in their SDL manifest files.
Example Output:
STEP2 - Patch the manifest
That’s all. There is no need bouncing the akash-provider pod/service for the statefulset deployment.
Maintaining and Rotating Kubernetes/etcd Certificates: A How-To Guide
When K8s certs expire, you won’t be able to use your cluster. Make sure to rotate your certs proactively.
The following procedure explains how to rotate them manually.
Evidence that the certs have expired:
You can always view the certs expiration using the kubeadm certs check-expiration command:
Rotate K8s Certs
Backup etcd DB
It is crucial to back up your etcd DB as it contains your K8s cluster state! So make sure to backup your etcd DB first before rotating the certs!
Take the etcd DB Backup
Replace the etcd key & cert with your locations found in the prior steps
You can additionally backup the current certs:
Renew the Certs
IMPORTANT: For an HA Kubernetes cluster with multiple control plane nodes, the kubeadm certs renew command (followed by the kube-apiserver, kube-scheduler, kube-controller-manager pods and etcd.service restart) needs to be executed on all the control-plane nodes, on one control plane node at a time, starting with the primary control plane node. This approach ensures that the cluster remains operational throughout the certificate renewal process and that there is always at least one control plane node available to handle API requests. To find out whether you have an HA K8s cluster (multiple control plane nodes) use this command kubectl get nodes -l node-role.kubernetes.io/control-plane
Now that you have the etcd DB backup, you can rotate the K8s certs using the following commands:
Rotate the k8s Certs
Update your kubeconfig
Bounce the following services in this order
Verify the Certs Status
Repeat the process for all control plane nodes, one at a time, if you have a HA Kubernetes cluster.
Force New ReplicaSet Workaround
The steps outlined in this guide provide a workaround for known issue which occurs when a deployment update is attempted and fails due to the provider being out of resources. This is happens because K8s won’t destroy an old pod instance until it ensures the new one has been created.
Create the crontab job /etc/cron.d/akash-force-new-replicasets to run the workaround every 5 minutes.
Kill Zombie Processes
Issue
It is possible for certain deployments to initiate subprocesses that do not properly implement the wait() function.
This improper handling can result in the formation of <defunct> processes, also known as “zombie” processes.
Zombie processes occur when a subprocess completes its task but still remains in the system’s process table due to the parent process not reading its exit status.
Over time, if not managed correctly, these zombie processes have the potential to accumulate and occupy all available process slots in the system, leading to resource exhaustion.
These zombie processes aren’t too harmful much (they don’t occupy cpu/mem / nor impact cgroup cpu/mem limits) unless they take up the whole process table space so no new processes will be able to spawn, i.e. the limit:
To address this issue, tenants should ensure they manage and terminate child processes appropriately to prevent them from becoming zombie processes.
One of the correct ways to approach that would be this example:
Or using a proper container init (tini) / supervision system (such as s6, supervisor, runsv, …) that would reap adopted child processes.
Someone’s running a wrongly configured image, with the service ssh start in it, which fails to start, hence creating bunch of <defunct> zombie sshd processes growing every 20 seconds:
Steps to implement a workaround for the providers
Providers can’t control this, hence they are recommended to implement the following workaround across all worker nodes.
This way the workaround will automatically run every 5 minutes.
Close Leases Based on Image
Below is the suboptimal way of terminating the leases with the selected (unwanted) images (until Akash natively supports that).
Suboptimal because once the deployment gets closed the provider will have to be restarted to recover from the account sequence mismatch error. Providers already do it automatically through the K8s’s liveness probe set to the akash-provider deployment.
The other core problem is that the image is unknown until the client transfers the SDL to the provider (tx send-manifest) which can only happen after provider bids, client accepts the bid.
Follow the steps associated with your Akash Provider install method:
Create script file - /usr/local/bin/akash-kill-lease.sh - and populate with the following content:
Make the Script Executable
Create Cron Job
Create the Cron Job file - /etc/cron.d/akash-kill-lease - with the following content:
Provider Bid Script Migration - GPU Models
A new bid script for Akash Providers has been released that now includes the ability to specify pricing of multiple GPU models.
This document details the recommended procedure for Akash providers needing migration to the new bid script from prior versions.
New Features of Bid Script Release
Support for parameterized price targets (configurable through the Akash/Provider Helm chart values), eliminating the need to manually update your bid price script
Pricing based on GPU model, allowing you to specify different prices for various GPU models
How to Migrate from Prior Bid Script Releases
STEP 1 - Backup your current bid price script
This command will produce an old-bid-price-script.sh file which is your currently active bid price script with your custom modifications
This command will backup your akash/provider config in the provider.yaml file (excluding the old bid price script)
STEP 4 - Update provider.yaml File Accordingly
Update your provider.yaml file with the price targets you want. If you don’t specify these keys, the bid price script will default values shown below
price_target_gpu_mappings sets the GPU price in the following way and in the example provided:
a100 nvidia models will be charged by 120 USD/GPU unit a month
t4 nvidia models will be charged by 80 USD/GPU unit a month
Unspecified nvidia models will be charged 130 USD/GPU unit a month (if * is not explicitly set in the mapping it will default to 100 USD/GPU unit a month)
Extend with more models your provider is offering if necessary with syntax of <model>=<USD/GPU unit a month>
If your GPU model has different possible RAM specs - use this type of convention: a100.40Gi=900,a100.80Gi=1000
STEP 5 - Download New Bid Price Script
STEP 6 - Upgrade Akash/Provider Chart to Version 6.0.5
Expected/Example Output
STEP 7 - Upgrade akash-provider Deployment with New Bid Script
Verification of Bid Script Update
Expected/Example Output
GPU Provider Troubleshooting
Should your Akash Provider encounter issues during the installation process or in post install hosting of GPU resources, follow the troubleshooting steps in this guide to isolate the issue.
NOTE - these steps should be conducted on each Akask Provider/Kubernetes worker nodes that host GPU resources unless stated otherwise within the step
Conduct the steps in this section for basic verification and to ensure the host has access to GPU resources
Prep/Package Installs
Confirm GPU Resources Available on Host
NOTE - example verification steps were conducted on a host with a single NVIDIA T4 GPU resource. Your output will be different based on the type and number of GPU resources on the host.
Example/Expected Output
Confirm CUDA Install & Version
Example/Expected Output
Confirm CUDA GPU Support iis Available for Hosted GPU Resources
Example/Expected Output
Examine Linux Kernel Logs for GPU Resource Errors and Mismatches
Example/Expected Output
NOTE - example output is from a healthy host which loaded NVIDIA drivers successfully and has no version mismatches. Your output may look very different if there are issues within the host.
Ensure Correct Version/Presence of NVIDIA Device Plugin
NOTE - conduct this verification step on the Kubernetes control plane node on which Helm was installed during your Akash Provider build
Example/Expected Output
CUDA Drivers Fabric Manager
In some circumstances it has been found that the CUDA Drivers Fabric Manager needs to be installed on worker nodes hosting GPU resources (e.g. non-PCIe GPU configurations such as those using SXM form factors)
If the output of the torch.cuda.is_available() command - covered in prior section in this doc - is an error condition, consider installing the CUDA Drivers Fabric Manager to resolve issue
Frequently encountered error message encounter when issue exists:
torch.cuda.is_available() function: Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)
Further details on the CUDA Drivers Fabric Manager are available here
NOTE - replace 525 in the following command with the NVIDIA driver version used on your host