This guide covers monitoring and troubleshooting tools for Akash providers, including log analysis, status checks, and GPU troubleshooting.
Provider Logs
Provider logs show order receipt, bidding activity, manifest processing, and errors. Use logs to verify provider health and troubleshoot issues.
View Provider Logs
kubectl logs <pod-name> -n akash-servicesStream Live Logs
View the last 50 log entries and follow new logs in real-time:
kubectl logs akash-provider-0 -n akash-services --tail=50 -fFilter Logs
Exclude routine health checks and focus on important events:
kubectl -n akash-services logs -l app=akash-provider --tail=100 -f | \ grep -Ev "running check|check result|cluster resources|service available replicas below target"Example Log Output
Here’s what a successful order processing sequence looks like:
I[2025-04-29|20:35:25.057] order detected module=bidengine-service order=order/akash1d2xltxu5vmsxza8gu2j9vudcpacuntn6r72aee/21299713/1/1I[2025-04-29|20:35:25.058] group fetched module=bidengine-order order=akash1d2xltxu5vmsxza8gu2j9vudcpacuntn6r72aee/21299713/1/1I[2025-04-29|20:35:25.059] Reservation fulfilled module=bidengine-order order=akash1d2xltxu5vmsxza8gu2j9vudcpacuntn6r72aee/21299713/1/1D[2025-04-29|20:35:25.060] submitting fulfillment module=bidengine-order order=akash1d2xltxu5vmsxza8gu2j9vudcpacuntn6r72aee/21299713/1/1 price=21.000000000000000000uaktI[2025-04-29|20:35:25.061] broadcast response response="code: 0" txhash=AF7E9AB65B0200B0B8B4D9934C019F8E07FAFB5C396E82DA582F719A1FA15C14I[2025-04-29|20:35:25.061] bid complete module=bidengine-order order=akash1d2xltxu5vmsxza8gu2j9vudcpacuntn6r72aee/21299713/1/1Declined Bids
When the provider declines to bid due to incompatible attributes:
D[2025-04-29|20:35:25.058] unable to fulfill: incompatible attributes wanted="gpu:vendor/nvidia/model/rtx-3060-ti" have="gpu:vendor/nvidia/model/rtx4090"D[2025-04-29|20:35:25.059] declined to bid module=bidengine-orderThis is normal behavior when deployment requirements don’t match your provider’s capabilities.
Provider Status
Get comprehensive provider status including active leases, resource utilization, and per-node capacity.
Command Template
provider-services status <provider-address>Example
provider-services status akash1wxr49evm8hddnx9ujsdtd86gk46s7ejnccqfmyExample Output
{ "cluster": { "leases": 3, "inventory": { "active": [ { "cpu": 8000, "memory": 8589934592, "storage_ephemeral": 5384815247360 }, { "cpu": 100000, "memory": 450971566080, "storage_ephemeral": 982473768960 } ], "available": { "nodes": [ { "cpu": 111495, "memory": 466163988480, "storage_ephemeral": 2375935850345 }, { "cpu": 118780, "memory": 474497601536, "storage_ephemeral": 7760751097705 } ] } } }, "bidengine": { "orders": 0 }, "manifest": { "deployments": 0 }, "cluster_public_hostname": "provider.example.com"}Status Fields Explained
leases: Number of active leasesinventory.active: Resources currently consumed by deploymentsinventory.available.nodes: Available resources per nodebidengine.orders: Orders currently being processedmanifest.deployments: Manifests waiting to be deployedcluster_public_hostname: Your provider’s public hostname
GPU Provider Troubleshooting
When GPU deployments fail or GPUs aren’t detected, use these steps to diagnose issues.
Scope: Conduct these checks on each Kubernetes worker node hosting GPU resources unless specified otherwise.
Basic GPU Verifications
Install Testing Tools
apt update && apt -y install python3-venvpython3 -m venv /venvsource /venv/bin/activatepip install torch numpyVerify GPU Detection
nvidia-smi -LExpected Output:
GPU 0: Tesla T4 (UUID: GPU-faa48437-7587-4bc1-c772-8bd099dba462)If no GPUs are listed, check:
- Driver installation
- Hardware connection
- BIOS/UEFI settings
Check CUDA Version
python3 -c "import torch;print(torch.version.cuda)"Expected Output: 12.7 (or your installed CUDA version)
Verify CUDA GPU Support
python3 -c "import torch; print(torch.cuda.is_available())"Expected Output: True
If False, see the NVIDIA Fabric Manager section below.
Check Kernel Logs for GPU Errors
Examine kernel logs for driver issues, version mismatches, or hardware errors:
dmesg -T | grep -Ei 'nvidia|nvml|cuda|mismatch'Healthy Output Example
[Thu Sep 28 19:29:02 2023] nvidia: loading out-of-tree module taints kernel.[Thu Sep 28 19:29:02 2023] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 580.13.04[Thu Sep 28 19:29:02 2023] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 580.13.04[Thu Sep 28 19:29:04 2023] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:04.0 on minor 0[Thu Sep 28 19:29:05 2023] nvidia-uvm: Loaded the UVM driver, major device number 235.Problem Indicators
mismatcherrors → Driver/kernel version conflictfailed to initialize→ Hardware or configuration issue- Missing
nvidia-uvm→ Incomplete driver installation
Verify NVIDIA Device Plugin
The NVIDIA Device Plugin DaemonSet must be running on all GPU nodes for Kubernetes to schedule GPU workloads.
Check Plugin Status
kubectl get pods -n nvidia-device-plugin -l app.kubernetes.io/name=nvidia-device-pluginExpected Output
NAME READY STATUS RESTARTS AGEnvidia-device-plugin-daemonset-abc123 1/1 Running 0 2d3hnvidia-device-plugin-daemonset-def456 1/1 Running 0 2d3hIndicators:
READYshould be1/1STATUSshould beRunning- One pod per GPU-enabled node
Troubleshooting Plugin Issues
If pods are not running:
# Check pod eventskubectl describe pod -n nvidia-device-plugin <pod-name>
# Check DaemonSet statuskubectl get daemonset -n nvidia-device-plugin
# View plugin logskubectl logs -n nvidia-device-plugin <pod-name>NVIDIA Fabric Manager (SXM GPUs)
When Needed: NVIDIA Fabric Manager is required for non-PCIe GPU configurations such as:
- SXM form factor GPUs
- NVLink-connected GPUs
- Multi-GPU setups with direct GPU-to-GPU communication
Common Error
If torch.cuda.is_available() returns:
Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:109.)Install Fabric Manager to resolve this issue.
Install Fabric Manager
Replace 580 with your NVIDIA driver version:
apt-get install nvidia-fabricmanager-580systemctl start nvidia-fabricmanagersystemctl enable nvidia-fabricmanagerInitialization Time: Wait 2-3 minutes after starting the service before testing GPU availability.
Verify Fabric Manager Status
systemctl status nvidia-fabricmanagerExpected Output:
● nvidia-fabricmanager.service - NVIDIA Fabric Manager Daemon Loaded: loaded Active: active (running) since ...Fabric Manager Version Mismatch
Ubuntu repositories may not provide a Fabric Manager version matching your driver, causing startup failures.
Symptom
systemctl status nvidia-fabricmanagerShows:
nv-fabricmanager[104230]: fabric manager NVIDIA GPU driver interface version 580.127.05 doesn't match with driver version 580.120systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILUREFix: Use Official NVIDIA Repository
Prerequisites:
- Ubuntu 24.04 LTS
- Root access
Add the official NVIDIA CUDA repository:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/3bf863cc.pubapt-key add 3bf863cc.pub
echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/ /" > /etc/apt/sources.list.d/nvidia-official-repo.list
apt updateapt dist-upgradeapt autoremoveTip: Running apt dist-upgrade with the official NVIDIA repo upgrades all NVIDIA packages (driver, fabric manager, CUDA tools) in sync, preventing version mismatches.
Verify Package Versions
dpkg -l | grep nvidiaRemove any unexpected versions and reboot the node to apply changes.