Inventory Operator

The Inventory Operator continuously discovers and monitors hardware resources across your Kubernetes cluster, making them available for provider bidding decisions.

Purpose

The Inventory Operator:

Discovers cluster resources (CPU, GPU, memory, storage)
Monitors resource availability in real-time
Tracks resource utilization and allocation
Publishes inventory data to the provider service
Updates Kubernetes node labels with capabilities

Architecture

+---------------------------------------------+
|       Inventory Operator                    |
|                                             |
|  +--------------------------------------+   |
|  |    Cluster Nodes Manager             |   |
|  |  - Watch Kubernetes nodes            |   |
|  |  - Deploy discovery pods             |   |
|  |  - Collect hardware info             |   |
|  +--------------+-----------------------+   |
|                 |                           |
|                 v                           |
|  +--------------------------------------+   |
|  |    Node Discovery Pods               |   |
|  |  - Run on each node                  |   |
|  |  - Detect CPUs, GPUs, storage        |   |
|  |  - Read hardware capabilities        |   |
|  +--------------+-----------------------+   |
|                 |                           |
|                 v                           |
|  +--------------------------------------+   |
|  |    Storage Queriers                  |   |
|  |  - Ceph integration                  |   |
|  |  - Rancher Longhorn integration      |   |
|  |  - Storage class detection           |   |
|  +--------------+-----------------------+   |
|                 |                           |
|                 v                           |
|  +--------------------------------------+   |
|  |    Cluster State Aggregator          |   |
|  |  - Combine node + storage data       |   |
|  |  - Publish to event bus              |   |
|  |  - Respond to queries                |   |
|  +--------------------------------------+   |
+---------------------------------------------+
                 |
                 v
        +-------------------+
        |  Provider Service |
        |  (Bid Engine)     |
        +-------------------+

Discovery Process

1. Node Discovery

When the operator starts, it:

Watches Kubernetes Nodes
```
watch.Interface for v1.Node resources
```
- Monitors node additions/removals
- Detects node capacity changes
- Tracks node status updates
Deploys Discovery Pods
- Creates a discovery pod on each node
- Runs with privileged access for hardware detection
- Uses same image as operator for consistency
Collects Hardware Information
- CPU cores and architecture
- Memory capacity
- GPUs (NVIDIA)
- Storage devices (NVMe, SSD, HDD)
- Network interfaces

2. Resource Tracking

The operator tracks:

CPU Resources

cpu:
  quantity:
    allocatable: 64000  # millicores
    allocated: 32000    # millicores currently used
  info:
    - model: "AMD EPYC 7763"
      vcores: 64

GPU Resources

gpu:
  quantity:
    allocatable: 8
    allocated: 2
  info:
    - vendor: nvidia
      name: rtx4090
      modelid: "2684"
      interface: pcie
      memory_size: 24Gi

Memory Resources

memory:
  quantity:
    allocatable: 256Gi
    allocated: 128Gi

Storage Resources

storage:
  - class: beta2    # Storage class name
    size: 5000Gi    # Total capacity
    provisioner: ceph.rook.io

3. GPU Feature Discovery

The operator integrates with the NVIDIA Device Plugin and other GPU management tools:

Detection Process:

Query PCI devices for GPUs
Read GPU vendor and product IDs
Match against provider-configs database
Extract GPU capabilities (memory, CUDA version, features)
Publish GPU info to inventory

GPU Information Collected:

Vendor ID (e.g., 10de for NVIDIA)
Product ID (e.g., 2684 for RTX 4090)
Model name (user-friendly)
Memory size
Interface type (PCIe, SXM)

4. Storage Discovery

The operator supports multiple storage backends:

Rook-Ceph Integration

func NewCeph(ctx context.Context) (QuerierStorage, error)

Discovers:

Ceph storage classes
Available storage capacity
Provisioner type
Storage performance class

Rancher Longhorn Integration

func NewRancher(ctx context.Context) (QuerierStorage, error)

Discovers:

Longhorn volumes
Storage pools
Replica counts
Available capacity

Real-Time Updates

The operator continuously monitors for changes:

Node Changes

case watch.Modified:
    if nodeAllocatableChanged(knode, obj) {
        updateNodeInfo(obj, &node)
        signalLabels()
    }

Triggers:

Node capacity changes (scale-up/down)
Node labels modified
Node conditions changed (Ready, MemoryPressure, etc.)

Pod Changes

case watch.Added, watch.Modified:
    if isPodAllocated(obj.Status) {
        addPodAllocatedResources(&node, obj)
    }
case watch.Deleted:
    subPodAllocatedResources(&node, &pod)

Tracks:

New pod deployments (subtract from available)
Pod deletions (add back to available)
Pod resource requests (CPU, memory, GPU)

Event Publishing

The operator publishes inventory updates to the event bus:

Topics

inventory.nodes - Node hardware capabilities
inventory.storage - Storage availability
inventory.cluster - Aggregated cluster state

Retained Events

bus.Pub(state, []string{topicInventoryCluster}, pubsub.WithRetain())

Events are retained so new subscribers immediately receive current state.

Kubernetes Node Labels

The operator adds labels to nodes based on discovered hardware:

Example Labels

akash.network/capabilities.gpu.vendor.nvidia: "true"
akash.network/capabilities.gpu.model.rtx4090: "true"
akash.network/capabilities.storage.class.beta2: "true"
akash.network/capabilities.storage.class.beta3: "true"

Purpose:

Enable node selector constraints for scheduling
Support GPU model matching in deployments
Allow storage class requirements

Integration with Bid Engine

The Bid Engine queries the inventory to make bidding decisions:

inventory, err := cluster.Inventory(ctx)
if err != nil {
    return err
}

// Check if resources are available
canBid := inventory.Has(requiredResources)

Flow:

Order arrives with resource requirements
Bid Engine queries inventory
Inventory Operator returns available resources
Bid Engine compares required vs. available
Bid submitted if resources sufficient

Cluster Service - Resource reservation
Bid Engine - Bidding logic
IP Operator - IP address management
Hostname Operator - Hostname management