DeTrain | Akash Network - Your Guide to Decentralized Cloud

Introduction

DeTrain is a framework for distributed training and model parallelism (MP). DeTrain includes tools and lightweight libraries using PyTorch to simplify distributed training pipelines.

Demo information

Repositories

DeTrain’s different components would later be maintained in seperate repositories, namely:

Below are screenshots of a walkthrough of the app

Home page (My Pipelines)

New pipeline - General settings

New pipeline - PP training - Nodes settings

New pipeline - TP training - Nodes settings

New pipeline - Training script settings

New pipeline - Review & start training

DeTrain - Code base

Artchitecture

The DeTrain project comprises four components:

DeTrain Python Lightweight Library: This library, currently in the early development phase, is constructed based on the PyTorch Distributed package and PyTorch RPC package. It simplifies the development of distributed AI model training code by reducing complexity. DeTrain primarily focuses on MP training, with two types: Tensor parallelism (TP) and Pipeline parallelism (PP).
- To facilitate communication between training nodes in PP training, DeTrain utilizes remote methods of the RPC package.
- For distributed optimization, DeTrain employs distributed autograd and the loss_parallel context.
- To integrate with Data Parallelism (DP), DeTrain utilizes DeviceMesh 2D.
Agent: This component consists of two parts:
- A Docker image including Python with support for Cuda 11.8, FastAPI, Uvicorn, Gunicorn, and DeTrain libraries.
- A simple software program that assists nodes in communicating with each other and with the DeTrain console.
DeTrain Console: This web application aids developers in designing custom pipelines for training AI models tailored to user-defined infrastructure.
Sample SDL Templates & Example Code for Distributed Training Jobs: This component provides examples to guide developers on how to define infrastructure for PP & TP training.

Examples for training are currently included in model_parallelism/detrain/examples for easy reference. They would later be maintained in the DeTrain Examples repository.

Technology Stack

DeTrain console

NextJS and Ant Design for the frontend.
AkashJS for interacting with Akash networks.
Cosmos Kit for blockchain integration.
Mongoose for database management.

Agent software

FastAPI and StreamingResponse function for the backend.
Uvicorn and Gunicorn for ASGI server.
Docker for containerization.

DeTrain Python library

PyTorch RPC and Distributed packages for distributed training functionalities.

Infrastructure

Akash Networks marketplace.

Installation

To run DeTrain console frontend application

cd frontend
npm i
npm run dev to run it in dev mode
npm run build && npm run start for the production mode.

To deploy nodes on Akash

Go to the Akash console
Select “upload SDL”
Go to folder “SDL template”, select one of templates.
Change settings and start deployment.

To customize DeTrain docker image for agents

Go to folder agent
Modify Dockerfile if you want to add new commands or change the base image
To add new function, go to main.py and add your code block
Build new image
Push to your docker repository.

Run examples manually

Ensure that DeTrain library is installed on your machines.

To test DeTrain python library

Go to folder model_parallelism/examples
Select what kind of training you want to see
- PPL: Pipeline parallelism
- TP: Tensor parallelism
- FSDP + TP: Full sharding data parallelism + Tensor Parallelism
All examples contain logs and evaluation steps, these functions can make training process longer. These files for testing and instruction only.

Torchrun commands:

For PPL training:

torchrun --nnodes=3 --nproc_per_node=1 --node_rank=0 --master_addr=localhost --master_port=9999 main.py --gpu="0_0_0" --epochs=2 --batch_size=40 --lr=0.001 --model_name="ppl_04"

Node rank is the rank of each node joining the training process. If you have one master node and two GPU worker nodes, you need to run each command for each node with node_rank values of 0, 1, and 2. The master address is the address of the master node for tensor offloading.

For TP and FSDP + TP training:

torchrun --nnodes=1 --node_rank=0 --nproc_per_node=2 --rdzv_id=101 --rdzv-backend=c10d --rdzv_endpoint="localhost:9999" main.py --gpu="0_0_0" --epochs=4 --batch_size=50 --lr=0.001 --model_name="ppl_04"

If you have N nodes, rdzv_endpoint is the rendezvous backend endpoint: use localhost on the master node and use internal IP or public IP on the remaining nodes.

Run examples using the DeTrain console

Use this tool; you don’t need to set up any environment, SSH to a remote node, and run commands manually. Follow these steps:

Deploy new nodes on Akash.
Use DeTrain console to create new pipelines.