Decentralized Scheduling

Overview: Scheduling in Geo-distributed Environments

The primary objective of Yotta is to develop parallelism strategies that optimize the system throughput for AI model training and inference. By conducting a static analysis of the overhead associated with various distributed deployment strategies, such as pipeline parallelism (PP), data parallelism (DP), tensor model parallelism (TP) , and sequence parallelism (SP), as well as memory-saving techniques like tensor offloading, quantization, and rematerialization, we formulate the model scheduling process as an optimization problem. The specific goal of this optimization is to achieve the highest aggregated throughput, measured in floating-point operations per second (FLOPS), which is calculated based on GPU hardware performance and AI model configurations. The constraints of the optimization problem ensure that the memory usage on each device does not exceed its memory capacity. This optimization problem can be effectively solved using Integer Linear Programming (ILP) algorithms.

The following figure illustrates an example of a model partition strategy generated by Yotta. In this example, we assume three geo-distributed sites are involved in the computation, connected via the Internet. The sites contain 4 Nvidia 16GB P100 GPUs, 1 Nvidia 32GB V100 GPU, and 2 Nvidia 16GB RTX6000 GPUs, respectively. Yotta first partitions the model into three stages using pipeline parallelism to minimize communication over the Internet. Each stage contains a different number of AI operators (e.g., MLP, Attention, and embedding), denoted as N0N_{0}, N1N_{1}, and N2N_{2}. The operators in stage 0 (N0N_0) are further distributed across four GPUs, employing both data and tensor parallelism with degrees set to 2 for each. The operators in stage 1 (N1N_1) are deployed on a single GPU, while the operators in stage 2 (N2N_2) utilize tensor parallelism with a degree of 2.

In summary, Yotta determines the optimal parallelism strategies and memory-saving techniques while calculating the total number of stages (NN) and the number of model operators (NiN_i) for each stage. These decisions are based on the geo-distributed environment (i.e., heterogeneous GPUs, heterogeneous networks) and AI model training/inference configurations (i.e., model size and batch size). Pipeline parallelism is always employed for cross-site scheduling since networks between sites usually have limited bandwidth and significant latency. On the other hand, data parallelism, tensor model parallelism, and sequence parallelism are used for multiple GPUs within a single site to fully utilize GPU resources. To improve the efficiency of pipeline parallelism, Yotta enables load balancing between stages and sites by leveraging memory saving techniques. Furthermore, considering the dynamics of GPU nodes joining and leaving the Yotta network, as well as the increased frequency of failures in a decentralized environment, Yotta introduces mechanisms to detect and handle site failures. Additionally, Yotta incorporates mechanisms to detect and handle faults caused by communication issues between sites due to attacks or data loss, ensuring the correctness of model training and inference in a decentralized environment.

Inter-site Scheduling with Pipeline Parallelism

Yotta OS partitions the AI model. By assigning different partitions to different sites, Yotta OS uses aggregated compute resources at the global scale to run the AI workload using pipeline parallelism. With Yotta OS, the AI model is internally represented as a computation graph where there is a collection of coarse-grained AI operators with dependency. The dependency determines the execution order of these operators.

Based on the graph representation of the AI model, the AI model is partitioned and each partition is assigned to a site. When partitioning the model, the activation size of each operator and the interconnect bandwidth between sites are considered, so the communication time between two connecting sites is minimized. Given the GPU computation capability and interconnect bandwidth provided by the sites, the AI model is partitioned in a way that the end-to-end performance (e.g., the throughput for AI training and inference) is maximized. The partitioning problem is formalized as an optimization problem.

Intra-site Scheduling with Adaptive Parallelism

Yotta introduces adaptive parallelism for GPUs within each site. Yotta estimates the execution time of an operation on hardware and identifies performance bottlenecks by combining cost analysis for different operations in AI models with hardware features. Specifically, when computation for training or inference of AI models happens on GPUs, tensors are loaded to GPU memory which costs memory bandwidth. Yotta identifies the bottleneck of operations as either flop-bound or memory-bound. Specifically, we compare the computational cost over hardware peak flops with memory cost over memory bandwidth. Flop-bound refers to performance bottleneck due to floating-point calculations, while memory-bound refers to performance bottleneck due to slow memory accesses. Table below lists the performance bottleneck for commonly used operations. The numbers are based on an Nvidia V100 GPU with FP16 data. Arithmetic intensity refers to the computational cost (FLOPS) over memory cost (Bytes)

OperationArithmetic IntensityBottleneck

Multi-head attention (1024 input, 1024 output, batch size 4)

4 FLOPS/B

memory-bound

Feedforward layer (1024 inputs, 4096 outputs, batch size 1)

1 FLOPS/B

memory-bound

Feedforward layer (1024 inputs, 4096 outputs, batch size 512)

315 FLOPS/B

flop-bound

Max pooling with 3x3 window and unit stride

2.25 FLOPS/B

memory-bound

ReLU activation

0.25 FLOPS/B

memory-bound

Layer normalization

< 10 FLOPS/B

memory-bound

The performance analysis and bottleneck identification are used to determine appropriate parallelism strategies for the model to be deployed. For flop-bound operations, adding more hardware and implementing tensor model parallelism (TP) can boost computational efficiency, especially when the batch size is larger than one. For memory-bound operations, data parallelism (DP) can improve system throughput by reducing per-device batch size. Yotta sets parallelism strategies based on estimated execution time of operations and communication overhead to maximize system throughput.

Because of Yotta's tendency to overestimate hardware capability by applying peak performance metrics during scheduling, suboptimal partition strategies may be generated. To avoid this problem, Yotta monitors throughput and execution time on each node and GPU during model training or inference. It then applies an effectiveness factor (ranging from 0 to 1) to adjust the memory and FLOPs performance estimations for each GPU. Based on these adjusted factors, Yotta improves the partition strategy.

Last updated