Yotta Labs
  • COMPANY
    • About
    • Mission
    • FAQs
  • PRODUCTS
    • Quickstart
    • GPU Instances
    • AI Explorer
      • Text Generation
      • Image Generation
      • Audio Generation
      • Pricing
    • Inference
      • Introduction
      • User Interface
      • Broker Node
      • GPU Worker
        • Quickstart
        • Core Concept
        • Troubleshoot
        • Supported Devices
        • FAQs
      • Pricing
    • Fine-Tune
    • Billing
  • Technology
    • Parallelism Strategies
    • Decentralized Scheduling
    • Memory Optimization
    • Proactive Communication
    • Fault Tolerance
    • Computation Validation
    • White Paper
  • TOKENOMICS
    • Overview
    • Payment & Fee
    • Staking
    • GPU Incentive
    • Community
    • Governance
Powered by GitBook
On this page

Was this helpful?

  1. Technology

Proactive Communication

PreviousMemory OptimizationNextFault Tolerance

Last updated 9 months ago

Was this helpful?

The inter-site communication is a major performance bottleneck in the decentralized system. As the interconnection bandwidth between two sites can be multiple orders of magnitude lower than that in a centralized system, reducing communication overhead is the key to enabling the high performance of Yotta.

Using the pipeline parallelism, Yotta overlaps GPU computation with the communication, hiding the communication overhead, shown in the following figure. It's an example of using the traditional pipeline parallelism to run a transformer block in an LLM model on two sites (Site A and Site B). The communication between the two sites can be partially overlapped with the GPU computation on the two sites. However, when the GPU computation on a site is small, the communication overhead cannot be effectively hidden. To address this problem, Yotta uses a technique called proactive communication.

This technique decomposes the original computation and communication operations into finer-grained ones, and transfers each data shard (for communication) whenever it is ready instead of waiting until all of the data shards are ready (the data can be either operands or computation results). The following figure depicts the new parallelism of the pipeline with the computation decomposed into two fine-grained ones to run a transformer block in an LLM model on two sites (Site A and Site B).

The result Z1Z_{1}Z1​ is transferred when it is ready. The transfer of Z1Z_{1}Z1​ overlaps with the generation of Z2Z_{2}Z2​, shown below. We can easily see that using the proactive communication leads to a reduction of the execution time because the data transfer can be overlapped with the computation. In this example, transferring Z1Z_{1}Z1​ overlaps with the computation of generating Z2Z_{2}Z2​.