Memory Optimization

Yotta Protocol deploys memory-saving techniques such as rematerialization, tensor offloading, and quantization to scale up the number of operations in a single GPU, such that the user can use affordable GPUs with limited RAM for large model inference and training. With our performance optimizations on memory saving, the performance of using affordable GPUs can be comparable to that of using high-end GPUs.

  • Rematerialization deletes some activations from memory after their use in the forward phase of AI model training, and recomputes them during the backward phase. It also deletes partial Key-Value (KV) cache in Large Language Model (LLM) inference and recomputes them if those tensors are involved in later computations.

  • Tensor offloading temporarily moves unused tensors to the host CPU, SSD, or remote memory and reads them back when needed. Yotta Labs has years of experiences of developing the tensor offloading technique in previous research work and plays a leadership role in the AI community.

  • Quantization is applied in large AI model training or inference to minimize memory footprint and data movement by representing data with reduced precision. The aforementioned memory-saving techniques is beneficial in increasing the arithmetic intensity of operations, thereby further enhancing system throughput.

Yotta accurately estimates the changes in memory usage and execution time when the memory-saving techniques are applied. Specifically, rematerialization will increase the number of operations, while offloading and quantization will increase the execution time per operation. Furthermore, with the memory-saving techniques, the communication overhead in the system might also change since each site can complete more computations compared to not using the memory-saving techniques. Existing systems such as SHE do not consider the impact of the memory-saving techniques, leading to suboptimal parallelism strategies in geo-distributed environments

Last updated