# Memory Optimization

Yotta Protocol deploys memory-saving techniques such as `rematerialization`, `tensor offloading`, and `quantization` to scale up the number of operations in a single GPU, such that the user can use affordable GPUs with limited RAM for large model inference and training. With our performance optimizations on memory saving, the performance of using affordable GPUs can be comparable to that of using high-end GPUs.

* `Rematerialization` deletes some activations from memory after their use in the forward phase of AI model training, and recomputes them during the backward phase. It also deletes partial Key-Value (KV) cache in Large Language Model (LLM) inference and recomputes them if those tensors are involved in later computations.
* `Tensor offloading` temporarily moves unused tensors to the host CPU, SSD, or remote memory and reads them back when needed. Yotta Labs has years of experience developing the tensor offloading technique in previous research work and plays a leadership role in the AI community.
* `Quantization` is applied in large AI model training or inference to minimize memory footprint and data movement by representing data with reduced precision. The aforementioned memory-saving techniques is beneficial in increasing the arithmetic intensity of operations, thereby further enhancing system throughput.

Yotta accurately estimates the changes in memory usage and execution time when the memory-saving techniques are applied. Specifically, rematerialization will increase the number of operations, while offloading and quantization will increase the execution time per operation. Furthermore, with the memory-saving techniques, the communication overhead in the system might also change since each site can complete more computations compared to not using the memory-saving techniques. Existing systems such as SHE do not consider the impact of the memory-saving techniques, leading to suboptimal parallelism strategies in geo-distributed environments


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.yottalabs.ai/technology/memory-optimization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
