Fault Tolerance
Yotta protocol incorporates robust fault tolerance mechanisms to ensure the reliability and stability of model training and inference in a decentralized environment. These mechanisms address two critical aspects: site failure detection and handling, and communication fault detection and handling.0
Communication Fault Detection and Handling
Yotta employs a combination of checksum verification and redundant computation to detect and handle communication faults in a decentralized environment. Checksums are used to detect data corruption during transmission, and retransmission is requested when a mismatch is found. Additionally, Yotta randomly selects idle sites to compute the pipeline stages simultaneously. The results from the primary and redundant sites are compared, and a majority voting mechanism is used to determine the correct result in case of discrepancies. This approach enhances reliability and security by detecting and mitigating the impact of communication faults and potential attacks. Furthermore, encryption and secure communication protocols are used to protect data in transit.
Site Failure Dection and Handling
Yotta employs a heartbeat monitoring mechanism to detect site failures. Each site periodically sends a heartbeat mes- sage to the other sites involved in the computation. If a site A fails to receive a heartbeat message from another site B within a predefined timeout period, it assumes that the site B has failed. Upon detecting a site failure, Yotta triggers a checkpoint/restart mechanism. A new site will leverage the previous system checkpoint to restart the AI model.
Last updated