Submitted by admin on Fri, 10/25/2024 - 05:30

We consider a distributed storage system with $n$ nodes, where a user can recover the stored file from any $k$ nodes, and study the problem of repairing $r$ partially failed nodes. We consider broadcast repair, that is, $d$ surviving nodes transmit broadcast messages on an error-free wireless channel to the $r$ nodes being repaired, which are then used, together with the surviving data in the local memories of the failed nodes, to recover the lost content. First, we derive the trade-off between the storage capacity and the repair bandwidth for partial repair of multiple failed nodes, based on the cut-set bound for information flow graphs. It is shown that utilizing the broadcast nature of the wireless medium and the surviving contents at the partially failed nodes reduces the repair bandwidth per node significantly. Then, we list a set of invariant conditions that are sufficient for a functional repair code to be feasible. We further propose a scheme for functional repair of multiple failed nodes that satisfies the invariant conditions with high probability, and its extension to the repair of partial failures. The performance of the proposed scheme meets the cut-set bound on all the points on the trade-off curve for all admissible parameters when $k$ is divisible by $r$ , while employing linear subpacketization, which is an important practical consideration in the design of distributed storage codes. Unlike random linear codes, which are conventionally used for functional repair of failed nodes, the proposed repair scheme has lower overhead, lower input-output cost, and lower computational complexity during repair.

Nitish Mital
Katina Kralevska
Cong Ling
Deniz Gündüz