Design and optimization of heterogeneous coded distributed computing
Abstract
The massive increase in data volume in recent years has posed significant
challenges for traditional data processing systems. Although distributed
computing has been considered as an effective solution, its efficient implementation
faces the challenge of the high communication overhead incurred
by data exchange (shuffling) between workers. Coded Distributed
Computing (CDC) has been proposed by utilizing coded multicasting to
reduce the shuffling load. To our best knowledge, existing works on the
CDC only consider input files with uniform file size, limiting their practicality
in real-world applications. To address this limitation, we propose a
Heterogeneous Coded Distributed Computing (HetCDC) scheme to handle
input files of nonuniform sizes. We then formulate a joint optimization
problem to optimize the file placement and coded shuffling strategies to
minimize the shuffling load. Through reformulation, we convert the nonconvex
optimization problem into an integer linear programming problem
and solve it through the branch-and-cut method. Numerical studies show
the proposed HetCDC outperforms existing works. Based on the Het-
CDC, we further develop a Heterogeneous TeraSort algorithm to improve
the sorting time of traditional TeraSort, which is a key building blocks for
many big data processing algorithms.

