Technical Program

Paper Detail

Paper Title Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers
Paper IdentifierFR2.R6.3
Authors Mehmet Emre Ozfatura, Deniz Gunduz, Imperial College London, United Kingdom; Sennur Ulukus, University of Maryland, United States
Session Coding for Distributed Computation
Location Sorbonne, Level 5
Session Time Friday, 12 July, 11:40 - 13:00
Presentation Time Friday, 12 July, 12:20 - 12:40
Manuscript  Click here to download the manuscript
Abstract When gradient descent (GD) is scaled to many parallel computing servers (workers) for large scale machine learning problems, its per-iteration computation time is limited by the straggling workers. Coded distributed GD (DGD) can tolerate straggling workers by assigning redundant computations to the workers, but in most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (master) after completing all its computations. We allow multiple computations to be conveyed from each worker per iteration in order to exploit computations executed also by the straggling worker. We show that the average completion time per iteration can be reduced significantly at a reasonable increase in the communication load. We also propose a general coded DGD technique which can trade-off the average computation time with the communication load.