Technical Program

Paper Detail

Paper Title Timely Coded Computing
Paper IdentifierFR3.R2.2
Authors Chien-Sheng Yang, University of Southern California, United States; Ramtin Pedarsani, University of California, Santa Barbara, United States; Amir Salman Avestimehr, University of Southern California, United States
Session Coding for Stragglers
Location Saint Germain, Level 3
Session Time Friday, 12 July, 14:30 - 16:10
Presentation Time Friday, 12 July, 14:50 - 15:10
Manuscript  Click here to download the manuscript
Abstract In modern distributed computing systems, unpredictable and unreliable infrastructures result in high variability of computing resources. Meanwhile, there is significantly increasing demand for timely and event-driven services with deadline constraints. Motivated by measurements over Amazon EC2 clusters, we consider a two-state Markov model for variability of computing speed in cloud networks. In this model, each worker can be either in a good state or a bad state in terms of the computation speed, and the transition between these states is modeled as a Markov chain which is unknown to the scheduler. We then consider a Coded Computing framework, in which the data is possibly encoded and stored at the worker nodes in order to provide robustness against nodes that may be in a bad state. Our goal is to design the optimal computation-load allocation strategy that maximizes the timely computation throughput (i.e, the average number of computation tasks accomplished before their deadline). Our main result is the development of a dynamic computation strategy called Estimate-and-Allocate (EA) strategy, which achieves the optimal timely computation throughput. Compared with the static allocation strategy, EA improves the timely computation throughput by 1.44× ∼ 4.6× in experiments over Amazon EC2 clusters.