iShuffle: Improving Hadoop Performance with Shuffle-on-Write (Jun-2017)

Aim:

To improve hadoop performance and increase the efficiency of map reduce function We present iShuffule, this technique to a real, multi-server; complex and widely used system such as Hadoop.

Proposed system:

We propose to decouple shuffle from reduce tasks and convert it into a platform service provided by Hadoop. We present iShuffle, a user-transparent shuffle service that pro-actively pushes map output data to nodes via a novel shuffle-on-write operation and flexibly schedules reduce tasks considering workload balance. It overlaps the data shuffling of any reduce task with the map phase, addresses the input data skew in reduce tasks, and enables efficient reduce scheduling.

We design iShuffle on a Hadoop cluster and evaluate its benefits using benchmark jobs from the Purdue MapReduce Benchmark Suite (PUMA) and the HiBench with E-Commerce datasets collected from real applications. We compared the performance of iShuffle running both shuffle-heavy and shuffle-light workloads with that of stock Hadoop.