Vijay Dahiphale - 21DOCS Test Area

MapReduce efficiently processes large data sets in a Petabyte/Exabyte scale distributed system. Applications such as reverse indexing, image and pattern recognition, and analytics rely extensively on this programming model for large-scale distributed data processing. MapReduce works by splitting the processing into two distinct phases: a Map phase, which distributes data processing, and a Reduce phase, which aggregates results. While open-source applications such as Apache Hadoop and its query engine "Hive" implement MapReduce, some challenges persist. Most crucially, high network I/O is incurred when Mappers write many rows. One solution (attempted by Apache Hive) is to pre-aggregate data at the Mapper and send this aggregated data to the Reducer. This attempts to reduce network I/O. However, this is insufficient for two reasons: first, Reducers are starved while Mappers aggregate data at Map-Side, and second, Mappers can run into memory overflow and need a re-compute if pre-aggregated results exceed their buffer capacity. We propose a design to solve these problems using an adaptive split size for Map-side pre-aggregation. By dynamically adjusting the size of the input data split for a Mapper, we ensure that (1) Reducers get a continuous stream of data without starving and (2) Mapper memory overflows are avoided, which also avoids the need to "flush and re-compute" earlier results.