MapReduce

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster.[1][2][3]

A MapReduce program is composed of a map procedure, which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The model is a specialization of the split-apply-combine strategy for data analysis.[4] It is inspired by the map and reduce functions commonly used in functional programming,[5] although their purpose in the MapReduce framework is not the same as in their original forms.[6] The key contributions of the MapReduce framework are not the actual map and reduce functions (which, for example, resemble the 1995 Message Passing Interface standard's[7] reduce[8] and scatter[9] operations), but the scalability and fault-tolerance achieved for a variety of applications due to parallelization. As such, a single-threaded implementation of MapReduce is usually not faster than a traditional (non-MapReduce) implementation; any gains are usually only seen with multi-threaded implementations on multi-processor hardware.[10] The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm.[11]

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized. By 2014, Google was no longer using MapReduce as their primary big data processing model,[12] and development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated full map and reduce capabilities.[13]

  1. ^ "MapReduce Tutorial". Apache Hadoop. Retrieved 3 July 2019.
  2. ^ "Google spotlights data center inner workings". cnet.com. 30 May 2008. Archived from the original on 19 October 2013. Retrieved 31 May 2008.
  3. ^ "MapReduce: Simplified Data Processing on Large Clusters" (PDF). googleusercontent.com.
  4. ^ Wickham, Hadley (2011). "The split-apply-combine strategy for data analysis". Journal of Statistical Software. 40: 1–29. doi:10.18637/jss.v040.i01.
  5. ^ "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Research
  6. ^ Lämmel, R. (2008). "Google's Map Reduce programming model — Revisited". Science of Computer Programming. 70: 1–30. doi:10.1016/j.scico.2007.07.001.
  7. ^ http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/mpi2-report.htm MPI 2 standard
  8. ^ "MPI Reduce and Allreduce · MPI Tutorial". mpitutorial.com.
  9. ^ "Performing Parallel Rank with MPI · MPI Tutorial". mpitutorial.com.
  10. ^ "MongoDB: Terrible MapReduce Performance". Stack Overflow. October 16, 2010. The MapReduce implementation in MongoDB has little to do with map reduce apparently. Because for all I read, it is single-threaded, while map-reduce is meant to be used highly parallel on a cluster. ... MongoDB MapReduce is single threaded on a single server...
  11. ^ Cite error: The named reference ullman was invoked but never defined (see the help page).
  12. ^ Sverdlik, Yevgeniy (2014-06-25). "Google Dumps MapReduce in Favor of New Hyper-Scale Analytics System". Data Center Knowledge. Retrieved 2015-10-25. "We don't really use MapReduce anymore" [Urs Hölzle, senior vice president of technical infrastructure at Google]
  13. ^ Harris, Derrick (2014-03-27). "Apache Mahout, Hadoop's original machine learning project, is moving on from MapReduce". Gigaom. Retrieved 2015-09-24. Apache Mahout [...] is joining the exodus away from MapReduce.