Thursday, August 1, 2013

MapReduce: a gentle introduction with examples

By had00b 11:49 AM distributed file system, hadoop, hdfs, mapreduce, matrix transpose, word count 33 comments

A brief history of MapReduce and Hadoop.

MapReduce is a programming framework for distributed processing originally developed at Google in 2004. The original paper by Jeffrey Dean and Sanjay Ghemawat describes the programming model and underlying system. The reason that led to the development of MapReduce was the fact that engineers at Google found themselves repeatedly solving the same problems (such as inverted indices, graph structure representations, set of most frequent queries) with ad-hoc distributed computations running on several machines. MapReduce was designed as a framework to express such computations and meanwhile automatically take care of parallelization, fault-tolerance, data distribution and load balancing.

In the recent years, the MapReduce framework has gained wide adoption and has become the de facto standard for large scale data analysis. A major effort that favored such a quick adoption was the development of Hadoop, an open-source implementation of the MapReduce framework (the MapReduce implementation at Google was, and still is, a proprietary technology). Hadoop was created by Doug Cutting in 2004-05, who originally saw the potential of the MapReduce framework to aid the development of Apache Nutch, an open-source web search engine he and his team had started building in 2002. In 2006, Doug Cutting was hired by Yahoo! and Hadoop was moved out of Nutch and formed an independent project. Since 2008, many companies started adopting Hadoop and several Hadoop distributions spawned (Cloudera, MapR, HortonWorks, EMC Pivotal HD to name a few). Amazon's EC2 compute cloud and similar services offer users the possibility to rent a cluster and run MapReduce computation.

MapReduce as we know it today is composed of the MapReduce programming framework along with a distributed file-system. HDFS (the Hadoop Distributed File System) is the Hadoop implementation of such a file-system. As Hadoop came as an open-source implementation of MapReduce, likewise HDFS (firstly known as NDFS, aka Nutch Distributed File System) provided an open-source implementation of GFS, the distributed file-system used in production at Google, whose architecture was described in the 2003 paper The Google File System. Other distributed file-systems compatible with MapReduce exist such as Quantcast FS (QFS, previously CloudStore and KosmosFS) and MapR-FS.

MapReduce concepts

We now briefly discuss concepts that are at the core of MapReduce: the underlying hardware (i.e. the cluster of machines), the distributed file system, and the actual MapReduce programming framework.

Physical organization of machines

Our usual view of computation was that of a single (possibly multi-processor) machine. The recent explosion in the amount of data available to analyze called for a lot more computational power. Cluster computing is a horizontal solution to this need: more computational capacity is generated by networking a large number of commodity low-end servers (as opposed to vertical solutions of purchasing more expensive, special-purpose machines with many processors). In a typical configuration, the machines (usually called nodes) of a cluster are physically stored on racks, each rack allocating 8-64 machines. The nodes on a rack are usually networked via gigabit Ethernet, and for large clusters with multiple racks, racks are connected by a switch or some other level of network.

Failures of a single node are quite common in clusters -- for large clusters, even loss of an entire rack. Considering the goal of a cluster is to run intensive computations reading data from several machines, an obvious need is that the failure of a node doesn't cause the abortion and restart of the whole computation. In order to guarantee this, it must be the case that

files are stored redundantly -- this justifies the need of a distributed file system
computation is split into parallel, independent tasks (if a task dies, we only want to re-execute that task) -- the MapReduce programming model and framework take care of exactly this aspect

The Distributed File System

Distributed file systems manage the storage across a network (cluster) of machines. Note that the network aspect creates several complications: the problem we mentioned about tolerating node failures without data loss calls for data replication among different nodes. Data redundancy in turn yields synchronization problems when we want to update a file.

Distributed file systems such as HDFS and GFS are designed for a scenario where files are huge (in the order of gigabytes and terabytes), and are pretty much static. The data is read many times and possibly additional data is written into new large files or appended to existing files. In this scenario we don't usually expect random access to a file, rather we expect reads of entire files.

The distributed file system stores the files divided into large chunks, which are typically 64 or 128 megabytes, to amortize seek time with large reads. Chunks are replicated, typically on three different machines, with two of them belonging to different racks in the case of large clusters. In order to find the chunks of a file, the distributed file system adopts a master–slave architecture: the master (called namenode in HDFS) maintains a namespace file containing the filesystem tree and the metadata (the locations of the chunks) of each file. This information is replicated as well, typically to a remote NFS mount. The slaves (called datanodes in HDFS) are distributed among the machines and manage the actual data chunks.

The MapReduce programming model

In the MapReduce programming paradigm, the basic unit of information is a (key, value) pair. The computation on an input (i.e. on a set of pairs) occur in three stages: the map stage, the shuffle stage and the reduce stage. Semantically, the map and shuffle phases distribute the data, and the reduce phase performs the computation.

Map

In the map stage, the mapper takes as input a single (key, value) pair and produces as output any number of (key, value) pairs. It is important to think of the map operation as stateless, that is its logic operates on a single pair at a time (even if in practice several input pairs are delivered to the same mapper). The reason why is advised not to combine more input pairs for the computation of the output pairs is that the input and the computation is distributed, so we don't know a priori which and how many input pairs a mapper receives. To summarize, for the map phase, the user simply designs a map function that maps an input (key, value) pair to any number (even none) of output pairs. Most of the time, the map phase is simply used to specify the desired location of the input value by changing its key.

Shuffle

The shuffle stage is automatically handled by the MapReduce framework, i.e. the engineer has nothing to do for this stage. The underlying system implementing MapReduce routes all of the values that are associated with an individual key to the same reducer.

Reduce

In the reduce stage, the reducer takes all of the values associated with a single key k and outputs any number of (key, value) pairs. This highlights one of the sequential aspects of MapReduce computation: all of the maps need to finish before the reduce stage can begin. Since the reducer has access to all the values with the same key, it can perform sequential computations on these values. In the reduce step, the parallelism is exploited by observing that reducers operating on different keys can be executed simultaneously. To summarize, for the reduce phase, the user designs a function that takes in input a list of values associated with a single key and outputs any number of pairs. Often the output keys of a reducer equal the input key (in fact, in the original MapReduce paper the output key must equal to the input key, but Hadoop relaxed this constraint).

Overall, a program in the MapReduce paradigm can consist of many rounds (usually called jobs) of different map and reduce functions, performed sequentially one after another.

Examples

We will go through a series of simple examples to help understanding how to solve problems in MapReduce. It is helpful to think the input data lying unordered in a cloud, with the output of the reduce being stored again unordered to such cloud. Also, it is useful to think of the mapper phase as a way to distribute the data (where to "map" the input), and the reduce phase as a way to perform computation on (its portion of the) data (how to "reduce" the data). There are cases in which it is convenient to depart from these guidelines, but in general they provide a good starting point.

In the following examples, the key of the map input pairs is irrelevant and the value represents the actual input. Think of the input being stored in one or multiple files distributed across the machines in the cluster, with each line of these files representing (the value of) an input pair. Each input pair is read by a mapper, running on some machine of the cluster (usually, on a machine where that line physically resides).

Word count

Word count is the prototypical example of MapReduce, it's the "Hello World" of MapReduce. Imagine you have a bunch of text files in your cluster and would like to count the overall number of occurrences of each word. We assume each value in input is the line of one of the files.

The standard solution is the following: the map will tokenize each input value (i.e. each line) into words, outputting a pair (word, 1) for each word; the reducer associated with a word will receive all the occurrences of a word and can count them and return the total number of occurrences.

Map: For each input value v: for each word w obtained by tokenizing v, output the key-value pair (w, 1)
Shuffle: For each key (word) w produced by any of the Map tasks, there will be one or more key-value pairs (w,1). Then, for key w, the shuffle will produce (w, [1, 1,..., 1]) and present this as input to a reducer.
Reduce: the reducer associated with key (word) w turns (w, [1, 1, ..., 1]) into (w, n), where n is the sum of the values. Note that the reduce phase produces exactly one output pair for each key w.

Removing duplicates

We basically would like to implement the distinct operator: the desired output should contain all of the input values, but values that appear multiple times in input, should appear only once in output.

The solution is simpler than that for word count: we use the value itself as the map output key; the reducer associated with a specific key, can then return a single output.

Map: For each input value v, output the key-value pair (v, v)
Shuffle: For each key v produced by any of the Map tasks, there will be one or more key-value pairs (v,v). Then, for key v, the shuffle will produce (v, [v, v, ..., v]) and present this as input to a reducer.
Reduce: the reducer for key v turns (v, [v, v, ..., v]) into v, so it produces exactly one output v for this key v.

Note that the way we defined the reducer is slightly imprecise as we should define pairs as output. This is a small technicality that also depends on the MapReduce implementation we are using: in Hadoop for example, we can have the reducer return the pair (v, null) or (null, v) to obtain the same result.

Matrix transpose

Given a sparse matrix in row major order, we would like to output the same matrix in column major order. In other words, each input value encodes a single row of the matrix and we would like the output to be composed of the columns of the matrix. If we view the input matrix as the adjacency list of a directed graph, this is equivalent to compute the graph obtained by reversing the edges.

Assume each input value has the form $(i, [(j_1,v_{i,j_1}), (j_2,v_{i,j_2}), \ldots, (j_t,v_{i,j_t})])$, meaning that row $i$ of the matrix has value $v_{i,j_1}$ on column $j_1$, value $v_{i,j_2}$ on column $j_2$, and so on (with value zero on all missing columns). We will solve the problem as follows: the map phase distributes the values by column and the reducer associated with a column will output the whole column.

Map: For each input $(i, [(j_1,v_{i,j_1}), (j_2,v_{i,j_2}), \ldots, (j_t,v_{i,j_t})])$, output the key-value pairs $(j_1, (i, v_{i,j_1})), (j_2, (i, v_{i,j_2})), \ldots, (j_t, (i, v_{i,j_t}))$
Shuffle: For each key (column) $j$, the shuffle will produce $(j, [(i_1, v_{i_1,j}), (i_2, v_{i_2,j}), \ldots, (i_{\ell}, v_{i_{\ell},j})])$ and present it as input to a reducer.
Reduce: the reducer for key (column) $j$ returns its whole input as a single output $(j, [(i_1, v_{i_1,j}), (i_2, v_{i_2,j}), \ldots, (i_{\ell}, v_{i_{\ell},j})])$. This exactly corresponds to column $j$ of the matrix.

Followers of Followers

We are given a directed graph where an edge $(u,v)$ indicates that $u$ follows $v$. The graph is stored as an unordered list of edges, i.e. each input corresponds to a single edge. We want to find the second hub of each user $u$, that is the set of users that follow a follower of $u$.

The key insight for solving this problem is that the in-neighbors of a node $u$ are in the 2nd hub of each of $u$'s out-neighbors. We will distribute the edges in such a way the reducer associated with a node $u$ receives all the in-neighbors and out-neighbors of $u$ and can distinguish between the two.

Map: For each input edge $(u,v)$, output the two key-value pairs $(u,(v, OUT))$ and $(v, (u,IN))$. This "says" to $u$ that $v$ is one of its out-neighbors and to $v$ that $u$ is one of its in-neighbors.
Shuffle: For each key (node) $u$, the shuffle will produce $(u, [(v_1, a_1), (v_2, a_2), \ldots, (v_t, a_t)])$, where $a_i$ is either $IN$ or $OUT$; $v_i$ is an in-neighbor of $u$ if $a_i=IN$ and is an out-neighbor of $u$ if $a_i=OUT$. Note that the shuffle basically compiles the neighborhood of each node.
Reduce: the reducer for key (node) $u$ takes in input the whole neighborhood of $u$ and therefore knows all in-neighbors and out-neighbors of $u$. For every node $w$ in-neighbor of $u$, and for every node $v$ out-neighbor of $u$, we generate the output pair $(v,w)$, to indicate that $w$ is in the second hub of $v$.

At the end of this job we have computed all pairs $(v,w)$ for every $w$ in the second hub of $v$. We may have duplicates though, since $w$ could be in the second hub of $v$ via multiple followers of $v$. We can then use our previous solution to remove duplicates to achieve the final result.
[If you're interested in follower-related problems, you can find a 2009 snapshot of follow relationships of Twitter here.]

Key ideas in MapReduce

Below are some of MapReduce key features.

Horizontal scaling: choosing a large number of commodity low-end machines over a small number of high-end machines.
Move code to data, rather than data to code: in a MapReduce cluster, storage and processors are co-located. MapReduce exploits data locality by running the code on the machine storing the block of data needed to be read.
Sequential reads, no random access: MapReduce is designed for batch processing, milking on the important fact that sequential reads of a file are orders of magnitude faster than random access. The important metric being optimized by MapReduce is throughput rather than latency.
Developer only Map and Reduce aware: you don't need to know all of MapReduce intricacies and system-level details to run a program in MapReduce. The MapReduce abstraction is simple to understand and provides a clean interface for algorithm designers and developers.

References

Hadoop: the definitive guide, by Tom White
Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer
Mining of Massive Datasets, by Anand Rajaraman, Jure Leskovec, and Jeff Ullman
To install Hadoop, follow the instructions in our tutorial

33 comments:

SurjanAugust 15, 2016 at 4:49 PM
Thanks. Great examples and very clear explanation.
ReplyDelete
Replies
UnknownNovember 25, 2017 at 8:50 AM

In Hadoop, MapReduce is a calculation that decomposes large manipulation jobs into individual tasks that can be executed in parallel cross a cluster of servers. The results of tasks can be joined together to compute final results.
Mapreduce program example
Hadoop fs command using java api
ReplyDelete
Replies
AnonymousFebruary 9, 2018 at 10:50 PM
Its really very useful topic. Thanks for sharing this information...

SAP FICO Training in Chennai

ReplyDelete
Replies
UnknownMarch 16, 2018 at 9:16 PM

Thankyou for sharing this useful info and keep it up

isoft innovations facebook

isoft innovations faq

isoft innovations about us

isoft innovations address

isoft innovations contact us

isoft innovations software company

isoft innovations chennai

isoft innovations employees

isoft innovations careers

isoft innovations consulting services

ReplyDelete
Replies
UnknownApril 20, 2018 at 10:57 PM
I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
Best Hadoop Training Institute In chennai

amazon-web-services-training-institute-in-chennai
ReplyDelete
Replies
MounikaApril 28, 2018 at 2:15 AM
Appreciation for really being thoughtful and also for deciding on certain marvelous guides most people really want to be aware of. Dotnet developer
Authorized Dot Net training in chennai
ReplyDelete
Replies
kevingeorgeMay 5, 2018 at 4:01 AM
I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.

Best Java Training Institute Chennai

Amazon Web Services Training in Chennai

ReplyDelete
Replies
TejutejuMay 22, 2018 at 6:41 AM
awesome post presented by you..your writing style is fabulous and keep update with your blogs Big data hadoop online Course India
ReplyDelete
Replies
TejutejuJune 21, 2018 at 9:05 PM
Really nice blog post.provided a helpful information.I hope that you will post more updates like this Big data hadoop online Training Bangalore

ReplyDelete
Replies
Vincent SmithAugust 7, 2018 at 12:49 AM

App Cloner for Android is an application that will allow you to make exact copies of any app on your smartphone or tablet.
ReplyDelete
Replies
Sakthi MuruganNovember 21, 2018 at 12:46 AM
Best Content ever!!! Keep doing more and update your blog regularly for the future reference
Selenium Training in Chennai
Selenium Course in Chennai
iOS Course in Chennai
Digital Marketing Training in Chennai
J2EE Training in Chennai
Software testing training in chennai
Testing training in chennai
Big Data Training in Chennai
ReplyDelete
Replies
AnonymousNovember 25, 2018 at 9:02 AM
Thanks for your great and helpful presentation I like your good service. I always appreciate your post. That is very interesting I love reading and I am always searching for informative information like this. Well written article Thank You for Sharing with Us pmp training class in chennai | pmp training near me | pmp training courses online | pmp training fee | project management training certification
ReplyDelete
Replies
Anbarasan14December 12, 2018 at 8:58 PM
Excellent blog!!! I got to know the more useful information by reading your blog. Thanks for posting this blog.

TOEFL Coaching Centres in Ambattur Chennai
TOEFL Classes in Ambattur OT
TOEFL Course in Redhills
TOEFL Training near me
Best TOEFL Coaching Institute in velachery
TOEFL Coaching Classes in Tharamani
TOEFL Coaching Centres in Madipakkam
ReplyDelete
Replies
luckysFebruary 8, 2019 at 12:33 AM
indian whatsapp group links
ReplyDelete
Replies
bashaMarch 18, 2019 at 9:27 PM
Superb blog I visit this blog it's really awesome. The important thing is that in this blog content written clearly and understandable. The content of information is very informative.
Oracle Fusion HCM Online Training
Oracle Fusion SCM Online Training
Oracle Fusion Financials Online Training
Big Data and Hadoop Training In Hyderabad
oracle fusion financials classroom training
Workday HCM Online Training
Oracle Fusion HCM Classroom Training
Workday HCM Online Training
ReplyDelete
Replies
karthickMarch 20, 2019 at 5:16 AM
Thanks for sharing your informative post on development.Your work is very good and I appreciate you and hoping for some more informative posts.keep writing
and sharing.
R Language Training in Chennai
ReplyDelete
Replies
keerthiJune 11, 2019 at 11:58 PM
nice
Web Designing Course in Hyderabad,ameerpet
http://www.argiatechnologies.com
ReplyDelete
Replies
luckysJuly 9, 2019 at 7:42 AM
english to tamil typing
ReplyDelete
Replies
AnonymousJuly 13, 2019 at 6:13 AM
This comment has been removed by the author.
ReplyDelete
Replies
AjishJuly 18, 2019 at 2:38 AM
Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
Dot Net training in Electronic City

ReplyDelete
Replies
AnonymousJuly 18, 2019 at 10:59 PM
This comment has been removed by the author.
ReplyDelete
Replies
salmanSeptember 21, 2019 at 3:48 AM
https://vodafonecustomercarenumber.hatenablog.com
https://vodafonecustomercarenumber.hatenablog.com
https://mpcustomercareno.blogspot.com
https://mpcustomercareno.blogspot.com
https://myairtelcustomercarenumber.blogspot.com
https://myairtelcustomercarenumber.blogspot.com
ReplyDelete
Replies
akashJune 4, 2020 at 9:22 AM
Such a great word which you use in your article and article is amazing knowledge. thank you for sharing it.
Microsoft Windows Azure Training | Online Course | Certification in chennai | Microsoft Windows Azure Training | Online Course | Certification in bangalore | Microsoft Windows Azure Training | Online Course | Certification in hyderabad | Microsoft Windows Azure Training | Online Course | Certification in pune
ReplyDelete
Replies
tjproductstechDecember 8, 2020 at 11:28 PM
Very nice article.I hope you will publish again such type of post.
Corporate gifts | Promotional gifts singapore
Corporate gifts ideas
ReplyDelete
Replies
Links IndexingJanuary 7, 2021 at 5:43 AM
Tips to Save Money During College Days
ReplyDelete
Replies
AlwinFebruary 22, 2021 at 2:58 AM
German classes in Chennai
French Classes in Chennai
Spoken English Classes in Chennai
Spoken Hindi Classes in Chennai
Language Academy
ReplyDelete
Replies
Parul PathakMay 26, 2021 at 4:51 AM
I think you did vikram university ba 2nd year result an awesome job explaining it. Sure beats having to research it vikram university ba 3rd year result on my own. Thanks
ReplyDelete
Replies
Bollywood HungamaJune 19, 2021 at 4:51 AM
I really appreciate your help with my project! Sweet Tooth Full Season Dual Audio
ReplyDelete
Replies
GeorgeNovember 2, 2021 at 1:44 AM
The Universal Currency Converter Is The Most Reliable And Accurate RMB TO USD Currency Conversion Tool On The Internet. It Is Fast And Easy To Use, And Supports Over 150 World Currencies.
ReplyDelete
Replies
Elena JamesNovember 18, 2021 at 2:51 AM
Are You Thinking About Joining AVATRADE REVIEW ? If You Are Then Read The Review On AVATRADE Fx In This Article. We, Will, Tell You About The Company, The Brokers, The Platform, And All The Information You Need To Know. Read More Here.
ReplyDelete
Replies
AnonymousMarch 17, 2022 at 9:31 PM
Sports Betting at Safest Betting Sites in Israel - Airjordan10 RetroOutlet
Safest Betting Sites air jordan 18 retro varsity red to me in Israel · The where to buy jordan 18 white royal blue Best and Best air jordan 18 retro men red Betting Sites in air jordan 18 retro men blue discount the UK · air jordan 18 retro racer blue free shipping The UK's Hottest Bookmakers · Betfair Sports Betting Authority · Betfair
ReplyDelete
Replies
hanumanchalisaJune 6, 2023 at 7:08 AM
Very informative post thanks for sharing this post
Hanuman Chalisa Lyrics Pdf
Hanuman Chalisa Tamil Pdf
Hanuman Chalisa English Pdf
ReplyDelete
Replies

Add comment

Had00b Big data made simple