Monday, August 5, 2013

Reservoir Sampling in MapReduce

414 comments
[Image source] We consider the problem of picking a random sample of a given size k from a large dataset of some unknown size n. The hidden assumption here is that n is large enough that the whole dataset does not fit into main memory, whereas the desired sample does....
Read More...

Saturday, August 3, 2013

Setup Apache Hadoop on your machine (single-node cluster)

34 comments
Let's get your machine ready for some big data crunching! Installing Apache Hadoop on a single machine is very simple. Of course, the purpose of installing Hadoop on your machine is mainly for learning, developing and debugging. For production, you will want to deploy Hadoop in fully distributed mode on a cluster of machines. The fully distributed mode is not in the scope of this post...
Read More...

Thursday, August 1, 2013

MapReduce: a gentle introduction with examples

33 comments
A brief history of MapReduce and Hadoop. MapReduce is a programming framework for distributed processing originally developed at Google in 2004. The original paper by Jeffrey Dean and Sanjay Ghemawat describes the programming model and underlying system. The reason that led to the development of MapReduce was the fact that engineers at Google found themselves repeatedly solving the same problems (such as inverted indices, graph structure representations, set of most frequent queries) with ad-hoc distributed computations running...
Read More...