Monday, August 5, 2013

Reservoir Sampling in MapReduce

414 comments
[Image source] We consider the problem of picking a random sample of a given size k from a large dataset of some unknown size n. The hidden assumption here is that n is large enough that the whole dataset does not fit into main memory, whereas the desired sample does....
Read More...

Saturday, August 3, 2013

Setup Apache Hadoop on your machine (single-node cluster)

34 comments
Let's get your machine ready for some big data crunching! Installing Apache Hadoop on a single machine is very simple. Of course, the purpose of installing Hadoop on your machine is mainly for learning, developing and debugging. For production, you will want to deploy Hadoop in fully distributed mode on a cluster of machines. The fully distributed mode is not in the scope of this post...
Read More...

Thursday, August 1, 2013

MapReduce: a gentle introduction with examples

33 comments
A brief history of MapReduce and Hadoop. MapReduce is a programming framework for distributed processing originally developed at Google in 2004. The original paper by Jeffrey Dean and Sanjay Ghemawat describes the programming model and underlying system. The reason that led to the development of MapReduce was the fact that engineers at Google found themselves repeatedly solving the same problems (such as inverted indices, graph structure representations, set of most frequent queries) with ad-hoc distributed computations running...
Read More...

Sunday, July 21, 2013

Had00b - Introduction

28 comments
Had00b is a Big Data blog for readers ranging from n00bs to advanced. We will discuss common algorithmic problems that arise in Big Data and explain solutions that apply to the MapReduce framework. We will discuss recommendation systems, pagerank, clustering, locality-sensitive hashing, and more, each of them with exciting applications and source code. We will also provide an introduction to the MapReduce framework and howtos on...
Read More...