## Saturday, August 3, 2013

Let's get your machine ready for some big data crunching! Installing Apache Hadoop on a single machine is very simple. Of course, the purpose of installing Hadoop on your machine is mainly for learning, developing and debugging. For production, you will want to deploy Hadoop in fully distributed mode on a cluster of machines. The fully distributed mode is not in the scope of this post.

Don't get discouraged by the length of this post, the whole procedure literally takes only a few minutes! There are only three steps to install Hadoop on a single machine. First, make sure Java (version 6 or later) is installed and that Hadoop knows where to find it. Second, setup your machine to accept ssh logins (this is needed for Hadoop's pseudo-distributed mode). Third, configure Hadoop. We will proceed explaining each of these steps, differentiating depending on the UNIX operating system in use.

## Java

As mentioned, we will need to install Java SDK and have the environment variable JAVA_HOME point to a suitable Java installation. Usually, this variable is set in a shell startup file, such as ~/.bash_profile or ~/.bashrc (or ~/.zshrc if you use zsh as shell). We will use .bashrc in this tutorial. The location of Java home varies depending on the system. In most cases this folder should contain a folder named include containing a file jni.h.
###### Mac OS X
Mac OS X comes by default with Java 6 SDK. It is enough to set JAVA_HOME in ~/.bash_profile or ~/.bashrc. From a terminal run the following two commands.
echo "export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home" >> ~/.bashrc
source ~/.bashrc


###### Ubuntu/Debian
Run the following command to install Java 6 OpenJDK:
sudo apt-get update
sudo apt-get -y install openjdk-6-jdk openjdk-6-jre

The JAVA_HOME variable can be set as follows:
JAVA_BIN=update-alternatives --list java | cut -d' ' -f1
###### Mac OS X
Go into System Preferences -> Sharing and enable Remote Login for (at least) the current user. Then go to section SSH password-less login.

###### Ubuntu/Debian
Install ssh with the following command, then go to section SSH password-less login.
sudo apt-get install -y ssh


###### CentOS/Red Hat
Install ssh with the following command, then go to section SSH password-less login.
sudo yum -y install ssh


To enable password-less login, generate a new SSH key with an empty passphrase, and add it to the authorized keys and known hosts:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh-keyscan -H localhost, localhost >> ~/.ssh/known_hosts
ssh-keyscan -H localhost, 0.0.0.0 >> ~/.ssh/known_hosts

Test this with:
ssh localhost

If everything was setup correctly, you should have logged in without having to type in any password.

Download a release of Hadoop here. The common question is "Which version??". In this post we assume we want to run MapReduce v1 (i.e. no YARN, as it's not production-ready yet). I would suggest you to go either with version 0.20.2 (the last legacy version with all usable features), or with the latest 1.x version (1.2.1 at the time of writing this post). Despite the naming, the 1.x versions are stable continuations of the 0.20 branch, in fact 1.0 is a simple renaming of 0.20.205. Once you decided the release x.y.z to go with, navigate in the corresponding folder and download the file hadoop-x.y.z.tar.gz and unpack it somewhere in your filesystem:
tar xzf hadoop-x.y.z.tar.gz

It is useful to have an environmental variable HADOOP_INSTALL pointing to the Hadoop installation folder and to add the Hadoop binary subfolder bin to the command-line path. The following commands assume Hadoop is in your home folder, also change x.y.z with your version.
echo "export HADOOP_INSTALL=~/hadoop-x.y.z" >> ~/.bashrc
echo "export PATH=\$PATH:\$HADOOP_INSTALL/bin" >> ~/.bashrc
source ~/.bashrc

At this point you should be able to run Hadoop. Test this with:
hadoop version

This step may be redundant but usually solves the "JAVA_HOME is not set" issue. Set JAVA_HOME also in the file $HADOOP_INSTALL/conf/hadoop-env.sh. You can either edit the file yourself looking for JAVA_HOME and setting it to the right value, or run the following: echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_INSTALL/conf/hadoop-env.sh  Let's now move on with configuring Hadoop for pseudo-distributed mode. By default Hadoop is configured for standalone (sometimes called local) mode. In standalone mode, a submitted job is actually not being executed by Hadoop's daemons, but rather by a MapReduce simulator (indeed, everything runs in a single JVM). While this can be useful for basic debugging, this mode does not reflect some other important Hadoop aspects that should be debugged, such as multiple reducers, or serialization between map and reduce. In pseudo-distributed mode, everything runs as in fully-distributed mode, except that the cluster has only one machine. The files we are going to change are$HADOOP_INSTALL/conf/{mapred, core, hdfs}-site.xml. It is convenient to save the default files if you later want to switch back to standalone mode.

###### Running the daemons
If you decided to run Hadoop without HDFS then you can start the MapReduce daemons (JobTracker and TaskTracker) with the following command:
start-mapred.sh

You can access the JobTracker UI at http://localhost:50030.

If instead you decided to also use HDFS, you can start the HDFS and MapReduce daemons (NameNode, DataNode, JobTracker, TaskTracker) as follows:
start-dfs.sh
start-mapred.sh

You can access the NameNode UI at http://localhost:50070.
To stop the daemons run the corresponding stop-mapred.sh and stop-dfs.sh.

## A quick test

We will run a quick test that counts the word occurrences in a file. The Hadoop examples jar file contains several examples, among which the typical word count. Running the following command:
hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar  returns as output: An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. dbcount: An example job that count the pageview counts from a database. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using monte-carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sleep: A job that sleeps at each map and reduce task. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files.  As input, we will use a plain text version of "Moby-Dick" by Herman Melville, downloadable from project Gutenberg. wget http://www.gutenberg.org/cache/epub/2489/pg2489.txt  If you have decided to use HDFS, copy the file to HDFS with: hadoop fs -copyFromLocal pg2489.txt .  Now run the following command to run the word count job. You can check the progress on the JobTracker UI. hadoop jar${HADOOP_INSTALL}/hadoop-*examples*.jar wordcount pg2489.txt out

The output is going to be in the files part-r-0000* inside the out folder. If you're not using HDFS, that folder has been created in the folder from which you launched the command.
Note that there's only one file part-r-00000; this is because by default Hadoop uses a single reducer. If you want to use multiple reducers (say 2), then you can modify the previous command to:
hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar wordcount -D mapred.reduce.tasks=2 pg2489.txt out  If you are not using HDFS, you can print the output content as follows: cat out/part-r-00000  If you are using HDFS, you can use the following command: hadoop fs -cat out/part-r-00000  Alternatively, you can copy the output folder to your local file-system: hadoop fs -copyToLocal out .  #### References #### 6 comments: 1. Thanks for your post it was a great write-up, especially for so many different platforms. I had some problems getting this to work on MacOS 10.9, here are some of the steps that I had to change to get it to work: - in hadoop-env.sh changed export JAVA_HOME=$(/usr/libexec/java_home)

This got rid of the errors and warnings and I was able to run sample examples. I also noted that many have increased the heap, I did this just to be safe in hadoop-env.sh:
# The maximum amount of heap to use, in MB. Default is 1000.

2. As we also follow this blog along with attending hadoop online training center, our knowledge about the hadoop increased in manifold ways. Thanks for the way information is presented on this blog.

3. As we also follow this blog along with attending hadoop online training center, our knowledge about the hadoop increased in manifold ways. Thanks for the way information is presented on this blog.

4. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
Home Spa Services in Mumbai

5. Thank you for sharing such a nice and interesting blog with us. I have seen that all will say the same thing repeatedly. But in your blog, I had a chance to get some useful and unique information. I would like to suggest your blog in my dude circle.
Invisalign Treatment In Chennai

6. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.