Saturday, August 3, 2013

Setup Apache Hadoop on your machine (single-node cluster)

11 comments
Let's get your machine ready for some big data crunching! Installing Apache Hadoop on a single machine is very simple. Of course, the purpose of installing Hadoop on your machine is mainly for learning, developing and debugging. For production, you will want to deploy Hadoop in fully distributed mode on a cluster of machines. The fully distributed mode is not in the scope of this post.

Don't get discouraged by the length of this post, the whole procedure literally takes only a few minutes! There are only three steps to install Hadoop on a single machine. First, make sure Java (version 6 or later) is installed and that Hadoop knows where to find it. Second, setup your machine to accept ssh logins (this is needed for Hadoop's pseudo-distributed mode). Third, configure Hadoop. We will proceed explaining each of these steps, differentiating depending on the UNIX operating system in use.

Java

As mentioned, we will need to install Java SDK and have the environment variable JAVA_HOME point to a suitable Java installation. Usually, this variable is set in a shell startup file, such as ~/.bash_profile or ~/.bashrc (or ~/.zshrc if you use zsh as shell). We will use .bashrc in this tutorial. The location of Java home varies depending on the system. In most cases this folder should contain a folder named include containing a file jni.h.
Mac OS X
Mac OS X comes by default with Java 6 SDK. It is enough to set JAVA_HOME in ~/.bash_profile or ~/.bashrc. From a terminal run the following two commands.
echo "export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home" >> ~/.bashrc
source ~/.bashrc

Ubuntu/Debian
Run the following command to install Java 6 OpenJDK:
sudo apt-get update
sudo apt-get -y install openjdk-6-jdk openjdk-6-jre
The JAVA_HOME variable can be set as follows:
JAVA_BIN=`update-alternatives --list java | cut -d' ' -f1`
echo "export JAVA_HOME=`dirname $( dirname $( dirname $JAVA_BIN ) )`" >> ~/.bashrc
source ~/.bashrc

CentOS/Red Hat
Run the following to install Java 6 OpenJDK:
sudo yum -y install java-1.6.0-openjdk java-1.6.0-openjdk-devel
This should install Java in a subdirectory of /usr/lib/jvm/java which is going to be our JAVA_HOME:
echo "export JAVA_HOME=/usr/lib/jvm/java" >> ~/.bashrc
source ~/.bashrc

SSH

Hadoop does not distinguish between fully-distributed mode (i.e. when deployed on a cluster) and pseudo-distributed mode (i.e. when installed on a single machine). It simply starts the required daemons on the machine(s) listed in the $HADOOP_INSTALL/conf/slaves, by logging in these machines and starting the processes. By default, the slaves file contains localhost (i.e. by default Hadoop is configured for single-machine mode), so we need to enable SSH login to our machine.
Mac OS X
Go into System Preferences -> Sharing and enable Remote Login for (at least) the current user. Then go to section SSH password-less login.

Ubuntu/Debian
Install ssh with the following command, then go to section SSH password-less login.
sudo apt-get install -y ssh

CentOS/Red Hat
Install ssh with the following command, then go to section SSH password-less login.
sudo yum -y install ssh

SSH password-less login
First of all, don't you worry :) Password-less login does not mean that everybody can login into your machine without a password. It simply means that we will setup your machine to login into itself without a password.
To enable password-less login, generate a new SSH key with an empty passphrase, and add it to the authorized keys and known hosts:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh-keyscan -H localhost, localhost >> ~/.ssh/known_hosts
ssh-keyscan -H localhost, 0.0.0.0 >> ~/.ssh/known_hosts
Test this with:
ssh localhost
If everything was setup correctly, you should have logged in without having to type in any password.

Hadoop

Download a release of Hadoop here. The common question is "Which version??". In this post we assume we want to run MapReduce v1 (i.e. no YARN, as it's not production-ready yet). I would suggest you to go either with version 0.20.2 (the last legacy version with all usable features), or with the latest 1.x version (1.2.1 at the time of writing this post). Despite the naming, the 1.x versions are stable continuations of the 0.20 branch, in fact 1.0 is a simple renaming of 0.20.205. Once you decided the release x.y.z to go with, navigate in the corresponding folder and download the file hadoop-x.y.z.tar.gz and unpack it somewhere in your filesystem:
tar xzf hadoop-x.y.z.tar.gz
It is useful to have an environmental variable HADOOP_INSTALL pointing to the Hadoop installation folder and to add the Hadoop binary subfolder bin to the command-line path. The following commands assume Hadoop is in your home folder, also change x.y.z with your version.
echo "export HADOOP_INSTALL=~/hadoop-x.y.z" >> ~/.bashrc
echo "export PATH=\$PATH:\$HADOOP_INSTALL/bin" >> ~/.bashrc
source ~/.bashrc
At this point you should be able to run Hadoop. Test this with:
hadoop version
This step may be redundant but usually solves the "JAVA_HOME is not set" issue. Set JAVA_HOME also in the file $HADOOP_INSTALL/conf/hadoop-env.sh. You can either edit the file yourself looking for JAVA_HOME and setting it to the right value, or run the following:
echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_INSTALL/conf/hadoop-env.sh
Let's now move on with configuring Hadoop for pseudo-distributed mode. By default Hadoop is configured for standalone (sometimes called local) mode. In standalone mode, a submitted job is actually not being executed by Hadoop's daemons, but rather by a MapReduce simulator (indeed, everything runs in a single JVM). While this can be useful for basic debugging, this mode does not reflect some other important Hadoop aspects that should be debugged, such as multiple reducers, or serialization between map and reduce. In pseudo-distributed mode, everything runs as in fully-distributed mode, except that the cluster has only one machine.

The files we are going to change are $HADOOP_INSTALL/conf/{mapred, core, hdfs}-site.xml. It is convenient to save the default files if you later want to switch back to standalone mode.
mkdir $HADOOP_INSTALL/conf/standalone
cp $HADOOP_INSTALL/conf/*-site.xml $HADOOP_INSTALL/conf/standalone
MapReduce framework
We start with modifying the file mapred-site.xml to instruct Hadoop to launch a JobTracker daemon, which basically implements the MapReduce framework. The file should have the following content: This is actually enough to run jobs in pseudo-distributed mode. The question is if you also want to use HDFS rather than your local file-system (since you're running Hadoop on only one machine, the local filesystem will work fine). If you want to stick with your local filesystem, you can skip the following section and go directly to section Running the daemons.

HDFS
If you want to run HDFS as well, first setup HDFS to be Hadoop's default filesystem by modifying the file core-site.xml: And set a block replication of 1 in hdfs-site.xml: Finally we initialize (format) HDFS.
hadoop namenode -format
You can see that a folder /tmp/hadoop-${user.name}/dfs has been created. If you want to change the location where HDFS stores metadata and data, you need to set the properties dfs.namenode.name.dir and dfs.datanode.data.dir in hdfs-site.xml and re-format.

Running the daemons
If you decided to run Hadoop without HDFS then you can start the MapReduce daemons (JobTracker and TaskTracker) with the following command:
start-mapred.sh
You can access the JobTracker UI at http://localhost:50030.

If instead you decided to also use HDFS, you can start the HDFS and MapReduce daemons (NameNode, DataNode, JobTracker, TaskTracker) as follows:
start-dfs.sh
start-mapred.sh
You can access the NameNode UI at http://localhost:50070.
To stop the daemons run the corresponding stop-mapred.sh and stop-dfs.sh.

A quick test

We will run a quick test that counts the word occurrences in a file. The Hadoop examples jar file contains several examples, among which the typical word count. Running the following command:
hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar
returns as output:
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  dbcount: An example job that count the pageview counts from a database.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using monte-carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sleep: A job that sleeps at each map and reduce task.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
As input, we will use a plain text version of "Moby-Dick" by Herman Melville, downloadable from project Gutenberg.
wget http://www.gutenberg.org/cache/epub/2489/pg2489.txt
If you have decided to use HDFS, copy the file to HDFS with:
hadoop fs -copyFromLocal pg2489.txt .
Now run the following command to run the word count job. You can check the progress on the JobTracker UI.
hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar wordcount pg2489.txt out
The output is going to be in the files part-r-0000* inside the out folder. If you're not using HDFS, that folder has been created in the folder from which you launched the command.
Note that there's only one file part-r-00000; this is because by default Hadoop uses a single reducer. If you want to use multiple reducers (say 2), then you can modify the previous command to:
hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar wordcount -D mapred.reduce.tasks=2 pg2489.txt out
If you are not using HDFS, you can print the output content as follows:
cat out/part-r-00000
If you are using HDFS, you can use the following command:
hadoop fs -cat out/part-r-00000
Alternatively, you can copy the output folder to your local file-system:
hadoop fs -copyToLocal out .

References

11 comments:

  1. Thanks for your post it was a great write-up, especially for so many different platforms. I had some problems getting this to work on MacOS 10.9, here are some of the steps that I had to change to get it to work:
    - in hadoop-env.sh changed export JAVA_HOME=$(/usr/libexec/java_home)
    - in hadoop-env.sh set export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk -Djava.net.preferIPv4Stack=true"

    This got rid of the errors and warnings and I was able to run sample examples. I also noted that many have increased the heap, I did this just to be safe in hadoop-env.sh:
    # The maximum amount of heap to use, in MB. Default is 1000.
    export HADOOP_HEAPSIZE=2000

    ReplyDelete
  2. As we also follow this blog along with attending hadoop online training center, our knowledge about the hadoop increased in manifold ways. Thanks for the way information is presented on this blog.

    ReplyDelete
  3. As we also follow this blog along with attending hadoop online training center, our knowledge about the hadoop increased in manifold ways. Thanks for the way information is presented on this blog.

    ReplyDelete
  4. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
    Home Spa Services in Mumbai

    ReplyDelete
  5. Thank you for sharing such a nice and interesting blog with us. I have seen that all will say the same thing repeatedly. But in your blog, I had a chance to get some useful and unique information. I would like to suggest your blog in my dude circle.
    Invisalign Treatment In Chennai

    ReplyDelete
  6. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai

    ReplyDelete
  7. Interesting blog which attracted me more.I hope you will post more update like this.
    Digital marketing company in Chennai

    ReplyDelete
  8. It’s the best time to make some plans for the future and it is time to be happy. I’ve read this post and if I could I want to suggest you few interesting things or suggestions.You can write next articles referring to this article. I desire to read even more things about it..
    Office Interior Designers in Coimbatore
    Office Interior Designers in Bangalore
    Office Interior Designers in Hyderabad

    ReplyDelete
  9. very very amazing explaintion....many things gather about yourself...yes realy i enjoy it
    Base SAS Training in Chennai

    ReplyDelete
  10. I am really happy to say it’s an interesting post to read . I learn new information from your article , you are doing a great job . Keep it up
    Hadoop Training in Hyderabad
    Data Science Training in Hyderabad

    ReplyDelete
  11. Hi,

    I could understand various concepts explained here and it is easier to grasp because of step by step instructions given here. This is the reason i love this post.Thanks for sharing:) it's very useful.

    SEO Company in Chennai

    SEO Company in India

    Digital Marketing company in Chennai


    Digital Marketing Company in India

    Web Development Company in India

    Web Design Company in Chennai

    ReplyDelete