Saturday, August 3, 2013

Setup Apache Hadoop on your machine (single-node cluster)

Let's get your machine ready for some big data crunching! Installing Apache Hadoop on a single machine is very simple. Of course, the purpose of installing Hadoop on your machine is mainly for learning, developing and debugging. For production, you will want to deploy Hadoop in fully distributed mode on a cluster of machines. The fully distributed mode is not in the scope of this post.

Don't get discouraged by the length of this post, the whole procedure literally takes only a few minutes! There are only three steps to install Hadoop on a single machine. First, make sure Java (version 6 or later) is installed and that Hadoop knows where to find it. Second, setup your machine to accept ssh logins (this is needed for Hadoop's pseudo-distributed mode). Third, configure Hadoop. We will proceed explaining each of these steps, differentiating depending on the UNIX operating system in use.


As mentioned, we will need to install Java SDK and have the environment variable JAVA_HOME point to a suitable Java installation. Usually, this variable is set in a shell startup file, such as ~/.bash_profile or ~/.bashrc (or ~/.zshrc if you use zsh as shell). We will use .bashrc in this tutorial. The location of Java home varies depending on the system. In most cases this folder should contain a folder named include containing a file jni.h.
Mac OS X
Mac OS X comes by default with Java 6 SDK. It is enough to set JAVA_HOME in ~/.bash_profile or ~/.bashrc. From a terminal run the following two commands.
echo "export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home" >> ~/.bashrc
source ~/.bashrc

Run the following command to install Java 6 OpenJDK:
sudo apt-get update
sudo apt-get -y install openjdk-6-jdk openjdk-6-jre
The JAVA_HOME variable can be set as follows:
JAVA_BIN=`update-alternatives --list java | cut -d' ' -f1`
echo "export JAVA_HOME=`dirname $( dirname $( dirname $JAVA_BIN ) )`" >> ~/.bashrc
source ~/.bashrc

CentOS/Red Hat
Run the following to install Java 6 OpenJDK:
sudo yum -y install java-1.6.0-openjdk java-1.6.0-openjdk-devel
This should install Java in a subdirectory of /usr/lib/jvm/java which is going to be our JAVA_HOME:
echo "export JAVA_HOME=/usr/lib/jvm/java" >> ~/.bashrc
source ~/.bashrc


Hadoop does not distinguish between fully-distributed mode (i.e. when deployed on a cluster) and pseudo-distributed mode (i.e. when installed on a single machine). It simply starts the required daemons on the machine(s) listed in the $HADOOP_INSTALL/conf/slaves, by logging in these machines and starting the processes. By default, the slaves file contains localhost (i.e. by default Hadoop is configured for single-machine mode), so we need to enable SSH login to our machine.
Mac OS X
Go into System Preferences -> Sharing and enable Remote Login for (at least) the current user. Then go to section SSH password-less login.

Install ssh with the following command, then go to section SSH password-less login.
sudo apt-get install -y ssh

CentOS/Red Hat
Install ssh with the following command, then go to section SSH password-less login.
sudo yum -y install ssh

SSH password-less login
First of all, don't you worry :) Password-less login does not mean that everybody can login into your machine without a password. It simply means that we will setup your machine to login into itself without a password.
To enable password-less login, generate a new SSH key with an empty passphrase, and add it to the authorized keys and known hosts:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/ >> ~/.ssh/authorized_keys
ssh-keyscan -H localhost, localhost >> ~/.ssh/known_hosts
ssh-keyscan -H localhost, >> ~/.ssh/known_hosts
Test this with:
ssh localhost
If everything was setup correctly, you should have logged in without having to type in any password.


Download a release of Hadoop here. The common question is "Which version??". In this post we assume we want to run MapReduce v1 (i.e. no YARN, as it's not production-ready yet). I would suggest you to go either with version 0.20.2 (the last legacy version with all usable features), or with the latest 1.x version (1.2.1 at the time of writing this post). Despite the naming, the 1.x versions are stable continuations of the 0.20 branch, in fact 1.0 is a simple renaming of 0.20.205. Once you decided the release x.y.z to go with, navigate in the corresponding folder and download the file hadoop-x.y.z.tar.gz and unpack it somewhere in your filesystem:
tar xzf hadoop-x.y.z.tar.gz
It is useful to have an environmental variable HADOOP_INSTALL pointing to the Hadoop installation folder and to add the Hadoop binary subfolder bin to the command-line path. The following commands assume Hadoop is in your home folder, also change x.y.z with your version.
echo "export HADOOP_INSTALL=~/hadoop-x.y.z" >> ~/.bashrc
echo "export PATH=\$PATH:\$HADOOP_INSTALL/bin" >> ~/.bashrc
source ~/.bashrc
At this point you should be able to run Hadoop. Test this with:
hadoop version
This step may be redundant but usually solves the "JAVA_HOME is not set" issue. Set JAVA_HOME also in the file $HADOOP_INSTALL/conf/ You can either edit the file yourself looking for JAVA_HOME and setting it to the right value, or run the following:
echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_INSTALL/conf/
Let's now move on with configuring Hadoop for pseudo-distributed mode. By default Hadoop is configured for standalone (sometimes called local) mode. In standalone mode, a submitted job is actually not being executed by Hadoop's daemons, but rather by a MapReduce simulator (indeed, everything runs in a single JVM). While this can be useful for basic debugging, this mode does not reflect some other important Hadoop aspects that should be debugged, such as multiple reducers, or serialization between map and reduce. In pseudo-distributed mode, everything runs as in fully-distributed mode, except that the cluster has only one machine.

The files we are going to change are $HADOOP_INSTALL/conf/{mapred, core, hdfs}-site.xml. It is convenient to save the default files if you later want to switch back to standalone mode.
mkdir $HADOOP_INSTALL/conf/standalone
cp $HADOOP_INSTALL/conf/*-site.xml $HADOOP_INSTALL/conf/standalone
MapReduce framework
We start with modifying the file mapred-site.xml to instruct Hadoop to launch a JobTracker daemon, which basically implements the MapReduce framework. The file should have the following content: This is actually enough to run jobs in pseudo-distributed mode. The question is if you also want to use HDFS rather than your local file-system (since you're running Hadoop on only one machine, the local filesystem will work fine). If you want to stick with your local filesystem, you can skip the following section and go directly to section Running the daemons.

If you want to run HDFS as well, first setup HDFS to be Hadoop's default filesystem by modifying the file core-site.xml: And set a block replication of 1 in hdfs-site.xml: Finally we initialize (format) HDFS.
hadoop namenode -format
You can see that a folder /tmp/hadoop-${}/dfs has been created. If you want to change the location where HDFS stores metadata and data, you need to set the properties and in hdfs-site.xml and re-format.

Running the daemons
If you decided to run Hadoop without HDFS then you can start the MapReduce daemons (JobTracker and TaskTracker) with the following command:
You can access the JobTracker UI at http://localhost:50030.

If instead you decided to also use HDFS, you can start the HDFS and MapReduce daemons (NameNode, DataNode, JobTracker, TaskTracker) as follows:
You can access the NameNode UI at http://localhost:50070.
To stop the daemons run the corresponding and

A quick test

We will run a quick test that counts the word occurrences in a file. The Hadoop examples jar file contains several examples, among which the typical word count. Running the following command:
hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar
returns as output:
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  dbcount: An example job that count the pageview counts from a database.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using monte-carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sleep: A job that sleeps at each map and reduce task.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
As input, we will use a plain text version of "Moby-Dick" by Herman Melville, downloadable from project Gutenberg.
If you have decided to use HDFS, copy the file to HDFS with:
hadoop fs -copyFromLocal pg2489.txt .
Now run the following command to run the word count job. You can check the progress on the JobTracker UI.
hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar wordcount pg2489.txt out
The output is going to be in the files part-r-0000* inside the out folder. If you're not using HDFS, that folder has been created in the folder from which you launched the command.
Note that there's only one file part-r-00000; this is because by default Hadoop uses a single reducer. If you want to use multiple reducers (say 2), then you can modify the previous command to:
hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar wordcount -D mapred.reduce.tasks=2 pg2489.txt out
If you are not using HDFS, you can print the output content as follows:
cat out/part-r-00000
If you are using HDFS, you can use the following command:
hadoop fs -cat out/part-r-00000
Alternatively, you can copy the output folder to your local file-system:
hadoop fs -copyToLocal out .



  1. Thanks for your post it was a great write-up, especially for so many different platforms. I had some problems getting this to work on MacOS 10.9, here are some of the steps that I had to change to get it to work:
    - in changed export JAVA_HOME=$(/usr/libexec/java_home)
    - in set export HADOOP_OPTS=""

    This got rid of the errors and warnings and I was able to run sample examples. I also noted that many have increased the heap, I did this just to be safe in
    # The maximum amount of heap to use, in MB. Default is 1000.
    export HADOOP_HEAPSIZE=2000

  2. As we also follow this blog along with attending hadoop online training center, our knowledge about the hadoop increased in manifold ways. Thanks for the way information is presented on this blog.

  3. As we also follow this blog along with attending hadoop online training center, our knowledge about the hadoop increased in manifold ways. Thanks for the way information is presented on this blog.

  4. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
    Home Spa Services in Mumbai

  5. Thank you for sharing such a nice and interesting blog with us. I have seen that all will say the same thing repeatedly. But in your blog, I had a chance to get some useful and unique information. I would like to suggest your blog in my dude circle.
    Invisalign Treatment In Chennai

  6. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai

  7. Interesting blog which attracted me more.I hope you will post more update like this.
    Digital marketing company in Chennai

  8. It’s the best time to make some plans for the future and it is time to be happy. I’ve read this post and if I could I want to suggest you few interesting things or suggestions.You can write next articles referring to this article. I desire to read even more things about it..
    Office Interior Designers in Coimbatore
    Office Interior Designers in Bangalore
    Office Interior Designers in Hyderabad

  9. very very amazing explaintion....many things gather about yourself...yes realy i enjoy it
    Base SAS Training in Chennai

  10. I am really happy to say it’s an interesting post to read . I learn new information from your article , you are doing a great job . Keep it up
    Hadoop Training in Hyderabad
    Data Science Training in Hyderabad

  11. Hi,

    I could understand various concepts explained here and it is easier to grasp because of step by step instructions given here. This is the reason i love this post.Thanks for sharing:) it's very useful.

    SEO Company in Chennai

    SEO Company in India

    Digital Marketing company in Chennai

    Digital Marketing Company in India

    Web Development Company in India

    Web Design Company in Chennai

  12. Good work Sir, Thanks for the proper explanation about HDFS. I found one of the good resource related to HDFS and Hadoop. It is providing in-depth knowledge on HDFS and HDFS Architecture. which I am sharing a link with you where you can get more clear on HDFS and Hadoop. To know more Just have a look at Below link

    HDFS Architecture

  13. Actually your giving information is very useful information for me .so thank u for sharing blog.your information is very nice thank u and keep more updates
    informatica training in chennai

  14. Actually your giving information is very useful to me so keep more updates
    Business Tax Return
    Cpa Tax Accountant
    Tax Return Services

  15. A great content and very much useful to the visitors. Looking for more updates in future.

    Selenium Training in Chennai

  16. Its really an Excellent post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog. posts. I hope you post again soon.

    Selenium Training in Chennai


  17. This is a very interesting web page and I have enjoyed reading many of the articles and posts

    contained on the website, keep up the good work and hope to read some more interesting content in the

    PHP certfication in


  18. This is an awesome post. Really very informative and creative. This sharing concept is a good way to enhance the knowledge. Thank you very much for this post. I like this site very much. I like it and it help me to development very well...
    Software Testing Training in Chennai
    SEO Training in Chennai
    Informatica Training in Chennai
    Digital Marketing Training in Chennai

  19. very informative blog and useful article thank you for sharing with us
    Big data hadoop online Training Bangalore

  20. خدمات نقل وتخزين الاثاث
    تعرف شركة شراء اثاث مستعمل جدة
    ان الاثاث من اكثر الاشياء التي لها ثمن غالي ومكلف للغايةويحتاج الي عناية جيدة وشديدة لقيام بنقلة بطريقة غير مثالية وتعرضة للخدش او الكسر نحن في غني عنه فأن تلفيات الاثاث تؤدي الي التكاليف الباهظة نظرا لتكلفة الاثاث العالية كما انه يؤدي الي الحاجه الي تكلفة اضافية لشراء اثاث من جديد ،
    شركة شراء اثاث مستعمل بجدة
    ، ونظرا لان شركة نقل اثاث بجدة من الشركات التى تعلم جيدا حجم المشكلات والاضرار التى تحدث وهي ايضا من الشركات التى على دراية كاملة بكيفية الوصول الى افضل واحسن النتائج فى عملية النقل ،كل ماعليك ان تتعاون مع شركة شراء الاثاث المستعمل بجدة والاعتماد عليها بشكل كلي في عملية نقل الاثاث من اجل الحصول علي افضل النتائج المثالية في عمليات النقل
    من اهم الخدمات التي تقدمها شركة المستقبل في عملية النقل وتجعلك تضعها من
    ضمن اوائل الشركات هي :
    اعتماد شراء الاثاث المستعمل بجدة علي القيام بأعمال النقل علي عدة مراحل متميزة من اهما اثناء القيام بالنقل داخل المملكة او خارجها وهي مرحلة تصنيف الاثاث عن طريق المعاينة التي تتم من قبل الخبراء والفنين المتخصصين والتعرف علي اعداد القطع الموجودة من قطع خشبية او اجهزة كهربائية ا تحف او اثاث غرف وغيرهم.
    كما اننا نقوم بمرحلة فك الاثاث بعد ذلك وتعتمد شركتنا في هذة المرحلة علي اقوي الاساليب والطرق المستخدمة ويقوم بذلك العملية طاقم كبير من العمالة المتربة للقيام بأعمال الفك والتركيب.
    ارقام شراء الاثاث المستعمل بالرياضثم تأتي بعد ذلك مرحلة التغليف وهي من اهم المراحل التي تعمل علي الحفاظ علي اثاث منزلك وعلي كل قطعة به وتتم عملية التغليف بطريقة مميزة عن باقي الشركات.
    محلات شراء الاثاث المستعمل بالرياضويأتي بعد ذلك للمرحلة الاخيرة وهي نقل الاثاث وتركيبة ويتم اعتمادنا في عملية النقل علي اكبر الشاحنات المميزة التي تساعد علي الحفاظ علي كل قطع اثاثك اثناء عملية السير والنقل كما اننا لا نتطرق الي عمليات النقل التقليدية لخطورتها علي الاثاث وتعرضة للخدش والكسر .
    تخزين الاثاث بالرياض
    ارقام شراء الاثاث المستعمل بجدة
    تمتلك شركة المستقبل افضل واكبر المستودعات المميزة بجدة والتي تساعد علي تحقيق اعلي مستوي من الدقة والتميز فأذا كنت في حيرة من اتمام عملية النقل والتخزين فعليك الاستعانة بشركة نقل اثاث بجدة والاتصال بنا ارقام محلات شراء الاثاث المستعمل بجدة
    والتعاقد معنا للحصول علي كافة خدماتنا وعروضنا المقدمة بأفضل الاسعار المقدمة لعملائنا الكرام .

  21. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating Hadoop administration Online Training

  22. This is a great post. I like this topic.This site has lots of advantage. I found many interesting things from this site. It helps me in many ways.Thanks for posting this again.
    Best ERP Software
    ERP Software In India
    ERP Companies In Chennai
    ERP Software In Chennai

  23. nice course. thanks for sharing this post this post harried me a lot.
    Big Data Training in Gurgaon

  24. This comment has been removed by the author.

  25. This comment has been removed by the author.

  26. This comment has been removed by the author.

  27. This comment has been removed by the author.

  28. What a thrilling post, you have pointed out some excellent points, I as well believe this is a superb website. I have planned to visit it again and again.
    this content
    have a peek at these guys
    check my blog