Saturday, August 3, 2013

Setup Apache Hadoop on your machine (single-node cluster)

By had00b 10:37 PM hadoop, hdfs, linux, mac os x, pseudo-distributed, setup 36 comments

Let's get your machine ready for some big data crunching! Installing Apache Hadoop on a single machine is very simple. Of course, the purpose of installing Hadoop on your machine is mainly for learning, developing and debugging. For production, you will want to deploy Hadoop in fully distributed mode on a cluster of machines. The fully distributed mode is not in the scope of this post.

Don't get discouraged by the length of this post, the whole procedure literally takes only a few minutes! There are only three steps to install Hadoop on a single machine. First, make sure Java (version 6 or later) is installed and that Hadoop knows where to find it. Second, setup your machine to accept ssh logins (this is needed for Hadoop's pseudo-distributed mode). Third, configure Hadoop. We will proceed explaining each of these steps, differentiating depending on the UNIX operating system in use.

Java

As mentioned, we will need to install Java SDK and have the environment variable JAVA_HOME point to a suitable Java installation. Usually, this variable is set in a shell startup file, such as ~/.bash_profile or ~/.bashrc (or ~/.zshrc if you use zsh as shell). We will use .bashrc in this tutorial. The location of Java home varies depending on the system. In most cases this folder should contain a folder named include containing a file jni.h.

Mac OS X

Mac OS X comes by default with Java 6 SDK. It is enough to set JAVA_HOME in ~/.bash_profile or ~/.bashrc. From a terminal run the following two commands.

echo "export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home" >> ~/.bashrc
source ~/.bashrc

Ubuntu/Debian

Run the following command to install Java 6 OpenJDK:

sudo apt-get update
sudo apt-get -y install openjdk-6-jdk openjdk-6-jre

The JAVA_HOME variable can be set as follows:

JAVA_BIN=`update-alternatives --list java | cut -d' ' -f1`
echo "export JAVA_HOME=`dirname $( dirname $( dirname $JAVA_BIN ) )`" >> ~/.bashrc
source ~/.bashrc

CentOS/Red Hat

Run the following to install Java 6 OpenJDK:

sudo yum -y install java-1.6.0-openjdk java-1.6.0-openjdk-devel

This should install Java in a subdirectory of /usr/lib/jvm/java which is going to be our JAVA_HOME:

echo "export JAVA_HOME=/usr/lib/jvm/java" >> ~/.bashrc
source ~/.bashrc

SSH

Hadoop does not distinguish between fully-distributed mode (i.e. when deployed on a cluster) and pseudo-distributed mode (i.e. when installed on a single machine). It simply starts the required daemons on the machine(s) listed in the $HADOOP_INSTALL/conf/slaves, by logging in these machines and starting the processes. By default, the slaves file contains localhost (i.e. by default Hadoop is configured for single-machine mode), so we need to enable SSH login to our machine.

Mac OS X

Go into System Preferences -> Sharing and enable Remote Login for (at least) the current user. Then go to section SSH password-less login.

Ubuntu/Debian

Install ssh with the following command, then go to section SSH password-less login.

sudo apt-get install -y ssh

CentOS/Red Hat

Install ssh with the following command, then go to section SSH password-less login.

sudo yum -y install ssh

SSH password-less login

First of all, don't you worry :) Password-less login does not mean that everybody can login into your machine without a password. It simply means that we will setup your machine to login into itself without a password.
To enable password-less login, generate a new SSH key with an empty passphrase, and add it to the authorized keys and known hosts:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh-keyscan -H localhost, localhost >> ~/.ssh/known_hosts
ssh-keyscan -H localhost, 0.0.0.0 >> ~/.ssh/known_hosts

Test this with:

ssh localhost

If everything was setup correctly, you should have logged in without having to type in any password.

Hadoop

Download a release of Hadoop here. The common question is "Which version??". In this post we assume we want to run MapReduce v1 (i.e. no YARN, as it's not production-ready yet). I would suggest you to go either with version 0.20.2 (the last legacy version with all usable features), or with the latest 1.x version (1.2.1 at the time of writing this post). Despite the naming, the 1.x versions are stable continuations of the 0.20 branch, in fact 1.0 is a simple renaming of 0.20.205. Once you decided the release x.y.z to go with, navigate in the corresponding folder and download the file hadoop-x.y.z.tar.gz and unpack it somewhere in your filesystem:

tar xzf hadoop-x.y.z.tar.gz

It is useful to have an environmental variable HADOOP_INSTALL pointing to the Hadoop installation folder and to add the Hadoop binary subfolder bin to the command-line path. The following commands assume Hadoop is in your home folder, also change x.y.z with your version.

echo "export HADOOP_INSTALL=~/hadoop-x.y.z" >> ~/.bashrc
echo "export PATH=\$PATH:\$HADOOP_INSTALL/bin" >> ~/.bashrc
source ~/.bashrc

At this point you should be able to run Hadoop. Test this with:

hadoop version

This step may be redundant but usually solves the "JAVA_HOME is not set" issue. Set JAVA_HOME also in the file $HADOOP_INSTALL/conf/hadoop-env.sh. You can either edit the file yourself looking for JAVA_HOME and setting it to the right value, or run the following:

echo "export JAVA_HOME=$JAVA_HOME" >> $HADOOP_INSTALL/conf/hadoop-env.sh

Let's now move on with configuring Hadoop for pseudo-distributed mode. By default Hadoop is configured for standalone (sometimes called local) mode. In standalone mode, a submitted job is actually not being executed by Hadoop's daemons, but rather by a MapReduce simulator (indeed, everything runs in a single JVM). While this can be useful for basic debugging, this mode does not reflect some other important Hadoop aspects that should be debugged, such as multiple reducers, or serialization between map and reduce. In pseudo-distributed mode, everything runs as in fully-distributed mode, except that the cluster has only one machine.

The files we are going to change are $HADOOP_INSTALL/conf/{mapred, core, hdfs}-site.xml. It is convenient to save the default files if you later want to switch back to standalone mode.

mkdir $HADOOP_INSTALL/conf/standalone
cp $HADOOP_INSTALL/conf/*-site.xml $HADOOP_INSTALL/conf/standalone

MapReduce framework

We start with modifying the file mapred-site.xml to instruct Hadoop to launch a JobTracker daemon, which basically implements the MapReduce framework. The file should have the following content: This is actually enough to run jobs in pseudo-distributed mode. The question is if you also want to use HDFS rather than your local file-system (since you're running Hadoop on only one machine, the local filesystem will work fine). If you want to stick with your local filesystem, you can skip the following section and go directly to section Running the daemons.

HDFS

If you want to run HDFS as well, first setup HDFS to be Hadoop's default filesystem by modifying the file core-site.xml: And set a block replication of 1 in hdfs-site.xml: Finally we initialize (format) HDFS.

hadoop namenode -format

You can see that a folder /tmp/hadoop-${user.name}/dfs has been created. If you want to change the location where HDFS stores metadata and data, you need to set the properties dfs.namenode.name.dir and dfs.datanode.data.dir in hdfs-site.xml and re-format.

Running the daemons

If you decided to run Hadoop without HDFS then you can start the MapReduce daemons (JobTracker and TaskTracker) with the following command:

start-mapred.sh

You can access the JobTracker UI at http://localhost:50030.

If instead you decided to also use HDFS, you can start the HDFS and MapReduce daemons (NameNode, DataNode, JobTracker, TaskTracker) as follows:

start-dfs.sh
start-mapred.sh

You can access the NameNode UI at http://localhost:50070.
To stop the daemons run the corresponding stop-mapred.sh and stop-dfs.sh.

A quick test

We will run a quick test that counts the word occurrences in a file. The Hadoop examples jar file contains several examples, among which the typical word count. Running the following command:

hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar

returns as output:

An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  dbcount: An example job that count the pageview counts from a database.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using monte-carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sleep: A job that sleeps at each map and reduce task.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.

As input, we will use a plain text version of "Moby-Dick" by Herman Melville, downloadable from project Gutenberg.

wget http://www.gutenberg.org/cache/epub/2489/pg2489.txt

If you have decided to use HDFS, copy the file to HDFS with:

hadoop fs -copyFromLocal pg2489.txt .

Now run the following command to run the word count job. You can check the progress on the JobTracker UI.

hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar wordcount pg2489.txt out

The output is going to be in the files part-r-0000* inside the out folder. If you're not using HDFS, that folder has been created in the folder from which you launched the command.

Note that there's only one file part-r-00000; this is because by default Hadoop uses a single reducer. If you want to use multiple reducers (say 2), then you can modify the previous command to:

hadoop jar ${HADOOP_INSTALL}/hadoop-*examples*.jar wordcount -D mapred.reduce.tasks=2 pg2489.txt out

If you are not using HDFS, you can print the output content as follows:

cat out/part-r-00000

If you are using HDFS, you can use the following command:

hadoop fs -cat out/part-r-00000

Alternatively, you can copy the output folder to your local file-system:

hadoop fs -copyToLocal out .

References

A nice history of Hadoop releases by the Cloudera folks
Our intro to MapReduce and Hadoop.

36 comments:

seizadiOctober 30, 2013 at 12:02 PM
Thanks for your post it was a great write-up, especially for so many different platforms. I had some problems getting this to work on MacOS 10.9, here are some of the steps that I had to change to get it to work:
- in hadoop-env.sh changed export JAVA_HOME=$(/usr/libexec/java_home)
- in hadoop-env.sh set export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk -Djava.net.preferIPv4Stack=true"

This got rid of the errors and warnings and I was able to run sample examples. I also noted that many have increased the heap, I did this just to be safe in hadoop-env.sh:
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=2000
ReplyDelete
Replies
AnonymousOctober 26, 2015 at 10:33 PM
As we also follow this blog along with attending hadoop online training center, our knowledge about the hadoop increased in manifold ways. Thanks for the way information is presented on this blog.
ReplyDelete
Replies
UnknownFebruary 27, 2017 at 8:50 PM
Thank you for sharing such a nice and interesting blog with us. I have seen that all will say the same thing repeatedly. But in your blog, I had a chance to get some useful and unique information. I would like to suggest your blog in my dude circle.
Invisalign Treatment In Chennai
ReplyDelete
Replies
Mahalya sreeJuly 27, 2017 at 3:44 AM
It’s the best time to make some plans for the future and it is time to be happy. I’ve read this post and if I could I want to suggest you few interesting things or suggestions.You can write next articles referring to this article. I desire to read even more things about it..
Office Interior Designers in Coimbatore
Office Interior Designers in Bangalore
Office Interior Designers in Hyderabad
ReplyDelete
Replies
Data science September 19, 2017 at 2:57 AM
I am really happy to say it’s an interesting post to read . I learn new information from your article , you are doing a great job . Keep it up
Hadoop Training in Hyderabad
Data Science Training in Hyderabad
ReplyDelete
Replies
AnonymousSeptember 20, 2017 at 12:03 AM
Hi,

I could understand various concepts explained here and it is easier to grasp because of step by step instructions given here. This is the reason i love this post.Thanks for sharing:) it's very useful.

SEO Company in Chennai

SEO Company in India

Digital Marketing company in Chennai

Digital Marketing Company in India

Web Development Company in India

Web Design Company in Chennai
ReplyDelete
Replies
UnknownOctober 2, 2017 at 7:49 AM
Good work Sir, Thanks for the proper explanation about HDFS. I found one of the good resource related to HDFS and Hadoop. It is providing in-depth knowledge on HDFS and HDFS Architecture. which I am sharing a link with you where you can get more clear on HDFS and Hadoop. To know more Just have a look at Below link

HDFS
Hadoop
HDFS Architecture
ReplyDelete
Replies
UnknownFebruary 9, 2018 at 11:34 PM
Thanks for your wonderful information..
SAP Basis Training in Chennai

ReplyDelete
Replies
UnknownMarch 22, 2018 at 2:48 AM

This is a very interesting web page and I have enjoyed reading many of the articles and posts

contained on the website, keep up the good work and hope to read some more interesting content in the

future.
PHP certfication in

chennai
ReplyDelete
Replies
UnknownMarch 26, 2018 at 5:36 AM
Superb post presented by i really liked it Big data hadoop online Course
ReplyDelete
Replies
Sophia sageApril 6, 2018 at 11:39 PM
This is an awesome post. Really very informative and creative. This sharing concept is a good way to enhance the knowledge. Thank you very much for this post. I like this site very much. I like it and it help me to development very well...
Software Testing Training in Chennai
SEO Training in Chennai
Informatica Training in Chennai
Digital Marketing Training in Chennai
ReplyDelete
Replies
TejutejuMay 14, 2018 at 6:01 AM
very informative blog and useful article thank you for sharing with us
Big data hadoop online Training Bangalore
ReplyDelete
Replies
SSC Coaching Institute in DehradunSeptember 5, 2018 at 11:54 PM
thanks sir for this post
thanks for this post sir i really like your work.
Top CDS Coaching in Dehradun
CLAT Coaching in Dehradun
Best SSC Coaching in Dehradun
Today Match Bhavishyavani
KPL 2018 match prediction
KPL 2018 all match prediction
Today match prediction
ReplyDelete
Replies
TechnogeekscsSeptember 20, 2018 at 2:17 AM
Wow its a great blog on Hadoop topic. I really like this mater peace. Keep sharing and thanks...!

Big Data Testing Classes
Hadoop Big Data Classes in Pune
Big Data Training Institutes in Pune
Hadoop Training in Pune
Hadoop Pune
ReplyDelete
Replies
sandeepSeptember 24, 2018 at 11:46 PM

Really it was an awesome article… very interesting to read…
Thanks for sharing.........

Tableau online training in Chennai

Tableau training in mumbai

Best Tableau online training in delhi
ReplyDelete
Replies
UnknownOctober 3, 2018 at 5:07 AM
Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating Hadoop administration Online Training
ReplyDelete
Replies
mohitNovember 8, 2018 at 11:48 PM
nice post thanks
appvn apk ios
tutuapp apk ios
ReplyDelete
Replies
Dharani MDecember 6, 2018 at 9:28 PM
Nice information
best android training center in Marathahalli

best android development institute in Marathahalli

android training institutes in Marathahalli

ios training in Marathahalli

android training in Marathahalli

mobile app development training in Marathahalli

ReplyDelete
Replies
DiwaliJanuary 24, 2019 at 4:07 AM
India Tour Packages
Holiday Tour Packages
IPL Match Prediction 2019
IPL Match Astrology 2019
IPL 2019 All Match Prediction
IPL 2019 All Match Astrology
Today IPL Toss Prediction 2019
Vivo IPL 2019 Match Schedule

World Cup 2019 Match Prediction
World Cup 2019 Match Astrology
World Cup 2019 All Match Prediction
ReplyDelete
Replies
bashaMarch 18, 2019 at 9:27 PM
Superb blog I visit this blog it's really awesome. The important thing is that in this blog content written clearly and understandable. The content of information is very informative.
Oracle Fusion HCM Online Training
Oracle Fusion SCM Online Training
Oracle Fusion Financials Online Training
Big Data and Hadoop Training In Hyderabad
oracle fusion financials classroom training
Workday HCM Online Training
Oracle Fusion HCM Classroom Training
Workday HCM Online Training
ReplyDelete
Replies
AnonymousJuly 13, 2019 at 6:12 AM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousJuly 18, 2019 at 6:30 AM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousJuly 18, 2019 at 10:54 PM
This comment has been removed by the author.
ReplyDelete
Replies
Vishal DurgaITJuly 22, 2019 at 9:33 AM
This comment has been removed by the author.
ReplyDelete
Replies
Euro T20 SlamAugust 3, 2019 at 4:30 AM
What a thrilling post, you have pointed out some excellent points, I as well believe this is a superb website. I have planned to visit it again and again.
this content
have a peek at these guys
check my blog
ReplyDelete
Replies
RajeshAugust 5, 2019 at 1:49 AM
nice message
informatica Training in Bangalore
Azure DevOps training in Bangalore
Google Cloud Training in Bangalore
Blue Prism Training in Bangalore
MERN StackTraining in Bangalore
RPA Training in Bangalore
Qlikview Training in Bangalore
Qlik Sense Training in Bangalore
ReplyDelete
Replies
Realtime ExpertsNovember 5, 2019 at 11:09 PM
Great Post. The information provided is of great use as I got to learn new things. Keep Blogging.

HADOOP Training Institutes in Bangalore

ReplyDelete
Replies
NikishaDecember 19, 2019 at 11:33 PM
Machine learning solution providers should understand the need of Data warehouses, and they should work to build more appropriate warehouses to meet the requirements of their clients.
ReplyDelete
Replies
Julia LoiMarch 1, 2020 at 12:15 AM
Nice information, valuable and excellent in Job, as share good stuff with good ideas and concepts, lots of great information and inspiration, both of which I need, thanks to offer such a helpful information here.

mobile phone repair in Canton
iphone repair in Canton
cell phone repair in Canton
tablet repair in Canton
ipad repair in Canton
phone repair in Canton
mobile phone repair canton
iphone repair canton
cell phone repair canton
phone repair canton
ReplyDelete
Replies
AdminOctober 12, 2020 at 11:33 PM
Here is the site(bcomexamresult.in) where you get all Bcom Exam Results. This site helps to clear your all query.
BA 3rd year Result 2019-20
Calcutta University BCOM 3rd Year Result 2020

ReplyDelete
Replies
Dynamic Sales SolutionsFebruary 26, 2021 at 10:39 PM
Awesome article, it was exceptionally helpful! I simply began in this and I'm becoming more acquainted with it better! Cheers, keep doing awesome!

SEO Gloucester
SEO Cheltenham
Local SEO Agency Gloucester
ReplyDelete
Replies
Graeme SmithApril 7, 2022 at 3:37 AM
I am very very impressed with your blog, I hope you will have more blogs or more articles to bring to readers. You are doing a very good job.

VBSPU BA 1st Year Result
VBSPU BA 2nd Year Result
VPSPU BA 3rd Year Result
ReplyDelete
Replies
AnonymousMay 31, 2022 at 6:27 PM
Smm Panel
smm panel
iş ilanları
İnstagram takipçi satın al
Hirdavatci Burada
www.beyazesyateknikservisi.com.tr
servis
TİKTOK JETON HİLESİ İNDİR
ReplyDelete
Replies
AnonymousJune 4, 2022 at 9:20 PM
maltepe bosch klima servisi
beykoz arçelik klima servisi
üsküdar arçelik klima servisi
tuzla vestel klima servisi
kartal mitsubishi klima servisi
ümraniye mitsubishi klima servisi
beykoz toshiba klima servisi
üsküdar toshiba klima servisi
beykoz beko klima servisi
ReplyDelete
Replies
mernApril 11, 2025 at 12:14 AM
good blog

mern stack course in bangalore,
mern stack developer course in bangalore,
mern stack training in bangalore
ReplyDelete
Replies
meanSeptember 11, 2025 at 3:06 AM
I am very very impressed with your blog, I hope you will have more blogs or more articles to bring to readers. You are doing a very good job.

best software training institute in kukatpally,
best software training institute in hyderabad,
software training institutes in hyderabad,
top software training institutes in hyderabad with placements,
software coaching centres in hyderabad,
software training institutes in hyderabad with placements,
software coaching centres in hyderabad with placements,
best software coaching centers in hyderabad,
best software institute in hyderabad,
software training institutes in kphb,
top 10 software coaching centers in hyderabad,
best software institute in Hyderabad
ReplyDelete
Replies

Add comment

Had00b Big data made simple

Saturday, August 3, 2013

Setup Apache Hadoop on your machine (single-node cluster)

Java

Mac OS X

Ubuntu/Debian

CentOS/Red Hat

SSH

Mac OS X

Ubuntu/Debian

CentOS/Red Hat

SSH password-less login

Hadoop

MapReduce framework

HDFS

Running the daemons

A quick test

References

36 comments:

Popular Posts

Search

Labels

Blog Archive

Labels

Popular Posts

Followers

About Me