Hadoop setup in single node: Setting up ubuntu virtual machine

· Download and install VirtualBox on your machine: http://virtualbox.org/wiki/Downloads

· Download cs246.vdi.tgz at

http://snap.stanford.edu/class/cs246-data/vm/cs246.vdi.tgz.

· Download Cygwin window at http://cygwin.com/install.html

· Once all the downloads complete, open Cygwin and type \tar -xvf cs246.vdi.tgz. it will generate cs246.vdi. The VDI file you obtained is a Linux virtual machine with a pre-configured Hadoop environment. If it does not work, download ‘vdi’ file direct from

http://snap.stanford.edu/class/cs246-data/vm/cs246.vdi

Start VirtualBox and click New. Type any name you want for your virtual machine like “cs246”. Choose Linux as the operating system to install and Ubuntu as the type of distribution to install. Set the memory size to at least 1024 MB. In the Hard Drive step, check the \Use an existing virtual hard drive" radio button and point to the provided cs246.vdi file, and click on Create.

Your virtual machine should now appear in the left column. Select it and click on Start to launch it. Username and password are “cs246” and cs246 .

Virtual machine includes the following software

Ubuntu 12.04
JDK 7 (1.7.0 10)
Hadoop 1.0.4
Eclipse 4.2.1 (Juno)

Hadoop can be run in three modes.

1. Standalone (or local) mode: There are no daemons running in this mode. Hadoop

uses the local file system as an substitute for HDFS file system. If you do a JPS on

your terminal, there would be no Job tracker, Name node or other daemons running.

The jobs will run as if there is 1 mapper and 1 reducer.

2. Pseudo-distributed mode: All the daemons run on a single machine and this setting mimics the behavior of a cluster. All the daemons run on your machine locally using the HDFS protocol. There can be multiple mappers and reducers.

3. Fully-distributed mode: This is how Hadoop runs on a real cluster.

To start Hadoop in pseudo mode:

$ sh /usr/local/hadoop/bin/start-all.sh

starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-cs246-namenode-cs246.out

localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-cs246-datanode-cs246.out

localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-cs246-secondarynamenode-cs246.out

starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-cs246-jobtracker-cs246.out

localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-cs246-tasktracker-cs246.out

To stop Hadoop:

$ sh /usr/local/hadoop/bin/stop-all.sh

stopping jobtracker

localhost: stopping tasktracker

stopping namenode

localhost: stopping datanode

localhost: stopping secondarynamenode

To view the Files in the HDFS :

$ hadoop fs -ls

Found 5 items

drwxr-xr-x - cs246 supergroup 0 2013-02-08 16:51 /user/cs246/MaxTempDataHDFS

-rw-r--r-- 1 cs246 supergroup 9582237 2013-02-14 20:14 /user/cs246/NYSEDATA_HDFS

drwxr-xr-x - cs246 supergroup 0 2013-01-11 07:01 /user/cs246/dataset

drwxr-xr-x - cs246 supergroup 0 2013-02-07 17:12 /user/cs246/hadoopDir

drwxr-xr-x - cs246 supergroup 0 2013-01-11 07:04 /user/cs246/output

To view the status of various nodes in Hadoop:

$ jps

12868 TaskTracker

12555 SecondaryNameNode

13318 Jps

12116 NameNode

12332 DataNode

12649 JobTracker

Copy local data to HDFS:

$ Hadoop fs -copyFromLocal /home/share/currency_dat.csv hdfs://localhost:54310/user/cs246/currency_data.csv

Note : Ensure the Hadoop name node is up and running.

Hadoop setup in single node

Friday, March 7, 2014

Setting up ubuntu virtual machine

1 comment: