Tuesday, March 11, 2014

Setting Up Rhadoop in Ubuntu 12.04

Prerequisites and installation about RHadoop

In order to setup RHADOOP, the Ubuntu machine with Hadoop configuration must be set in single or distributed node. Refer my earlier post to setup hadoop on Ubuntu

Installing RHadoop

R and Hadoop can complement each other very well; they are a natural match in big data analytics and visualization. One of the most well-known R packages to support Hadoop functionalists is Rhadoop that was developed by Revolution Analytics.

The Steps shown below are to install all the related packages for Ubuntu Precise pangolin (12.04;LTS)

Type the command below at ubuntu terminal to check the release

  cs246@cs246:~$ lsb_release –a

  • rmr2 –Functions providing Hadoop Map Reduce functionality in R.
  • rhdfs-Functions providing file management of the HDFS from within R.
  • rhbase –functions providing database management for the HBASE distributed database from within R. 
To obtain the latest R packages, open the terminal open source.list 

 sudo -s
 gedit /etc/apt/sources.list

add an entry like

 deb http://cran.stat.ucla.edu/bin/linux/ubuntu precise/

in your /etc/apt/sources.list file ,save and quit.


In case , file open in read-only mode then use this command and make the require changes

 gksudo gedit /etc/apt/sources.list

On terminal, issue the sequence of command one by one

 gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys E084DAB9
 gpg -a --export E084DAB9 | sudo apt-key add -
 sudo apt-get update
 sudo apt-get install r-base
 sudo apt-get install r-base-dev
 sudo apt-get install r-base-core
 sudo apt-get install r-base-core-dbg
 sudo apt-get install r-mathlib
 sudo apt-get install r-recommended libxml2-dev

At middle, you will be prompted to enter (Y/N) , Default is N to continue with existing version of R, To update to latest version of R ,Enter Y and continue.

We need to install packages with their dependencies like rmr requires RCpp, RJSONIO, digest, functional, stringr and plyr, while rhdfs requires rJava.

On terminal change the present working directory to \home\cs246\Downloads and execute the command sequentially.Keep observering that the tar.gz file is getting download at location \home\cs246\Downloads.

wget https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads/rhdfs_1.0.6.tar.gz
wget https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads/rhbase_1.2.0.tar.gz
wget http://cran.r-project.org/src/contrib/RJSONIO_1.0-3.tar.gz
wget http://cran.r-project.org/src/contrib/XML_3.98-1.1.tar.gz
wget http://cran.r-project.org/src/contrib/rJava_0.9-6.tar.gz
wget http://cran.r-project.org/src/contrib/bitops_1.0-6.tar.gz
wget http://cran.r-project.org/src/contrib/functional_0.4.tar.gz
wget http://cran.r-project.org/src/contrib/plyr_1.8.1.tar.gz
wget http://cran.r-project.org/src/contrib/stringr_0.6.2.tar.gz
wget http://cran.r-project.org/src/contrib/reshape2_1.2.2.tar.gz
wget http://cran.r-project.org/src/contrib/Rcpp_0.11.0.tar.gz
wget http://cran.r-project.org/src/contrib/codetools_0.2-8.tar.gz
wget http://cran.r-project.org/src/contrib/digest_0.6.4.tar.gz
wget http://cran.r-project.org/src/contrib/iterators_1.0.6.tar.gz
wget http://cran.r-project.org/src/contrib/itertools_0.1-1.tar.gz
wget http://cran.r-project.org/src/contrib/itertools_0.1-1.tar.gz

Set environment variable 

Environments variables for Hadoop can be set with "export" command in Terminal


 export HADOOP_CMD=/usr/local/hadoop/bin/hadoop
 export HADOOP_STREAMING=/usr/local/hadoop/contrib/streaming/
 hadoop-streaming-1.0.4.jar

Issue the following command at terminal one by one

 sudo R CMD INSTALL iterators_1.0.6.tar.gz
 sudo R CMD INSTALL itertools_0.1-1.tar.gz
 sudo R CMD INSTALL digest_0.6.4.tar.gz
 sudo R CMD INSTALL RJSONIO_1.0-3.tar.gz
 sudo R CMD INSTALL codetools_0.2-8.tar.gz
 sudo R CMD INSTALL Rcpp_0.11.0.tar.gz
 sudo R CMD INSTALL bitops_1.0-6.tar.gz
 sudo R CMD INSTALL digest_0.6.4.tar.gz
 sudo R CMD INSTALL functional_0.4.tar.gz
 sudo R CMD INSTALL plyr_1.8.1.tar.gz
 sudo R CMD INSTALL stringr_0.6.2.tar.gz
 sudo R CMD INSTALL reshape2_1.2.2.tar.gz
 sudo R CMD INSTALL rmr2_2.2.1.tar.gz /usr/lib/R/site-library
 sudo R CMD INSTALL XML_3.98-1.1.tar.gz

sudo R CMD javareconf JAVA=/usr/lib/jvm/jdk1.7.0_10/jre/bin/java JAVA_HOME=/usr/lib/jvm/jdk1.7.0_10 JAVAC=/usr/lib/jvm/jdk1.7.0_10/bin/javac JAR=/usr/lib/jvm/jdk1.7.0_10/bin/jar JAVAH=/usr/lib/jvm/jdk1.7.0_10/bin/javah  LD_LIBRARY_PATH=LD_LIBRARY_PATH =/usr/lib/jvm/jdk1.7.0_10/jre/lib/i386/client


Before ,installing rjava, make sure JAVA_HOME path is set and then continue with

 echo $JAVA_HOME must return /usr/lib/jvm/jdk1.7.0_10
 sudo R CMD INSTALL rJava_0.9-4.tar.gz
 sudo -E R CMD INSTALL rhdfs_1.0.6.tar.gz /usr/lib/R/site-library

By now we are all set with Rhadoop setup in Ubuntu, before we write our first map-reduce program in R, make sure the hadoop name node is on


Open a new terminal session and start hadoop

 $ sh /usr/local/hadoop/bin/start-all.sh

Switch back to terminal where we did installation of Rhadoop,Type R and open R session,and issue following commands

 Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
 Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
 library(rmr2)
 library(rhdfs)
 hdfs.init()
 ints = to.dfs(1:100)
 calc = mapreduce(input = ints,map = function(k, v) cbind(v, 2*v))
 from.dfs(calc)  


$val

If you see the below result that means Rhadoop is working for you.


  [1,]   1   2
  [2,]   2   4
  [3,]   3   6
  [4,]   4   8
  [5,]   5  10
  [6,]   6  12
  [7,]   7  14
  [8,]   8  16
  [9,]   9  18
 [10,]  10  20


Friday, March 7, 2014

Setting up ubuntu virtual machine


  ·         Download and install VirtualBox on your machine:  http://virtualbox.org/wiki/Downloads

·         Download cs246.vdi.tgz at

·         Download Cygwin window at http://cygwin.com/install.html

·         Once all the downloads complete, open Cygwin and type \tar -xvf cs246.vdi.tgz. it will generate cs246.vdi. The VDI file you obtained is a Linux virtual machine with a pre-configured Hadoop environment. If it does not work, download ‘vdi’ file direct from
  • Start VirtualBox and click New. Type any name you want for your virtual machine like “cs246”. Choose Linux as the operating system to install and Ubuntu as the type of distribution to install. Set the memory size to at least 1024 MB. In the Hard Drive step, check the \Use an existing virtual hard drive" radio button and point to the provided cs246.vdi file, and click on Create.
Your virtual machine should now appear in the left column. Select it and click on Start to launch it. Username and password are “cs246” and cs246 .

Virtual machine includes the following software
  • Ubuntu 12.04
  • JDK 7 (1.7.0 10)
  • Hadoop 1.0.4
  • Eclipse 4.2.1 (Juno)
Hadoop can be run in three modes.
1. Standalone (or local) mode: There are no daemons running in this mode. Hadoop
uses the local file system as an substitute for HDFS file system. If you do a JPS on
your terminal, there would be no Job tracker, Name node or other daemons running.
The jobs will run as if there is 1 mapper and 1 reducer.

2. Pseudo-distributed mode: All the daemons run on a single machine and this setting mimics the behavior of a cluster. All the daemons run on your machine locally using the HDFS protocol. There can be multiple mappers and reducers.

3. Fully-distributed mode: This is how Hadoop runs on a real cluster.

To start Hadoop in pseudo mode:

$ sh /usr/local/hadoop/bin/start-all.sh

starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-cs246-namenode-cs246.out
localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-cs246-datanode-cs246.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-cs246-secondarynamenode-cs246.out
starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-cs246-jobtracker-cs246.out

localhost: starting tasktracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-cs246-tasktracker-cs246.out


To stop Hadoop:

$ sh /usr/local/hadoop/bin/stop-all.sh

stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode

localhost: stopping secondarynamenode


To view the Files in the HDFS :

$ hadoop fs -ls

Found 5 items
drwxr-xr-x   - cs246 supergroup          0 2013-02-08 16:51 /user/cs246/MaxTempDataHDFS
-rw-r--r--   1 cs246 supergroup    9582237 2013-02-14 20:14 /user/cs246/NYSEDATA_HDFS
drwxr-xr-x   - cs246 supergroup          0 2013-01-11 07:01 /user/cs246/dataset
drwxr-xr-x   - cs246 supergroup          0 2013-02-07 17:12 /user/cs246/hadoopDir

drwxr-xr-x   - cs246 supergroup          0 2013-01-11 07:04 /user/cs246/output


To view the status of various nodes in Hadoop:

$ jps
     
      12868 TaskTracker
      12555 SecondaryNameNode
      13318 Jps
      12116 NameNode
      12332 DataNode

      12649 JobTracker


Copy local data to HDFS:

$ Hadoop fs -copyFromLocal /home/share/currency_dat.csv hdfs://localhost:54310/user/cs246/currency_data.csv

Note : Ensure the Hadoop name node is up and running.