Prerequisites and installation about
RHadoop
In order to setup RHADOOP,
the Ubuntu machine with Hadoop configuration must be set in single or
distributed node. Refer my earlier post to setup hadoop on Ubuntu
Installing RHadoop
R and Hadoop can complement each other very well;
they are a natural match in big data analytics and visualization. One of the
most well-known R packages to support Hadoop functionalists is Rhadoop that was developed by Revolution
Analytics.
The Steps shown below are to
install all the related packages for Ubuntu Precise pangolin (12.04;LTS)
Type the command below at ubuntu
terminal to check the release
cs246@cs246:~$
lsb_release –a
- rmr2 –Functions providing Hadoop Map Reduce functionality in R.
- rhdfs-Functions providing file management of the HDFS from within R.
- rhbase –functions providing database management for the HBASE distributed database from within R.
sudo -s
gedit /etc/apt/sources.listadd an entry like
deb http://cran.stat.ucla.edu/bin/linux/ubuntu
precise/
in your /etc/apt/sources.list file ,save and quit.
In case , file open in
read-only mode then use this command and make the require changes
gksudo gedit /etc/apt/sources.list
On terminal, issue the sequence of command one by one
gpg --keyserver
hkp://keyserver.ubuntu.com:80 --recv-keys E084DAB9
gpg -a --export E084DAB9 | sudo apt-key
add -
sudo apt-get
update
sudo apt-get install r-base
sudo apt-get install r-base-dev
sudo apt-get install r-base-core
sudo
apt-get install r-base-core-dbg
sudo
apt-get install r-mathlib
sudo apt-get install
r-recommended libxml2-dev
At middle, you will be
prompted to enter (Y/N) , Default is N to continue with existing version of R,
To update to latest version of R ,Enter Y and continue.
We need to install packages
with their dependencies like rmr requires RCpp, RJSONIO, digest, functional,
stringr and plyr, while rhdfs requires rJava.
wget
https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads/rhdfs_1.0.6.tar.gz
wget https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads/rhbase_1.2.0.tar.gz
wget http://cran.r-project.org/src/contrib/RJSONIO_1.0-3.tar.gz
wget http://cran.r-project.org/src/contrib/XML_3.98-1.1.tar.gz
wget http://cran.r-project.org/src/contrib/rJava_0.9-6.tar.gz
wget http://cran.r-project.org/src/contrib/bitops_1.0-6.tar.gz
wget http://cran.r-project.org/src/contrib/functional_0.4.tar.gz
wget http://cran.r-project.org/src/contrib/plyr_1.8.1.tar.gz
wget http://cran.r-project.org/src/contrib/stringr_0.6.2.tar.gz
wget http://cran.r-project.org/src/contrib/reshape2_1.2.2.tar.gz
wget http://cran.r-project.org/src/contrib/Rcpp_0.11.0.tar.gz
wget http://cran.r-project.org/src/contrib/codetools_0.2-8.tar.gz
wget http://cran.r-project.org/src/contrib/digest_0.6.4.tar.gz
wget http://cran.r-project.org/src/contrib/iterators_1.0.6.tar.gz
wget http://cran.r-project.org/src/contrib/itertools_0.1-1.tar.gz
wget http://cran.r-project.org/src/contrib/itertools_0.1-1.tar.gz
wget https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads/rhbase_1.2.0.tar.gz
wget http://cran.r-project.org/src/contrib/RJSONIO_1.0-3.tar.gz
wget http://cran.r-project.org/src/contrib/XML_3.98-1.1.tar.gz
wget http://cran.r-project.org/src/contrib/rJava_0.9-6.tar.gz
wget http://cran.r-project.org/src/contrib/bitops_1.0-6.tar.gz
wget http://cran.r-project.org/src/contrib/functional_0.4.tar.gz
wget http://cran.r-project.org/src/contrib/plyr_1.8.1.tar.gz
wget http://cran.r-project.org/src/contrib/stringr_0.6.2.tar.gz
wget http://cran.r-project.org/src/contrib/reshape2_1.2.2.tar.gz
wget http://cran.r-project.org/src/contrib/Rcpp_0.11.0.tar.gz
wget http://cran.r-project.org/src/contrib/codetools_0.2-8.tar.gz
wget http://cran.r-project.org/src/contrib/digest_0.6.4.tar.gz
wget http://cran.r-project.org/src/contrib/iterators_1.0.6.tar.gz
wget http://cran.r-project.org/src/contrib/itertools_0.1-1.tar.gz
wget http://cran.r-project.org/src/contrib/itertools_0.1-1.tar.gz
Set environment variable
Environments variables for Hadoop can be set with "export" command in Terminal
export HADOOP_CMD=/usr/local/hadoop/bin/hadoop
export
HADOOP_STREAMING=/usr/local/hadoop/contrib/streaming/hadoop-streaming-1.0.4.jar
Issue the following command at terminal one by one
sudo R CMD INSTALL
iterators_1.0.6.tar.gz
sudo R CMD INSTALL
itertools_0.1-1.tar.gz
sudo R CMD INSTALL digest_0.6.4.tar.gz
sudo R CMD INSTALL RJSONIO_1.0-3.tar.gz
sudo R CMD INSTALL
codetools_0.2-8.tar.gz
sudo R CMD INSTALL Rcpp_0.11.0.tar.gz
sudo R CMD INSTALL bitops_1.0-6.tar.gz
sudo R CMD INSTALL digest_0.6.4.tar.gz
sudo R CMD INSTALL functional_0.4.tar.gz
sudo R CMD INSTALL plyr_1.8.1.tar.gz
sudo R CMD INSTALL stringr_0.6.2.tar.gz
sudo R CMD INSTALL reshape2_1.2.2.tar.gz
sudo R CMD INSTALL rmr2_2.2.1.tar.gz
/usr/lib/R/site-library
sudo R CMD INSTALL XML_3.98-1.1.tar.gz
sudo R CMD javareconf JAVA=/usr/lib/jvm/jdk1.7.0_10/jre/bin/java
JAVA_HOME=/usr/lib/jvm/jdk1.7.0_10 JAVAC=/usr/lib/jvm/jdk1.7.0_10/bin/javac
JAR=/usr/lib/jvm/jdk1.7.0_10/bin/jar
JAVAH=/usr/lib/jvm/jdk1.7.0_10/bin/javah
LD_LIBRARY_PATH=LD_LIBRARY_PATH =/usr/lib/jvm/jdk1.7.0_10/jre/lib/i386/client
Before ,installing rjava, make sure JAVA_HOME path is set and then continue with
echo $JAVA_HOME must return /usr/lib/jvm/jdk1.7.0_10
sudo R CMD INSTALL rJava_0.9-4.tar.gz
sudo -E R CMD INSTALL rhdfs_1.0.6.tar.gz
/usr/lib/R/site-library
By now we are all set with Rhadoop setup in Ubuntu, before we write our first map-reduce program in R, make sure the hadoop name node is on
Open a new terminal
session and start hadoop
$ sh
/usr/local/hadoop/bin/start-all.sh
Switch back to terminal where we did installation of Rhadoop,Type R and open R session,and issue following commands
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
library(rmr2)
library(rhdfs)
hdfs.init()
ints = to.dfs(1:100)
calc = mapreduce(input = ints,map =
function(k, v) cbind(v, 2*v))
from.dfs(calc)
$val
If you see the below result
that means Rhadoop is working for you.
[1,] 1 2
[2,] 2 4
[3,] 3 6
[4,] 4 8
[5,] 5 10
[6,] 6 12
[7,] 7 14
[8,] 8 16
[9,] 9 18
[10,] 10 20