Prediction(4)Logistic Regression - Local Cluster Set Up
1. Try to Set Up Hadoop
Download the right version
> wget http://apache.spinellicreations.com/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
Place it in the right place and soft link the file
> hadoop version
Hadoop 2.7.1
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a
Compiled by jenkins on 2015-06-29T06:04Z
Compiled with protoc 2.5.0
From source with checksum fc0a1a23fc1868e4d5ee7fa2b28a58a
Set up the Cluster
> mkdir /opt/hadoop/temp
Config core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ubuntu-master:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/opt/hadoop/temp</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
</configuration>
> mkdir /opt/hadoop/dfs
> mkdir /opt/hadoop/dfs/name
> mkdir /opt/hadoop/dfs/data
Configure hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>ubuntu-master:9001</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
> mv mapred-site.xml.template mapred-site.xml
Configure mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>ubuntu-master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>ubuntu-master:19888</value>
</property>
</configuration>
Configure the yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>ubuntu-master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>ubuntu-master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>ubuntu-master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>ubuntu-master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>ubuntu-master:8088</value>
</property>
</configuration>
Configure slaves
ubuntu-dev1
ubuntu-dev2
ubuntu-dev3
Prepare the 3 slave machines if needed.
> mkdir ~/.ssh
> vi ~/.ssh/authorized_keys
Copy the keys there, the content is from cat ~/.ssh/id_rsa.pub
scp all the files to all slaves machines.
The same command will start hadoop
7. Hadoop hdfs and yarn
cd /opt/hadoop
sbin/start-dfs.sh
sbin/start-yarn.sh
visit the page
http://ubuntu-master:50070/dfshealth.html#tab-overview
http://ubuntu-master:8088/cluster
Error Message:
> sbin/start-dfs.sh
Starting namenodes on [ubuntu-master]
ubuntu-master: Error: JAVA_HOME is not set and could not be found.
ubuntu-dev1: Error: JAVA_HOME is not set and could not be found.
ubuntu-dev2: Error: JAVA_HOME is not set and could not be found.
Solution:
> vi hadoop-env.sh
export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
Error Message:
2015-09-30 19:39:49,482 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /opt/hadoop/dfs/name/in_use.lock acquired by nodename [email protected]
2015-09-30 19:39:49,487 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage
java.io.IOException: NameNode is not formatted.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:225)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:975)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:681)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:584)
Solution:
hdfs namenode -format
Cool, all things are up and running for yarn cluster.
2. Try to Set Up Spark 1.5.0
Fetch the latest Spark
> wget http://apache.mirrors.ionfish.org/spark/spark-1.5.0/spark-1.5.0-bin-hadoop2.6.tgz
Unzip and place that in the right working directory.
3. Try to Set Up Zeppelin
Fetch the source codes first.
> git clone https://github.com/apache/incubator-zeppelin.git
> npm install -g grunt-cli
> grunt --version
grunt-cli v0.1.13
> mvn clean package -Pspark-1.5 -Dspark.version=1.5.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests
Exception:
[ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:0.0.23:grunt (grunt build) on project zeppelin-web: Failed to run task: 'grunt --no-color' failed. (error code 3) -> [Help 1]
INFO [launcher]: Trying to start PhantomJS again (1/2).
ERROR [launcher]: Cannot start PhantomJS
INFO [launcher]: Trying to start PhantomJS again (2/2).
ERROR [launcher]: Cannot start PhantomJS
ERROR [launcher]: PhantomJS failed 2 times (cannot start). Giving up.
Warning: Task "karma:unit" failed. Use --force to continue.
Solution:
>cd /home/carl/install/incubator-zeppelin/zeppelin-web
> mvn clean install
I get more exceptions in detail. It shows that the PhantomJS is not installed.
Install PhantomJS
http://sillycat.iteye.com/blog/1874971
Build own PhantomJS from source
http://phantomjs.org/build.html
Or find an older version from here
https://code.google.com/p/phantomjs/downloads/list
Download the right version
> wget https://phantomjs.googlecode.com/files/phantomjs-1.9.2-linux-x86_64.tar.bz2
> bzip2 -d phantomjs-1.9.2-linux-x86_64.tar.bz2
> tar -xvf phantomjs-1.9.2-linux-x86_64.tar
Move to the proper directory. Add to path. Verify installation.
Error Exception:
phantomjs --version
phantomjs: error while loading shared libraries: libfontconfig.so.1: cannot open shared object file: No such file or directory
Solution:
> sudo apt-get install libfontconfig
It works.
> phantomjs --version
1.9.2
Build Success.
4. Configure Spark and Zeppelin
Set Up Zeppelin
>cp zeppelin-env.sh.template zeppelin-env.sh
> cp zeppelin-site.xml.template zeppelin-site.xml
>vi zeppelin-env.sh
export MASTER="yarn-client"
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop/"
export SPARK_HOME="/opt/spark"
. ${SPARK_HOME}/conf/spark-env.sh
export ZEPPELIN_CLASSPATH="${SPARK_CLASSPATH}"
Set Up Spark
>cp spark-env.sh.template spark-env.sh
>vi spark-env.sh
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop"
export SPARK_WORKER_MEMORY=768m
export SPARK_JAVA_OPTS="-Dbuild.env=lmm.sparkvm"
export USER=carl
Rebuild and set up the zeppelin.
> mvn clean package -Pspark-1.5 -Dspark.version=1.5.0 -Dhadoop.version=2.7.0 -Phadoop-2.6 -Pyarn -DskipTests -P build-distr
The final gz file will be here:
/home/carl/install/incubator-zeppelin-0.6.0/zeppelin-distribution/target
> mv zeppelin-0.6.0-incubating-SNAPSHOT /home/carl/tool/zeppelin-0.6.0
> sudo ln -s /opt/zeppelin-0.6.0 /opt/zeppelin
Start the Server
> bin/zeppelin-daemon.sh start
Visit the Zeppelin
http://ubuntu-master:8080/#/
Exception:
Found both spark.driver.extraJavaOptions and SPARK_JAVA_OPTS. Use only the former.
Solution:
Zeppelin Configuration
export ZEPPELIN_JAVA_OPTS="-Dspark.akka.frameSize=100 -Dspark.jars=/home/hadoop/spark-seed-assembly-0.0.1.jar"
Spark Configuration
export SPARK_DAEMON_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70"
export SPARK_LOCAL_DIRS=/opt/spark
export SPARK_LOG_DIR=/var/log/apps
export SPARK_CLASSPATH=“/opt/spark/conf:/home/hadoop/conf:/opt/spark/classpath/emr/*:/opt/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar"
References:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression
zeppelin
http://sillycat.iteye.com/blog/2216604
http://sillycat.iteye.com/blog/2223622
https://github.com/apache/incubator-zeppelin
hadoop
http://sillycat.iteye.com/blog/2242559
http://sillycat.iteye.com/blog/2193762
http://sillycat.iteye.com/blog/2103457
http://sillycat.iteye.com/blog/2084169
http://sillycat.iteye.com/blog/2090186