围炉网

一行代码,一篇日志,一个梦想,一个世界

hadoop installation on Ubuntu 12.04 LTS

Hadoop cluster deployment
Win2k8R2 Hyper V: adamsbx3,adamsbx5,adamsbx6
All machines have same name accounts: adamslee
1.     Install Vim
2.     Install OpenSSH
$sudo apt-get install ssh
3.     On adamsbx3
$ssh-keygen –t rsa
回车
回车
上述命令将为主机adamsbx3上的当前用户adamslee生成其密钥对,该密钥对被保存在/home/adamslee/.ssh/id_rsa文件 中,同时命令所生成的证书以及公钥也保存在该文件所在的目录中(在这里是:/home/adamslee/.ssh),并形成两个文件 id_rsa,id_rsa.pub。然后将 id_rsa.pub 文件的内容复制到每台主机(其中包括本机adamsbx3)的/home/adamslee/.ssh/authorized_keys文件的尾部,如果该文件不存在,可手工创建一个。
注意:id_rsa.pub 文件的内容是长长的一行,复制时不要遗漏字符或混入了多余换行符。
4.     download jdk1.6 file jdk-6u32-linux-x64.bin
$sudo cp Downloads/jdk-6u32-linux-x64.bin /usr/local/bin
5.     install jdk
$cd /usr/local/bin
$sudo chmod u+x jdk-6u32-linux-x64.bin
$./jdk-6u32-linux-x64.bin
6.     create jdk.sh in /etc/profile.d/ 
               #!/bin/sh
               pathmunge () {
                if ! echo $PATH | /bin/egrep -q "(^|:)$1($|:)" ; then
                  if [ "$2" = "after" ] ; then
                     PATH=$PATH:$1
                  else
                     PATH=$1:$PATH
                  fi
                fi
               }
               pathmunge /usr/local/bin/jdk1.6.0_32/bin
               export JAVA_HOME=/usr/local/bin/jdk1.6.0_32
               export CLASSPATH=.:$JAVA_HOME/lib
               export HADOOP_HOME=/usr/lib/hadoop
               unset pathmunge
7.     PATH not set in xrdp remote. So for xrdp remote to add lines to ~/.bashrc:
export JAVA_HOME=/usr/local/jdk1.6.0_32
export CLASSPATH=.:$JAVA_HOME/lib
export HADOOP_HOME=/usr/lib/hadoop
export PATH=$JAVA_HOME/bin:$PATH
8.     JAVA_HOME not set for sudo. So edit /etc/sudoers
Defaults env_keep += "JAVA_HOME CLASSPATH HADOOP_HOME"
Defaults secure_path="/usr/local/bin/jdk1.6.0_32/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
9.     Create a new file /etc/apt/sources.list.d/cloudera.list with the following contents:
deb http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib
deb-src http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib
How to tell the version of Ubuntu you are running:
$cat /etc/lsb-release
But not all release version of Ubuntu are supported. If not supported, pick similar one.
10.     Update the APT package index:
$ sudo apt-get update
11.     Find and install the Hadoop core and native packages by using your favorite APT package manager, such as apt-get, aptitude, or dselect. For example:
$ apt-cache search hadoop
$ sudo apt-get install hadoop-0.20 hadoop-0.20-native
if install failed due to libzip1. Download libzip1 deb (hadoop-0.20-native depends libzip1)
to install libzip1:
$sudo dpkg -i libzip1.deb
12.     Install each type of daemon package on the appropriate machine. For example, install the NameNode package on your NameNode machine:
$ sudo apt-get install hadoop-0.20-<daemon type>
where <daemon type> is one of the following:
namenode
datanode
secondarynamenode
jobtracker
tasktracker
13.     bx3 install namenode + jobtracker
14.     bx5,bx6 install datanode + tasktracker
15.     bx6 install secondarynamenode
16.     Copy the default configuration to your custom directory.
$ sudo cp -r  /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.my_cluster
17.     If cannot ping host name, update /etc/hosts
Original 127.0.0.1 adamsbxX should be deleted.
18.     Update config files
•     core-site.xml
<name>fs.default.name</name><value>hdfs://adamsbx3:8020</value>
•     hdfs-site.xml
<name>dfs.name.dir</name><value>/data/1/dfs/nn</value>
<name>dfs.data.dir</name><value>/data/1/dfs/dn</value>
•     mapred-site.xml
<name>mapred.job.tracker</name><value>hdfs://adamsbx3:8021</value>
<name>mapred.local.dir</name><value>/data/1/mapred/local</value>
•     masters
adamsbx6
•     slaves
adamsbx5
adamsbx6
19.     create folders on machines
$sudo mkdir -p /data/1/dfs/nn /data/1/dfs/dn /data/1/mapred/local
20.     Configure the owner of the dfs.name.dir and dfs.data.dir directories to be the hdfs user:
$ sudo chown -R hdfs:hadoop /data/1/dfs/nn /data/1/dfs/dn
21.     Configure the owner of the mapred.local.dir directory to be the mapred user:
$ sudo chown -R mapred:hadoop /data/1/mapred/local
22.     To activate the new configuration on Ubuntu and SUSE systems:
$ sudo update-alternatives –install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.my_cluster 50
23.     If you look at the /etc/hadoop-0.20/conf symlink now, it points to /etc/hadoop-0.20/conf.my_cluster. You can verify this by querying the alternatives. To query Hadoop configuration alternatives on Ubuntu and SUSE systems:
$ sudo update-alternatives –display hadoop-0.20-conf
24.     拷贝配置文件,到机器的所有机器/etc/hadoop-0.20/conf.my_cluster
$scp -rp master:/etc/hadoop-0.20/conf.my_cluster  /etc/hadoop-0.20/conf.my_cluster
25.     在所有机器,添加alternative rules
$ sudo update-alternatives –install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.my_cluster 50
26.     设置JAVA_HOME
Hadoop的JAVA_HOME是在文件/etc/hadoop/conf/hadoop-env.sh中设置,具体设置如下:
$sudo vi /etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME="/usr/local/bin/jdk1.6.0_32"
27.     格式化Namenode
$sudo -u hdfs hadoop namenode -format
28.     Start HDFS
To start HDFS, start the NameNode, Secondary NameNode, and DataNode services.
On the NameNode:
$ sudo service hadoop-0.20-namenode start
On the Secondary NameNode:
$ sudo service hadoop-0.20-secondarynamenode start
On each DataNode:
$ sudo service hadoop-0.20-datanode start
29.     Verify service started
$sudo jps
If service not started successfully, check /var/log/hadoop/* for logs
30.     Create the HDFS /tmp Directory
Once HDFS is up and running, create the /tmp directory and set its permissions to 1777 (drwxrwxrwt), as follows:
$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
     Note
This is the root of hadoop.tmp.dir (/tmp/hadoop-$<user.name> by default) which is used both for the local file system and HDFS.
31.     Create and Configure the mapred.system.dir Directory in HDFS
After you start HDFS and before you start the JobTracker (see the next step), you must also create the HDFS directory specified by the mapred.system.dir parameter and configure it to be owned by the mapred user. The mapred.system.dir parameter is represented by the following /mapred/system path example.
$ sudo -u hdfs hadoop fs -mkdir /mapred/system
$ sudo -u hdfs hadoop fs -chown mapred:hadoop /mapred/system
Here is a summary of the correct owner and permissions of the mapred.system.dir directory in HDFS:
Directory      Owner      Permissions
mapred.system.dir      mapred:hadoop      drwx—— 1
/ (root directory)      hdfs:hadoop      drwxr-xr-x
Footnote:
When starting up, MapReduce sets the permissions for the mapred.system.dir directory in HDFS, assuming the user mapred owns that directory.
Add the path for the mapred.system.dir directory to the conf/mapred-site.xml file.
32.     在masternode节点上mapred-site.xml添加以下的配置:
<property>
<name>mapred.system.dir</name>
<value>/mapred/system</value>
</property>
33.     开机自动启动
$ sudo chkconfig hadoop-0.20-namenode on 
$ sudo chkconfig hadoop-0.20-jobtracker on 
$ sudo chkconfig hadoop-0.20-secondarynamenode on 
$ sudo chkconfig hadoop-0.20-tasktracker on 
$ sudo chkconfig hadoop-0.20-datanode on
On Ubuntu and other Debian systems, you can install the sysv-rc-conf package to get the chkconfig command or use update-rc.d:
On the NameNode:
$ sudo update-rc.d hadoop-0.20-namenode defaults
On the JobTracker:
$ sudo update-rc.d hadoop-0.20-jobtracker defaults
On the Secondary NameNode:
$ sudo update-rc.d hadoop-0.20-secondarynamenode defaults
On each TaskTracker:
$ sudo update-rc.d hadoop-0.20-tasktracker defaults
On each DataNode:
$ sudo update-rc.d hadoop-0.20-datanode defaults
34.     手工启动
$ sudo service hadoop-0.20-namenode start 
$ sudo service hadoop-0.20-jobtracker start 
$ sudo service hadoop-0.20-secondarynamenode start 
$ sudo service hadoop-0.20-tasktracker start 
$ sudo service hadoop-0.20-datanode start
35.     You can now check the interface
http://< namenode-ip >:50070   – for HDFS overview
and
http://< jobtracker –ip>:50030  – for Mapreduce overview
36.     Test
$sudo -u hdfs hadoop fs -mkdir /input
$sudo -u hdfs hadoop fs –put /etc/hadoop-0.20/conf/*.xml /input
$sudo -u hdfs hadoop fs –ls /input
$sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20/hadoop-*-examples.jar grep /input /output 'dfs[a-z.]+'
37.     Installing HBase on bx3,bx5,bx6
$ sudo apt-get install hadoop-hbase
38.     Host Configuration Settings for HBase on bx3,bx5,bx6
Configuring the REST Port
You can use an init.d script, /etc/init.d/hadoop-hbase-rest, to start the REST server; for example:
/etc/init.d/hadoop-hbase-rest start
The script starts the server by default on port 8080. This is a commonly used port and so may conflict with other applications running on the same host.
If you need change the port for the REST server, configure it in hbase-site.xml, for example:
<property>
  <name>hbase.rest.port</name>
  <value>60050</value>
</property>
39.     In the /etc/security/limits.conf file, add the following lines:
Note:Only the root user can edit this file.
hdfs  –       nofile  32768
hbase  –       nofile  32768
40.     To apply the changes in /etc/security/limits.conf on Ubuntu and other Debian systems, add the following line in the /etc/pam.d/common-session file:
session required  pam_limits.so
41.     Using dfs.datanode.max.xcievers with HBase on bx3,bx5,bx6
A Hadoop HDFS DataNode has an upper bound on the number of files that it can serve at any one time. The upper bound property is called dfs.datanode.max.xcievers (the property is spelled in the code exactly as shown here). Before loading, make sure you have configured the value for dfs.datanode.max.xcievers in the conf/hdfs-site.xml file to at least 4096 as shown below:
<property>
  <name>dfs.datanode.max.xcievers</name>
  <value>4096</value>
</property>
Be sure to restart HDFS after changing the value for dfs.datanode.max.xcievers. If you don't change that value as described, strange failures can occur and an error message about exceeding the number of xcievers will be added to the DataNode logs. Other error messages about missing blocks are also logged
42.     Install hbase master on bx3
$ sudo apt-get install hadoop-hbase-master
43.     Modifying the HBase Configuration on bx3,bx5,bx6
Open /etc/hbase/conf/hbase-site.xml in your editor of choice, and insert the following XML properties between the <configuration> and </configuration> tags.
<property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>
<property>
  <name>hbase.rootdir</name>
  <value>hdfs://adamsbx3:8020/hbase</value>
</property>
44.     Update hbase-env.sh on bx3,bx5,bx6
编辑/etc/hbase/conf/hbase-env.sh 文件,设置JAVA_HOME和HBASE_CLASSPATH变量
        export JAVA_HOME=/usr/local/bin/jdk1.6.0_32
        export HBASE_CLASSPATH=${HBASE_CLASSPATH}:/usr/lib/hadoop/conf
45.     Creating the /hbase Directory in HDFS
Before starting the HBase Master, you need to create the /hbase directory in HDFS. The HBase master runs as hbase:hbase so it does not have the required permissions to create a top level directory.
To create the /hbase directory in HDFS:
$ sudo -u hdfs hadoop fs -mkdir /hbase
$ sudo -u hdfs hadoop fs -chown hbase /hbase
46.     To install a ZooKeeper server on bx3,bx5,bx6:
$ sudo apt-get install hadoop-zookeeper-server
47.     Update ZooKeeper configure
$ sudo vi /usr/lib/zookeeper/conf/zoo.cfg
Update line
server.0=localhost:2888:3888
To
server.1=adamsbx3:2888:3888
server.2= adamsbx5:2888:3888
server.3= adamsbx6:2888:3888
48.     Install HBase Region Server on bx5 & bx6
The Region Server is the part of HBase that actually hosts data and processes requests. The region server typically runs on all of the slave nodes in a cluster, but not the master node.
$ sudo apt-get install hadoop-hbase-regionserver
49.     Update HBase regionservers Configure on bx3, bx5, bx6
Update /usr/lib/hbase/conf/regionservers to
adamsbx5
adamsbx6
50.     Installing the HBase Thrift Server
The HBase Thrift Server is an alternative gateway for accessing the HBase server. Thrift mirrors most of the HBase client APIs while enabling popular programming languages to interact with HBase. The Thrift Server is multi-platform and performs better than REST in many situations. Thrift can be run collocated along with the region servers, but should not be collocated with the NameNode or the JobTracker. For more information about Thrift, visit http://incubator.apache.org/thrift/.
To enable the HBase Thrift Server on Ubuntu and other Debian systems:
$ sudo apt-get install hadoop-hbase-thrift
51.     Update configure for ZooKeeper on bx3,bx5,bx6
Update ZooKeeper Quorum address in hbase-site.xml. Insert the following XML property to configure the nodes with the address of the node where the ZooKeeper quorum peer is running:
<property>
  <name>hbase.zookeeper.quorum</name>
  <value>adamsbx3,adamsbx5,adamsbx6</value>
</property>
52.     To start the cluster, start the services in the following order:
The ZooKeeper Quorum Peer
The HBase Master
Each of the HBase Region Servers
After the cluster is fully started, you can view the HBase Master web interface on port 60010 and verify that each of the slave nodes has registered properly with the master.
53.     To start ZooKeeper server
     Note
ZooKeeper may start automatically on installation on Ubuntu and other Debian systems.
Use the following command to start ZooKeeper:
$ sudo /sbin/service hadoop-zookeeper-server start
54.     Starting the HBase Master on bx3
After ZooKeeper is running, you can start the HBase master
$ sudo /etc/init.d/hadoop-hbase-master start
55.     Test HBase is running
$hbase shell
56.     To start the Region Server:
$ sudo /etc/init.d/hadoop-hbase-regionserver start
57.     Verifying the Pseudo-Distributed Operation
After you have started ZooKeeper, the Master, and a Region Server, the pseudo-distributed cluster should be up and running. You can verify that each of the daemons is running using the jps tool from the Oracle JDK
$ sudo jps
32694 Jps
30674 HRegionServer
29496 HMaster
28781 DataNode
28422 NameNode
30348 QuorumPeerMain
You should also be able to navigate to http://localhost:60010 and verify that the local region server has registered with the master.
58.     Starting and Stopping Services
When starting and stopping services, you need to do it in the right order to make sure everything starts or stops cleanly.
Step 1: Perform a Graceful Cluster Shutdown
To shut HBase down gracefully, stop the Thrift server and clients, then stop the cluster.
Stop the Thrift server and clients
$sudo service hadoop-hbase-thrift stop
Stop the cluster.
Use the following command on the master node:
$sudo service hadoop-hbase-master stop
Use the following command on each node hosting a region server:
$sudo service hadoop-hbase-regionserver stop
This shuts down the master and the region servers gracefully.
Step 2. Stop the ZooKeeper Server
$ sudo service hadoop-zookeeper-server stop
     Note
Depending on your platform and release, you may need to use
$ sudo /sbin/service hadoop-zookeeper-server stop
or
$ sudo /sbin/service hadoop-zookeeper stop
START services in this order:
Order     Service     Comments     For instructions and more information
1     HDFS     Always start HDFS first     Running Services
2     MapReduce     Start MapReduce before Hive or Oozie     Running Services
3     ZooKeeper     Start ZooKeeper before starting HBase     Installing the ZooKeeper Server Package and Starting ZooKeeper on a Single Server; Installing ZooKeeper in a Production Environment
4     HBase          Starting the HBase Master; Deploying HBase in a Distributed Cluster
5     Hive          Installing Hive
6     Oozie          Starting the Oozie Server
STOP services in this order:
Order     Service     Comments     Instructions
1     Oozie          To stop the Oozie server, use one of the following commands, depending on how you installed Oozie.
Tarball installation:
sudo -u oozie /usr/lib/oozie/bin/oozie-stop.sh
RPM or Debian installation:
sudo /sbin/service oozie stop
2     Hive          To stop Hive, exit the Hive console and make sure no Hive scripts are running.
3     HBase     Stop the Thrift server and clients, then shut down the cluster.     See Performing a Graceful Shutdown.
4     ZooKeeper     Stop HBase before stopping ZooKeeper.     To stop the ZooKeeper server, use one of the following commands on each ZooKeeper node, depending on the platform and release:
sudo /sbin/service hadoop-zookeeper-server stop
or
sudo /sbin/service hadoop-zookeeper stop
5     MapReduce     Stop Hive and Oozie before stopping MapReduce.     To stop MapReduce, stop the JobTracker service, and stop the TaskTracker on all nodes where it is running. Use the following commands:
sudo service hadoop-0.20-jobtracker stop
sudo service hadoop-0.20-tasktracker stop
6     HDFS     Always stop HDFS last.     To stop HDFS:
On the NameNode:
sudo service hadoop-0.20-namenode stop
On the Secondary NameNode:
sudo service hadoop-0.20-secondarynamenode stop
On each DataNode:
sudo service hadoop-0.20-datanode stop

Continue Reading
沪ICP备15009335号-2