[Ubuntu] Install Hadoop 3.0.0 & Hive on Ubuntu 16.04

[Ubuntu] Install Hadoop 3.0.0 & Hive on Ubuntu 16.04

What’s Hadoop?

hadoop-logo
Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

What’s Hive?

hive_logo
The Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

Installing Hadoop 3.0.0

Pre-installation Setup for Hadoop

$ sudo apt-get update
$ sudo apt-get install -y default-jdk ssh

Downloading Hadoop

$ su => It's very important. You must use the root to execute the following command
$ wget http://www-eu.apache.org/dist/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz
$ tar -xzvf hadoop-3.0.0.tar.gz
$ sudo mv hadoop-3.0.0 /usr/local/hadoop
$ rm -rf hadoop-3.0.0.tar.gz

SSH Setup and Key Generation

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Setting JAVA_HOME and Hadoop path

Then export environment variables, put the below variables into ~/.bashrc:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"

$ source ~/.bashrc

Check version

$ hadoop version

Hadoop 3.0.0
Source code repository https://git-wip-us.apache.org/repos/asf/hadoop.git -r c25427ceca461ee979d30edd7a4b0f50718e6533
Compiled by andrew on 2017-12-08T19:16Z
Compiled with protoc 2.5.0
From source with checksum 397832cb5529187dc8cd74ad54ff22
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.0.0.jar

Pseudo-Distributed Operation

Export user environment variable

export HDFS_NAMENODE_USER="root"
export HDFS_DATANODE_USER="root"
export HDFS_SECONDARYNAMENODE_USER="root"
export YARN_RESOURCEMANAGER_USER="root"
export YARN_NODEMANAGER_USER="root"

$ source ~/.bashrc

Configuration

Modify the following config files:
/usr/local/hadoop/etc/hadoop/core-site.xml:

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://0.0.0.0:9000</value>
     </property>
</configuration>

/usr/local/hadoop/etc/hadoop/hdfs-site.xml:

<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>

/usr/local/hadoop/etc/hadoop/mapred-site.xml:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=/usr/local/hadoop</value>
    </property>
</configuration>

/usr/local/hadoop/etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Execution

Format a new distributed-filesystem

$ hdfs namenode -format

WARNING: /usr/local/hadoop/logs does not exist. Creating.
2018-02-12 21:10:29,281 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = 63dd4f2f9f69/172.17.0.2
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.0.0
...
...
...
STARTUP_MSG: build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r c25427ceca461ee979d30edd7a4b0f50718e6533; compiled by 'andrew' on 2017-12-08T19:16Z
STARTUP_MSG: java = 1.8.0_151
************************************************************/
...
...
...
2018-02-12 21:10:31,793 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at 63dd4f2f9f69/172.17.0.2
************************************************************/

Start NameNode daemon and DataNode daemon

$ start-dfs.sh

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [ubuntu]

Start ResourceManager daemon and NodeManager daemon:

$ start-yarn.sh

Then you can access http://localhost:9870 and http://localhost:8088

[NOTE] If you have any error, take a look the below links…
If the error occurred localhost: ssh: connect to host localhost port 22: Cannot assign requested address:
https://www.electricmonk.nl/log/2014/09/24/ssh-port-forwarding-bind-cannot-assign-requested-address/

If connection refused:
linux - connect to host localhost port 22: Connection refused
https://stackoverflow.com/questions/17335728/connect-to-host-localhost-port-22-connection-refused

Hadoop “Unable to load native-hadoop library for your platform” warning
https://stackoverflow.com/questions/19943766/hadoop-unable-to-load-native-hadoop-library-for-your-platform-warning

Inserting Data into HDFS

Create the folder:

$ hdfs dfs -mkdir /user/root/input

Copy files into the created folder:

$ hdfs dfs -put $HADOOP_HOME/etc/hadoop /user/root/input

List files:

$ hdfs dfs -ls /user/root/input

Found 1 items
drwxr-xr-x - root supergroup 0 2018-02-13 17:41 /user/root/input/hadoop

Run map reduce example:

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0.jar grep input output 'dfs[a-z.]+'

2018-02-13 17:43:14,113 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2018-02-13 17:43:16,149 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1518495207253_0001
2018-02-13 17:43:16,517 INFO input.FileInputFormat: Total input files to process : 1
2018-02-13 17:43:16,633 INFO mapreduce.JobSubmitter: number of splits:1
...
...
...

Installing Hive 2.3.2

Downloading Hive 2.3.2

$ su =>If you didn’t use root before
$ wget http://www-us.apache.org/dist/hive/hive-2.3.2/apache-hive-2.3.2-bin.tar.gz
$ tar -xzvf apache-hive-2.3.2-bin.tar.gz
$ mv apache-hive-2.3.2-bin /usr/local/hive
$ rm apache-hive-2.3.2-bin.tar.gz

Set up the environment variables for Hive

Put the below lines into ~/.bashrc

export HIVE_HOME=/usr/local/hive
export PATH=$HIVE_HOME/bin:$PATH

$ source ~/.bashrc

Running Hive

Running HiveServer2 and Beeline

$ schematool -dbType derby -initSchema

Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true
Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver
Metastore connection User: APP
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.derby.sql
Initialization script completed
schemaTool completed
$ hiveserver2

$ beeline -u jdbc:hive2://localhost:10000
or
$ beeline -u jdbc:hive2://

Connection to jdbc:hive2://

Running Hive CLI

$ hive

Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> CREATE TABLE pokes (foo INT, bar STRING);

hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

hive> show tables;
invites
pokes
Time taken: 0.068 seconds, Fetched: 2 row(s)

hive> DESCRIBE invites;
OK
foo       int                 	                    
bar       string              	                    
ds        string              	                    
	 	 
# Partition Information	 	 
# col_name      data_type     comment             
	 	 
ds        string              	                    
Time taken: 0.459 seconds, Fetched: 8 row(s)

References:
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm
https://cwiki.apache.org/confluence/display/Hive/GettingStarted

(Visited 930 times, 1 visits today)
Comments are closed.