Post

Hive Installation

Prerequisites

  • Please make sure to have a working hadoop cluster installed hadoop tutorial
  • login to the master node as the ‘hadoop’ user created in the previous turorial

Install Hive

Download Hive

wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

Unzip it

tar -xzf apache-hive-3.1.3-bin.tar.gz

Move it to a folder called “hive”

sudo mv apache-hive-3.1.3-bin hive

Open the bashrc file and export HIVE_HOME

sudo nano ~/.bashrc

Add the following two lines to the bashrc file

export HIVE_HOME=/home/hadoop/hive
export PATH=$PATH:$HIVE_HOME/bin

Save and exit

  • <ctrl+x>
  • y
  • <enter>

Apply the changes

source ~/.bashrc

Install MySQL for the metadata database

Install MySQL Server

sudo apt-get install mysql-server

Create a root user with a password of “root”

sudo mysql -u root -p

exit

exit

run mysql to execute sql commands against the new server

sudo mysql

show databases (don’t forget the semicolon)

show databases;

exit

exit

hive-default template file

Navigate to the hive conf folder and copy the hive-default template into a hive-site.xml file.

cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml

MySQL Connector

Search google for mysql connector jar and download it:

wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.30/mysql-connector-java-8.0.30.jar

Copy the connector:

cp mysql-connector-java-8.0.30.jar $HIVE_HOME/lib/mysql-connector-java-8.0.30.jar

Create a new file in /hive/conf

sudo nano hive-site.xml

Add the following, but make sure to point update the ip address to reflect the master hadoop node in your cluster:

<configuration>

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://192.168.1.103:3306/hive_metastore?createDatabaseIfNotExist=true</value>
    <description>MySQL connection URL for the Hive metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class for MySQL database</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hiveuser</value>
    <description>Username for the Hive metastore MySQL database</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hivepassword</value>
    <description>Password for the Hive metastore MySQL database</description>
  </property>

  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/usr/hive/warehouse</value>
  </property>

</configuration>

Create hive_metastore database

  • Login to mysql as root
  • Create database for hive metastore
  • Create a user for hive
  • Grant privileges to the new user on the new database

sudo mysql -u root -p
CREATE DATABASE hive_metastore;
CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'hivepassword';
GRANT ALL PRIVILEGES ON hive_metastore.* TO 'hiveuser'@'localhost';
FLUSH PRIVILEGES;
exit

You may need to extend this not just to localhost, but also to the actual hostname of the master node.

sudo mysql -u root -p
CREATE USER 'hiveuser'@'ubuntu103' IDENTIFIED WITH mysql_native_password BY 'hivepassword';
GRANT ALL PRIVILEGES ON hive_metastore.* TO 'hiveuser'@'ubuntu103';
FLUSH PRIVILEGES;

MYSQL Ports

Make sure mysql is listening on port 3306 for the actual ip address, not just localhost.

sudo netstat -tulnp | grep mysqld

It’s probably not, by default. Fix that. Navigate to the following file and make a few updates:

sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf

Look for the line that starts with bind-address. It might currently look like this:

bind-address = 127.0.0.1
mysqlx-bind-address = 127.0.0.1

Change the bind-address to 0.0.0.0 to allow MySQL to accept connections on all network interfaces, or specify the specific IP address of the server if you want to limit it to a specific interface:

bind-address = 0.0.0.0
mysqlx-bind-address = 0.0.0.0

Now resart mysql

sudo systemctl restart mysql

Check that it’s listening on the ip address of the master node

sudo netstat -tulnp | grep mysqld
telnet 192.168.1.103 3306

Initialize the Hive Schemas

At this point:

  • We have Hive installed
  • We have mysql installed to use for the hive metastore
  • We have a database insalled with a new user
  • The database use has permissions to run commands on the database
  • The database is listening on the correct ports

Now we need to initialize the schemas for hive

schematool -dbType mysql -initSchema

You can now review all of the content created by initializing the schema. Login to mysql and review:

sudo mysql
show databases;
use hive_metastore
show tables;

Connect Hive to Hadoop

Move to the hadoop directory

cd hadoop/etc/hadoop

Copy the three hive site xml files so they are also in the hadoop folder

cp core-site.xml ~/hive/conf/core-site.xml
cp hdfs-site.xml ~/hive/conf/hdfs-site.xml
cp yarn-site.xml ~/hive/conf/yarn-site.xml

Make sure hive is up and running

nohup hive --service metastore &
netstat -tuln | grep 9083

Create Hive Warehouse in Hadoop Distributed File System

Make directory using hdfs

hdfs dfs -mkdir -p /hive/warehouse

Extend permissions on the the newly created warehouse directory

hdfs dfs -chmod 777 /usr/hive/warehouse

Start the hive metatore service. Note: nohup allows the program to run with no hangups even if the user logs out.

nohup hive --service metastore &

Getting Close

We’re almost there.

  • We have a running hadoop cluster used for our distributed file system, a.k.a. the storage for our lake
  • We have a running hive metastore, which gives us a way to store the metadata in our lake

Next, we need to stand up our spark cluster. Then we can use PySpark to connect to our hive warehouse.

This post is licensed under CC BY 4.0 by the author.

Trending Tags