Hive Installation
Prerequisites
- Please make sure to have a working hadoop cluster installed hadoop tutorial
- login to the master node as the ‘hadoop’ user created in the previous turorial
Install Hive
Download Hive
wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
Unzip it
tar -xzf apache-hive-3.1.3-bin.tar.gz
Move it to a folder called “hive”
sudo mv apache-hive-3.1.3-bin hive
Open the bashrc file and export HIVE_HOME
sudo nano ~/.bashrc
Add the following two lines to the bashrc file
export HIVE_HOME=/home/hadoop/hive
export PATH=$PATH:$HIVE_HOME/bin
Save and exit
<ctrl+x>
y
<enter>
Apply the changes
source ~/.bashrc
Install MySQL for the metadata database
Install MySQL Server
sudo apt-get install mysql-server
Create a root user with a password of “root”
sudo mysql -u root -p
exit
exit
run mysql to execute sql commands against the new server
sudo mysql
show databases (don’t forget the semicolon)
show databases;
exit
exit
hive-default template file
Navigate to the hive conf folder and copy the hive-default template into a hive-site.xml file.
cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
MySQL Connector
Search google for mysql connector jar and download it:
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.30/mysql-connector-java-8.0.30.jar
Copy the connector:
cp mysql-connector-java-8.0.30.jar $HIVE_HOME/lib/mysql-connector-java-8.0.30.jar
Create a new file in /hive/conf
sudo nano hive-site.xml
Add the following, but make sure to point update the ip address to reflect the master hadoop node in your cluster:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://192.168.1.103:3306/hive_metastore?createDatabaseIfNotExist=true</value>
<description>MySQL connection URL for the Hive metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class for MySQL database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
<description>Username for the Hive metastore MySQL database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hivepassword</value>
<description>Password for the Hive metastore MySQL database</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/usr/hive/warehouse</value>
</property>
</configuration>
Create hive_metastore database
- Login to mysql as root
- Create database for hive metastore
- Create a user for hive
- Grant privileges to the new user on the new database
sudo mysql -u root -p
CREATE DATABASE hive_metastore;
CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'hivepassword';
GRANT ALL PRIVILEGES ON hive_metastore.* TO 'hiveuser'@'localhost';
FLUSH PRIVILEGES;
exit
You may need to extend this not just to localhost
, but also to the actual hostname of the master node.
sudo mysql -u root -p
CREATE USER 'hiveuser'@'ubuntu103' IDENTIFIED WITH mysql_native_password BY 'hivepassword';
GRANT ALL PRIVILEGES ON hive_metastore.* TO 'hiveuser'@'ubuntu103';
FLUSH PRIVILEGES;
MYSQL Ports
Make sure mysql is listening on port 3306 for the actual ip address, not just localhost.
sudo netstat -tulnp | grep mysqld
It’s probably not, by default. Fix that. Navigate to the following file and make a few updates:
sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf
Look for the line that starts with bind-address. It might currently look like this:
bind-address = 127.0.0.1
mysqlx-bind-address = 127.0.0.1
Change the bind-address to 0.0.0.0
to allow MySQL to accept connections on all network interfaces, or specify the specific IP address of the server if you want to limit it to a specific interface:
bind-address = 0.0.0.0
mysqlx-bind-address = 0.0.0.0
Now resart mysql
sudo systemctl restart mysql
Check that it’s listening on the ip address of the master node
sudo netstat -tulnp | grep mysqld
telnet 192.168.1.103 3306
Initialize the Hive Schemas
At this point:
- We have Hive installed
- We have mysql installed to use for the hive metastore
- We have a database insalled with a new user
- The database use has permissions to run commands on the database
- The database is listening on the correct ports
Now we need to initialize the schemas for hive
schematool -dbType mysql -initSchema
You can now review all of the content created by initializing the schema. Login to mysql and review:
sudo mysql
show databases;
use hive_metastore
show tables;
Connect Hive to Hadoop
Move to the hadoop directory
cd hadoop/etc/hadoop
Copy the three hive site xml files so they are also in the hadoop folder
cp core-site.xml ~/hive/conf/core-site.xml
cp hdfs-site.xml ~/hive/conf/hdfs-site.xml
cp yarn-site.xml ~/hive/conf/yarn-site.xml
Make sure hive is up and running
nohup hive --service metastore &
netstat -tuln | grep 9083
Create Hive Warehouse in Hadoop Distributed File System
Make directory using hdfs
hdfs dfs -mkdir -p /hive/warehouse
Extend permissions on the the newly created warehouse directory
hdfs dfs -chmod 777 /usr/hive/warehouse
Start the hive metatore service. Note: nohup allows the program to run with no hangups even if the user logs out.
nohup hive --service metastore &
Getting Close
We’re almost there.
- We have a running hadoop cluster used for our distributed file system, a.k.a. the storage for our lake
- We have a running hive metastore, which gives us a way to store the metadata in our lake
Next, we need to stand up our spark cluster. Then we can use PySpark to connect to our hive warehouse.