On-Prem Data Lake Project Intro
Background
The term “data lake” is credited to James Dixon, the former CTO of Pentaho. The term has been around since 2010. In 2012, The Harvard Business Review described the data scientist role as “The Sexiest Job of the 21st Century”. Today, we often hear about the “lakehouse”, which is a term coined by Databricks, and often linked to Bill Inmon, who is an influential voice in the world of data warehousing.
While the modern lakehouse has come a long way since 2010, it’s helpful to reflect on the tools that have been used by engineers and data scientists to understand what challenges they were trying to overcome.
In this project, we’ll setup our very own multi-node, on-prem data lake. Terms and marketing have changed a lot over the years, so here are the highlights:
Storage: Hadoop
The primary benefit of Hadoop is its distributed file system. You may have heard the term HDFS which stands for Hadoop Distributed File System. The need for HDFS has been going away, as modern cloud storage solutions such as AWS’s S3 and Azure’s ADLS Gen 2 off HCFS or Hadoop Compatible File System solutions. However, this will remain at the heart of our on-prem data lake.
Metastore: Hive
Next up, we’ll install a Hive metastore layer on top of our hadoop installation. We’ll leverage this for querying our data, as if it were a database.
Compute: Spark
Finally, we’ll leverage a multi-node standalone spark cluster for compute. While hadoop has been losing popularity, spark, especially pyspark, has been gaining popularity. Historically, spark and hadoop have been tightly coupled. However, the industry has been decoupling these technologies for quite some time now, so that’s the approach we’ll take here.
Architecture
You can install this however you choose. For my setup, I’m virtualizing everything in a a series of ubuntu servers, hosted in Proxmox VMs. To recap:
- Spark Cluster: thee VMs
- Hadoop Cluster: three VMs
- Hive: leverage the existing master node of the hadoop cluster
- Spark Driver: this will be my laptop, since that’s where I’ll be executing my code from
---
title: Architecture
---
stateDiagram-v2
spark_driver: Spark Driver
spark_master: Spark Master
spark_worker1: Spark Worker 01
spark_worker2: Spark Worker 02
hadoop_master: Hadoop/Hive Master
hadoop_slave01: Hadoop Slave 01
hadoop_slave02: Hadoop Slave 02
spark_driver --> spark_master
spark_master --> spark_worker1
spark_master --> spark_worker2
hadoop_master --> hadoop_slave01
hadoop_master --> hadoop_slave02