On-Prem Data Lake Project Intro

Posted Mar 10, 2024

By Andrew McLaughlin 2 min read

Background

The term “data lake” is credited to James Dixon, the former CTO of Pentaho. The term has been around since 2010. In 2012, The Harvard Business Review described the data scientist role as “The Sexiest Job of the 21st Century”. Today, we often hear about the “lakehouse”, which is a term coined by Databricks, and often linked to Bill Inmon, who is an influential voice in the world of data warehousing.

While the modern lakehouse has come a long way since 2010, it’s helpful to reflect on the tools that have been used by engineers and data scientists to understand what challenges they were trying to overcome.

In this project, we’ll setup our very own multi-node, on-prem data lake. Terms and marketing have changed a lot over the years, so here are the highlights:

Storage: Hadoop

The primary benefit of Hadoop is its distributed file system. You may have heard the term HDFS which stands for Hadoop Distributed File System. The need for HDFS has been going away, as modern cloud storage solutions such as AWS’s S3 and Azure’s ADLS Gen 2 off HCFS or Hadoop Compatible File System solutions. However, this will remain at the heart of our on-prem data lake.

Metastore: Hive

Next up, we’ll install a Hive metastore layer on top of our hadoop installation. We’ll leverage this for querying our data, as if it were a database.

Compute: Spark

Finally, we’ll leverage a multi-node standalone spark cluster for compute. While hadoop has been losing popularity, spark, especially pyspark, has been gaining popularity. Historically, spark and hadoop have been tightly coupled. However, the industry has been decoupling these technologies for quite some time now, so that’s the approach we’ll take here.

Architecture

You can install this however you choose. For my setup, I’m virtualizing everything in a a series of ubuntu servers, hosted in Proxmox VMs. To recap:

Spark Cluster: thee VMs
Hadoop Cluster: three VMs
Hive: leverage the existing master node of the hadoop cluster
Spark Driver: this will be my laptop, since that’s where I’ll be executing my code from

---
title: Architecture
---
stateDiagram-v2
    spark_driver: Spark Driver
    spark_master: Spark Master
    spark_worker1: Spark Worker 01
    spark_worker2: Spark Worker 02
    hadoop_master: Hadoop/Hive Master
    hadoop_slave01: Hadoop Slave 01
    hadoop_slave02: Hadoop Slave 02

    spark_driver --> spark_master
    spark_master --> spark_worker1
    spark_master --> spark_worker2
    hadoop_master --> hadoop_slave01
    hadoop_master --> hadoop_slave02

On-Prem Data Lake

This post is licensed under CC BY 4.0 by the author.