The Data Platform

The Data Platform

Motivation

Data Engineers need a platform. Platform Engineers need requirements.

sequenceDiagram
    actor Data Engineer
    actor Platform Engineer
    Data Engineer->>Platform Engineer: I need a data platform!
    Platform Engineer->>Data Engineer: What are your requirements?

CI/CD

In most orgs, we need to think about how to incorporate CI/CD into our design patterns. At its core, the goal is simple. Develop code however you see fit, but make sure you have some sort of QA and/or Testing environment that your code can run against, before hitting Prod. Wherever possible, we want this process to be automated. We can leverage tools like Github Actions or Azure DevOps, paired with Infrastructure as Code like Terraform to complete these tasks.

---
title: CI/CD
---
flowchart LR
    Local --> Dev --> QA --> Prod

DRY Programming

For CI/CD pipelines to work effectively, we need to implement DRY Programming, a.k.a, Don’t Repeat Yourself. With a stateful programming language like terraform, we can leverage plans and modules to assist with this. Our repeatable code would be housed in modules and we would simply apply each child module from their root environment.

1
2
3
4
5
6
7
8
9
10
11
12
13
project
│   README.md
|   configs.json
└───modules
│   └───some-module
│       └───main.tf
└───plans
│   └───dev
│   |   └─── main.tf
│   └───qa
│   │   └───main.tf
│   └───prod
│       └─── main.tf 

Hub & Spoke Architecture

Applying some of these concepts in the real world lends itself nicely to a Hub & Spoke Architecture. Consider a hybrid network, where your organization has both on-prem and cloud resources. In some cases, physical limitations and costs can be prohibitive. A common Hub & Spoke might have central networking resources for communication between on-prem and cloud resources. As a next step, you could simply run through your normal CI/CD pipelines for Dev, QA, and Prod. However, it might not be feasible to replicate some of your networking configurations. Instead, a common approach is to leverage VNet or VPC peering between the central hub and individual spokes.

---
title: Hub & Spoke
---
flowchart LR
    OnPrem <== S2S ==> Hub
    Hub <== VNet Peering ==> Dev
    Hub <== Vnet Peering ==> QA
    Hub <== Vnet Peering ==> Prod

Unity Catalog (Databricks)

Unity Catalog aims to provide data lake governance. This is a relatively new feature offering, comparable to Lake Formation in AWS. One drawback, Unity Catalog limits one metastore per region. However, that primary metastore can be assigned to multiple workspaces. You should be able to see the parallels between Hub & Spoke and Unity Catalog’s metastore assignment.

---
title: Unity Catalog
---
flowchart LR
    Metastore == Metastore Assignment ==> d[Dev Workspace]
    Metastore == Metastore Assignment ==> q[QA Workspace]
    Metastore == Metastore Assignment ==> p[Prod Workspace]

Pets vs. Cattle

Entering the world of Platform Engineering and IaC, we have to be cognizant of Pets vs. Cattle. Most things in the data world (e.g. transactional data) are considered “pets”, as they cannot easily be recreated. On the flip side, most items in the platform world are considered “cattle” (e.g. compute), as they can be easily recreated. We need to take special precaution when thinking about the blast radius of our state files and how to properly handle resource propagation, so the creation or removal of resource x does not impact resource y.

VS.

Resource Propagation

Extending upon Dev, QA, and Prod, we can further scale out our development pattern. If coming from a programming language like Python, you might be tempted to create nested loops. This can pose significant challenges in Terraform, but is entirely possible. My approach is to create baseline resource in my primary child modules. These child modules are replicated across Dev, QA, and Prod. For example, I might create Dev, QA, and Prod storage accounts. Assuming you have multiple teams that need their own respective storage accounts, we can leverage for_each loops within the child modules, to create N storage account within each environment. Currently, I use a central configuration file that I can iterate through for such tasks. Additionally, I focus on maps, using key-value pairs when looping through “pet” resources. This differs from looping through lists, because with the sateful nature of terraform, if the order of items in a list were to change, our underlying resources would most likely be destoyed an rebuilt. With the mapping of keys, this issue is mitigated.

---
title: Hub & Spoke
---
flowchart LR
    OnPrem <== S2S ==> Hub
    Hub <== VNet Peering ==> Dev
    Hub <== Vnet Peering ==> QA
    Hub <== Vnet Peering ==> Prod
    Dev --> a[Resource<sub>i-n</sub>]
    QA --> b[Resource<sub>i-n</sub>]
    Prod --> c[Resource<sub>i-n</sub>]

Trending Tags