Modern Data Stack Project

Early Prototype Version

Project info


Data Engineering & Analysis

2024/12

Introduction

A brief project description or quote to summarize its goals.

“A refreshed and improved version of my previous ETL project — now following ELT principles. ”
— Arek

Explore the Project

Tips

Tech details - Go to general technical info about components (on this page)


Data Engineering Project (Early Prototype)

Project Status: Early Development Phase
This is an early experimental version of the project to explore key ideas and validate the concept.

Why?

(contains a bit of self-promotion ;) )

As I mentioned in my last micropost Status Update: Leveling Up in Data Engineering, the Google Cloud Developers Meetup inspired me to explore new areas in my data stack. I decided to refresh my knowledge by diving into newer technologies.

I've already read Fundamentals of Data Engineering by Joe Reis and I'm currently reading Designing Data-Intensive Applications by Martin Kleppmann (which, by the way, is more challenging).
I'm also working on labs and have started preparing for the first step toward ✨ Google Cloud certification ✨.

What?

I use Apache Airflow for orchestration, SQLLoader for data extraction and loading from .csv file (dataset dane.gov.pl), and DBT for data transformation (including dimension preloading). Data is stored in Oracle XE, and the environment setup is automated using Debian scripts.

The entire project can be easily run with docker-compose. Key features include:

  • A dedicated Docker network to connect services using hostnames instead of IPs.
  • The setup builds on Apache Airflow's docker-compose, extended with custom snippets for Oracle XE and a remote Linux host.
  • Secure SSH connections are configured using key-based authentication for SSHOperators.

How it works?

TaskSetupDb: Executes bash and SQL scripts to initialize the Oracle pluggable database instance.
TaskExtractLoad: Uses SSHOperator to run commands on a remote Linux host.

  • Runs Liquibase to create schemas (raw data, "bronze" stage).
  • Loads data using SQLLoader (further optimization and generalization planned).
  • Executes DBT models to populate dimension tables.

Todo

Data enginnering todo:

  • Implement a more generalized Python-based approach for data loading (considering performance trade-offs vs SQLLoader).
  • Explore columnar databases as an alternative for optimized analytics.
  • Compare the project's star schema performance against GCP BigQuery using the One Big Table pattern.
  • Practice real-time data streaming.

Anlysis todo:

  • Use Python visualization libraries for exploratory data analysis (EDA).
  • Implement Apache Superset for advanced dashboards and drill-down reports.

Project Overview

In General

Components

Quick Summary

  1. Data processing
  2. Storage
  3. Environment

List of All Technologies

Grouped by Category

Category Technologies
OrchestrationApache Airflow
DAGSSHOperator ✦ PythonOperator
Data TransformationDBT
Extraction & LoadBash script ✦ SQL script ✦ SQLLoader
Database Change ManagementLiquibase
Data Source files.csv
RDBMSOracle DB 21
LinuxDebian ✦ bash
Dockerdocker-compose ✦ dockerfile

Components Specification

Detailed Information Grouped by Components

Table of Contents:

  1. Data processing
  2. Storage
  3. Environment
Details
  1. Data processing

    Extract, Load, Tranform tools
    Tech Stack
    Category Description
    Orchestration Apache Airflow
    DAG SSHOperator
    DAG PythonOperator
    Data Transformation DBT
    Extraction & Load Bash script
    Extraction & Load SQL script
    Extraction & Load SQLLoader
    Database Change Management Liquibase
  2. Storage

    Source of data. Staging. Target storage.
    Tech Stack
    Category Description
    Data Source files .csv
    RDBMS Oracle DB 21
  3. Environment

    Operations services
    Tech Stack
    Category Description
    Linux Debian
    Linux bash
    Docker docker-compose
    Docker dockerfile

Data Sources

  • Dane.gov.pl: 'Dane dotyczące hospitalizacji rozliczonych JGP w latach 2019-2021 (contains hospitalizations 2017, 2018)'. Link. Quantity: >20M.
  • Dane.gov.pl: 'Dane dotyczące hospitalizacji rozliczonych JGP w latach 2022'. Link. Quantity: >20M (whole dataset).
  • CEZ Polish HL7 Impl.: Discharge and admision modes. Link. Quantity: ~10.

Additional Information

  • Domain: Data Engineering
  • Status: in_progress
  • Keywords: DataEngineering Docker-compose dbt ApacheAirflow ELT bash python

Thumbnail

Thumbnail