Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Reference Architecture: Data Pipelines

Status: Proposed | Date: 2025-01-28

When to Use This Pattern

Use when building data processing systems for analytics, ETL workloads, or real-time data transformation across organisational systems.

Overview

Template for implementing scalable data pipelines using SQLMesh framework with S3 object storage, Aurora database patterns, and modern analytics engines (DuckDB/DuckLake) or legacy engines (Iceberg/Trino/Athena) based on existing data lake alignment.

Core Components

Data Sources -> SQLMesh -> S3 Storage SQLMesh -> Aurora PostgreSQL S3 Storage -> Analytics Engine Analytics Engine -> Reports

Data Sources: { Databases APIs Files }

SQLMesh: { Orchestration Transformation Quality Checks }

S3 Storage: Raw & Processed Data Aurora PostgreSQL: Metadata & Schema Analytics Engine: DuckDB/Athena

Reports: { Quarto Evidence BI Data API }

Project Kickoff Steps

Foundation Setup

  1. Apply Isolation - Follow ADR 001: Application Isolation for data processing network and runtime separation
  2. Deploy Infrastructure - Follow ADR 002: AWS EKS for Cloud Workloads for SQLMesh container deployment
  3. Configure Infrastructure - Follow ADR 010: Infrastructure as Code for reproducible data infrastructure
  4. Setup Database - Follow ADR 018: Database Patterns for Aurora Serverless v2 as data warehouse

Security & Operations

  1. Configure Secrets Management - Follow ADR 005: Secrets Management for data source credentials and API keys
  2. Setup Logging - Follow ADR 007: Centralized Security Logging for transformation audit trails
  3. Setup Backup Strategy - Follow ADR 014: Object Storage Backups for data warehouse backup
  4. Data Governance - Follow ADR 015: Data Governance Standards for SQLMesh data contracts and lineage

Development Process

  1. Configure CI/CD - Follow ADR 004: CI/CD Quality Assurance for automated testing of data transformations
  2. Setup Release Process - Follow ADR 009: Release Documentation Standards for SQLMesh model versioning
  3. Analytics Tools - Follow ADR 017: Analytics Tooling Standards for Quarto and Evidence BI integration

Implementation Details

Data Processing & Quality:

  • Configure SQLMesh for data transformation and orchestration
  • Setup S3 object storage for data files and Aurora for metadata
  • Implement data quality validation rules and testing frameworks
  • Configure DuckDB/DuckLake or Iceberg/Trino based on existing data lake alignment

Cost Optimization & Performance:

  • Configure Aurora Serverless v2 autoscaling for cost-effective metadata storage
  • Implement S3 lifecycle policies for data archival (Intelligent Tiering → Glacier)
  • Setup DuckDB for cost-effective analytics vs Athena for large-scale queries
  • Configure SQLMesh incremental processing to minimize compute costs

Data Governance & API Access: