Reference Architecture: Data Pipelines
Status: Proposed | Date: 2025-01-28
When to Use This Pattern
Use when building data processing systems for analytics, ETL workloads, or real-time data transformation across organisational systems.
Overview
Template for implementing scalable data pipelines using SQLMesh framework with S3 object storage, Aurora database patterns, and modern analytics engines (DuckDB/DuckLake) or legacy engines (Iceberg/Trino/Athena) based on existing data lake alignment.
Core Components
Data Sources -> SQLMesh -> S3 Storage SQLMesh -> Aurora PostgreSQL S3 Storage -> Analytics Engine Analytics Engine -> Reports
Data Sources: { Databases APIs Files }
SQLMesh: { Orchestration Transformation Quality Checks }
S3 Storage: Raw & Processed Data Aurora PostgreSQL: Metadata & Schema Analytics Engine: DuckDB/Athena
Reports: { Quarto Evidence BI Data API }
Project Kickoff Steps
Foundation Setup
- Apply Isolation - Follow ADR 001: Application Isolation for data processing network and runtime separation
- Deploy Infrastructure - Follow ADR 002: AWS EKS for Cloud Workloads for SQLMesh container deployment
- Configure Infrastructure - Follow ADR 010: Infrastructure as Code for reproducible data infrastructure
- Setup Database - Follow ADR 018: Database Patterns for Aurora Serverless v2 as data warehouse
Security & Operations
- Configure Secrets Management - Follow ADR 005: Secrets Management for data source credentials and API keys
- Setup Logging - Follow ADR 007: Centralized Security Logging for transformation audit trails
- Setup Backup Strategy - Follow ADR 014: Object Storage Backups for data warehouse backup
- Data Governance - Follow ADR 015: Data Governance Standards for SQLMesh data contracts and lineage
Development Process
- Configure CI/CD - Follow ADR 004: CI/CD Quality Assurance for automated testing of data transformations
- Setup Release Process - Follow ADR 009: Release Documentation Standards for SQLMesh model versioning
- Analytics Tools - Follow ADR 017: Analytics Tooling Standards for Quarto and Evidence BI integration
Implementation Details
Data Processing & Quality:
- Configure SQLMesh for data transformation and orchestration
- Setup S3 object storage for data files and Aurora for metadata
- Implement data quality validation rules and testing frameworks
- Configure DuckDB/DuckLake or Iceberg/Trino based on existing data lake alignment
Cost Optimization & Performance:
- Configure Aurora Serverless v2 autoscaling for cost-effective metadata storage
- Implement S3 lifecycle policies for data archival (Intelligent Tiering → Glacier)
- Setup DuckDB for cost-effective analytics vs Athena for large-scale queries
- Configure SQLMesh incremental processing to minimize compute costs
Data Governance & API Access:
- Setup data API following ADR 003: API Documentation Standards
- Implement PII handling, data classification, and retention policies per ADR 015: Data Governance Standards
- Configure data lineage tracking and impact analysis through SQLMesh
- Setup automated data profiling and anomaly detection