ADR 002: AWS EKS for Cloud Workloads
Status: Accepted | Date: 2025-02-17
Context
Organisations want to efficiently manage and scale bespoke workloads in a secure and scalable manner. Traditional server management can be cumbersome and inefficient for dynamic workloads. Provider specific control planes can result in lock-in and artificial constraints limiting technology options.
Decision
To address these challenges, use a CNCF Certified Kubernetes platform with automatically managed infrastructure resources. Due to hyperscaler availability and size AWS EKS (Elastic Kubernetes Service) in auto mode is the preferred option. This leverages Kubernetes for orchestration, AWS EKS for managed Kubernetes services, AWS Elastic Block Store (EBS) for storage and AWS load balancers for traffic management.
- AWS EKS Auto Mode: Provide a managed Kubernetes service, that automatically scales the infrastructure based on workload demands.
- Managed Storage and NodePools: Ensure that the underlying infrastructure is maintained and updated by AWS.
- Load Balancers: Standardise ingress and traffic management.
- Persistent Storage Databases and object storage should be DBaaS to enable higher resilience for PITR/backups with lower overheads as per ADR 018: Database Patterns
Consequences
Benefits:
- Efficient resource utilisation through managed scaling
- Clear boundaries for shared responsibilities with a small operational overhead
- Enhanced security through automatic updates and patches
- Improved availability with managed storage and node pools
Risks if not implemented:
- Resource inefficiency from manual scaling
- High operational overhead managing custom infrastructure
- Security vulnerabilities from delayed updates
- Service downtime during traffic spikes
Strategic Research
Analysis like Cloud services and government digital sovereignty in Australia and beyond. / Mitchell, Andrew D.; Samlidis, Theodore. in the International Journal of Law and Information Technology, Vol. 29, No. 4, 2021, p. 364-394 highlights the ongoing issues with depending on hyperscalers in a single foreign jurisdiction. Based on this changing landscape, exploring simplified options for secure sovereign owned hosting options such as Australian Dedicated Servers and local colo in Tier 3+ datacentres (designed for 99.98% uptime) is warranted and touched on below.
Bare metal management
Use a platform like Proxmox VE to run standalone clusters at multiple facilities with multiple 2U servers per location. Example hardware (starts approx $15k AUD per server) - Dell PowerEdge R7725, HPE ProLiant DL385 Gen11, Lenovo ThinkSystem SR665 V3
Year 1 estimated costs:
- Hardware: ~$200k for 6x ~$33k servers
- Colo (2 sites, Tier 3+): ~$50k for 2x 5kw racks with 1 Gbit IP Transit
- Total: $250k for ~2-3TB ram, ~500 cores, 100TB disk across 2 sites (reduce by a factor of 2-3 for redundancy)