ADR 002: AWS EKS for Cloud Workloads

Status: Accepted | Date: 2025-02-17

Context

Organisations want to efficiently manage and scale bespoke workloads in a secure and scalable manner. Traditional server management can be cumbersome and inefficient for dynamic workloads. Provider specific control planes can result in lock-in and artificial constraints limiting technology options.

Decision

To address these challenges, use a CNCF Certified Kubernetes platform with automatically managed infrastructure resources. Due to hyperscaler availability and size AWS EKS (Elastic Kubernetes Service) in auto mode is the preferred option. This leverages Kubernetes for orchestration, AWS EKS for managed Kubernetes services, AWS Elastic Block Store (EBS) for storage and AWS load balancers for traffic management.

AWS EKS Auto Mode: Provide a managed Kubernetes service, that automatically scales the infrastructure based on workload demands.
Managed Storage and NodePools: Ensure that the underlying infrastructure is maintained and updated by AWS.
Load Balancers: Standardise ingress and traffic management.
Persistent Storage Databases and object storage should be DBaaS to enable higher resilience for PITR/backups with lower overheads as per ADR 018: Database Patterns

Consequences

Benefits:

Efficient resource utilisation through managed scaling
Clear boundaries for shared responsibilities with a small operational overhead
Enhanced security through automatic updates and patches
Improved availability with managed storage and node pools

Risks if not implemented:

Resource inefficiency from manual scaling
High operational overhead managing custom infrastructure
Security vulnerabilities from delayed updates
Service downtime during traffic spikes

Strategic Research

Analysis like Cloud services and government digital sovereignty in Australia and beyond. / Mitchell, Andrew D.; Samlidis, Theodore. in the International Journal of Law and Information Technology, Vol. 29, No. 4, 2021, p. 364-394 highlights the ongoing issues with depending on hyperscalers in a single foreign jurisdiction. Based on this changing landscape, exploring simplified options for secure sovereign owned hosting options such as Australian Dedicated Servers and local colo in Tier 3+ datacentres (designed for 99.98% uptime) is warranted and touched on below.

Bare metal management

Use a platform like Proxmox VE to run standalone clusters at multiple facilities with multiple 2U servers per location. Example hardware (starts approx $15k AUD per server) - Dell PowerEdge R7725, HPE ProLiant DL385 Gen11, Lenovo ThinkSystem SR665 V3

Year 1 estimated costs:

Hardware: ~$200k for 6x ~$33k servers
Colo (2 sites, Tier 3+): ~$50k for 2x 5kw racks with 1 Gbit IP Transit
Total: $250k for ~2-3TB ram, ~500 cores, 100TB disk across 2 sites (reduce by a factor of 2-3 for redundancy)