Bootstrapping a data platform infrastructure on Kubernetes
The goal of this article is to show how easily you can bootstrap a resilient, scalable, and budget-friendly data processing platform in the cloud with Terraform and Terragrunt. In this article, I focus only on AWS. I may cover the same setup on GCP in another article.
In my previous articles about Spark on Kubernetes, I took a resilient and scalable infrastructure for granted. In many companies, that part is usually managed by a platform team. But if your organization does not use Kubernetes company-wide, getting such an environment provisioned for your team can easily take months.
The good news is that a data processing platform is actually much simpler to set up than a general-purpose application platform. It usually does not need to handle incoming public traffic, service mesh concerns, or DNS/SSL certificate management. That makes it a very good candidate for a self-service infrastructure setup.
Setting up a Kubernetes cluster on AWS
If you are not familiar with Terraform or Terragrunt, have a quick look at their articles first. They are very powerful infrastructure-as-code tools.
TL;DR: if you just want the code, check out the full implementation in my Git repository. It also contains a deployment guide.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "13.2.1"
cluster_name = var.cluster_name
cluster_version = "1.17"
subnets = module.vpc.private_subnets
vpc_id = module.vpc.vpc_id
enable_irsa = true
worker_groups_launch_template = [
{
name = "on-demand-1"
override_instance_types = var.ondemand_instance_types
asg_max_size = var.ondemand_asg_max_size
kubelet_extra_args = "--node-labels=node.kubernetes.io/lifecycle=ondemand"
suspended_processes = ["AZRebalance"]
tags = [
{
"key" = "k8s.io/cluster-autoscaler/enabled"
"propagate_at_launch" = "false"
"value" = "true"
},
{
"key" = "k8s.io/cluster-autoscaler/${var.cluster_name}"
"propagate_at_launch" = "false"
"value" = "true"
}
]
},
{
name = "spot-1"
override_instance_types = var.spot_instance_types
spot_instance_pools = 4
asg_max_size = var.spot_asg_max_size
asg_desired_capacity = 1
kubelet_extra_args = "--node-labels=node.kubernetes.io/lifecycle=spot --register-with-taints=node-role.kubernetes.io/spot=true:PreferNoSchedule"
tags = [
{
"key" = "k8s.io/cluster-autoscaler/enabled"
"propagate_at_launch" = "false"
"value" = "true"
},
{
"key" = "k8s.io/cluster-autoscaler/${var.cluster_name}"
"propagate_at_launch" = "false"
"value" = "true"
}
]
},
]
worker_additional_security_group_ids = [aws_security_group.all_worker_mgmt.id]
}
This setup uses the existing Terraform EKS module. It defines two worker node pools:
- one pool with on-demand instances to host critical components
- one pool with spot instances to run compute workloads more cheaply
Later, in your Kubernetes deployments, you can decide which node pool to target by using Kubernetes taints and tolerations.
This split works very well in practice: keep the essential services on stable nodes, and push bursty or fault-tolerant compute workloads onto spot capacity.
Deploying the cluster autoscaler and the spot instance handler
Once the cluster is in place, the next step is to deploy the cluster autoscaler and the spot instance termination handler. I use the Terraform Helm provider for that.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
resource "helm_release" "cluster-autoscaler" {
name = "cluster-autoscaler"
version = "8.0.0"
repository = "https://charts.helm.sh/stable"
chart = "cluster-autoscaler"
namespace = "kube-system"
}
resource "helm_release" "spot-handler" {
name = "spot-handler"
version = "1.4.9"
repository = "https://charts.helm.sh/stable"
chart = "k8s-spot-termination-handler"
namespace = "kube-system"
}
At this point, you already have a managed, scalable, and resilient Kubernetes cluster on AWS.
The cluster autoscaler adjusts node capacity automatically based on workload demand, while the spot termination handler helps the cluster react more gracefully to spot interruption events. Together, they make spot-based compute much more practical for data workloads.
Deploying Airflow
The next step is to deploy Airflow into the cluster so it can orchestrate Spark streaming jobs and Spark batch ETL pipelines, as discussed in my previous articles.
1
2
3
4
5
6
7
8
resource "helm_release" "aiflow" {
name = "airflow"
repository = "https://dataops-sre.github.io/helm-charts/"
chart = "airflow"
namespace = "default"
}
I packaged my Airflow Helm chart in my own Helm repository.
Conclusion
And that is basically it: a resilient, scalable, and budget-friendly data platform infrastructure running on AWS.
What I like about this setup is that it gives a small data team a lot of autonomy. You do not need a huge platform organization to get started. With a relatively small amount of Terraform and Helm configuration, you can build a solid Kubernetes foundation for Spark, Airflow, and other data workloads.
Of course, this is only the bootstrap phase. In a real platform, you will probably also want to add observability, IAM fine-tuning, backups, cost monitoring, and security hardening. But as a starting point, this setup already gives you a strong and very usable base.