2 minute read

Introduction

The motivation of this article is to show how easily one can setup a resiliant, scalable and budget data processing plateform infrastucture in cloud thanks to Terraform/Terragrunt. In this article, I tackle only AWS cloud provider, I will make the same thing with GCP in an another article.

In my past articles about Spark on kubernetes, it takes a resiliant, scalable infrastructure as granted, it is indeed usualy managed by the company’s plateform team. if your organisation does not use Kubernetes company wide, and request such feature would probably take few months. Son’t worry, just do it yourself! A Data processing plateform has no intention to manage any incoming traffic, service mesh or dns/ssl certificates configurations, thus it is easy to setup.

Setup Kubernetes cluster on AWS

If you are not familiar with Terraform or Terragrunt, have a glance on their blogs, it’s a powerfull code as infrastructure tool.

TL;TR, I just want code! check out the entire code base at my git repository, it contains a deployment guide.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "13.2.1"
  cluster_name    = var.cluster_name
  cluster_version = "1.17"
  subnets         = module.vpc.private_subnets
  vpc_id          = module.vpc.vpc_id
  enable_irsa     = true

  worker_groups_launch_template = [
    {
      name                    = "on-demand-1"
      override_instance_types = var.ondemand_instance_types
      asg_max_size            = var.ondemand_asg_max_size
      kubelet_extra_args      = "--node-labels=node.kubernetes.io/lifecycle=ondemand"
      suspended_processes     = ["AZRebalance"]
      tags = [
        {
          "key"                 = "k8s.io/cluster-autoscaler/enabled"
          "propagate_at_launch" = "false"
          "value"               = "true"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/${var.cluster_name}"
          "propagate_at_launch" = "false"
          "value"               = "true"
        }
      ]
    },
    {
      name                    = "spot-1"
      override_instance_types = var.spot_instance_types
      spot_instance_pools     = 4
      asg_max_size            = var.spot_asg_max_size
      asg_desired_capacity    = 1
      kubelet_extra_args      = "--node-labels=node.kubernetes.io/lifecycle=spot --register-with-taints=node-role.kubernetes.io/spot=true:PreferNoSchedule"
      tags = [
        {
          "key"                 = "k8s.io/cluster-autoscaler/enabled"
          "propagate_at_launch" = "false"
          "value"               = "true"
        },
        {
          "key"                 = "k8s.io/cluster-autoscaler/${var.cluster_name}"
          "propagate_at_launch" = "false"
          "value"               = "true"
        }
      ]
    },
  ]

  worker_additional_security_group_ids = [aws_security_group.all_worker_mgmt.id]
}

It uses existing Terraform EKS module, it defines two worker node pools, one with on demande instance to host vital components, one with spot instances to schedule compute units. you can later in your kubernetes deployments to specify which work node pool to target by setting kubernetes taint and toleration

Deploy the cluster autoscaler and the spot instance handler

Use terraform Helm provider to deploy the cluster autoscaler and the spot instance termination handler.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
resource "helm_release" "cluster-autoscaler" {
  name       = "cluster-autoscaler"
  version    = "8.0.0"
  repository = "https://charts.helm.sh/stable"
  chart      = "cluster-autoscaler"
  namespace  = "kube-system"
}

resource "helm_release" "spot-handler" {
  name       = "spot-handler"
  version    = "1.4.9"
  repository = "https://charts.helm.sh/stable"
  chart      = "k8s-spot-termination-handler"
  namespace  = "kube-system"
}

now you have a managed, fully scalable, resiliant Kubernetes cluster on AWS!

Deploy Airflow

We now deploy an Airflow instance in our cluster to perform Spark streaming jobs or Spark batch ETL as we discussed in my other articles.

1
2
3
4
5
6
7
8
resource "helm_release" "aiflow" {
  name       = "airflow"
  repository = "https://mrmuggymuggy.github.io/helm-charts/"
  chart      = "airflow"
  namespace  = "default"
}

I packed my Airflow helm chart into my own helm repo. Here were are, a resiliant, scalable and budget data processing plateform infrastucture in AWS!