Our journey to Immutable Infrastructure

Mikael Gibert
Teemo Tech Blog
Published in
9 min readAug 6, 2018

--

In this story, I’ll try to explain our journey towards Immutable Infrastructure and how it unlocked Continuous Deployment in our organization.

Building components do not change, just like Lego pieces

Initial situation

Our initial situation was quite classic in regards to a little organization, we had many microservices, following an environment branch Git pattern. Our continuous integration, Jenkins, performed unit tests, packaging, and integration tests. A code review was necessary to be able to merge code to our environment branches (staging and production).

Our staging and production environments were fully hosted on Google Cloud Platform and the majority of our cloud resources were created manually. The default project service account was widely used. Instances configurations were managed with Ansible in a dedicated repository (not alongside the application code).

Each time a Pull Request was merged on an environment branch, the author has to deploy the change by running Ansible against the target servers thanks to the Google dynamic inventory. Once the staging environment had been updated, we checked the application behavior and, if it was correct, we created a Pull Request from staging environment branch to production. The author had to merge it and then deployed the change the same way he deployed it in staging.

Infrastructure as Code and repository organization

We first decided to manage all our cloud resources with Terraform to make them repeatable and more easily manageable. We also took this opportunity to use dedicated service accounts and fine grained IAM. We then upgraded Jenkins and moved from manual configurations to Jenkinsfiles using Declarative Pipelines. Best of all, we split the Ansible configuration per application.

Then, we chose to have a git repository for each application that contains both infrastructure related code, configuration management code, continuous integration code, integration tests code (docker-compose) and application code.

What a big step ! However, we stilled deploy changes by running Ansible from our workstation two times (one per environment branch) which makes our instances highly mutable. Also, we are not able to rollback to a previous version because of dependencies (apt, pip, …).

Releasing images

To fix the mutability problem, we decided to move to Immutable Infrastructure and release images instead of packaged code. Thus, rolling back means deploying an instance with the previous image.

Google Compute Engine offers some tools to help in managing images. The images are grouped into Image Families. An instance can be created from an Image or the Image Family. When an instance is created from an Image Family, the latest not deprecated Image in this family is used. An Image can be deprecated, obsolete or deleted. Differences are well explained here.

Wow, it is huge! That means we can release images under an Image Family and create instances using this family, without having to update our Terraform configuration to change the image!

To automate images creation, we chose Packer as it can create images in GCE and configure them using Ansible.

Some attention points to notice before reading packer configuration:

  • We used a private repository to store our custom Ansible roles so we needed to configure git command with the correct SSH key and remove it from the image before considering it production ready.
  • Some (a lot) of our Ansible roles were written without immutability in mind and perform tasks that cannot work in an immutable context like joining cluster peers for instance. We needed to be backward compatible during the transition so we position an extra var and update our roles to perform tasks only in mutable contexts.

This packer configuration is generic and used for all projects.

{
"variables": {
"source_image_project_id": "debian-cloud",
"source_image_family": "debian-9",
"service_name": "",
"source_commit": "",
"build_url": ""
},
"builders": [
{
"type": "googlecompute",
"project_id": "<my-image-project> ",
"zone": "europe-west1-d",
"source_image_project_id": "{{user `source_image_project_id`}}",
"source_image_family": "{{user `source_image_family`}}",
"ssh_username": "packer",
"instance_name": "{{user `service_name`}}-{{uuid}}",
"image_family": "{{user `service_name`}}",
"image_name": "{{user `service_name`}}-{{isotime \"20060102150405\"}}",
"image_description": "{{user `service_name`}} released at {{isotime \"2006-01-02 15:04:05\"}} from commit {{user `source_commit`}}, more informations at {{user `build_url`}}",
"tags": ["packer"]
}
],
"provisioners": [
{
"type": "file",
"source": "packer_key",
"destination": "/tmp/packer_key"
},
{
"type": "shell",
"inline": ["chmod 0400 /tmp/packer_key"]
},
{
"type": "shell",
"inline": ["sudo apt-get update && sudo apt-get -t stretch-backports install -y git ansible"]
},
{
"type": "ansible-local",
"inventory_groups": "tag_{{user `service_name`}}",
"playbook_dir": "ansible",
"playbook_file": "configuration/site.yml",
"galaxy_file": "configuration/requirements.yml",
"galaxycommand": "GIT_SSH_COMMAND='ssh -i /tmp/packer_key -F /dev/null -o StrictHostKeyChecking=no' ansible-galaxy",
"extra_arguments": ["--extra-vars", "immutable_build=yes"]
},
{
"type": "shell",
"inline": ["rm -f /tmp/packer_key"]
}
]
}

Variable parts are filled by continuous integration to trigger the image build. This step is scripted like this:

#!/bin/bash
GIT_COMMIT="$1"
BUILD_URL="$2"
SERVICE_NAME="$3"
/bin/bash -c "packer validate -var 'source_commit=${GIT_COMMIT}' -var 'build_url=${BUILD_URL}' -var 'service_name=${SERVICE_NAME}' packer.json"
/bin/bash -c "packer build -var 'source_commit=${GIT_COMMIT}' -var 'build_url=${BUILD_URL}' -var 'service_name=${SERVICE_NAME}' packer.json"

Pretty easy right? Jenkinsfile only needs to run this script.

stages {
stage('Bake') {
steps {
sh "./bake.sh ${env.GIT_COMMIT} ${env.BUILD_URL} my-service"
}
}
}

But beware, the images are never deleted which means that you will have rapidly run into a hell, otherwise called Image Proliferation !!! Let’s fix it by adding this step in our continuous integration.

post {      
always {
sh "./clean.sh my-service"
}
cleanWs()
}

And clean.sh implementation is following, note that we keep 3 active images in order to manage rollbacks and 3 deprecated images if we want to run them into another environment for debugging purposes.

#!/bin/bash
SERVICE_NAME="$1"
MAX_ACTIVE_IMAGES=3
MAX_DEPRECATED_IMAGES=3
PROJECT=<my-image-project>
ACTIVE_IMAGES=($(gcloud compute images list --project $PROJECT --no-standard-images --filter ${SERVICE_NAME} --sort-by creationTimestamp | awk '{print $1}' | tail -n +2))
ACTIVE_IMAGES_LENGTH=${#ACTIVE_IMAGES[@]}
echo "We found ${ACTIVE_IMAGES_LENGTH} active images for ${SERVICE_NAME} service, here they are:"
echo "${ACTIVE_IMAGES[@]}"
echo ""
if [ "${ACTIVE_IMAGES_LENGTH}" -gt "${MAX_ACTIVE_IMAGES}" ]; then
echo "We need some cleanup!"
for (( i=0; i<$(( ACTIVE_IMAGES_LENGTH - MAX_ACTIVE_IMAGES )); i++ )); do
echo "Deprecating ${ACTIVE_IMAGES[$i]}"
done
echo ""
gcloud compute images deprecate ${ACTIVE_IMAGES[$i]} --state DEPRECATED --project $PROJECT
fi
DEPRECATED_IMAGES=($(gcloud compute images list --project $PROJECT --no-standard-images --show-deprecated --filter ${SERVICE_NAME} --sort-by creationTimestamp | grep DEPRECATED | awk '{print $1}' | tail -n +2))
DEPRECATED_IMAGES_LENGTH=${#DEPRECATED_IMAGES[@]}
echo "We found ${DEPRECATED_IMAGES_LENGTH} deprecated images for ${SERVICE_NAME} service, here they are:"
echo "${DEPRECATED_IMAGES[@]}"
echo ""
if [ "${DEPRECATED_IMAGES_LENGTH}" -gt "${MAX_DEPRECATED_IMAGES}" ]; then
echo "We need some cleanup!"
for (( i=0; i<$(( DEPRECATED_IMAGES_LENGTH - MAX_DEPRECATED_IMAGES )); i++ )); do
echo "Deleting ${DEPRECATED_IMAGES[$i]}"
gcloud compute images delete ${DEPRECATED_IMAGES[$i]} --project $PROJECT --quiet
done
echo ""
fi

Now we can deliver images at each build, which means two things: images are our new artifacts and we achieved Continuous Delivery.

Guess what? Thanks to Image Families, we have just unlocked Continuous Deployment.

Deploying changesets

Changesets are now packaged as images which are part of an image family. To deploy the changes, we have to replace old instances with new instances. It means we now have to choose a deployment strategy.

Hopefully, GCE natively supports some common deployment strategies: rolling restart and rolling replace and they are configurable with Terraform Google Provider. For you to benefit from these features, instances have to be managed with a Managed Instance Group and Instance Templates.

Our deployment process is quite simple, we retrieve the latest image from a given family, then use it as an Instance Template boot disk. An Instance Group Manager then manages instances and deployment strategy by using Instance Template to create its instances.

A rolling update policy can be configured using surge and number of unhealthy instances parameters to control the deployment speed.

Basic service

data "google_compute_image" "my-service" {
family = "my-service"
project = "my-image-project"
}
resource "google_compute_instance_template" "my-service" {
name_prefix = "my-service-"
tags = ["..."]
machine_type = "n1-standard-1"
can_ip_forward = false
labels = {
service = "my-service"
environment = "${var.environment}"
}
scheduling {
automatic_restart = true
on_host_maintenance = "MIGRATE"
}
disk {
source_image = "${data.google_compute_image.my-service.self_link}"
disk_type = "pd-standard"
disk_size_gb = 20
auto_delete = true
boot = true
}
network_interface {
network = "default"
access_config = {}
}
service_account {
email = "${google_service_account.my-service.email}"
scopes = ["..."]
}
lifecycle {
create_before_destroy = true
}
}
resource "google_compute_instance_group_manager" "my-service" {
name = "my-service-manager"
base_instance_name = "my-service"
instance_template = "${google_compute_instance_template.my-service.self_link}"
update_strategy = "ROLLING_UPDATE"
zone = "europe-west1-d"
target_size = 3
rolling_update_policy {
type = "PROACTIVE"
minimal_action = "REPLACE"
max_surge_fixed = 1
max_unavailable_fixed = 0
}
}

Deploying a changeset is one step command with some variables to provide. This stage is easy to integrate in our continuous integration so we just achieved — wait-for-it — continuous deployment.

terraform apply -var 'environment=staging'

Web service

We have to add some elements to continuously deploy services accessible on the web: a load balancer and an SSL certificate.

Let’s update our Terraform configuration. We have to change our instance group manager to use a named port which will forward traffic to group members.

resource "google_compute_instance_group_manager" "my-service" {
name = "my-service-manager"
base_instance_name = "my-service"
instance_template = "${google_compute_instance_template.my-service.self_link}"
update_strategy = "ROLLING_UPDATE"
zone = "europe-west1-d"
target_size = 3
named_port {
name = "http"
port = 80
}
rolling_update_policy {
type = "PROACTIVE"
minimal_action = "REPLACE"
max_surge_fixed = 1
max_unavailable_fixed = 0
}
}

We then have to add the load balancer configuration with a public address, SSL and Identity Aware Proxy if it is an internal service.

resource "google_compute_global_address" "my-service" {
name = "my-service-address"
}
resource "google_compute_health_check" "my-service" {
name = "my-service-healthcheck"
check_interval_sec = 5
timeout_sec = 1
http_health_check {
port = "80"
request_path = "/healthz"
}
}
resource "google_compute_backend_service" "my-service" {
name = "my-backend-service"
port_name = "http"
protocol = "HTTP"
health_checks = ["${google_compute_health_check.my-service.self_link}"]
backend {
group = "${google_compute_instance_group_manager.my-service.instance_group}"
}
iap {
oauth2_client_id = "<client_id>"
oauth2_client_secret = "<client_secret>"
}
}
resource "google_compute_url_map" "my-service" {
name = "my-service-url-map"
default_service = "${google_compute_backend_service.my-service.self_link}"
}
resource "google_compute_ssl_certificate" "my-service" {
name = "my-service-certificate"
description = "My Service certificate"
private_key = "${file("files/key.pem")}"
certificate = "${file("files/cert.pem")}"
}
resource "google_compute_target_https_proxy" "my-service" {
name = "my-service-https-proxy"
url_map = "${google_compute_url_map.my-service.self_link}"
ssl_certificates = ["${google_compute_ssl_certificate.my-service.self_link}"]
}
resource "google_compute_global_forwarding_rule" "my-service" {
name = "my-service-forwarding-rule-https"
port_range = "443"
ip_address = "${google_compute_global_address.my-service.address}"
target = "${google_compute_target_https_proxy.my-service-https-proxy.self_link}"
}

Autoscaling and Self Healing

It is possible to associate a self healing policy to an instance group manager. It means that any instance in the group which is not considered healthy will be replaced.

resource "google_compute_health_check" "my-service" {
name = "my-service-healthcheck"
check_interval_sec = 1
timeout_sec = 1
healthy_threshold = 2
unhealthy_threshold = 10
http_health_check {
request_path = "/healthz"
port = "80"
}
}
resource "google_compute_instance_group_manager" "my-service" {
name = "my-service-manager"
base_instance_name = "my-service"
instance_template = "${google_compute_instance_template.my-service.self_link}"
update_strategy = "ROLLING_UPDATE"
zone = "europe-west1-d"
target_size = 3
named_port {
name = "http"
port = 80
}
auto_healing_policies {
health_check = "${google_compute_health_check.my-service.self_link}"
initial_delay_sec = 10
}
rolling_update_policy {
type = "PROACTIVE"
minimal_action = "REPLACE"
max_surge_fixed = 1
max_unavailable_fixed = 0
}
}

We can also set up an autoscaling policy that will let the instance group manager add or remove instances depending on an indicator (CPU consumption, Query per second, a custom metric, …). We can do this by defining an autoscaler resource that will trigger an instance group manager proactive update when an autoscaling condition is met.

resource "google_compute_autoscaler" "my-service" {
name = "my-service"
zone = "europe-west1-d"
target = "${google_compute_instance_group_manager.my-service.self_link}"
autoscaling_policy = {
max_replicas = 5
min_replicas = 1
cooldown_period = 60
cpu_utilization {
target = 0.5
}
}
}

Testing changesets

GCE supports multiple versions running at the same time and traffic split which unlocks canary and A/B testing strategies. Using this feature, we can update the workflow I described to work with two image families. We can imagine a scenario where merging a Pull Request to staging branch releases an image to my-service-rc image family, and merging to production branch releases an image to my-service image family.

After this step, we can automate canary release to observe the impact of a new version on the system. Yes, it means that you must strictly monitor your production behavior during deployments. But you already do that, right?

With the following configuration, GCP will always keep one instance in the canary version and all other instances in the standard version.

resource "google_compute_instance_group_manager" "my-service" {
name = "my-service"

base_instance_name = "my-service"
update_strategy = "NONE"
zone = "europe-west1-d"

version {
instance_template = "${google_compute_instance_template.my-service.self_link}"
}

version {
instance_template = "${google_compute_instance_template.my-service-rc.self_link}"
target_size {
fixed = 1
}
}
}

Rollback

Holy crap, despite all our efforts, shit happens sometimes, and our shiny version is broken :’(

To rollback to a previous version, it is easy as a pie! Just deprecate Image and the latest image in the family will be the previous one. Then we simply need to apply our terraform configuration without any change and everything will be fine again.

gcloud compute images deprecate my-service-rc-20180802

Conclusion

To summarize, we went from manual cloud configuration and highly mutable instances (OK, you can say our instances were pets) to an immutable infrastructure (hello cattle) managed with code.

Our release and rollback processes are now uniform as well as our cloud configurations. We are able in one step to update or rollback a service, and canary release will allow us to test changes against real production constraints.

--

--