Terraform Refactoring Nightmare

Terraform Is Better, But Often A Nightmare

We have gone through several iterations of our Terraform code now, and it’s getting better. We have moved several different implementations of modules from various sources into a central GitHub repository of modules, releases using semantic version tags, all up to TF 1.0.10 or even 1.1 and handling the several different formats, names, and implementations that existed in our infra. It took a lot of work to get everything aligned.

The language and tools are also getting a lot better, and pretty quickly. I am looking forward to what’s next.

But recently we had to upgrade a bunch of databases running old versions of PostgreSQL. While I would have loved to do these upgrades using Terraform, the logic required ruled out terraform. We needed to control the timing, we needed to drop several views before and reapply after, several new extensions, parameter groups and, so on.

Nearly all of this could be done in terraform, but as we tested, we found there were multiple unpredictable states. For example, if the code ran during the maintenance window it would fail, or if a database needed maintenance during upgrade it would fail. So we built scripts that called AWS APIs and knew how to handle the exception cases.

In the end, we had the same databases running on higher versions, and with instance names that now finally met our naming standards. These resources were already in terraform, and just needed to get updates.

Going back to our terraform code to synchronize with the new state of the world has proven to be a nightmare.

What I Expected Terraform Would Do

To be clear, I did not expect Terraform to do the major version upgrade of our PostgreSQL databases. This is a complex set of dependent and occasionally flaky (non-deterministic) operations. But in the end, I have a newer version of the same database already managed by Terraform, and this seems like state drift to me.

For example, we had an AWS Aurora database running Postgres 9.6.22 called company. The cluster was called company-production-2018-0121431932022 and the writer instance was called company-production-2018-0121431932022-1. Our other databases followed the pattern name-environment-<cluster|instance-N> so as part of the upgrade, we renamed company database as company-production-cluster and company-production-instance-1 and company-production-instance-2. The instance also has a newer engine version, and this version has features not available in 9.6. To be sure, several changes.

I would like to think I could simply replace the engine version 9.6.22 with the later version now present, specify the new name of the cluster and instances. Pretty much everything else is the same. For a moment, I thought maybe terraform would detect drift, and I could just run:

But that didn’t work.

I also briefly hoped that the new moved configuration block in TF 1.1 would help. But my resource identifiers are the same. I am not moving resource definitions from one place to another. I am just updating the existing resources definitions with new ones.

What I Actually Have To Do with Terraform

Terraform now sees my state as an old database instance it must destroy and a new database instance to create, and the same with all of the other resources define in the module.

The module address is the same, so I can’t simply use terraform state mv. I can’t use terraform import directly because terraform sees the old resource in the state and says it’s already managed. And there’s no way I have found to tell terraform that the old instance it knows about is really the new instance it wants to create.

This means I have to remove the old reference from state with terraform state rm <address> then import the new one using terraform import <address> id. There are several other resources that depend on the aws_db_instance so each needs to be rmd and re-imported.

But it’s not that easy. Our module also configures aspects of the underlying postgres databases: names, passwords, roles, extensions and so on.

The aws provider implements an aws.db.instance resource that defines the server infrastructure. The result of the aws_db_instanceexports attributes like host address, master username and password, and port. These values are necessary inputs for the postgresql provider so it can connect to the instance and create the database resources.

In short, my PostgreSQL database depends on the AWS db instance. Unfortunately, depends_on allows only relationships between resources, and cannot be used in my provider block.

I no longer have the aws.db.instance in the statefile, as noted above. There’s no way to specify the dependency of username, password created by theaws.db.instance to the postgresql provider. The postgresql provider fails with an obtuse error message when the postgresql provider can’t log in to the database. The workaround is to hard-code the host/user/port/password credentials in the module until the specific aws.db.instance and postgresql provider have both been applied to the terraform state file.

As it happens, many of the changes to one part of the AWS code leave the database code unable to connect. So in this iteration, I added four variables to the postgresql part of the module. I wrote code in the module for the four attributes that got values like this:

Now in my root module, to bootstrap the resource creation, I called my module like this:

(I could probably use tfvars to pass in values as well.)

With the temp settings in place, the database module can successfully connect, and import its resources, and the AWS module can import its resources, and now the state file knows about these resources again. At this point, I can remove the temporary variables from the root module.

Now, I just need to state rm then import all the “new” resources into the state. Yikes!

All of this is work I have done locally on my laptop. In order to get to the point that I can check in the updates to the calling module, I need to apply changes to the terraform state. Our CI system is sophisticated, but there’s no easy way for me to provide tf.var files that contain the temporary host/user/port/password set. So I apply locally.

And now, our shared state file exists, but the code in production is behind. I need to get my PR approved, run the CI system to run the plan, then apply and then check in the code. To make this work, turnaround on PRs needs to be very quick!

(An important change we made: because we tag the module directory when we are done with features we can be confident that old modules that haven’t had new databases will continue to work.)

But in any case, having a single developer (me!) applying changes directly to production and then urgently requesting approval for a PR is just not how proper development flows should work. I wonder if there are patterns of using development state files?

Refactor and Use Remote State

As we have understood the dependency between the two parts of our database system: the server, and the underlying database, it’s clear that we should have two separate modules: the code to manage the AWS Db Cluster and AWS Db Instances, then another to manage the PostgreSQL modifications. We should create outputs in the AWS Db Module that can be referenced by the PostgreSQL module.

And I’ll do that. And the new move functionality will be awesome.

But this still doesn’t address the fact that I had to jump through many different hoops just to get the new state of reality aligned with the terraform state file.

Oh well. Still better than managing infra by hand!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tom Harrison

Tom Harrison

30 Years of Developing Software, 20 Years of Being a Parent, 10 Years of Being Old. (Effective: 2020)