David's Notes Mostly tech oriented...

Terraform Considerations for Design

David Gamba, @gambaeng
version 0.1, 2021-10-20 #infrastructure #terraform

This document lists several considerations that need to be taken into account when designing/organizing Terraform code.

First and foremost is The ability to separate project life cycles

Some projects might move fast, some projects might not.

Some projects might be really stable. For example, an S3 bucket created with Terraform v0.11 might still work perfectly fine and you might not want to force updates on that project if you don’t have the budget for it.

Each team or even each project might have different budgets to address technical debt and to update dependencies. Furthermore, those budgets change over time. Module layout should allow for different life cycles.
The Terraform binary version itself

Terraform has been in a state of flux for a long time and just recently a v1 landed. Before that, every minor version jump in the v0 series brought breaking changes, and in particular when going from v0.11 to v0.12.

Even though there are now some compatibility promises made with the release of v1.0, teams might still want to be able to use a different Terraform version and upgrade at their own pace.
The one module to rule them all problem

In paper it sounds great to have a single company wide module to deploy an AWS EC2 instance. After all, every single EC2 instance in the company should be composed of an Ubuntu 16.04 AMI, A Route 53 record, User Data that installs python and an auto-scaling group with rules for 9am to 5pm. Wait what? I obviously need to add a variable for the AMI, some projects don’t require DNS, my project doesn’t use python and we are a global company.

Most company wide ANYTHING efforts I have seen end up being a wrapper around the base provider modules with a bunch of conditionals that end up adding very little value. The same is the case for community provided modules that also have to add conditionals to support older versions of Terraform.

Finally, company wide modules end up introducing lots of breaking changes very often and newer versions have very little uptake. You end up with different projects using and maintaining different major versions in many cases.
The development workflow problem

In many cases, when you are developing changes to a module you might need to apply your changes and in some cases leave them up for days to validate they are working.

When you develop your shared module that means you make your module changes in a separate git branch and your Terraform file uses that ref.

Once you are happy with your changes, you create a tag and promote those changes to your staging environment. In most cases staging is a copy of prod where integration tests are run and breaking the staging environment infrastructure is a big deal. You soak your changes there and then promote your infra changes to Prod.

There needs to be a clear definition of when a module is ready for production use. Unfortunately, in most cases the tagging is just done as an incremental semver update and it doesn’t reflect the actual maturity of the changes.

Developing shared modules in a separate git repo for third party consumption necessarily involves working in at least two repositories at once and actually committing module changes into your branch on every test iteration.
The module versioning problem

To summarize the two problems above, shared modules with breaking changes force teams into maintaining multiple major versions. At least until they have the budget to figure out what the new major version introduced and what all needs to change to use it.

Additionally, semver without a pre release version information is not sufficient to indicate the maturity or intended use of the code. It is tagged because it is use in some environment, but there needs to be some indication of its status. For example: 1.2.3-alpha → 1.2.3-alpha.5 → 1.2.3-rc → 1.2.3
The large state file problem

Large modules are slow, have a big blast radius, they require you to have permissions to all the infra that they deploy, they require you to understand the whole module and testing them properly requires to deploy the whole thing from scratch over and over.

On the testing side of things, you might think that Terraform is smart enough to understand the order of operations of your humongous module but most likely you are sprinkling depends_on everywhere hoping for the best.
Terraform Backends

Though not a problem related to modules, code organization affects where you do your terraform init and other operations from. If from a single directory, you either need to clear your .terraform dir to ensure a clean slate (terraform init -reconfigure), use an error prone feature like Terraform Workspaces or, actually have a separate dir for each environment you deploy your infra on.

Having a single dir that can dynamically point to multiple backends can, depending on whether or not you cleaned your environment, give you a prompt to copy the previous used state to the new backend your are about to use. Because of that, some users might accidentally copy state between unrelated environments.

This problem is also there with multiple git branches, so all Terraform code should live in a single branch.
Providers don’t support for_each or count

Terraform providers are not fully dynamic, that is, they don’t support for_each or count and also they can’t be embedded in a module that uses for_each or count as explained in official docs or the Github issue request. In other words, what this post describes is not possible out of the box.

This is particularly a problem you always encounter if you are doing dynamic environments and using for_each as a conditional to your module.

The workaround is to create a module of modules as I describe in a separate post.
Providers need to exist when destroying

If a provider was used when building a piece of infrastructure, it needs to exist when trying to destroy it. This is particularly challenging when using dynamic provider configurations, like for example connecting to a kubernetes cluster after creating it. That means that destruction of resources needs to be done in stages. One where you destroy the resource while still able to connect to it (keeping provider details) and another where you can actually delete all resource related config.
State loss when updating providers

There are fairly innocent situations, like a region change in a provider, where instead of deleting and recreating resources in the new region, Terraform just orphans the resources. See https://github.com/hashicorp/terraform/issues/29294 for some details.

When making any change to a provider, it is recommended to destroy the resources it manages first.
Provider versions are global

Terraform requires that a single version of each provider is selected for the entire configuration, due to the fact that provider configurations can cross between module boundaries. https://github.com/hashicorp/terraform/issues/25343#issuecomment-649149976

This limits the ability to have separate lifecycles for resources. For example, you might want to work in a blue/green scenario, where once deployed you don’t want to update resources. It would be desirable to maintain the version of the provider so there is no overhead in maintaining that static resource, while new resources get new provider versions that might have breaking changes compared to the old ones.
Providers don’t allow using variables for their versions

That means that if you want to use different provider versions per environment, you have to bring in the entire provider file and not just the versions as variables.
Data lookups, remote state lookups and shared common variables

Data lookups have the downside of doing an actual API lookup for every piece of information, in many cases, slowing down your terraform plan. If the lookup is for something that requires you to redeploy your entire infrastructure if it were to change anyway, like a VPC ID, you might want to use the ID itself rather than doing the lookup, but hardcoding IDs is not something you can do in a module. Other values actually never change, like the account number.

Remote state lookups, that is lookups that look into the saved state of a different Terraform project, when done in a module ties the module with the specific backend location of the lookup it is doing. Using them in a module to avoid having to give the user another variable input is something that needs to be considered with care.

A proper pattern that allows to pass in defaults and common IDs without encumbering the caller with a million variables is a must.
Security Controls

Even if you have defined proper project ownership, some parts of your project might need to be centralized for wider control.

For example, the organization might decide to have every secret that is meant to be deployed to AWS Secrets Manager centralized in one location (one single Terraform project that is). This to ensure that uniform security practices are applied to that part of the infrastructure.

Another example, the organization might decide to centralize all service accounts in a single Terraform project to provide clear visibility into where API users are created and what their key rotation rules are, etc.

In these scenarios, the developer of the module not only needs to create the secret or user in the centralized project, but also provide the 2 way policies that allow instance profiles to read those secrets for example.

Thought has to be given to avoid introducing circular dependencies.
Chicken and egg problems

Even when working inside a single state file, Terraform has ordering issues that in some cases require multiple plan/apply sets before reaching consistency. See https://github.com/hashicorp/terraform/issues/4149#issuecomment-162030953

When splitting into multiple state files, this ordering becomes even more important.