Terraform Modules Organization

David Gamba, @gambaeng
2021-10-20
version 0.3, 2023-04-03 #terraform #software_architecture

This is an in-depth overview of the rationale behind choosing certain patterns, for a summary please jump to Summary

Working with Terraform modules can be hard, in particular when things grow or when you have multiple teams or multiple projects.

The official documentation provides some insight into how to organize your modules as shown below:

$ tree my-company-functions
├── modules
├── prod
│   ├── document-metadata
│   │   └── main.tf
│   └── document-translate
│       └── main.tf
└── staging
    ├── document-metadata
    │   └── main.tf
    └── document-translate
        └── main.tf

Then it mentions the ability to share modules by uploading them to a registry or by tagging them in a git repo.

There are several drawbacks with that suggested approach, this guide attempts to explain them and provide an alternative solution.

Problems that need to be addressed

  1. First and foremost is The ability to separate project life cycles

    Some projects might move fast, some projects might not.

    Some projects might be really stable. For example, an S3 bucket created with Terraform v0.11 might still work perfectly fine and you might not want to force updates on that project if you don’t have the budget for it.

    Each team or even each project might have different budgets to address technical debt and to update dependencies. Furthermore, those budgets change over time. Module layout should allow for different life cycles.

  2. The Terraform binary version itself

    Terraform has been in a state of flux for a long time and just recently a v1 landed. Before that, every minor version jump in the v0 series brought breaking changes, and in particular when going from v0.11 to v0.12.

    Even though there are now some compatibility promises made with the release of v1.0, teams might still want to be able to use a different Terraform version and upgrade at their own pace.

  3. The one module to rule them all problem

    In paper it sounds great to have a single company wide module to deploy an AWS EC2 instance. After all, every single EC2 instance in the company should be composed of an Ubuntu 16.04 AMI, A Route 53 record, User Data that installs python and an auto-scaling group with rules for 9am to 5pm. Wait what? I obviously need to add a variable for the AMI, some projects don’t require DNS, my project doesn’t use python and we are a global company.

    Most company wide ANYTHING efforts I have seen end up being a wrapper around the base provider modules with a bunch of conditionals that end up adding very little value. The same is the case for community provided modules that also have to add conditionals to support older versions of Terraform.

    Finally, company wide modules end up introducing lots of breaking changes very often and newer versions have very little uptake. You end up with different projects using and maintaining different major versions in many cases.

  4. The development workflow problem

    In many cases, when you are developing changes to a module you might need to apply your changes and in some cases leave them up for days to validate they are working.

    When you develop your shared module that means you make your module changes in a separate git branch and your Terraform file uses that ref.

    Once you are happy with your changes, you create a tag and promote those changes to your staging environment. In most cases staging is a copy of prod where integration tests are run and breaking the staging environment infrastructure is a big deal. You soak your changes there and then promote your infra changes to Prod.

    There needs to be a clear definition of when a module is ready for production use. Unfortunately, in most cases the tagging is just done as an incremental semver update and it doesn’t reflect the actual maturity of the changes.

    Developing shared modules in a separate git repo for third party consumption necessarily involves working in at least two repositories at once and actually committing module changes into your branch on every test iteration.

  5. The module versioning problem

    To summarize the two problems above, shared modules with breaking changes force teams into maintaining multiple major versions. At least until they have the budget to figure out what the new major version introduced and what all needs to change to use it.

    Additionally, semver without a pre release version information is not sufficient to indicate the maturity or intended use of the code. It is tagged because it is use in some environment, but there needs to be some indication of its status. For example: 1.2.3-alpha → 1.2.3-alpha.5 → 1.2.3-rc → 1.2.3

  6. The large state file problem

    Large modules are slow, have a big blast radio, they require you to have permissions to all the infra that they deploy, they require you to understand the whole module and testing them properly requires to deploy the whole thing from scratch over and over.

    On the testing side of things, you might think that Terraform is smart enough to understand the order of operations of your humongous module but most likely you are sprinkling depends_on everywhere hoping for the best.

  7. Terraform Backends

    Though not a problem related to modules, code organization affects where you do your terraform init and other operations from. If from a single directory, you either need to clear your .terraform dir to ensure a clean slate (terraform init -reconfigure), use an error prone feature like Terraform Workspaces or, actually have a separate dir for each environment you deploy your infra on.

    Having a single dir that can dynamically point to multiple backends can, depending on whether or not you cleaned your environment, give you a prompt to copy the previous used state to the new backend your are about to use and some users might accidentally copy state between unrelated environments.

    This problem is also there with multiple git branches, so all Terraform code should live in a single branch.

  8. Providers don’t support for_each or count

    Terraform providers are not fully dynamic, that is, they don’t support for_each or count and also they can’t be embedded in a module that uses for_each or count as explained in official docs or the Github issue request. In other words, what this post describes is not possible out of the box.

    This is particularly a problem you always encounter if you are doing dynamic environments and using for_each as a conditional to your module.

    The workaround is to create a module of modules as described in one of the following sections.

  9. Providers need to exist when destroying

    If a provider was used when building a piece of infrastructure, it needs to exist when trying to destroy it. This is particularly challenging when using dynamic provider configurations, like for example connecting to a kubernetes cluster after creating it. That means that destruction of resources needs to be done in stages. One where you destroy the resource while still able to connect to it (keeping provider details) and another where you can actually delete all resource related config.

  10. State loss when updating providers

    There are fairly innocent situations, like a region change in a provider, where instead of deleting and recreating resources in the new region, Terraform just orphans the resources. See https://github.com/hashicorp/terraform/issues/29294 for some details.

    When making any change to a provider, it is recommended to destroy the resources it manages first.

  11. Provider versions are global

    Terraform requires that a single version of each provider is selected for the entire configuration, due to the fact that provider configurations can cross between module boundaries. https://github.com/hashicorp/terraform/issues/25343#issuecomment-649149976

    This limits the ability to have separate lifecycles for resources. For example, you might want to work in a blue/green scenario, where once deployed you don’t want to update resources. It would be desirable to maintain the version of the provider so there is no overhead in maintaining that static resource, while new resources get new provider versions that might have breaking changes compared to the old ones.

  12. Providers don’t allow using variables for their versions

    That means that if you want to use different provider versions per environment, you have to bring in the entire provider file and not just the versions as variables.

  13. Data lookups, remote state lookups and shared common variables

    Data lookups have the downside of doing an actual API lookup for every piece of information, in many cases, slowing down your terraform plan. If the lookup is for something that requires you to redeploy your entire infrastructure if it were to change anyway, like a VPC ID, you might want to use the ID itself rather than doing the lookup, but hardcoding IDs is not something you can do in a module. Other values actually never change, like the account number.

    Remote state lookups, that is lookups that look into the saved state of a different Terraform project, when done in a module ties the module with the specific backend location of the lookup it is doing. Using them in a module to avoid having to give the user another variable input is something that needs to be considered with care.

    A proper pattern that allows to pass in defaults and common IDs without encumbering the caller with a million variables is a must.

  14. Security Controls

    Even if you have defined proper project ownership, some parts of your project might need to be centralized for wider control.

    For example, the organization might decide to have every secret that is meant to be deployed to AWS Secrets Manager centralized in one location (one single Terraform project that is). This to ensure that uniform security practices are applied to that part of the infrastructure.

    Another example, the organization might decide to centralize all service accounts in a single Terraform project to provide clear visibility into where API users are created and what their key rotation rules are, etc.

    In these scenarios, the developer of the module not only needs to create the secret or user in the centralized project, but also provide the 2 way policies that allow instance profiles to read those secrets for example.

    Thought has to be given to avoid introducing circular dependencies.

  15. Chicken and egg problems

    Even when working inside a single state file, Terraform has ordering issues that in some cases require multiple plan/apply sets before reaching consistency. See https://github.com/hashicorp/terraform/issues/4149#issuecomment-162030953

    When splitting into multiple state files, this ordering becomes even more important.

Infrastructure as code

Important

As of Terraform 1.10 (in beta at time of writing) there is a new moved block syntax that improves this situation.

Defining the boundaries of the term Infrastructure as Code is important. Infrastructure as Code means that the infrastructure that is deployed is tracked in plain text in a VCS. It does not mean that infrastructure follows the same workflows that the rest of your code does.

Refactoring Terraform infrastructure deployed as a resource to group it into a module would be a simple refactor in code. In Terraform however, it requires manual state refactoring using terraform state mv <source> <target>.

In many cases updating some infrastructure values, like a name, involves destroying and recreating the resource.

There are many other examples and instances in which a simple code operation results in infrastructure that would need to be destroyed and recreated or that requires a state move to keep it.

Due to those limitations, it is better to err on the side of duplication and verbosity than to err on the side of coupling.

Infrastructure workflows are more alike database workflows because some rollbacks are difficult, require manual intervention and bugs have wider impact. Deploying canary infrastructure and migrating traffic over to it is in most cases the only choice to avoid down time.

Terraform "conditional hacks" are not conditionals

In Terraform there are no conditionals to tell a resource or a module whether or not it should be deployed. In other words, there is no if or enabled keyword one can leverage.

What Terraform does have is count and for_each.

In general, count should never be used as a conditional. Count works by creating an array with indexes. If you ever need to remove an index that is not the last, all indexes after the one you are removing are destroyed and new ones are recreated to update the index list.

In contrast, for_each works as a key value list (map), so removing any member of the list doesn’t affect other members.

Example for_each used as a conditional
module "my_module" {
  for_each = toset(var.code_maturity == "dev" ? ["enabled"] : [])

Note that if the module was created before adding the "conditional" then Terraform will want to destroy the module (delete module.my_module.*) and create a new one (add module.my_module["enabled"].*). To avoid that scenario, you need to manually manipulate the state:

$ ./terraform state mv module.my_module 'module.my_module["enabled"]'

Now, in the case you want to promote that module to also work with "staging" without destroying the resources, it is possible to create a set of locals to handle the conditional approach:

locals {
  dev_only        = toset(var.code_maturity == "dev" ? ["enabled"] : [])
  staging_only    = toset(var.code_maturity == "staging" ? ["enabled"] : [])
  production_only = toset(var.code_maturity == "production" ? ["enabled"] : [])

  dev_and_staging = toset(contains(["dev", "staging"], var.code_maturity) ? ["enabled"] : [])
  all_envs        = toset(["enabled"])
}

And then use the simplified local in the module "conditional":

module "my_module" {
  for_each = local.dev_only

And finally, perform a promotion without destroying the resources:

module "my_module" {
  for_each = local.dev_and_staging

Then:

module "my_module" {
  for_each = local.all_envs

Static vs Dynamic environments

Terraform has a dir based workflow. It doesn’t recursively traverse the filesystem. It only uses the .tf files that are available at the directory of the invocation.

Static environments

Static environments refer to a project environment where all the Terraform files needed to use the environment are present on the environment’s directory. And there are multiple copies of the main.tf file:

$ tree monorepo
└── projects
    └── project-a
        ├── modules
        ├── us-dev-1
            ├── main.tf
            └── backend.tf
        └── us-prod-1
            ├── main.tf
            └── backend.tf

Because all files necessary for init and plan are present, the Terraform commands in use have no need for arguments:

$ cd projects/project-a/us-dev-1
$ ./terraform init
$ ./terraform plan

Dynamic environments

Dynamic environments refer to a project Terraform dir that has a manifest to deploy the infrastructure but the environment details live in a separate directory and are invoked on every plan call:

$ tree monorepo
└── projects
    └── project-a
        ├── modules
        ├── src
            ├── main.tf
            ├── backend.tf
            ├── variables.tf
            └── envs
                ├── us-dev-1
                    ├── variables.tfvars
                    └── backend.tfvars
                └── us-prod-1
                    ├── variables.tfvars
                    └── backend.tfvars

The backend is specific to each environment and because the key to the state file is unique to the project/environemnt combination, multiple backend.tfvars files are required along with a backend template:

Contents of the backend.tf file.
terraform {
  backend "s3" {
  }
}
Contents of a sample backend.tfvars file.
bucket  = "us-dev-1-terraform-state"
key     = "monorepo/projects/project-a/us-dev-1.tfstate"
region  = "us-east-1"
profile = "us-dev-1"

The files have to be provided to Terraform on each invocation:

$ cd projects/project-a/src
$ ./terraform init -reconfigure -backend-config=envs/us-dev-1/backend.tfvars
$ ./terraform plan -var-file=envs/us-dev-1/variables.tfvars

For mode details on the backend-config approach, see the official partial backend configuration guide.

_Dinamic_ providers

As mentioned in the problems section Terraform providers don’t support for_each or count and they can’t live in a module that uses for_each or count.

The use case for a dynamic provider is for example, when a kubernetes cluster is created in one step and in the next the cluster needs to be configured. To configure the cluster, the following details are required:

kubernetes provider required details
provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks-cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks-cluster.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.eks-cluster.token
}

If you are using for_each as a conditional in a dynamic environment setup, you can bypass the provider limitation by doing a module of modules:

modules/eks-cluster
  ├── main.tf
  ├── eks-cluster
  └── eks-cluster_internal_config

Instead of using for_each as a conditional variable directly, you can instead define an enabled variable. In this case it is defined as a set to be able to reuse it directly in a nested for_each

enabled variable defined in a module that requires conditional (enabled or not) provider logic
variable "enabled" {
  type = set(string)
}
main.tf use of the enabled variable
module "eks-cluster" {
  for_each = var.enabled
  source   = "./eks-cluster"
  ...
}

provider "kubernetes" {
  alias                  = "eks-cluster"
  host                   = length(var.enabled) > 0 ? data.aws_eks_cluster.eks-cluster["enabled"].endpoint : ""
  cluster_ca_certificate = length(var.enabled) > 0 ? base64decode(data.aws_eks_cluster.eks-cluster["enabled"].certificate_authority[0].data) : ""
  token                  = length(var.enabled) > 0 ? data.aws_eks_cluster_auth.eks-cluster["enabled"].token : ""
}

module "eks-cluster_internal_config" {
  for_each = var.enabled
  source   = "./eks-cluster_internal_config"

  providers = {
    kubernetes = kubernetes.eks-cluster
  }
  ...
}

And the caller:

module "eks-cluster-1" {
  enabled  = local.dev_only
  source   = "../modules/eks-cluster"
  ...
}

module "eks-cluster-2" {
  enabled  = local.dev_only
  source   = "../modules/eks-cluster"
  ...
}

In the example above, the enabled variable used as a conditional gets passed to the module and used directly in the for_each statements. Since the provider needs to have static values at all times, in the cases were the enabled variable evaluates to an empty set (to an empty conditional), then the provider values are an empty string and since the provider is never actually called this results in the desired outcome.

The Terraform state shows that this embedded provider in the module is unique to the module:

Terraform state provider entry for a module with a provider within the module
"provider": "module.eks-cluster-1.provider[\"registry.terraform.io/hashicorp/kubernetes\"].eks-cluster",

...

"provider": "module.eks-cluster-2.provider[\"registry.terraform.io/hashicorp/kubernetes\"].eks-cluster",
Important

This approach only works because for_each is used as a yes/no switch. There is a single use of the provider, it is not used to connect to multiple clusters.

When to use

The static approach has the benefit of explicitness, easier inspection of what is deployed in a given environment, easier manual invocation for the user. Also, it has less conditionals because the main.tf is unique per environment so what is not deployed in dev or staging (for cost or other practical concerns) is "explicitly" missing.

When working outside of an automated environment that is the easier approach.

The dynamic approach has the benefit of reduced duplication but there is a growth of conditionals in the main.tf to account for what is not deployed in the smaller environments.

As mentioned before, having a single dir that can dynamically point to multiple backends can, depending on whether or not you cleaned your environment, give you a prompt to copy the previous used state to the new backend your are about to use and some users might accidentally copy state between unrelated environments.

The dynamic approach is attainable when the developer only works on one environment and once changes are validated other environments are automatically deployed. Because of the lack of duplication it lends itself better to deploying the same infra on multiple environments with automation.

A solution using a dynamic monorepo approach

Terraform Caching

Before starting with the solution, every user of Terraform should have provider plugin caching enabled. This is particularly important on a CI/CD environment that runs against all projects to detect drift.

~/.terraformrc contents
plugin_cache_dir = "$HOME/.terraform.d/plugin-cache"

or:

$ export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache"

Without the cache the checkout will end up with a copy of each plugin per environment per project and will likely hit Hashicorp registry download quotas.

Multi-platform workflows

The Terraform lock file stores files with hashes for the compiled plugins. This hash is different depending of the platform. If there is development in one or many platforms and the CI/CD system runs a different one, use the following command to record multi platform hashes:

$ terraform providers lock -platform=windows_amd64 -platform=darwin_amd64 -platform=linux_amd64

Bootstrap

When setting up Terraform to work on multiple accounts for the first time, there is no S3 bucket to store the state and there is no lock DB.

While every other project in the repo will have their state stored in S3, for the bootstrap state (s3 and dynamoDB) is is OK to save the state in version control directly.

Since the bootstrap is managed manually, the static directory layout is best:

$ tree monorepo
└── terraform_backend
    ├── modules
        ├── s3
        └── dynamodb
    ├── us-dev-1
        ├── main.tf
        └── terraform.tfstate
    ├── us-staging-1
        ├── main.tf
        └── terraform.tfstate
    └── us-prod-1
        ├── main.tf
        └── terraform.tfstate

Allowing muliple Terraform binary versions

Each project has a symbolic link to the actual binary used for the project. Updating the Terraform binary in a project environment means pointing the symlink to a newer version of Terraform.

This approach even allows a single project to validate a newer Terraform version in dev before promoting to staging or prod.

There is a single common binary per version in the bin/ dir. There is a bootstrap script in charge of downloading the project’s Terraform versions.

$ tree monorepo
├── bin
    ├── terraform-v0.11.15
    ├── terraform-v0.12.31
    ├── terraform-v1.0.1
    └── terraform-v1.0.9
└── projects
    ├── project-a
        ├── src
            └── ./terraform -> ../../bin/terraform-v0.12.31
        └── modules

    └── project-b
        ├── src
            └── ./terraform -> ../../bin/terraform-v1.0.9
        └── modules

With this approach, running Terraform changes from running terraform (looking at the $PATH) to ./terraform and it is always correct independent per project.

Common Variables

At the top of the repo there is an env_defaults dir. These defaults are common across projects deployed in an account.

Some examples of global defaults per account:

  • S3 bucket used to store Terraform state and dynamoDB used for lock.

  • The account number.

  • The account alias if you use one.

  • The region used in that account.

  • Availability zones

  • The code maturity for the account (if you separate between dev, staging and production at the account level).

Some examples of default values that can also be included:

  • VPC ID.

  • VPC CIDR.

  • Subnet IDs.

  • SSL certs ARNs.

  • Route 53 Zone ID.

  • Shared AWS KMS Key ARN.

$ tree monorepo
├── env_defaults
    ├── defaults_var.tf
    ├── env_defaults.tf

    ├── us-dev-1
        ├── env_backend_defaults.tfvars
        └── env_defaults.tfvars

    ├── us-staging-1
        ├── env_backend_defaults.tfvars
        └── env_defaults.tfvars

    ├── us-prod-1
        ├── env_backend_defaults.tfvars
        └── env_defaults.tfvars

├── projects
    ├── project-a
        ├── modules
        └── base
            ├── backend.tf
            ├── provider.tf

            ├── main.tf
            ├── variables.tf

            ├── defaults_var.tf -> ../../../env_defaults/defaults_var.tf
            ├── env_defaults.tf -> ../../../env_defaults/env_defaults.tf

            └── envs
                ├── us-dev-1
                    ├── backend.tfvars
                    ├── env_backend_defaults.tfvars -> ../../../../../env_defaults/us-dev-1/env_backend_defaults.tfvars
                    ├── env_defaults.tfvars -> ../../../../../env_defaults/us-dev-1/env_defaults.tfvars
                    └── variables.tfvars

                ├── us-staging-1
                    ├── backend.tfvars
                    ├── env_backend_defaults.tfvars -> ../../../../../env_defaults/us-staging-1/env_backend_defaults.tfvars
                    ├── env_defaults.tfvars -> ../../../../../env_defaults/us-staging-1/env_defaults.tfvars
                    └── variables.tfvars

                ├── us-prod-1
                    ├── backend.tfvars
                    ├── env_backend_defaults.tfvars -> ../../../../../env_defaults/us-prod-1/env_backend_defaults.tfvars
                    ├── env_defaults.tfvars -> ../../../../../env_defaults/us-prod-1/env_defaults.tfvars
                    └── variables.tfvars
Note
The syntax used for Terraform variables has changed in recent versions, so for example as of v0.12 it is easier to group unrelated types of variables into a single object whereas in the past you couldn’t so these env_defaults can be versioned too, for example env_defaults/us-dev-1-v0.11 and env_defaults/us-dev-1-v1.0.

Once the workflow of defining default variables has been standardized, it is possible to pass the defaults to a module.

For example:

Contents of the default_var.tf file
variable "defaults" {
  description = "Account Defaults"
  type = object({
    aws_account_number     = string
    aws_region             = string
    aws_availability_zones = list(string)

    tags = map(string)
  })
}
Contents of the env_defaults.tf file
variable "profile" {
  description = "AWS profile to use"
}

variable "code_maturity" {
  description = "Environment's code maturity"
  type        = string

  // By default we don't deploy dev or staging code into an environment.
  default = "production"

  validation {
    condition     = contains(["dev", "staging", "production"], var.code_maturity)
    error_message = "Variable code_maturity must be one of 'dev', 'staging' or 'production'."
  }
}
Contents of the env_defaults.tfvars file for the us-dev-1 environment
defaults = {
  aws_account_alias      = "us-dev-1"
  aws_account_number     = "12345"
  aws_region             = "us-east-1"
  aws_availability_zones = ["us-east-1a", "us-east-1b"]

  tags = {
    "terraform" = "true"
  }
}

profile = "us-dev-1"

code_maturity = "dev"
Contents of the env_defaults.tfvars file for the us-prod-1 environment
defaults = {
  aws_account_alias      = "us-prod-1"
  aws_account_number     = "9876"
  aws_region             = "us-east-1"
  aws_availability_zones = ["us-east-1a", "us-east-1b"]

  tags = {
    "terraform" = "true"
  }
}

profile = "us-prod-1"

code_maturity = "production"
Contents of a sample main.tf module
module "sample" {
  for_each = local.dev_only
	source   = ../modules/my-module
	defaults = var.defaults

	actual_variable_that_matters = "this is unique to this environment and project"
}
Contents of a sample module source file
resource "aws_x" "resource_within_module" {
	region = var.defaults["aws_region"]
}
Contents of a sample provider.tf file
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "3.63.0"
    }
  }
}


provider "aws" {
  region  = "us-east-1"
  profile = var.profile
}

This approach allows passing common variables as a single entity to the module and allows the developer to focus on the actual variables that matter for the project.

It also allows defining other common variables, like profile, that can be overridden at the project level if needed.

The variables files need to be passed to the Terraform plan command and the locally defined variables need to be passed after the default ones for any overrides to take effect:

$ ./terraform plan -var-file "envs/$env/default_variables.tfvars" -var-file "envs/$env/variables.tfvars" -out tf.plan
$ ./terraform apply -input tf.plan && rm tf.plan

Module Setup

TODO: This section needs to be re-written to support dynamic environments.

Definition of project

A project as used in this section is a logical grouping of infrastructure that needs to move together with a single lifecycle and with a set of shared or similar infrastructure.

For example, deploying Artifactory in-house is a single project and deploying all the company microservices that follow a pattern can also be a single project (even though each microservice in itself is a separate project from the business perspective).

The section after this one will discuss how to split very large projects into smaller pieces.

A module layout that allows promoting changes according to code maturity is described below:

$ tree monorepo
├── projects
    ├── project-a
        ├── modules
            ├── dev
                ├── aws_s3
                ├── aws_ec2
                └── aws_route53
            ├── rc
                ├── aws_s3
                ├── aws_ec2
                └── aws_route53
            └── release
                ├── aws_s3
                ├── aws_ec2
                └── aws_route53
        ├── us-dev-1
            └── main.tf
        ├── uk-dev-1
            └── main.tf
        ├── us-staging-1
            └── main.tf
        ├── uk-staging-1
            └── main.tf
        ├── us-prod-1
            └── main.tf
        └── uk-prod-1
            └── main.tf
Contents of a sample main.tf module
module "sample_s3" {
	source   = "../modules/<maturity>/aws_s3"
	defaults = var.defaults
	name     = "my_bucket"
}

module "sample_ec2" {
	source   = "../modules/<maturity>/aws_ec2"
	defaults = var.defaults
	name     = "my_instance"
	ami      = data...
	s3bucket = module.sample_s3.arn
}

module "sample_route53" {
	source   = "../modules/<maturity>/aws_ec2"
	defaults = var.defaults
	name     = "my_dns_record"
	ip       = module.sample_ec2.ip
}

As shown in the above tree structure, the modules dir belongs to the project. That means that the module lifecycle is tied only to the project that uses it and not to any other project. This approach does lead to some degree of code duplication across projects but the ability to allow projects to move at their own pace trumps any DRY ideals.

The project directory also has the different environments the project is deployed in. In the diagram above it ties to a separate AWS account because that is the best level of isolation between environments that can be achieved in AWS but other environment separation strategies would be possible too.

Also, in the example above, the main.tf does the orchestration of all the modules. The S3 bucket ARN output is an input to the ec2 instance and the ec2 instance IP output is an input to the route53 DNS record.

This can seem repetitive, an in particular, environment specific variables could be defined so that the main.tf as shown above is truly the same across all environments and only the variables file differs. Additionally, while you could do a module of modules that does in effect what this main.tf is doing, having the orchestration done in the main.tf works well in practice since in many cases what you deploy and iterate with in dev is a subset of the system that you deploy in staging and prod so having the control to deploy less at this level is beneficial.

Another benefit of this approach is that you can mix and match maturity levels, so even in a dev account, you can deploy release level software for all modules except the one you are iterating on.

Note
Since promotion of code happens by copy pasting code from dev to rc to release, branch based diffs provided by git are not available. There are other tools in the market that allow you to diff dirs and this tooling guidance should be given to developers in advanced to avoid friction when adopting the new promotion workflow.

Splitting large projects

While many projects should be kept simple, some grow so big that they need to be split into pieces.

Below is a way to split these projects:

$ tree monorepo
├── projects
    ├── project-a
        ├── modules
        ├── subproject-a
            ├── us-dev-1
            ├── uk-dev-1
            ├── us-staging-1
            ├── uk-staging-1
            ├── us-prod-1
            └── uk-prod-1
        ├── subproject-b
            ├── us-dev-1
            ├── uk-dev-1
            ├── us-staging-1
            ├── uk-staging-1
            ├── us-prod-1
            └── uk-prod-1
        ├── subproject-c
            ├── us-dev-1
            ├── uk-dev-1
            ├── us-staging-1
            ├── uk-staging-1
            ├── us-prod-1
            └── uk-prod-1
        └── subproject-d
            ├── us-dev-1
            ├── uk-dev-1
            ├── us-staging-1
            ├── uk-staging-1
            ├── us-prod-1
            └── uk-prod-1

For this example d depends on c, c depends on a, and b depends on a.

The split allows for developer iteration on a subset of the system (given that the prerequisites have been met). The split, however, introduces the need for an orchestration system.

Not only the build system needs to know the creation/update dependencies, but also it needs to know to destroy elements in the opposite order. Additionally, given that one of the goals of splitting the state is faster plans during development, the build system needs to be able to determine what layers have had no changes and only plan/apply the layers that have been changed.

Since there is no single main.tf orchestrating the use of all modules but multiple small ones, the only way to pass outputs between layers is by doing remote state lookups. Remote state lookups tie project lifecycles so they should be limited to lookups within the project as much as possible. Circular dependencies also need to be considered.

Orphaned State

In previous versions of Terraform, resource dependencies were not tracked in the state so you needed the Terraform code the state was produced with to be able to destroy (some hacks were possible that included multiple iterations of destroy).

In the current iteration of Terraform the provider connection information is the minimun requirement to destroy from state. For dynamically generated environments the destruction many times has to be done in two steps, one to set the for_each set to emtpy or to set the count to 0 and then another to actually remove the code that controls the resources.

Controlling provider versions lifecycle

Note
WIP section
Note

A super hacky workaround that could possibly "help": https://stackoverflow.com/a/66430076

On the other hand, since providers escape module boundaries this could end up in some corruption or applying something you don’t expect.

To allow provider versions to be validated in an environment before propagating those changes to the others, the Terraform required_providers block needs to be moved as an environment block.

Note
For projects that don’t feel the need to have separate provider versions per environment, there is nothing additional to do.

When wanting to control the lifecycle of the Terraform provider versions, and since Terraform doesn’t allow to have multiple versions of the same provider in a single layer, the only way to separate provider upgrades is by making them environment specific.

The approach is shown below:

Create a project_files dir at the top project level that will be shared by all layers (so there is a single version of the provider used across them). Then, in the environment specific dir place the required_providers.tf file:

projects/eks
  ├── modules                                          # Modules Terrform code
  ├── project_files/envs/<env>/required_providers.tf   # Environment specific required_providers

  ├── base                                             # Main Terraform code
  ├── istio
  └── flux

With all the versions of the providers used in all layers:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "3.74.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "2.7.1"
    }
    github = {
      source  = "integrations/github"
      version = "4.20.0"
    }
    tls = {
      source  = "hashicorp/tls"
      version = "3.1.0"
    }
    helm = {
      source = "hashicorp/helm"
      version = "2.5.1"
    }
  }
}

When running the build tool, it will automatically create a symlink in each layer called provider_versions.tf pointing to this shared required_providers.tf file.

Summary

We have solved the requirement of having a single module block to deploy the same infra across accounts and regions by using for_each as an on-off enabled toggle.

The provider.tf file:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "3.74.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "2.7.1"
    }
    tls = {
      source  = "hashicorp/tls"
      version = "3.1.0"
    }
  }
}

provider "aws" {
  region  = var.defaults["aws_region"]
  profile = var.profile
}

The project layout uses the same main.tf with multiple accounts:

            ├── main.tf
            ├── backend.tf
            ├── provider.tf
            ├── variables.tf
            └── envs
                ├── us-dev-1
                    ├── variables.tfvars
                    └── backend.tfvars
                └── us-prod-1
                    ├── variables.tfvars
                    └── backend.tfvars

The backend at the root level is a placeholder:

terraform {
  backend "s3" {
  }
}

And the envs/dev/backend.tfvars looks like:

bucket  = "us-dev-1-terraform-state"
key     = "monorepo/projects/project-a/us-dev-1.tfstate"
region  = "us-east-1"
profile = "us-dev-1"

To initialize, you are required to define the backend vars:

$ ./terraform init -reconfigure -backend-config=envs/us-dev-1/backend.tfvars
$ ./terraform plan -var-file=envs/us-dev-1/variables.tfvars

In regards to the module, if it doesn’t require dynamic providers, you can use for_each directly as an on-off switch, but not as a collection of elements. You can do the collection of elements using a module of modules with the top level one being the on-off and the sub modules using for_each as normal.

module "my_module_1" {
  for_each = toset(var.account == "dev" ? ["enabled"] : [])
  ...

module "my_module_2" {
  for_each = toset(var.account == "dev" ? ["enabled"] : [])
  ...

For dynamic providers, like kubernetes, you need to introduce a variable to do your on-off switch. And use a module of modules:

modules/eks-cluster/
├── eks-cluster
│   ├── main.tf
│   └── variables.tf
├── eks-managed_node_group
│   ├── main.tf
│   ├── provider.tf
│   └── variables.tf
├── main.tf
├── provider.tf
└── variables.tf

The caller code will look like this:

module "eks-cluster-1" {
  enabled = toset(contains(["dev", "prod"], var.account) ? ["enabled"] : [])
  source  = "../modules/eks-cluster"
  ...
}

module "eks-cluster-2" {
  enabled = toset(var.account == "dev" ? ["enabled"] : [])
  source  = "../modules/eks-cluster"
  ...
}

The modules/eks-cluster/variables.tf has then on-off variable definition:

variable "enabled" {
  type = set(string)
}

The modules/eks-cluster/main.tf defines the dynamic provider and calls the sub-modules:

module "eks-cluster" {
  for_each = var.enabled
  source   = "./eks-cluster"
  ...
}

provider "kubernetes" {
  alias                  = "eks-cluster"
  host                   = length(var.enabled) > 0 ? data.aws_eks_cluster.eks-cluster["enabled"].endpoint : ""
  cluster_ca_certificate = length(var.enabled) > 0 ? base64decode(data.aws_eks_cluster.eks-cluster["enabled"].certificate_authority[0].data) : ""
  token                  = length(var.enabled) > 0 ? data.aws_eks_cluster_auth.eks-cluster["enabled"].token : ""
}

module "eks-managed_node_group" {
  for_each = var.enabled
  source   = "./eks-managed_node_group"

  providers = {
    kubernetes = kubernetes.eks-cluster
  }
  ...
}

This is what the state ends up looking like:

"provider": "module.eks-cluster-1.provider[\"registry.terraform.io/hashicorp/kubernetes\"].eks-cluster",

...

"provider": "module.eks-cluster-2.provider[\"registry.terraform.io/hashicorp/kubernetes\"].eks-cluster",

Design drawbacks

Two step destroy

The challenge with this design, is that since terraform doesn’t store provider information in the state, you have to destroy in 2 stages. First, mark the entry as disabled:

module "eks-cluster-1" {
  enabled = toset(contains(["dev", "prod"], var.account) ? ["enabled"] : [])
  source  = "../modules/eks-cluster"
  ...
}

module "eks-cluster-2" {
  enabled = toset(["destroy-me"])
  source  = "../modules/eks-cluster"
  ...
}

Then plan/apply that. Terraform will still be able to connect to the cluster. Finally remove the block.

Provider version upgrades

Another challenge with the design is that because providers escape module boundaries you can’t easily do blue/green to update provider versions which is awful when you really don’t want to touch base level infra, like EKS, until the next version release.