The rough edges of Terraform that I awkwardly stumble over

Mar 07, 2023

First and foremost, thanks to everyone who has subscribed recently. I know I’ve been quieter than usual. I actually have a number of nearly-complete drafts ready to go, but each is waiting on something. Two of them are posts that I think y’all will hopefully enjoy very much!

For everyone who has come to expect less-technical posts, this one is unfortunately pretty tech-heavy. If you’re not here for engineering deep-dives, you won’t miss anything by skipping this one.

Terraform, for those of you who are unfamiliar, is a product from Hashicorp that allows you to write a kind of code that expresses how your infrastructure is in production. Write some code describing an S3 bucket that you’d like to have, run terraform apply, and now you’ve got an S3 bucket with the properties that you described. If the bucket changes, terraform will happily offer you to revert it back to what your code describes. Or, update the code to apply changes to the bucket.

And this works for everything, and I do mean everything. Every resource in AWS (with very few exceptions) can be described by Terraform. It also works for Azure, GCP, Oracle, Docker, Cloudflare, and many others. And they all work together: I use Terraform, for instance, to set up Cloudflare logpush to an S3 bucket—it automatically does the dance of setting up the bucket, validating bucket access on the CF side and creating the logpush rules, and then configuring things like retention.

The reason I got started with Terraform is the lurking fear that my AWS account would get wasted in some way1. Being able to “start over” by setting up new infrastructure in a new account is highly desirable, and being able to do it entirely automatically without having to futz about in the console is even better (or at the very least, only having to deal with some rough edges rather than the whole damn thing).

Going through the exercise of Terraforming (terraformizing? terrorizing?) my infra has also revealed issues and offered improvements. I wrote a module (reusable code, like a function) for Cloudwatch alerts and applied it to all of my Kinesis data streams. I discovered a queue whose consumer had been erroring for over a month. My application servers on Elastic Beanstalk2 are happily templatized, and I’ve been able to not just upgrade them from Python 3.7 to Python 3.83 with relative ease, I’ve been able to easily migrate them to Graviton instances (yielding a substantial cost savings). All of this would have been a real pain in the ass to do in the console or CLI by hand.

At this point, I’d say I’m about 90-95% of the way to having six years of infra encoded in Terraform files. The remaining pieces are the things I want to talk about here, and it mostly comes down to the stuff that I don’t like about Terraform.

I don’t want Terraform to know my secrets

Since I was just a teenager, one particular lesson has been beaten into me to create a deep-rooted anxiety: don’t put your secrets in git. Ever since, I’ve been paranoid about where my secrets are and how I’m handling them.

The biggest complaint I have about Terraform is its insistence on storing secrets. Terraform works by performing API calls to create or import resources in a provider (say AWS), then storing the details of that resource in a “state file.” This state file contains all the details so that Terraform can know what everything looked like the last time it interacted with the thing. If I change a Lambda function in my Terraform code, for instance, it allows Terraform to know the ARN of that function to be able to make changes (either updating the existing function or knowing to create a new one and delete the old one).

But lots of providers return sensitive values in their API responses. Consider the aws_iam_access_key resource: this lets you create an AWS API key. AWS returns the secret in the API response (what else would it do with the secret?) and Terraform happily keeps that in the state file indefinitely.

Now here’s the thing: that state file has to get shared with everyone who can run the terraform code. If you’re going to run the terraform command, Terraform needs that state file in order to operate. Otherwise, how would it know whether the infrastructure is brand new or not?

We all know that checking your secrets into git is bad. Github even has systems in place so that if you check your, say, Stripe secret key into git, you get a spicy email saying you need to maybe not do that. We have limited options to avoid secrets in git, and I’m frankly not thrilled with them.

Put your state file in your Terraform backend of choice. Lock down access, pull the state file in place, and do all the management of that service manually (I have FUD about using Terraform to manage, say, an S3 bucket that contains the state about the bucket).
Store an encrypted version of your state file somehow (PGP?), and just check that bad boy into git.
Put it in Google Drive or something and have everyone be extremely diligent about updating their state file every time they want to use terraform.

I don’t like any of these for a couple reasons:

The fewer hands on my secrets, the better. I don’t want to trust Github, S3, Hashicorp, or Google with my secrets. The secrets should live exclusively in the systems that absolutely need them and nowhere else. Secrets are a production implementation detail.
- Running terraform should use your credentials associated with your machine (or credentials in your CI system) and be centrally managed with IAM (or equivalent) by an admin.
- I’ve got a forthcoming post about admin permissions!
Any system that requires extra effort on the part of the user is destined to fail due to human error at some point.

The solution I’ve adopted for Pinecast is perhaps a bit barbaric: I simply don’t do anything that stores secrets in Terraform state. I’m able to confidently (?) check my state file into git after making sure there’s nothing sensitive in it and I can move on with my life.

The terrible burden of trying to keep secrets out of state

There are certain things you simply cannot use. IAM access keys, like I mentioned, for one. Which actually isn’t so bad: the only real use for this is keys for software (e.g., a desktop GUI for Glacier), I’ve found4. I’ll just mint those manually, since they’re not part of my infrastructure. Everything else should use instance profiles and IAM roles, which involve no secrets in state.

RDS, interestingly, has proven to be a hurdle. I have two Aurora Postgres clusters, and neither is encoded in Terraform yet. Why? Because you need to set a master_password in your Terraform code. This just feels like asking for trouble. The solution is probably to set up IAM auth (which uses AWS credentials—local or instance profile—to generate an ephemeral password for the RDS instance) and disable non-IAM auth. I’ve yet to figure out a way to automate that setup process with a Postgres Terraform provider, but I’m sure it’s possible.

Other things that are problematic are SSM Parameter Store and Secrets Manager, which require you to provide secrets in your code. Which, in turn, sort of defeats the purpose of using a secrets manager if you’re checking those secrets in. Thanks but no thanks, I’ll just set those up manually (you’d have to set them up manually anyway).

Both the RDS and SSM/Secrets Manager examples involve secrets in your terraform code, but the secrets end up in state also. Both of these are issues, and make them non-starters for me.

There’s a relatively small number of things that actually use secrets, so the burden isn’t overwhelming, but each of the small things that do involve secrets seem to waste hours of my time.

Terraform is manual and serial (by design)

You can’t safely automate the process of applying Terraform changes. You need a human to inspect the changes and manually approve them every single time. Just like any kind of code, two commits can make valid changes that are incompatible—that is, the input to Terraform is valid and can be applied but the behavior/output is undesirable. This is why most companies run tests on individual PRs, but then also run the tests after the PRs are merged: just because a PR merges cleanly doesn’t mean that it does what it should when combined with other newly-merged PRs.

Unlike code, though, Terraform can’t really be tested. You can make sure the Terraform code is valid and well-formed, you can test that certain values and variables are correct (e.g., values that should contain IP addresses are IP addresses, or lists have the expected size).

The problem is that when you apply a Terraform plan, you’re actually modifying a stateful system, and the actions needed to eliminate the delta between what the Terraform code describes and what’s actually in existence might not be what you want. You can’t leave it up to a machine to decide what is a reasonable change, because any change might be reasonable under the right circumstances.

For instance, you probably don’t want to accidentally delete and recreate your production database instances because the merge order of two developers’ commits prevents Terraform from updating the existing resources (rather than deleting/recreating). But another person may have intended to do exactly that: the database was created incorrectly, and they’re replacing it with the right settings.

Sure, you could have two AWS accounts: one for testing and one for production. Apply your changes in one, run some tests against it, then apply your changes in the other when you are confident it works. In the happy path, this works great. But there are some deal breakers:

It’s an extremely expensive approach to ultimately avoid humans looking at what their commits are about to do.
The tests you run can’t simply be integration tests. You obviously care about the system’s ability to perform actions, but you also need to care that the system maintained integrity after the application of a terraform plan.
- If a commit triggers the deletion and reconstruction of the database, integration tests should still pass afterward, just as they would in CI where the database is fresh. We need to additionally test that state from before the application of the plan is still usable after the application of the plan. And that involves a level of testing that’s probably unachievable.
- These kinds of tests are both hard to write, hard to reason about, and mean that we need to be diligent in making sure the environment isn’t polluted for future test runs.
If your production system requires, say, $100/hr to run, you probably don’t want to use equivalent instances and configurations for your test environment (i.e., you want the test environment to cost far less than $100/hr). If you introduce differences between the production and test environments (e.g., instance types), you’ve opened the door to the possibility of problems that only affect production systems.

So ultimately, you always need to look at what Terraform is going to do to your environment. Always always always, no matter what. Thankfully, the plans that it spits out are very readable. Here’s a recent one showing some really mundane description and JSON whitespace changes after I added some resources to my Terraform code:

One description was changed, and three EventBridge rule triggers had their input arguments’ JSON reformatted. The yellow tildes mean that the value changed; a green plus would mean additions and a red minus would mean removals.

Now, requiring a manual/serial step is not really a problem with Terraform, it’s a fundamental constraint of what Terraform is trying to accomplish. The notion of infrastructure as code necessarily has a temporal component. As far as I’m aware, this is a problem with every infrastructure as code tool.

It’s not an unfamiliar problem, either: this is an issue with database migrations as well. If two people make overlapping migrations that affect the same database table, it could have unintended consequences. You don’t want these folks applying migrations before merging, and after merge the changes need to be deemed safe and run serially.

Terraform wants my code, and I don’t want to give it my code

Something that surprised me a lot about Terraform is how often it wants the code for my application, or wants to have meta information about my code. The easiest example of this that I can give is Lambda, in two ways.

First, you can’t create an aws_lambda_function resource without giving it code. Here’s what the docs say:

AWS Lambda expects source code to be provided as a deployment package whose structure varies depending on which runtime is in use.
…
Once you have created your deployment package you can specify it either directly as a local file (using the filename argument) or indirectly via Amazon S3 (using the s3_bucket, s3_key and s3_object_version arguments).

Which means you need to have a dummy zip file in your Terraform source or on S3 that contains some JavaScript or Python or whatever.

I simply don’t know how this is supposed to be used without keeping my terraform colocated with the code it’s meant to be supporting. My Lambda functions have a deploy.sh script that bundles the JavaScript or Python code, uploads the ZIP bundle5, and creates a new function version. These don't live in the same repo as my Terraform code, and I don't want to have to have both cloned just so I can bundle my functions at the time Terraform is run in a clean environment.

My workaround for this has been to create a new function in the console (which adds dummy placeholder code) and import the new function into terraform. This would be the ideal behavior in my eyes: creating the function with Terraform just makes it exist, and you separately need to deploy “real” code to it.

Another example in Lambda is function layers. Layers are a cool concept: you can add code and libraries to a layer that you want multiple functions to use. The functions that use the layer will include those files when invoked. I have one, for instance, that includes ffmpeg binaries.

Now, you might be thinking, “Oh Basta, that’s the same problem: the layer needs a file.” Well, yes. But there’s a separate problem, which is that it not only depends on having code, but layers in Terraform are modeled as versions rather than as the layer itself. In fact, there’s no resource for layers, there’s only a resource for layer versions (aws_lambda_layer_version).

Now, I have a script that downloads ffmpeg binaries and bundles them as a ZIP and uploads them to S3. That’s easy and fine (though I don’t have Terraform wired up to my script, which will make recreating this painful), but it means that when I upload a new version, I now need to add a new layer version to my Terraform code pointing at the new bundle on S3.

Thankfully, I don’t need new ffmpeg versions more than once a quarter or so, and so this isn’t an issue that keeps me up at night. But let’s say the layer contained compiled code from one of my projects which was under active development. Deploying that code means I have two options:

I have to run the deploy script, then update my Terraform code and apply the changes to create the new layer version.
I don’t use terraform to manage the layer versions and just deploy the layer version manually using the CLI.

Both are bad options. The first is a bad option because it means that I can’t have fully automated deploys (see above). The second is a bad option because getting the Lambda functions to use the new layer versions means updating Terraform anyway, if those functions are managed by Terraform. Or, letting the functions get out of sync with the Terraform version.

Arguably, the second option probably isn’t the worst for an existing project. It doesn’t seem terrible to require someone to update a version number in Terraform to bump a version, but that’s tedious for something updated many times per week. But it also means that recreating the environment is hard because I need to create the bucket to upload the ZIP to before I can upload the ZIP before I can provide a version number for the layer version for the Lambda functions to use.

Again, this isn’t a problem inherent to Terraform, but it’s a problem that Terraform-the-concept inherits from bad AWS API design. Fundamentally, this is Amazon having different rules for different types of resources. Things like Lambda functions, where the resource is tightly coupled to the data (code) it contains, behave differently from Kinesis data streams, where the resource doesn’t care at all about the data. Using Terraform like it feels like it wants to be used means adopting practices that might not make sense.

Other kinds of differences in rules make it hard to practically manage resources via Terraform. I can delete my Elastic Beanstalk environment by deleting the Terraform code and applying the changes. I cannot delete an S3 bucket that’s not empty by deleting the Terraform code and applying the changes (I need to empty the bucket manually first). Differences like this make it inherently difficult to treat Terraform as a generic tool, because everything you manage has the possibility of introducing its own weirdsies.

Importing is hard and gets harder

If you’re starting from infrastructure that’s not in Terraform and writing the Terraform code for it, you need to use terraform import quite a lot. You write your Terraform code for a resource (usually just enough to satisfy the required arguments), run terraform import my_resource.resource_id <the id of the resource on AWS>, and you’re good to go. Mostly.

Once you’ve imported, you can run terraform plan to see what Terraform would do to apply your bare-bones description of the resource into a set of changes to update the version that already exists in production. That tells you all the arguments that you missed or got wrong. You tweak and repeat until the plan is empty or you’re satisfied with the diff. And honestly, this sucks for a number of reasons.

First, the IDs of things are many and varied. They’re often not ARNs: they use S3 bucket names, Lambda function names, parent resource IDs with a slash and a UUID that’s not exposed in the console, and more. The lack of consistency makes it frustrating to use. Almost every resource in the docs has a section on how to import; frankly this shouldn’t be necessary.

Second, why can’t I say something like terraform describe <the id of the resource on AWS> and get back the code for it, like a sort of reverse terraform import? Have I just missed this? Surely this would be invaluable for folks who have a ton of infra to bring over6.

Last, Terraform now prefers describing sub-resources as their own resource rather than as a nested structure on the parent resource. I sort of understand why, but it makes it that much harder to manage stuff. Especially when the sub-resource is practically only used once in one place.

EventBridge rules, for instance, will happily get created without any triggers. You separately need to add the triggers as their own resources, with a reference back to the rule. But AWS doesn’t even show the UUIDs for the triggers in the console: they’re just shown as rows in a table on the rule page. Knowing that you need to do this is one problem, and importing them is another challenge that forces you to break out the CLI. If you have a lot of these, it’s very painful.

As you bring your infrastructure into Terraform, every terraform plan or apply operation needs to check more and more resources. The more you import, the slower the process becomes. For me, it takes almost fifteen seconds to generate a plan. I expect it’ll probably end up somewhere around 30 seconds when I get to 100%. For such a manual process, this becomes an almost excruciating dev loop. Surely there must be a tool that helps automate this for folks bringing in lots of resources, right?

My Terraform wishlist

Compute resources should be able to exist independent of their code. You should be able to create a Lambda function without Terraform knowing about the code or the deployment process.
If nothing terraform-izable depends on a resource, it should be able to be updated/deleted without any manual steps.
It should be possible to tell Terraform not to store a particular value for a resource in state when the value is returned by the API call to create the resource.
Terraform should have built-in tools that allow you to go the no-secrets-in-state route, raising an error for any sensitive return value that would otherwise end up in state before you apply your changes.
State locking should be optionally decoupled from backends. It should be possible to use the local backend so you can track your secrets-free state in version control while also preventing two people from making changes that conflict.
- “I want to `terraform apply`, and I’m on state <hash>. If I am clear to proceed, lock the state until I provide a new hash. Otherwise complain and tell me to rebase.”

The other stuff

I don’t want this post to just be me complaining about Terraform, because I really do like it. It’s certainly not the perfect tool, but it’s had a ton of benefits:

I’m faster at making changes
I’m more confident that things are configured correctly
My infrastructure is more consistent and well-documented
Copying and pasting infra when building new stuff is a game-changer
I no longer have anxiety that I have stuff hanging around in AWS that I’m not using or have forgotten about
- If it’s not in the code, I can be pretty confident that it doesn’t exist (with few exceptions).
Lots of things that I built in the past were made of many moving parts that had been configured purely in the console; having these in code means there’s no mystery about how they work

I’m one of the lucky few whose AWS account is inseparably tied to their Amazon retail account. :tadont:

I know I need to migrate to ECS/Fargate, but I really just don’t want to deal with the headache of containers right now.

Py3.8 is the most recent version that Beanstalk supports. See the above footnote—I know I need to leave.

There are some places where having keys is unavoidable in infra for one reason or another. Someone I spoke with (sorry I can’t remember who!) suggested creating a Lambda function on a cron that only has permissions to mint and roll keys, and all it does is set up and rotate keys all day. This is an interesting idea and prevents even the person writing the Terraform code from having the secrets touch their machine in the first place, but I haven’t tried it.

Mind you, to some S3 bucket that I don’t specify. I use aws lambda update-function-code with the —zip-file argument, and so it goes to S3 but lord know what bucket it ends up in. Not one of mine, as far as I’m aware.

It would be challenging to, say, reference existing resources rather than hard-coding names/IDs/ARNs in the generated code, but a best-effort attempt would be far more valuable than nothing.

Basta’s Notes

Discussion about this post