Hi! I'm Zsolt from the Internal Platform team and in this post, I'll be describing the automation we've built upon Terraform Cloud that enables teams to manage their infrastructure resources in a self-served way.
The reason our team was formed (and our ultimate dream goal) is to provide teams a truly self-served way to manage their services and their infrastructure resources. In the previous post, I talked about Terraform Cloud and how we set it up for teams. Now it's time to tell you how they can actually use it to build their services.
What are systems?
As we've described in an earlier post, our internal platform hosts services in cohesive groups called systems. In these systems, teams should be able to work freely as systems represent the slice of domain teams have ownership upon. To achieve this, we should set up automation that allows the full management of resources inside these systems. And since we're fans of Terraform (and thus reliable, reproducible, easy to use declarative resource management), we're building all of these automations using Terraform, on Terraform Cloud. This means that we ourselves are the users of our own automation, which gives us quick feedback and also it’s rewarding to see the fruits of our work. :)
So, what exactly constitutes a system? Well, we use GCP projects to host the infrastructure resources (databases, secrets, buckets, etc.), so let's start with that. Naturally, first we need to create these projects. And, since we're using shared VPC, we also need to set them up as service projects. There are some project-wide settings we also need to turn on so that these projects can host these resources, for example, we need to enable certain APIs in them. And to enable out-of-the-box monitoring on Datadog, we need to enable their GCP project integration. Luckily, there's even a Terraform resource for that! Now, let's see what else is needed to be created when creating a system.
GitOps
Of course, teams need to deploy their applications to our Kubernetes clusters. To enable that, when creating systems we also create a bunch of objects. Most importantly, these systems need their own namespaces in Kubernetes to achieve good isolation, and since we're using ArgoCD for continuous deployment, we're creating separate projects for them with RBAC set up for teams owning the systems. ArgoCD is an excellent CD tool that supports GitOps to manage deployments. As such, we also create separate GitHub repositories for each system to act as the source of truth for GitOps.
We have set up ArgoCD so that it expects Helm charts in these repositories. That way, teams can just commit their changes, and ArgoCD will automatically pick it up and change the corresponding Kubernetes objects accordingly, including rolling out a new version. It's almost like magic! Teams can also quite easily revert changes as they need only to revert the commit on GitHub, and that will also be reflected soon in their application. There's one trick when creating these repos, though. We have to invite the bot user we have given to ArgoCD as an outside collaborator in these repositories to make sure ArgoCD can read the content of these repositories. By not creating the ArgoCD's bot user in the organization, we can make sure that it has only access to specific repositories, thus limiting the attack surface of a potential breach. In the next section, I'll describe how we can do this invitation using Terraform.
Setting up providers
To our luck, we live in 2020 *ahem* 2022 where there's a Terraform resource for almost anything 😎 Creating the above-mentioned resources is quite easy, we just have to set up the necessary providers with proper credentials and access rights, just like we do with GCP resources and the Google Terraform provider. But there's a twist! To invite a user to a repo, we also have to accept the invitation from the invited user's account. So that means that we actually have to use two GitHub accounts to set up the deployment repositories, one that is able to invite new members, and one that will be invited and accepts the invitations. The latter will also be used by ArgoCD to read the repositories. Now, to achieve this, we had to set up two GitHub Terraform providers with aliases, and specify explicitly when creating the two resources: the invitation itself and accepting the invitation.
Let this be a friendly warning to all who embark on a similar journey: don't forget to explicitly declare `required_providers` in your modules and pass providers explicitly to them (and to resources)! When multiple aliases are used it is a requirement to pass providers explicitly and will result in unfriendly error messages if forgotten. Now that the systems themselves are ready, let's discuss the automation required to actually make use of them. More specifically, let's iterate on what kind of resources teams will be creating on their own and what kind of rights we need to give them so that they can work in a secure and independent way.
Terraform workspace rights
To allow teams to quickly try out new ideas, we give them full ownership on their development projects. For production systems we refrain from that practice, so we only grant them a limited set of rights to make sure all activity goes through Terraform, which -- combined with the cloud CI -- acts as a safeguard against accidental catastrophes and also doubles as an audit trail. As discussed in the previous post, for team members we achieve these IAM settings via GCP folders.
To allow teams to use Terraform to manage the systems they own, it gets a bit more convoluted than granting rights to users. Since we're aiming for full system separation (except for the network), we're creating separate service accounts for each system (and environment) and granting these accounts specific rights scoped only for the GCP projects of the systems. In those projects, teams should be able to manage their resources arbitrarily (creating databases, buckets, etc.) using Terraform, so giving them the owner right is necessary. However, they will also be using resources from other projects. For instance, they need access to specific secrets in our common project that stores credentials global to the company, so we also have to create secret accessor rights to that common project.
Additionally, the service accounts running Terraform will need to be able to use the shared VPC, so we need to create IAM roles in the host project, granting network user rights to the teams' service accounts individually. These shared VPCs are created using a different infrastructure setup repository using Terraform (wrapped with Terragrunt, to be more precise), so the service accounts will also have to have access to the remote state bucket of that repository to be able to query common configuration (like the subnets of the shared VPC).
Kubernetes service registration
And finally, teams will be creating Kubernetes objects to set up their services using Terraform. This might sound confusing, didn't we just speak about using GitOps? 🤔 Actually, while we're using GitOps and Helm to deploy applications, there are some initial configuration needed to enable this to work smoothly. Each service teams register onto the internal platform will need a separate ArgoCD application to set up their GitOps flow, so teams will require some rights to enable creating these objects (in ArgoCD's namespace, to be exact). The workload objects defined by the charts will be created by ArgoCD itself, so no need to grant extra rights to their Terraform service account in this regard.
However, teams will be using the above mentioned Terraform workspaces to create infrastructure resources. To actually enable their applications to access these resources, they will also be creating Kubernetes objects to store the configurations / credentials, so they will be creating ConfigMaps / secrets from Terraform, and thus we'll need to allow the creation of these resources. To make this piece of the puzzle complete, we also need to enable teams to create the service accounts running their services on Kubernetes.
Luckily, Kubernetes RBAC is quite powerful and allows us to grant rights specifically for these purposes, and these purposes only. So, we have defined roles for each system that allows the full management of the above-mentioned resources in the systems' namespaces, and only grants read rights for certain objects (e.g. nodes). We bind these roles to each system's service account using the email address of the service account running their Terraform workspace. This way, teams can now register their services on our Kubernetes clusters using the infrastructure resources they created in the same self-served and fully automated way! 🎉
Workspaces
Now that we have created systems and we have service accounts that are able to create resources in them, we can wire everything together. To allow teams to manage their resources in a system, they'll need a workspace that executes their Terraform code. As mentioned in an earlier post, we're creating teams using the Terraform Cloud provider (`tfe`), so it comes naturally to create system workspaces using that same provider. There are a couple of things needed to for this, so let's briefly go over them.
A workspace on Terraform Cloud needs a link to the codebase to execute, and this is where teams will store their Terraform code for each system. We have decided to create a monorepo for this purpose that holds the Terraform code for each system. In this repository, each system has its own subfolder, and we're creating workspaces in a way that they run within their corresponding subdirectory. Terraform Cloud can be easily configured to automatically trigger speculative plans for PRs and to apply changes on each merge. This way, teams will just need to write their code in a GitHub repository, get a review, and then the resources will just magically appear. Sounds familiar? Yeah, it's exactly the same GitOps flow like with application deployments. :) To top it off, speculative plans ensure that teams have a clear understanding of the impact of their planned changes to help prevent any issues due to misconfiguration. And to allow teams to work independently, we're also setting up `CODEOWNERS` in the monorepo so that teams directly own their system's subdirectories and can merge PRs based on reviews of their own teammates.
To be able to actually create the resources though, we need to use that service account mentioned before. It's actually quite easy to achieve: we only need to create a sensitive environment variable in the system workspace (called `GOOGLE_CREDENTIALS`) with a service account key, and the Google Terraform provider in the workspace will use that. The same goes for the team's token on Terraform Cloud, we just pass it through an environment variable to the provider configuration in the workspace, so that teams can manage their own workspace or team from within Terraform.
But wait, how do these providers get initialized if we generate the folder for each system? The system folder will need to be bootstrapped, which means creating a couple of files with `terraform {}` and `provider {}` blocks to define provider versions and their configuration. For the `kubernetes` and `helm` providers we also need to set up access to remote states as the credentials for the clusters are created using the earlier mentioned repository and Terragrunt.
Additionally, Terraform needs a backend configuration to use the Terraform Cloud workspace that we've created for the system. This is a lot of boilerplate required to start using a workspace, so to save the effort for teams (and to make sure these configurations are kept in a consistent state), we generate a couple of files (using the GitHub Terraform provider) for each folder in the monorepo, containing all the necessary configuration to make the workspace usable right from the start, without any manual configuration required. 🚀
How it all comes together
So, to reiterate on how the automation is built: using our own Terraform Cloud workspace, with each system's registration we create GCP and Kubernetes resources to host the system's resources, then we create a GitHub repository to host application deployment code (in the form of Helm charts), and a directory to host the team's own Terraform code in a monorepo. And then, we create a separate workspace on Terraform Cloud owned by the system's owner team that is able to apply the team's own Terraform code. The workspace is set up to use a GCP service account with very specific rights, and we generate all the files necessary for Terraform to be able to create resources on GCP and on our Kubernetes cluster.
Basically, we're using Terraform to create a Terraform Cloud workspace that is able to run the teams' Terraform code, i.e. we're managing Terraform with Terraform 🤯 This automation of automations is what makes our platform truly self-served, allowing teams to manage their own (and only their own!) resources in an independent and automatic way, which was exactly our dream when our team was formed. ⭐️