Me: Well I mean big, like really big. Enterprise big.
You: That’s a bit nebulous, what do you mean by Enterprise big?
Me: Well it’s worked for clients of mine with > 1000 pipelines, and > 8000 builds a month, so I guess we can call that big?
You: Yeah I suppose that counts.
A quick primer on CI/CD
To understand the different methodologies available to perform CI/CD at scale, we first need to actually understand what CI/CD actually is beyond just a bunch of letters separated by a slash. Lets outline some common terms, and their meanings, below.
Term
Description
Example
Pipeline
A static artefact containing an arbitrary collection of tasks. Said grouping of tasks, when performed, achieve a business goal and/or use case. Said collection of tasks is usually generalised, with well-defined input and outputs, to allow for different runtime behaviour based on our needs.
A singular pipeline containing a collection of tasks which, when performed, deliver cookies. Generalised inputs allowed us to define a physical location to deliver cookies to, whilst the outputs let us know the state of delivery.
Build
A singular execution of a pipeline, with discrete values provided for generalised inputs and outputs created as a result of the execution. Each build represents an execution of a pipeline, with its own parameters and runtime environment.
We execute the cookie delivery pipeline, specifying the physical location input as “living room”. The output tells us that the cookies were successfully delivered, along with a nice cup of tea.
Continuous Integration (CI)
The practice of automating the integration of code from various authors/places into a singular whole.
When a pull request is opened on a repository, for each commit we run a suite of end-to-end integration and regression tests to ensure any code changes are compatible with the rest of the code base. Given our previous pipeline example, this might be ensuring that any new cookies added to the cookie jar don’t impede out ability to jam our sausage fingers inside to fish out the jamie dodgers.
Continuous Deployment (CD)
The practice of automatically releasing code into production - hopefully after some form of automated testing - when a release candidate is ready.
When we merge a pull request in to the main branch - or semantically version code on the main branch - automatically release it to production. Given our previous pipeline example, this might be delivering the new pack of Jaffa Cakes to the living room as soon as you return from the corner shop.
Continuous Delivery (CD)
Having a means of automatically releasing a release candidate but triggering this manually and/or placing the deployment behind a gate.
Have an automated release pipeline, but require this to be manually kicked off and place the actual deployment behind an approval step. Given our previous pipeline example, this might be moving the biscuits from the kitchen to the living room only when the biscuit tin is empty. Or getting halfway out the kitchen, giving some biscuits to the dog (for approval), then moving them to the living room.
The problem
CI/CD platforms are often created in a haphazard, casual, organic manner. They are particular vulnerable to the Big Ball of Mud pattern, due to:
CI/CD Platforms springing up organically, with developers self-serving their needs, which is fine for smaller scale but then the business suddenly needs to scale rapidly.
CI/CD Platforms being created casually, haphazardly, with a system structure brought about without foresight but rather solely to serve expedience due to piecemeal needs being fed in to specialist teams without an architecture first being put in to place to direct such efforts
Both of these issues are, actually, perfectly acceptable ways of working at smaller scale. Essential complexity is a key factor to bare in mind; simple problems require simple solutions.
A lean startup with 5 developers and barely enough budget to scrape by? Organic CI/CD growth for fulfilment of needs makes perfect sense.
An organic CI/CD platform that works for the most part, but occasionally requires your DevOps specialist to step-in and hand-craft problematic pipelines? If it’s fulfilling your business needs, then it’s probably OK.
But if your organisation has millions of lines of code, 100s of repositories, and plans to grow the CI/CD platform to service CI/CD needs at an enterprise scale? Then the ways of working above become a hinderance, and we need a different approach.
The impact
These haphazard ways of working manifest in a myriad of unpleasant ways when you attempt to transition to an Enterprise scale:
We lose traceability in the form of who is doing what in our CI/CD Platform, why they are doing it, and how it is being done.
We lose flexibility in the platform to adapt for changing needs and technologies. As the underlying glue for a mature software development lifecycle (SDLC), CI/CD needs to react to the needs of the business otherwise it becomes a bottleneck for all other SDLC activities.
Rather than utilising the specialist skillset that our DevOps and Cloud Teams have, these valuable resources end-up hand-crafting pipelines across the business. This has a finite limit to scalability, and doesn’t allow them to fully utilise their specialist knowledge to provision the most computationally efficient CI/CD platform for developers (including themselves) to utilise.
A path forward
Dealing with ambiguity
The first unpleasant manifestation of these issues at scale is that of traceability, and actually has a relatively boring - yet effective - fix.
Naming conventions.
CI/CD Platforms are, ultimately, structured around organisation of pipelines. So the first step in planning CI/CD at scale is figuring out how to structure our pipelines, and have this strictly enforced across a business. Questions of what is being deployed, by whom, and to what environment can be simply solved with a robust naming convention.
An example with Azure DevOps Pipelines
Below, we can see an example of how one may structure their pipelines in Azure DevOps Pipelines:
Lets break this down:
At the top-level we organise by project, allowing separation of concerns by logical organisational unit. This is particularly important for resourcing and permission concerns; if organisational unit A should have access to repositories B and C, then this should be represented in our pipeline structure to closely map the platform to the organisational context.
Then, underneath, we organise by repository… logically grouping all pipelines with the repository within which they reside. Providing your repositories have clear separation of concerns, this then allows you to have clarity at all times about what a pipeline is interacting with.
In one of my talks, I was asked “But Ben, what about mono-repos?”.
If a repository is responsible for one than one domain of interest, then figure out the logical separation and further split it down.
For example, if repository A creates apps B and C then the structure mutates only ever so slightly:
Under here we further split pipelines into categories of Continuous Integration and Deployment. This allows us to understand, immediately, what a pipeline does at a high-level.
Finally, for deployments we further split by environment. Thus, allowing us to know - at all times - what environment our pipeline is deploying to.
A note on magic numbers, and their application to environments.
Often, you may name your environments “prod”, “test” or “dev”. If you’re a small organisation, this is fine. But if you have multiple, different, locations you could be deploying to then this should be as explicit as possible.
“dev” means nothing if you have 50 AWS accounts you could be deploying to, all of which - within their domain - are considered “dev”. Structuring the pipelines like this is bad; it requires tribal knowledge of the organisation, and its various projects, to understand where things are actually deploying.
Name your environments explicitly. If the physical data-centre you are deployuing to is in location A, name the environment in your pipeline structure location A. If it’s an AWS account, name it explicitly after that account rather than just calling it “dev”.
If you can’t sit down a new hire on day one, show them your pipeline structure and have them immediately understand where things are deploying to… you’re doing it wrong.
In this manner, at all times we may know:
What application a pipeline relates to
If it is related to a CI or CD use-case
If it’s CD, which environment it is deploying to
What repository to look at in order to find the underlying pipeline code in case you need to add features, fix bugs, or otherwise poke around
Assuming you have good auditing in place, it should then be a simple case of navigating to the executed pipeline to figure out:
Who ran it, by simply looking at the logs
Why it was run, assuming a pipeline execution is traceable back to an approved change request, or some other form of authorisation and/or documentation required within your organisation for releases to occur
Furthermore, we may replicate this structure within the repository itself; allowing efficient organisation of pipeline code at-rest.
A note on other CI/CD Platforms
The example above is given for Azure DevOps, but it’s principles are applicable to other platforms as well.
The key is to find a naming convention that works for your organisation, and stick to it. The above is a convention I’ve used, and seen work, in large scale enterprises. However:
CircleCI allows organisation by project, and thus the above may work.
GitLab CI allows organisation of pipelines on a per repository basis.
The only major platform (read: not Jenkins) out there where this approach is not immediately applicable (as of 26th March 2025) is GitHub Actions which - honestly - I’m pretty shocked by. Even more baffling is the root cause of it, which we may trace to this issue which has been open since 2022: https://github.com/orgs/community/discussions/15935.
Yup, you read that right. GitHub Actions is the only major CI/CD provider I know out there that needs all pipeline files to:
Be in a specific folder
Be in a flat file structure
It’s weird, it’s odd, it’s baffling. It is what it is.
You can get around it by naming workflows given the pattern above, for example:
But it’s pretty clunky and not the best user experience.
Arbritary task management
If our pipelines are just collections of arbitrary code to achieve a business goal, then the next question we must answer is how to effectively organise such code so we can scale.
Pull model
Within a pull model, we define a core library of - semantically versioned - tasks, which are effectively reusable components within our CI/CD Infrastructure. Upstream repositories then simply piece together the reusable tasks to fulfil their needs.
Creation of tasks within such a model usually follows a methodology attributed to Kent Beck (1997): “make it work, make it right, make it fast”.
[Beck 1997]
Kent Beck
Smalltalk Best Practice Patterns
Prentice Hall, Upper Saddle River, NJ, 1997
Perhaps a real-world example would illuminate this more clearly. Given the problem of “developer A wants to run a docker build in their pipeline” this pattern may be applied as follows:
First, simply get a docker build working in the local pipeline definition within the corresponding repository that the developer is working in. (make it work)
Once this is working, create a generalised version within the ci-cd-components library for your organisation. Furthermore, if you wish for certain standards to be applied - such as security scanning, static code analysis etc - implement these in the reusable component. (make it right)
Finally, after a few weeks of usage evaluate the computational efficiency and - using semantic versioning to communicate interface changes - optimise the task. (make it fast)
This is a process which I followed at client of mine:
First, I setup a basic docker build pipeline within the source repository. It built the artefact, it pushed it to the internal artefact repository, it unblocked the developer.
Second, I generalised the pipeline and created it as a singular task in the core component library. Given the developer was unblocked, thanks to the “make it work” step being carried out, I then had a good long think about what I wanted as standard. I setup docker build caching, implemented security scanning with trivy, defined an interface and called it a day.
Third, after a couple of weeks of usage I identified that the CLI tool I was using to do the builds was ineffective. I gutted the entire tool from the task, replaced it with buildah, generally optimised the tasks and sped up the build time by 95%. The developers using the task simply bumped to a new patch version of the core CI/CD component library, and their builds were sped up by 95% without any other effort needed on their part.
The worked example highlights the benefits of such a model:
Developers use off-the-shelf components to fulfil their needs.
Such components are centralised, allowing our specialist teams to optimise the computation of such components and enforce standardisation of security tooling, tool chains, processes etc.
Centralised components are semantically versioned like any other software, with a well-defined interface and contextual dependencies.
Centralised components are released with release notes, and core user documentation, making it a well-known, well-documented, mature ecosystem for developer usage.
Developers benefit from improvements to core components by simply bumping semantic version, and making changes to their pipelines are indicated by the release notes for the version(s) they are bumping to.
This is a pattern which is applicable across many platforms. For example, in Azure DevOps Pipelines we declare a resource in our pipeline which we may then later reference:
- template: <task>.yaml@library parameters: key: value
Docs for Azure DevOps Pipelines resources are available here
Whilst for GitHub Actions we use composite actions with a full reference to the remote repository on a per-task basis:
- uses: <task>@<version> with: key: value
Docs for GitHub Actions composite actions are available here
Templating entire workflows
The pull model is also applicable to entire workflows but such re-usable, end-to-end, workflows should sit separately to the reusable component library.
For example, say you’re running an extract-transform-load (ETL) project split across a usual development/test/production environment separation. Occasionally, your developers want an anonymised snapshot of raw production data in test in order to carry out testing of the ETL processes before deploying to production. In this instance, you might want a full end-to-end pipeline to:
Discover KMS keys on source and destination S3 buckets.
Mutate polices to allow a cross-account copy.
Use the AWSCLI to perform a cross-account sync from a given source bucket and path in production, to a given destination and path in test.
Restore KMS key policies.
Given this need has arisen in one ETL project, odds are it’ll be a requirement in others. In this instance, you can:
Setup an initial pipeline in a source repository to perform the full process between a singular source and destination bucket. (make it work)
Generalise this in a ci-cd-workflow repository, and semantically version it. (make it right)
Use the templated end-to-end workflow to quickly setup automated processes to take snapshots of data from production to test on a regular basis at pre-determined paths to assist the SDLC (make it fast)
The key thing to note is that automated workflows and reusable components are not mutually exclusive. One, in fact, supports the other; you can cobble together reusable components to automate entire workflows, semantically version these end-to-end workflows, and use this across the business to rapidly fulfil developer needs.
Push model
Within the push model, we flip the assumptions of the pull model on its head. Rather than provisioning a platform of re-usable components, we instead centralise the management of all pipelines within the organisation.
In my experience, this is only performed if you have a really mature source control platform to manage your source control. It requires automated processes, and declarative code, to centrally manage all of your source control repositories in order to propagate out pipelines from a centralised source.
For example, we might use tagging to determine what technologies are utilised for a given repository. Given these tags we can bootstrap the repository with standardised pull request and deployment workflows, and have bootstrap steps in them to read in data at runtime to mutate the workflows behaviour via the usage of dynamic matrices. A really basic example within GitHub Actions might be as follows:
This way, we can centrally control the workflows which are applied to our repositories but still allow the workflows to mutate to the specific repository setup.
However, it should be noted that - in my experience - the push model is not as prominent as the pull model. That said, they are not mutually exclusive; workflows centrally pushed can still use reusable components in your core ci-cd-component library.
CI/CD Tooling
Regardless of if you’re using a pull or push model, your chosen CI/CD platform will have gaps that off-the-shelf components cannot fulfil. For example, a quick scan of the task library for Azure DevOps Pipelines or the in-built tasks for GitHub Actions shows there are no in-built tasks to add a comment on a BitBucket pull request to push a docker image to an internal Nexus artefact repository.
A more pertinent example might be if you’re hosting an internal RESTAPI and you want to built up a set of reusable components that interact with the RESTAPI to trigger tasks, monitoring tasks, carry out healthchecks etc.
The go-to for these tasks - in my experience - is usually bash. One of my clients had 8000 lines of bash constituting the core logic of their CI/CD platform. This is horrible. It’s:
Not extensible.
Untested.
Difficult to read.
However, just saying it’s horrible doesn’t cut it. We still have a requirement to plug gaps in our CI/CD platform, but we need to do so in a way which is:
Extensible
Tested
Easy to read
Platform independent
For this, we can create our own internal CLI tool:
Creation of a CLI tool is a complex process, and would definitely bloat out this blog post to gigantic proportions.
My experience is with Python, and so my recommendations and examples will all be Python based such as with sudoblark.python.github-cli.
However, the principle is language-agnostic as the idea is pretty simple. Plugging gaps in the CI/CD Platform via a tested, well-documented, internal CLI tool is a platform-agnostic, robust, approach to cover those edge-cases where off-the-shelf components for your given CI/CD Platform don’t quite cut it. Tasks in your ci-cd-component library may then wrap around this CLI tool to provide a friendly user-interface for performing tasks.
Developers have no knowledge, or visibility, that this tool exists. They simply use the reusable ci-cd components to fulfil their needs, whilst you’re safe in the knowledge that these tasks are using a well-written, tested, CLI tool to perform the functionality under the hood.
Core component library documentation
Having a core CI/CD component library is all good and well, but how do our users discover information about it? How do they know what version(s) are available, what tasks are available within the given version, and what the changes are between versions?
These are questions which need to be answered in order to effectively scale a self-service platform.
One way in which we may do this is via the combination of mkdocs and mike.
sudoblark.github-actions.library contains a full end-to-end example of how to get mkdocs setup, and working, in order to document a core component library. Its web interface is also available here:
We shall break down how this works below.
mkdocs setup
The tutorial for mkdocs is pretty extensive, so I’m not going to replicate it here. The setup for this repository is contained in the mkdocs.yml file in the repository, and is pretty basic. It just maps friendly names of things for navigation to the underlying markdown file, enables the search plugin to improve user experience, and sets up versioning with mike to give us the version dropdown:
site_name: sudoblark.github-actions.libraryrepo_url: https://github.com/sudoblark/sudoblark.github-actions.libraryrepo_name: sudoblark/sudoblark.github-actions.librarysite_url: https://sudoblark.github.io/sudoblark.github-actions.library/site_author: Benjamin Clarksite_description: -> Library of semantically versioned](https://semver.org) actions, intended to be used as off-the-shelf, standardised, building blocks for CI/CD processes involving GitHub Actions.copyright: BSD-3site_dir: sitedocs_dir: docstheme: name: material icon: repo: fontawesome/brands/git-altextra: version: provider: mikenav: - "Home": 'index.md' - "Getting Started": - "What are composite actions": "getting-started/what-are-composite-actions.md" - "How to utilise": "getting-started/how-to-utilise.md" - "Release Notes": - "1.0.0": "release-notes/1.0.0.md" - "Composite Actions": - "Terraform": - "plan": "composite-actions/terraform/plan.md" - "apply": "composite-actions/terraform/apply.md"plugins: - search
mike setup
As the component library is hosted on GitHub, we simply use mike as-is to deploy changes to a gh-pages branch for our repository:
Everything about the setup is standard, and simply follows the mike setup instructions in the docs.
Continuous Delivery of documentation
We utilise a continuous delivery workflow to automatically create, and publish, documentation when a new release is created in GitHub. The workflow is contained in .github/workflows/release.yaml and is pretty basic:
Triggering the workflow when we perform a release in GitHub.
Checkout the repository with the appropriate depth such that we can make changes to the gh-pages branch
Grab the release version
Build and publish docs at the new version using mike
Ensure the latest version is available under the alias latest, and set this as the default for the site
That last step is important, is means that users can always access the site via the /latest alias, and from there navigate to the version(s) of the documentation they require.
In particular, pay attention to the deploy workflow. Said workflow is an example of continuous deployment; we have a pipeline to automatically deploy our code, but we gate it behind an approval step to make sure a human approves first:
You can see here the simplicity that a pull library offers. Both our plan, and apply, tasks are just utilising reusable components of our ci-cd component library.
You can also see the advantage of standardisation when using centralised CI/CD components. The above is just a singular task, but if we look at the definition of our task:
name: "sudoblark.github-actions.library/terraform/plan"description: >- Run quality checks against terraform, in addition to outputting a plan, with results outputted to a pipeline artefact ZIP file with the name {{ inputs.prefix }}-terraform-artefact, contents of which are as follows: - terraform.plan : Binary terraform plan - terraform.validate : Results of terraform validation - terraform.show : Terraform plan in human-readable format - terraform.json : Terraform plan in JSON format, required for some downstream CLI tooling - terraform.format : List of files which have failed terraform format checks, else an empty file - checkov.xml : JUnit output of Checkov results, can be used to upload test results downstreamauthor: "sudoblark"inputs: terraform_version: description: "Semantic version of Terraform to utilise for the task." type: string required: true working_directory: description: "The working directory to utilise when performing the task." type: string required: true artefact_prefix: description: "Prefix to append to terraform-artefact produced by the task." type: string required: true aws_region: description: "AWS_DEFAULT_REGION value, required if the hashicorp/aws provider is utilised." type: string default: "" required: false aws_access_key: description: "AWS_ACCESS_KEY_ID value, required if the hashicorp/aws provider is utilised." type: string default: "" required: false aws_secret_access_key: description: "AWS_SECRET_ACCESS_KEY value, required if the hashicorp/aws provider is utilised." type: string default: "" required: falseoutputs: artefact_name: description: "Name of artefact ZIP file with outputted results." value: "{{ inputs.prefix }}-terraform-artefact"runs: using: "composite" steps: - name: "Pull Hashicorp container" run: | docker pull hashicorp/terraform:${{ inputs.terraform_version }} shell: bash - name: "Terraform init" run: | docker run \ -v $GITHUB_WORKSPACE:$GITHUB_WORKSPACE \ -w ${{ inputs.working_directory }} \ -e AWS_ACCESS_KEY_ID=${{ inputs.aws_access_key }} \ -e AWS_SECRET_ACCESS_KEY=${{ inputs.aws_secret_access_key }} \ -e AWS_DEFAULT_REGION=${{ inputs.aws_region }} \ hashicorp/terraform:${{ inputs.terraform_version }} \ init shell: bash - name: "Terraform validate" run: | docker run \ -v $GITHUB_WORKSPACE:$GITHUB_WORKSPACE \ -w ${{ inputs.working_directory }} \ -e AWS_ACCESS_KEY_ID=${{ inputs.aws_access_key }} \ -e AWS_SECRET_ACCESS_KEY=${{ inputs.aws_secret_access_key }} \ -e AWS_DEFAULT_REGION=${{ inputs.aws_region }} \ hashicorp/terraform:${{ inputs.terraform_version }} \ validate -no-color > ${{ inputs.working_directory }}/terraform.validate shell: bash - name: "Terraform formatter check" run: | docker run \ -v $GITHUB_WORKSPACE:$GITHUB_WORKSPACE \ -w ${{ inputs.working_directory }} \ -e AWS_ACCESS_KEY_ID=${{ inputs.aws_access_key }} \ -e AWS_SECRET_ACCESS_KEY=${{ inputs.aws_secret_access_key }} \ -e AWS_DEFAULT_REGION=${{ inputs.aws_region }} \ hashicorp/terraform:${{ inputs.terraform_version }} \ fmt -check -no-color > ${{ inputs.working_directory }}/terraform.format shell: bash - name: "Terraform plan" run: | docker run \ -v $GITHUB_WORKSPACE:$GITHUB_WORKSPACE \ -w ${{ inputs.working_directory }} \ -e AWS_ACCESS_KEY_ID=${{ inputs.aws_access_key }} \ -e AWS_SECRET_ACCESS_KEY=${{ inputs.aws_secret_access_key }} \ -e AWS_DEFAULT_REGION=${{ inputs.aws_region }} \ hashicorp/terraform:${{ inputs.terraform_version }} \ plan -no-color -out=${{ inputs.working_directory }}/terraform.plan shell: bash - name: "Terraform show for humans" run: | docker run \ -v $GITHUB_WORKSPACE:$GITHUB_WORKSPACE \ -w ${{ inputs.working_directory }} \ -e AWS_ACCESS_KEY_ID=${{ inputs.aws_access_key }} \ -e AWS_SECRET_ACCESS_KEY=${{ inputs.aws_secret_access_key }} \ -e AWS_DEFAULT_REGION=${{ inputs.aws_region }} \ hashicorp/terraform:${{ inputs.terraform_version }} \ show -no-color terraform.plan > ${{ inputs.working_directory }}/terraform.show shell: bash - name: "Terraform show for tooling" run: | docker run \ -v $GITHUB_WORKSPACE:$GITHUB_WORKSPACE \ -w ${{ inputs.working_directory }} \ -e AWS_ACCESS_KEY_ID=${{ inputs.aws_access_key }} \ -e AWS_SECRET_ACCESS_KEY=${{ inputs.aws_secret_access_key }} \ -e AWS_DEFAULT_REGION=${{ inputs.aws_region }} \ hashicorp/terraform:${{ inputs.terraform_version }} \ show -no-color -json terraform.plan > ${{ inputs.working_directory }}/terraform.json-raw shell: bash - name: "Format JSON output" run: | sed -i $'s/[^[:print:]\t]//g' ${{ inputs.working_directory }}/terraform.json-raw cat ${{ inputs.working_directory }}/terraform.json-raw | jq > ${{ inputs.working_directory }}/terraform.json shell: bash - name: "Run checkov" run: | python3 -m venv checkov_venv source checkov_venv/bin/activate pip3 install checkov checkov -f ${{ inputs.working_directory }}/terraform.json \ --repo-root-for-plan-enrichment ${{ inputs.working_directory }} \ --download-external-modules True \ -o junitxml > ${{ inputs.working_directory }}/checkov.xml \ --soft-fail shell: bash - name: "Package results" run: | cd ${{ inputs.working_directory }} tar -czvpf results.tar.gz terraform.plan terraform.validate terraform.show terraform.json terraform.format checkov.xml pwd ls -lr mv results.tar.gz $GITHUB_WORKSPACE cd $GITHUB_WORKSPACE pwd ls -lr shell: bash - uses: actions/upload-artifact@v4 with: name: "${{ inputs.artefact_prefix }}-terraform-artefact" path: results.tar.gz
We are enforcing:
Usage of the hashicorp container for terraform commands, which allows us to follow the 3musketeers pattern: the same environment we use for CI/CD we can use deployment, testing, and local execution.
Running checkov against our plan, embedding security scanning directly in to terraform-plans no-matter where they are run.
Standardised outputs.
That last one, standardised outputs, may not seem like much but its been the biggest value add I’ve seen when implemented in organisations. With the input/output (I/O) of our task known, we are free to build a sophisticated ecosystem of tasks for our usage.
Feed checkov results from terraform-plan into a centralised database with a dashboard? Sure, the results are standardised so just feed the artefact that one task produces in to another.
Want to implement checkov, regula, terraform-compliance, terrascan, tfsec or another form of security software in to the ecosystem? Sure, update the task library and force developers to use the latest version of the library. Developers not using the latest version? Scan your source code, find non-compliant repositories, and name and shame them.