Sustainable CI With Buildkite

Welcome again, Wayfair Engineering Enthusiasts, to our virtual fireside. Make yourself comfortable and settle in for the second installment of our Fireside Story series on Continuous Integration. If you haven’t had the pleasure, please read part one, a history of GitLab CI at Wayfair. This installment will feature GitLab CI’s successor, Buildkite.

You can’t out-git the Hub

As a refresher on the previous story, we had set up GitLab CI with the intention of making our build system more maintainable and stable. Our expectations were missed with the implementation choices we made, and we needed to completely overhaul how our system stood up to fix the problem.

The current build system was putting stress on the ability to check in code for an average Wayfairian. It was a common occurance circa Cyber5 2018 that we would see GitLab go completely down for hours at a time, several days a week. This happened in large part due to our CI system and its incredible load blocking out GitLab’s ability to keep up with our developers generating average traffic.

GitLab in itself is a fantastic appliance. Yet years of pragmatic choices and technical debt left us with a poorly maintained instance; as well as many bad practices and developer distrust. If we were able to make the switch, we would hopefully see that GitHub stays up more than we were currently experiencing with GitLab. That would help us gain trust, and reduce our toil as a team.

That’s a dramatic simplification of the technical barriers that stood between us and the system we have today (i.e. production-breaking incidents not happening frequently from our code repositories being inaccessible). I’ll abbreviate that story to point out the important change relevant to how we eventually got to GitHub and Buildkite: We needed to overhaul, or start fresh. In this instance, between two version control systems, we were starting fresh with GitHub.

What wrongs might buildkite right for less main deploy lane brain pain?

At this time, all php, js, windows, python, and some java applications depended on the GitLab CI Infrastructure. We used Jenkins and Octopus for other stages in various pipelines, but GitLab CI had cemented itself as a critical piece of how we would deploy and verify code. We hadn’t built that infrastructure for the scale we achieved over several years, or maintained it as the developer needs of a growing organization changed. Our systems were also not created in a repeatable way (see: infrastructure-not-as-code). In the user experience realm, we had a plethora of issues with shared pipelines changing constantly by application teams. As we considered these issues, we discovered the Australian-based company, Buildkite.

Buildkite offered the ability to use their SaaS for the frontend of our build infrastructure, which mitigated any problems we had where our infrastructure slowed down the ability to check builds and verify our code. Buildkite also offered something unique in the testing space, dynamic pipelines. To dramatically oversimplify, dynamic pipelines allow us to make decisions on the fly for what we need to execute in a pipeline. Consequently, we don’t have literally hundreds or thousands (yes, actually) of lines in a .yml file to create our production build pipelines. We can programatically send certain steps of the build, test, and/or deploy automation up when we know we need them.

We also found particularly exciting: Buildkite has an open-source agent (stand in for what used to be our gitrunners), which we use to accept jobs from our Wayfair developers. We are able to make tweaks and bug fixes as needed for our supported use cases, and make Wayfair a bit more attractive with an open-source presence. In addition to modifications on the agent itself, we’ve been able to contribute to Buildkite Plugins: reusable functionality that could be created for both Wayfair specific cases, and give back to a broader community on GitHub. We also had the ability for hooks, small scripts we can embed within agents that dramatically simplify the processes we put in place to keep infrastructure stable, repeatable, and reliable.

Most dramatically, as we had learned with GitLab, we would need support from development teams and operations teams to achieve and maintain a stable infrastructure for our pipelines. We had dedicated resources (like myself!) investigating and claiming ownership against what we could improve from GitLab CI up to Buildkite. We set off to get started with infra, providing groundwork for how we would work with developer teams.

Refine and define headline and sideline pipeline guidelines and confines

GitLab CI agents, as we mentioned above and in the previous post, were not always put together in a replicable way. We didn’t have one place to look for all dependencies and expected usage of the agents. With an intention to change that for Buildkite; we gathered as much knowledge as we could about our GitLab CI infrastructure. Our base knowledge was that we would definitely need coordination between ourselves and infra teams like Networking and Security. We wanted to make sure we understood the problem well enough that we wouldn’t repeat it again and continue the cycle of new infra requirements. We determined several parameters that we expected to achieve:

We would need several queues to avoid putting “everything on every agent”
- Php
- Javascript
- Windows (dotnet)
- Default (no dependencies for pipeline upload steps)
- “experimental” queue for workflows we couldn’t predict yet (docker became the name + tech we went with for experimenting, it’s now the biggest queue!)
We would need a way to store artifacts that wasn’t GCP or S3
- Some of you may know that now we do depend on GCP for storing artifacts. At the time, we didn’t use GCP for Buildkite artifacts, related to security concerns that have since been resolved.
We needed Windows agents

This was not our exhaustive list, but these were the most interesting starting points. We built open source, we sent out standards, and we worked with individual teams to ensure there was pipeline-specific ownership for all of the monorepos. Setting up pagers, on call, getting infrastructure permissions to push ourselves forward, writing WOC runbooks, and otherwise collaborating across the organization; we achieved the first version of the Buildkite infrastructure ready to take over from GitLab.

At this time, GitLab was taking (on a good day) at least half an hour to run all the tests to get into the integrator. Most of the time, it was over an hour, and you would have to run them multiple times. Worse, they would have to be run again when it got to The Integrator. We decided that we needed a better system, and PHP platforms pulled together a way to use dynamic pipelines for running dynamic testing suites. That’s a fancy way to say, we broke them up into sequences of 10, so it would take 3-5 minutes per suite piece, instead of running all 35 thousand tests consecutively every time.

We also noticed that many of the unit tests were reaching out to external services, which made the results unpredictable. This was not the purpose of unit tests, so we did a mass-audit and skipping of our unit tests. We skipped roughly 20% of the tests, and then put the rest into a container that cannot reach the outside world, to ensure the problem couldn’t replicate again. We expected significant improvement, but this practically eliminated the typical flakiness we came to expect from our unit tests. Any tests that expected to reach out would continuously fail, and wouldn’t make it into the master branch!

All of this was much more possible with a dependable infrastructure and rotation to maintain it. As we continue to grow as an organization and in our CI strategy, we’re constantly refining what Buildkite needs are and what we need to do to make sure we’re serving the organization.

C ing into the future

As one of the Buildkite builders, it’s awesome to see how far the platform has come. We continue improving the process of developing on the webstack, making it even easier for our developers. Many of the engineers at Wayfair depend on Buildkite, and one day we hope to make it so smooth that new engineers think of it as a detail. We’ll never be completely done, but it feels good to see so many of the painful parts of the application development process improve even since this time last year.

I’m glad you took the time out of your day to read through this history, and appreciate the opportunity to share this chapter in Wayfair’s journey. Please reach out to me with any questions or corrections gwhite@wayfair.com -- we’re happy to hear from you.