# Bazel + Remote Caching (EngFlow)

> Bazel is Google's open-source build system, designed from the ground up for large monorepos and fast incremental builds.

Perspective: delivery
Source: https://visdom-maturity-matrix.virtuslab.com/guides/delivery/bazel-remote-caching-engflow

## What It Is

Bazel is Google's open-source build system, designed from the ground up for large monorepos and fast incremental builds. Unlike traditional build systems (Make, Gradle, webpack) that operate on files and directories, Bazel operates on hermetic, content-addressed build actions: every build step declares its exact inputs and outputs, and Bazel can determine with certainty whether a step needs to be re-executed or can be served from cache. This property - correctness by construction - is what makes Bazel's caching reliable where other systems are not.

Remote caching extends Bazel's local cache to a shared, network-accessible store. When a developer or CI runner executes a build action that has already been computed by someone else (another developer, a previous CI run), Bazel fetches the pre-computed output from the remote cache rather than recomputing it. The result is that in a team of 20 developers all working on the same monorepo, most build actions are cache hits pulled from the network rather than local computations. CI runs that would take 10 minutes from scratch complete in 90 seconds when most actions are cache hits.

EngFlow is the leading enterprise provider of managed Bazel Remote Build Execution (RBE) and remote caching infrastructure. EngFlow runs the caching and execution infrastructure so your team doesn't have to operate Bazel's Remote Execution API (REAPI) servers. Teams connect their Bazel builds to EngFlow's cluster using a few lines of configuration, and immediately get: a high-performance remote cache, optional distributed build execution across EngFlow's machines, and build analytics that show exactly which actions are slow and why.

The combination of Bazel + remote caching is the infrastructure pattern that enables L3-L4 CI times for large, complex codebases. Simple codebases can hit 5-minute CI with parallelization and basic caching. But for monorepos with millions of lines of code and complex dependency graphs, Bazel's content-addressed caching is the tool that makes sub-5-minute CI achievable. Organizations like Google, Stripe, Dropbox, and Twitter have published accounts of using Bazel-style build systems to keep CI fast as codebases grow.

## Why It Matters

- **Correct incremental builds by construction** - Bazel's hermetic action model guarantees that cache hits are correct; you can't get stale artifact bugs that plague makefiles and Gradle incremental builds
- **Shared cache across developers and CI** - a CI run following a developer's local build hits the cache for actions the developer already computed, collapsing "CI rebuilds everything from scratch" to "CI verifies the already-computed result"
- **Linear CI time growth with codebase size** - without Bazel, CI time grows as the codebase grows; with Bazel remote caching, CI time grows only with the size of the changed sub-graph, not the total codebase
- **Remote Build Execution distributes compilation** - EngFlow's RBE can distribute a 10-minute compilation across 50 machines, completing it in 12 seconds; this is the mechanism that makes sub-minute CI achievable for large codebases
- **Build analytics expose bottlenecks** - EngFlow's analytics dashboard shows action-level timing, cache hit rates, and critical path analysis; optimization becomes data-driven rather than guesswork

## Getting Started

1. **Evaluate whether Bazel is right for your codebase** - Bazel delivers the most value for: large monorepos (500k+ lines of code), multi-language codebases, and teams that generate many CI runs per day. For smaller codebases or single-language repositories, the BUILD file maintenance overhead may not justify the caching benefits. Conduct an honest evaluation before committing.
2. **Start with a single BUILD file and measure** - Don't try to Bazel-ify the entire codebase at once. Start with a single library or service that's heavily tested and frequently changed. Write its BUILD file, connect to a remote cache, and measure the CI time improvement on that module. Use this as the proof of concept before proposing a broader migration.
3. **Set up EngFlow or an alternative remote cache** - Create an EngFlow account and follow their quickstart guide to connect your Bazel builds. Alternatives include BuildBuddy (open-source self-hosted option) and Google's own Remote Build Execution service. Add the `--remote_cache` and `--remote_executor` flags to your `.bazelrc` or CI environment variables. The connection is typically 10-15 lines of configuration.
4. **Configure Bazel for CI correctly** - CI builds need specific Bazel flags: `--remote_accept_cached=true` (use remote cache), `--remote_upload_local_results=true` (populate cache from CI runs), and `--jobs=auto` (use all available CPUs). Add a `.bazelrc` file with CI-specific configuration so all CI runners use consistent settings.
5. **Measure cache hit rates and identify cold paths** - EngFlow's analytics dashboard shows cache hit rate per action type. Target > 80% cache hit rate on CI runs after the first run of a day. If hit rates are low, check: are BUILD files correctly declaring all inputs? Are CI runners using the same cache key configuration as developers? Low hit rates indicate incorrect hermetic input declaration.
6. **Plan the BUILD file maintenance process** - Bazel BUILD files need to be updated when dependencies change. Manual maintenance is error-prone; tools like Gazelle (for Go), rules_js (for JavaScript), and generate_pom (for Java) automate BUILD file generation from existing build manifests. Set up automated BUILD file generation early to prevent the maintenance burden from becoming a blocker.

> **Tip**: EngFlow's managed service removes the infrastructure operations burden that makes self-hosted Bazel remote caching painful. The cost of EngFlow is typically much less than the engineering time required to operate equivalent infrastructure yourself. Evaluate managed services before deciding to self-host.

## Common Pitfalls

**Underestimating the Bazel migration effort.** Migrating a large existing codebase to Bazel is a multi-month engineering project. BUILD files must be written or generated for every library and service. Hermeticity violations (implicit dependencies, ambient environment assumptions) must be found and fixed. Teams that estimate a Bazel migration as "a few weeks" consistently find it takes 3-6 months for non-trivial codebases. Plan accordingly.

**Using Bazel without hermetic builds.** Bazel's correctness guarantee depends on hermetic build actions - every input explicitly declared, every tool pinned to a specific version. Bazel builds that have hermetic violations (reading files not declared as inputs, using tools from PATH rather than declared toolchains) will have unreliable cache behavior: spurious cache misses, spurious cache hits that serve wrong results. Hermeticity analysis is the most important and most difficult part of a Bazel migration.

**Adopting Bazel without a platform team.** Bazel is a platform product that requires ongoing engineering support: keeping BUILD files current as code evolves, updating rules as upstream Bazel rules_* libraries change, maintaining the remote cache configuration as CI infrastructure changes. Teams that adopt Bazel without a designated owner find that it degrades over time as BUILD files drift out of sync. Assign ownership before adopting.

**Expecting Bazel to solve existing test quality problems.** Bazel makes correct incremental builds possible, but it doesn't fix slow tests, flaky tests, or tests with insufficient coverage. Teams sometimes expect Bazel adoption to fix CI problems that are actually test quality problems. Diagnose the root cause of CI slowness before investing in Bazel - if tests are the bottleneck, Bazel's build-time speedups are less valuable than test suite optimization.

**Not configuring developer-to-CI cache sharing.** The highest-value scenario for remote caching is when CI hits the cache for actions already computed by developers. This requires developers and CI to use the same remote cache endpoint with compatible configuration. If developers run Bazel locally without `--remote_cache`, or with a different cache endpoint than CI, the developer-to-CI cache sharing doesn't work.

## Bob - Head of Engineering

Bob's team runs a large TypeScript monorepo with 800k lines of code. They've implemented basic caching and parallelization, but CI is still at 8 minutes because the TypeScript compilation step takes 5 minutes even with incremental builds. A senior engineer has proposed adopting Bazel with EngFlow remote caching to get compilation to under 1 minute. Bob is interested but concerned about the migration effort.

Bob should fund a 2-sprint "Bazel proof of concept" with a single senior engineer (or Victor, the AI champion). The goal: get one service in the monorepo building with Bazel and remote caching, measure the compilation time improvement, and estimate the effort to migrate the rest. At the end of 2 sprints, Bob will have: actual performance data for the remote caching benefit, a realistic effort estimate for the full migration, and a recommendation on whether to proceed. This is a much better decision basis than theoretical estimates. If the proof of concept shows 5x compilation speedup, the migration effort is justified. If it shows 2x speedup with 6 months of migration effort, it may not be.

## Sarah - Productivity Lead

Sarah's CI feedback latency data shows that the TypeScript compilation step is 62% of total CI time. Caching and parallelization haven't helped this step because it's a single large tsc invocation that can't be trivially parallelized. Sarah sees Bazel remote caching as the intervention that addresses the specific bottleneck the data identifies.

Sarah should frame the Bazel proposal in terms of the specific bottleneck it addresses: "compilation is 62% of CI time; Bazel remote caching would serve pre-compiled outputs from cache for unchanged modules, reducing this step from 5 minutes to an estimated 30 seconds on most runs." She should then estimate the CI time impact: 8-minute CI minus 4.5 minutes of compilation time savings equals approximately 3.5-minute CI. That's a 56% reduction in CI time from a single infrastructure investment. Sarah should present this as a testable hypothesis: implement Bazel remote caching on one module, measure the compilation time, and validate the estimate before committing to a full migration.

## Victor - Staff Engineer - AI Champion

Victor has been following Bazel development for two years and has used it at a previous company. He knows it works but also knows the migration pitfalls. He's the right person to lead the proof of concept Bob described.

Victor should own the Bazel proof of concept with a specific scope: one service, two sprints, clear success criteria (compilation time < 60 seconds with warm cache, cache hit rate > 85% after day 1). He should document every step of the proof of concept - BUILD file authoring decisions, hermeticity violations found and fixed, EngFlow configuration - as the foundation of the migration playbook if the team decides to proceed. Victor should also evaluate BuildBuddy as an alternative to EngFlow (open-source, self-hostable, lower recurring cost) and include a comparison in his recommendation. His technical credibility makes his recommendation on "Bazel yes/no, EngFlow vs. BuildBuddy" the deciding input for Bob's decision.

## Links

- [Bazel - Official documentation](https://bazel.build/docs)
- [EngFlow - Remote Build Execution](https://www.engflow.com/)
- [BuildBuddy - Open-source Bazel remote cache](https://www.buildbuddy.io/)
- [Gazelle - Automatic BUILD file generation for Go](https://github.com/bazelbuild/bazel-gazelle)
- [Google Engineering Practices - Build in the large](https://abseil.io/resources/swe-book/html/ch18.html)
