Having a build cache solution is a powerful way to speed up builds, especially at scale. Bitrise Build Cache already accelerates builds across multiple ecosystems, but to get the most out of it we also need to optimize the build cache clients themselves and ensure stability across changing network environments. In this blog post, I’ll walk through the steps we took to improve stability and performance for Bitrise Build Cache customers.
We initially built the Build Cache backend for Bazel, which ships with built-in support for a remote cache and defines a well‑specified gRPC-based protocol for it. You can read more about the Bazel remote caching API here.
Later, we extended this backend to support Gradle via a custom plugin, and Xcode via the Bitrise Build Cache CLI. You can find Xcode specifics here.
Bazel’s protocol includes concepts such as the Action Cache, which are not required in all scenarios. For Gradle and Xcode we only need a key-value API and a content‑addressable store, so we ended up using a slightly different API, tailored for these use cases.
Client interface
During build system invocations (Gradle, Bazel, Xcode), our clients upload and download binary blobs to/from Bitrise Build Cache. These blobs are referenced by a hash:
- In Bazel and Gradle, the hash is typically provided by the build system.
- For Xcode, the hash may be provided by Xcode itself, but when we act as a generic content‑addressable store (CAS), we compute the hash from the blob content.
To support Gradle, we designed a dedicated gRPC API for storing and loading blobs. When we later implemented Xcode Compilation Caching, the Xcode proxy service re‑used this same API to access the cache. This way, we could consolidate behavior and optimizations across ecosystems.
A quick gRPC refresher
All of these build cache interactions happen over gRPC.
gRPC is a high‑performance, open‑source RPC framework that runs over HTTP/2/TCP. It uses Protocol Buffers (protobuf) by default to define service interfaces and message schemas, and it supports several communication patterns:
- Unary RPCs – a single request followed by a single response (for example, “get this blob by key”).
- Server‑streaming RPCs – a single request with a stream of responses.
- Client‑streaming RPCs – a stream of requests with a single response.
- Bidirectional streaming RPCs – both sides send streams of messages on the same connection.
Bazel’s remote execution and remote cache APIs are defined as gRPC services, and our Gradle and Xcode integration layers also speak gRPC to the Bitrise Build Cache backend. This gives us consistent transport and makes it easier to roll out optimizations across all clients.
Connection initialization
When a client (Bazel, Gradle, or the Xcode proxy) opens a connection to the Build Cache API, it first has to handle the usual cross‑cutting concerns:
- Authentication – via JWT or access tokens.
- Metadata – we attach headers that describe the Bitrise build, such as the build slug, app slug, and an invocation ID.
- Correlation – the invocation ID lets us correlate cache usage with analytical data we collect elsewhere (more on that in a bit).
We send authentication information and build metadata in gRPC metadata headers. This metadata is used by our backend to authorize requests and to link cache access patterns to specific builds and invocations. You can read more about how we surface these analytics here.
Importantly, the cache backend itself is content‑agnostic: it stores opaque blobs keyed by hashes. It has no understanding of what those blobs represent (for example, compiled classes, Swift modules, intermediate artifacts). Any higher‑level information we expose, like “which Gradle task produced this output?” or “which Xcode target is using this cache entry?” comes from separate components:
- the Gradle analytics plugin, and
- the Bitrise Build Cache CLI wrapper (including our Xcode xcodebuild wrapper)
- the Build Event Service in case of Bazel
These components send analytical data to a different backend dedicated to observability, not to the cache backend.
The first RPC that every client makes is GetCapabilities. This is a common entrypoint that:
- checks whether Bitrise Build Cache is enabled for the current workspace,
- determines whether the cache is in read‑only mode for this build, and
- initializes the organization’s namespace on the backend if needed.
Without this step, the cache will not function correctly. GetCapabilities is also where we associate cache calls with information about the build environment, so that subsequent usage data can be interpreted correctly in analytics.
Build context
Most of the time, clients connect to the Bitrise Build Cache from ephemeral build VMs on Bitrise. This is still our primary environment, but we now also support:
- Local environments (for example, Gradle running on a developer’s machine)
- Other CI providers
Any optimization we introduce must work across all of these contexts. The main constraints we design for are:
- Highly variable build durations
Build system invocations can range from just under 10 seconds with no cacheable tasks to more than an hour with tens of thousands of tasks. There is rarely a one‑size‑fits‑all solution. - I/O bottlenecks and transient failures
On build VMs in particular, local disk I/O and network I/O can both become bottlenecks, leading to transient errors and latency spikes. For this reason, we must reliably retry reads and writes. - Cache must never dominate build time
The whole point of a cache is to speed up the build. That means we need tight but adaptive timeouts and reasonable retry strategies. For instance, if the cache is slow or unreachable, builds should still complete, and we should fail fast enough that the cache overhead does not erase the performance benefits.
These constraints drove many of the client‑side optimizations we’ll look at in the upcoming sections.
Stability
Ensuring cache stability across a wide range of build sizes and network conditions comes down to a few core strategies:
- sensible timeouts,
- smart retries with backoff, and
- careful handling of partial transfers and validation.
Retries
To improve resilience without letting the cache dominate build time, we apply timeouts and retries with exponential backoff and jitter.
- Per‑blob timeouts
We use a timeout that scales with blob size between roughly 20 seconds and 2 minutes for exceptionally large blobs, targeting about 1 second per 10 MB of data. This keeps retries realistic even for large artifacts, while avoiding long blocking operations. - Special handling for GetCapabilities
The GetCapabilities call is crucial for determining whether caching is enabled and how it should behave, but it doesn’t involve large payloads and is less affected by I/O saturation. For this reason:- we use a short timeout of about 10 seconds per attempt,
- and retry it up to 10 times, since a temporary blip here would otherwise disable caching for the entire build.
- Retry only transient failures
We explicitly avoid retrying errors that are not transient, such as:- Unauthenticated / PermissionDenied
- NotFound (for read operations)
- Retrying these would only waste time and put extra load on the backend without any chance of success.
Connections
By default, each client maintains a small pool of gRPC connections to the cache backend and uses them in a round‑robin fashion:
- This avoids the overhead of constantly setting up and tearing down connections.
- It also spreads load more evenly across backend pods.
However, we observed that long‑lived connections can occasionally get “stuck” in unhealthy states. To mitigate this:
- on certain error classes we establish a fresh connection for the retry instead of reusing the potentially bad one,
- while still keeping connection pooling as the default behavior when things are healthy.
This balances stability with connection overhead.
Offsetting or resuming transfers
For very large blobs, restarting from byte zero on every retry is both slow and unlikely to succeed under degraded conditions. Instead, we resume from where we left off whenever possible.
- Reads
The server supports returning object data starting from a specified byte offset. If a read fails midway, the client:- tracks how many bytes were successfully received,
- and restarts the read from that offset.
- Writes
Writes are more complex:- the server might have already flushed part of its buffer to storage before a stream is interrupted,
- so simply retrying from offset 0 could duplicate data or corrupt the blob.
- To avoid this, on each write attempt the client calls QueryWriteStatus to determine how many bytes have already been committed. It then resumes streaming from that offset instead of starting from scratch. This lets us safely resume large uploads instead of re‑uploading the entire blob.
Validation
Early backend iterations were experimental and occasionally had blob consistency issues. To guard against this and prevent silent corruption, we added multiple layers of validation.
- Hash validation
For write operations:- the client sends the expected hash in the request metadata,
- the server recomputes the hash of the received content and compares it to the expected one.
- If there is a mismatch, the server can either surface it as an error or only as a warning, depending on how the client is configured. In stricter configurations, mismatches fail the write; in more permissive setups, they are logged and surfaced but do not necessarily break the build.
- Size validation
During streaming uploads we also:- validate the size of each chunk,
- and verify the total committed size once the upload completes.
These checks ensure that what ends up in the cache actually matches what the client intended to store, while allowing teams to choose how strictly they want mismatches to affect their builds.
Debugging
To understand issues in the wild and quickly trace cache behavior for a particular build, we’ve invested in observability and diagnostics.
- Verbose client logging
- At debug level, we log all cache operations, including method, key/hash, and timing.
- We always log the blob hash, so we can follow the “life” of a blob across different operations and builds.
- Automatic escalation on errors
When we detect an error, we:- elevate logging around that operation to at least info level,
- capturing more details without requiring manual reconfiguration.
- In‑house Build Analytics integration
We’ve built an internal Build Analytics service that:- scans build logs for specific cache‑related patterns and error messages,
- correlates them with build metadata and invocation IDs,
- and notifies us proactively when particular error signatures show up.
This allows us to spot emerging issues early and investigate them with enough context, often before they become visible to most users.
Conclusion
By treating the Bitrise Build Cache client as a first‑class component; carefully tuning retries and timeouts, pooling and refreshing connections, resuming partial transfers, and validating content, we’ve made cache access both faster and more resilient across CI, local, and hybrid environments.
These improvements don’t just reduce flakiness; they directly impact reliability at scale. Using the strategies described in this post, we’ve been able to raise our internal SLOs for both Bitrise Build Cache and Build Analytics to 99.99% availability.
That level of reliability means teams can confidently depend on Bitrise Build Cache as part of their core delivery pipeline, knowing that when things go wrong in the network or infrastructure, the cache will either keep working, or get out of the way, without dragging the build down.
To learn more about Bitrise Build Cache, and try it yourself, sign-up for a trial here.
Explore our documentation or getting started material and see Build Cache in action for yourself.

