Migrating high-volume metrics requires balancing protocol modernization with performance. This approach shows how OTLP and vmagent can reduce CPU overhead and storage costs while maintaining data fidelity at scale, offering a blueprint for efficient observability infrastructure.
A production-tested approach for moving a large-scale metrics pipeline from StatsD to OpenTelemetry and Prometheus.

By: Eugene Ma, Natasha Aleksandrova
When migrating to a new monitoring system, you’ll want to frontload the work to collect all your metrics. This exposes bottlenecks at full write scale and unblocks the migration of assets which require real data for validation, such as dashboards and alerts. Collecting all your metrics first means you can focus on major technical challenges — scale, correctness and performance — without worrying about how users will adopt your new tools.
But for our project, this approach wasn’t straightforward: most of our metrics were instrumented with StatsD libraries, OpenTelemetry was gaining traction, and our new storage system was based on Prometheus. We were left with a lot of open questions. Where do we fork the metrics? Should we adopt OpenTelemetry? Do our metrics work well with Prometheus? The task of collecting metrics required us to answer these questions and rethink our metrics infrastructure.
Our services originally used the StatsD protocol, with a Veneur sidecar collecting and forwarding metrics to our vendor.

For the migration away from this architecture, we defined a clear set of supported and preferred instrumentation protocols. OpenTelemetry Protocol (OTLP) is now recommended for internal services, while Prometheus is preferred for OSS workloads. StatsD (DogStatsD format) remains as a fallback for legacy paths or applications that are difficult to update / instrument.
This led us to adopt the vendor-neutral OpenTelemetry Collector (OTel Collector).
Roughly 40% of our services use a shared platform-maintained metrics library, which was the logical starting point for OTLP adoption and enabled broad rollout of OTLP emission across those services. By enabling this library to dual-emit metrics, StatsD for the existing pipeline and OTLP for the new OTel Collector, we made broad migration progress with low friction.
The collection pipeline during migration:

Moving to OTLP offered significant benefits over remaining on StatsD:
While most services handled the dual-write OTLP + StatsD mode without issues, our highest-volume metric emitters (many of which are also among our most critical services) experienced significant regressions in performance after enabling OTLP. In particular, these services began to suffer from memory pressure, increased garbage collection activity, and heap growth.
Through investigation, we connected the performance issues to the fact that those services emit very high-cardinality metrics (on the order of 10K+ samples per second per instance). To mitigate this regression, we updated our common metrics library configuration to use delta temporality (AggregationTemporalitySelector.deltaPreferred()) for a select set of services. That change reduces the in-process memory burden, because delta temporality avoids retaining full state for all metric-label combinations between exports.
Based on our observations, we found that switching to delta temporality was only necessary for services emitting extremely high metric volume, so we left the rest of the services using default cumulative mode. That said, with delta metrics, unexpected failures result in visible gaps in the data, a trade-off we accepted in exchange for reduced memory usage and improved stability. With cumulative metrics, those same failures typically appear as a larger jump between data points rather than a gap.
With OTLP now serving as the primary data flow and the fallback StatsD path maintained only where necessary, we arrived at a consistent and future-ready collection architecture. The OTel Collector anchors the pipeline, feeding directly into our Prometheus-based storage backend, with the legacy path available for compatibility.
This groundwork was essential before tackling more complex challenges like aggregation.
The diagram illustrates this final design and the complete flow of metrics through the system.

To keep costs under control, our previous pipeline used an internal fork of Veneur to aggregate away instance-level labels (e.g. pod, hostname) instead of submitting the raw metrics directly to our vendor. In the Prometheus world, we needed a similar aggregation step.
There were several options that we considered using, but discarded:
We ended up looking into vmagent from VictoriaMetrics.
Pros:

Our architecture has two layers of vmagents: routers and aggregators. Routers are stateless and shard metrics by consistently hashing all labels except the ones you want to aggregate away (e.g. “pod”, “host”). Aggregators do the aggregation (i.e. summing over counters) and maintain the in-memory state of outputs (i.e. the current running total). On the command line, each router is passed a static list of aggregator hostnames (Kubernetes gives StatefulSet pods stable network identity). This simple approach avoids additional service discovery dependencies and keeps the metrics sharding consistent between pods.
Once we figured out how to scale vmagent, we made a few internal customizations to support our metrics environment (including native histogram support and Mimir-style multitenancy). More generic changes were contributed back upstream (5931, 5938, 5990).
We achieved our goal of aggregating metrics: our single production cluster scaled out to hundreds of aggregators, ingesting over 100 million samples per second.
In addition to reducing the overall cost by an order of magnitude, our new aggregator pipeline provides another benefit. Because it’s centralized and processes ALL our metrics, it’s a convenient and powerful tool for other types of transformations. When users deploy a bad instrumentation change, it’s handy to drop problematic metrics. When users temporarily need raw metrics, we use it to dual-emit raw metrics on the fly.
After migrating to our new Prometheus-based metrics pipeline, we noticed our PromQL queries over certain counters were consistently undercounting compared to our previous vendor.
The root cause lies in how Prometheus computes rates of change from cumulative counters. In a StatsD monitoring system, a data point measures the delta observed during a flush window (typically 10 seconds). In Prometheus, a data point measures the cumulative count starting from zero. In practice, you’ll use something like rate() or increase() to derive deltas, which are more meaningful to graph.
However, there is an edge case: when a counter is created, it may also be incremented at the same time.
Consider a counter that increments once and resets (due to pod restart) before it can increment again:
1 1 1 1 1 - - - 1
↑ reset
Each time the counter resets, the increment is lost before PromQL’s rate() function has a chance to derive a delta.
Initially, we anticipated that the edge case would have minimal impact, given Prometheus’s widespread adoption and proven reliability in diverse environments. However, as we migrated more users, we started seeing this issue more frequently, and it stalled migration. Airbnb systems have a tendency to generate many low-rate counters with high dimensionality. A counter tracking requests per currency per user per region might see any given data series increment only a handful of times a day, and these updates can be lost due to the issue with rate(). Yet, these were often critical metrics that product teams needed to monitor accurately.
Before arriving at our final approach, we considered several workarounds — each with significant drawbacks.
Clearly, the solution had to be transparent and systematic. The migration was already demanding a lot from our users, and it wasn’t reasonable to expect our team to modify every callsite ourselves.
The idea was simple: we needed to seamlessly inject zeroes in our metric stream. Our aggregation tier was already processing all our metrics in a centralized location, so it was the perfect spot to implement the change.
We implemented an Aggregator tweak: when we flush an aggregated count for the first time, flush a zero instead of the actual running total.
Consider a counter that increments by one every flush. Here’s what this looks like in practice:
![Table showing how an aggregator delays its first flush: 1 → injects zero (state 1, series [0]); 2 → flushes normally (state 2, series [0, 2]); 3 → continues (state 3, series [0, 2, 3]).](https://cdn-images-1.medium.com/max/1024/1*4Hes3V8JBVNvXwG-Vvd6PA.png)
By injecting the synthetic zero, all counters are implicitly initialized to zero to match Prometheus counter semantics. The delayed flush is also important, it ensures the zero’s timestamp doesn’t conflict with any previous samples. There is one downside: the initial increment will lag by one flush interval, which isn’t a big deal in practice.
This approach required minimal changes to vmagent, kept the fix invisible to end users, and solved the problem at the source rather than papering over it downstream.
In this blog post, we talked about how we collected all existing StatsD metrics into a new, Prometheus-based metric pipeline.
We implemented a dual-write instrumentation strategy, adopting the OpenTelemetry Protocol (OTLP) while maintaining support for StatsD. This switch yielded performance gains and unlocked the use of features such as native histograms.
To manage costs and scale, we introduced a centralized, two-layer streaming aggregation pipeline using a sharded vmagent setup. Finally, we solved the issue of consistently undercounted sparse counters by implementing a transparent “zero injection” technique within our aggregation tier.
Our work demonstrates a practical, production-tested blueprint for migrating a massive, high-volume metrics pipeline to modern, open-source standards such as OpenTelemetry and Prometheus, all while maintaining data accuracy and cost efficiency.
If this type of work interests you, check out our open roles.
Thank you to the Observability team — Abdurrahman Allawala, Rong Hu, Callum Jones, Rishabh Kumar, Wei Song, and Yann Ramin — and our partners across the company who helped make this a reality.
We especially want to thank Helen Lee and Vlad Kostyukov for their support on OpenTelemetry instrumentation.
We also want to thank Suman Karumuri and Xuan Lu for their support in authoring this post during their time at Airbnb.
All product names, logos, and brands are property of their respective owners. All company, product, and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.
Building a high-volume metrics pipeline with OpenTelemetry and vmagent was originally published in The Airbnb Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Continue reading on the original blog to support the author
Read full articleManaging observability at scale requires balancing cost and utility. Airbnb's shift to an in-house, automated platform demonstrates how to regain control over data, standardize metrics across thousands of services, and reduce operational overhead through self-service migration tools.
This architecture demonstrates how to scale global payment systems by abstracting vendor-specific complexities into standardized archetypes. It enables rapid expansion into new markets while maintaining high reliability and consistency through domain-driven design and asynchronous orchestration.
Skipper offers a lightweight alternative to heavy orchestrators like Temporal. It allows engineers to build reliable, multi-step processes using existing infrastructure, significantly reducing operational complexity while maintaining high reliability for critical transactions.
Scaling observability for 1,000+ services requires balancing multi-tenant isolation with operational efficiency. Airbnb's approach to shuffle sharding and automated control planes provides a blueprint for building resilient, petabyte-scale metrics systems that avoid 'flying blind' during outages.