Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer

Pinterest EngineeringMay 1, 2026

Why it matters

This approach addresses the common bottleneck where network I/O limits ML serving efficiency. By implementing feature trimming based on model signatures, engineers can maximize GPU utilization and significantly reduce infrastructure costs by moving away from network-optimized instances.

Key takeaways

Pinterest's root-leaf architecture separates CPU-heavy feature fetching from GPU-heavy model inference, but created a network bandwidth bottleneck.
Network usage, rather than compute, dictated scaling needs, preventing full GPU utilization and requiring expensive network-optimized AWS instances.
Implementing lz4 compression provided a 20% bandwidth reduction but increased CPU usage and latency.
The Feature Trimmer system implements a Send What You Use strategy, filtering features at the root level before transmission to leaf nodes.
By extracting required features from model signatures, the system ensures only necessary data is sent for each specific model version.
This optimization reduces network pressure, allowing for fleet downscaling and a transition to cheaper, standard compute instances.

Keywords

Root-leaf architecture

Guangtong Bai | Staff Software Engineer, Product ML Infrastructure*; Shantam Shorewala | Software Engineer II, Product ML Infrastructure*; Chi Zhang | Staff Software Engineer, AI Platform*; Neha Upadhyay | Software Engineer II, AI Platform*; Haoyang Li | Director, Product ML Infrastructure

*These authors contributed equally to this article.

Background

At Pinterest, our online ML serving systems employ a root-leaf architecture. On a high level, the architecture looks as follows:

*Figure 1: Root-leaf Architecture of Online ML Serving Systems at Pinterest*

In the diagram, “Client Service” is responsible for recommending organic or promoted Pins to users. In order to know if a given Pin is relevant to a particular user request, client service sends a score request to the online ML serving system to have the Pin scored by a bunch of ML models, each of which scores an aspect of “relevancy”.

The online ML serving system is composed of 2 parts:

Root: This component handles initial feature processing. Its responsibilities include retrieving necessary features from the feature store, performing required preprocessing, and distributing (fanning out) the scoring requests to the various leaf partitions.
Leaf: This is where the actual model inference takes place, typically utilizing GPU machines. It is structured into multiple partitions, each of which hosts a related group of models, such as one production model and several experimental variants.

What is flowing between the services are ML features. In this blog, we share how passing too many features from root to leaf created a network bottleneck and how we resolved it with Feature Trimmer.

Motivation

The root-leaf architecture provides us with significant benefits, namely:

Simplified Model Onboarding: New ML models can easily be onboarded for online serving by creating new leaf partitions, transparent to root and upstream clients.
Reduced Feature Store QPS: The system minimizes RPCs to the feature store for fetching ML features by having all leaf partitions share a large in-memory feature cache in the root.
Optimized Resource Utilization: Separating CPU (feature fetching, preprocessing) and GPU (model inference) workloads allows for optimized resource use, improving efficiency and reducing cost.

However, this setup introduced a new challenge — the network bandwidth between root and leaf became a performance bottleneck on the online serving path; we had to scale the system based on network usage rather than compute. We observed this pressure in the Ads server on both the root and leaf partitions:

On leaf partitions, peak network usage was significantly higher than peak GPU SM activity (see Figure 2). Consequently, the network bottleneck prevented us from fully utilizing the available GPU compute power.
On root, we had to use the network optimized AWS instance type m6in to ensure the server latency met our internal SLA.

*Figure 2. Comparison of the network bandwidth usage vs GPU SM activity on a subset of the leaf partitions of the online ML server*

That led to a straightforward idea: reduce the root-leaf network bandwidth usage to unlock immediate fleet downscaling and infrastructure savings. If we could cut bandwidth enough, we could also move the root from network-optimized m6in instances to standard m6i instances (about 20% cheaper), further reducing cost.

Enable compression to reduce network usage

The most direct way to reduce the root-leaf network bandwidth usage is to compress the requests between them.

This compression strategy is well-suited for the requests sent from the root to the leaf, which primarily carry ML features for multiple candidate Pins for a given user request. These requests are compressible for several reasons:

Feature Set Consistency: The set of features requested is identical across different candidate Pins, although the actual feature values vary.
Feature Similarity: There are groups of features that share similar representations (e.g., last_x_pins_user_viewed and last_x_pins_user_clicked )
Sparsity: Many features are sparse, containing numerous empty or zero values.

After a few quick tests, we enabled lz4 compression in fbthrift (the RPC framework used by root and leaf) for root-leaf traffic. That reduced 20% root-leaf network usage, at the cost of 5% CPU usage increase and 5ms (~10%) p90 latency increase.

Compression was a solid early win, but it didn’t change the underlying problem: we were still shipping too much unused data. The bigger lever was to stop sending unused features altogether, which led to our “Send What You Use” approach.

Send What You Use

In our root–leaf architecture, the root is shared across many leaf partitions and must fetch ML features for all models. To minimize feature store QPS, the root fetches the union of features needed across models (per candidate Pin), stores them in an efficient in-memory cache, and then fans out the full feature set to each leaf model. Each model converts and uses only the features it needs; the rest are effectively discarded before inference.

This approach was acceptable in our prior architecture, where the same GPU host handled both feature fetching/preprocessing and local model inference. In that context, the unnecessary features only increased main memory usage, which was not a bottleneck on GPU machines. However, within the new root-leaf architecture, transmitting these unneeded features across the network introduces a significant efficiency problem.

If we could send only the required features and trim everything else, similar to C++’s “include what you use” header management tool removing unnecessary #include’s, we could potentially cut root-leaf network usage by ~50%. Like compression, this trades network savings for some additional CPU work and potential latency overhead.

*Figure 3: Overview of the ML inference engine with root-leaf setup and feature trimming*

To make this work, the root must know the exact feature list required by each leaf model. Since models refresh continuously, we also need to keep the feature allowlist on root in sync with the feature expectations of the latest model version on the leaf.

Source of Truth: Model Signature

The source of truth for which features are needed by a model is its model signature. Model signature defines the inputs and outputs of a model, similar to a function signature. As a version of a model finishes training, its model signature is exported as an extra file alongside the TorchScript artifact in the .pt archive file. Below is what a model signature looks like:

❯ unzip -p model.pt archive/extra/module_info.json | jq
{
  "input_names": [
    "feature_id_1",
    "feature_id_2",
    "feature_id_3",
    ...
  ],
  "output_names": [
    "output_score_1",
    "output_score_2"
  ]
}

When the leaf loads a specific model version from the .pt archive, it not only deserializes the weights from the TorchScript artifact, but also builds a feature converter from the model signature. The converter transforms input features from internal company format into native PyTorch tensors before passing them to the model. Because it knows the model’s inputs, it converts only the required features and discards the rest.

A crucial convention is that a model’s signature remains unchanged across different versions. If a signature modification is necessary — for instance, to introduce a new input feature — a new model is forked from the original. This practice is essential because it underpins the fallback mechanism for the versioned lookup feature of the Feature Trimmer, a topic discussed in detail later in the “Versioned Lookups and Fallback” section.

Model Deploy Synchronization

Feature Trimmer only works if the root knows exactly the features that the leaf model expects. That sounds simple until you factor in reality: models are refreshed frequently (hourly to daily), multiple models are shipped together as a “bundle”, and rollouts happen gradually (canary → prod, rolling deploys, occasional rollbacks).

This section explains how we keep the root up to date with what’s actually deployed on the leaf without adding heavy runtime dependencies or introducing brittle, manually managed configs.

At a high level, our approach is:

Treat the model signature as the source of truth which is exported as module_info.json.
Publish signatures as lightweight artifacts that can be consumed by deployment pipelines.
Aggregate per-model signatures into a per-bundle artifact that is deployed to the root alongside existing root configs.
Use the same staged delivery semantics as model rollout (canary, automated canary analysis, prod, rollback), so trimmer config changes ride the same operational rails as everything else.

*Figure 4: Root configurations artifact generation and delivery integrated with existing model deployment*

Publish module_info.json as a standalone artifact

To make the model signature easy to ship and consume, we export module_info.json as a standalone file as part of the model training workflow, next to other model files (for example, alongside the model artifact and config files). This is important for synchronization as it ensures signatures are available before deployment, and available in a form that can be aggregated and deployed without any heavy runtime dependencies.

Generate a bundle-level module_info mapping during bundle build

In production, roots don’t serve a single model, they typically serve bundles containing multiple models (and sometimes multiple versions during a rollout window). So instead of deploying N per-model signatures independently, the bundle pipeline generates one bundle-level artifact that looks like:

{
  "model_A": [
    {
      "version": "1",
      "input_names": ["feature_id_1", "feature_id_2", "..."],
      "output_names": ["score_1", "..."]
    },
    {
      "version": "2",
      "input_names": ["feature_id_1", "feature_id_2", "..."],
      "output_names": ["score_1", "..."]
    }
  ],
  "model_B": [
    {
      "version": "7",
      "input_names": ["feature_id_9", "..."],
      "output_names": ["score_x", "..."]
    }
  ]
}

During the build step, the model deploy pipeline iterates over the model versions that will be shipped in the bundle.

If a model version includes module_info.json, the pipeline parses it and records the signature.
If the signature is missing, the pipeline logs a warning and skips that version rather than failing the entire build. This keeps the system resilient while signature publishing is being rolled out across use cases.

Finally, the bundle-level module_info file is packaged and uploaded together with other root configuration files, so the root receives one coherent “ configs” package.

Deploy root configs through the same staged delivery flow

Once the bundle build produces the root-config package, deployment follows the standard staged delivery pattern:

Deploy root configs to Canary
Deploy model configs to Canary
Run Automated Canary Analysis (ACA)
Deploy root configs to Production
Deploy model configs to Production

This is important because it integrates the feature trimmer into the existing model deployment system and ensures that the “root’s trimming view of the world” is updated using the same guardrails and rollback mechanics as other model changes.

We deploy root configs before rolling out new leaf model versions because the feature trimmer keys feature allowlists by model name + version. If a versioned request arrives without a matching allowlist, we skip trimming to avoid stale configs, which can cause a temporary rollout gap. To prevent this, we ship a backwards-compatible root artifact containing allowlists for both the current and pending versions. Discussed in more detail in a later section “Versioned Lookups and Fallback.”

On successful completion, the root hosts receive the bundle-level signature mapping at a known location on disk, and the trimmer can begin using it for per-model feature allowlisting.

A Closer Look into Trimmer Internals

Feature Allowlist or Blocklist

Once the root hosts have an idea of which features each model requires, we only keep the needed features in the fan-out request to leaf partitions. This allowlist approach, compared to a blocklist where we keep features not in the list, does not carry the burden of tracking all the features that might be in development or deprecated. Given the evolving nature of ML models and volume of experiments at Pinterest, the blocklist is significantly larger for any given model and it is probable that it will grow faster than the allowlist in the future.

Concurrent Updates Across Model Bundles

As mentioned earlier, a model bundle can contain multiple ML models. Additionally, the model bundles do not map 1:1 to the root cluster — each root cluster can receive traffic for multiple bundles. The bundles, each with their own module_info artifact, are deployed independently and often at different cadence. Further, we need to support independent rollbacks for even a single model bundle.

*Figure 5: Concurrent update handling for multiple bundles*

A feature trimmer module is initialized on each root host when it comes online. This module maintains a consolidated, in-memory mapping from models to their versioned feature allowlist. Each trim request is efficiently serviced by looking up the model name and version within this consolidated map. The consolidated map uses the model name and version as nested keys for fast read access as follows.

{
  "model_A": {
"version_N": ["feature_id_1", "feature_id_2", "..."],
 "version_M": ["feature_id_1", "feature_id_2", "..."],
  },
  "model_B": {
"version_N": ["feature_id_3", "feature_id_4", "..."],
 "version_K": ["feature_id_4", "feature_id_5", "..."],
  },
}

This per-model feature allowlist map needs to be continuously refreshed as the model bundle is updated. Here is how it is managed:

Configuration: The root cluster is configured with the active model bundles, and the file path for each corresponding module_info.json is set using GFlags.
Initial Loading: The feature trimmer module loads the content of each module_info.json file into an independent in-memory map.
Monitor for Content Updates: A file watcher is attached to each module_info.json. Any content refresh triggers a reload of its contents into the in-memory map for the given model bundle.
Consolidation: On initial loading or when any model bundle is refreshed, the module:
— Scans and merges all independent maps.
— Creates a new consolidated map.
— Atomically replaces the current active consolidated map with the new one.
Concurrency Management w/ Read-Write Lock:
— Concurrent reads of the consolidated and independent maps are managed with a shared lock.
— Write access during the map replacement is managed with a unique lock.

Versioned Lookups and Fallback

*Figure 6: Request flow for versioned lookup and fallback*

Each scoring request sent to the root cluster must include the model name and optionally, the model version. If the version is omitted, it defaults to the latest version. The feature trimmer parses these fields to determine the version-specific feature allowlist for the requested model.

If no feature allowlist exists for the model, the request proceeds untrimmed.
If both model name and version are specified and found, the specific version’s allowlist is used.
If the model name is found but the version is either not specified or not found, the trimmer uses the latest version of the allowlist. This design choice is based on the assumption at Pinterest that the model signature remains consistent across versions, which also simplifies the deployment by avoiding the need to keep multiple versions in memory during a rolling deployment.

The adoption of the feature trimmer is expected to reduce network bandwidth consumption for root-leaf connections. This places the trimmer on the critical failure path: failure to trim score requests can cause a significant spike in network bandwidth, potentially leading to cascading failures. Therefore, robust handling of artifact (module_info.json) corruption or deployment failures is essential.

We have implemented the following safeguards:

Initialization Failure Railguard: Upon Feature Trimmer module initialization, any failures while parsing the required module_infoartifacts are emitted to our observability dashboard and trigger an on-call alert. We specifically chose not to block host launch on initialization failure. This decision preserves our ability to respond to capacity-related incidents, especially if a deeper issue is affecting the Feature Trimmer module itself.
Isolate Failures from a Single Model Bundle: The feature trimmer loads the module_info contents for each model bundle into a separate map in its memory. If a model bundle’s file gets corrupted on disk during an update, the feature trimmer keeps using the old, in-memory version for that bundle. Because each bundle has its own map, the feature trimmer can still successfully update the information for all the other model bundles.

The fundamental assumption that the model signature is consistent across different model versions allows us to implement these precautions, ensuring the Feature Trimmer remains reliably operational even in the event of intermittent deployment failures.

Efficiency Wins

Reduced Network Stress

Ads root-leaf server setup was the biggest beneficiary of this launch. Figures 7 and 8 compare the network performance of the Ads root and leaf clusters post the launch of the feature trimmer module.

Figure 7. Comparison of the network bandwidth usage vs GPU SM activity on a subset of the leaf partitions of the online ML server after feature trimmer was enabled. The reduction in network usage allowed us to tune the cluster size and batch size config to improve the GPU utilization.

Figure 8: Comparison of the network bandwidth consumption before and after launch of the feature trimmer on the Ads root cluster. It dropped from a peak of 4GBPS to <1.5GBPS even after downsizing the root cluster by 27%.

Figure 9: Comparison of network bandwidth performance on Ads leaf partitions after the launch of the feature trimmer. The peak usage dropped from 1000–1200 MBPS in some clusters to <200MBPS for all clusters.

Later, we also applied the feature trimmer to other use cases such as HomeFeed and Related Pins and saw latency and network reductions similar to Ads, amplifying the overall impact of this initiative. Figures 10 and 11 show the network savings in Homefeed Root and Leaf.

*Figure 10: In our Homefeed Root cluster, outbound network usage dropped substantially from ~1.2–2.1 GB/s to ~0.45–1.1 GB/s*

*Figure 11: We saw 65–75% reduction in inbound network usage across Homefeed GPU leaf clusters*

As a result, we reduced the Homefeed root cluster fleet size by 33% and are still working on rightsizing the Homefeed leaf clusters, unlocking significant infrastructure savings.

Latency Improvement

While the payload size reduction directly contributed to the network performance improvement, we also saw a reduction in CPU utilization on the root cluster and a reduction in both server-side and client-side root latency. We believe this is largely because a smaller payload leads to less CPU cycles spent on SerDe (serialization/deserialization). This additional latency headroom allowed Ads to save additional cost by trading some latency for cost and the remainder was used to unblock future experiments (see latency increases in late June).

*Figure 12: Ads client (AdMixer) P90 latency dropped significantly as well, peaking above 90ms prior the launch to <80ms peak after feature trimmer was enabled.*

For our Related Pins surface, the model score latency p99 (ms) before the feature trimmer for most models sits around ~130–180 ms with frequent spikes above 200 ms. After the feature trimmer is enabled, the p99 baseline shifts down to roughly ~95–125 ms for most models, a notable ~25–30% drop in latency.

Figure 13: Feature Trimmer reduces Related Pins model p99 latency by ~25–30%. Note that the feature trimmer was not available for some models because they did not have a valid feature allowlist so these models still see the same peak latency post rollout.

Cost Saving

Based on the efficiencies realized in terms of network performance and client latency, we were able to resize the ML servers at Pinterest to realize significant cost savings:

Ads was the biggest beneficiary of this project — the team could downsize the root cluster by 27% without any performance regression. On the leaf side, the network improvement allowed us to tinker with the batching logic to finetune GPU utilization without impacting any other metrics, representing roughly 5% of the total GPU capacity at the time.
— The latency reduction unblocked future improvements and marginally reduced the failures due to server timeouts — this led to a marginal 0.17% increase in revenue as well.
Across other use cases like Search and Notification, we saw approximately 45% and 65% drops in egress network throughput, with no material change in p99 latency. Because these clusters were initially network-bound, feature trimmer allowed us to move to more optimized instance types, resulting in ≥30% cost reduction for both.
— This realized an additional $0.98M in annual infrastructure cost from rightsizing the clusters

Overall, this project saved over $4M in annual infrastructure costs for Pinterest while creating headroom to test bigger models and features without latency or network performance concerns. It effectively shifted the bottleneck from network to CPU cycles on the root cluster. This also allows the team to switch focus to optimizing the payload between the client and the root to further finetune the resource utilization end-to-end.

Wrap Up

Feature Trimmer successfully addressed a critical network bottleneck in Pinterest’s root-leaf ML serving architecture, moving beyond simple payload compression to implement a “Send What You Use” philosophy. By establishing the model signature as the source of truth for required features and deploying a robust, version-aware feature allowlisting system in sync with model rollouts, we significantly reduced the data volume passed between the root and leaf clusters. This optimization resulted in substantial network bandwidth reduction, improved client-side latency, and ultimately delivered significant cost savings.

In Part II of this blog series, we will shift focus to how request feature compression further optimizes the network connection between the client and the root. Keep an eye out for the next installment to discover how we achieve even greater efficiencies in our ML serving infrastructure.

Acknowledgement

This project would not have been possible without former team members Yiran Zhao and Queena Zhang’s early exploration and prototyping. We extend our sincere gratitude to the following individuals for their invaluable support in deploying Feature Trimmer into production: Miao Wang, Randy Carlson, Runze Su, Qifei Shen, and Tao Mo. We would also like to thank Nazanin Farahpour, Howard Nguyen, Bo Liu, Sihan Wang, Renjun Zheng and Zheng Liu for their helpful review of this blog post.

Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.