GPU-Serving Two-Tower Models for Lightweight Ads Engagement Prediction

Pinterest EngineeringFebruary 13, 2026

Why it matters

Transitioning to GPU serving for lightweight ranking allows engineers to deploy sophisticated architectures like MMOE-DCN. This shift significantly improves prediction accuracy and business metrics without sacrificing the strict latency requirements of real-time recommendation systems.

Key takeaways

Pinterest transitioned its ads lightweight ranking from CPU to GPU serving to support more complex model architectures while maintaining low latency.
The new architecture replaces Multi-Task Multi-Domain (MTMD) models with a Multi-gate Mixture-of-Experts (MMOE) and Deep & Cross Network (DCN) design.
GPU serving enabled a 5-10% reduction in offline CTR loss and significant improvements in online metrics like Cost-Per-Click (CPC) and Click-Through Rate (CTR).
Training efficiency was optimized using BF16 precision, fused kernels, GPU prefetching, and increased batch sizes on p4d instances.
Segmenting standard and shopping ad scenarios for separate training doubled offline model iteration speed.
The two-tower paradigm uses offline batch updates for Pin embeddings and real-time generation for query embeddings to balance performance and latency.

Keywords

Two-Tower Model

Yuanlu Bai | Machine Learning Engineer II, L1 Conversion and Shopping Modeling; Yao Cheng | Sr. Machine Learning Engineer, L1 Conversion and Shopping Modeling; Xiao Yang | Sr. Staff Machine Learning Engineer, Ads Lightweight Ranking; Zhaohong Han | Manager II, Ads Lightweight Ranking; Jinfeng Zhuang | Sr. Manager, Ads Ranking

Introduction

Lightweight ranking plays a crucial role as an intermediate stage in Pinterest’s ads recommendation system. Its main purpose is to efficiently narrow down the set of candidate ads before passing them to downstream, more complex ranking models. By doing so, it ensures that only the most relevant candidates move forward, improving both the efficiency and quality of our ads recommendations.

To balance model performance and serving latency, we adopted a classic two-tower paradigm. In this design, the Pin (ad) tower calculates Pin embeddings via offline batch updates, while the query (user) tower generates real-time embeddings. The prediction score is computed as the sigmoid of the dot product between the Pin and query embeddings. Previously, all two-tower models were served on CPUs. In 2025, we launched our first GPU-serving model for engagement prediction, which was an important milestone in the roadmap for next-generation infrastructure and model architecture.

The new model architecture combines Multi-gate Mixture-of-Experts (MMOE) with Deep & Cross Networks (DCN), alongside feature updates. GPU serving enables us to support this more complex model while maintaining latency comparable to the CPU baseline. With these improvements, we observed a 5–10% reduction in offline loss compared to our previous production model for click-through rate (CTR) prediction. Additionally, by serving standard and shopping ad scenarios separately and training each with only its relevant data, we achieved a further 5–10% reduction in loss. This segmentation also doubled our offline model iteration speed.

In this blog, we will provide a brief overview of the changes to our model architecture. For a more detailed explanation of MMOE and DCN, please refer to [1]. We will also share our insights on improving GPU training efficiency, as increased model complexity and large training datasets have led to longer training time. Finally, we will present both the offline and online evaluation results of this launch.

Model Architecture

The new model introduces a significant architectural shift from the previous Multi-Task Multi-Domain (MTMD) [2] model to an MMOE-DCN design. We incorporated the MMOE structure with an MLP gating mechanism. In the prior MTMD model, domain-specific modules were used to learn information unique to each type of Pin or query. The new MMOE architecture effectively addresses multi-domain multi-task challenges, even without these domain-specific modules. Each expert in our model employs both full-rank and low-rank DCN layers. Below are diagrams illustrating the previous and current model architectures.

Here is a comparison of the model sizes.

Training Efficiency Improvement

As model size and training FLOPs increased, we conducted various optimizations and analyses to enhance training efficiency. As a summary, to accelerate training, we implemented the following improvements:

Dataloader Optimization:
• Enabled GPU prefetch, allowing batch i+1 to be prepared while the GPU processes batch i.
• Tuned the number of worker threads. Since p4d instances have 1 TB of CPU memory, we were able to increase the number of threads, which proved effective.
Model Code Efficiency:
• Avoided costly zero allocations on the CPU by performing these operations directly on the GPU.
• Used fused kernels instead of multiple individual kernels to reduce overhead.
Model Training Configuration:
• Adopted BF16 precision during training to enhance processing speed
• Increased batch size to better utilize available memory.

Evaluation

We use prediction scores from downstream ranking models as training labels and employ KL divergence [3] between the labels and model predictions as our loss function. The model is trained and evaluated on both auction winners (ads that were inserted and shown to users) and auction candidates (ads passed to the downstream ranking model). The table below demonstrates significant loss reduction across all slices, both offline and online [4].

In online experiments, we typically use cost-per-click (CPC) and click-through rate (CTR) as key success metrics. CPC measures the average advertising cost for each user click, so lower values are preferable. As shown in the table below, we observed significant reductions in CPC and increases in CTR across all slices [4].

Conclusion

In this launch, we introduced a new GPU-serving two-tower model for Pinterest ads lightweight ranking, leveraging the MMOE-DCN architecture. By GPU infrastructure, model optimizations, and training efficiency improvements, we achieved substantial gains in both offline and online metrics. These enhancements resulted in significant reductions in loss and cost-per-click, as well as increases in click-through rate. This work marks an important step forward in scaling our recommender systems with more complex, efficient, and effective models.

Acknowledgements

This project was a collaborative effort involving multiple teams at Pinterest:

Ads Lightweight Ranking: Xiao Yang, Yao Cheng, Yuanlu Bai, Zhaohong Han, Longyu Zhao
Ads Infra: Tristan Nee, Shantam Shorewala, Sihan Wang, Haoyu He, Ang Xu
Leadership: Jinfeng Zhuang, Haoyang Li, Ling Leng

References

[1] Li, Jiacheng, et al. “Multi-gate-Mixture-of-Experts (MMoE) model architecture and knowledge distillation in Ads Engagement modeling development.” Pinterest Engineering Blog.

[2] Yang, Xiao, et al. “MTMD: A Multi-Task Multi-Domain Framework for Unified Ad Lightweight Ranking at Pinterest.” AdKDD 2025.

[3] Kullback, Solomon, and Richard A. Leibler. “On information and sufficiency.” The annals of mathematical statistics 22.1 (1951): 79–86.

[4] Pinterest Internal Data, US, 2025.

GPU-Serving Two-Tower Models for Lightweight Ads Engagement Prediction was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

GPU-Serving Two-Tower Models for Lightweight Ads Engagement Prediction

Why it matters

Key takeaways

Keywords

Content preview

Introduction

Model Architecture

Training Efficiency Improvement

Evaluation

Conclusion

Acknowledgements

References

Related posts

Finding zombies in our systems: A real-world story of CPU bottlenecks

Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models…

PinLanding: Turn Billions of Products into Instant Shopping Collections with Multimodal AI

A Decade of AI Platform at Pinterest