This approach solves the 'cold start' of session intent in recommendation systems by blending offline historical sequences with real-time context. The hybrid inference model balances computational efficiency with immediate relevance, significantly improving candidate survival in ranking funnels.
Huiqin Xin | Machine Learning Engineer II, Ads Vertical Modeling; Lakshmi Manoharan | Senior Machine Learning Engineer, Ads Vertical Modeling; Karthik Jayasurya | Staff Machine Learning Engineer, Ads Signals; Ziwei Guo | Senior Machine Learning Engineer, Ads Vertical Modeling; Alina Liviniuk | Machine Learning Engineer II, Ads Vertical Modeling
In a previous post, Ads Candidate Generation using Behavioral Sequence Modeling, we introduced a candidate generator (CG) that uses a Transformer-based two-tower model to leverage a user’s offsite conversion history — a powerful signal — to predict future interactions with advertisers and specific products. This was a significant step forward, moving beyond static interest categories to model the evolving user shopping journey.
However, a key limitation of the initial sequential model was its lack of online context information. The user embeddings were inferred offline purely from historical offsite behavior, meaning that at the moment an ad was served, the model had no knowledge of what the user was currently browsing on Pinterest. This is a crucial drawback, particularly for highly contextual surfaces like Related Pins and Search, where the user’s current Pin or search query represents a strong, immediate signal of intent. For example, on the Related Pins surface, if a user is viewing a Pin of a “vintage leather armchair,” the recommended ads should be highly relevant to that specific item, not just their general, long-term interests.
This lack of context severely limited the model’s effectiveness on these surfaces; in the previous production system, less than 1% of impressions on Related Pins were attributed to this CG, indicating its candidates struggled to survive the downstream ranking and auction stages.
To overcome this challenge, we developed the Contextual Sequential Two Tower Model, an evolution of the sequential recommender model specifically designed to incorporate real-time, online context. This approach focuses on three major areas: a new model architecture, a novel training approach, and a hybrid serving flow.
The core architectural change was integrating a context layer directly into the query tower of the two-tower model.

As shown in the diagram above, the model now concatenates the output of the original Transformer encoder (which represents historical sequence information) with the output of the new context layer. This combined representation is then fed into the final Multi-Layer Perceptron (MLP) to derive the final user embedding.
For the Related Pins surface, the context layer’s input features are derived from the subject Pin (the Pin the user is currently viewing), specifically using features like the aggregated embedding representations of the top interest categories of the subject Pin, weighted by their confidence scores.
To further personalize the model, the user representation layer was augmented with embeddings of user demographic features, such as age, country, and gender.
Since real-time context is only available at serving time, we had to make the model capable of learning from this signal during offline training. The solution was to use synthetic augmented data.

During model training, we artificially inject pseudo-context information derived from the positive label (the conversion event) into the input sequence. For example, by projecting the interest category features from the positive item, we encourage the model to retrieve items that are semantically related to the context associated with that user session. A high dropout rate is used in the context layer during training to ensure the model still relies on the user’s historical event sequence (the Transformer output).
We opted to use synthetic augmented data over real context data due to two main challenges:
Given that the context features (e.g., subject Pin features) are only known at the ad request time (online), we adopted a hybrid model inference approach.
This architecture and serving flow enables the user embedding to be dynamically influenced by the real-time context, ensuring the recommendations are both personalized (from sequence) and contextually relevant.
To assess the impact of integrating context features on the survival rate of model-retrieved ad candidates, we conducted an offline evaluation. Using logged features from real traffic ad data on Related Pins, we generated the model output embedding and calculated Recall@K, which measures the proportion of positive items found in the top-K retrieved items. Here the candidates that survived the ranking funnel and delivered to the users were considered positive items. This new model demonstrated a significant improvement, achieving a 3x to 10x increase in Recall@K compared to the production model.

We were able to successfully drive up the survival rate of the candidates from this CG on the Related Pins surface. The median relevance of the candidates went up by ~275–300%. On the Related Pins surface overall, the ads relevance metric improved by 1.08%. Furthermore, we observed a significant increase in candidate delivery, with 2x more ads candidates retrieved being delivered to impression.
The improvement in candidate relevance translated into ~0.7% measurable lift in conversion-related business metrics ROAS (Return on Ad Spend). In particular, the model benefits more for top countries which account for a majority of total revenue and leads to ~1.4% ROAS lift.
We plan to explore several key enhancements:
We would like to thank Supeng Ge, Yang Liu, Richard Huang, Yu Liu, Zhuqing Zhang, Kevin Liao, Yu Gu, Wanyu Zhang, for their dedicated help; thank to Alice Wu, Leo Lu, Siping Ji, Ling Leng for their incredible support and leadership; thank to Joachim Groeger for the valuable discussion and support.
Enhancing Ad Relevance: Integrating Real-Time Context into Sequential Recommender Models was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Continue reading on the original blog to support the author
Read full articleManaging user-sequence data is notoriously expensive and prone to training-serving skew. This unified architecture reduces operational costs and ensures data consistency across the ML lifecycle, enabling faster iteration on sequence-aware models like Transformers for recommendation systems.
Optimizing for sparse conversion events is a major challenge in ad tech. This architecture shows how to effectively combine sparse labels with dense engagement signals using parallel DCN v2 and multi-task learning to drive significant business value and advertiser RoAS.
Scaling ML models often leads to exponential costs. This approach demonstrates how architectural changes like request-level deduplication and SyncBatchNorm can decouple model complexity from infrastructure overhead, enabling massive scale-ups without proportional cost increases.
This article demonstrates how moving from heuristic-heavy re-ranking to sophisticated algorithms like SSD improves both system performance and long-term user retention. It highlights the importance of balancing immediate clicks with content diversity in large-scale recommendation engines.