Pinterest Engineering

Why it matters: Scaling Text-to-SQL in large enterprises fails with simple RAG due to schema complexity. By encoding historical analyst intent and governance metadata into embeddings, engineers can build agents that provide trustworthy, context-aware queries instead of just syntactically correct ones.

Pinterest evolved its Text-to-SQL system into a production Analytics Agent by focusing on analytical intent rather than just raw SQL syntax.
The system utilizes unified context-intent embeddings, which translate historical SQL queries into semantically rich natural language descriptions using LLMs.
A three-step pipeline injects domain context, such as glossary terms and metric definitions, before converting SQL to structured text summaries.
Retrieval is enhanced by structural and statistical patterns, extracting validated join keys and aggregation logic from historical query data.
A governance-aware ranking system prioritizes trustworthy data by incorporating table tiers, usage signals, and documentation quality from the PinCat catalog.
This approach addresses the challenges of a massive data warehouse by grounding AI outputs in patterns that have historically worked for human analysts.

#data #mlp

Read original

Pinterest EngineeringMar 3, 2026

Unifying Ads Engagement Modeling Across Pinterest Surfaces

Why it matters: Consolidating fragmented ML models reduces technical debt and operational overhead while boosting performance through shared representations. This case study provides a blueprint for balancing architectural unification with the need for surface-specific specialization in large-scale systems.

Pinterest unified fragmented ads engagement models from Home Feed and Search into a single architecture to increase iteration velocity and reduce maintenance.
The unified model utilizes a multi-task learning design with surface-specific tower trees and calibration layers to handle distinct user intents across surfaces.
To mitigate latency increases from larger feature maps, the team implemented DCNv2 projection layers to compress Transformer outputs.
Infrastructure efficiency was improved via request-level broadcasting, fetching user embeddings once per unique user rather than per candidate pin.
The approach leverages shared representation learning, allowing surface-specific models to benefit from combined training data and complementary features.

#mlp #data

Read original

Pinterest EngineeringFeb 27, 2026

Bridging the Gap: Diagnosing Online–Offline Discrepancy in Pinterest’s L1 Conversion Models

Why it matters: This case study highlights that even mathematically superior models fail if serving infrastructure lacks feature parity with training. It provides a blueprint for diagnosing ML system discrepancies by auditing the entire pipeline from embedding generation to funnel alignment.

Pinterest investigated why L1 conversion models with 20-45% offline LogMAE gains failed to produce online CPA improvements during A/B testing.
The team ruled out offline evaluation bugs, exposure bias, and serving latency, focusing instead on structural discrepancies between training and inference.
A critical root cause was feature disparity: high-impact signals like targeting specs and conversion counts were available in training logs but missing from the L1 embedding builder.
Temporal misalignment between query and Pin tower embeddings further degraded online performance, as the two towers were not synchronized during real-time serving.
The investigation highlights the necessity of a full-stack diagnosis framework covering model evaluation, serving pipelines, and funnel utility to isolate ML system failures.

#mlp #data

Read original

Pinterest EngineeringFeb 24, 2026

Piqama: Pinterest Quota Management Ecosystem

Why it matters: Managing resources at scale requires more than just hard limits. Piqama provides a unified framework for capacity and rate-limiting, enabling automated rightsizing and budget alignment. This reduces manual overhead while improving resource efficiency and system reliability across platforms.

Piqama is a unified quota management ecosystem at Pinterest handling physical resources, service limits like QPS, and application-specific units.
The platform manages the full quota lifecycle including schema definition, pluggable validation rules, and ownership-based authorization.
It supports both capacity-based quotas for Big Data workloads (integrated with Yunikorn) and rate-limiting for online storage services.
A centralized management portal provides visibility and self-service capabilities for quota updates and usage tracking.
Governance features include automated usage statistics collection via Apache Iceberg and an auto-rightsizing service for predictive resource allocation.
The system integrates with Pinterest's chargeback and entitlement systems to align resource consumption with financial budgets.

#sre #data #finops

Read original

Pinterest EngineeringFeb 17, 2026

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Why it matters: OOM errors are a primary cause of Spark job failures at scale. Pinterest's elastic executor sizing allows jobs to be tuned for average usage while automatically handling memory-intensive tasks, significantly reducing manual tuning effort, job failures, and infrastructure costs.

Pinterest implemented Auto Memory Retries to handle Spark Out-of-Memory (OOM) errors by dynamically adjusting resource profiles for failed tasks.
The system uses a hybrid strategy: first increasing the CPU property to reduce task concurrency on existing executors, then launching physically larger executors if OOMs persist.
Core Spark classes like TaskSetManager and TaskSchedulerImpl were modified to support task-level resource profiles, deviating from the standard TaskSet-wide configuration.
This elastic sizing allows engineers to tune jobs for P90 memory usage rather than peak requirements, improving overall cluster resource efficiency.
A proactive OOM Prediction feature was introduced to preemptively assign larger resource profiles to tasks likely to fail based on historical job data.
The implementation resulted in a 40% reduction in OOM-related job failures and a 2.5% reduction in total cluster memory consumption.

#data #dist #finops

Read original

Pinterest EngineeringFeb 13, 2026

GPU-Serving Two-Tower Models for Lightweight Ads Engagement Prediction

Why it matters: Transitioning to GPU serving for lightweight ranking allows engineers to deploy sophisticated architectures like MMOE-DCN. This shift significantly improves prediction accuracy and business metrics without sacrificing the strict latency requirements of real-time recommendation systems.

Pinterest transitioned its ads lightweight ranking from CPU to GPU serving to support more complex model architectures while maintaining low latency.
The new architecture replaces Multi-Task Multi-Domain (MTMD) models with a Multi-gate Mixture-of-Experts (MMOE) and Deep & Cross Network (DCN) design.
GPU serving enabled a 5-10% reduction in offline CTR loss and significant improvements in online metrics like Cost-Per-Click (CPC) and Click-Through Rate (CTR).
Training efficiency was optimized using BF16 precision, fused kernels, GPU prefetching, and increased batch sizes on p4d instances.
Segmenting standard and shopping ad scenarios for separate training doubled offline model iteration speed.
The two-tower paradigm uses offline batch updates for Pin embeddings and real-time generation for query embeddings to balance performance and latency.

#mlp #dist

Read original

Pinterest EngineeringFeb 5, 2026

Next Generation DB Ingestion at Pinterest

Why it matters: Transitioning from batch to real-time ingestion is critical for modern data-driven apps. Pinterest's architecture shows how to use CDC and Iceberg to reduce latency from days to minutes while cutting costs and ensuring compliance through efficient row-level updates and unified pipelines.

Pinterest replaced fragmented, high-latency batch ingestion with a unified CDC-based framework using Flink, Spark, and Apache Iceberg.
The system captures changes from MySQL, TiDB, and KVStore via a custom CDC service, writing events to Kafka with sub-second latency.
A dual-table architecture uses append-only CDC tables for change logs and Base tables for mirrored snapshots updated via Spark's MERGE INTO.
Standardizing on Iceberg's Merge-on-Read (MOR) strategy significantly reduced storage and compute costs compared to Copy-on-Write (COW).
The framework supports row-level deletions natively, improving data compliance and handling petabyte-scale data across thousands of pipelines.

#data #dist #finops

Read original

Pinterest EngineeringFeb 2, 2026

Beyond Two Towers: Re-architecting the Serving Stack for Next-Gen Ads Lightweight Ranking Models…

Why it matters: Moving beyond Two-Tower models allows for more expressive ranking but introduces massive latency. This architecture demonstrates how to integrate heavy GPU inference into real-time stacks by optimizing feature fetching and moving business logic to the device.

Transitioned from Two-Tower architectures to complex neural networks to enable interaction features and target attention.
Implemented an Inventory Segmentation Strategy, bundling high-value document features directly into PyTorch model registered buffers to eliminate network I/O.
Moved business logic, including utility calculations and top-k sorting, into the PyTorch model to minimize data transfer between GPU and CPU.
Optimized GPU inference latency from 4000ms to 20ms using Multi-Stream CUDA to overlap compute and data transfer.
Leveraged in-house model inference engines supporting PyTorch traced models and CUDAGraphs for high-performance serving.

#mlp #dist

Read original

Pinterest EngineeringJan 28, 2026

Ads Candidate Generation using Behavioral Sequence Modeling

Why it matters: This article demonstrates how to scale personalized recommendation systems using transformer-based sequence modeling. It provides a blueprint for transitioning from coarse-grained to fine-grained candidate generation, improving ad relevance and efficiency in large-scale production environments.

Pinterest implemented a transformer-based two-tower model to predict future user interactions with advertisers and specific products based on historical offsite behavior.
The architecture uses a bidirectional transformer for user event sequences and an MLP for advertiser/item representations, trained using sampled softmax loss with log-Q bias correction.
To handle a corpus of over 1 billion items, the system utilizes a combination of in-batch negatives and a randomly sampled set of 20 million Pins for contrastive learning.
The serving flow involves daily offline batch inference for user embeddings, stored in an online feature store for low-latency retrieval during ad requests.
Online experiments showed significant conversion volume increases and CPA reductions, demonstrating the effectiveness of moving from advertiser-level to item-level personalization.

#mlp #data

Read original

Pinterest EngineeringJan 13, 2026

PinLanding: Turn Billions of Products into Instant Shopping Collections with Multimodal AI

Why it matters: It demonstrates how to scale multimodal LLMs for production by combining expensive VLM extraction with efficient dual-encoder retrieval. This architecture allows platforms to organize billions of items into searchable collections while maintaining high precision and low operational costs.

PinLanding is a production pipeline that transforms massive product catalogs into structured shopping collections using multimodal AI.
The system uses Vision-Language Models (VLMs) to extract normalized key-value attributes from product images and metadata.
A curation layer employs LLM-as-judge and embedding-based clustering to consolidate sparse attributes into a searchable vocabulary.
To scale, Pinterest uses a CLIP-style dual-encoder model to map products and attributes into a shared embedding space for efficient assignment.
The infrastructure leverages Ray for distributed batch inference, allowing independent scaling of CPU-bound preprocessing and GPU-bound model execution.
The pipeline processes billions of items in approximately 12 hours on 8 NVIDIA A100 GPUs, costing roughly $500 per run.

#mlp #dist #data

Read original

Page 1 of 3

Prev1 2 3 Next