Redundant processing of duplicate URLs wastes massive computational resources. This automated, data-driven approach to normalization reduces infrastructure costs and improves data quality by identifying content identity before expensive rendering or ingestion steps occur.
Shanhai Liao | Senior Software Engineer, Content Acquisition and Media Platform; Di Ruan, | Senior Staff Software Engineer, Content Acquisition and Media Platform; Evan Li, | Senior Engineering Manager, Content Acquisition and Media Platform
Accurate content understanding underpins Pinterest’s ability to drive distribution and engagement. This requires deep insight not just into the image itself, but also the outbound links or items to which those images point. At the foundation of this process lies a deceptively simple problem: URL normalization.
When Pinterest ingests content from millions of merchant domains, the same product page often appears under many different URLs. A single pair of shoes might be referenced by dozens of URL variations — each one decorated with different tracking parameters, session tokens, or analytics tags. While downstream systems can eventually deduplicate by content identity, the inability to recognize these duplicates at the URL level means every variation is independently fetched, rendered, and processed. At scale, this redundant ingestion and processing represents a significant waste of computational resources — rendering the same page dozens of times simply because its URLs differ in irrelevant parameters.
Item canonicalization — ensuring that identical items represented by different URLs are unified — is critical for organizing shopping catalogs and presenting a consistent experience to users. For many partners, a provided item ID determines canonical identity, but in its absence, the onus falls to advanced URL normalization to deduplicate effectively.
This post details the technical journey behind the Minimal Important Query Param Set (MIQPS) algorithm: a system that automatically learns which URL parameters matter for content identity, enabling dynamic and precise URL normalization at scale.
Consider a typical product URL from an e-commerce site:
https://example.com/shoes?id=42&color=red
This URL identifies a specific product variant. But in practice, the same product page is often reached through URLs like:
https://example.com/shoes?id=42&color=red&utm_source=facebook&session=abc123
https://example.com/shoes?id=42&color=red&ref=pinterest&click_id=xyz
https://example.com/shoes?id=42&color=red&tracking=campaign_spring
Figure 1: The URL duplication problem. Multiple URLs with different tracking parameters all resolve to the same product content.

The parameters utm_source, session, ref, click_id, and tracking are all neutral - they don’t change the content of the page. Meanwhile, id and color are non-neutral - they determine which product and variant are displayed.
The challenge is distinguishing between the two. For well-known e-commerce platforms, this can be solved with curated rules. Shopify URLs, for example, use variants as the key product differentiator. Salesforce Commerce Cloud uses parameters like start, sz, prefn1, and prefv1. For these platforms, static allowlists are sufficient.
But Pinterest ingests content from a large number of domains, operating on a wide variety of platforms.
For this long tail of domains, URL parameter conventions vary wildly. Static rules cannot scale to cover them all. We need a dynamic, data-driven approach.
The core insight behind MIQPS is straightforward: if removing a query parameter changes the content of a page, that parameter is important; if it doesn’t, the parameter is noise and can be safely stripped. Crucially, this analysis runs independently per domain — each merchant site gets its own MIQPS map, because the same parameter name can be meaningful on one domain and irrelevant on another.
The algorithm operates in three steps.
As Pinterest’s content ingestion pipeline processes URLs from domains, the system accumulates a corpus of observed URLs per domain. This corpus is stored durably and represents a snapshot of all the URL variations seen for a given domain. It serves as the input to the MIQPS analysis.
Not all URLs from a domain share the same set of query parameters. A product page URL might carry {id, color, utm_source} while a category page might carry {category, page, sort}. Analyzing them together would be meaningless.
Moreover, the same parameter name can play different roles depending on its context. Consider the parameter `ref`: on a product page URL like `example.com/product? id = 42 & ref = homepage`, `ref` is purely a tracking parameter and is neutral - removing it doesn’t change the product displayed. But on a comparison page URL like `example.com/compare? ref=99`, the same `ref` parameter identifies which items to compare and is non-neutral. By grouping URLs by their full parameter pattern, the algorithm evaluates each parameter within its specific context, correctly classifying it as neutral in one pattern and non-neutral in another.
To address this, the algorithm groups URLs by their query parameter pattern — the sorted set of parameter names present in the URL. For example:
To address this, the algorithm groups URLs by their query parameter pattern — the sorted set of parameter names present in the URL. For example:

URLs sharing the same query pattern are grouped together. The top K patterns by URL count are selected for analysis, focusing computational resources on the patterns that matter most.
For each query parameter within a pattern, the algorithm determines whether it is neutral or non-neutral through empirical testing:
1. Sample: Select up to S URLs with distinct values for the parameter under test.
2. Compare: For each sampled URL, compute the content ID — a fingerprint derived from the page’s rendered visual content — for both:
— The original URL (with the parameter present)
— A modified URL (with the parameter removed)
3. Classify: If removing the parameter changes the content ID in at least T% of samples, the parameter is classified as non-neutral (important). Otherwise, it is neutral (safe to drop).
The content ID is a hash of the page’s visual representation, meaning two URLs that render the same visible content will produce the same content ID, even if their underlying HTML differs slightly. This particular fingerprinting approach leverages Pinterest’s in-house page rendering infrastructure, which is tailored to our content pipeline. The core MIQPS algorithm, however, is agnostic to how the content fingerprint is produced — it only requires a function that returns the same identifier for the same page content. Third parties looking to adopt a similar approach could substitute alternatives such as DOM tree hashing, HTTP response body checksums, or even simpler heuristics like comparing the `<title>` and Open Graph metadata across URL variants. The key principle remains the same: compare some representation of the page content with and without each parameter to determine its importance.
A natural question is: why not simply use the **canonical URL** declared in the page’s HTML (via the `<link rel=”canonical”>` tag) to resolve duplicates? If the merchant provides a canonical URL, two variant URLs pointing to the same product should share the same canonical, making deduplication trivial. In practice, however, canonical URLs are unreliable at scale. Many merchant sites omit them entirely, set them incorrectly (e.g., pointing every page to the homepage), or include tracking parameters in the canonical URL itself. Because we cannot assume canonical URLs are present or correct across the long tail of merchant domains, MIQPS uses visual content comparison as a ground-truth signal that works regardless of how well-maintained a site’s metadata is.
The behavior of the MIQPS algorithm is governed by a small set of tunable parameters:

Two additional design choices make the algorithm practical at scale:
Figure 2: The MIQPS computation pipeline.
The output of this pipeline is a MIQPS map: a mapping from each query parameter pattern to the set of non-neutral parameters within that pattern. This map is published to a configuration store and consumed at runtime during URL normalization.

MIQPS does not operate in isolation. In production, URL normalization combines static rules with the dynamically computed MIQPS. Static rules capture known conventions — curated allowlists for recognized e-commerce platforms and regex patterns for widely used parameter naming schemes. These rules handle cases where we already have high confidence about which parameters matter.
MIQPS complements these static rules by covering the long tail of domains where no predefined rules exist. A URL parameter is kept if it is matched by either the static rules or the MIQPS non-neutral set. Only parameters that pass neither check are stripped. This combination ensures broad coverage: static rules provide immediate, reliable handling for known platforms, while MIQPS dynamically adapts to everything else.
Computing MIQPS is inherently dependent on external page rendering. Pages can change, rendering infrastructure can have transient issues, and a domain’s URL structure can shift between analysis runs. Without safeguards, a bad MIQPS computation could cause the system to start dropping parameters that are actually important — leading to content deduplication errors and degraded catalog quality.
To address this, the system includes an anomaly detection layer that compares each newly computed MIQPS against the previously published version. The comparison follows a set of conservative rules:
If more than A% of existing patterns are flagged as anomalous, the entire MIQPS update is rejected and the previous version is retained. This ensures the system never regresses — it errs on the side of over-keeping parameters rather than accidentally dropping ones that affect content identity.
The MIQPS system fits into Pinterest’s content processing pipeline as follows:
Figure 3: End-to-end system architecture.

The architecture has three distinct phases:
This separation of concerns means the expensive content ID comparison happens offline and asynchronously, while runtime URL normalization is a fast, in-memory lookup.
An alternative design would be to determine parameter importance **in realtime** — rendering the page with and without each parameter at the moment a URL is first encountered. This would eliminate staleness entirely and provide immediate coverage for newly discovered domains. However, we chose the offline approach for several reasons:
- Latency: Each content ID computation requires rendering a full page, which takes seconds. Testing every parameter in a URL would multiply this cost, adding unacceptable latency to the content processing pipeline.
- Cost: Offline analysis scales with the number of domains, while realtime analysis would scale with the number of URLs — orders of magnitude more expensive.
- Reliability: Transient rendering failures in an offline job are isolated and retryable. In a realtime path, they would directly block content processing.
In practice, the offline approach is a natural fit because URL parameter conventions change infrequently — on the order of weeks or months. The small amount of staleness between computation cycles is an acceptable tradeoff for the massive savings in cost, latency, and operational complexity.
URL normalization may seem like a mundane infrastructure problem, but at Pinterest’s scale — with a large number of domains and billions of URLs — getting it right has outsized impact on content quality.
The MIQPS algorithm brings several key properties to this challenge:
By aligning normalization strategies with proven content identity signals, MIQPS ensures every unique item or experience is surfaced cleanly — improving search and recommendations, downstream catalog management, and ultimately the user experience.
Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication at Pinterest was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Continue reading on the original blog to support the author
Read full articleOOM errors are a primary cause of Spark job failures at scale. Pinterest's elastic executor sizing allows jobs to be tuned for average usage while automatically handling memory-intensive tasks, significantly reducing manual tuning effort, job failures, and infrastructure costs.
Transitioning from batch to real-time ingestion is critical for modern data-driven apps. Pinterest's architecture shows how to use CDC and Iceberg to reduce latency from days to minutes while cutting costs and ensuring compliance through efficient row-level updates and unified pipelines.
Scaling ML models often leads to exponential costs. This approach demonstrates how architectural changes like request-level deduplication and SyncBatchNorm can decouple model complexity from infrastructure overhead, enabling massive scale-ups without proportional cost increases.
Managing resources at scale requires more than just hard limits. Piqama provides a unified framework for capacity and rate-limiting, enabling automated rightsizing and budget alignment. This reduces manual overhead while improving resource efficiency and system reliability across platforms.