This article demonstrates how Pinterest achieves high-performance AI at significantly lower costs by prioritizing open-source models and fine-tuning with domain-specific data. It's crucial for engineers seeking efficient, scalable, and cost-effective AI development strategies.
Dmitry Kislyuk | Director, Machine Learning; Ryan Galgon | Director, Product Management; Chuck Rosenberg | Vice President, Engineering; Matt Madrigal | Chief Technology Officer

Foreword from Bill Ready, CEO
The AI landscape is undergoing a fundamental shift, and it’s not the one you think. The competitive frontier isn’t only about building the largest proprietary models. There are two other major trends emerging that haven’t had enough discussion:
Open-source models have made tremendous strides, especially on cost relative to performance.
Compact, fit-for-purpose models can meaningfully out-perform general purpose LLMs on specific tasks and do so at dramatically lower cost.
Our Chief Technology Officer and AI team share how we are using open-source AI models at Pinterest to achieve similar performance at less than 10% of the cost of leading, proprietary AI models. They also share how Pinterest has built in-house, fit-for-purpose models that are able to significantly outperform leading, proprietary general purpose models.
The race to build the largest, most powerful models is profound and meaningful. If you want to see a thriving ecosystem of innovation in an AI-driven world, you should also want to see a thriving open-source AI community that creates democratization and transparency. It’s a good thing for us all that open source is in the race.
For our part, we’ll continue to share our findings in leveraging open-source AI so that more companies and builders can benefit from the democratizing effect of open-source AI.
Pinterest helps users worldwide to search, save, and shop for the best ideas powered by our visual AI capabilities. These are powered by a mix of models operating across different modalities; a recent development is that for applications requiring LLMs and VLMs¹, we have found significant advantages in adapting open-source models with Pinterest’s unique data and existing technologies. As a result, Pinterest has been shifting more of our AI investments towards fine-tuned open-source models, achieving similar quality at a fraction of the cost, particularly for visual and multimodal tasks. This shift reflects a broader industry trend: core LLM architectures are commoditizing, while differentiation increasingly comes from domain-specific data, personalization, and product integration.
It is worth taking a closer look at the technical strategy behind foundation models at Pinterest. Just because it can be built in-house does not mean every capability should or must be. The build, buy, adapt set of tradeoffs are a well understood concept in the industry, and AI models are no different. At Pinterest, we structure our thinking about this question by looking at the primary modality over which the foundation model is optimized for:
What we are observing now is that the capabilities of open-source multimodal LLM architectures have begun to level the playing field of model capabilities. Critically, across many product categories at Pinterest, the core differentiation in capabilities is shifting to the ability to fine-tune models with domain-specific data, and investing in end-to-end optimization and integration.
The trend toward domain-specific data and deep product integration as a core differentiator can be seen as a reversion to a common trend in the ML industry. In the first decade of the AlexNet era, core architectures were routinely commoditized, and either fine-tuning open-source models or training models on specific web-scale datasets was the most common form of development. We saw this first-hand with our development of various visual encoders (e.g. UVE, PinCLIP), where training embedding models from scratch on Pinterest image and visual search data has yielded meaningful retrieval gains over off-the-shelf embedding models⁴. Recently, we’ve also seen this with Pinterest Canvas, our image generation foundation model, where tuning an internally-trained diffusion model for specific image editing and enhancement use-cases with Pinterest data has thus far yielded better results than using larger but more general-purpose visual generative models.
Our most recent data point in this trend comes from the beta launch of Pinterest Assistant in October of this year. We can think of the Pinterest Assistant as being broken down into two sets of ML technologies. First, there is an underlying set of multimodal retrieval systems, recommendation services, and specialized generative models (including other LLMs) that serve as tools for an agentic LLM to invoke. These tools are predominantly Pinterest-native and rely on our user and visual foundation models.
And second, there is the core multimodal LLM itself, which oversees the agentic loop and is responsible for query understanding, query planning, and effective tool calling. The key factor with this LLM is that it acts as an intelligent router that recursively delegates much of the recommendation and agentic capabilities to the aforementioned Pinterest-native tools. In this design, the biggest lever we have for product improvements is scaling the quality of the tools, and scaling test-time-compute (e.g. breaking down the call into more advanced steps), as opposed to focusing solely on using the largest core LLM possible. Indeed, as comparisons of LLMs start showing small or negligible differences, we have observed that open-source solutions meet our product needs; we’re getting more value by focusing on building out more domain-specific tools, fine-tuning for product-specific use-cases, optimizing for latency, etc. There are some benefits we are particularly excited about as we adopt more open-source multimodal LLMs at Pinterest:
Looking ahead, ML and AI capabilities at Pinterest will continue to be powered by a mix of internally-developed foundation models, fine-tuned open-source models, and licensed third party models. In addition, third party AI platforms are widely used by Pinterest engineering teams for coding tools, internal productivity, and rapid prototyping. However, the scalability advantages and capability gains from all forms of internally hosted models, whether they are trained from scratch or fine-tuned, are leading to a change in technology defaults at Pinterest. Furthermore, the development of model families that provide generative capabilities across a variety of latency and throughput requirements have allowed for a development pattern where product teams can prototype and iterate with third party models, while the ML teams develop more scalable and personalized internal models for the relevant capability.
How long will this open-source trend hold? We can only make an educated guess. The large-scale buildout of AI data centers may result in more step-function jumps in quality and emergent capabilities for proprietary models. In parallel, the supply-side growth in chip production may drive down fine-tuned open-source inference costs even further. Either way, our strategy at Pinterest to bring inspiration to all of our users will remain the same: leverage our visual, graph, and recommendation data to build the best and most efficient models we can, and address any capability gaps by partnering with third-party providers, alongside regular research & development from Pinterest Labs.
¹We use “LLM” to refer to both text-only models, and multimodal visual LLMs, which contain an image encoder, which are sometimes referred to as Visual Language Models (VLMs). Most applications of generative models at Pinterest require visual inputs, so internally we assume multimodal capabilities as a default.
²Visual models are commonly trained with text supervision via contrastive learning or other forms of conditioning. But the dominant training signal for the model remains the raw visual input.
³Most LLMs benefit from a mix of modalities, with VLMs designed explicitly for this purpose. However, pre-training remains focused on an autoregressive text token prediction task, which is why we characterize them as text-dominant models.
⁴For example, we have seen our PinCLIP system outperform state of the art open-source multimodal embeddings by more than 30% on core retrieval tasks.
On the (re)-prioritization of open-source AI was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Continue reading on the original blog to support the author
Read full articleScaling Text-to-SQL in large enterprises fails with simple RAG due to schema complexity. By encoding historical analyst intent and governance metadata into embeddings, engineers can build agents that provide trustworthy, context-aware queries instead of just syntactically correct ones.
Consolidating fragmented ML models reduces technical debt and operational overhead while boosting performance through shared representations. This case study provides a blueprint for balancing architectural unification with the need for surface-specific specialization in large-scale systems.
This case study highlights that even mathematically superior models fail if serving infrastructure lacks feature parity with training. It provides a blueprint for diagnosing ML system discrepancies by auditing the entire pipeline from embedding generation to funnel alignment.
This article demonstrates how to scale personalized recommendation systems using transformer-based sequence modeling. It provides a blueprint for transitioning from coarse-grained to fine-grained candidate generation, improving ad relevance and efficiency in large-scale production environments.