This article demonstrates how to overcome legacy observability challenges by pragmatically integrating AI agents and context engineering, offering a blueprint for unifying fragmented data without costly overhauls.
Marcel Mateos Salles | Software Engineer Intern; Jorge Chavez | Sr. Software Engineer; Khashayar Kamran | Software Engineer II; Andres Almeida | Software Engineer; Peter Kim | Manager II; Ajay Jha | Sr. Manager
At Pinterest, inspiration isn’t just for our users — it shapes how we build and care for our platform. Until recently, our own observability (o11y) tools told a fragmented story: logs over here, traces over there, and metrics somewhere else. We’ve always excelled at collecting signals: time-series metrics, traces, logs, and change related events. But without the seamless context and unity now promised by open standards like OpenTelemetry (OTel), we were missing out on the “big picture”: the full narrative behind every anomaly and alert. While modern observability standards like OpenTelemetry (OTel) promise a unified world of correlated data, the reality for many mature, large-scale infrastructures is far more fragmented. These systems, often predating the widespread adoption of such standards, are composed of powerful but disconnected data silos. We solved the problem with a pragmatic solution to this common challenge by leveraging AI agents through a centralized Model Context Protocol (MCP) server to bridge these gaps without mandating a complete infrastructure overhaul.
The Pinterest Observability team is charting a new course that meets the moment. We’re working both left and right: “shift-left” practices to bake better logging and instrumentation into the heart of our code, and “shift-right” strategies to keep production observability robust and responsive. Still, we know that tools alone aren’t enough. The real breakthrough comes with bringing more intelligence and context into the mix. We are embracing the new era of AI, and at its core is the Model Context Protocol (MCP), Agent2Agent(A2A) and context engineering, a new way to bring all our observability signals together and feed them into intelligent agents. Beginning with the MCP server, we attempt to make every major pillar of observability data available in a unified, contextual stream.
Observability analysis systems can dig deep, asking the right questions. Following clues across logs, metrics, traces, and change events, and iteratively building insight much like a Pinterest board comes together, piece by piece. The result? Faster, clearer root-cause analysis, and actionable guidance for our engineers, right where they need it. This isn’t just about connecting yesterday’s silos, it’s about creating new frontiers for discovery and problem-solving, empowering every Pinterest team to build their own context-aware tools and shape observability that grows with us.
The field of observability (o11y) faces major turning points every couple of years, with a major shift a couple of years back when OpenTelemetry (OTel) and similar services came into the picture. These tools facilitate the o11y process by enabling context propagation across the different pillars of o11y data while remaining vendor and language agnostic. For example, under a single SDK, you have the ability to generate metrics, logs, and traces with an ID that allows for correlation and connections between those unique data pillars.
However, our o11y infrastructure was set up before conventions and tools like OTel were available, and it is not feasible to overturn our entire o11y infrastructure in order to incorporate them into our stack. This means that we suffer from a lack of the virtues they provide. We had to individually implement separate tools and pipelines for ingesting logs, metrics, and traces from our services. This resulted in a strong, yet fragmented system where each individual pillar is constrained to its own domain with no clear matching across datapoints. As a result, an on-call engineer must jump around multiple unique interfaces when root causing an issue, leading to the potential loss of valuable time. A steep learning curve for the current tools unique to each pillar further extends this loss of time for newer engineers. Consequently, advanced o11y analysis by leveraging machine learning or other techniques that can holistically understand the health of our systems creates non-trivial problems for the o11y team.

Knowing these limitations, the o11y team here at Pinterest is committed to overcoming these gaps by what we call “shifting-left” and “shifting-right.” When “shifting-left,” we have prioritized the integration and standardization of o11y practices and tools, which facilitates the proactive identification and resolution of issues. Meanwhile, when “shifting-right,” we focus on maintaining system visibility in production through the use of our alerting and health inferencing systems.
This means that we have to continue to innovate and connect the dots across our pillars while ensuring teams can continue to monitor the health of their services and quickly solve problems when they arise.
Enter the era of AI and Agents. What if our limitations didn’t truly matter? We could just provide our data to Large Language Models (LLMs) acting as agents and have them connect the dots for us, find correlations, return meaningful information to our users in a single interface, facilitate the root-causing process, and in the future lead to a system where we can autonomously solve issues as they arise. We are working towards that future and are excited to share work we have taken up in that regard.
An AI agent is only as good as the information that it has access to, so we knew that we had to build a system that would be able to provide our o11y agents with as much relevant data as possible. LLMs are impressive on their own, but with some real context engineering behind them, they become so impressive that you begin to feel like you are living in the advanced future from your favorite Sci-Fi movies and shows.
Different techniques have sprung up recently to facilitate the sharing of context with an agent. However, the most prominent and widely accepted is that of the Model Context Protocol (MCP), which was released by Anthropic in late 2024. This protocol has become the new standard and a staple of agentic projects for companies and enthusiasts alike. In short, it provides an agent different tools that it can utilize when working to resolve a request, allowing it the flexibility to choose what to use (if it wants to call anything at all) as it organically works through a task with its reasoning and newfound information. MCP was the perfect fit to help us sidestep our limitations and begin to drive Pinterest o11y into a new era as it grants the following:

And so, the o11y team’s very own MCP server was born. It is now available internally for Pinterest engineers to use and is a central part of our move towards autonomous o11y. Currently, it provides models with tooling for accessing the following data:
Its development was a great experience and allowed us to learn a very important lesson about applied AI. It is partially a consequence of our data but something that anyone who wants to do something with agents should consider as a limitation: the model context size. Going in, we overestimated the amount of information that a model could take while also underestimating the amount of data that we own as a team. The o11y team processes around 3 billion data points per minute, 12 billion keys (tag key/value combinations) per minute, 7 TB of logs per day, and 7 TB of traces per day — no small amount of data! If we allow an agent to organically look through this data, it would end up querying for too much at a time (even if it was only querying for a 15 min window), breaking its context window and causing itself to crash. We came up with two main solutions to prevent this from happening, the first being short-term while we test the other:
We are also currently working on and testing another solution with the Spark team within Pinterest. They are looking at building a similar agent, where we leverage an additional LLM within the server (with a fresh context) to summarize the data. This allows us to only return a summary to the agent connected to the MCP server which, in theory, would conserve a lot of context space. We just need to verify that these summaries don’t drastically decrease the agent’s performance.
Our MCP Server agent is called Tricorder Agent. It is designed to assist engineers in quickly analyzing problems and resolving conflicts. The agent is part of a broader suite of new tools under development by the o11y team, collectively known as the Tricorder. The engineer can provide the Tricorder with their alert link/number and sit back while it gathers the relevant information for their investigation. Before, this would have been extremely time consuming as engineers would’ve had to switch between all our interfaces and apply filters to find relevant data. Additionally, Tricorder queries our services directly to understand what is going on and hypothesize a cause, providing suggestions and next steps as it gains more information. Throughout this process, there have been many times where we have been pleasantly surprised by the Tricorder. For example, a lot of information is unlocked when a dependency graph becomes available. The agents use tools on multiple parts of the graph, exploring all the incoming and outgoing dependencies to check for the overall health of connections with no specific prompting to do so. Additionally, when generating links and narrowing down to relevant services, they include the services in the dependency graph, knowing the problems could be stemming from them.
Autonomous Observability at Pinterest (Part 1 of 2) was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Continue reading on the original blog to support the author
Read full articleThis article details Pinterest's approach to building a scalable data processing platform on EKS, covering deployment and critical logging infrastructure. It offers insights into managing large-scale data systems and ensuring observability in cloud-native environments.
This article details Pinterest's strategic move from Hadoop to Kubernetes for data processing at scale. It offers valuable insights into the challenges and benefits of modernizing big data infrastructure, providing a blueprint for other organizations facing similar migration decisions.
This article demonstrates how to automate the challenging process of migrating and scaling stateful Hadoop clusters, significantly reducing manual effort and operational risk. It offers a blueprint for managing large-scale distributed data infrastructure efficiently.
Managing resources at scale requires more than just hard limits. Piqama provides a unified framework for capacity and rate-limiting, enabling automated rightsizing and budget alignment. This reduces manual overhead while improving resource efficiency and system reliability across platforms.