This article details Pinterest's strategic move from Hadoop to Kubernetes for data processing at scale. It offers valuable insights into the challenges and benefits of modernizing big data infrastructure, providing a blueprint for other organizations facing similar migration decisions.
Soam Acharya: Principal Engineer · Rainie Li: Manager, Data Processing Infrastructure · William Tom: Senior Staff Software Engineer · Ang Zhang: Director, Big Data Platform
As Pinterest’s data processing needs grow and as our current Hadoop-based platform (Monarch) ages, the Big Data Platform (BDP) team within Pinterest Data Engineering started considering alternatives for our next generation massive scale data processing platform. In this blog post series, we share details of our subsequent journey, the architecture of our next gen data processing platform, and some insights we gained along the way. In part one, we provide rationale for our new technical direction prior to outlining the overall design and detailing the application focused layer of our platform. We conclude with current status and some of our learnings. Part two of our series can be found here where we spotlight the infrastructure focused aspects of our platform.
Encouraged by its growing popularity and increasing adoption in the Big Data community, we explored Kubernetes (K8s)-based systems as the most likely replacement for Hadoop 2.x. Candidate platforms had to meet the following criteria:
Armed with these requirements, we performed a comprehensive evaluation of running Spark on various platforms during 2022. We leaned towards Kubernetes-focused frameworks for the following advantages they offered:
We consider each advantage in turn.
Unlike Hadoop, Kubernetes was built as a container orchestration system, first and foremost. Consequently, it provides more fine grained support for container management and deployment than other systems. Similarly, there’s extensive built-in support for Role-Based Access Control (RBAC) and account management at the container level, as well as networking within a Kubernetes cluster.
However, Kubernetes as a general purpose system does not have the built in support for data management, storage, and processing that Hadoop does. Nor does it have support for batch-based scheduling. As a result, we have to rely on Spark, S3, and third party schedulers like YuniKorn, respectively, to fashion a Kubernetes-based data processing solution.
Our current deployment model in Hadoop is cumbersome, involving a mixture of Terraform, Puppet, and custom scripts. In Kubernetes, deployment is typically handled via a mixture of:
This promised to be a more straightforward approach on the surface with the possibility of custom container images being an effective tool for deploying customer specific environments.
A number of frameworks are available for Kubernetes, such as Prometheus and FluentBit for logging and monitoring, KubeCost for resource usage tracking, Spark Operator for Spark application deployment, and pluggable third party schedulers such as YuniKorn and Volcano. While not all of these frameworks apply well to our use case, their availability made it easier for us to evaluate multiple approaches to find the best fit.
The Kubernetes architecture and deployment model makes a number of levers available when optimizing for performance:
Our POC indicated we would be able to achieve acceptable performance with Spark on AWS Elastic Kubernetes Service (EKS) clusters running both x86 and ARM instances depending on the price / performance observed for the workload.
Building a new platform that leverages Kubernetes and EKS to replace Monarch at Pinterest introduced several challenges. These included integrating EKS into the existing Pinterest environment, finding replacements for Hadoop components, ensuring compatibility with the Pinterest ecosystem, building an operational framework to observe and support the new platform, and optimizing overall cost-effectiveness through the use of Graviton instances. This required thorough understanding of EKS as well as the ecosystem of supporting tools, together with careful planning and implementation. In particular:
Figure 1 illustrates the initial high level design of Moka, our new Spark on EKS platform. For phase one, we built a system able to process batch Spark workloads that only access non sensitive data. In this post, we will detail our design for phase one. We have also added other functionality to Moka such as interactive query submission and FGAC and will present those changes at a later date.

In our first iteration, jobs are submitted and processed as follows:
Next, we provide more details on the core application focused aspects of our platform. Infrastructure and other remaining components will be described in the second part of our blog series.
To manage the deployment and lifecycle of Spark applications at Pinterest, we decided to leverage the Spark Operator instead of running spark-submit directly. Spark Operator exposes the SparkApplication Custom Resource Definition (CRD), which allows us to define Spark applications in a declarative manner and leave it to the Spark Operator to handle all of the underlying submission details. Internally, Spark Operator still utilizes the native spark-submit command. The Spark Operator will run spark-submit in cluster mode to start the driver pod, and the driver pod will internally run spark-submit in client mode to start the driver process.

During our evaluation process and migration, we identified several bottlenecks or scalability issues caused by the Spark Operator. Here are some of the issues we found and how we addressed them:
When the number of pods in K8s reaches a certain threshold (in our case 12.5k) defined by PodGCControllerConfiguration, PodGC Controller will trigger cleanup of terminal pods. We observed cases where a driver pod completes and is cleaned up before the Spark Operator has a chance to retrieve the pod status and update the Spark Application. In this case, the Spark Operator will incorrectly interpret and mark the SparkApplication as FAILED. In order to prevent premature cleanup of the driver pod by PodGC controller, we utilize Pod Templates to add a finalizer to all Spark Driver pods upon creation. If a finalizer exists on a pod, it will prevent the PodGC Controller from removing it. We added logic to the Spark Operator that will remove the finalizer on the driver pod only when the final status of the Driver has been retrieved and the SparkApplication transitions to a terminal state.
In Moka, we utilize volume mounts to allow access to predefined host-level directories from within the pod as soon as the pod starts. For example, Normandie, an internal security process which manages certificates, exposes a FUSE endpoint on a fixed path in every Pinterest host and should be accessible as soon as the Spark process starts.
Originally, we relied on the Spark Operator’s mutating admission webhook to mutate the pod after it was created to add the volume mounts. However, as the platform scaled we found that the increased load caused increased latency against the K8s control plane. As a mitigation we deployed multiple spark operators to the platform. To fully resolve the webhook related latency issues we utilized pod templates, which we passed to Spark via Spark template configs to configure the volume mounts instead. This allowed us to remove our reliance on the Spark Operators webhook and return to a single Spark operator deployment which manages multiple namespaces.
Our existing job submission service (Argonath aka JSS) for Hadoop only supports job deployment to YARN clusters. To support job submission to a large scale, more accurately track job status, and provide job uniqueness checks, we built Archer — a new job submission service for Spark on K8s, with focus on EKS. Archer provides the following features: job submission, job deletion, and job information tracking. It integrates with existing user interface frameworks such as
Figure 3 focuses on the components that interact with Archer.


Archer comprises two layers:
2. Worker layer
It is possible for Archer to get job status directly from K8s api-server, which would make the state machine unnecessary. With this approach, the worker layer would only handle light-weight operations (dequeue from DB, call K8s api, balance load between multiple Spark Operators, etc.), and we could combine the worker layer and service layer into one layer to simplify the design. However keeping separate layers can provide the following benefits:
Migrating Spark jobs from Hadoop to EKS is a non-trivial undertaking. We’re essentially building a new platform on top of EKS while ensuring everything remains performant and compatible. Here are a few key challenges we encountered:
To address these challenges and ensure job reliability in production, we designed and implemented a “dry run” process.

Whenever there is a production job submission to JSS, our Monarch/Hadoop job submissions service, we will submit the same request to Archer. Archer automatically replaces all the prod output buckets and tables and replaces them with test buckets. Archer submits the dryRun requests to Moka staging clusters. Once both Monarch production run and Moka staging runs complete, Archer will automatically trigger data validation. This includes:
With the dryRun pipeline, we were able to automatically detect unexpected failures for prod jobs in the staging environment to avoid job failures in production.
Once jobs pass dryRun validation, they are ready to be migrated to Moka prod. We designed and implemented migration flow by extending the Airflow/Spinner-based Spark Operator (not to be confused with the Kubernetes Spark Operator) and Spark SQL Operator to support both the existing JSS YARN operator and the new Archer Operator. The extended Airflow operators decide whether to route to Monarch or Moka during runtime based on corresponding info in the migration DB. Overall, our dryRun framework greatly eased the Spark workflow validation and migration process.

Data shuffling is a process where the data is redistributed across different partitions to enable parallel processing and better performance, which is important for big data. Figure 7 shows how this applies to MapReduce. On Hadoop YARN, we use the external shuffle service (ESS). By utilizing ESS, we achieve support for dynamic allocation since we have the ability to scale down executors that no longer have the responsibility of managing shuffle data. However, there are two challenges with ESS on YARN: shuffle timeouts, and noisy neighbor interference because of shared disks filling up.
To enable dynamic allocation on K8s, we adopted Apache Celeborn as Remote Shuffle Service (RSS) on Moka. Here are some of the key advantages of a Remote shuffle service:
We’ve found our usage of Celeborn for RSS has improved platform reliability and stability with overall Spark job performance improving 5% on average.
We set up a data processing pipeline to collect historical workflow resource usage from our past Monarch and Moka jobs. This pipeline generates resource strategy based on historical data and workflow SLO. It sets up queue usage for each workflow and populates a resource DB with routing information. When a workflow is submitted, Archer queries this resource DB and routes the workflow to specific queues and specific clusters.

We adopted Apache YuniKorn as Scheduler on Moka since YuniKorn provides several advantages over the default K8s scheduler.
1. YuniKorn provides queue-based scheduling & maxApplication Enforcement.
YuniKorn is very similar to YARN. This is useful as having queue structures is important for batch applications. It allows us to control resource allocation between different organizations, teams, and projects. YuniKorn also supports maxApplication enforcement, which is a critical feature we used on YARN. When there are a large number of concurrent jobs, they will all compete for resources and a certain amount of resources will be wasted, which could lead to job failure.
An example of org-based queue structure in Moka:
2. YuniKorn provides preemption.
Preemption feature allows higher-priority jobs to dynamically reallocate resources by preempting lower-priority ones, ensuring critical workloads get necessary resources. Our workload tiers are defined using K8s priorityClass:
We’ve currently migrated approximately 70% of our batch Spark workloads to Moka from Monarch while managing high growth on both platforms, with all new Spark and SparkSQL workloads running on Moka by default. We expect to have all Spark workloads running on Moka by the end of the year. In tandem, we’re also working on transitioning non Spark batch workloads to Spark, which will allow us to sunset all Hadoop clusters at Pinterest.
Regarding cost savings, we benefit from being able to leverage Moka clusters with ARM instance types. However, it’s been our experience that not all workloads perform well on ARM. In addition, since we’re sunsetting our Hadoop clusters and moving the same EC2 instances to Moka, many of our Moka clusters run the same CPUs as our Monarch clusters (i.e. x86-based instance types). However, other components in our architecture ensure we continue extracting meaningful cost efficiencies via Moka:
Moka was a massive project that necessitated and continues to require extensive cross functional collaboration between teams at multiple organizations at Pinterest and elsewhere. Here’s an incomplete list of folks and teams that helped us build our first set of production Moka clusters:
Data Processing Infra: Aria Wang, Bhavin Pathak, Hengzhe Guo, Royce Poon, Bogdan Pisica
Big Data Query Platform: Zaheen Aziz, Sophia Hetts, Ashish Singh
Batch Processing Platform: Nan Zhu, Yongjun Zhang, Zirui Li, Frida Pulido, Chen
SRE/PE: Ashim Shrestha, Samuel Bahr, Ihor Chaban, Byron Benitez, Juan Pablo Daniel Borgna
TPM: Ping-Huai Jen, Svetlana Vaz Menezes Pereira, Hannah Chen
Cloud Architecture: James Fogel, Sekou Doumbouya
Traffic: James Fish, Scott Beardsley
Security: Henry Luo, Jeremy Krach, Ali Yousefi, Victor Savu, Vedant Radhakrishnan
Continuous Integration Platform: Anh Nguyen
Infra Provisioning: Su Song, Matthew Tejo
Cloud Runtime: David Westbrook, Quentin Miao, Yi Li, Ming Zong
Workflow Platform: Dinghang Yu, Yulei Li, Jin Hyuk Chang
ML platform: Se Won Jang, Anderson Lam
AWS Team: Doug Youd, Alan Tyson, Vara Bonthu, Aaron Miller, Sahas Maheswari, Vipul Viswanath, Raj Gajjar, Nirmal Mehta
Leadership: Chunyan Wang, Dave Burgess, David Chaiken, Madhuri Racherla, Jooseong Kim, Anthony Suarez, Amine Kamel, Rizwan Tanoli, Alvaro Lopez Ortega
Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2) was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Continue reading on the original blog to support the author
Read full articleThis article demonstrates how to overcome legacy observability challenges by pragmatically integrating AI agents and context engineering, offering a blueprint for unifying fragmented data without costly overhauls.
This article details Pinterest's approach to building a scalable data processing platform on EKS, covering deployment and critical logging infrastructure. It offers insights into managing large-scale data systems and ensuring observability in cloud-native environments.
This article demonstrates how to automate the challenging process of migrating and scaling stateful Hadoop clusters, significantly reducing manual effort and operational risk. It offers a blueprint for managing large-scale distributed data infrastructure efficiently.
Managing resources at scale requires more than just hard limits. Piqama provides a unified framework for capacity and rate-limiting, enabling automated rightsizing and budget alignment. This reduces manual overhead while improving resource efficiency and system reliability across platforms.