Why it matters: This article details Pinterest's approach to building a scalable data processing platform on EKS, covering deployment and critical logging infrastructure. It offers insights into managing large-scale data systems and ensuring observability in cloud-native environments.

  • Pinterest is transitioning to Moka, a new data processing platform, deploying it on AWS EKS across standardized test, dev, staging, and production environments.
  • EKS cluster deployment utilizes Terraform with a layered structure of AWS-originated and Pinterest-specific modules and Helm charts.
  • A comprehensive logging strategy is implemented for Moka, addressing EKS control plane logs (via CloudWatch), Spark application logs (driver, executor, event logs), and system pod logs.
  • A key challenge in logging is ensuring reliable upload of Spark event logs to S3, even during job failures, for consumption by Spark History Server.
  • They are exploring custom Spark listeners and sidecar containers to guarantee event log persistence and availability for debugging and performance analysis.

Why it matters: This article details Pinterest's journey in building PinConsole, an Internal Developer Platform based on Backstage, to enhance developer experience and scale engineering velocity by abstracting complexity and unifying tools.

  • Pinterest adopted an Internal Developer Platform (IDP) strategy to counter engineering velocity degradation caused by increasing complexity and tool fragmentation.
  • They chose Backstage as the open-source foundation for their IDP, PinConsole, due to its community adoption, extensible plugin architecture, and active development.
  • PinConsole aims to provide consistent abstractions, self-service capabilities, and reduce cognitive overhead for engineers by unifying disparate tools and workflows.
  • The architecture includes custom integrations with Pinterest's internal OAuth and LDAP systems for secure and seamless authentication within the platform.
  • The IDP addresses critical challenges such as inconsistent workflows, tool discovery issues, and fragmented documentation, significantly enhancing overall developer experience.

Why it matters: This article highlights the extreme difficulty of debugging elusive, high-impact performance issues in complex distributed systems during migration. It showcases the systematic troubleshooting required to uncover subtle interactions between applications and their underlying infrastructure.

  • Pinterest encountered a rare, severe latency issue (100x slower) when migrating its memory-intensive Manas search infrastructure to Kubernetes.
  • The in-house Manas search system, critical for recommendations, uses a two-tier root-leaf node architecture, with leaf nodes handling query processing, retrieval, and ranking.
  • Debugging revealed sharp P100 latency spikes every few minutes on individual leaf nodes during index retrieval and ranking phases, indicating a one-in-a-million request failure.
  • Initial extensive troubleshooting, including dedicated nodes, removed cgroups, and OS-level profiling, failed to isolate the root cause of the performance degradation.
  • The problem persisted even when running Manas outside its container directly on the host, suggesting a subtle interaction unique to the Kubernetes provisioning on the AMI.

Why it matters: This article details Pinterest's strategic move from Hadoop to Kubernetes for data processing at scale. It offers valuable insights into the challenges and benefits of modernizing big data infrastructure, providing a blueprint for other organizations facing similar migration decisions.

  • Pinterest is migrating from its aging Hadoop 2.x (Monarch) data platform to a new Kubernetes (K8s) based system, Moka, for massive-scale data processing.
  • The shift to K8s is driven by needs for enhanced container isolation, security, improved performance with Spark, lower operational costs, and better developer velocity.
  • Kubernetes offers built-in container support, streamlined deployment via Terraform/Helm, and a rich ecosystem of monitoring, logging, and scheduling frameworks.
  • Performance optimizations include leveraging newer JDKs, GPU support, ARM/Graviton instances, and Kubernetes' native autoscaling capabilities.
  • Key design challenges involve integrating EKS into Pinterest's existing infrastructure and replacing core Hadoop functionalities like YARN UI, job submission, resource management, log aggregation, and security.

Why it matters: This article demonstrates how to significantly accelerate ML development and deployment by leveraging Ray for end-to-end data pipelines. Engineers can learn to build more efficient, scalable, and faster ML iteration systems, reducing costs and time-to-market for new features.

  • Pinterest expanded Ray's role from ML training to the entire ML infrastructure, including feature development, sampling, and label modeling, to accelerate iteration.
  • A Ray Data native pipeline API was developed for on-the-fly feature transformations, eliminating slow Spark backfills and costly feature joins.
  • Efficient Iceberg bucket joins were implemented in Ray, enabling dynamic dataset joining at runtime and reducing feature experimentation from days to hours.
  • Ray-based Iceberg write mechanisms facilitate data persistence, caching transformed features for reuse, enhancing iteration efficiency and production data generation.
  • This integrated Ray architecture provides a more scalable, efficient, and faster end-to-end ML development and deployment process.

Why it matters: This article demonstrates how Pinterest optimizes ad retrieval by strategically using offline ANN to reduce infrastructure costs and improve efficiency for static contexts, complementing real-time online ANN. This is crucial for scaling ad platforms.

  • Pinterest employs both online and offline Approximate Nearest Neighbors (ANN) for ad retrieval, balancing real-time personalization with cost efficiency.
  • Online ANN handles dynamic user behavior but struggles with scalability and cost as ad inventories expand.
  • Offline ANN precomputes ad candidates, significantly reducing infrastructure costs (up to 80%) by minimizing online lookup and repetitive searches.
  • Ideal for stable query contexts, it delivers high throughput and low latency, though it lacks real-time adaptability.
  • Pinterest's "Similar Item Ads" use case demonstrated offline ANN's superior engagement, conversion, and cost-effectiveness over its online counterpart.
  • The adoption of IVF algorithms for larger ad indexes necessitated offline ANN to control escalating infrastructure expenses.

Why it matters: This article details how Pinterest scaled its recommendation system to leverage vast lifelong user data, significantly improving personalization and user engagement through innovative ML models and efficient serving infrastructure.

  • Pinterest's TransActV2 significantly enhances personalization by modeling up to 16,000 lifelong user actions, a 160x increase over previous systems.
  • It introduces a Next Action Loss (NAL) as an auxiliary task, improving user action forecasting beyond traditional CTR models.
  • To handle long sequences efficiently, TransActV2 uses Nearest Neighbor (NN) selection at inference, feeding only the most relevant actions to the model.
  • The system employs a multi-headed transformer encoder architecture with causal masking and explicit action features.
  • Industrial-scale deployment challenges are addressed through NN feature logging, on-device NN search, and custom OpenAI Triton kernels for low-latency serving.
  • Lifelong behavior modeling captures evolving, multi-seasonal, and less-frequent user interests, leading to richer personalization.

Why it matters: This article demonstrates how to automate the challenging process of migrating and scaling stateful Hadoop clusters, significantly reducing manual effort and operational risk. It offers a blueprint for managing large-scale distributed data infrastructure efficiently.

  • Pinterest developed Hadoop Control Center (HCC) to automate complex migration and scaling operations for its large, stateful Hadoop clusters on AWS.
  • Traditional manual scale-in procedures for Hadoop clusters were tedious, error-prone, and involved many steps like updating exclude files, monitoring data drainage, and managing ASGs.
  • HCC enables in-place cluster migrations by introducing new Auto Scaling Groups (ASGs) with updated AMIs/instance types, avoiding costly and risky full cluster replacements.
  • The tool streamlines scaling-in by managing node decommissioning and ensuring HDFS data replication to new nodes before termination, preventing data loss or workload impact.
  • HCC provides a centralized platform for various Hadoop-related tasks, including ASG resizing, node status monitoring, YARN application reporting, and AWS event tracking.
  • Its architecture includes a manager node for API calls and caching, and worker nodes per VPC to manage clusters, facilitating automated and efficient cluster administration.
Page 2 of 2