Posts tagged with dist
Why it matters: This article provides a detailed blueprint for achieving high availability and fault tolerance for distributed databases on Kubernetes in a multi-cloud environment. Engineers can learn best practices for managing stateful services, mitigating risks, and designing resilient systems at scale.
- •Airbnb achieved high availability for a distributed SQL database by deploying it across multiple Kubernetes clusters, each in a different AWS Availability Zone, a complex but effective strategy.
- •They addressed challenges of running stateful databases on Kubernetes, particularly node replacements and upgrades, using custom Kubernetes operators and admission hooks.
- •A custom Kubernetes operator coordinates node replacements, ensuring data consistency and preventing service disruption during various event types.
- •Deploying across three independent Kubernetes clusters in different AWS AZs significantly limits the blast radius of infrastructure or deployment issues.
- •AWS EBS provides rapid volume reattachment and durability, with tail latency spikes mitigated by read timeouts, transparent retries, and stale reads.
- •Overprovisioning database clusters ensures sufficient capacity even if an entire AZ or Kubernetes cluster fails.
Why it matters: This article highlights the extreme difficulty of debugging elusive, high-impact performance issues in complex distributed systems during migration. It showcases the systematic troubleshooting required to uncover subtle interactions between applications and their underlying infrastructure.
- •Pinterest encountered a rare, severe latency issue (100x slower) when migrating its memory-intensive Manas search infrastructure to Kubernetes.
- •The in-house Manas search system, critical for recommendations, uses a two-tier root-leaf node architecture, with leaf nodes handling query processing, retrieval, and ranking.
- •Debugging revealed sharp P100 latency spikes every few minutes on individual leaf nodes during index retrieval and ranking phases, indicating a one-in-a-million request failure.
- •Initial extensive troubleshooting, including dedicated nodes, removed cgroups, and OS-level profiling, failed to isolate the root cause of the performance degradation.
- •The problem persisted even when running Manas outside its container directly on the host, suggesting a subtle interaction unique to the Kubernetes provisioning on the AMI.
Why it matters: This article details Pinterest's strategic move from Hadoop to Kubernetes for data processing at scale. It offers valuable insights into the challenges and benefits of modernizing big data infrastructure, providing a blueprint for other organizations facing similar migration decisions.
- •Pinterest is migrating from its aging Hadoop 2.x (Monarch) data platform to a new Kubernetes (K8s) based system, Moka, for massive-scale data processing.
- •The shift to K8s is driven by needs for enhanced container isolation, security, improved performance with Spark, lower operational costs, and better developer velocity.
- •Kubernetes offers built-in container support, streamlined deployment via Terraform/Helm, and a rich ecosystem of monitoring, logging, and scheduling frameworks.
- •Performance optimizations include leveraging newer JDKs, GPU support, ARM/Graviton instances, and Kubernetes' native autoscaling capabilities.
- •Key design challenges involve integrating EKS into Pinterest's existing infrastructure and replacing core Hadoop functionalities like YARN UI, job submission, resource management, log aggregation, and security.
Why it matters: Engineers often struggle to balance robust security with system performance. This approach demonstrates how to implement scalable, team-level encryption at rest using HSMs without sacrificing the speed of file sharing or the functionality of content search in a distributed environment.
- •Dropbox developed a team-based encryption system using Hardware Security Modules (HSM) for secure key generation and storage.
- •The architecture solves the performance bottleneck of re-encrypting 4MB file blocks during cross-team sharing operations.
- •Unique top-level keys allow enterprise teams to instantly disable access to their data, providing granular control over sensitive information.
- •The system balances high security with usability, maintaining features like content search that are often lost in traditional end-to-end encryption.
- •This security framework serves as the foundation for protecting AI-driven tools like Dropbox Dash and its universal search capabilities.
Why it matters: This article demonstrates how to significantly accelerate ML development and deployment by leveraging Ray for end-to-end data pipelines. Engineers can learn to build more efficient, scalable, and faster ML iteration systems, reducing costs and time-to-market for new features.
- •Pinterest expanded Ray's role from ML training to the entire ML infrastructure, including feature development, sampling, and label modeling, to accelerate iteration.
- •A Ray Data native pipeline API was developed for on-the-fly feature transformations, eliminating slow Spark backfills and costly feature joins.
- •Efficient Iceberg bucket joins were implemented in Ray, enabling dynamic dataset joining at runtime and reducing feature experimentation from days to hours.
- •Ray-based Iceberg write mechanisms facilitate data persistence, caching transformed features for reuse, enhancing iteration efficiency and production data generation.
- •This integrated Ray architecture provides a more scalable, efficient, and faster end-to-end ML development and deployment process.
Why it matters: This article demonstrates how Pinterest optimizes ad retrieval by strategically using offline ANN to reduce infrastructure costs and improve efficiency for static contexts, complementing real-time online ANN. This is crucial for scaling ad platforms.
- •Pinterest employs both online and offline Approximate Nearest Neighbors (ANN) for ad retrieval, balancing real-time personalization with cost efficiency.
- •Online ANN handles dynamic user behavior but struggles with scalability and cost as ad inventories expand.
- •Offline ANN precomputes ad candidates, significantly reducing infrastructure costs (up to 80%) by minimizing online lookup and repetitive searches.
- •Ideal for stable query contexts, it delivers high throughput and low latency, though it lacks real-time adaptability.
- •Pinterest's "Similar Item Ads" use case demonstrated offline ANN's superior engagement, conversion, and cost-effectiveness over its online counterpart.
- •The adoption of IVF algorithms for larger ad indexes necessitated offline ANN to control escalating infrastructure expenses.
Why it matters: This framework helps engineers proactively identify bottlenecks, evaluate capacity, and ensure system reliability through robust, decentralized, and automated load testing integrated with CI/CD.
- •Airbnb's Impulse is a decentralized load-testing-as-a-service framework for robust system performance evaluation.
- •It features a context-aware load generator, an out-of-process dependency mocker, a traffic collector, and a testing API generator.
- •The load generator uses Java/Kotlin for flexible test logic, containerized for isolation, scalability, and cost-efficiency.
- •The dependency mocker enables selective stubbing of HTTP, Thrift, and GraphQL dependencies with configurable latency, isolating the SUT.
- •Impulse integrates with CI/CD for automated testing across warm-up, steady-state, and peak phases, using synthetic or collected traffic.
- •Its architecture empowers self-service load tests, minimizing manual effort and enhancing proactive issue detection.
Why it matters: This article details how Pinterest scaled its recommendation system to leverage vast lifelong user data, significantly improving personalization and user engagement through innovative ML models and efficient serving infrastructure.
- •Pinterest's TransActV2 significantly enhances personalization by modeling up to 16,000 lifelong user actions, a 160x increase over previous systems.
- •It introduces a Next Action Loss (NAL) as an auxiliary task, improving user action forecasting beyond traditional CTR models.
- •To handle long sequences efficiently, TransActV2 uses Nearest Neighbor (NN) selection at inference, feeding only the most relevant actions to the model.
- •The system employs a multi-headed transformer encoder architecture with causal masking and explicit action features.
- •Industrial-scale deployment challenges are addressed through NN feature logging, on-device NN search, and custom OpenAI Triton kernels for low-latency serving.
- •Lifelong behavior modeling captures evolving, multi-seasonal, and less-frequent user interests, leading to richer personalization.
Why it matters: This article demonstrates how to automate the challenging process of migrating and scaling stateful Hadoop clusters, significantly reducing manual effort and operational risk. It offers a blueprint for managing large-scale distributed data infrastructure efficiently.
- •Pinterest developed Hadoop Control Center (HCC) to automate complex migration and scaling operations for its large, stateful Hadoop clusters on AWS.
- •Traditional manual scale-in procedures for Hadoop clusters were tedious, error-prone, and involved many steps like updating exclude files, monitoring data drainage, and managing ASGs.
- •HCC enables in-place cluster migrations by introducing new Auto Scaling Groups (ASGs) with updated AMIs/instance types, avoiding costly and risky full cluster replacements.
- •The tool streamlines scaling-in by managing node decommissioning and ensuring HDFS data replication to new nodes before termination, preventing data loss or workload impact.
- •HCC provides a centralized platform for various Hadoop-related tasks, including ASG resizing, node status monitoring, YARN application reporting, and AWS event tracking.
- •Its architecture includes a manager node for API calls and caching, and worker nodes per VPC to manage clusters, facilitating automated and efficient cluster administration.
Why it matters: This article details how to build secure, privacy-preserving enterprise search and AI features. It offers a blueprint for integrating external data without compromising user data, leveraging RAG, federated search, and strict access controls. Essential for engineers building secure data platforms.
- •Slack's enterprise search and AI uphold strict security and privacy by keeping customer data within Slack's trust boundary, utilizing an AWS escrow VPC for LLMs.
- •The system employs Retrieval Augmented Generation (RAG) instead of training Large Language Models (LLMs) on customer data, ensuring data privacy and preventing retention.
- •Enterprise search operates on a federated, real-time model, never storing external source data in Slack's databases, but rather fetching it via partner APIs.
- •Access to external content is strictly permissioned based on the user's existing Access Control Lists (ACLs) and requires explicit user/admin consent, adhering to the principle of least privilege.
- •External data and permissions are always up-to-date with the source system, ensuring accuracy and compliance.
- •Search Answer summaries generated by the AI are ephemeral, shown to the user and immediately discarded, further enhancing privacy.