Posts tagged with sre
Why it matters: This article details Pinterest's journey in building PinConsole, an Internal Developer Platform based on Backstage, to enhance developer experience and scale engineering velocity by abstracting complexity and unifying tools.
- •Pinterest adopted an Internal Developer Platform (IDP) strategy to counter engineering velocity degradation caused by increasing complexity and tool fragmentation.
- •They chose Backstage as the open-source foundation for their IDP, PinConsole, due to its community adoption, extensible plugin architecture, and active development.
- •PinConsole aims to provide consistent abstractions, self-service capabilities, and reduce cognitive overhead for engineers by unifying disparate tools and workflows.
- •The architecture includes custom integrations with Pinterest's internal OAuth and LDAP systems for secure and seamless authentication within the platform.
- •The IDP addresses critical challenges such as inconsistent workflows, tool discovery issues, and fragmented documentation, significantly enhancing overall developer experience.
Why it matters: This article highlights how robust ML observability is critical for maintaining reliable, high-performing ML systems in production, especially for sensitive areas like payment processing. It provides a practical framework for implementing effective monitoring and explainability.
- •ML Observability is essential for monitoring, understanding, and gaining insights into production ML models, ensuring reliability and continuous improvement.
- •At Netflix, it's crucial for optimizing payment processing, reducing friction, and ensuring seamless user subscriptions and renewals.
- •An effective framework includes automatic issue detection, root cause analysis, and builds stakeholder trust by explaining system behavior.
- •Netflix's approach focuses on stakeholder-facing outcomes, structured into logging, monitoring, and explaining modules.
- •Logging requires a comprehensive data schema to capture model inputs, outputs, and metadata for effective analysis.
- •Monitoring emphasizes online, outcome-focused metrics to understand real-world model behavior and health.
- •Explainability, using tools like SHAP, clarifies the "why" behind ML decisions, both in aggregate and for individual instances.
Why it matters: This article demonstrates how a large-scale monorepo build system migration can dramatically improve developer productivity and build reliability. It provides valuable insights into leveraging Bazel's features like remote execution and hermeticity for complex JVM environments.
- •Airbnb migrated its JVM monorepo (Java, Kotlin, Scala) to Bazel, achieving 3-5x faster local builds/tests and 2-3x faster deploys over 4.5 years.
- •The move to Bazel was driven by needs for superior build speed via remote execution, enhanced reliability through hermeticity, and a uniform build infrastructure across all language repos.
- •Bazel's remote build execution (RBE) and "Build without the Bytes" boosted performance by enabling parallel actions and reducing data transfer.
- •Hermetic builds, enforced by sandboxing, ensured consistent, repeatable results by isolating build actions from external environment dependencies.
- •The migration strategy included a proof-of-concept on a critical service with co-existing Gradle/Bazel builds, followed by a breadth-first rollout.
Why it matters: This article details how to perform large-scale, zero-downtime Istio upgrades across diverse environments. It offers a blueprint for managing complex service mesh updates, ensuring high availability and minimizing operational overhead for thousands of workloads.
- •Airbnb developed a robust process for seamless Istio upgrades across tens of thousands of pods and VMs on dozens of Kubernetes clusters.
- •The strategy employs Istio's canary upgrade model, running multiple Istiod revisions concurrently within a single logical service mesh.
- •Upgrades are atomic, rolling out new istio-proxy versions and connecting them to the corresponding new Istiod revision simultaneously.
- •A rollouts.yml file dictates the gradual rollout, defining namespace patterns and percentage distributions for Istio versions using consistent hashing.
- •For Kubernetes, MutatingAdmissionWebhooks inject the correct istio-proxy and configure its connection to the specific Istiod revision based on a istio.io/rev label.
- •The process prioritizes zero downtime, gradual rollouts, easy rollbacks, and independent upgrades for thousands of diverse workloads.
Why it matters: This article provides a detailed blueprint for achieving high availability and fault tolerance for distributed databases on Kubernetes in a multi-cloud environment. Engineers can learn best practices for managing stateful services, mitigating risks, and designing resilient systems at scale.
- •Airbnb achieved high availability for a distributed SQL database by deploying it across multiple Kubernetes clusters, each in a different AWS Availability Zone, a complex but effective strategy.
- •They addressed challenges of running stateful databases on Kubernetes, particularly node replacements and upgrades, using custom Kubernetes operators and admission hooks.
- •A custom Kubernetes operator coordinates node replacements, ensuring data consistency and preventing service disruption during various event types.
- •Deploying across three independent Kubernetes clusters in different AWS AZs significantly limits the blast radius of infrastructure or deployment issues.
- •AWS EBS provides rapid volume reattachment and durability, with tail latency spikes mitigated by read timeouts, transparent retries, and stale reads.
- •Overprovisioning database clusters ensures sufficient capacity even if an entire AZ or Kubernetes cluster fails.
Why it matters: This article highlights the extreme difficulty of debugging elusive, high-impact performance issues in complex distributed systems during migration. It showcases the systematic troubleshooting required to uncover subtle interactions between applications and their underlying infrastructure.
- •Pinterest encountered a rare, severe latency issue (100x slower) when migrating its memory-intensive Manas search infrastructure to Kubernetes.
- •The in-house Manas search system, critical for recommendations, uses a two-tier root-leaf node architecture, with leaf nodes handling query processing, retrieval, and ranking.
- •Debugging revealed sharp P100 latency spikes every few minutes on individual leaf nodes during index retrieval and ranking phases, indicating a one-in-a-million request failure.
- •Initial extensive troubleshooting, including dedicated nodes, removed cgroups, and OS-level profiling, failed to isolate the root cause of the performance degradation.
- •The problem persisted even when running Manas outside its container directly on the host, suggesting a subtle interaction unique to the Kubernetes provisioning on the AMI.
Why it matters: This article details Pinterest's strategic move from Hadoop to Kubernetes for data processing at scale. It offers valuable insights into the challenges and benefits of modernizing big data infrastructure, providing a blueprint for other organizations facing similar migration decisions.
- •Pinterest is migrating from its aging Hadoop 2.x (Monarch) data platform to a new Kubernetes (K8s) based system, Moka, for massive-scale data processing.
- •The shift to K8s is driven by needs for enhanced container isolation, security, improved performance with Spark, lower operational costs, and better developer velocity.
- •Kubernetes offers built-in container support, streamlined deployment via Terraform/Helm, and a rich ecosystem of monitoring, logging, and scheduling frameworks.
- •Performance optimizations include leveraging newer JDKs, GPU support, ARM/Graviton instances, and Kubernetes' native autoscaling capabilities.
- •Key design challenges involve integrating EKS into Pinterest's existing infrastructure and replacing core Hadoop functionalities like YARN UI, job submission, resource management, log aggregation, and security.
Why it matters: Dropbox's 7th-gen hardware shows how custom infrastructure at exabyte scale drives massive efficiency. By co-designing hardware and software, they achieve superior performance-per-watt and density, essential for modern AI-driven workloads and sustainable growth.
- •Dropbox launched its seventh-generation hardware platform featuring specialized tiers: Crush (compute), Dexter (database), Sonic (storage), and Gumby/Godzilla (GPUs).
- •The architecture doubles available rack power from 17kW to 35kW and transitions to 400G networking to support high-bandwidth AI and data workloads.
- •Storage density is optimized using 28TB SMR drives and a custom chassis designed to mitigate vibration and heat, supporting exabyte-scale data.
- •The compute tier utilizes 128-core AMD EPYC processors and PCIe Gen5, providing significant performance-per-watt improvements over previous generations.
- •New GPU tiers are specifically integrated to power AI products like Dropbox Dash, focusing on high-performance training and inference capabilities.
Why it matters: This framework helps engineers proactively identify bottlenecks, evaluate capacity, and ensure system reliability through robust, decentralized, and automated load testing integrated with CI/CD.
- •Airbnb's Impulse is a decentralized load-testing-as-a-service framework for robust system performance evaluation.
- •It features a context-aware load generator, an out-of-process dependency mocker, a traffic collector, and a testing API generator.
- •The load generator uses Java/Kotlin for flexible test logic, containerized for isolation, scalability, and cost-efficiency.
- •The dependency mocker enables selective stubbing of HTTP, Thrift, and GraphQL dependencies with configurable latency, isolating the SUT.
- •Impulse integrates with CI/CD for automated testing across warm-up, steady-state, and peak phases, using synthetic or collected traffic.
- •Its architecture empowers self-service load tests, minimizing manual effort and enhancing proactive issue detection.
Why it matters: This article demonstrates how to automate the challenging process of migrating and scaling stateful Hadoop clusters, significantly reducing manual effort and operational risk. It offers a blueprint for managing large-scale distributed data infrastructure efficiently.
- •Pinterest developed Hadoop Control Center (HCC) to automate complex migration and scaling operations for its large, stateful Hadoop clusters on AWS.
- •Traditional manual scale-in procedures for Hadoop clusters were tedious, error-prone, and involved many steps like updating exclude files, monitoring data drainage, and managing ASGs.
- •HCC enables in-place cluster migrations by introducing new Auto Scaling Groups (ASGs) with updated AMIs/instance types, avoiding costly and risky full cluster replacements.
- •The tool streamlines scaling-in by managing node decommissioning and ensuring HDFS data replication to new nodes before termination, preventing data loss or workload impact.
- •HCC provides a centralized platform for various Hadoop-related tasks, including ASG resizing, node status monitoring, YARN application reporting, and AWS event tracking.
- •Its architecture includes a manager node for API calls and caching, and worker nodes per VPC to manage clusters, facilitating automated and efficient cluster administration.