sre

Posts tagged with sre

Why it matters: This article details Pinterest's journey in building PinConsole, an Internal Developer Platform based on Backstage, to enhance developer experience and scale engineering velocity by abstracting complexity and unifying tools.

  • Pinterest adopted an Internal Developer Platform (IDP) strategy to counter engineering velocity degradation caused by increasing complexity and tool fragmentation.
  • They chose Backstage as the open-source foundation for their IDP, PinConsole, due to its community adoption, extensible plugin architecture, and active development.
  • PinConsole aims to provide consistent abstractions, self-service capabilities, and reduce cognitive overhead for engineers by unifying disparate tools and workflows.
  • The architecture includes custom integrations with Pinterest's internal OAuth and LDAP systems for secure and seamless authentication within the platform.
  • The IDP addresses critical challenges such as inconsistent workflows, tool discovery issues, and fragmented documentation, significantly enhancing overall developer experience.

Why it matters: This article highlights how robust ML observability is critical for maintaining reliable, high-performing ML systems in production, especially for sensitive areas like payment processing. It provides a practical framework for implementing effective monitoring and explainability.

  • ML Observability is essential for monitoring, understanding, and gaining insights into production ML models, ensuring reliability and continuous improvement.
  • At Netflix, it's crucial for optimizing payment processing, reducing friction, and ensuring seamless user subscriptions and renewals.
  • An effective framework includes automatic issue detection, root cause analysis, and builds stakeholder trust by explaining system behavior.
  • Netflix's approach focuses on stakeholder-facing outcomes, structured into logging, monitoring, and explaining modules.
  • Logging requires a comprehensive data schema to capture model inputs, outputs, and metadata for effective analysis.
  • Monitoring emphasizes online, outcome-focused metrics to understand real-world model behavior and health.
  • Explainability, using tools like SHAP, clarifies the "why" behind ML decisions, both in aggregate and for individual instances.

Why it matters: This article demonstrates how a large-scale monorepo build system migration can dramatically improve developer productivity and build reliability. It provides valuable insights into leveraging Bazel's features like remote execution and hermeticity for complex JVM environments.

  • Airbnb migrated its JVM monorepo (Java, Kotlin, Scala) to Bazel, achieving 3-5x faster local builds/tests and 2-3x faster deploys over 4.5 years.
  • The move to Bazel was driven by needs for superior build speed via remote execution, enhanced reliability through hermeticity, and a uniform build infrastructure across all language repos.
  • Bazel's remote build execution (RBE) and "Build without the Bytes" boosted performance by enabling parallel actions and reducing data transfer.
  • Hermetic builds, enforced by sandboxing, ensured consistent, repeatable results by isolating build actions from external environment dependencies.
  • The migration strategy included a proof-of-concept on a critical service with co-existing Gradle/Bazel builds, followed by a breadth-first rollout.

Why it matters: This article details how to perform large-scale, zero-downtime Istio upgrades across diverse environments. It offers a blueprint for managing complex service mesh updates, ensuring high availability and minimizing operational overhead for thousands of workloads.

  • Airbnb developed a robust process for seamless Istio upgrades across tens of thousands of pods and VMs on dozens of Kubernetes clusters.
  • The strategy employs Istio's canary upgrade model, running multiple Istiod revisions concurrently within a single logical service mesh.
  • Upgrades are atomic, rolling out new istio-proxy versions and connecting them to the corresponding new Istiod revision simultaneously.
  • A rollouts.yml file dictates the gradual rollout, defining namespace patterns and percentage distributions for Istio versions using consistent hashing.
  • For Kubernetes, MutatingAdmissionWebhooks inject the correct istio-proxy and configure its connection to the specific Istiod revision based on a istio.io/rev label.
  • The process prioritizes zero downtime, gradual rollouts, easy rollbacks, and independent upgrades for thousands of diverse workloads.

Why it matters: This article provides a detailed blueprint for achieving high availability and fault tolerance for distributed databases on Kubernetes in a multi-cloud environment. Engineers can learn best practices for managing stateful services, mitigating risks, and designing resilient systems at scale.

  • Airbnb achieved high availability for a distributed SQL database by deploying it across multiple Kubernetes clusters, each in a different AWS Availability Zone, a complex but effective strategy.
  • They addressed challenges of running stateful databases on Kubernetes, particularly node replacements and upgrades, using custom Kubernetes operators and admission hooks.
  • A custom Kubernetes operator coordinates node replacements, ensuring data consistency and preventing service disruption during various event types.
  • Deploying across three independent Kubernetes clusters in different AWS AZs significantly limits the blast radius of infrastructure or deployment issues.
  • AWS EBS provides rapid volume reattachment and durability, with tail latency spikes mitigated by read timeouts, transparent retries, and stale reads.
  • Overprovisioning database clusters ensures sufficient capacity even if an entire AZ or Kubernetes cluster fails.

Why it matters: This article highlights the extreme difficulty of debugging elusive, high-impact performance issues in complex distributed systems during migration. It showcases the systematic troubleshooting required to uncover subtle interactions between applications and their underlying infrastructure.

  • Pinterest encountered a rare, severe latency issue (100x slower) when migrating its memory-intensive Manas search infrastructure to Kubernetes.
  • The in-house Manas search system, critical for recommendations, uses a two-tier root-leaf node architecture, with leaf nodes handling query processing, retrieval, and ranking.
  • Debugging revealed sharp P100 latency spikes every few minutes on individual leaf nodes during index retrieval and ranking phases, indicating a one-in-a-million request failure.
  • Initial extensive troubleshooting, including dedicated nodes, removed cgroups, and OS-level profiling, failed to isolate the root cause of the performance degradation.
  • The problem persisted even when running Manas outside its container directly on the host, suggesting a subtle interaction unique to the Kubernetes provisioning on the AMI.

Why it matters: This article details Pinterest's strategic move from Hadoop to Kubernetes for data processing at scale. It offers valuable insights into the challenges and benefits of modernizing big data infrastructure, providing a blueprint for other organizations facing similar migration decisions.

  • Pinterest is migrating from its aging Hadoop 2.x (Monarch) data platform to a new Kubernetes (K8s) based system, Moka, for massive-scale data processing.
  • The shift to K8s is driven by needs for enhanced container isolation, security, improved performance with Spark, lower operational costs, and better developer velocity.
  • Kubernetes offers built-in container support, streamlined deployment via Terraform/Helm, and a rich ecosystem of monitoring, logging, and scheduling frameworks.
  • Performance optimizations include leveraging newer JDKs, GPU support, ARM/Graviton instances, and Kubernetes' native autoscaling capabilities.
  • Key design challenges involve integrating EKS into Pinterest's existing infrastructure and replacing core Hadoop functionalities like YARN UI, job submission, resource management, log aggregation, and security.

Why it matters: Dropbox's 7th-gen hardware shows how custom infrastructure at exabyte scale drives massive efficiency. By co-designing hardware and software, they achieve superior performance-per-watt and density, essential for modern AI-driven workloads and sustainable growth.

  • Dropbox launched its seventh-generation hardware platform featuring specialized tiers: Crush (compute), Dexter (database), Sonic (storage), and Gumby/Godzilla (GPUs).
  • The architecture doubles available rack power from 17kW to 35kW and transitions to 400G networking to support high-bandwidth AI and data workloads.
  • Storage density is optimized using 28TB SMR drives and a custom chassis designed to mitigate vibration and heat, supporting exabyte-scale data.
  • The compute tier utilizes 128-core AMD EPYC processors and PCIe Gen5, providing significant performance-per-watt improvements over previous generations.
  • New GPU tiers are specifically integrated to power AI products like Dropbox Dash, focusing on high-performance training and inference capabilities.

Why it matters: This framework helps engineers proactively identify bottlenecks, evaluate capacity, and ensure system reliability through robust, decentralized, and automated load testing integrated with CI/CD.

  • Airbnb's Impulse is a decentralized load-testing-as-a-service framework for robust system performance evaluation.
  • It features a context-aware load generator, an out-of-process dependency mocker, a traffic collector, and a testing API generator.
  • The load generator uses Java/Kotlin for flexible test logic, containerized for isolation, scalability, and cost-efficiency.
  • The dependency mocker enables selective stubbing of HTTP, Thrift, and GraphQL dependencies with configurable latency, isolating the SUT.
  • Impulse integrates with CI/CD for automated testing across warm-up, steady-state, and peak phases, using synthetic or collected traffic.
  • Its architecture empowers self-service load tests, minimizing manual effort and enhancing proactive issue detection.

Why it matters: This article demonstrates how to automate the challenging process of migrating and scaling stateful Hadoop clusters, significantly reducing manual effort and operational risk. It offers a blueprint for managing large-scale distributed data infrastructure efficiently.

  • Pinterest developed Hadoop Control Center (HCC) to automate complex migration and scaling operations for its large, stateful Hadoop clusters on AWS.
  • Traditional manual scale-in procedures for Hadoop clusters were tedious, error-prone, and involved many steps like updating exclude files, monitoring data drainage, and managing ASGs.
  • HCC enables in-place cluster migrations by introducing new Auto Scaling Groups (ASGs) with updated AMIs/instance types, avoiding costly and risky full cluster replacements.
  • The tool streamlines scaling-in by managing node decommissioning and ensuring HDFS data replication to new nodes before termination, preventing data loss or workload impact.
  • HCC provides a centralized platform for various Hadoop-related tasks, including ASG resizing, node status monitoring, YARN application reporting, and AWS event tracking.
  • Its architecture includes a manager node for API calls and caching, and worker nodes per VPC to manage clusters, facilitating automated and efficient cluster administration.
Page 9 of 10