sre

Posts tagged with sre

Why it matters: This article introduces Sapling's innovative directory branching solution for monorepos, enabling scalable version management and merging without compromising performance or developer experience. It's crucial for engineers working with large codebases to maintain agility.

  • Meta's Sapling monorepo utilizes two distinct branching workflows to effectively balance scalability and developer experience.
  • Non-mergeable full-repo branching, supported by `sl bookmark`, is ideal for temporary product releases that do not require merging back to the main branch.
  • Mergeable directory branching is a novel solution for product development, allowing specific directories to be treated like traditional repository branches.
  • This new workflow enables copying, cherry-picking, and merging changes between directories using `sl subtree` commands.
  • Crucially, directory merges appear as linear commits in the monorepo's commit graph, preserving performance for operations like `sl log` and `sl blame`.
  • This approach resolves the challenges of managing multiple code versions within a large monorepo without sacrificing performance or essential developer tools.

Why it matters: This article offers engineers actionable design principles to reduce IT hardware's environmental impact, fostering sustainability and cost savings through circularity and emissions reduction in data center infrastructure.

  • Meta introduces "Design for Sustainability" principles for IT hardware to cut emissions and costs via reuse, extended life, and optimized design.
  • Key strategies include modularity, retrofitting, dematerialization, greener materials, and extending hardware lifecycles in data centers.
  • The focus is on reducing Scope 3 emissions from manufacturing, delivery, and end-of-life of IT hardware components.
  • Methods involve optimizing material selection, using lower carbon alternatives, extending rack life, and harvesting components for reuse.
  • These principles apply across various rack types (AI, Compute, Storage, Network) and target components like compute, storage, and cooling.
  • Collaboration with suppliers to electrify processes and transition to renewable energy is crucial for achieving net-zero goals.
  • The initiative also significantly reduces electronic waste (e-waste) generated from data centers.

Why it matters: This article details how to build resilient distributed systems by moving beyond static rate limits to adaptive traffic management. Engineers can learn to maximize goodput and ensure reliability in high-traffic, multi-tenant environments.

  • Airbnb evolved Mussel, their multi-tenant key-value store, from static QPS rate limiting to adaptive traffic management for improved reliability and goodput during traffic spikes.
  • The initial QoS system used simple Redis-backed QPS limits, effective for basic isolation but unable to account for varying request costs or adapt to real-time traffic shifts.
  • Resource-aware rate control (RARC) was introduced, charging requests in "request units" (RU) based on fixed overhead, rows processed, payload bytes, and crucial latency, reflecting actual backend load.
  • RARC uses a linear model for RU calculation, allowing the system to differentiate between cheap and expensive operations, even with similar surface metrics.
  • Future layers include load shedding with criticality tiers for priority traffic and hot-key detection/DDoS mitigation to handle skewed access patterns and shield the backend.

Why it matters: This article details Slack's successful Deploy Safety Program, which drastically cut customer impact from deployments. It provides a practical framework for improving reliability, incident response, and development velocity in complex, distributed systems.

  • Slack's Deploy Safety Program reduced customer impact from change-triggered incidents by 90% in 1.5 years, maintaining development velocity.
  • The program tackled 73% of customer-facing incidents caused by code deploys across diverse services and deployment systems.
  • North Star goals included automated detection/remediation within 10 minutes and preventing problematic deployments from reaching 10% of the fleet.
  • A custom metric, "Hours of customer impact from high/selected medium severity change-triggered incidents," measured program effectiveness.
  • Investment prioritized known pain points, rapid iteration, and scaling successful patterns like automated monitoring and rollbacks.
  • Key projects involved automating deployments, rollbacks, and blast radius control for critical systems like Webapp backend and frontend.

Why it matters: This article details how Netflix built a robust WAL system to solve common, critical data challenges like consistency, replication, and reliable retries at massive scale. It offers a blueprint for building resilient data platforms, enhancing developer efficiency and preventing outages.

  • Netflix developed a generic, distributed Write-Ahead Log (WAL) system to address critical data challenges at scale, including data loss, corruption, and replication.
  • The WAL provides strong durability guarantees and reliably delivers data changes to various downstream consumers.
  • Its simple WriteToLog API abstracts internal complexities, using namespaces to define storage (Kafka, SQS) and configurations.
  • Key use cases (personas) include enabling delayed message queues for reliable retries in real-time data pipelines.
  • It facilitates generic cross-region data replication for services like EVCache.
  • The WAL also supports complex operations like handling multi-partition mutations in Key-Value stores, ensuring eventual consistency via two-phase commit.

Why it matters: This article details how a large-scale key-value store was rearchitected to meet modern demands for real-time data, scalability, and operational efficiency. It offers valuable insights into addressing common distributed system challenges and executing complex migrations.

  • Airbnb rearchitected its core key-value store, Mussel, from v1 to v2 to handle real-time demands, massive data, and improve operational efficiency.
  • Mussel v1 faced issues with operational complexity, static partitioning leading to hotspots, limited consistency, and opaque costs.
  • Mussel v2 leverages Kubernetes for automation, dynamic range sharding for scalability, flexible consistency, and enhanced cost visibility.
  • The new architecture includes a stateless Dispatcher, Kafka-backed writes for durability, and an event-driven model for ingestion.
  • Bulk data loading is supported via Airflow orchestration and distributed workers, maintaining familiar semantics.
  • Automated TTL in v2 uses a topology-aware expiration service for efficient, parallel data deletion, improving on v1's compaction cycle.
  • A blue/green migration strategy with custom bootstrapping and dual writes ensured a seamless transition with zero downtime and data loss.

Why it matters: This article details how Netflix scaled incident management by empowering all engineers with an intuitive tool and process. It offers a blueprint for other organizations seeking to democratize incident response and foster a culture of continuous learning and reliability.

  • Netflix transitioned from a centralized SRE-led incident management system to a decentralized, "paved road" approach to empower all engineers.
  • The previous system, relying on basic tools, failed to scale with Netflix's growth, leading to missed learning opportunities from numerous uncaptured incidents.
  • They adopted Incident.io after a build-vs-buy analysis, prioritizing intuitive UX, internal data integration, balanced customization, and an approachable design.
  • Key to successful adoption was the tool's intuitive design, which fostered a cultural shift, making incident management less intimidating and more accessible.
  • Organizational investment in standardized processes, educational resources, and internal data integrations significantly reduced cognitive load and accelerated adoption.
  • This transformation aimed to make incident declaration and management easy for any engineer, even for minor issues, to foster continuous improvement and system reliability.

Why it matters: This article showcases a successful approach to managing a large, evolving data graph in a service-oriented architecture. It provides insights into how a data-oriented service mesh can simplify developer experience, improve modularity, and scale efficiently.

  • Viaduct, Airbnb's data-oriented service mesh, has been open-sourced after five years of significant growth and evolution within the company.
  • It's built on three core principles: a central, integrated GraphQL schema, hosting business logic directly within the mesh, and re-entrancy for modular composition.
  • The "Viaduct Modern" initiative simplified its developer-facing Tenant API, reducing complexity from multiple mechanisms to just node and field resolvers.
  • Modularity was enhanced through formal "tenant modules," enabling teams to own schema and code while composing via GraphQL fragments and queries, avoiding direct code dependencies.
  • This modernization effort has allowed Viaduct to scale dramatically (8x traffic, 3x codebase) while maintaining operational efficiency and reducing incidents.

Why it matters: This article details Pinterest's approach to building a scalable data processing platform on EKS, covering deployment and critical logging infrastructure. It offers insights into managing large-scale data systems and ensuring observability in cloud-native environments.

  • Pinterest is transitioning to Moka, a new data processing platform, deploying it on AWS EKS across standardized test, dev, staging, and production environments.
  • EKS cluster deployment utilizes Terraform with a layered structure of AWS-originated and Pinterest-specific modules and Helm charts.
  • A comprehensive logging strategy is implemented for Moka, addressing EKS control plane logs (via CloudWatch), Spark application logs (driver, executor, event logs), and system pod logs.
  • A key challenge in logging is ensuring reliable upload of Spark event logs to S3, even during job failures, for consumption by Spark History Server.
  • They are exploring custom Spark listeners and sidecar containers to guarantee event log persistence and availability for debugging and performance analysis.

Why it matters: As AI workloads push GPU power consumption beyond the limits of traditional air cooling, liquid cooling becomes essential. This project demonstrates a viable path for maintaining hardware reliability and efficiency in high-density data centers.

  • Dropbox engineers developed a custom liquid cooling system for GPU servers during Hack Week 2025 to address the thermal demands of AI workloads.
  • The team built a prototype from scratch using radiators, pumps, reservoirs, and manifolds when pre-assembled units were unavailable.
  • Stress tests revealed that liquid cooling reduced operating temperatures by 20–30°C compared to standard air-cooled production systems.
  • The project enabled reduced fan speeds for secondary components, leading to quieter operation and potential power savings.
  • The initiative serves as a proof-of-concept for future-proofing data center infrastructure against the rising power consumption of next-gen GPUs.
  • Future plans include expanding testing with dedicated liquid cooling labs across multiple Dropbox data centers.
Page 8 of 10