Curated topic
Why it matters: This article details how Netflix scaled incident management by empowering all engineers with an intuitive tool and process. It offers a blueprint for other organizations seeking to democratize incident response and foster a culture of continuous learning and reliability.
Why it matters: This article showcases a successful approach to managing a large, evolving data graph in a service-oriented architecture. It provides insights into how a data-oriented service mesh can simplify developer experience, improve modularity, and scale efficiently.
Why it matters: Postgres's logical replication design creates a tight coupling between CDC consumers and HA failover. Unlike MySQL's GTID approach, Postgres requires active subscriber participation to make replicas failover-ready, potentially stalling maintenance or breaking data pipelines during outages.
Why it matters: This article details Pinterest's approach to building a scalable data processing platform on EKS, covering deployment and critical logging infrastructure. It offers insights into managing large-scale data systems and ensuring observability in cloud-native environments.
Why it matters: As AI workloads push GPU power consumption beyond the limits of traditional air cooling, liquid cooling becomes essential. This project demonstrates a viable path for maintaining hardware reliability and efficiency in high-density data centers.
Why it matters: This article details Pinterest's journey in building PinConsole, an Internal Developer Platform based on Backstage, to enhance developer experience and scale engineering velocity by abstracting complexity and unifying tools.
Why it matters: This article highlights how robust ML observability is critical for maintaining reliable, high-performing ML systems in production, especially for sensitive areas like payment processing. It provides a practical framework for implementing effective monitoring and explainability.
Why it matters: This article demonstrates how a large-scale monorepo build system migration can dramatically improve developer productivity and build reliability. It provides valuable insights into leveraging Bazel's features like remote execution and hermeticity for complex JVM environments.
Why it matters: Neki brings proven sharding expertise from the Vitess team to the Postgres ecosystem, enabling massive horizontal scaling for Postgres users. This provides a path for high-growth applications to scale without abandoning the Postgres feature set or switching to proprietary solutions.
Why it matters: This article details how to perform large-scale, zero-downtime Istio upgrades across diverse environments. It offers a blueprint for managing complex service mesh updates, ensuring high availability and minimizing operational overhead for thousands of workloads.