Search by topic, company, or concept and scan results quickly.
Why it matters: As AI workloads push GPU power consumption beyond the limits of traditional air cooling, liquid cooling becomes essential. This project demonstrates a viable path for maintaining hardware reliability and efficiency in high-density data centers.
Why it matters: This article details Pinterest's journey in building PinConsole, an Internal Developer Platform based on Backstage, to enhance developer experience and scale engineering velocity by abstracting complexity and unifying tools.
Why it matters: This article details how Netflix is innovating data engineering to tackle the unique challenges of media data for advanced ML. It offers insights into building specialized data platforms and roles for multi-modal content, crucial for any company dealing with large-scale unstructured media.
Why it matters: Dropbox's jump to 90% AI adoption provides a blueprint for scaling developer productivity. It shows how combining leadership alignment with a mix of third-party and internal tools can transform the SDLC and overcome developer skepticism toward AI-assisted workflows.
Why it matters: This article highlights how robust ML observability is critical for maintaining reliable, high-performing ML systems in production, especially for sensitive areas like payment processing. It provides a practical framework for implementing effective monitoring and explainability.
Why it matters: This article demonstrates how a large-scale monorepo build system migration can dramatically improve developer productivity and build reliability. It provides valuable insights into leveraging Bazel's features like remote execution and hermeticity for complex JVM environments.
Why it matters: Neki brings proven sharding expertise from the Vitess team to the Postgres ecosystem, enabling massive horizontal scaling for Postgres users. This provides a path for high-growth applications to scale without abandoning the Postgres feature set or switching to proprietary solutions.
Why it matters: This article details how to perform large-scale, zero-downtime Istio upgrades across diverse environments. It offers a blueprint for managing complex service mesh updates, ensuring high availability and minimizing operational overhead for thousands of workloads.
Why it matters: This article provides a detailed blueprint for achieving high availability and fault tolerance for distributed databases on Kubernetes in a multi-cloud environment. Engineers can learn best practices for managing stateful services, mitigating risks, and designing resilient systems at scale.
Why it matters: This article highlights the extreme difficulty of debugging elusive, high-impact performance issues in complex distributed systems during migration. It showcases the systematic troubleshooting required to uncover subtle interactions between applications and their underlying infrastructure.