Why it matters: As AI workloads push GPU power consumption beyond the limits of traditional air cooling, liquid cooling becomes essential. This project demonstrates a viable path for maintaining hardware reliability and efficiency in high-density data centers.
- •Dropbox engineers developed a custom liquid cooling system for GPU servers during Hack Week 2025 to address the thermal demands of AI workloads.
- •The team built a prototype from scratch using radiators, pumps, reservoirs, and manifolds when pre-assembled units were unavailable.
- •Stress tests revealed that liquid cooling reduced operating temperatures by 20–30°C compared to standard air-cooled production systems.
- •The project enabled reduced fan speeds for secondary components, leading to quieter operation and potential power savings.
- •The initiative serves as a proof-of-concept for future-proofing data center infrastructure against the rising power consumption of next-gen GPUs.
- •Future plans include expanding testing with dedicated liquid cooling labs across multiple Dropbox data centers.
Why it matters: This article details Pinterest's journey in building PinConsole, an Internal Developer Platform based on Backstage, to enhance developer experience and scale engineering velocity by abstracting complexity and unifying tools.
- •Pinterest adopted an Internal Developer Platform (IDP) strategy to counter engineering velocity degradation caused by increasing complexity and tool fragmentation.
- •They chose Backstage as the open-source foundation for their IDP, PinConsole, due to its community adoption, extensible plugin architecture, and active development.
- •PinConsole aims to provide consistent abstractions, self-service capabilities, and reduce cognitive overhead for engineers by unifying disparate tools and workflows.
- •The architecture includes custom integrations with Pinterest's internal OAuth and LDAP systems for secure and seamless authentication within the platform.
- •The IDP addresses critical challenges such as inconsistent workflows, tool discovery issues, and fragmented documentation, significantly enhancing overall developer experience.
Why it matters: This article details how Netflix is innovating data engineering to tackle the unique challenges of media data for advanced ML. It offers insights into building specialized data platforms and roles for multi-modal content, crucial for any company dealing with large-scale unstructured media.
- •Netflix is evolving its data engineering function to "Media ML Data Engineering" to handle complex, multi-modal media data at scale.
- •This new specialization focuses on centralizing, standardizing, and managing media assets and their metadata for machine learning applications.
- •The "Media Data Lake" is introduced as a platform for storing and serving media assets, leveraging vector storage solutions like LanceDB.
- •Its architecture includes a Media Table for metadata, a robust data model, a Pythonic Data API, and distributed compute for ML training and inference.
- •The initiative aims to bridge creative media workflows with cutting-edge ML demands, enabling applications like content embedding and quality measures.
Why it matters: Dropbox's jump to 90% AI adoption provides a blueprint for scaling developer productivity. It shows how combining leadership alignment with a mix of third-party and internal tools can transform the SDLC and overcome developer skepticism toward AI-assisted workflows.
- •Dropbox achieved over 90% AI tool adoption among engineers by 2025 through strong leadership alignment and a structured change management plan.
- •The engineering organization utilizes AI across the entire software development lifecycle, including code generation, testing, debugging, and incident resolution.
- •A three-pronged strategy was employed: evaluating external tools like GitHub Copilot, developing custom internal AI solutions, and fostering a culture of knowledge sharing.
- •Initial adoption challenges, such as distrust of output quality and workflow friction, were addressed through peer-to-peer training and clear performance metrics.
- •The company balances third-party integrations with in-house development to solve specific organizational problems while building internal machine learning expertise.
Why it matters: This article highlights how robust ML observability is critical for maintaining reliable, high-performing ML systems in production, especially for sensitive areas like payment processing. It provides a practical framework for implementing effective monitoring and explainability.
- •ML Observability is essential for monitoring, understanding, and gaining insights into production ML models, ensuring reliability and continuous improvement.
- •At Netflix, it's crucial for optimizing payment processing, reducing friction, and ensuring seamless user subscriptions and renewals.
- •An effective framework includes automatic issue detection, root cause analysis, and builds stakeholder trust by explaining system behavior.
- •Netflix's approach focuses on stakeholder-facing outcomes, structured into logging, monitoring, and explaining modules.
- •Logging requires a comprehensive data schema to capture model inputs, outputs, and metadata for effective analysis.
- •Monitoring emphasizes online, outcome-focused metrics to understand real-world model behavior and health.
- •Explainability, using tools like SHAP, clarifies the "why" behind ML decisions, both in aggregate and for individual instances.
Why it matters: This article demonstrates how a large-scale monorepo build system migration can dramatically improve developer productivity and build reliability. It provides valuable insights into leveraging Bazel's features like remote execution and hermeticity for complex JVM environments.
- •Airbnb migrated its JVM monorepo (Java, Kotlin, Scala) to Bazel, achieving 3-5x faster local builds/tests and 2-3x faster deploys over 4.5 years.
- •The move to Bazel was driven by needs for superior build speed via remote execution, enhanced reliability through hermeticity, and a uniform build infrastructure across all language repos.
- •Bazel's remote build execution (RBE) and "Build without the Bytes" boosted performance by enabling parallel actions and reducing data transfer.
- •Hermetic builds, enforced by sandboxing, ensured consistent, repeatable results by isolating build actions from external environment dependencies.
- •The migration strategy included a proof-of-concept on a critical service with co-existing Gradle/Bazel builds, followed by a breadth-first rollout.
Why it matters: This article details how to perform large-scale, zero-downtime Istio upgrades across diverse environments. It offers a blueprint for managing complex service mesh updates, ensuring high availability and minimizing operational overhead for thousands of workloads.
- •Airbnb developed a robust process for seamless Istio upgrades across tens of thousands of pods and VMs on dozens of Kubernetes clusters.
- •The strategy employs Istio's canary upgrade model, running multiple Istiod revisions concurrently within a single logical service mesh.
- •Upgrades are atomic, rolling out new istio-proxy versions and connecting them to the corresponding new Istiod revision simultaneously.
- •A rollouts.yml file dictates the gradual rollout, defining namespace patterns and percentage distributions for Istio versions using consistent hashing.
- •For Kubernetes, MutatingAdmissionWebhooks inject the correct istio-proxy and configure its connection to the specific Istiod revision based on a istio.io/rev label.
- •The process prioritizes zero downtime, gradual rollouts, easy rollbacks, and independent upgrades for thousands of diverse workloads.
Why it matters: This article provides a detailed blueprint for achieving high availability and fault tolerance for distributed databases on Kubernetes in a multi-cloud environment. Engineers can learn best practices for managing stateful services, mitigating risks, and designing resilient systems at scale.
- •Airbnb achieved high availability for a distributed SQL database by deploying it across multiple Kubernetes clusters, each in a different AWS Availability Zone, a complex but effective strategy.
- •They addressed challenges of running stateful databases on Kubernetes, particularly node replacements and upgrades, using custom Kubernetes operators and admission hooks.
- •A custom Kubernetes operator coordinates node replacements, ensuring data consistency and preventing service disruption during various event types.
- •Deploying across three independent Kubernetes clusters in different AWS AZs significantly limits the blast radius of infrastructure or deployment issues.
- •AWS EBS provides rapid volume reattachment and durability, with tail latency spikes mitigated by read timeouts, transparent retries, and stale reads.
- •Overprovisioning database clusters ensures sufficient capacity even if an entire AZ or Kubernetes cluster fails.
Why it matters: This article highlights the extreme difficulty of debugging elusive, high-impact performance issues in complex distributed systems during migration. It showcases the systematic troubleshooting required to uncover subtle interactions between applications and their underlying infrastructure.
- •Pinterest encountered a rare, severe latency issue (100x slower) when migrating its memory-intensive Manas search infrastructure to Kubernetes.
- •The in-house Manas search system, critical for recommendations, uses a two-tier root-leaf node architecture, with leaf nodes handling query processing, retrieval, and ranking.
- •Debugging revealed sharp P100 latency spikes every few minutes on individual leaf nodes during index retrieval and ranking phases, indicating a one-in-a-million request failure.
- •Initial extensive troubleshooting, including dedicated nodes, removed cgroups, and OS-level profiling, failed to isolate the root cause of the performance degradation.
- •The problem persisted even when running Manas outside its container directly on the host, suggesting a subtle interaction unique to the Kubernetes provisioning on the AMI.
Why it matters: This article details Pinterest's strategic move from Hadoop to Kubernetes for data processing at scale. It offers valuable insights into the challenges and benefits of modernizing big data infrastructure, providing a blueprint for other organizations facing similar migration decisions.
- •Pinterest is migrating from its aging Hadoop 2.x (Monarch) data platform to a new Kubernetes (K8s) based system, Moka, for massive-scale data processing.
- •The shift to K8s is driven by needs for enhanced container isolation, security, improved performance with Spark, lower operational costs, and better developer velocity.
- •Kubernetes offers built-in container support, streamlined deployment via Terraform/Helm, and a rich ecosystem of monitoring, logging, and scheduling frameworks.
- •Performance optimizations include leveraging newer JDKs, GPU support, ARM/Graviton instances, and Kubernetes' native autoscaling capabilities.
- •Key design challenges involve integrating EKS into Pinterest's existing infrastructure and replacing core Hadoop functionalities like YARN UI, job submission, resource management, log aggregation, and security.