Why it matters: This article details how Pinterest uses advanced ML and LLMs to understand complex user intent, moving beyond simple recommendations to goal-oriented assistance. It offers a practical blueprint for building robust, extensible recommendation systems from limited initial data.
- •Pinterest developed a system to identify "user journeys" – sequences of user-item interactions revealing long-term goals beyond immediate interests.
- •The system uses a dynamic keyword extraction approach, leveraging user search history, activity, and boards.
- •Keywords are processed with pretrained text embeddings (e.g., SearchSage) and then hierarchically clustered to form journey candidates.
- •Specialized models handle journey naming (currently keyword-based, evolving to LLMs), expansion (LLM-generated recommendations), ranking, and diversification.
- •The architecture emphasizes lean development, starting small with annotated data, and extensibility for future advanced ML/LLM techniques.
- •The inference pipeline runs on a streaming system for quick adaptation to recent user activities.
Why it matters: This article details how Netflix scaled real-time recommendations for live events to millions of users, solving the "thundering herd" problem. It offers a robust, two-phase architectural pattern for high-concurrency, low-latency updates, crucial for distributed systems engineers.
- •Netflix developed a real-time recommendation system for live events to handle millions of concurrent users without overwhelming cloud services.
- •The core solution involves a two-phase approach: prefetching data to devices ahead of time and broadcasting low-cardinality messages to trigger updates.
- •Prefetching distributes load over a longer period, avoiding traffic spikes and optimizing request throughput and compute cardinality.
- •Real-time broadcasting uses state keys and timestamps to ensure devices update locally with prefetched data, guaranteeing delivery even on unstable networks.
- •This system successfully delivers updates to over 100 million devices in under a minute during peak live event loads.
- •It leverages a robust two-tier pub/sub architecture built on Pushy (WebSocket proxy), Apache Kafka, and Netflix's KV store for efficient, low-latency fanout.
Why it matters: DSF revolutionizes AI network scaling by overcoming traditional fabric limitations. Its disaggregated architecture, packet spraying, and advanced congestion control ensure high-performance, lossless connectivity for massive GPU clusters, crucial for the future of large-scale AI model training.
- •Meta's Disaggregated Scheduled Fabric (DSF) is a next-generation network technology designed to scale AI training networks beyond the physical limits of traditional Clos-based architectures.
- •DSF disaggregates line cards (Interface Nodes) and fabric cards (Fabric Nodes) into distinct hardware, creating a distributed system for enhanced scalability and performance.
- •It addresses critical challenges in AI workloads, such as "elephant flows" and "low entropy" traffic patterns, which cause congestion and suboptimal utilization in conventional IP fabrics.
- •The system employs a two-domain architecture, packet spraying, and a credit-based congestion control algorithm for efficient, lossless traffic management.
- •Built on open standards like OCP-SAI and managed by FBOSS, DSF enables the creation of large virtual chassis switches capable of interconnecting thousands of GPUs for massive AI clusters.
Why it matters: This article details how Netflix built a real-time distributed graph to unify disparate data from microservices, enabling complex relationship analysis and personalized experiences. It showcases a robust stream processing architecture for internet-scale data.
- •Netflix developed a Real-Time Distributed Graph (RDG) to unify member interaction data across diverse services and devices, addressing data silos from their microservices architecture.
- •The RDG provides advantages like relationship-centric queries, schema flexibility, and efficient pattern detection over traditional data warehousing.
- •Its ingestion and processing pipeline relies on a stream processing architecture for real-time updates, crucial for maintaining an up-to-date graph.
- •Apache Kafka acts as the ingestion backbone, handling up to 1M messages/second, with Avro-encoded records and schema registry.
- •Apache Flink jobs process these Kafka streams in near real-time, leveraging robust internal platform support for integration.
- •Data is also persisted to Apache Iceberg for backfilling, complementing Kafka's retention policies.
Why it matters: This article details Meta's innovations in LLM inference parallelism, offering critical strategies for engineers to achieve high throughput, low latency, and better resource efficiency when deploying large language models at scale. It provides practical solutions for optimizing performance.
- •Meta developed advanced parallelism techniques to optimize LLM inference for resource efficiency, throughput, and latency, crucial for applications like Meta AI.
- •LLM inference comprises two stages: compute-bound prefill for prompt processing and memory-bound decoding for token generation, each with distinct computational demands.
- •Tensor Parallelism shards model layers across GPUs, utilizing novel Direct Data Access (DDA) algorithms (flat, tree) to significantly reduce 'allreduce' communication latency.
- •DDA solutions demonstrated substantial speedups (10-50% for decode, 10-30% for prefill) on AMD MI300X, achieving performance parity with NVIDIA H100.
- •Context Parallelism, implemented via 'ring attention' variants (Pass-KV, Pass-Q), addresses the challenges of processing extremely long contexts by distributing input tokens and exchanging tensors.
Why it matters: This article introduces Sapling's innovative directory branching solution for monorepos, enabling scalable version management and merging without compromising performance or developer experience. It's crucial for engineers working with large codebases to maintain agility.
- •Meta's Sapling monorepo utilizes two distinct branching workflows to effectively balance scalability and developer experience.
- •Non-mergeable full-repo branching, supported by `sl bookmark`, is ideal for temporary product releases that do not require merging back to the main branch.
- •Mergeable directory branching is a novel solution for product development, allowing specific directories to be treated like traditional repository branches.
- •This new workflow enables copying, cherry-picking, and merging changes between directories using `sl subtree` commands.
- •Crucially, directory merges appear as linear commits in the monorepo's commit graph, preserving performance for operations like `sl log` and `sl blame`.
- •This approach resolves the challenges of managing multiple code versions within a large monorepo without sacrificing performance or essential developer tools.
Why it matters: This article details how Meta is re-architecting its core network infrastructure to handle the massive data demands of AI, offering insights into large-scale network design for future-proof, high-capacity connectivity.
- •Meta is scaling its Express Backbone (EBB) network by 10x to support the increasing demands of AI workloads and extend AI clusters across data centers.
- •The Backbone comprises Classic Backbone (CBB) for global DC-to-POP connectivity and EBB for scalable DC-to-DC interconnection, with EBB facing the most significant scaling challenges.
- •EBB employs a customized software stack, including the Open/R routing protocol and an in-house traffic engineering system with onbox agents and a centralized controller.
- •A key scaling technique for 10X Backbone is the DC Metro Architecture, which pre-builds fiber rings and POPs to simplify and standardize new data center connectivity.
- •This architecture separates metro and long-haul networks, enabling faster, more reliable high-capacity connections for rapidly expanding data centers.
Why it matters: This article offers engineers actionable design principles to reduce IT hardware's environmental impact, fostering sustainability and cost savings through circularity and emissions reduction in data center infrastructure.
- •Meta introduces "Design for Sustainability" principles for IT hardware to cut emissions and costs via reuse, extended life, and optimized design.
- •Key strategies include modularity, retrofitting, dematerialization, greener materials, and extending hardware lifecycles in data centers.
- •The focus is on reducing Scope 3 emissions from manufacturing, delivery, and end-of-life of IT hardware components.
- •Methods involve optimizing material selection, using lower carbon alternatives, extending rack life, and harvesting components for reuse.
- •These principles apply across various rack types (AI, Compute, Storage, Network) and target components like compute, storage, and cooling.
- •Collaboration with suppliers to electrify processes and transition to renewable energy is crucial for achieving net-zero goals.
- •The initiative also significantly reduces electronic waste (e-waste) generated from data centers.
Why it matters: This article details how to build resilient distributed systems by moving beyond static rate limits to adaptive traffic management. Engineers can learn to maximize goodput and ensure reliability in high-traffic, multi-tenant environments.
- •Airbnb evolved Mussel, their multi-tenant key-value store, from static QPS rate limiting to adaptive traffic management for improved reliability and goodput during traffic spikes.
- •The initial QoS system used simple Redis-backed QPS limits, effective for basic isolation but unable to account for varying request costs or adapt to real-time traffic shifts.
- •Resource-aware rate control (RARC) was introduced, charging requests in "request units" (RU) based on fixed overhead, rows processed, payload bytes, and crucial latency, reflecting actual backend load.
- •RARC uses a linear model for RU calculation, allowing the system to differentiate between cheap and expensive operations, even with similar surface metrics.
- •Future layers include load shedding with criticality tiers for priority traffic and hot-key detection/DDoS mitigation to handle skewed access patterns and shield the backend.
Why it matters: This article details Slack's successful Deploy Safety Program, which drastically cut customer impact from deployments. It provides a practical framework for improving reliability, incident response, and development velocity in complex, distributed systems.
- •Slack's Deploy Safety Program reduced customer impact from change-triggered incidents by 90% in 1.5 years, maintaining development velocity.
- •The program tackled 73% of customer-facing incidents caused by code deploys across diverse services and deployment systems.
- •North Star goals included automated detection/remediation within 10 minutes and preventing problematic deployments from reaching 10% of the fleet.
- •A custom metric, "Hours of customer impact from high/selected medium severity change-triggered incidents," measured program effectiveness.
- •Investment prioritized known pain points, rapid iteration, and scaling successful patterns like automated monitoring and rollbacks.
- •Key projects involved automating deployments, rollbacks, and blast radius control for critical systems like Webapp backend and frontend.