Netflix Tech Blog
https://netflixtechblog.com/Why it matters: Translating natural language to complex DSLs reduces friction for subject matter experts interacting with massive, federated datasets. This approach bridges the gap between intuitive human intent and rigid technical schemas, improving productivity across hundreds of enterprise applications.
- •Netflix is evolving its Graph Search platform to support natural language queries using Large Language Models (LLMs).
- •The system translates ambiguous user input into a structured Filter Domain Specific Language (DSL) for federated GraphQL data.
- •Accuracy is maintained by ensuring syntactic, semantic, and pragmatic correctness through schema validation and controlled vocabularies.
- •The architecture utilizes Retrieval-Augmented Generation (RAG) to provide domain-specific data processing without replacing existing UIs.
- •Pre-processing and context engineering are critical to prevent LLM hallucinations and ensure fields match the underlying index.
Why it matters: This article demonstrates how a Durable Execution platform like Temporal can drastically improve the reliability of critical cloud operations and continuous delivery pipelines, reducing complex failure handling and state management for engineers.
- •Netflix significantly improved the reliability of its Spinnaker deployments by adopting Temporal, reducing transient failures from 4% to 0.0001%.
- •Temporal is a Durable Execution platform that allows engineers to write resilient code, abstracting away complexities of distributed system failures.
- •The previous Spinnaker architecture suffered from complex, undifferentiated internal orchestration, retry logic, and a homegrown Saga framework within its Clouddriver service.
- •Prior to Temporal, Clouddriver's instance-local task state led to lost operation progress if the service crashed, impacting deployment reliability.
- •Temporal helped streamline cloud operations by offloading complex state management and failure handling, allowing services like Clouddriver to focus on core infrastructure changes.
Why it matters: This article details how Netflix built a robust, high-performance live streaming origin and optimized its CDN for live content. It offers insights into handling real-time data defects, ensuring resilience, and optimizing content delivery at scale.
- •Netflix Live Origin is a multi-tenant microservice bridging cloud live streaming pipelines and Open Connect CDN, managing content distribution.
- •It ensures resilience through redundant regional pipelines and server-side failover, utilizing epoch locking for intelligent segment selection.
- •The Origin detects and mitigates live stream defects (e.g., short, missing segments) by selecting valid candidates from multiple pipelines.
- •Open Connect's nginx-based CDN was optimized for live streaming, extending proxy-caching and adding millisecond-grain caching.
- •Live Origin "holds open" requests for yet-to-be-published segments, reducing network chatter and improving efficiency.
- •HTTP headers are leveraged for scalable streaming metadata, providing real-time event notifications to client devices via OCAs.
Why it matters: This article highlights how open video codecs like AV1 drive significant improvements in streaming quality and network efficiency. It showcases a successful large-scale rollout across diverse devices, offering valuable insights into optimizing content delivery and user experience.
- •Netflix's AV1 codec adoption has reached 30% of all streaming, becoming their second most-used codec due to its superior efficiency.
- •AV1 delivers higher video quality (4.3 VMAF points over AVC) with one-third less bandwidth and 45% fewer buffering interruptions.
- •The rollout began with Android mobile in 2020 using the dav1d software decoder, expanding to smart TVs, web browsers, and Apple devices with hardware support.
- •This advanced codec significantly improves network efficiency for Netflix's Open Connect CDN and partner ISPs by reducing overall internet bandwidth consumption.
- •AV1 enables advanced features like HDR10+ streaming and cinematic film grain, enhancing the overall viewing experience for members.
Why it matters: This article introduces "Spin," a new Metaflow feature that significantly improves the iterative development experience for ML/AI engineers. It allows faster experimentation and debugging, bridging the gap between workflow orchestrators and interactive notebooks.
- •Metaflow, an open-sourced Netflix framework, streamlines ML/AI workflows from prototype to production, emphasizing rapid iteration and reliable operations.
- •The new "Spin" command in Metaflow 2.19 significantly accelerates iterative ML/AI development by enabling quick, stateful execution of individual steps.
- •ML/AI development requires fast, stateful iteration due to large, mutable data and models, and computationally expensive processes.
- •Metaflow steps function as explicit, deterministic checkpoints, persisting state as versioned artifacts.
- •"Spin" allows developers to execute a single Metaflow step with inherited state, mimicking notebook cell behavior for instant feedback.
- •Unlike `run` or `resume`, `spin` is designed for fast, untracked, throw-away iterations, optimizing the development loop.
Why it matters: This article introduces A-SFT, a novel post-training algorithm for generative recommenders. It addresses key challenges like noisy reward models and lack of counterfactual data, offering a practical way to improve recommendation quality by better aligning models with user preferences.
- •Generative Recommenders (GRs) model user behavior as a sequential transduction task, inspired by transformer architectures.
- •Applying RLHF to GRs is challenging due to the lack of counterfactual feedback and the inherent noisiness of recommendation reward signals.
- •User feedback is on-policy, making it impractical to obtain evaluations for hypothetical or unseen recommendations.
- •Reward models in recommendation systems often exhibit high uncertainty, as user choices are less structured and more random than language data.
- •The paper proposes Advantage-Weighted Supervised Fine-tuning (A-SFT) to overcome these post-training challenges.
- •A-SFT combines supervised fine-tuning with the advantage function, effectively guiding optimization even with high-variance reward models.
- •This approach improves alignment between pre-trained generative recommenders and reward models, balancing offline RL and behavior cloning.
Why it matters: This article details how Netflix scaled real-time recommendations for live events to millions of users, solving the "thundering herd" problem. It offers a robust, two-phase architectural pattern for high-concurrency, low-latency updates, crucial for distributed systems engineers.
- •Netflix developed a real-time recommendation system for live events to handle millions of concurrent users without overwhelming cloud services.
- •The core solution involves a two-phase approach: prefetching data to devices ahead of time and broadcasting low-cardinality messages to trigger updates.
- •Prefetching distributes load over a longer period, avoiding traffic spikes and optimizing request throughput and compute cardinality.
- •Real-time broadcasting uses state keys and timestamps to ensure devices update locally with prefetched data, guaranteeing delivery even on unstable networks.
- •This system successfully delivers updates to over 100 million devices in under a minute during peak live event loads.
- •It leverages a robust two-tier pub/sub architecture built on Pushy (WebSocket proxy), Apache Kafka, and Netflix's KV store for efficient, low-latency fanout.
Why it matters: This article details how Netflix built a real-time distributed graph to unify disparate data from microservices, enabling complex relationship analysis and personalized experiences. It showcases a robust stream processing architecture for internet-scale data.
- •Netflix developed a Real-Time Distributed Graph (RDG) to unify member interaction data across diverse services and devices, addressing data silos from their microservices architecture.
- •The RDG provides advantages like relationship-centric queries, schema flexibility, and efficient pattern detection over traditional data warehousing.
- •Its ingestion and processing pipeline relies on a stream processing architecture for real-time updates, crucial for maintaining an up-to-date graph.
- •Apache Kafka acts as the ingestion backbone, handling up to 1M messages/second, with Avro-encoded records and schema registry.
- •Apache Flink jobs process these Kafka streams in near real-time, leveraging robust internal platform support for integration.
- •Data is also persisted to Apache Iceberg for backfilling, complementing Kafka's retention policies.
Why it matters: This article demonstrates how Netflix optimized its workflow orchestrator by 100X, crucial for supporting evolving business needs like real-time data processing and low-latency applications. It highlights the importance of engine redesign for scalability and developer productivity.
- •Netflix's Maestro workflow orchestrator achieved a 100X performance improvement, reducing overhead from seconds to milliseconds for Data/ML workflows.
- •The previous Maestro engine, based on deprecated Conductor 2.x, suffered from performance bottlenecks and race conditions due to its internal flow engine layer.
- •New business needs like Live, Ads, Games, and low-latency use cases necessitated a high-performance workflow engine.
- •The team evaluated options including upgrading Conductor, using Temporal, or implementing a custom internal flow engine.
- •They opted to rewrite Maestro's internal flow engine to simplify the architecture, eliminate complex database synchronizations, and ensure strong guarantees.
Why it matters: This article details how Netflix built a robust WAL system to solve common, critical data challenges like consistency, replication, and reliable retries at massive scale. It offers a blueprint for building resilient data platforms, enhancing developer efficiency and preventing outages.
- •Netflix developed a generic, distributed Write-Ahead Log (WAL) system to address critical data challenges at scale, including data loss, corruption, and replication.
- •The WAL provides strong durability guarantees and reliably delivers data changes to various downstream consumers.
- •Its simple WriteToLog API abstracts internal complexities, using namespaces to define storage (Kafka, SQS) and configurations.
- •Key use cases (personas) include enabling delayed message queues for reliable retries in real-time data pipelines.
- •It facilitates generic cross-region data replication for services like EVCache.
- •The WAL also supports complex operations like handling multi-partition mutations in Key-Value stores, ensuring eventual consistency via two-phase commit.