Netflix Tech Blog

Why it matters: This article introduces "Spin," a new Metaflow feature that significantly improves the iterative development experience for ML/AI engineers. It allows faster experimentation and debugging, bridging the gap between workflow orchestrators and interactive notebooks.

Metaflow, an open-sourced Netflix framework, streamlines ML/AI workflows from prototype to production, emphasizing rapid iteration and reliable operations.
The new "Spin" command in Metaflow 2.19 significantly accelerates iterative ML/AI development by enabling quick, stateful execution of individual steps.
ML/AI development requires fast, stateful iteration due to large, mutable data and models, and computationally expensive processes.
Metaflow steps function as explicit, deterministic checkpoints, persisting state as versioned artifacts.
"Spin" allows developers to execute a single Metaflow step with inherited state, mimicking notebook cell behavior for instant feedback.
Unlike `run` or `resume`, `spin` is designed for fast, untracked, throw-away iterations, optimizing the development loop.

#mlp #data

Read original

Netflix Tech BlogOct 25, 2025

Post-Training Generative Recommenders with Advantage-Weighted Supervised Finetuning

Why it matters: This article introduces A-SFT, a novel post-training algorithm for generative recommenders. It addresses key challenges like noisy reward models and lack of counterfactual data, offering a practical way to improve recommendation quality by better aligning models with user preferences.

Generative Recommenders (GRs) model user behavior as a sequential transduction task, inspired by transformer architectures.
Applying RLHF to GRs is challenging due to the lack of counterfactual feedback and the inherent noisiness of recommendation reward signals.
User feedback is on-policy, making it impractical to obtain evaluations for hypothetical or unseen recommendations.
Reward models in recommendation systems often exhibit high uncertainty, as user choices are less structured and more random than language data.
The paper proposes Advantage-Weighted Supervised Fine-tuning (A-SFT) to overcome these post-training challenges.
A-SFT combines supervised fine-tuning with the advantage function, effectively guiding optimization even with high-variance reward models.
This approach improves alignment between pre-trained generative recommenders and reward models, balancing offline RL and behavior cloning.

#data #mlp

Read original

Netflix Tech BlogOct 21, 2025

Behind the Streams: Real-Time Recommendations for Live Events Part 3

Why it matters: This article details how Netflix scaled real-time recommendations for live events to millions of users, solving the "thundering herd" problem. It offers a robust, two-phase architectural pattern for high-concurrency, low-latency updates, crucial for distributed systems engineers.

Netflix developed a real-time recommendation system for live events to handle millions of concurrent users without overwhelming cloud services.
The core solution involves a two-phase approach: prefetching data to devices ahead of time and broadcasting low-cardinality messages to trigger updates.
Prefetching distributes load over a longer period, avoiding traffic spikes and optimizing request throughput and compute cardinality.
Real-time broadcasting uses state keys and timestamps to ensure devices update locally with prefetched data, guaranteeing delivery even on unstable networks.
This system successfully delivers updates to over 100 million devices in under a minute during peak live event loads.
It leverages a robust two-tier pub/sub architecture built on Pushy (WebSocket proxy), Apache Kafka, and Netflix's KV store for efficient, low-latency fanout.

#dist #sre

Read original

Netflix Tech BlogOct 17, 2025

How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data…

Why it matters: This article details how Netflix built a real-time distributed graph to unify disparate data from microservices, enabling complex relationship analysis and personalized experiences. It showcases a robust stream processing architecture for internet-scale data.

Netflix developed a Real-Time Distributed Graph (RDG) to unify member interaction data across diverse services and devices, addressing data silos from their microservices architecture.
The RDG provides advantages like relationship-centric queries, schema flexibility, and efficient pattern detection over traditional data warehousing.
Its ingestion and processing pipeline relies on a stream processing architecture for real-time updates, crucial for maintaining an up-to-date graph.
Apache Kafka acts as the ingestion backbone, handling up to 1M messages/second, with Avro-encoded records and schema registry.
Apache Flink jobs process these Kafka streams in near real-time, leveraging robust internal platform support for integration.
Data is also persisted to Apache Iceberg for backfilling, complementing Kafka's retention policies.

#dist #data

Read original

Netflix Tech BlogSep 29, 2025

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine

Why it matters: This article demonstrates how Netflix optimized its workflow orchestrator by 100X, crucial for supporting evolving business needs like real-time data processing and low-latency applications. It highlights the importance of engine redesign for scalability and developer productivity.

Netflix's Maestro workflow orchestrator achieved a 100X performance improvement, reducing overhead from seconds to milliseconds for Data/ML workflows.
The previous Maestro engine, based on deprecated Conductor 2.x, suffered from performance bottlenecks and race conditions due to its internal flow engine layer.
New business needs like Live, Ads, Games, and low-latency use cases necessitated a high-performance workflow engine.
The team evaluated options including upgrading Conductor, using Temporal, or implementing a custom internal flow engine.
They opted to rewrite Maestro's internal flow engine to simplify the architecture, eliminate complex database synchronizations, and ensure strong guarantees.

#data #dist #mlp

Read original

Netflix Tech BlogSep 26, 2025

Building a Resilient Data Platform with Write-Ahead Log at Netflix

Why it matters: This article details how Netflix built a robust WAL system to solve common, critical data challenges like consistency, replication, and reliable retries at massive scale. It offers a blueprint for building resilient data platforms, enhancing developer efficiency and preventing outages.

Netflix developed a generic, distributed Write-Ahead Log (WAL) system to address critical data challenges at scale, including data loss, corruption, and replication.
The WAL provides strong durability guarantees and reliably delivers data changes to various downstream consumers.
Its simple WriteToLog API abstracts internal complexities, using namespaces to define storage (Kafka, SQS) and configurations.
Key use cases (personas) include enabling delayed message queues for reliable retries in real-time data pipelines.
It facilitates generic cross-region data replication for services like EVCache.
The WAL also supports complex operations like handling multi-partition mutations in Key-Value stores, ensuring eventual consistency via two-phase commit.

#data #dist #sre

Read original

Netflix Tech BlogSep 22, 2025

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

Why it matters: This article details how Netflix scaled a critical OLAP application to handle trillions of rows and complex queries. It showcases practical strategies using approximate distinct counts (HLL) and in-memory precomputed aggregates (Hollow) to achieve high performance and data accuracy.

Netflix's Muse application, an OLAP system for creative insights, evolved its architecture to handle trillions of rows and complex queries.
The updated data serving layer leverages HyperLogLog (HLL) sketches for efficient, approximate distinct counts, reducing query latencies by approximately 50%.
Hollow is used as a read-only, in-memory key-value store for precomputed aggregates, offloading Druid and improving performance for specific data access patterns.
The architecture now includes React, GraphQL, and Spring Boot GRPC microservices, with significant tuning applied to the Druid cluster.
The solution addresses challenges like dynamic analysis by audience affinities and combinatorial data explosion.

#data #dist

Read original

Netflix Tech BlogSep 19, 2025

Empowering Netflix Engineers with Incident Management

Why it matters: This article details how Netflix scaled incident management by empowering all engineers with an intuitive tool and process. It offers a blueprint for other organizations seeking to democratize incident response and foster a culture of continuous learning and reliability.

Netflix transitioned from a centralized SRE-led incident management system to a decentralized, "paved road" approach to empower all engineers.
The previous system, relying on basic tools, failed to scale with Netflix's growth, leading to missed learning opportunities from numerous uncaptured incidents.
They adopted Incident.io after a build-vs-buy analysis, prioritizing intuitive UX, internal data integration, balanced customization, and an approachable design.
Key to successful adoption was the tool's intuitive design, which fostered a cultural shift, making incident management less intimidating and more accessible.
Organizational investment in standardized processes, educational resources, and internal data integrations significantly reduced cognitive load and accelerated adoption.
This transformation aimed to make incident declaration and management easy for any engineer, even for minor issues, to foster continuous improvement and system reliability.

#sre #culture

Read original

Netflix Tech BlogAug 21, 2025

From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix

Why it matters: This article details how Netflix is innovating data engineering to tackle the unique challenges of media data for advanced ML. It offers insights into building specialized data platforms and roles for multi-modal content, crucial for any company dealing with large-scale unstructured media.

Netflix is evolving its data engineering function to "Media ML Data Engineering" to handle complex, multi-modal media data at scale.
This new specialization focuses on centralizing, standardizing, and managing media assets and their metadata for machine learning applications.
The "Media Data Lake" is introduced as a platform for storing and serving media assets, leveraging vector storage solutions like LanceDB.
Its architecture includes a Media Table for metadata, a robust data model, a Pythonic Data API, and distributed compute for ML training and inference.
The initiative aims to bridge creative media workflows with cutting-edge ML demands, enabling applications like content embedding and quality measures.

#data #mlp #dist

Read original

Netflix Tech BlogAug 18, 2025

ML Observability: Bringing Transparency to Payments and Beyond

Why it matters: This article highlights how robust ML observability is critical for maintaining reliable, high-performing ML systems in production, especially for sensitive areas like payment processing. It provides a practical framework for implementing effective monitoring and explainability.

ML Observability is essential for monitoring, understanding, and gaining insights into production ML models, ensuring reliability and continuous improvement.
At Netflix, it's crucial for optimizing payment processing, reducing friction, and ensuring seamless user subscriptions and renewals.
An effective framework includes automatic issue detection, root cause analysis, and builds stakeholder trust by explaining system behavior.
Netflix's approach focuses on stakeholder-facing outcomes, structured into logging, monitoring, and explaining modules.
Logging requires a comprehensive data schema to capture model inputs, outputs, and metadata for effective analysis.
Monitoring emphasizes online, outcome-focused metrics to understand real-world model behavior and health.
Explainability, using tools like SHAP, clarifies the "why" behind ML decisions, both in aggregate and for individual instances.

#mlp #data #sre

Read original

Page 2 of 2

Prev 1 2Next