Why it matters: Building reliable LLM applications requires moving beyond ad-hoc testing. This framework shows engineers how to implement a rigorous, code-like evaluation pipeline to manage the unpredictability of probabilistic AI components and ensure consistent performance at scale.
- •LLM pipelines involve complex probabilistic stages like intent classification and retrieval, requiring systematic evaluation to prevent regressions.
- •Dropbox Dash moved from ad-hoc testing to an evaluation-first approach, treating every model or prompt change with the same rigor as production code.
- •A hybrid dataset strategy combines public benchmarks like MS MARCO for baselining with internal production logs to capture real-world user behavior.
- •Synthetic data generation using LLMs helps create evaluation sets for diverse content types, including tables, images, and factual lookups.
- •Traditional NLP metrics like BLEU and ROUGE are often inadequate for RAG systems, necessitating the development of more actionable, task-specific rubrics.
Why it matters: This article demonstrates how Netflix optimized its workflow orchestrator by 100X, crucial for supporting evolving business needs like real-time data processing and low-latency applications. It highlights the importance of engine redesign for scalability and developer productivity.
- •Netflix's Maestro workflow orchestrator achieved a 100X performance improvement, reducing overhead from seconds to milliseconds for Data/ML workflows.
- •The previous Maestro engine, based on deprecated Conductor 2.x, suffered from performance bottlenecks and race conditions due to its internal flow engine layer.
- •New business needs like Live, Ads, Games, and low-latency use cases necessitated a high-performance workflow engine.
- •The team evaluated options including upgrading Conductor, using Temporal, or implementing a custom internal flow engine.
- •They opted to rewrite Maestro's internal flow engine to simplify the architecture, eliminate complex database synchronizations, and ensure strong guarantees.
Why it matters: This article details how Netflix built a robust WAL system to solve common, critical data challenges like consistency, replication, and reliable retries at massive scale. It offers a blueprint for building resilient data platforms, enhancing developer efficiency and preventing outages.
- •Netflix developed a generic, distributed Write-Ahead Log (WAL) system to address critical data challenges at scale, including data loss, corruption, and replication.
- •The WAL provides strong durability guarantees and reliably delivers data changes to various downstream consumers.
- •Its simple WriteToLog API abstracts internal complexities, using namespaces to define storage (Kafka, SQS) and configurations.
- •Key use cases (personas) include enabling delayed message queues for reliable retries in real-time data pipelines.
- •It facilitates generic cross-region data replication for services like EVCache.
- •The WAL also supports complex operations like handling multi-partition mutations in Key-Value stores, ensuring eventual consistency via two-phase commit.
Why it matters: This article details how a large-scale key-value store was rearchitected to meet modern demands for real-time data, scalability, and operational efficiency. It offers valuable insights into addressing common distributed system challenges and executing complex migrations.
- •Airbnb rearchitected its core key-value store, Mussel, from v1 to v2 to handle real-time demands, massive data, and improve operational efficiency.
- •Mussel v1 faced issues with operational complexity, static partitioning leading to hotspots, limited consistency, and opaque costs.
- •Mussel v2 leverages Kubernetes for automation, dynamic range sharding for scalability, flexible consistency, and enhanced cost visibility.
- •The new architecture includes a stateless Dispatcher, Kafka-backed writes for durability, and an event-driven model for ingestion.
- •Bulk data loading is supported via Airflow orchestration and distributed workers, maintaining familiar semantics.
- •Automated TTL in v2 uses a topology-aware expiration service for efficient, parallel data deletion, improving on v1's compaction cycle.
- •A blue/green migration strategy with custom bootstrapping and dual writes ensured a seamless transition with zero downtime and data loss.
Why it matters: This article details how Netflix scaled a critical OLAP application to handle trillions of rows and complex queries. It showcases practical strategies using approximate distinct counts (HLL) and in-memory precomputed aggregates (Hollow) to achieve high performance and data accuracy.
- •Netflix's Muse application, an OLAP system for creative insights, evolved its architecture to handle trillions of rows and complex queries.
- •The updated data serving layer leverages HyperLogLog (HLL) sketches for efficient, approximate distinct counts, reducing query latencies by approximately 50%.
- •Hollow is used as a read-only, in-memory key-value store for precomputed aggregates, offloading Druid and improving performance for specific data access patterns.
- •The architecture now includes React, GraphQL, and Spring Boot GRPC microservices, with significant tuning applied to the Druid cluster.
- •The solution addresses challenges like dynamic analysis by audience affinities and combinatorial data explosion.
Why it matters: This article details how Netflix scaled incident management by empowering all engineers with an intuitive tool and process. It offers a blueprint for other organizations seeking to democratize incident response and foster a culture of continuous learning and reliability.
- •Netflix transitioned from a centralized SRE-led incident management system to a decentralized, "paved road" approach to empower all engineers.
- •The previous system, relying on basic tools, failed to scale with Netflix's growth, leading to missed learning opportunities from numerous uncaptured incidents.
- •They adopted Incident.io after a build-vs-buy analysis, prioritizing intuitive UX, internal data integration, balanced customization, and an approachable design.
- •Key to successful adoption was the tool's intuitive design, which fostered a cultural shift, making incident management less intimidating and more accessible.
- •Organizational investment in standardized processes, educational resources, and internal data integrations significantly reduced cognitive load and accelerated adoption.
- •This transformation aimed to make incident declaration and management easy for any engineer, even for minor issues, to foster continuous improvement and system reliability.
Why it matters: This article showcases a successful approach to managing a large, evolving data graph in a service-oriented architecture. It provides insights into how a data-oriented service mesh can simplify developer experience, improve modularity, and scale efficiently.
- •Viaduct, Airbnb's data-oriented service mesh, has been open-sourced after five years of significant growth and evolution within the company.
- •It's built on three core principles: a central, integrated GraphQL schema, hosting business logic directly within the mesh, and re-entrancy for modular composition.
- •The "Viaduct Modern" initiative simplified its developer-facing Tenant API, reducing complexity from multiple mechanisms to just node and field resolvers.
- •Modularity was enhanced through formal "tenant modules," enabling teams to own schema and code while composing via GraphQL fragments and queries, avoiding direct code dependencies.
- •This modernization effort has allowed Viaduct to scale dramatically (8x traffic, 3x codebase) while maintaining operational efficiency and reducing incidents.
Why it matters: This article introduces a novel approach to managing complex microservice architectures. By shifting to a data-oriented service mesh with a central GraphQL schema, engineers can significantly improve modularity, simplify dependency management, and enhance data agility in large-scale SOAs.
- •Airbnb introduced Viaduct, a data-oriented service mesh, to improve modularity and address the complexity of massive dependency graphs in microservices-based Service-Oriented Architectures (SOA).
- •Traditional service meshes are procedure-oriented, leading to 'spaghetti SOA' where managing and modifying services becomes increasingly difficult.
- •Viaduct shifts to a data-oriented design, leveraging GraphQL to define a central schema comprising types, queries, and mutations across the entire service mesh.
- •This data-oriented approach abstracts service dependencies from data consumers, as Viaduct intelligently routes requests to the appropriate microservices.
- •The central GraphQL schema acts as a single source of truth, aiming to define service APIs and potentially database schemas, which significantly enhances data agility.
- •By centralizing schema definition, Viaduct seeks to streamline changes, allowing database updates to propagate to client code with a single, coordinated update, reducing weeks of effort.
Why it matters: This article details Pinterest's approach to building a scalable data processing platform on EKS, covering deployment and critical logging infrastructure. It offers insights into managing large-scale data systems and ensuring observability in cloud-native environments.
- •Pinterest is transitioning to Moka, a new data processing platform, deploying it on AWS EKS across standardized test, dev, staging, and production environments.
- •EKS cluster deployment utilizes Terraform with a layered structure of AWS-originated and Pinterest-specific modules and Helm charts.
- •A comprehensive logging strategy is implemented for Moka, addressing EKS control plane logs (via CloudWatch), Spark application logs (driver, executor, event logs), and system pod logs.
- •A key challenge in logging is ensuring reliable upload of Spark event logs to S3, even during job failures, for consumption by Spark History Server.
- •They are exploring custom Spark listeners and sidecar containers to guarantee event log persistence and availability for debugging and performance analysis.
Why it matters: This article details Slack's Anomaly Event Response, showcasing a real-world example of building a proactive, automated security system. Engineers can learn about designing multi-tiered architectures for real-time threat detection and response, crucial for modern platform security.
- •Slack's Anomaly Event Response (AER) is a proactive security system that detects and responds to emerging threat behaviors in real-time.
- •AER automatically terminates suspicious user sessions, reducing the detection-to-response gap from hours/days to minutes.
- •It targets common threats like Tor access, excessive downloads, data scraping, session fingerprint mismatches, and unusual API patterns.
- •The system uses a multi-tiered architecture: detection engine, decision framework, and response orchestrator.
- •Enterprise Grid customers can configure AER to select which anomalies trigger automated responses and notification preferences.
- •This native solution disrupts attack chains, preventing data exfiltration and system compromise without additional tools or human capital.