data

Posts tagged with data

Why it matters: This article details how Pinterest uses advanced ML and LLMs to understand complex user intent, moving beyond simple recommendations to goal-oriented assistance. It offers a practical blueprint for building robust, extensible recommendation systems from limited initial data.

  • Pinterest developed a system to identify "user journeys" – sequences of user-item interactions revealing long-term goals beyond immediate interests.
  • The system uses a dynamic keyword extraction approach, leveraging user search history, activity, and boards.
  • Keywords are processed with pretrained text embeddings (e.g., SearchSage) and then hierarchically clustered to form journey candidates.
  • Specialized models handle journey naming (currently keyword-based, evolving to LLMs), expansion (LLM-generated recommendations), ranking, and diversification.
  • The architecture emphasizes lean development, starting small with annotated data, and extensibility for future advanced ML/LLM techniques.
  • The inference pipeline runs on a streaming system for quick adaptation to recent user activities.

Why it matters: DSF revolutionizes AI network scaling by overcoming traditional fabric limitations. Its disaggregated architecture, packet spraying, and advanced congestion control ensure high-performance, lossless connectivity for massive GPU clusters, crucial for the future of large-scale AI model training.

  • Meta's Disaggregated Scheduled Fabric (DSF) is a next-generation network technology designed to scale AI training networks beyond the physical limits of traditional Clos-based architectures.
  • DSF disaggregates line cards (Interface Nodes) and fabric cards (Fabric Nodes) into distinct hardware, creating a distributed system for enhanced scalability and performance.
  • It addresses critical challenges in AI workloads, such as "elephant flows" and "low entropy" traffic patterns, which cause congestion and suboptimal utilization in conventional IP fabrics.
  • The system employs a two-domain architecture, packet spraying, and a credit-based congestion control algorithm for efficient, lossless traffic management.
  • Built on open standards like OCP-SAI and managed by FBOSS, DSF enables the creation of large virtual chassis switches capable of interconnecting thousands of GPUs for massive AI clusters.

Why it matters: This article details how Netflix built a real-time distributed graph to unify disparate data from microservices, enabling complex relationship analysis and personalized experiences. It showcases a robust stream processing architecture for internet-scale data.

  • Netflix developed a Real-Time Distributed Graph (RDG) to unify member interaction data across diverse services and devices, addressing data silos from their microservices architecture.
  • The RDG provides advantages like relationship-centric queries, schema flexibility, and efficient pattern detection over traditional data warehousing.
  • Its ingestion and processing pipeline relies on a stream processing architecture for real-time updates, crucial for maintaining an up-to-date graph.
  • Apache Kafka acts as the ingestion backbone, handling up to 1M messages/second, with Avro-encoded records and schema registry.
  • Apache Flink jobs process these Kafka streams in near real-time, leveraging robust internal platform support for integration.
  • Data is also persisted to Apache Iceberg for backfilling, complementing Kafka's retention policies.

Why it matters: This article offers engineers actionable design principles to reduce IT hardware's environmental impact, fostering sustainability and cost savings through circularity and emissions reduction in data center infrastructure.

  • Meta introduces "Design for Sustainability" principles for IT hardware to cut emissions and costs via reuse, extended life, and optimized design.
  • Key strategies include modularity, retrofitting, dematerialization, greener materials, and extending hardware lifecycles in data centers.
  • The focus is on reducing Scope 3 emissions from manufacturing, delivery, and end-of-life of IT hardware components.
  • Methods involve optimizing material selection, using lower carbon alternatives, extending rack life, and harvesting components for reuse.
  • These principles apply across various rack types (AI, Compute, Storage, Network) and target components like compute, storage, and cooling.
  • Collaboration with suppliers to electrify processes and transition to renewable energy is crucial for achieving net-zero goals.
  • The initiative also significantly reduces electronic waste (e-waste) generated from data centers.

Why it matters: Building reliable LLM applications requires moving beyond ad-hoc testing. This framework shows engineers how to implement a rigorous, code-like evaluation pipeline to manage the unpredictability of probabilistic AI components and ensure consistent performance at scale.

  • LLM pipelines involve complex probabilistic stages like intent classification and retrieval, requiring systematic evaluation to prevent regressions.
  • Dropbox Dash moved from ad-hoc testing to an evaluation-first approach, treating every model or prompt change with the same rigor as production code.
  • A hybrid dataset strategy combines public benchmarks like MS MARCO for baselining with internal production logs to capture real-world user behavior.
  • Synthetic data generation using LLMs helps create evaluation sets for diverse content types, including tables, images, and factual lookups.
  • Traditional NLP metrics like BLEU and ROUGE are often inadequate for RAG systems, necessitating the development of more actionable, task-specific rubrics.

Why it matters: This article demonstrates how Netflix optimized its workflow orchestrator by 100X, crucial for supporting evolving business needs like real-time data processing and low-latency applications. It highlights the importance of engine redesign for scalability and developer productivity.

  • Netflix's Maestro workflow orchestrator achieved a 100X performance improvement, reducing overhead from seconds to milliseconds for Data/ML workflows.
  • The previous Maestro engine, based on deprecated Conductor 2.x, suffered from performance bottlenecks and race conditions due to its internal flow engine layer.
  • New business needs like Live, Ads, Games, and low-latency use cases necessitated a high-performance workflow engine.
  • The team evaluated options including upgrading Conductor, using Temporal, or implementing a custom internal flow engine.
  • They opted to rewrite Maestro's internal flow engine to simplify the architecture, eliminate complex database synchronizations, and ensure strong guarantees.

Why it matters: This article details how Netflix built a robust WAL system to solve common, critical data challenges like consistency, replication, and reliable retries at massive scale. It offers a blueprint for building resilient data platforms, enhancing developer efficiency and preventing outages.

  • Netflix developed a generic, distributed Write-Ahead Log (WAL) system to address critical data challenges at scale, including data loss, corruption, and replication.
  • The WAL provides strong durability guarantees and reliably delivers data changes to various downstream consumers.
  • Its simple WriteToLog API abstracts internal complexities, using namespaces to define storage (Kafka, SQS) and configurations.
  • Key use cases (personas) include enabling delayed message queues for reliable retries in real-time data pipelines.
  • It facilitates generic cross-region data replication for services like EVCache.
  • The WAL also supports complex operations like handling multi-partition mutations in Key-Value stores, ensuring eventual consistency via two-phase commit.

Why it matters: This article details how a large-scale key-value store was rearchitected to meet modern demands for real-time data, scalability, and operational efficiency. It offers valuable insights into addressing common distributed system challenges and executing complex migrations.

  • Airbnb rearchitected its core key-value store, Mussel, from v1 to v2 to handle real-time demands, massive data, and improve operational efficiency.
  • Mussel v1 faced issues with operational complexity, static partitioning leading to hotspots, limited consistency, and opaque costs.
  • Mussel v2 leverages Kubernetes for automation, dynamic range sharding for scalability, flexible consistency, and enhanced cost visibility.
  • The new architecture includes a stateless Dispatcher, Kafka-backed writes for durability, and an event-driven model for ingestion.
  • Bulk data loading is supported via Airflow orchestration and distributed workers, maintaining familiar semantics.
  • Automated TTL in v2 uses a topology-aware expiration service for efficient, parallel data deletion, improving on v1's compaction cycle.
  • A blue/green migration strategy with custom bootstrapping and dual writes ensured a seamless transition with zero downtime and data loss.

Why it matters: This article details how Netflix scaled a critical OLAP application to handle trillions of rows and complex queries. It showcases practical strategies using approximate distinct counts (HLL) and in-memory precomputed aggregates (Hollow) to achieve high performance and data accuracy.

  • Netflix's Muse application, an OLAP system for creative insights, evolved its architecture to handle trillions of rows and complex queries.
  • The updated data serving layer leverages HyperLogLog (HLL) sketches for efficient, approximate distinct counts, reducing query latencies by approximately 50%.
  • Hollow is used as a read-only, in-memory key-value store for precomputed aggregates, offloading Druid and improving performance for specific data access patterns.
  • The architecture now includes React, GraphQL, and Spring Boot GRPC microservices, with significant tuning applied to the Druid cluster.
  • The solution addresses challenges like dynamic analysis by audience affinities and combinatorial data explosion.

Why it matters: This article showcases a successful approach to managing a large, evolving data graph in a service-oriented architecture. It provides insights into how a data-oriented service mesh can simplify developer experience, improve modularity, and scale efficiently.

  • Viaduct, Airbnb's data-oriented service mesh, has been open-sourced after five years of significant growth and evolution within the company.
  • It's built on three core principles: a central, integrated GraphQL schema, hosting business logic directly within the mesh, and re-entrancy for modular composition.
  • The "Viaduct Modern" initiative simplified its developer-facing Tenant API, reducing complexity from multiple mechanisms to just node and field resolvers.
  • Modularity was enhanced through formal "tenant modules," enabling teams to own schema and code while composing via GraphQL fragments and queries, avoiding direct code dependencies.
  • This modernization effort has allowed Viaduct to scale dramatically (8x traffic, 3x codebase) while maintaining operational efficiency and reducing incidents.
Page 7 of 9