Posts tagged with data
Why it matters: This article introduces a novel approach to managing complex microservice architectures. By shifting to a data-oriented service mesh with a central GraphQL schema, engineers can significantly improve modularity, simplify dependency management, and enhance data agility in large-scale SOAs.
- •Airbnb introduced Viaduct, a data-oriented service mesh, to improve modularity and address the complexity of massive dependency graphs in microservices-based Service-Oriented Architectures (SOA).
- •Traditional service meshes are procedure-oriented, leading to 'spaghetti SOA' where managing and modifying services becomes increasingly difficult.
- •Viaduct shifts to a data-oriented design, leveraging GraphQL to define a central schema comprising types, queries, and mutations across the entire service mesh.
- •This data-oriented approach abstracts service dependencies from data consumers, as Viaduct intelligently routes requests to the appropriate microservices.
- •The central GraphQL schema acts as a single source of truth, aiming to define service APIs and potentially database schemas, which significantly enhances data agility.
- •By centralizing schema definition, Viaduct seeks to streamline changes, allowing database updates to propagate to client code with a single, coordinated update, reducing weeks of effort.
Why it matters: This article details Pinterest's approach to building a scalable data processing platform on EKS, covering deployment and critical logging infrastructure. It offers insights into managing large-scale data systems and ensuring observability in cloud-native environments.
- •Pinterest is transitioning to Moka, a new data processing platform, deploying it on AWS EKS across standardized test, dev, staging, and production environments.
- •EKS cluster deployment utilizes Terraform with a layered structure of AWS-originated and Pinterest-specific modules and Helm charts.
- •A comprehensive logging strategy is implemented for Moka, addressing EKS control plane logs (via CloudWatch), Spark application logs (driver, executor, event logs), and system pod logs.
- •A key challenge in logging is ensuring reliable upload of Spark event logs to S3, even during job failures, for consumption by Spark History Server.
- •They are exploring custom Spark listeners and sidecar containers to guarantee event log persistence and availability for debugging and performance analysis.
Why it matters: This article details how Netflix is innovating data engineering to tackle the unique challenges of media data for advanced ML. It offers insights into building specialized data platforms and roles for multi-modal content, crucial for any company dealing with large-scale unstructured media.
- •Netflix is evolving its data engineering function to "Media ML Data Engineering" to handle complex, multi-modal media data at scale.
- •This new specialization focuses on centralizing, standardizing, and managing media assets and their metadata for machine learning applications.
- •The "Media Data Lake" is introduced as a platform for storing and serving media assets, leveraging vector storage solutions like LanceDB.
- •Its architecture includes a Media Table for metadata, a robust data model, a Pythonic Data API, and distributed compute for ML training and inference.
- •The initiative aims to bridge creative media workflows with cutting-edge ML demands, enabling applications like content embedding and quality measures.
Why it matters: This article highlights how robust ML observability is critical for maintaining reliable, high-performing ML systems in production, especially for sensitive areas like payment processing. It provides a practical framework for implementing effective monitoring and explainability.
- •ML Observability is essential for monitoring, understanding, and gaining insights into production ML models, ensuring reliability and continuous improvement.
- •At Netflix, it's crucial for optimizing payment processing, reducing friction, and ensuring seamless user subscriptions and renewals.
- •An effective framework includes automatic issue detection, root cause analysis, and builds stakeholder trust by explaining system behavior.
- •Netflix's approach focuses on stakeholder-facing outcomes, structured into logging, monitoring, and explaining modules.
- •Logging requires a comprehensive data schema to capture model inputs, outputs, and metadata for effective analysis.
- •Monitoring emphasizes online, outcome-focused metrics to understand real-world model behavior and health.
- •Explainability, using tools like SHAP, clarifies the "why" behind ML decisions, both in aggregate and for individual instances.
Why it matters: This article provides a detailed blueprint for achieving high availability and fault tolerance for distributed databases on Kubernetes in a multi-cloud environment. Engineers can learn best practices for managing stateful services, mitigating risks, and designing resilient systems at scale.
- •Airbnb achieved high availability for a distributed SQL database by deploying it across multiple Kubernetes clusters, each in a different AWS Availability Zone, a complex but effective strategy.
- •They addressed challenges of running stateful databases on Kubernetes, particularly node replacements and upgrades, using custom Kubernetes operators and admission hooks.
- •A custom Kubernetes operator coordinates node replacements, ensuring data consistency and preventing service disruption during various event types.
- •Deploying across three independent Kubernetes clusters in different AWS AZs significantly limits the blast radius of infrastructure or deployment issues.
- •AWS EBS provides rapid volume reattachment and durability, with tail latency spikes mitigated by read timeouts, transparent retries, and stale reads.
- •Overprovisioning database clusters ensures sufficient capacity even if an entire AZ or Kubernetes cluster fails.
Why it matters: This article details Pinterest's strategic move from Hadoop to Kubernetes for data processing at scale. It offers valuable insights into the challenges and benefits of modernizing big data infrastructure, providing a blueprint for other organizations facing similar migration decisions.
- •Pinterest is migrating from its aging Hadoop 2.x (Monarch) data platform to a new Kubernetes (K8s) based system, Moka, for massive-scale data processing.
- •The shift to K8s is driven by needs for enhanced container isolation, security, improved performance with Spark, lower operational costs, and better developer velocity.
- •Kubernetes offers built-in container support, streamlined deployment via Terraform/Helm, and a rich ecosystem of monitoring, logging, and scheduling frameworks.
- •Performance optimizations include leveraging newer JDKs, GPU support, ARM/Graviton instances, and Kubernetes' native autoscaling capabilities.
- •Key design challenges involve integrating EKS into Pinterest's existing infrastructure and replacing core Hadoop functionalities like YARN UI, job submission, resource management, log aggregation, and security.
Why it matters: Engineers often struggle to balance robust security with system performance. This approach demonstrates how to implement scalable, team-level encryption at rest using HSMs without sacrificing the speed of file sharing or the functionality of content search in a distributed environment.
- •Dropbox developed a team-based encryption system using Hardware Security Modules (HSM) for secure key generation and storage.
- •The architecture solves the performance bottleneck of re-encrypting 4MB file blocks during cross-team sharing operations.
- •Unique top-level keys allow enterprise teams to instantly disable access to their data, providing granular control over sensitive information.
- •The system balances high security with usability, maintaining features like content search that are often lost in traditional end-to-end encryption.
- •This security framework serves as the foundation for protecting AI-driven tools like Dropbox Dash and its universal search capabilities.
Why it matters: Dropbox's 7th-gen hardware shows how custom infrastructure at exabyte scale drives massive efficiency. By co-designing hardware and software, they achieve superior performance-per-watt and density, essential for modern AI-driven workloads and sustainable growth.
- •Dropbox launched its seventh-generation hardware platform featuring specialized tiers: Crush (compute), Dexter (database), Sonic (storage), and Gumby/Godzilla (GPUs).
- •The architecture doubles available rack power from 17kW to 35kW and transitions to 400G networking to support high-bandwidth AI and data workloads.
- •Storage density is optimized using 28TB SMR drives and a custom chassis designed to mitigate vibration and heat, supporting exabyte-scale data.
- •The compute tier utilizes 128-core AMD EPYC processors and PCIe Gen5, providing significant performance-per-watt improvements over previous generations.
- •New GPU tiers are specifically integrated to power AI products like Dropbox Dash, focusing on high-performance training and inference capabilities.
Why it matters: This article demonstrates how to significantly accelerate ML development and deployment by leveraging Ray for end-to-end data pipelines. Engineers can learn to build more efficient, scalable, and faster ML iteration systems, reducing costs and time-to-market for new features.
- •Pinterest expanded Ray's role from ML training to the entire ML infrastructure, including feature development, sampling, and label modeling, to accelerate iteration.
- •A Ray Data native pipeline API was developed for on-the-fly feature transformations, eliminating slow Spark backfills and costly feature joins.
- •Efficient Iceberg bucket joins were implemented in Ray, enabling dynamic dataset joining at runtime and reducing feature experimentation from days to hours.
- •Ray-based Iceberg write mechanisms facilitate data persistence, caching transformed features for reuse, enhancing iteration efficiency and production data generation.
- •This integrated Ray architecture provides a more scalable, efficient, and faster end-to-end ML development and deployment process.
Why it matters: This article details how Pinterest scaled its recommendation system to leverage vast lifelong user data, significantly improving personalization and user engagement through innovative ML models and efficient serving infrastructure.
- •Pinterest's TransActV2 significantly enhances personalization by modeling up to 16,000 lifelong user actions, a 160x increase over previous systems.
- •It introduces a Next Action Loss (NAL) as an auxiliary task, improving user action forecasting beyond traditional CTR models.
- •To handle long sequences efficiently, TransActV2 uses Nearest Neighbor (NN) selection at inference, feeding only the most relevant actions to the model.
- •The system employs a multi-headed transformer encoder architecture with causal masking and explicit action features.
- •Industrial-scale deployment challenges are addressed through NN feature logging, on-device NN search, and custom OpenAI Triton kernels for low-latency serving.
- •Lifelong behavior modeling captures evolving, multi-seasonal, and less-frequent user interests, leading to richer personalization.