Engineering at Meta

Why it matters: This research addresses the challenge of sparse signal optimization in massive-scale recommendation systems. By using hierarchical graph learning and multimodal enrichment, engineers can improve deep funnel performance and better align user intent with content in high-sparsity environments.

Meta is developing a Hierarchical Interest Representation layer to optimize deep funnel ads by mapping latent user interests to advertiser offerings.
The system uses a transformer-based graph learning architecture with bias-aware attention and self-supervised cross-view distillation.
It integrates multimodal data via LLMs to enrich sparse interaction signals and improve generalization for rare or unseen entities.
The model projects raw interaction graphs into super-graphs of learned interest primitives, reducing dimensionality and stabilizing the vocabulary.
It outputs universal embeddings and Bag-of-Meaning interest tokens to power retrieval and ranking across Meta's ads stack at the scale of billions of interactions.

#mlp #data #dist

Read original

Engineering at MetaJul 13, 2026

Modernizing the Meta Ads Service With an Open-Source Kernel Scheduler

Why it matters: This demonstrates how BPF-based extensible scheduling allows engineers to bypass general-purpose kernel limitations. By tailoring CPU scheduling to specific workload patterns, Meta achieved massive latency reductions and power efficiency gains that standard schedulers couldn't provide.

Meta implemented sched_ext, a BPF-based extensible scheduling framework, to address latency regressions caused by the Linux kernel's EEVDF scheduler.
The custom scheduling policy soft-partitions CPUs into pools for latency-critical and background tasks, optimizing for specific ad-serving workload patterns.
The BPF-based approach allows for rapid iteration and deployment of scheduling updates via user-space binaries without requiring kernel reboots.
Optimization of CPU affinity improved L3 cache locality and reduced DRAM access, contributing to a 28% reduction in p99 tail latency.
The deployment resulted in significant business impact, including a 1.1% increase in ads ranked and 3.28 megawatts of power savings across the fleet.

#sre #dist #finops

Read original

Engineering at MetaJul 1, 2026

Meta’s AI Storage Blueprint at Scale

Why it matters: Storage bottlenecks are a primary cause of GPU stalls in AI workloads. Optimizing BLOB storage for low-latency retrieval is critical for maximizing expensive compute utilization and accelerating the development of frontier models.

Meta's storage architecture is built on Tectonic, a regional, multi-tenant block layer that uses erasure-coding and media tiering to manage exabyte-scale data.
AI training workloads require bounded pMax latencies because GPU synchronization steps stall the entire cluster if a single data fetch is delayed.
Legacy BLOB storage architectures suffered from excessive metadata lookups across multiple stateful layers, creating bottlenecks for high-speed flash storage.
Meta is migrating its AI training stack to a unified BLOB-storage interface to provide high-performance access to massive, geo-distributed data lakes.
Optimizing storage retrieval is essential for maximizing GPU utilization and increasing research velocity for frontier models like Llama.

#data #mlp #dist

Read original

Engineering at MetaJun 30, 2026

10 Years of Meta’s Commitment to Python

Why it matters: Meta's decade-long support for the PSF underscores the importance of corporate investment in open-source stability. This commitment ensures that Python remains a secure, high-performance, and innovative tool for the global engineering community, particularly in AI and infrastructure.

Meta marks 10 years of sponsoring the Python Software Foundation (PSF) to support the language's long-term sustainability.
Python is the most used language at Meta, powering the backends of Instagram and Threads and driving AI research.
Meta engineers contribute directly to the language through Python Enhancement Proposals (PEPs) and core feature development.
The company supports the Developer-in-Residence program, which employs full-time developers to improve the Python ecosystem.
Sponsorship funds have enabled critical security enhancements for the Python Package Index (PyPI) and core infrastructure.
Meta's open-source contributions include major projects like PyTorch and performance tools like the Pyrefly type checker.

#mlp #culture #security

Read original

Engineering at MetaJun 25, 2026

Privacy-Aware Infrastructure in the AI-Native Era: An Asset Classification Case Study

Why it matters: Scaling privacy controls in AI environments requires balancing model flexibility with deterministic reliability. This hybrid approach allows engineers to automate data classification at scale while maintaining the auditability and low latency required for production enforcement.

Meta utilizes a hybrid classification system that combines LLMs for reasoning through ambiguous data with deterministic rules for production enforcement.
The infrastructure prioritizes context building by gathering metadata like schema names and data lineage before applying model reasoning.
LLMs are used to interpret novel or complex assets, which are then distilled into versioned, human-reviewed rules to ensure auditability and low latency.
Asset classification serves as the 'Understand' layer, forming the load-bearing foundation for data discovery, enforcement, and compliance demonstration.
The system maintains a human-in-the-loop approach where people adjudicate reference labels and approve rule promotions that change protection enforcement.

#data #security #mlp

Read original

Engineering at MetaJun 23, 2026

How Meta Engineered Ultra-Narrow Batteries for AI Glasses

Why it matters: This engineering feat demonstrates how hardware constraints drive innovation in battery architecture and firmware. Rethinking cell design and power management is essential for enabling high-performance AI features in extremely constrained wearable form factors.

Meta developed ultra-narrow 7mm steel-can batteries to fit the slim temple arms of smart glasses, replacing traditional pouch cells that waste volume.
Engineers replaced traditional 'jelly roll' electrodes with die-cut stacked layers to lower impedance and prevent brownouts during peak AI and camera workloads.
The steel-can architecture maintains tolerances within 100 microns, maximizing energy density by reclaiming space lost in traditional battery designs.
System-level optimizations in firmware and power management achieved a 2x runtime increase despite only a 30% increase in raw battery capacity.
Dual-battery configurations in specific models introduced complex sequencing challenges to manage uneven electrical loads and prevent cross-charging risks.

#mobile #mlp

Read original

Engineering at MetaJun 22, 2026

Adopting AV1 for Real-Time Communication (RTC) at Scale

Why it matters: AV1 adoption for RTC demonstrates how to balance high-efficiency video compression with the strict latency and power constraints of mobile devices. It provides a blueprint for scaling modern codecs to billions of users while maintaining performance on low-end hardware.

Meta deployed the AV1 video codec across Messenger and WhatsApp to improve real-time communication quality, particularly on low-bandwidth networks.
AV1 achieves at least a 20% bitrate reduction compared to H.264/AVC, enabling clear video at bitrates as low as 100 kbps.
To overcome mobile power constraints, Meta implemented a custom low-complexity internal encoder that matches the power profile of H.264.
The implementation utilizes specialized tools like palette mode and intra-block copy to significantly improve the clarity of screen-shared content.
Engineers addressed RTC-specific challenges by maintaining end-to-end latency below 300ms and optimizing rate control to prevent video freezes.
Advanced error resilience techniques were integrated to handle network fluctuations and packet loss common in emerging markets.

#mobile #dist

Read original

Engineering at MetaJun 3, 2026

Lights Out, Systems On: Validating Instant Power Loss Readiness

Why it matters: Zero-notice power failures pose a massive risk to availability. Meta's approach shows how to handle regional outages by combining hardware persistence with automated dependency management, ensuring complex distributed systems can bootstrap autonomously from scratch.

Meta introduced 'Instantaneous PowerLoss Storm' to validate data center readiness for zero-notice power failures at a regional scale.
The strategy utilizes defense-in-depth, combining hardware-level batteries and Power Loss Siren (PLS) with software-level signaling.
A primary challenge is regional bootstrapping, where millions of services must restart and discover each other autonomously after a total outage.
To prevent circular dependencies ('ouroboros') in the control plane, Meta uses Belljar tests in CI/CD and the Twrko recovery kit for jumpstarting.
Testing at a regional scale revealed unique vulnerabilities in replica placement and autonomous recovery that single-fault domain tests missed.

#sre #dist

Read original

Engineering at MetaMay 26, 2026

SilverTorch: Index as Model — A New Retrieval Paradigm for Recommendation Systems

Why it matters: SilverTorch breaks the performance ceiling of microservice-based recommendation systems. By unifying retrieval into a single GPU-accelerated model, engineers can reduce latency, lower TCO, and eliminate the friction between ML and infrastructure development cycles.

Introduces SilverTorch, a unified 'Index as Model' architecture that replaces traditional microservice-based retrieval with a single neural network.
Consolidates user towers, item indices, eligibility filters, and scoring layers into integrated PyTorch model modules.
Achieves up to 23.7x higher throughput and 20.9x better cost efficiency compared to traditional CPU-based microservice baselines.
Eliminates latency overhead from network round-trips and data serialization by executing the entire retrieval pipeline in a single forward pass.
Solves version inconsistency issues by ensuring all retrieval components update simultaneously within a single deployment artifact.
Enables sub-100ms latency for processing millions of items while increasing model complexity and candidate evaluation volume.

#mlp #dist #data

Read original

Engineering at MetaMay 13, 2026

Reel Friends: Building Social Discovery that Scales to Billions

Why it matters: This article highlights the hidden complexity of scaling social features. It demonstrates how machine learning and platform-specific user behavior analysis are critical for delivering personalized experiences to billions, proving that simple UI often masks deep engineering challenges.

Explores the engineering behind 'Friend Bubbles,' a social discovery feature for Facebook Reels.
Discusses the iterative evolution of machine learning models used to rank and recommend social content.
Addresses the technical challenges of scaling social discovery features to billions of global users.
Highlights behavioral differences between iOS and Android users that influenced feature development.
Emphasizes that seemingly simple UI features often require significant backend and ML infrastructure.

#mlp #mobile #dist

Read original

Page 1 of 6

Prev1 2 3...6 Next