Engineering at Meta

Why it matters: Scaling security updates across massive codebases is traditionally slow and error-prone. By combining secure-by-default frameworks with AI-powered codemods, Meta demonstrates how to automate large-scale security migrations, reducing developer friction and improving app safety at scale.

Meta utilizes a two-pronged strategy for mobile security: secure-by-default frameworks and AI-driven automated migrations.
Secure-by-default frameworks wrap unsafe Android OS APIs to ensure the secure path is the easiest for developers to follow.
Generative AI is leveraged to automate the migration of legacy code to these secure frameworks at a massive scale.
The system automates the proposal, validation, and submission of security patches across millions of lines of code.
This approach significantly reduces developer friction and manual effort required to maintain security across a sprawling codebase.

#security #mobile #mlp

Read original

Engineering at MetaMar 9, 2026

How Advanced Browsing Protection Works in Messenger

Why it matters: It demonstrates how to implement privacy-preserving security features in end-to-end encrypted environments. Engineers can learn how to balance cryptographic privacy primitives like PIR and OPRF with the practical performance requirements of large-scale real-time messaging.

Messenger's Advanced Browsing Protection (ABP) uses Private Information Retrieval (PIR) to check links against a malicious URL database without revealing user activity to the server.
The system employs Oblivious Pseudorandom Functions (OPRF) to ensure the server cannot see the specific content of the client's query during the lookup process.
To handle URL prefix matching for subpaths, the system groups links by domain rather than requiring exact matches, preventing multiple queries that could leak data.
ABP addresses the privacy-efficiency tradeoff by sharding the database into buckets, carefully managing the number of bits leaked to the server to optimize performance.
The architecture is designed to scale to millions of potentially malicious websites while maintaining low latency for users within end-to-end encrypted chats.

#security #dist #data

Read original

Engineering at MetaMar 2, 2026

FFmpeg at Meta: Media Processing at Scale

Why it matters: Meta's move from a custom fork to upstream FFmpeg shows how large-scale needs drive open-source evolution. It highlights optimizations in multi-lane transcoding and real-time quality metrics that significantly reduce compute costs and maintenance overhead at massive scale.

Meta processes tens of billions of media files daily, necessitating highly optimized transcoding workflows at massive scale.
The company collaborated with the FFmpeg community to upstream critical features, allowing them to deprecate a long-standing internal fork.
New multi-lane transcoding capabilities enable parallelized encoding of multiple resolutions from a single source, reducing compute overhead.
Upstreamed support for real-time reference quality metrics like VMAF allows for quality monitoring without separate post-processing steps.
Significant refactoring in FFmpeg 6.0 and 8.0 was driven by these scale requirements, improving efficiency for the broader open-source community.

#dist #data

Read original

Engineering at MetaMar 2, 2026

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc

Why it matters: jemalloc is a critical foundation for high-performance systems. Meta's renewed commitment ensures the allocator evolves with modern hardware like ARM64 and complex workloads, reducing technical debt and improving memory efficiency for the entire open-source ecosystem.

Meta is unarchiving and renewing its stewardship of the jemalloc open-source repository to ensure long-term infrastructure health.
The project will prioritize technical debt reduction and refactoring to improve maintainability and ease of use for the community.
A key focus is enhancing the Huge-Page Allocator (HPA) to better utilize transparent hugepages for increased CPU efficiency.
Planned improvements to packing, caching, and purging mechanisms aim to optimize overall memory efficiency and performance.
The roadmap includes specific performance optimizations for the AArch64 (ARM64) platform to ensure high out-of-the-box performance.
Meta is shifting back to principled engineering practices, moving away from short-term hacks that previously accumulated technical debt.

#sre #data

Read original

Engineering at MetaFeb 24, 2026

RCCLX: Innovating GPU communications on AMD platforms

Why it matters: RCCLX optimizes GPU communication on AMD platforms, addressing bottlenecks in LLM inference and training. By reducing AllReduce latency and using FP8 quantization, it significantly improves performance for decoding and prefill stages on modern AMD hardware.

Meta open-sourced RCCLX, an enhanced version of the ROCm Communication Collective Library (RCCL) optimized for AMD GPU platforms.
Integrated with Torchcomms, RCCLX includes CTran features like AllToAllvDynamic to enable GPU-resident collectives.
Introduces Direct Data Access (DDA) algorithms that reduce AllReduce latency by 10-50% for LLM decoding and 10-30% for prefill on MI300X GPUs.
DDA flat and tree algorithms optimize small and medium message sizes by allowing ranks to load memory directly from other ranks.
Supports low-precision collectives using FP8 quantization to achieve up to 4:1 compression, significantly reducing communication overhead for large messages.
Leverages AMD Infinity Fabric for high-bandwidth peer-to-peer mesh communication while maintaining numerical stability through FP32 compute steps.

#dist #mlp

Read original

Engineering at MetaFeb 11, 2026

The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It

Why it matters: Traditional testing is a bottleneck for AI-accelerated development. JiTTesting automates the test lifecycle—from generation to validation—eliminating maintenance toil and ensuring high-signal bug detection in high-velocity environments.

Agentic software development is accelerating code changes beyond the capacity of traditional, manually maintained test suites.
Just-in-Time Tests (JiTTests) are LLM-generated on the fly for specific pull requests to catch regressions before they reach production.
The system uses mutation testing to deliberately insert faults, simulating potential failures to verify that generated tests are effective.
JiTTests are ephemeral and do not reside in the codebase, eliminating the long-term burden of test maintenance and code review.
Ensembles of rule-based and LLM-based assessors are used to filter results, significantly reducing false positives and engineer toil.
The approach shifts testing focus from generic code quality to high-signal fault detection tailored to the specific intent of a code change.

#mlp #sre

Read original

Engineering at MetaFeb 9, 2026

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

Why it matters: Scaling AI to gigawatt levels requires solving massive networking bottlenecks. BAG enables petabit-scale interconnectivity between distributed data centers, allowing thousands of GPUs to function as a single cluster, which is essential for training next-generation large-scale AI models.

Meta is developing Prometheus, a 1-gigawatt AI cluster designed to interconnect tens of thousands of GPUs across multiple data centers.
Backend Aggregation (BAG) serves as a centralized Ethernet-based super spine layer, enabling petabit-scale bandwidth (16-48 Pbps) between regions.
The architecture bridges two distinct network fabrics: Disaggregated Scheduled Fabric (DSF) and Non-Scheduled Fabric (NSF).
BAG utilizes planar and spread topologies to optimize for either management simplicity or enhanced path diversity and resilience.
The system manages strict distance, buffer, and latency constraints to maintain high-performance GPU-to-GPU communication.
BAG acts as the critical aggregation point between regional networks and Meta's backbone to support massive AI training demands.

#dist #mlp

Read original

Engineering at MetaFeb 4, 2026

No Display? No Problem: Cross-Device Passkey Authentication for XR Devices

Why it matters: This approach enables secure, phishing-resistant authentication for devices with limited UI, like XR headsets and IoT. By replacing QR codes with companion app transport, it maintains FIDO security standards while significantly improving the user experience for passwordless logins.

Meta introduced a novel cross-device passkey flow for XR and screenless devices that eliminates the need for scanning QR codes.
The approach utilizes the FIDO CTAP hybrid protocol, replacing visual data transfer with a secure message transport via a companion mobile app.
The XR device generates a FIDO URL containing an ECDH public key and session secret, delivered to the phone via GraphQL-based push notifications.
The companion app triggers native iOS or Android passkey verification flows upon receiving the notification, maintaining standard security requirements.
Proximity verification via Bluetooth or NFC is still enforced to ensure the authenticating device is physically near the XR hardware.

#security #mobile

Read original

Engineering at MetaJan 27, 2026

Rust at Scale: An Added Layer of Security for WhatsApp

Why it matters: WhatsApp's migration demonstrates that Rust is production-ready for massive-scale, cross-platform applications. It proves memory-safe languages can replace legacy C++ to eliminate vulnerabilities while improving performance and maintainability.

WhatsApp replaced its wamedia C++ library with a Rust implementation to mitigate memory-related vulnerabilities in media file processing.
The migration reduced the codebase from 160,000 lines of C++ to 90,000 lines of Rust while improving performance and memory efficiency.
The Kaleidoscope system performs structural checks on media, detects masquerading file types, and flags high-risk elements like embedded scripts.
WhatsApp utilized differential fuzzing and extensive integration testing to ensure compatibility between the legacy C++ and new Rust versions.
This deployment represents one of the largest global rollouts of Rust, spanning billions of devices across Android, iOS, Web, and wearables.

#security #mobile #dist

Read original

Engineering at MetaJan 14, 2026

Adapting the Facebook Reels RecSys AI Model Based on User Feedback

Why it matters: Traditional engagement metrics like watch time don't always reflect true user interest. By integrating direct survey feedback into ranking models, engineers can reduce noise, improve long-term retention, and better align content with niche user preferences in large-scale recommendation systems.

Facebook Reels transitioned from relying solely on engagement metrics like watch time to integrating direct user feedback via the User True Interest Survey (UTIS) model.
The UTIS model acts as a lightweight alignment layer trained on binarized survey responses to predict user satisfaction and content relevance.
Research indicated that traditional interest heuristics only achieved 48.3% precision, highlighting the gap between engagement signals and true user interest.
The system addresses sampling and nonresponse bias by weighting survey data to ensure the training set accurately reflects the broader user base.
Integrating survey-based interest matching led to significant improvements in long-term user retention, engagement, and satisfaction across video surfaces.

#mlp #data

Read original

Page 1 of 3

Prev1 2 3 Next