Engineering at Meta

https://engineering.fb.com/

Why it matters: This project demonstrates cutting-edge subsea cable engineering, utilizing SDM and optical switching to build massive-scale, open-access infrastructure. It's crucial for global connectivity, supporting future AI, cloud, and high-bandwidth applications across three continents.

  • The core 2Africa system, the world's longest open-access subsea cable, is complete, connecting 33 countries across Africa, Europe, and Asia.
  • It's the first cable to continuously link East and West Africa, and connect Africa to the Middle East, South Asia, and Europe.
  • The project, led by a Meta-consortium, uses an open-access model to promote competition and accelerate digital transformation.
  • Engineering innovations include Spatial Division Multiplexing (SDM) for 16 fiber pairs (double older systems) and undersea optical wavelength switching.
  • This infrastructure supports evolving demands for AI, cloud, and high-bandwidth applications, enabling connectivity for 3 billion people.

Why it matters: This article details the intricate process of preserving HDR video metadata (Dolby Vision, AMVE) across a large-scale video pipeline. It's crucial for engineers working on media processing, mobile development, and ensuring high-quality user experiences on global platforms.

  • Instagram for iOS now supports Dolby Vision and Ambient Viewing Environment (AMVE) metadata for enhanced HDR video playback.
  • This involved preserving unique Dolby Vision and AMVE metadata from iPhone-produced HDR videos throughout Meta's video processing pipeline.
  • Previously, FFmpeg-based transcoding systems discarded this metadata, impacting picture consistency, especially at low screen brightness.
  • Meta collaborated with the community to add AMVE support to FFmpeg and adopted Dolby Vision Profile 10 for AV1 delivery.
  • This enhancement makes Instagram the first Meta app to support Dolby Vision video, with future expansion across other Meta platforms.
  • The solution addresses challenges like carrying Dolby Vision metadata in non-HEVC codecs and managing different Dolby Vision profiles.

Why it matters: Engineers can learn how open hardware, AI, and collaborative projects like OCP are crucial for achieving environmental sustainability goals in tech. It highlights practical applications of AI in reducing carbon footprints for IT infrastructure and data centers.

  • Meta's podcast discusses open hardware and the Open Compute Project (OCP) for environmental sustainability.
  • OCP, a collaborative initiative with over 400 companies, focuses on open hardware designs to reduce environmental impact.
  • Meta leverages AI and open hardware to advance its goal of achieving net-zero emissions by 2030.
  • A new open methodology employs AI to enhance the accuracy of Scope 3 emission estimates for IT hardware.
  • AI is also being used to innovate concrete mixes, leading to lower-carbon data center construction.

Why it matters: StyleX offers a robust solution for managing CSS at scale, providing performance benefits of static CSS with the developer experience of CSS-in-JS. It ensures maintainability, reduces bundle sizes, and prevents styling conflicts in large, complex applications.

  • StyleX is Meta's open-sourced styling system, combining CSS-in-JS ergonomics with static CSS performance for large-scale applications.
  • It functions as a build-time compiler, extracting styles to generate collision-free, atomic CSS, significantly reducing CSS bundle size.
  • StyleX addresses historical CSS challenges at Meta, such as specificity wars and large bundles, by enforcing constraints for predictable and scalable styling.
  • The system enables expressive, type-safe style authoring in JavaScript, supporting composition and conditional logic while compiling to static output.
  • Its core is a Babel plugin that processes style objects, normalizes values, and outputs optimized, atomic CSS classes for efficient rendering.

Why it matters: This article details how Meta built and scaled a massive LLM-inspired foundation model for ads, showcasing innovations in architecture, training, and knowledge transfer for significant performance gains. It offers insights into building large-scale recommendation systems.

  • Meta's Generative Ads Model (GEM) is a new LLM-inspired foundation model enhancing ad recommendation performance and advertiser ROI.
  • Its novel architecture allows efficient scaling and precise predictions, leveraging thousands of GPUs for training.
  • GEM propagates learnings across Meta's ad model fleet through advanced post-training and knowledge transfer techniques.
  • It has already delivered significant increases in ad conversions on Instagram (5%) and Facebook (3%).
  • GEM achieves 4x efficiency in performance gains, 2x knowledge transfer effectiveness, and a 23x increase in training FLOPS.

Why it matters: This article details how Meta scaled invisible video watermarking, a critical technology for content provenance. It's vital for engineers tackling challenges like detecting AI-generated media and ensuring content authenticity at massive scale with operational efficiency.

  • Meta utilizes invisible watermarking for content provenance, enabling detection of AI-generated videos, verification of original posters, and identification of content sources.
  • Invisible watermarking embeds imperceptible signals into media, designed to be robust and persistent through transcodes and edits, unlike traditional metadata.
  • Scaling this technology presented significant challenges related to deployment environments, bitrate increases, and maintaining visual quality.
  • Meta developed a CPU-based solution for invisible video watermarking that achieves performance comparable to GPU-based systems while offering superior operational efficiency.
  • This technology is crucial for maintaining content authenticity and distinguishing between real and AI-generated media in today's rapidly evolving digital landscape.

Why it matters: This article is crucial for engineers building GenAI products, demonstrating how to integrate privacy-aware infrastructure and data lineage to manage complex data flows, ensure compliance, and accelerate innovation responsibly.

  • Meta addresses GenAI privacy challenges by scaling its Privacy Aware Infrastructure (PAI), using AI glasses as a key example.
  • GenAI products like AI glasses introduce new data types, increased volumes, and complex real-time data flows, necessitating robust privacy systems.
  • Key challenges include managing explosive data growth, adapting to shifting privacy requirements, and supporting rapid innovation cycles.
  • PAI leverages data lineage insights and automated privacy controls to embed privacy deeply into product development.
  • This approach enables Meta to accelerate GenAI product innovation while upholding user trust and data protection.

Why it matters: DSF revolutionizes AI network scaling by overcoming traditional fabric limitations. Its disaggregated architecture, packet spraying, and advanced congestion control ensure high-performance, lossless connectivity for massive GPU clusters, crucial for the future of large-scale AI model training.

  • Meta's Disaggregated Scheduled Fabric (DSF) is a next-generation network technology designed to scale AI training networks beyond the physical limits of traditional Clos-based architectures.
  • DSF disaggregates line cards (Interface Nodes) and fabric cards (Fabric Nodes) into distinct hardware, creating a distributed system for enhanced scalability and performance.
  • It addresses critical challenges in AI workloads, such as "elephant flows" and "low entropy" traffic patterns, which cause congestion and suboptimal utilization in conventional IP fabrics.
  • The system employs a two-domain architecture, packet spraying, and a credit-based congestion control algorithm for efficient, lossless traffic management.
  • Built on open standards like OCP-SAI and managed by FBOSS, DSF enables the creation of large virtual chassis switches capable of interconnecting thousands of GPUs for massive AI clusters.

Why it matters: This article details Meta's innovations in LLM inference parallelism, offering critical strategies for engineers to achieve high throughput, low latency, and better resource efficiency when deploying large language models at scale. It provides practical solutions for optimizing performance.

  • Meta developed advanced parallelism techniques to optimize LLM inference for resource efficiency, throughput, and latency, crucial for applications like Meta AI.
  • LLM inference comprises two stages: compute-bound prefill for prompt processing and memory-bound decoding for token generation, each with distinct computational demands.
  • Tensor Parallelism shards model layers across GPUs, utilizing novel Direct Data Access (DDA) algorithms (flat, tree) to significantly reduce 'allreduce' communication latency.
  • DDA solutions demonstrated substantial speedups (10-50% for decode, 10-30% for prefill) on AMD MI300X, achieving performance parity with NVIDIA H100.
  • Context Parallelism, implemented via 'ring attention' variants (Pass-KV, Pass-Q), addresses the challenges of processing extremely long contexts by distributing input tokens and exchanging tensors.

Why it matters: This article introduces Sapling's innovative directory branching solution for monorepos, enabling scalable version management and merging without compromising performance or developer experience. It's crucial for engineers working with large codebases to maintain agility.

  • Meta's Sapling monorepo utilizes two distinct branching workflows to effectively balance scalability and developer experience.
  • Non-mergeable full-repo branching, supported by `sl bookmark`, is ideal for temporary product releases that do not require merging back to the main branch.
  • Mergeable directory branching is a novel solution for product development, allowing specific directories to be treated like traditional repository branches.
  • This new workflow enables copying, cherry-picking, and merging changes between directories using `sl subtree` commands.
  • Crucially, directory merges appear as linear commits in the monorepo's commit graph, preserving performance for operations like `sl log` and `sl blame`.
  • This approach resolves the challenges of managing multiple code versions within a large monorepo without sacrificing performance or essential developer tools.
Page 2 of 3