Airbnb Engineering
https://medium.com/airbnb-engineeringWhy it matters: This architecture demonstrates how to scale global payment systems by abstracting vendor-specific complexities into standardized archetypes. It enables rapid expansion into new markets while maintaining high reliability and consistency through domain-driven design and asynchronous orchestration.
- •Replatformed from a monolith to a domain-driven microservices architecture (Payments LTA) to improve scalability and team autonomy.
- •Implemented a connector and plugin-based architecture to standardize third-party Payment Service Provider (PSP) integrations.
- •Developed the Multi-Step Transactions (MST) framework, a processor-agnostic system for handling complex flows like redirects and SCA.
- •Categorized 20+ local payment methods into three standardized archetypes—Redirect, Async, and Direct flows—to maximize code reuse.
- •Utilized asynchronous orchestration with webhooks and polling to manage external payment confirmations and ensure data consistency.
- •Enforced strict idempotency and built comprehensive observability dashboards to monitor transaction success rates and latency across regions.
Why it matters: This innovation significantly streamlines frontend and mobile development by automating the creation of realistic, type-safe mock data. It frees engineers from tedious manual work, accelerates feature delivery, and improves the reliability of tests and demos.
- •Airbnb introduces @generateMock, a new GraphQL client directive, to automate the creation and maintenance of realistic, type-safe mock data.
- •The solution combines GraphQL schema validation, rich product context, and Large Language Models (LLMs) to generate convincing mock data.
- •Engineers can use @generateMock on any GraphQL operation, fragment, or field, providing optional hints and design URLs to guide the LLM's data generation.
- •Integrated with Airbnb's Niobe CLI tool, it generates JSON mock files and helper functions (TypeScript/Kotlin/Swift) for seamless consumption in tests and demo apps.
- •This approach eliminates the tedious manual process of writing and updating mocks, enabling faster parallel client/server development and ensuring data consistency.
Why it matters: This article details how to build resilient distributed systems by moving beyond static rate limits to adaptive traffic management. Engineers can learn to maximize goodput and ensure reliability in high-traffic, multi-tenant environments.
- •Airbnb evolved Mussel, their multi-tenant key-value store, from static QPS rate limiting to adaptive traffic management for improved reliability and goodput during traffic spikes.
- •The initial QoS system used simple Redis-backed QPS limits, effective for basic isolation but unable to account for varying request costs or adapt to real-time traffic shifts.
- •Resource-aware rate control (RARC) was introduced, charging requests in "request units" (RU) based on fixed overhead, rows processed, payload bytes, and crucial latency, reflecting actual backend load.
- •RARC uses a linear model for RU calculation, allowing the system to differentiate between cheap and expensive operations, even with similar surface metrics.
- •Future layers include load shedding with criticality tiers for priority traffic and hot-key detection/DDoS mitigation to handle skewed access patterns and shield the backend.
Why it matters: This article details how a large-scale key-value store was rearchitected to meet modern demands for real-time data, scalability, and operational efficiency. It offers valuable insights into addressing common distributed system challenges and executing complex migrations.
- •Airbnb rearchitected its core key-value store, Mussel, from v1 to v2 to handle real-time demands, massive data, and improve operational efficiency.
- •Mussel v1 faced issues with operational complexity, static partitioning leading to hotspots, limited consistency, and opaque costs.
- •Mussel v2 leverages Kubernetes for automation, dynamic range sharding for scalability, flexible consistency, and enhanced cost visibility.
- •The new architecture includes a stateless Dispatcher, Kafka-backed writes for durability, and an event-driven model for ingestion.
- •Bulk data loading is supported via Airflow orchestration and distributed workers, maintaining familiar semantics.
- •Automated TTL in v2 uses a topology-aware expiration service for efficient, parallel data deletion, improving on v1's compaction cycle.
- •A blue/green migration strategy with custom bootstrapping and dual writes ensured a seamless transition with zero downtime and data loss.
Why it matters: This article showcases a successful approach to managing a large, evolving data graph in a service-oriented architecture. It provides insights into how a data-oriented service mesh can simplify developer experience, improve modularity, and scale efficiently.
- •Viaduct, Airbnb's data-oriented service mesh, has been open-sourced after five years of significant growth and evolution within the company.
- •It's built on three core principles: a central, integrated GraphQL schema, hosting business logic directly within the mesh, and re-entrancy for modular composition.
- •The "Viaduct Modern" initiative simplified its developer-facing Tenant API, reducing complexity from multiple mechanisms to just node and field resolvers.
- •Modularity was enhanced through formal "tenant modules," enabling teams to own schema and code while composing via GraphQL fragments and queries, avoiding direct code dependencies.
- •This modernization effort has allowed Viaduct to scale dramatically (8x traffic, 3x codebase) while maintaining operational efficiency and reducing incidents.
Why it matters: This article introduces a novel approach to managing complex microservice architectures. By shifting to a data-oriented service mesh with a central GraphQL schema, engineers can significantly improve modularity, simplify dependency management, and enhance data agility in large-scale SOAs.
- •Airbnb introduced Viaduct, a data-oriented service mesh, to improve modularity and address the complexity of massive dependency graphs in microservices-based Service-Oriented Architectures (SOA).
- •Traditional service meshes are procedure-oriented, leading to 'spaghetti SOA' where managing and modifying services becomes increasingly difficult.
- •Viaduct shifts to a data-oriented design, leveraging GraphQL to define a central schema comprising types, queries, and mutations across the entire service mesh.
- •This data-oriented approach abstracts service dependencies from data consumers, as Viaduct intelligently routes requests to the appropriate microservices.
- •The central GraphQL schema acts as a single source of truth, aiming to define service APIs and potentially database schemas, which significantly enhances data agility.
- •By centralizing schema definition, Viaduct seeks to streamline changes, allowing database updates to propagate to client code with a single, coordinated update, reducing weeks of effort.
Why it matters: This article demonstrates how a large-scale monorepo build system migration can dramatically improve developer productivity and build reliability. It provides valuable insights into leveraging Bazel's features like remote execution and hermeticity for complex JVM environments.
- •Airbnb migrated its JVM monorepo (Java, Kotlin, Scala) to Bazel, achieving 3-5x faster local builds/tests and 2-3x faster deploys over 4.5 years.
- •The move to Bazel was driven by needs for superior build speed via remote execution, enhanced reliability through hermeticity, and a uniform build infrastructure across all language repos.
- •Bazel's remote build execution (RBE) and "Build without the Bytes" boosted performance by enabling parallel actions and reducing data transfer.
- •Hermetic builds, enforced by sandboxing, ensured consistent, repeatable results by isolating build actions from external environment dependencies.
- •The migration strategy included a proof-of-concept on a critical service with co-existing Gradle/Bazel builds, followed by a breadth-first rollout.
Why it matters: This article details how to perform large-scale, zero-downtime Istio upgrades across diverse environments. It offers a blueprint for managing complex service mesh updates, ensuring high availability and minimizing operational overhead for thousands of workloads.
- •Airbnb developed a robust process for seamless Istio upgrades across tens of thousands of pods and VMs on dozens of Kubernetes clusters.
- •The strategy employs Istio's canary upgrade model, running multiple Istiod revisions concurrently within a single logical service mesh.
- •Upgrades are atomic, rolling out new istio-proxy versions and connecting them to the corresponding new Istiod revision simultaneously.
- •A rollouts.yml file dictates the gradual rollout, defining namespace patterns and percentage distributions for Istio versions using consistent hashing.
- •For Kubernetes, MutatingAdmissionWebhooks inject the correct istio-proxy and configure its connection to the specific Istiod revision based on a istio.io/rev label.
- •The process prioritizes zero downtime, gradual rollouts, easy rollbacks, and independent upgrades for thousands of diverse workloads.
Why it matters: This article provides a detailed blueprint for achieving high availability and fault tolerance for distributed databases on Kubernetes in a multi-cloud environment. Engineers can learn best practices for managing stateful services, mitigating risks, and designing resilient systems at scale.
- •Airbnb achieved high availability for a distributed SQL database by deploying it across multiple Kubernetes clusters, each in a different AWS Availability Zone, a complex but effective strategy.
- •They addressed challenges of running stateful databases on Kubernetes, particularly node replacements and upgrades, using custom Kubernetes operators and admission hooks.
- •A custom Kubernetes operator coordinates node replacements, ensuring data consistency and preventing service disruption during various event types.
- •Deploying across three independent Kubernetes clusters in different AWS AZs significantly limits the blast radius of infrastructure or deployment issues.
- •AWS EBS provides rapid volume reattachment and durability, with tail latency spikes mitigated by read timeouts, transparent retries, and stale reads.
- •Overprovisioning database clusters ensures sufficient capacity even if an entire AZ or Kubernetes cluster fails.