Why it matters: This architecture demonstrates how to scale global payment systems by abstracting vendor-specific complexities into standardized archetypes. It enables rapid expansion into new markets while maintaining high reliability and consistency through domain-driven design and asynchronous orchestration.

  • Replatformed from a monolith to a domain-driven microservices architecture (Payments LTA) to improve scalability and team autonomy.
  • Implemented a connector and plugin-based architecture to standardize third-party Payment Service Provider (PSP) integrations.
  • Developed the Multi-Step Transactions (MST) framework, a processor-agnostic system for handling complex flows like redirects and SCA.
  • Categorized 20+ local payment methods into three standardized archetypes—Redirect, Async, and Direct flows—to maximize code reuse.
  • Utilized asynchronous orchestration with webhooks and polling to manage external payment confirmations and ensure data consistency.
  • Enforced strict idempotency and built comprehensive observability dashboards to monitor transaction success rates and latency across regions.

Why it matters: This innovation significantly streamlines frontend and mobile development by automating the creation of realistic, type-safe mock data. It frees engineers from tedious manual work, accelerates feature delivery, and improves the reliability of tests and demos.

  • Airbnb introduces @generateMock, a new GraphQL client directive, to automate the creation and maintenance of realistic, type-safe mock data.
  • The solution combines GraphQL schema validation, rich product context, and Large Language Models (LLMs) to generate convincing mock data.
  • Engineers can use @generateMock on any GraphQL operation, fragment, or field, providing optional hints and design URLs to guide the LLM's data generation.
  • Integrated with Airbnb's Niobe CLI tool, it generates JSON mock files and helper functions (TypeScript/Kotlin/Swift) for seamless consumption in tests and demo apps.
  • This approach eliminates the tedious manual process of writing and updating mocks, enabling faster parallel client/server development and ensuring data consistency.

Why it matters: This article details how to build resilient distributed systems by moving beyond static rate limits to adaptive traffic management. Engineers can learn to maximize goodput and ensure reliability in high-traffic, multi-tenant environments.

  • Airbnb evolved Mussel, their multi-tenant key-value store, from static QPS rate limiting to adaptive traffic management for improved reliability and goodput during traffic spikes.
  • The initial QoS system used simple Redis-backed QPS limits, effective for basic isolation but unable to account for varying request costs or adapt to real-time traffic shifts.
  • Resource-aware rate control (RARC) was introduced, charging requests in "request units" (RU) based on fixed overhead, rows processed, payload bytes, and crucial latency, reflecting actual backend load.
  • RARC uses a linear model for RU calculation, allowing the system to differentiate between cheap and expensive operations, even with similar surface metrics.
  • Future layers include load shedding with criticality tiers for priority traffic and hot-key detection/DDoS mitigation to handle skewed access patterns and shield the backend.

Why it matters: This article details how a large-scale key-value store was rearchitected to meet modern demands for real-time data, scalability, and operational efficiency. It offers valuable insights into addressing common distributed system challenges and executing complex migrations.

  • Airbnb rearchitected its core key-value store, Mussel, from v1 to v2 to handle real-time demands, massive data, and improve operational efficiency.
  • Mussel v1 faced issues with operational complexity, static partitioning leading to hotspots, limited consistency, and opaque costs.
  • Mussel v2 leverages Kubernetes for automation, dynamic range sharding for scalability, flexible consistency, and enhanced cost visibility.
  • The new architecture includes a stateless Dispatcher, Kafka-backed writes for durability, and an event-driven model for ingestion.
  • Bulk data loading is supported via Airflow orchestration and distributed workers, maintaining familiar semantics.
  • Automated TTL in v2 uses a topology-aware expiration service for efficient, parallel data deletion, improving on v1's compaction cycle.
  • A blue/green migration strategy with custom bootstrapping and dual writes ensured a seamless transition with zero downtime and data loss.

Why it matters: This article showcases a successful approach to managing a large, evolving data graph in a service-oriented architecture. It provides insights into how a data-oriented service mesh can simplify developer experience, improve modularity, and scale efficiently.

  • Viaduct, Airbnb's data-oriented service mesh, has been open-sourced after five years of significant growth and evolution within the company.
  • It's built on three core principles: a central, integrated GraphQL schema, hosting business logic directly within the mesh, and re-entrancy for modular composition.
  • The "Viaduct Modern" initiative simplified its developer-facing Tenant API, reducing complexity from multiple mechanisms to just node and field resolvers.
  • Modularity was enhanced through formal "tenant modules," enabling teams to own schema and code while composing via GraphQL fragments and queries, avoiding direct code dependencies.
  • This modernization effort has allowed Viaduct to scale dramatically (8x traffic, 3x codebase) while maintaining operational efficiency and reducing incidents.

Why it matters: This article introduces a novel approach to managing complex microservice architectures. By shifting to a data-oriented service mesh with a central GraphQL schema, engineers can significantly improve modularity, simplify dependency management, and enhance data agility in large-scale SOAs.

  • Airbnb introduced Viaduct, a data-oriented service mesh, to improve modularity and address the complexity of massive dependency graphs in microservices-based Service-Oriented Architectures (SOA).
  • Traditional service meshes are procedure-oriented, leading to 'spaghetti SOA' where managing and modifying services becomes increasingly difficult.
  • Viaduct shifts to a data-oriented design, leveraging GraphQL to define a central schema comprising types, queries, and mutations across the entire service mesh.
  • This data-oriented approach abstracts service dependencies from data consumers, as Viaduct intelligently routes requests to the appropriate microservices.
  • The central GraphQL schema acts as a single source of truth, aiming to define service APIs and potentially database schemas, which significantly enhances data agility.
  • By centralizing schema definition, Viaduct seeks to streamline changes, allowing database updates to propagate to client code with a single, coordinated update, reducing weeks of effort.

Why it matters: This article demonstrates how a large-scale monorepo build system migration can dramatically improve developer productivity and build reliability. It provides valuable insights into leveraging Bazel's features like remote execution and hermeticity for complex JVM environments.

  • Airbnb migrated its JVM monorepo (Java, Kotlin, Scala) to Bazel, achieving 3-5x faster local builds/tests and 2-3x faster deploys over 4.5 years.
  • The move to Bazel was driven by needs for superior build speed via remote execution, enhanced reliability through hermeticity, and a uniform build infrastructure across all language repos.
  • Bazel's remote build execution (RBE) and "Build without the Bytes" boosted performance by enabling parallel actions and reducing data transfer.
  • Hermetic builds, enforced by sandboxing, ensured consistent, repeatable results by isolating build actions from external environment dependencies.
  • The migration strategy included a proof-of-concept on a critical service with co-existing Gradle/Bazel builds, followed by a breadth-first rollout.

Why it matters: This article details how to perform large-scale, zero-downtime Istio upgrades across diverse environments. It offers a blueprint for managing complex service mesh updates, ensuring high availability and minimizing operational overhead for thousands of workloads.

  • Airbnb developed a robust process for seamless Istio upgrades across tens of thousands of pods and VMs on dozens of Kubernetes clusters.
  • The strategy employs Istio's canary upgrade model, running multiple Istiod revisions concurrently within a single logical service mesh.
  • Upgrades are atomic, rolling out new istio-proxy versions and connecting them to the corresponding new Istiod revision simultaneously.
  • A rollouts.yml file dictates the gradual rollout, defining namespace patterns and percentage distributions for Istio versions using consistent hashing.
  • For Kubernetes, MutatingAdmissionWebhooks inject the correct istio-proxy and configure its connection to the specific Istiod revision based on a istio.io/rev label.
  • The process prioritizes zero downtime, gradual rollouts, easy rollbacks, and independent upgrades for thousands of diverse workloads.

Why it matters: This article provides a detailed blueprint for achieving high availability and fault tolerance for distributed databases on Kubernetes in a multi-cloud environment. Engineers can learn best practices for managing stateful services, mitigating risks, and designing resilient systems at scale.

  • Airbnb achieved high availability for a distributed SQL database by deploying it across multiple Kubernetes clusters, each in a different AWS Availability Zone, a complex but effective strategy.
  • They addressed challenges of running stateful databases on Kubernetes, particularly node replacements and upgrades, using custom Kubernetes operators and admission hooks.
  • A custom Kubernetes operator coordinates node replacements, ensuring data consistency and preventing service disruption during various event types.
  • Deploying across three independent Kubernetes clusters in different AWS AZs significantly limits the blast radius of infrastructure or deployment issues.
  • AWS EBS provides rapid volume reattachment and durability, with tail latency spikes mitigated by read timeouts, transparent retries, and stale reads.
  • Overprovisioning database clusters ensures sufficient capacity even if an entire AZ or Kubernetes cluster fails.
Page 1 of 2
Previous12Next