sre

Posts tagged with sre

Why it matters: For global-scale perimeter services, traditional sequential rollbacks are too slow. This architecture demonstrates how to achieve 10-minute global recovery through warm-standby blue-green deployments and synchronized autoscaling, ensuring high availability for trillions of requests.

  • Salesforce Edge manages a global perimeter platform handling 1.5 trillion monthly requests across 21+ points of presence.
  • Transitioned from sequential regional rollbacks taking up to 12 hours to a global blue-green model that recovers in 10 minutes.
  • Implemented parallel blue and green Kubernetes deployments to maintain a warm standby fleet capable of immediate full-load handling.
  • Customized Horizontal Pod Autoscalers (HPA) to ensure the inactive fleet scales identically to the active fleet, preventing capacity mismatches.
  • Automated traffic redirection using native Kubernetes labels and selectors instead of external L7 routing tools like Argo.
  • Integrated TCP connection draining and controlled traffic cutover to preserve four-nines availability during global rollback events.

Why it matters: Understanding global connectivity disruptions helps engineers build more resilient, multi-homed architectures. It highlights the fragility of physical infrastructure like submarine cables and the impact of BGP routing and government policy on service availability.

  • Q4 2025 saw over 180 global Internet disruptions caused by government mandates, physical infrastructure damage, and technical failures.
  • Tanzania implemented a near-total Internet shutdown during its presidential election, resulting in a 90% traffic drop and fluctuations in BGP address space announcements.
  • Submarine cable cuts, specifically to the PEACE and WACS systems, significantly impacted connectivity in Pakistan and Cameroon.
  • Infrastructure vulnerabilities in Haiti led to multiple outages for Digicel users due to international fiber optic cuts.
  • Beyond physical damage, disruptions were linked to hyperscaler cloud platform issues and ongoing military conflicts affecting regional network stability.

Why it matters: This incident highlights how minor automation errors in BGP policy configuration can cause global traffic disruptions. It underscores the risks of permissive routing filters and the importance of robust validation in network automation to prevent large-scale route leaks.

  • An automated routing policy change intended to remove IPv6 prefix advertisements for a Bogotá data center caused a major BGP route leak in Miami.
  • The removal of specific prefix lists from policy statements resulted in overly permissive terms, unintentionally redistributing peer routes to other providers.
  • The incident lasted 25 minutes, causing significant congestion on Miami backbone infrastructure and affecting both Cloudflare customers and external parties.
  • The leak was classified as a mixture of Type 3 and Type 4 route leaks according to RFC7908, violating standard valley-free routing principles.
  • Impact was limited to IPv6 traffic and was mitigated by manually reverting the configuration and pausing the automation platform.

Why it matters: Supporting open-source sustainability is crucial for the reliability of modern software stacks. This initiative demonstrates how large engineering organizations can mitigate supply chain risks and ensure the longevity of critical dependencies.

  • Spotify has announced the 2025 recipients of its Free and Open Source Software (FOSS) Fund.
  • The fund was established in 2022 to provide financial support to critical open source projects that Spotify relies on.
  • The initiative aims to ensure the long-term sustainability and health of the global open source ecosystem.
  • This program highlights the importance of corporate responsibility in maintaining the software infrastructure used by millions.

Why it matters: Securing AI agents at scale requires balancing rapid innovation with enterprise-grade protection. This architecture demonstrates how to manage 11M+ daily calls by decoupling security layers, ensuring multi-tenant reliability, and maintaining request integrity across distributed systems.

  • Salesforce's Developer Access team manages a secure access plane for Agentforce, handling over 11 million daily agent calls across production environments.
  • The architecture utilizes a layered access-control plane that separates authentication at the edge from authorization within the core platform to reduce latency and operational risk.
  • A middle-layer API service acts as a technical control point, ensuring all agentic traffic follows consistent security protocols and cannot bypass protection boundaries.
  • Security invariants include edge-level authentication validation, core-platform-enforced authorization, and end-to-end request integrity using Salesforce-minted tokens.
  • The system is designed to contain multi-tenant blast radius risks, preventing runaway agents or malformed requests from impacting other customers in a shared environment.
  • Strict egress traffic filtering and cross-boundary revalidation are employed to maintain the principle of least privilege across the distributed compute layer.

Why it matters: Benchmarking AI systems against live providers is expensive and noisy. This mock service provides a deterministic, cost-effective way to validate performance and reliability at scale, allowing engineers to iterate faster without financial friction or external latency fluctuations.

  • Salesforce developed an internal LLM mock service to simulate AI provider behavior, supporting benchmarks of over 24,000 requests per minute.
  • The service reduced annual token-based costs by over $500,000 by replacing live LLM dependencies during performance and regression testing.
  • Deterministic latency controls allow engineers to isolate internal code performance from external provider variability, ensuring repeatable results.
  • The mock layer enables rapid scale and failover benchmarking by simulating high-volume traffic and controlled outages without external infrastructure.
  • By providing a shared platform capability, the service accelerates development loops and improves confidence in performance signals.

Why it matters: Security mitigations added during incidents can become technical debt that degrades user experience. This case study emphasizes the need for lifecycle management and observability in defense systems to ensure temporary protections don't inadvertently block legitimate traffic as patterns evolve.

  • GitHub identified that emergency defense mechanisms, such as rate limits and traffic controls, were inadvertently blocking legitimate users after outliving their original purpose.
  • The issue stemmed from composite signals that combined industry-standard fingerprinting with platform-specific business logic, leading to false positives during normal browsing.
  • While the false-positive rate was low (0.003-0.004% of total traffic), it caused consistent disruption for logged-out users following external links.
  • The investigation involved tracing requests across a multi-layered infrastructure built on HAProxy to pinpoint which specific defense layer was triggering the blocks.
  • The incident reinforces that observability and lifecycle management are as critical for security mitigations as they are for core product features.

Why it matters: This report highlights the operational challenges of scaling AI-integrated services and global infrastructure. It provides insights into managing model-backed dependencies, handling cross-cloud network issues, and mitigating traffic spikes to maintain high availability for developer tools.

  • A Kafka misconfiguration prevented agent session data from reaching the AI Controls page, leading to improved pre-deployment validation.
  • Copilot Code Review experienced degradation due to model-backed dependency latency, mitigated by bypassing fix suggestions and increasing worker capacity.
  • Network packet loss between West US runners and an edge site caused GitHub Actions timeouts, resolved by rerouting traffic away from the affected site.
  • A database migration caused schema drift that blocked Copilot policy updates, resulting in hardened service synchronization and deployment pipelines.
  • Unauthenticated traffic spikes to search endpoints caused page load failures, addressed through improved limiters and proactive traffic monitoring.

Why it matters: This incident highlights how subtle optimizations can break systems by violating undocumented assumptions in legacy clients. It serves as a reminder that even when a protocol doesn't mandate order, real-world implementations often depend on it.

  • A memory optimization in Cloudflare's 1.1.1.1 resolver inadvertently changed the order of records in DNS responses.
  • The code change moved CNAME records to the end of the answer section instead of the beginning when merging cached partial chains.
  • While the DNS protocol technically treats record order as irrelevant, many client implementations process records sequentially.
  • Legacy implementations like glibc's getaddrinfo fail to resolve addresses if the A record appears before the CNAME that defines the alias.
  • The incident was resolved by reverting the optimization, restoring the original record ordering where CNAMEs precede final answers.

Why it matters: This architecture demonstrates how to scale global payment systems by abstracting vendor-specific complexities into standardized archetypes. It enables rapid expansion into new markets while maintaining high reliability and consistency through domain-driven design and asynchronous orchestration.

  • Replatformed from a monolith to a domain-driven microservices architecture (Payments LTA) to improve scalability and team autonomy.
  • Implemented a connector and plugin-based architecture to standardize third-party Payment Service Provider (PSP) integrations.
  • Developed the Multi-Step Transactions (MST) framework, a processor-agnostic system for handling complex flows like redirects and SCA.
  • Categorized 20+ local payment methods into three standardized archetypes—Redirect, Async, and Direct flows—to maximize code reuse.
  • Utilized asynchronous orchestration with webhooks and polling to manage external payment confirmations and ensure data consistency.
  • Enforced strict idempotency and built comprehensive observability dashboards to monitor transaction success rates and latency across regions.
Page 1 of 10
Previous123...10Next