Curated topic

sre

Posts tagged with sre

Airbnb EngineeringApr 28, 2026

Skipper: Building Airbnb’s embedded workflow engine

Why it matters: Skipper offers a lightweight alternative to heavy orchestrators like Temporal. It allows engineers to build reliable, multi-step processes using existing infrastructure, significantly reducing operational complexity while maintaining high reliability for critical transactions.

Airbnb built Skipper, an embedded workflow engine, to provide durable execution for Tier 0 services without the operational overhead of external orchestration clusters.
The engine runs as a library within host services, leveraging their existing databases (MySQL/Postgres) for state persistence and avoiding new infrastructure dependencies.
Skipper uses a Java/Kotlin programming model where workflows and actions are defined via annotations, making complex business logic readable and cohesive.
It supports critical features like durable retries, conditional waits, and signals while ensuring exactly-once semantics through idempotent side-effect boundaries.
By embedding the engine, Airbnb teams reduced the complexity of fragmented domain logic previously scattered across queues, cron jobs, and manual reconciliation scripts.

#dist #sre

Read original

GitHub EngineeringApr 28, 2026

Securing the git push pipeline: Responding to a critical remote code execution vulnerability

Why it matters: This incident highlights how minor sanitization failures in internal protocols can lead to critical RCE. It underscores the importance of defense-in-depth, showing how removing unused code paths and robust telemetry can mitigate risks and verify the absence of exploitation.

A critical RCE vulnerability (CVE-2026-3854) was discovered in the GitHub git push pipeline via unsanitized push options.
The flaw allowed attackers to inject metadata into internal protocols by using delimiter characters in user-supplied key-value strings.
By chaining injected values, researchers bypassed sandboxing and executed arbitrary commands on the server handling push operations.
GitHub's telemetry confirmed no exploitation occurred beyond the researchers' testing, as the exploit triggered an anomalous code path.
The primary fix involved strict sanitization of push option values to prevent influence over internal metadata fields.
Defense-in-depth measures included removing unused code paths from production container images that were inadvertently included during deployment model changes.

#security #sre

Read original

GitHub EngineeringApr 28, 2026

An update on GitHub availability

Why it matters: As AI agents accelerate development, platforms like GitHub face unprecedented load. This update highlights how massive scale requires shifting from monoliths to isolated services and multi-cloud strategies to maintain reliability under exponential growth.

GitHub is redesigning its infrastructure to handle a 30X scale increase driven by the rapid rise of agentic development workflows and AI-driven automation.
The platform is prioritizing availability over new features, focusing on isolating critical services like Git and Actions to reduce blast radius and eliminate single points of failure.
Technical migrations include moving performance-sensitive code from a Ruby monolith to Go and transitioning from custom data centers to a multi-cloud strategy.
Recent incidents highlighted vulnerabilities in merge queue operations and Elasticsearch clusters, leading to improved dependency analysis and service isolation.
Infrastructure updates include moving webhooks out of MySQL, redesigning session caches, and optimizing merge queue operations for large monorepos.

#sre #dist

Read original

Salesforce EngineeringApr 26, 2026

How We Increased Code Coverage by 28% Without Writing a Single Test

Why it matters: Code coverage is often a structural issue rather than a testing one. Refactoring data models to remove boilerplate allows teams to meet CI requirements while improving maintainability and reducing CI runtime, avoiding the trap of writing low-value tests.

Addressed CI-enforced coverage bottlenecks by refactoring data models instead of writing redundant, low-value tests.
Identified that auto-generated boilerplate from annotations like @Data distorted metrics by inflating the code denominator.
Transitioned mutable classes to immutable structures to reduce the volume of non-essential code and simplify the codebase.
Avoided increasing CI runtime and maintenance overhead by focusing on system design rather than test volume.
Prevented the creation of 'false contracts' in tests that could mislead future developers and AI-assisted coding tools.

#sre #security

Read original

Spotify EngineeringApr 22, 2026

Background Coding Agents: Supercharging Downstream Consumer Dataset Migrations (Honk, Part 4)

Why it matters: Automating dataset migrations at scale reduces developer toil and prevents technical debt. By using background agents to update downstream consumers, organizations can accelerate infrastructure evolution without overwhelming product teams with manual migration tasks.

Spotify utilizes background coding agents to automate the migration of thousands of downstream dataset consumers.
The system integrates with Backstage and Fleet Management to track progress and manage automated pull requests across the organization.
Automation reduces manual toil for product teams by programmatically updating code references to new dataset versions.
The approach shifts the migration burden from data consumers to automated infrastructure, accelerating platform evolution.
Validation and automated testing are used to ensure that background code changes maintain the integrity of downstream data pipelines.

#data #sre #dist

Read original

Cloudflare BlogApr 22, 2026

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

Why it matters: This update solves sandbox poisoning where a single Rust panic could crash an entire Wasm instance. By upstreaming recovery to wasm-bindgen, engineers get better reliability for stateful workloads like Durable Objects and improved error handling across the Rust-JS boundary.

Cloudflare improved Rust Worker reliability by contributing panic and abort recovery mechanisms to the upstream wasm-bindgen toolchain.
Implemented panic=unwind support using the WebAssembly Exception Handling proposal, allowing Rust destructors to run and preserving instance state for stateful workloads.
Updated the walrus parser and wasm-bindgen descriptor interpreter to support Wasm try/catch instructions and proper stack unwinding.
Introduced extern C-unwind for exports and the MaybeUnwindSafe trait for closures to ensure safe error propagation across the Rust-JavaScript boundary.
Developed abort recovery to detect fatal traps like OOM, preventing the re-execution of poisoned Wasm modules and ensuring clean reinitialization for subsequent requests.

#sre #dist

Read original

Airbnb EngineeringApr 21, 2026

Building a fault-tolerant metrics storage system at Airbnb

Why it matters: Scaling observability for 1,000+ services requires balancing multi-tenant isolation with operational efficiency. Airbnb's approach to shuffle sharding and automated control planes provides a blueprint for building resilient, petabyte-scale metrics systems that avoid 'flying blind' during outages.

Airbnb built an internal metrics system managing 50M samples/sec and 1.3B active time series across 2.5PB of data.
Adopted service-based multi-tenancy to ensure stable grouping and precise resource attribution for over 1,000 services.
Implemented shuffle sharding to isolate tenant workloads, preventing localized failures or traffic spikes from impacting the entire fleet.
Developed a centralized control plane to automate tenant onboarding and dynamically manage ingestion limits and configurations.
Enhanced reliability using shadow clusters, write guardrails, and query sharding to normalize performance across variable read payloads.
Optimized storage compaction and query execution to maintain a p99 latency under 30 seconds for large-scale data requests.

#sre #dist #data

Read original

PlanetScale Tech BlogApr 21, 2026

Approaches to tenancy in Postgres

Why it matters: Choosing the right multi-tenancy model is critical for database scalability and security. This guide helps engineers avoid common pitfalls like RLS complexity or schema sprawl, favoring a performant shared-schema approach that scales to thousands of tenants.

Shared-schema is the recommended multi-tenancy approach, using a tenant_id column to isolate data within common tables.
Avoid schema-per-tenant or database-per-tenant models unless tenants require unique schema structures, as they complicate migrations.
Use BIGINT for tenant identifiers to ensure performance and stability compared to string-based IDs.
Lead most indexes with the tenant_id column to optimize query performance across large, multi-tenant tables.
Enforce tenant isolation at the application layer rather than relying on Postgres Row-Level Security (RLS) to avoid silent failures.
Leverage declarative partitioning with tenant_id as the partition key to manage large datasets and improve maintenance.

#data #sre

Read original

GitHub EngineeringApr 20, 2026

Changes to GitHub Copilot Individual plans

Why it matters: High-intensity agentic workflows are forcing a shift in AI resource management. Engineers must now optimize token consumption and model selection to maintain productivity within new usage constraints and avoid service interruptions.

GitHub is pausing new sign-ups for Copilot Pro, Pro+, and Student plans to stabilize service for existing users.
Usage limits are tightening as agentic workflows and parallelized sessions consume significantly more compute than original plan structures supported.
Claude Opus models are removed from the Pro plan; Opus 4.7 remains exclusive to Pro+ subscribers.
New token-based weekly limits address the high costs of long-trajectory requests and complex coding tasks.
Real-time usage tracking is now integrated into VS Code and Copilot CLI to improve transparency.
Pro+ plans offer over 5x the usage limits of standard Pro plans to accommodate power users.

#mlp #finops #sre

Read original

Cloudflare BlogApr 20, 2026

Orchestrating AI Code Review at scale

Why it matters: Scaling AI code reviews requires moving beyond simple prompts to multi-agent orchestration. This architecture demonstrates how to integrate LLMs into CI/CD pipelines reliably, handling large-scale diffs and specialized domain knowledge while maintaining high signal-to-noise ratios.

Cloudflare replaced naive AI summarization with a multi-agent orchestration system built on OpenCode to reduce code review bottlenecks.
The architecture uses a composable plugin system with distinct lifecycle phases to isolate VCS logic from AI provider configurations.
Seven specialized agents cover domains like security and performance, managed by a coordinator that deduplicates findings and judges severity.
To handle massive merge requests, the system passes prompts via stdin to avoid Linux kernel ARG_MAX limits and E2BIG errors.
The implementation leverages the Model Context Protocol (MCP) to allow agents to interact with GitLab APIs for context and commenting.
The system is CI-native and has been deployed across tens of thousands of internal merge requests to improve engineering resiliency.

#mlp #sre #security

Read original

Page 1 of 23

Prev1 2 3...23 Next