Curated topic

sre

Posts tagged with sre

PlanetScale Tech BlogFeb 19, 2026

Faster PlanetScale Postgres connections with Cloudflare Hyperdrive

Why it matters: This article provides a blueprint for building high-concurrency, real-time applications by combining edge computing with optimized database pooling. It demonstrates how to minimize latency between globally distributed users and centralized stateful databases.

Cloudflare Hyperdrive optimizes Postgres performance by automating connection pooling and reducing the seven-step connection handshake latency.
PlanetScale Postgres Metal provides a high-performance backend using locally-attached NVMe SSDs rather than network-attached storage.
The architecture leverages Cloudflare Workers for global distribution and Durable Objects to manage stateful WebSocket connections for real-time updates.
Engineers must evaluate 'smart placement' to decide whether running Workers closer to the database or closer to the user yields better latency for their specific workload.
Hyperdrive consists of an edge component for connection preparation and a connection pooler located physically near the database to maintain warm connections.

#dist #data #sre

Read original

Airbnb EngineeringFeb 18, 2026

Safeguarding Dynamic Configuration Changes at Scale

Why it matters: Dynamic configuration is a powerful but risky tool. Airbnb's approach demonstrates how to treat configuration with the same rigor as code, using staged rollouts and architectural separation to prevent global outages while maintaining developer velocity.

Airbnb's Sitar platform manages dynamic configurations using a Git-based workflow to provide versioning, peer reviews, and automated CI/CD validation.
The architecture separates the control plane, which handles rollout logic and authorization, from the data plane, which manages high-scale distribution.
Safety is enforced through staged rollouts that gradually expand the blast radius across AWS zones or Kubernetes pod percentages.
A sidecar agent model fetches configurations and maintains a local cache, ensuring low-latency access and system resilience during network partitions.
The platform supports multi-tenancy, allowing individual teams to define custom guardrails, deployment triggers, and risk profiles for their services.

#sre #dist

Read original

Microsoft Azure BlogFeb 17, 2026

Azure reliability, resiliency, and recoverability: Build continuity by design

Why it matters: Distinguishing between reliability, resiliency, and recoverability prevents architectural anti-patterns. It ensures engineers don't over-invest in recovery when resiliency is needed, or assume redundancy alone guarantees a reliable customer experience.

Reliability is the ultimate outcome where a service consistently performs at its intended level within business-defined constraints.
Resiliency is the architectural ability to withstand faults, such as regional outages or sudden load spikes, without visible customer disruption.
Recoverability involves the processes and tools required to restore a workload to a reliable state once resiliency limits are exceeded.
The Azure Well-Architected Framework and Reliability Guides help define shared responsibility boundaries and service-specific fault behaviors.
Operationalizing reliability requires defining service levels, monitoring steady-state behavior, and validating assumptions via Azure Chaos Studio.

#sre #dist

Read original

GitHub EngineeringFeb 13, 2026

Automate repository tasks with GitHub Agentic Workflows

Why it matters: GitHub Agentic Workflows lower the barrier for complex repository automation by replacing rigid YAML with intent-driven Markdown. This enables 'Continuous AI,' allowing teams to automate cognitive tasks like issue triage and CI debugging while maintaining strict security and audit guardrails.

GitHub Agentic Workflows allow developers to automate repository tasks using plain Markdown instructions executed by AI coding agents.
The workflows run within GitHub Actions, leveraging existing infrastructure for permissions, sandboxing, and auditing.
Supported agent engines include Copilot CLI, Claude Code, and OpenAI Codex, allowing for flexible execution of non-deterministic tasks.
Key use cases include automated issue triaging, continuous documentation updates, and proactive investigation of CI failures with proposed fixes.
The initiative introduces 'Continuous AI,' a paradigm designed to augment traditional CI/CD with intent-driven automation for cognitive chores.

#mlp #culture #sre

Read original

Cloudflare BlogFeb 13, 2026

Shedding old code with ecdysis: graceful restarts for Rust services at Cloudflare

Why it matters: Graceful restarts are critical for high-availability services where even millisecond outages cause millions of failed requests. ecdysis provides a battle-tested Rust implementation for zero-downtime upgrades, ensuring continuous connection handling during security patches and deployments.

Cloudflare open-sourced ecdysis, a Rust library for zero-downtime graceful restarts of high-traffic network services.
It solves the orphaned connection problem inherent in SO_REUSEPORT by passing socket file descriptors directly between processes.
The mechanism uses fork() and execve(), allowing the new process to inherit listening sockets via a named pipe.
It ensures crash safety: if a new version fails during initialization, the existing process continues serving traffic without interruption.
The library integrates natively with the Tokio async runtime and supports systemd-notify for seamless service management.

#sre #dist

Read original

Netflix Tech BlogFeb 12, 2026

Automating RDS Postgres to Aurora Postgres Migration

Why it matters: This migration strategy demonstrates how to handle large-scale database transitions with minimal downtime and zero data loss. It provides a blueprint for automating complex stateful migrations in a self-service manner while maintaining strict security and operational standards.

Netflix standardized on Amazon Aurora PostgreSQL to leverage its cloud-native architecture for scalability and high availability across 400+ clusters.
The team developed a self-service migration workflow to automate the transition from RDS PostgreSQL, reducing manual effort and human error.
They utilized the Aurora Read Replica migration technique, which minimizes downtime by maintaining continuous replication until the final cutover.
To ensure zero data loss without direct credential access, Netflix implemented a control-plane solution that revokes IAM-based database access to quiesce traffic.
The automated pipeline handles complex tasks including parameter group migration, read replica setup, and data parity validation.

#data #sre

Read original

GitHub EngineeringFeb 11, 2026

GitHub availability report: January 2026

Why it matters: This report highlights the risks of major infrastructure upgrades and model configuration changes in high-scale environments. It underscores the importance of robust rollback procedures and the need for load testing to detect resource contention before production deployment.

GitHub Copilot experienced a significant outage on January 13 due to a configuration error during a model update, peaking at 100% error rates.
The Copilot recovery was delayed by secondary availability issues with upstream provider OpenAI's GPT-4.1 model.
On January 15, a major version upgrade to data store infrastructure caused resource contention, leading to widespread latency across GitHub services.
The infrastructure incident impacted 1.8% of web and API requests, primarily affecting unauthenticated users through slow queries and timeouts.
Both incidents were mitigated via rollbacks to previous stable versions while GitHub works on improved high-load validation and configuration safeguards.

#sre #data #mlp

Read original

Microsoft Azure BlogFeb 11, 2026

Agentic cloud operations: A new way to run the cloud

Why it matters: As cloud complexity outpaces human capacity, agentic operations allow engineers to move from manual toil to high-level orchestration. By automating context-aware diagnosis and remediation, teams can maintain reliability and efficiency at the scale required for modern AI workloads.

Agentic cloud operations shift from manual, dashboard-centric management to dynamic, AI-driven systems that correlate signals and take autonomous actions.
Azure Copilot serves as the central agentic interface, integrating with subscriptions, resources, and policies to provide context-aware operational intelligence.
Specialized agents cover the full cloud lifecycle, including migration planning, infrastructure-as-code generation, and automated deployment validation.
Real-time observability and troubleshooting agents accelerate root cause analysis by diagnosing health signals across the full stack and recommending fixes.
Resiliency and optimization agents continuously identify gaps in recovery configurations and execute cost-saving or performance-enhancing adjustments.

#sre #mlp #finops

Read original

Airbnb EngineeringFeb 11, 2026

My Journey to Airbnb — Anna Sulkina

Why it matters: This article provides a roadmap for career growth from IC to senior leadership while highlighting technical transitions from monoliths to microservices. It emphasizes the importance of designing for failure in distributed systems and the cultural impact of infrastructure on developer velocity.

Anna Sulkina transitioned from hardware diagnostics through the full stack to Senior Director of Engineering for Application & Cloud infrastructure at Airbnb.
During her tenure at Twitter, she managed the migration from a monolith to a microservices architecture to handle high-scale traffic events.
She emphasizes that failure is inevitable in complex distributed systems, requiring engineers to design for resilience rather than avoidance.
Sulkina successfully championed GraphQL adoption at Twitter by building cross-team consensus, which significantly accelerated product development velocity.
At Airbnb, her focus is on unifying siloed infrastructure projects into a cohesive strategy to improve the overall developer experience.

#culture #dist #sre

Read original

Engineering at MetaFeb 11, 2026

The Death of Traditional Testing: Agentic Development Broke a 50-Year-Old Field, JiTTesting Can Revive It

Why it matters: Traditional testing is a bottleneck for AI-accelerated development. JiTTesting automates the test lifecycle—from generation to validation—eliminating maintenance toil and ensuring high-signal bug detection in high-velocity environments.

Agentic software development is accelerating code changes beyond the capacity of traditional, manually maintained test suites.
Just-in-Time Tests (JiTTests) are LLM-generated on the fly for specific pull requests to catch regressions before they reach production.
The system uses mutation testing to deliberately insert faults, simulating potential failures to verify that generated tests are effective.
JiTTests are ephemeral and do not reside in the codebase, eliminating the long-term burden of test maintenance and code review.
Ensembles of rule-based and LLM-based assessors are used to filter results, significantly reducing false positives and engineer toil.
The approach shifts testing focus from generic code quality to high-signal fault detection tailored to the specific intent of a code change.

#mlp #sre

Read original

Page 3 of 17

Prev 1 2 3 4 5...17 Next