Curated topic

sre

Posts tagged with sre

Salesforce EngineeringDec 29, 2025

How Agentforce Empowered Incident Response Automation to Cut Common Resolution Time by 70 – 80%

Why it matters: Automating incident response at hyperscale reduces human error and cognitive load during high-pressure events. By using AI agents to correlate billions of signals, teams can cut resolution times by up to 80%, shifting from reactive manual triage to proactive, explainable mitigation.

Salesforce developed the Incident Command Deputy (ICD) platform, a multi-agent system powered by Agentforce to automate incident response.
The system utilizes AI-based anomaly detection across metrics, logs, and traces to replace static thresholds and manual monitoring at hyperscale.
ICD unifies fragmented data from observability, CI/CD, and change management systems into a single reasoning surface for AI agents.
Agentforce-powered agents automate evidence collection and hypothesis generation, significantly reducing cognitive load for engineers during 3:00 AM incidents.
The platform has successfully reduced resolution time for common Severity 2 incidents by 70-80%, with many detected and resolved within ten minutes.

#sre #mlp #data

Read original

GitHub EngineeringDec 22, 2025

This year’s most influential open source projects

Why it matters: These projects represent the backbone of modern developer productivity. By automating releases, simplifying backend infrastructure, and building independent engines, they empower engineers to bypass boilerplate and focus on high-impact innovation within the open source ecosystem.

Appwrite provides a comprehensive backend-as-a-service (BaaS) platform with APIs for databases, authentication, and storage to reduce development boilerplate.
GoReleaser automates the Go project release lifecycle, handling packaging and distribution for major tools including the GitHub CLI.
Homebrew remains the essential package management standard for macOS and Linux, facilitating environment bootstrapping and DevOps automation.
Ladybird is an independent browser being built from scratch in C++, aiming for high performance and privacy without relying on existing engines like Chromium.
The featured projects highlight a growing trend toward developer-centric tools that prioritize automation and independent engineering craft.

#culture #sre

Read original

Salesforce EngineeringDec 22, 2025

Shattering AWS’s 250K-IP Ceiling: How Data 360 Reached 1 Million IPs with Zero-Downtime Migration

Why it matters: Scaling to 100,000+ tenants requires overcoming cloud provider networking limits. This migration demonstrates how to bypass AWS IP ceilings using prefix delegation and custom observability without downtime, ensuring infrastructure doesn't bottleneck hyperscale data growth.

Overcame the AWS Network Address Usage (NAU) hard limit of 250,000 IPs per VPC to support 1 million IPs for Data 360.
Implemented AWS prefix delegation, which assigns IP addresses in contiguous 16-address blocks to significantly increase network efficiency.
Navigated Hyperforce architectural constraints, including immutable subnet structures and strict security group rules, without altering VPC boundaries.
Developed custom observability tools to monitor IP fragmentation and contiguous block availability, filling gaps in native AWS and Hyperforce metrics.
Utilized AI-driven validation and phased rollouts to ensure zero-downtime migration for massive Spark-driven data processing workloads.

#sre #dist #data

Read original

Cloudflare BlogDec 22, 2025

How Workers powers our internal maintenance scheduling pipeline

Why it matters: Manual infrastructure management fails at scale. This article shows how Cloudflare uses serverless Workers and graph-based data modeling to automate global maintenance scheduling, preventing downtime by programmatically enforcing safety constraints across distributed data centers.

Cloudflare transitioned from manual maintenance coordination to an automated scheduler built on Cloudflare Workers to manage 330+ global data centers.
The system enforces safety constraints to prevent simultaneous downtime of redundant edge routers and customer-specific egress IP pools.
To solve 'out of memory' errors on the Workers platform, the team implemented a graph-based data interface inspired by Facebook’s TAO.
The scheduler uses a graph model of objects and associations to load only the regional data necessary for specific maintenance requests.
The tool programmatically identifies overlapping maintenance windows and alerts operators to potential conflicts to ensure high availability.

#sre #dist #data

Read original

Cloudflare BlogDec 19, 2025

Code Orange: Fail Small — our resilience plan following recent incidents

Why it matters: This initiative highlights the danger of instant global configuration propagation. By treating config as code and implementing gated rollouts, Cloudflare demonstrates how to mitigate blast radius in hyperscale systems, a critical lesson for SRE and platform engineers.

Cloudflare launched 'Code Orange: Fail Small' to prioritize network resilience after two major outages caused by rapid configuration deployments.
The plan mandates controlled, gated rollouts for all configuration changes, mirroring the existing Health Mediated Deployment (HMD) process used for software binaries.
Teams must now define success metrics and automated rollback triggers for configuration updates to prevent global propagation of errors.
Engineers are reviewing failure modes across traffic-handling systems to ensure predictable behavior during unexpected error states.
The initiative aims to eliminate circular dependencies and improve 'break glass' procedures for faster emergency access during incidents.

#sre #dist #security

Read original

Engineering at MetaDec 19, 2025

DrP: Meta’s Root Cause Analysis Platform at Scale

Why it matters: DrP automates manual incident triaging at scale. By codifying expert knowledge into executable playbooks, it reduces MTTR and lets engineers focus on resolution rather than data gathering, improving system reliability in complex microservice environments.

DrP is Meta's programmatic root cause analysis (RCA) platform that automates incident investigation through an expressive SDK and scalable backend.
The platform uses 'analyzers'—codified investigation playbooks—to perform anomaly detection, dimension analysis, and time series correlation.
It integrates directly with alerting and incident management systems to trigger automated investigations immediately upon alert activation.
The system supports analyzer chaining, allowing for complex investigations across interconnected microservices and dependencies.
DrP includes a post-processing layer that can automate mitigation steps, such as creating pull requests or tasks based on findings.
The platform handles 50,000 daily analyses across 300+ teams, reducing Mean Time to Resolve (MTTR) by 20% to 80%.

#sre #dist

Read original

PlanetScale Tech BlogDec 17, 2025

Postgres 18 is now available

Why it matters: Postgres 18 introduces critical performance features like Skip Scans and async I/O, while native UUIDv7 support simplifies modern ID generation. PlanetScale's immediate support allows developers to leverage these optimizations alongside their managed infrastructure.

PlanetScale now defaults to Postgres 18.1 for all new database creations.
A new asynchronous I/O system is introduced to enhance query performance.
Native support for UUIDv7 is now available via the built-in uuidv7() function.
The Skip Scan optimization allows more queries to leverage multi-column indexes effectively.
Upgrades from version 17 require a manual online migration to a new database instance.

#data #sre

Read original

Salesforce EngineeringDec 16, 2025

How AI-Enabled Tooling Boosted Code Output 30% — While Keeping Quality and Deployment Safety Intact

Why it matters: AI tools can boost code output by 30%, but this creates downstream bottlenecks in testing and review. This article shows how to scale quality gates and deployment safety alongside velocity, ensuring that increased speed doesn't compromise system reliability or engineer well-being.

Unified fragmented tooling across Java, .NET, and Python using a portfolio approach including Cursor, Windsurf, and Claude Code.
Achieved a 30% increase in code production with 85% weekly adoption of AI-assisted development tools among eligible engineers.
Mitigated senior engineer bottlenecks by implementing AI-assisted code reviews to handle routine checks and initial analysis.
Scaled quality gates by automating test coverage and validation workflows to keep pace with accelerated development cycles.
Integrated AIOps and telemetry analysis to maintain high availability and improve incident response across 25 Hyperforce regions.

#sre #culture

Read original

Netflix Tech BlogDec 15, 2025

How Temporal Powers Reliable Cloud Operations at Netflix

Why it matters: This article demonstrates how a Durable Execution platform like Temporal can drastically improve the reliability of critical cloud operations and continuous delivery pipelines, reducing complex failure handling and state management for engineers.

Netflix significantly improved the reliability of its Spinnaker deployments by adopting Temporal, reducing transient failures from 4% to 0.0001%.
Temporal is a Durable Execution platform that allows engineers to write resilient code, abstracting away complexities of distributed system failures.
The previous Spinnaker architecture suffered from complex, undifferentiated internal orchestration, retry logic, and a homegrown Saga framework within its Clouddriver service.
Prior to Temporal, Clouddriver's instance-local task state led to lost operation progress if the service crashed, impacting deployment reliability.
Temporal helped streamline cloud operations by offloading complex state management and failure handling, allowing services like Clouddriver to focus on core infrastructure changes.

#sre #dist

Read original

Netflix Tech BlogDec 15, 2025

Netflix Live Origin

Why it matters: This article details how Netflix built a robust, high-performance live streaming origin and optimized its CDN for live content. It offers insights into handling real-time data defects, ensuring resilience, and optimizing content delivery at scale.

Netflix Live Origin is a multi-tenant microservice bridging cloud live streaming pipelines and Open Connect CDN, managing content distribution.
It ensures resilience through redundant regional pipelines and server-side failover, utilizing epoch locking for intelligent segment selection.
The Origin detects and mitigates live stream defects (e.g., short, missing segments) by selecting valid candidates from multiple pipelines.
Open Connect's nginx-based CDN was optimized for live streaming, extending proxy-caching and adding millisecond-grain caching.
Live Origin "holds open" requests for yet-to-be-published segments, reducing network chatter and improving efficiency.
HTTP headers are leveraged for scalable streaming metadata, providing real-time event notifications to client devices via OCAs.

#dist #sre

Read original

Page 6 of 17

Prev 1...4 5 6 7 8...17 Next