Curated topic

sre

Posts tagged with sre

Microsoft Azure BlogNov 18, 2025

Announcing Azure Copilot agents and AI infrastructure innovations

Why it matters: Azure's new AI-powered Copilot agents and enhanced infrastructure promise to automate complex cloud operations, significantly reducing manual effort and allowing engineers to focus on innovation and architecture rather than routine administration.

Azure introduces Copilot agents to automate complex cloud operations, including migration, deployment, optimization, observability, resiliency, and troubleshooting.
Azure Copilot provides an agentic interface for cloud management, integrating with existing governance, RBAC, and policy frameworks for secure and compliant operations.
Azure is significantly enhancing its global AI infrastructure with increased capacity, resilience, optimized datacenter design, and network topology for AI-scale workloads.
Key infrastructure modernizations include new systems like Azure Cobalt and Azure Boost, AKS Automatic, and Azure HorizonDB for PostgreSQL, supporting diverse workloads.
The initiative aims to free up engineering teams from repetitive tasks, allowing them to focus on architecture and innovation by embedding AI agents directly into the platform.

#sre #dist #mlp

Read original

Cloudflare BlogNov 18, 2025

Cloudflare outage on November 18, 2025

Why it matters: This incident highlights the critical importance of robust change management, configuration validation, and effective incident response in large-scale distributed systems. It underscores how seemingly minor changes can cascade into widespread failures.

Cloudflare experienced a significant outage due to a database permission change that generated an oversized "feature file" for its Bot Management system.
The excessively large feature file, propagated across the network, caused routing software to fail as it exceeded an internal size limit.
Initial incident response was complicated by fluctuating system failures, leading to a temporary misdiagnosis of a DDoS attack.
Resolution involved halting the propagation of the bad configuration, manually inserting a known good file, and restarting the core proxy.
The outage impacted core CDN, security services, Workers KV, Turnstile, and Access, manifesting as widespread HTTP 5xx errors and increased latency.

#sre #dist #security

Read original

GitHub EngineeringNov 17, 2025

Highlights from Git 2.52

Why it matters: This release significantly improves Git's performance for large repositories by introducing `git last-modified` for faster tree-level blame and enhancing `git maintenance` with more efficient repacking strategies. These updates streamline developer workflows and reduce operational overhead.

Git 2.52 introduces `git last-modified`, a new command for efficiently determining the most recent commit for every file within a given directory (tree-level blame).
This command offers a significant performance improvement, being over 5 times faster than traditional methods like iterating `git log -1` for each file.
The core functionality of `git last-modified` was developed by GitHub as `blame-tree` and has now been open-sourced and integrated into Git.
The release also brings advancements to `git maintenance`, a command for scheduled or ad-hoc repository housekeeping tasks.
Git maintenance now supports alternative strategies like `incremental-repack` to improve efficiency for very large repositories, moving beyond the default "all-into-one" repacks.

#sre #dist

Read original

Engineering at MetaNov 14, 2025

Open Source Is Good for the Environment

Why it matters: Engineers can learn how open hardware, AI, and collaborative projects like OCP are crucial for achieving environmental sustainability goals in tech. It highlights practical applications of AI in reducing carbon footprints for IT infrastructure and data centers.

Meta's podcast discusses open hardware and the Open Compute Project (OCP) for environmental sustainability.
OCP, a collaborative initiative with over 400 companies, focuses on open hardware designs to reduce environmental impact.
Meta leverages AI and open hardware to advance its goal of achieving net-zero emissions by 2030.
A new open methodology employs AI to enhance the accuracy of Scope 3 emission estimates for IT hardware.
AI is also being used to innovate concrete mixes, leading to lower-carbon data center construction.

#data #mlp #sre

Read original

GitHub EngineeringNov 14, 2025

Unlocking the full power of Copilot code review: Master your instructions files

Why it matters: This article provides essential guidance for engineers to master Copilot Code Review instruction files, enabling more effective and consistent automated code reviews tailored to project standards. It helps optimize AI-assisted development workflows.

Copilot Code Review (CCR) leverages copilot-instructions.md and path-specific *.instructions.md files for customizable automated code reviews.
Instructions should be concise, structured, direct, and include code examples to effectively guide Copilot's review process.
Use repo-wide copilot-instructions.md for general standards and path-specific *.instructions.md with applyTo for language or topic-specific rules.
Avoid instructions that attempt to alter Copilot's UX, modify PR overviews, request non-review tasks, include external links, or make vague improvement demands.
A structured approach, including clear titles, purpose, naming, style, and code examples, is recommended for effective instruction files.

#sre #culture

Read original

PlanetScale Tech BlogNov 14, 2025

$5 PlanetScale is live

Why it matters: PlanetScale lowers the entry barrier for developers by offering affordable Postgres instances with advanced features like branching. It provides a seamless growth path from a single node to sharded architectures without requiring painful database migrations.

PlanetScale has launched $5/month single node Postgres databases globally for startups and side projects.
Development branch pricing is reduced from $10 to $5 per month, lowering the cost of staging environments.
Single node instances include advanced features like Query Insights, schema recommendations, and branching.
Users can vertically scale clusters or upgrade to High Availability mode with multi-replica configurations.
The platform offers a seamless growth path to horizontal scaling via Neki, their upcoming sharded Postgres solution.

#data #finops #sre

Read original

GitHub EngineeringNov 13, 2025

GitHub Availability Report: October 2025

Why it matters: This report offers critical insights into distributed systems resilience, dependency management, and incident response. Engineers can learn from these real-world outages to build more robust, fault-tolerant services, emphasizing proactive measures and graceful degradation strategies.

GitHub experienced four incidents in October, leading to degraded performance across services like API, Actions, Codespaces, and mobile notifications.
Causes included a network device brought online prematurely, an erroneous configuration change for mobile push notifications, and two separate third-party dependency outages.
The most significant incident was a widespread third-party provider outage, severely impacting Codespaces, Actions runners, and the Enterprise Importer.
GitHub is implementing measures such as enhanced validation, reviewing cloud resource management, evaluating critical path dependencies, and improving monitoring.
Future efforts focus on reducing reliance on external providers and implementing graceful degradation strategies to enhance system resilience against outages.

#sre #dist

Read original

Cloudflare BlogNov 13, 2025

Finding the grain of sand in a heap of Salt

Why it matters: This article is crucial for SREs and infrastructure engineers dealing with large-scale configuration management. It demonstrates how to build systems that automate root cause analysis for CM failures, significantly reducing release delays and operational toil.

Cloudflare tackled the challenge of quickly identifying root causes for Salt configuration management failures across thousands of servers with high change volumes.
Salt, a CM tool, employs a master/minion architecture and declarative state system to manage large fleets and ensure consistent configurations.
Cloudflare's deployment pipeline for Salt changes incorporates blast radius protection and guardrails, designed to "fail safe" by halting deployments upon configuration failure.
While preventing customer impact, these halts necessitate human intervention for root cause analysis, leading to significant SRE toil and release delays.
A new architectural solution enables self-service root cause identification by correlating Salt failures with git commits, external services, and ad hoc releases.
This system has successfully reduced software release delays by over 5% and minimized repetitive triage for SRE teams.

#sre #dist

Read original

GitHub EngineeringNov 12, 2025

How Copilot helps build the GitHub platform

Why it matters: This article demonstrates how AI assistants like Copilot are evolving beyond simple autocomplete to become integral, active contributors in complex software development, significantly boosting engineering productivity and tackling tedious tasks.

GitHub Copilot is deeply integrated into GitHub's development lifecycle, acting as an active contributor that opens pull requests and completes assigned issues.
It handles a wide range of tasks, from minor UI fixes and documentation cleanup to critical maintenance like feature flag removal and large-scale refactoring.
Copilot resolves bugs, production errors, performance bottlenecks, and flaky tests, improving codebase stability.
It contributes to new feature development, creates API endpoints, and enhances internal tools.
Copilot undertakes complex projects such as security gating, database migrations, and comprehensive codebase audits for architectural analysis.
Its primary value is providing a concrete first-pass solution, enabling human engineers to review and iterate efficiently, rather than starting from scratch.

#mlp #sre #security

Read original

Cloudflare BlogNov 12, 2025

Connecting to production: the architecture of remote bindings

Why it matters: This feature significantly enhances local development for Cloudflare Workers, allowing engineers to test against real production data and services without deploying. It streamlines workflows, accelerates iteration, and ensures higher confidence in code changes before deployment.

Cloudflare's remote bindings enable local Worker development to connect directly to deployed production resources like R2 and D1, eliminating the need for full deployments during testing.
This feature significantly enhances the developer experience by allowing engineers to test local code changes against real data and services, accelerating iteration speed and improving confidence.
The new approach unifies the development workflow, replacing the older `wrangler dev --remote` mode with a per-binding `remote: true` option within the standard local development environment.
Architecturally, remote bindings leverage Cloudflare's existing production binding mechanisms, treating them as service bindings rather than creating new API wrappers.
This design avoids the complexity of replicating entire binding API surfaces and ensures compatibility with operations that lack direct HTTP API equivalents, streamlining implementation and maintenance.

#dist #sre

Read original

Page 9 of 17

Prev 1...7 8 9 10 11...17 Next