Curated topic

sre

Posts tagged with sre

Cloudflare BlogDec 5, 2025

Cloudflare outage on December 5, 2025

Why it matters: This incident underscores the critical impact of configuration management in distributed systems. It highlights how rapid, global deployments without gradual rollouts and robust error handling can lead to widespread outages, even from seemingly minor code paths.

A 25-minute Cloudflare outage on Dec 5, 2025, impacted 28% of HTTP traffic due to a configuration change.
The incident stemmed from disabling an internal WAF testing tool, intended to mitigate a React Server Components vulnerability (CVE-2025-55182).
A global configuration system, lacking gradual rollout, propagated a change that triggered a Lua runtime error in the FL1 proxy.
The error was an attempt to access a nil value ('rule_result.execute') when a killswitch skipped an "execute" action rule, a bug undetected for years.
This highlights the need for robust type systems and safe deployment practices, especially for critical infrastructure.
Cloudflare acknowledges similar past incidents and is prioritizing enhanced rollouts and versioning to prevent future widespread impacts.

#sre #dist #security

Read original

Pinterest EngineeringDec 3, 2025

Autonomous Observability at Pinterest (Part 1 of 2)

Why it matters: This article demonstrates how to overcome legacy observability challenges by pragmatically integrating AI agents and context engineering, offering a blueprint for unifying fragmented data without costly overhauls.

Pinterest faced fragmented observability data (logs, traces, metrics) due to legacy infrastructure predating OpenTelemetry, hindering efficient root-cause analysis.
They adopted a pragmatic solution using AI agents and a Model Context Protocol (MCP) server to unify disparate observability signals without a full infrastructure overhaul.
The MCP server allows AI agents to interact simultaneously with various data pillars (metrics, logs, traces, change events) to find correlations and build hypotheses.
This "context engineering" approach aims to provide intelligent agents with comprehensive data, leading to faster, clearer root-cause analysis and actionable insights.
The initiative represents a "shift-left" (proactive integration) and "shift-right" (production visibility) strategy, leveraging AI to overcome existing observability limitations.

#sre #dist #data

Read original

GitHub EngineeringDec 3, 2025

Your stack, your rules: Introducing custom agents in GitHub Copilot for observability, IaC, and security

Why it matters: Custom agents in GitHub Copilot empower engineering teams to embed their unique rules and workflows directly into their AI assistant. This streamlines development, ensures consistency across the SDLC, and automates complex tasks, boosting efficiency and adherence to standards.

GitHub Copilot now supports custom agents, extending its AI assistance across the entire software development lifecycle, not just code generation.
These Markdown-defined agents act as domain experts, integrating team-specific rules, tools, and workflows for areas like observability, security, and IaC.
Custom agents can be deployed at repository, organization, or enterprise levels and are accessible via Copilot CLI, VS Code Chat, and github.com.
They enable engineers to enforce standards, automate multi-step tasks, and integrate third-party tools directly within their development environment.
A growing ecosystem of partner-built agents is available for various domains, including security, databases, DevOps, and incident management.

#sre #security

Read original

GitHub EngineeringDec 2, 2025

“The local-first rebellion”: How Home Assistant became the most important project in your house

Why it matters: This article highlights the engineering complexities and architectural decisions behind building a robust, local-first distributed system for the physical world. It showcases how open-source governance can be a technical requirement for long-term project integrity and user control.

Home Assistant is a fast-growing open-source home automation platform, used in over 2 million households and attracting 21,000 contributors annually.
It champions a local-first architecture for privacy and interoperability, enabling control of thousands of devices on user hardware without cloud dependency.
The platform abstracts diverse devices into local entities with states and events, acting as a distributed event-driven runtime for complex home automations.
This local-first approach presents significant engineering challenges, demanding optimizations for device discovery, state management, and network communication on constrained hardware.
Governance by the Open Home Foundation ensures its open-source integrity, protecting against commercial acquisition and maintaining its core local-first philosophy.

#dist #sre #culture

Read original

Microsoft Azure BlogDec 1, 2025

Azure networking updates on security, reliability, and high availability

Why it matters: This article highlights Azure's commitment to scaling its network for demanding AI workloads and enhancing resilience. Engineers gain insights into new features like zone-redundant NAT Gateway V2, crucial for building highly available and performant cloud-native applications.

Azure's global network has expanded to 18 Pbps WAN capacity, optimized for hyperscale AI and data workloads across 60+ AI regions.
The network fabric is specifically engineered for AI, integrating InfiniBand and high-speed Ethernet for low-latency, high-bandwidth GPU cluster communication and distributed AI WAN.
Azure is enhancing resiliency with zone-redundant services, including the public preview of Standard NAT Gateway V2.
Standard NAT Gateway V2 provides zone-redundant outbound connectivity, 100 Gbps throughput, 10M packets/sec, IPv6 support, and flow logs.

#sre #dist #security

Read original

Microsoft Azure BlogNov 24, 2025

Introducing Claude Opus 4.5 in Microsoft Foundry

Why it matters: This release provides engineers with a powerful new AI model, Claude Opus 4.5, on Microsoft's platform, significantly boosting productivity, code quality, and enabling advanced agentic workflows for complex engineering challenges.

Claude Opus 4.5 is now in public preview on Microsoft Foundry, GitHub Copilot, and Microsoft Copilot Studio, marking a shift to AI as a genuine collaborator.
This new model excels in coding, agentic workflows, and enterprise productivity, outperforming previous versions and competitors at a better price point.
Opus 4.5 achieves state-of-the-art performance on software engineering benchmarks like SWE-bench (80.9%), improving multilingual coding and code generation.
It accelerates engineering velocity by handling complex tasks, interpreting ambiguous requirements, and reasoning about architectural tradeoffs.
Microsoft Foundry ensures Azure customers get immediate access to advanced AI models, supporting secure and scalable deployment of AI applications.

#mlp #sre

Read original

Engineering at MetaNov 21, 2025

Zoomer: Powering AI Performance at Meta’s Scale Through Intelligent Debugging and Optimization

Why it matters: Zoomer is crucial for optimizing AI performance at Meta's massive scale, ensuring efficient GPU utilization, reducing energy consumption, and cutting operational costs. This accelerates AI development and innovation across all Meta products, from GenAI to recommendations.

Zoomer is Meta's automated, comprehensive platform for debugging and optimizing AI training and inference workloads at scale.
It provides deep performance insights, leading to significant energy savings, accelerated workflows, and improved efficiency across Meta's AI infrastructure.
The platform has reduced training times and improved Queries Per Second (QPS), making it Meta's primary tool for AI performance optimization.
Zoomer's architecture comprises an Infrastructure/Platform layer for scalability, an Analytics/Insights Engine for deep analysis (using Kineto, StrobeLight, dyno telemetry), and a Visualization/UI layer for actionable insights.
It addresses critical challenges of GPU underutilization, operational costs, and suboptimal hardware use in large-scale AI environments.

#mlp #sre #finops

Read original

PlanetScale Tech BlogNov 21, 2025

AI-Powered Postgres index suggestions

Why it matters: Automating index optimization reduces the manual burden of database tuning. By combining LLMs with rigorous validation via HypoPG, engineers receive reliable, data-driven recommendations that improve query speed without the risk of hallucinated or ineffective indexes.

PlanetScale Insights now provides AI-powered index suggestions for PostgreSQL by analyzing workload and schema data.
To ensure relevance, the system targets queries with high rows-read to rows-returned ratios and significant resource consumption.
LLMs generate candidate indexes, which undergo strict syntactic validation and performance impact analysis.
The HypoPG extension is used to create hypothetical indexes, allowing the Postgres planner to estimate cost savings without actual overhead.
Only suggestions that provide a substantial, measurable improvement in query cost are surfaced to the user.

#data #mlp #sre

Read original

GitHub EngineeringNov 19, 2025

How we’re making GitHub Copilot smarter with fewer tools

Why it matters: Optimizing tool selection for LLM agents significantly boosts performance and reliability. This approach reduces latency and improves success rates for AI assistants like GitHub Copilot, making them faster and more effective for developers.

GitHub Copilot Chat's performance was hindered by reasoning across hundreds of tools via the Model Context Protocol (MCP).
New systems, embedding-guided tool routing and adaptive tool clustering, were developed to optimize tool selection for LLM agents.
The default toolset was reduced from 40 to 13 core tools, and 'virtual tools' were introduced to functionally group similar tools.
Adaptive tool clustering uses Copilot's internal embedding model and cosine similarity to efficiently group tools, replacing slower LLM-based categorization.
Embedding-guided tool routing pre-selects the most semantically relevant tools based on query embeddings, reducing unnecessary exploratory calls.
These optimizations improved success rates by 2-5 percentage points on benchmarks and reduced response latency by an average of 400 milliseconds.

#mlp #sre

Read original

Engineering at MetaNov 18, 2025

Efficient Optimization With Ax, an Open Platform for Adaptive Experimentation

Why it matters: Engineers can leverage Ax, an open-source ML-driven platform, to efficiently optimize complex systems like AI models and infrastructure. It streamlines experimentation, reduces resource costs, and provides deep insights into system behavior, accelerating development and deployment.

Ax 1.0 is an open-source adaptive experimentation platform leveraging machine learning for efficient optimization of complex systems.
It's widely used at Meta to improve AI models, tune production infrastructure, and accelerate advances in ML and hardware design.
The platform employs Bayesian optimization to guide resource-intensive experiments, identifying optimal configurations efficiently.
Ax provides advanced analytical tools, including Pareto frontiers and sensitivity analysis, for deeper system understanding beyond just finding optimal settings.
An accompanying paper details Ax's core architecture, methodology, and performance comparison against other black-box optimization libraries.

#mlp #sre #data

Read original

Page 8 of 17

Prev 1...6 7 8 9 10...17 Next