Curated topic

sre

Posts tagged with sre

GitHub EngineeringDec 12, 2025

The future of AI-powered software optimization (and how it can help your team)

Why it matters: This article introduces "Continuous Efficiency," an AI-driven method to embed sustainable and efficient coding practices directly into development workflows. It offers a practical path for engineers to improve code quality, performance, and reduce operational costs without manual effort.

"Continuous Efficiency" integrates AI-powered automation with green software principles to embed sustainability into development workflows.
This approach combines LLM-powered Continuous AI for CI/CD with Green Software practices, aiming for more performant, resilient, and cost-effective code.
It addresses the low priority of green software by enabling near-effortless, always-on optimization for efficiency and reduced environmental impact.
Implemented via Agentic Workflows in GitHub Actions, it allows defining engineering standards in natural language for scalable application.
Benefits include declarative rule authoring, semantic generalizability across languages, and intelligent remediation like automated pull requests.
Pilot projects demonstrate success in applying green software rules and Web Sustainability Guidelines, yielding measurable performance gains.

#mlp #sre #finops

Read original

GitHub EngineeringDec 11, 2025

Let’s talk about GitHub Actions

Why it matters: The article details how GitHub Actions' core infrastructure was re-architected to support massive scale and deliver crucial features. This ensures improved reliability, performance, and flexibility for developers using CI/CD pipelines, addressing long-standing community requests.

GitHub Actions underwent a significant re-architecture of its core backend services to handle massive growth, now processing 71 million jobs daily.
This re-architecture improved performance, scalability, and reliability, laying the foundation for future feature development.
Key quality-of-life improvements recently shipped include support for YAML anchors to reduce workflow duplication.
Non-public workflow templates enable consistent, private CI scaffolding across organizations.
Reusable workflow limits were increased, allowing for more modular and deeply nested CI/CD pipelines.
The cache size limit per repository was removed, addressing a pain point for large projects with heavy dependencies.

#sre #dist

Read original

GitHub EngineeringDec 11, 2025

GitHub Availability Report: November 2025

Why it matters: This report highlights common infrastructure challenges like rate limiting, certificate management, and configuration errors. It offers valuable insights into incident response, mitigation strategies, and proactive measures for maintaining high availability in complex distributed systems.

GitHub experienced three incidents in November 2025, affecting Dependabot, Git operations, and Copilot services.
A Dependabot incident was caused by hitting GitHub Container Registry rate limits, resolved by adjusting job rates and increasing limits.
All Git operations failed due to an expired TLS certificate for internal service-to-service communication, mitigated by certificate replacement and service restarts.
A Copilot outage for the Claude Sonnet 4.5 model resulted from a misconfiguration in an internal service, which was resolved by reverting the change.
Post-incident actions include adding new monitoring, auditing certificates, accelerating automation for certificate management, and improving cross-service deploy safeguards.

#sre #dist

Read original

GitHub EngineeringDec 9, 2025

Speed is nothing without control: How to keep quality high in the AI era

Why it matters: Engineers can leverage AI for rapid development while maintaining high code quality. This article introduces tools and strategies, like GitHub Code Quality and effective prompting, to prevent "AI slop" and ensure reliable, maintainable code in an accelerated workflow.

AI significantly accelerates development but risks generating "AI slop" and technical debt without proper quality control.
GitHub Code Quality, leveraging AI and CodeQL, ensures high standards by automatically detecting and suggesting fixes for maintainability and reliability issues in pull requests.
Key features include one-click enablement, automated fixes for common errors, enforcing quality bars with rulesets, and surfacing legacy technical debt.
Engineers must "drive" AI by providing clear, constrained prompts, focusing on goals, context, and desired output formats to maximize quality.
This approach allows teams to achieve both speed and control, preventing trade-offs between velocity and code reliability in the AI era.

#sre #mlp

Read original

Microsoft Azure BlogDec 9, 2025

Microsoft’s commitment to supporting cloud infrastructure demand in the United States

Why it matters: This expansion provides engineers with more Azure regions and Availability Zones, enabling highly resilient, performant, and geographically diverse cloud architectures for critical applications and AI workloads.

Microsoft is significantly expanding its cloud infrastructure in the US, including a new East US 3 region in Atlanta by early 2027.
The East US 3 region will incorporate Availability Zones for enhanced resiliency and support advanced Azure workloads, including AI.
Five existing US Azure regions (North Central US, West Central US, US Gov Arizona, East US 2, South Central US) will also gain Availability Zones by 2026-2027.
These expansions aim to meet growing customer demand for cloud and AI services, offering greater capacity, resiliency, and agility.
The new infrastructure emphasizes sustainability, with the East US 3 region designed for LEED Gold Certification and water conservation.
Leveraging Availability Zones and multi-region architectures is highlighted for improving application performance, latency, and overall resilience.

#dist #sre #mlp

Read original

Spotify EngineeringDec 9, 2025

Background Coding Agents: Predictable Results Through Strong Feedback Loops (Part 3)

Why it matters: As AI agents become integrated into development, ensuring their output is safe and predictable is critical. This system provides a blueprint for building trust in automated code generation through rigorous feedback loops and validation.

Spotify's system focuses on making AI coding agents predictable and trustworthy through structured feedback loops.
The architecture ensures that agent-generated code is validated against existing engineering standards and tests.
Background agents operate asynchronously to improve code quality without disrupting the primary developer workflow.
The framework addresses the challenge of moving from experimental AI generation to production-ready software engineering.
Automated verification steps are integrated to prevent the introduction of bugs or technical debt by autonomous agents.

#mlp #culture #sre

Read original

Cloudflare BlogDec 9, 2025

Shifting left at enterprise scale: how we manage Cloudflare with Infrastructure as Code

Why it matters: This article provides a blueprint for implementing "shift left" security and IaC at enterprise scale, crucial for preventing misconfigurations, enhancing consistency, and improving operational efficiency in large, complex environments.

Cloudflare adopted "shift left" principles and Infrastructure as Code (IaC) to manage its critical platform securely and consistently at enterprise scale.
All production account configurations are managed via IaC using Terraform, integrated with a custom CI/CD pipeline (Atlantis, GitLab, tfstate-butler).
A centralized monorepo holds all configurations, with teams owning their specific sections, promoting accountability and consistency.
Security baselines are enforced through Policy as Code (Open Policy Agent with Rego), shifting validation to the earliest stages of development.
Policies are automatically checked on every merge request, preventing misconfigurations before deployment and minimizing human error.

#security #sre

Read original

Salesforce EngineeringDec 8, 2025

How AI-Powered Testing Enabled Sub-Second Latency for Agentforce Voice

Why it matters: Achieving sub-second latency in voice AI requires rethinking performance metrics and optimizing every microservice. This article shows how semantic end-pointing and synthetic testing are critical for building responsive, human-like voice agents at scale.

Developed the Flash Reasoning Engine to achieve sub-second Time to First Audio (TTFA) for natural, human-fast voice interactions.
Optimized the real-time voice pipeline by shaving hundreds of milliseconds from microservices, synchronous calls, and serialization paths.
Implemented semantic end-pointing algorithms that use confidence thresholds to distinguish between meaningful pauses and true utterance completion.
Created AI-driven synthetic customer testing frameworks to generate repeatable data sets and eliminate noise in performance metrics.
Resolved measurement inaccuracies where initial tests incorrectly reported 70-second latencies by focusing on TTFA instead of total output duration.

#mlp #dist #sre

Read original

Cloudflare BlogDec 8, 2025

Python Workers redux: fast cold starts, packages, and a uv-first workflow

Why it matters: Engineers can now deploy Python applications globally on Cloudflare Workers with full package support and exceptionally fast cold starts. This significantly improves serverless Python development, offering a highly performant and flexible platform for a wide range of edge computing use cases.

Cloudflare Python Workers now support any Pyodide-compatible package, including pure Python and many dynamic libraries, enhancing developer flexibility.
A uv-first workflow and pywrangler tooling simplify package installation and global deployment of Python applications on the Workers platform.
Significant cold start performance improvements have been achieved through dedicated memory snapshots, making Python Workers 2.4x faster than AWS Lambda and 3x faster than Google Cloud Run for package-heavy applications.
The platform offers a free tier and supports various use cases, from FastAPI apps and HTML templating to real-time chat with Durable Objects and image generation.
These advancements provide a Python-native serverless experience with global deployment and minimal latency.

#dist #sre

Read original

Cloudflare BlogDec 5, 2025

Cloudflare outage on December 5, 2025

Why it matters: This incident underscores the critical impact of configuration management in distributed systems. It highlights how rapid, global deployments without gradual rollouts and robust error handling can lead to widespread outages, even from seemingly minor code paths.

A 25-minute Cloudflare outage on Dec 5, 2025, impacted 28% of HTTP traffic due to a configuration change.
The incident stemmed from disabling an internal WAF testing tool, intended to mitigate a React Server Components vulnerability (CVE-2025-55182).
A global configuration system, lacking gradual rollout, propagated a change that triggered a Lua runtime error in the FL1 proxy.
The error was an attempt to access a nil value ('rule_result.execute') when a killswitch skipped an "execute" action rule, a bug undetected for years.
This highlights the need for robust type systems and safe deployment practices, especially for critical infrastructure.
Cloudflare acknowledges similar past incidents and is prioritizing enhanced rollouts and versioning to prevent future widespread impacts.

#sre #dist #security

Read original

Page 17 of 27

Prev 1...15 16 17 18 19...27 Next