Curated topic

sre

Posts tagged with sre

Cloudflare BlogMar 13, 2026

From legacy architecture to Cloudflare One

Why it matters: Moving from legacy VPNs to Zero Trust is high-risk. This methodology de-risks the process by treating migration as application modernization, allowing engineers to secure legacy systems with MFA and identity-based access without downtime or code changes.

Cloudflare and CDW advocate for a tiered, risk-aware migration strategy to Zero Trust rather than high-risk 'big bang' VPN replacements.
Legacy applications are modernized by 'wrapping' them in Cloudflare Access, adding MFA and SSO capabilities without requiring code rewrites.
Cloudflare Tunnel establishes outbound-only connections, effectively hiding internal applications from the public internet and preventing lateral movement.
Comprehensive pre-migration audits map backend dependencies and identity providers to ensure service continuity during the transition.
Dynamic Path MTU Discovery (PMTUD) is utilized to maintain persistent sessions for legacy architectures even as client IP addresses change.

#security #sre

Read original

PlanetScale Tech BlogMar 13, 2026

Scaling Postgres connections with PgBouncer

Why it matters: Postgres's process-per-connection model limits scalability for modern apps needing thousands of concurrent connections. PgBouncer is the industry-standard solution to prevent resource exhaustion and context-switching overhead, ensuring database stability under high load.

PgBouncer addresses the resource overhead of Postgres's process-per-connection architecture by multiplexing many client connections onto a smaller pool of server connections.
Transaction pooling is the recommended mode for most applications, as it releases server connections back to the pool immediately after a COMMIT or ROLLBACK.
The configuration hierarchy includes max_client_conn for incoming traffic and default_pool_size for outgoing connections to the database.
Total potential connections to Postgres are calculated as the number of pools multiplied by the pool size, which must remain below Postgres's max_connections.
PlanetScale offers local, dedicated primary, and dedicated replica PgBouncer configurations to optimize for high availability and read-heavy workloads.
Using PgBouncer allows for thousands of simultaneous client connections while keeping the database's forked process count low and manageable.

#data #sre

Read original

GitHub EngineeringMar 12, 2026

GitHub availability report: February 2026

Why it matters: This report highlights how complex dependencies—like telemetry, caching, and security policies—can trigger cascading failures. It provides valuable lessons on the importance of robust monitoring, automated rollbacks, and the need for resilient proxy layers in large-scale distributed systems.

GitHub experienced six major incidents in February 2026, impacting Actions, Codespaces, Dependabot, and Git operations.
A significant outage was caused by telemetry loss that triggered incorrect security policies, blocking access to critical VM metadata.
Cache write amplification and simultaneous rewrites led to cascading failures and connection exhaustion in the Git HTTPS proxy.
Database failovers to read-only instances and authorization claim changes in networking dependencies caused service degradations.
Incorrect network configurations in the LFS service resulted in repository archive download errors for objects using Git LFS.
GitHub is implementing improved monitoring, automated rollbacks, and self-throttling mechanisms to increase system resilience.

#sre #dist #security

Read original

GitHub EngineeringMar 11, 2026

Addressing GitHub’s recent availability issues

Why it matters: This post highlights how rapid scaling and architectural coupling can turn localized issues into platform-wide outages. It provides lessons on managing cache TTLs, the risks of latent configuration errors in failover systems, and the necessity of robust load-shedding mechanisms.

Rapid usage growth and architectural coupling caused localized issues to cascade across GitHub's critical services.
A 10x increase in API traffic from client apps combined with a cache TTL reduction overwhelmed core authentication database clusters.
A telemetry gap triggered security policies that blocked VM metadata access, leading to a global GitHub Actions outage.
Latent configuration issues in Redis failover mechanisms left clusters without writable primaries, requiring manual mitigation.
GitHub is redesigning its user cache system and segmenting database clusters to accommodate significantly higher traffic volumes.
Engineering teams are prioritizing the isolation of critical dependencies to prevent shared infrastructure failures from impacting Git and Actions.

#sre #dist

Read original

Cloudflare BlogMar 11, 2026

Slashing agent token costs by 98% with RFC 9457-compliant error responses

Why it matters: Engineers building AI agents can now handle network errors programmatically and cost-effectively. By replacing verbose HTML with structured data, Cloudflare enables agents to make deterministic decisions like exponential backoff while slashing operational token costs by 98%.

Cloudflare now serves RFC 9457-compliant structured error responses in JSON and Markdown specifically for AI agents.
The new formats replace heavy HTML error pages, reducing payload size and LLM token consumption by over 98%.
Agents can trigger these responses by sending specific Accept headers such as application/problem+json or text/markdown.
Responses include machine-readable metadata like retryable status, retry_after durations, and specific error categories.
The implementation currently covers all 1xxx-class errors, including rate limits, WAF blocks, and DNS resolution issues.
This system requires no configuration from site owners and maintains standard HTML responses for browser-based traffic.

#dist #mlp #sre

Read original

Salesforce EngineeringMar 9, 2026

Engineering Platform Trust: Cutting Customer Case Volume 20x with Petabyte-Scale Health Signals

Why it matters: This system demonstrates how to transform massive, fragmented telemetry into actionable insights. By standardizing health metrics and isolating analytics from production, engineers can proactively identify risks, reduce support overhead, and ensure platform stability at a petabyte scale.

Salesforce's Technical Health Score (THS) quantifies implementation health across five pillars: Security, Efficiency, Operational Excellence, Customization, and Observability.
The architecture processes petabytes of telemetry via an off-core analytics platform, ensuring zero impact on live transactional workloads.
Diverse metrics are normalized into a 1–100 scale using distribution-based methods to compare organizations against peers of similar complexity.
A signal-qualification framework filters for actionability, ensuring the score reflects customer-controlled configurations rather than platform-level issues.
This proactive approach has successfully reduced support case volume by 20x for customers who maintain high technical health scores.

#data #sre #security

Read original

Cloudflare BlogMar 9, 2026

Fixing request smuggling vulnerabilities in Pingora OSS deployments

Why it matters: Request smuggling vulnerabilities can lead to critical security breaches like session hijacking and cache poisoning. For engineers using Pingora as an ingress proxy, upgrading to 0.8.0 is essential to ensure RFC compliance and prevent connection desynchronization attacks.

Cloudflare patched three HTTP/1.x request smuggling vulnerabilities (CVE-2026-2833, CVE-2026-2835, CVE-2026-2836) in Pingora 0.8.0.
One flaw involved Pingora prematurely entering passthrough mode for Upgrade requests before receiving a 101 Switching Protocols response.
Vulnerabilities also stemmed from non-RFC-compliant interpretations of request bodies involving Transfer-Encoding in specific HTTP contexts.
These issues could allow attackers to bypass security controls, hijack user sessions, or poison caches in standalone Pingora deployments.
Cloudflare's CDN was unaffected due to architectural safeguards and internal proxy configurations that prevent these exploitation vectors.
Users of the Pingora framework are strongly urged to upgrade to version 0.8.0 to mitigate risks of connection desynchronization.

#security #sre

Read original

Cloudflare BlogMar 9, 2026

Complexity is a choice. SASE migrations shouldn’t take years.

Why it matters: Engineers can bypass the 'marathon of misery' of multi-year SASE deployments. By using programmable, identity-centric tools, teams can secure global infrastructure and AI workflows in weeks rather than years, reducing technical debt and improving performance.

Cloudflare One reduces SASE migration timelines from 18 months to 6 weeks by replacing legacy hardware-centric approaches with a software-defined connectivity cloud.
The platform utilizes identity-first on-ramps and consolidated policy engines to eliminate the 'trombone effect' and latency common in traditional service chaining.
Lightweight cloud-native connectors like cloudflared enable instant connectivity without requiring inbound firewall port changes.
The extensible edge allows for custom environment support, such as creating native services for Arch Linux to maintain consistent device posture checks.
Integrated AI security tools provide shadow AI visibility and Data Loss Prevention (DLP) to control sensitive data flow into LLMs.

#security #sre #dist

Read original

Cloudflare BlogMar 5, 2026

A QUICker SASE client: re-building Proxy Mode

Why it matters: This shift solves the performance penalty of SASE proxies by moving from L3 tunneling to direct L4 proxying via QUIC. It doubles throughput and lowers latency, making Zero Trust security transparent to users during high-bandwidth tasks or when coexisting with legacy VPNs.

Cloudflare re-engineered its SASE client proxy mode by replacing the WireGuard-based Layer 3 tunnel with a direct Layer 4 QUIC-based architecture.
The previous implementation relied on smoltcp, a user-space TCP stack that lacked modern features and created a performance ceiling for media-heavy traffic.
The new architecture leverages MASQUE and HTTP/3 (RFC 9114) using the CONNECT method to encapsulate traffic directly into QUIC streams.
By bypassing Layer 3 translation, the client eliminates inefficient IP packet handling and benefits from native QUIC congestion and flow control.
Internal testing demonstrated that download and upload speeds doubled while latency decreased significantly for end-users.
The update specifically improves performance for third-party VPN coexistence, high-bandwidth application partitioning, and developer CLI tools.

#security #dist #sre

Read original

Airbnb EngineeringMar 4, 2026

It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb

Why it matters: Validating alert behavior before deployment prevents alert fatigue and missed incidents. By shifting validation left through backtesting and visual diffs, teams can iterate on complex monitoring patterns at scale without risking production reliability or developer trust.

Airbnb transitioned to a Prometheus-based Observability as Code (OaC) platform managing 300,000 alerts.
Identified that traditional code reviews fail to predict alert behavior, leading to noise or missed incidents.
Implemented a local-first workflow ensuring identical execution across developer environments, CI, and production.
Developed Change Reports providing side-by-side configuration diffs and visual previews of production alerts.
Integrated bulk backtesting to simulate alerts against historical data, calculating noisiness metrics before deployment.
Reduced development cycles from weeks to minutes by shifting alert validation left in the development lifecycle.

#sre #culture

Read original

Page 1 of 17

Prev1 2 3...17 Next