Curated topic

sre

Posts tagged with sre

GitHub EngineeringMar 3, 2026

How we rebuilt the search architecture for high availability in GitHub Enterprise Server

Why it matters: This architectural shift eliminates common failure modes in high-availability setups where search indexes could become locked or corrupted during upgrades. By using native Cross Cluster Replication, engineers gain a more resilient, easier-to-maintain search infrastructure.

GitHub Enterprise Server (GHES) transitioned from a single multi-node Elasticsearch cluster to independent single-node clusters per instance.
The previous architecture allowed primary shards to migrate to read-only replica nodes, causing system locks during maintenance and upgrades.
The new architecture utilizes Elasticsearch’s Cross Cluster Replication (CCR) to synchronize data between independent clusters.
CCR ensures data durability by replicating information only after it has been persisted to the underlying Lucene segments.
A custom bootstrap workflow was developed to attach followers to existing indexes and configure auto-follow for future data.
This shift aligns search infrastructure with the standard leader/follower pattern used across the rest of the GHES platform.

#sre #dist #data

Read original

Engineering at MetaMar 2, 2026

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc

Why it matters: jemalloc is a critical foundation for high-performance systems. Meta's renewed commitment ensures the allocator evolves with modern hardware like ARM64 and complex workloads, reducing technical debt and improving memory efficiency for the entire open-source ecosystem.

Meta is unarchiving and renewing its stewardship of the jemalloc open-source repository to ensure long-term infrastructure health.
The project will prioritize technical debt reduction and refactoring to improve maintainability and ease of use for the community.
A key focus is enhancing the Huge-Page Allocator (HPA) to better utilize transparent hugepages for increased CPU efficiency.
Planned improvements to packing, caching, and purging mechanisms aim to optimize overall memory efficiency and performance.
The roadmap includes specific performance optimizations for the AArch64 (ARM64) platform to ensure high out-of-the-box performance.
Meta is shifting back to principled engineering practices, moving away from short-term hacks that previously accumulated technical debt.

#sre #data

Read original

Cloudflare BlogMar 2, 2026

Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey

Why it matters: Project Helix reduces Zero Trust adoption barriers by replacing manual, error-prone configurations with automated best practices. This allows engineers to deploy secure, optimized SASE environments in minutes while ensuring consistency across complex network architectures.

Project Helix automates the configuration of Cloudflare One to eliminate the 'blank slate' problem and reduce deployment friction.
The system codifies best practices from solutions engineers into reusable Terraform templates for consistent, error-free SASE setups.
Automated policies cover DNS protection, TLS inspection, Remote Browser Isolation, and visibility controls for AI applications.
Helix addresses complex networking tasks like configuring CGNAT ranges for private app routing and managing split-tunneling for real-time apps.
The solution leverages a web-based UI hosted on Cloudflare Workers to trigger infrastructure-as-code deployments via API.

#security #sre

Read original

Cloudflare BlogMar 2, 2026

Modernizing with agile SASE: a Cloudflare One blog takeover

Why it matters: Agile SASE moves security from rigid hardware silos to a programmable, single-pass global network. For engineers, this reduces technical debt, eliminates performance bottlenecks caused by service-chaining, and enables custom security logic via native developer platforms like Cloudflare Workers.

Cloudflare One introduces agile SASE, a composable platform designed to replace fragmented legacy hardware and VPNs with a unified connectivity cloud.
The platform utilizes a single-pass architecture across 300+ cities, eliminating service-chaining bottlenecks by running all security checks on every server simultaneously.
Integration with Cloudflare Workers allows developers to write custom code for real-time security event interception, moving beyond static allow/block rules.
Modernization focuses on five key areas: remote access, AI-powered email protection, DNS filtering, safe AI governance, and simplified branch networking.
Upcoming technical deep-dives will cover identity evolution, AI-driven threat detection, and the engineering behind the autonomous edge for performance-centric security.

#security #dist #sre

Read original

Cloudflare BlogMar 2, 2026

The truly programmable SASE platform

Why it matters: Cloudflare's programmable SASE allows engineers to build context-aware security policies using code. By executing logic at the edge, teams can integrate external data into access decisions in real-time, reducing latency and complexity compared to traditional webhook-based automation.

Cloudflare defines true programmability as the ability to intercept security events, enrich them with external context, and act in real-time.
The platform integrates SASE (Cloudflare One) with the Developer Platform (Workers), allowing security logic to run on the same global edge network.
Engineers can move beyond static 'allow/block' rules by using Workers to inject dynamic headers or query external risk APIs inline.
Managed actions provide templates for common IT and compliance workflows, while custom actions allow for bespoke logic via serverless functions.
This architecture eliminates the latency overhead typically associated with routing traffic to separate cloud environments for policy enforcement.
Real-world use cases include automated device session revocation and context-aware access based on external training or certification databases.

#security #dist #sre

Read original

Netflix Tech BlogFeb 28, 2026

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Why it matters: Rapidly scaling containers with many layers can trigger kernel VFS lock contention when using idmap mounts for security. Understanding how hardware architecture, like NUMA domains and cache line bouncing, impacts system-level locks is crucial for high-density container orchestration.

Netflix identified a container startup bottleneck where nodes stalled for 30+ seconds due to kernel-level VFS mount lock contention.
The issue was triggered by the new container runtime's use of idmap mounts for unique user namespaces, which significantly increased mount operations.
Container images with many layers (50+) exacerbated the problem, requiring thousands of mount/unmount operations during rapid scale-up events.
Performance analysis using Intel's TMA revealed that 95.5% of pipeline slots were stalled on contested accesses and cache line bouncing.
Older multi-socket hardware (r5.metal) suffered more than newer single-socket instances (m7i, m7a) due to NUMA-related latency in lock acquisition.
The investigation highlights the critical intersection of container runtime security features, kernel lock management, and modern CPU architecture.

#sre #security

Read original

Salesforce EngineeringFeb 26, 2026

Hyperforce Migration at Scale: How Deterministic Automation Replaced Manual Spreadsheets Across 95,000 Organizations

Why it matters: Automating large-scale infrastructure migrations is critical for reducing operational risk. MIPS demonstrates how to build a deterministic decision engine that maintains auditability and customer trust while scaling to handle tens of thousands of complex organization moves.

Salesforce developed the Migration Intake and Processing Service (MIPS) to automate the migration of 95,000 organizations to Hyperforce.
The platform replaced manual spreadsheets and email-based coordination with a deterministic decision engine for eligibility and capacity.
MIPS consolidates distributed sources of truth into a centralized layer using well-defined APIs and explicit data contracts.
The system achieves a 90%+ auto-approval rate while escalating complex exceptions for human review to maintain throughput.
Every migration decision is fully traceable and auditable, ensuring customer trust and data residency requirements are met.
The architecture utilizes continuous data quality checks to prevent misinterpretations of regional capacity or scheduling windows.

#sre #dist #data

Read original

Pinterest EngineeringFeb 24, 2026

Piqama: Pinterest Quota Management Ecosystem

Why it matters: Managing resources at scale requires more than just hard limits. Piqama provides a unified framework for capacity and rate-limiting, enabling automated rightsizing and budget alignment. This reduces manual overhead while improving resource efficiency and system reliability across platforms.

Piqama is a unified quota management ecosystem at Pinterest handling physical resources, service limits like QPS, and application-specific units.
The platform manages the full quota lifecycle including schema definition, pluggable validation rules, and ownership-based authorization.
It supports both capacity-based quotas for Big Data workloads (integrated with Yunikorn) and rate-limiting for online storage services.
A centralized management portal provides visibility and self-service capabilities for quota updates and usage tracking.
Governance features include automated usage statistics collection via Apache Iceberg and an auto-rightsizing service for predictive resource allocation.
The system integrates with Pinterest's chargeback and entitlement systems to align resource consumption with financial budgets.

#sre #data #finops

Read original

Cloudflare BlogFeb 21, 2026

Cloudflare outage on February 20, 2026

Why it matters: This incident highlights the risks of automated configuration propagation in global networks. It demonstrates how a single API change can trigger widespread BGP withdrawals and how software bugs can complicate recovery, emphasizing the need for 'fail small' deployment strategies.

A configuration change to the BYOIP management pipeline caused the unintentional withdrawal of approximately 1,100 BGP prefixes.
The outage lasted 6 hours and 7 minutes, impacting services including Magic Transit, Spectrum, and Dedicated Egress.
While many prefixes were restored by reverting the change, a software bug deleted edge configurations for 300 prefixes, requiring manual restoration.
The Addressing API's immediate propagation mechanism meant that the configuration error was distributed globally to the edge almost instantly.
Impacted customers experienced BGP Path Hunting, where connections repeatedly failed while trying to find valid routes to destination IPs.
Cloudflare is implementing a 'Fail Small' resilience plan to improve how these systems roll out changes and prevent global-scale failures.

#sre #dist

Read original

PlanetScale Tech BlogFeb 19, 2026

Faster PlanetScale Postgres connections with Cloudflare Hyperdrive

Why it matters: This article provides a blueprint for building high-concurrency, real-time applications by combining edge computing with optimized database pooling. It demonstrates how to minimize latency between globally distributed users and centralized stateful databases.

Cloudflare Hyperdrive optimizes Postgres performance by automating connection pooling and reducing the seven-step connection handshake latency.
PlanetScale Postgres Metal provides a high-performance backend using locally-attached NVMe SSDs rather than network-attached storage.
The architecture leverages Cloudflare Workers for global distribution and Durable Objects to manage stateful WebSocket connections for real-time updates.
Engineers must evaluate 'smart placement' to decide whether running Workers closer to the database or closer to the user yields better latency for their specific workload.
Hyperdrive consists of an edge component for connection preparation and a connection pooler located physically near the database to maintain warm connections.

#dist #data #sre

Read original

Page 12 of 27

Prev 1...10 11 12 13 14...27 Next