Curated topic

sre

Posts tagged with sre

Microsoft Azure BlogFeb 10, 2026

Can high-temperature superconductors transform the power infrastructure of datacenters?

Why it matters: As AI workloads drive unprecedented power demands, traditional copper infrastructure faces efficiency and space limits. HTS technology offers a path to lossless power delivery and higher density, enabling sustainable scaling of next-generation datacenter architecture.

Microsoft is investigating High-Temperature Superconductors (HTS) to meet the massive power demands of AI and data-intensive computing.
HTS cables provide lossless power transmission with zero electrical resistance, eliminating heat generation and voltage drops.
The technology allows for higher power density in smaller footprints, enabling more compact and efficient datacenter designs.
Operationalizing HTS requires specialized cryogenic cooling systems to maintain materials at temperatures necessary for superconductivity.
By reducing transmission losses and infrastructure size, HTS supports sustainability goals and helps scale cloud infrastructure effectively.

#sre #mlp #finops

Read original

Spotify EngineeringFeb 9, 2026

How We Release the Spotify App: A Look Under the Hood (Part 2)

Why it matters: Scaling mobile releases to hundreds of engineers requires robust automation. This look into Spotify's tooling provides insights into building resilient CI/CD pipelines that maintain high velocity and app stability.

Spotify utilizes specialized internal tooling to orchestrate the complex release cycle of its mobile applications.
The release infrastructure is designed to handle the scale of hundreds of developers contributing to a single codebase.
Automation is a core component, managing the process from code freeze to final app store submission.
The tooling ensures consistency across platforms while maintaining high stability through automated validation gates.

#mobile #sre

Read original

Cloudflare BlogFeb 5, 2026

2025 Q4 DDoS threat report: A record-setting 31.4 Tbps attack caps a year of massive DDoS assaults

Why it matters: The scale of DDoS attacks is reaching unprecedented levels, with botnets leveraging IoT devices to hit 31.4 Tbps. Engineers must prioritize automated, multi-vector mitigation strategies as manual intervention is no longer viable against such hyper-volumetric volume.

DDoS attacks surged by 121% in 2025, totaling 47.1 million incidents with an average of 5,376 mitigations per hour.
A record-breaking 31.4 Tbps attack occurred in late 2025, highlighting a massive increase in attack volume.
The Aisuru-Kimwolf botnet, composed of 1-4 million infected Android TVs, launched hyper-volumetric HTTP attacks exceeding 200 million requests per second.
Network-layer DDoS attacks more than tripled year-over-year, accounting for 78% of all attacks in Q4 2025.
Multi-vector campaigns utilized SYN floods, Mirai-based attacks, and SSDP amplification to target global internet infrastructure.
Telecommunications emerged as the most-attacked industry, while Hong Kong and the UK saw the highest growth in attack frequency.

#security #sre #dist

Read original

Microsoft Azure BlogFeb 4, 2026

Enhanced storage resiliency with Azure NetApp Files Elastic zone-redundant service

Why it matters: It provides a managed, high-availability storage solution that ensures zero data loss and seamless failover across availability zones. This simplifies disaster recovery for mission-critical workloads like SAP HANA and SQL Server while optimizing costs and metadata performance.

Azure NetApp Files Elastic ZRS provides synchronous data replication across three or more availability zones within a single region.
The service features automated, service-managed failover that maintains the same mount targets and endpoints during zone-level outages.
It supports NFS and SMB protocols with enterprise-grade management capabilities including snapshots, clones, and storage tiering.
The architecture is cost-optimized, allowing for volumes as small as 1 GiB and reducing costs compared to manual cross-zone replication.
Future updates will introduce simultaneous multi-protocol access (NFS, SMB, and Object REST API) and custom region pairs for disaster recovery.
Optimized for metadata-heavy workloads, the service uses a shared QoS architecture to maintain low-latency operations during file enumeration.

#sre #dist #data

Read original

Cloudflare BlogFeb 3, 2026

Improve global upload performance with R2 Local Uploads

Why it matters: Engineers can significantly reduce upload latency for global users without managing complex multi-region replication logic. It provides the performance of a local edge cache with the reliability and strong consistency of centralized object storage.

Cloudflare R2 launched Local Uploads in open beta to improve global write performance by up to 75%.
Data is initially written to a storage location near the client and then asynchronously replicated to the bucket's home region.
The system maintains strong consistency, ensuring objects are immediately accessible for reads after the initial write.
Architecture utilizes R2 Gateway Workers for routing and Durable Objects for distributed metadata management.
Synthetic benchmarks show Time to Last Byte (TTLB) dropping from 2s to 500ms for cross-region 5MB uploads.
The feature is specifically designed for globally distributed workloads like media uploads and telemetry collection.

#dist #data #sre

Read original

PlanetScale Tech BlogJan 29, 2026

Introducing the PlanetScale MCP server

Why it matters: It bridges the gap between LLMs and live production data, enabling AI tools to provide context-aware debugging and schema optimization while maintaining strict security and safety guardrails like replica routing and destructive query protection.

PlanetScale's new MCP server connects AI tools like Claude and Cursor directly to database metadata and performance metrics.
Authentication is handled via OAuth, allowing granular read/write permissions for both production and development branches.
Safety features include automatic replica routing for reads, ephemeral credentials, and blocking of unsafe UPDATE/DELETE queries.
The server exposes tools for schema inspection, query performance analysis via Insights, and documentation search.
AI assistants can now perform natural language data analysis and suggest schema optimizations based on real production context.

#data #security #sre

Read original

Salesforce EngineeringJan 27, 2026

Inside Salesforce Edge: Automating Global Rollback for 1.5 Trillion Requests in 10 Minutes

Why it matters: For global-scale perimeter services, traditional sequential rollbacks are too slow. This architecture demonstrates how to achieve 10-minute global recovery through warm-standby blue-green deployments and synchronized autoscaling, ensuring high availability for trillions of requests.

Salesforce Edge manages a global perimeter platform handling 1.5 trillion monthly requests across 21+ points of presence.
Transitioned from sequential regional rollbacks taking up to 12 hours to a global blue-green model that recovers in 10 minutes.
Implemented parallel blue and green Kubernetes deployments to maintain a warm standby fleet capable of immediate full-load handling.
Customized Horizontal Pod Autoscalers (HPA) to ensure the inactive fleet scales identically to the active fleet, preventing capacity mismatches.
Automated traffic redirection using native Kubernetes labels and selectors instead of external L7 routing tools like Argo.
Integrated TCP connection draining and controlled traffic cutover to preserve four-nines availability during global rollback events.

#sre #dist #security

Read original

Cloudflare BlogJan 26, 2026

Cable cuts, storms, and DNS: a look at Internet disruptions in Q4 2025

Why it matters: Understanding global connectivity disruptions helps engineers build more resilient, multi-homed architectures. It highlights the fragility of physical infrastructure like submarine cables and the impact of BGP routing and government policy on service availability.

Q4 2025 saw over 180 global Internet disruptions caused by government mandates, physical infrastructure damage, and technical failures.
Tanzania implemented a near-total Internet shutdown during its presidential election, resulting in a 90% traffic drop and fluctuations in BGP address space announcements.
Submarine cable cuts, specifically to the PEACE and WACS systems, significantly impacted connectivity in Pakistan and Cameroon.
Infrastructure vulnerabilities in Haiti led to multiple outages for Digicel users due to international fiber optic cuts.
Beyond physical damage, disruptions were linked to hyperscaler cloud platform issues and ongoing military conflicts affecting regional network stability.

#sre #dist #culture

Read original

Cloudflare BlogJan 23, 2026

Route leak incident on January 22, 2026

Why it matters: This incident highlights how minor automation errors in BGP policy configuration can cause global traffic disruptions. It underscores the risks of permissive routing filters and the importance of robust validation in network automation to prevent large-scale route leaks.

An automated routing policy change intended to remove IPv6 prefix advertisements for a Bogotá data center caused a major BGP route leak in Miami.
The removal of specific prefix lists from policy statements resulted in overly permissive terms, unintentionally redistributing peer routes to other providers.
The incident lasted 25 minutes, causing significant congestion on Miami backbone infrastructure and affecting both Cloudflare customers and external parties.
The leak was classified as a mixture of Type 3 and Type 4 route leaks according to RFC7908, violating standard valley-free routing principles.
Impact was limited to IPv6 traffic and was mitigated by manually reverting the configuration and pausing the automation platform.

#sre #dist

Read original

Spotify EngineeringJan 22, 2026

Congratulations to the recipients of the 2025 Spotify FOSS Fund

Why it matters: Supporting open-source sustainability is crucial for the reliability of modern software stacks. This initiative demonstrates how large engineering organizations can mitigate supply chain risks and ensure the longevity of critical dependencies.

Spotify has announced the 2025 recipients of its Free and Open Source Software (FOSS) Fund.
The fund was established in 2022 to provide financial support to critical open source projects that Spotify relies on.
The initiative aims to ensure the long-term sustainability and health of the global open source ecosystem.
This program highlights the importance of corporate responsibility in maintaining the software infrastructure used by millions.

#culture #sre

Read original

Page 4 of 17

Prev 1 2 3 4 5 6...17 Next