Curated topic

sre

Posts tagged with sre

Netflix Tech BlogApr 2, 2026

Smarter Live Streaming at Scale: Rolling Out VBR for All Netflix Live Events

Why it matters: Moving to VBR for live streaming balances video quality and bandwidth efficiency but introduces traffic volatility. Engineers must adapt capacity planning and steering logic to account for sudden bitrate spikes, ensuring CDN stability during high-concurrency global events.

Netflix transitioned its Live event encoding from Constant Bitrate (CBR) to Variable Bitrate (VBR) to optimize for scene complexity.
VBR implementation resulted in a 15% reduction in average bandwidth usage and a 10% reduction in peak minute traffic.
User experience improved with a 5% decrease in rebuffers and reduced startup delays due to more efficient data transfer.
The shift introduced capacity planning challenges, as low-bitrate scenes could lead to server over-utilization during sudden complexity spikes.
Netflix updated its traffic-steering logic to reserve server capacity based on nominal bitrates rather than real-time consumption to ensure stability.

#dist #sre

Read original

PlanetScale Tech BlogApr 2, 2026

Patterns for Postgres Traffic Control

Why it matters: This approach moves database resource management from reactive monitoring to proactive enforcement. By tagging queries at the application layer, teams can isolate noisy neighbors, protect critical paths, and limit the blast radius of new features without manual intervention.

Database Traffic Control enables resource budgeting for specific Postgres traffic slices to prevent runaway queries from impacting critical flows.
Implementation relies on the SQLCommenter format, appending URL-encoded key-value pairs as comments to SQL queries.
Go applications can use context-based helpers to propagate tags through the call stack and automatically inject them into database queries.
Isolation patterns include per-service roles, route-level tagging via middleware, and deployment-specific tags for canary releases.
Multi-tenant applications can enforce tier-based resource limits, such as Free vs. Enterprise, directly at the database level.

#data #sre

Read original

Slack EngineeringMar 31, 2026

From Custom to Open: Scalable Network Probing and HTTP/3 Readiness with Prometheus

Why it matters: As HTTP/3 and QUIC become standard, legacy monitoring tools often fail to provide visibility into UDP-based traffic. Open-sourcing these capabilities into Prometheus BBE enables engineers to monitor modern network protocols without relying on fragmented or proprietary solutions.

Slack faced a critical observability gap when rolling out HTTP/3 because existing SaaS and internal tools lacked support for UDP-based QUIC probing.
An engineering intern developed and open-sourced QUIC support for the Prometheus Blackbox Exporter (BBE) using the quic-go library.
The implementation integrated a new HTTP/3 transport into BBE's client while maintaining existing configuration patterns and composability.
The new probing system enables a unified view of HTTP/1.1, HTTP/2, and HTTP/3 metrics within Grafana for easier correlation and debugging.
Open-sourcing the contribution future-proofs Slack's infrastructure and provides the wider Prometheus community with native HTTP/3 monitoring capabilities.
Future roadmap items include Server Name Indication (SNI) routing validation and hop-by-hop end-to-end network path visualization.

#sre #dist

Read original

Cloudflare BlogMar 31, 2026

Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers

Why it matters: Engineers can now extend Cloudflare's DDoS protection with custom eBPF logic. This is crucial for proprietary UDP-based applications like gaming or VoIP, where generic rate limiting causes collateral damage. It provides granular, stateful control over traffic filtering at the network edge.

Cloudflare launched Programmable Flow Protection, allowing Magic Transit customers to deploy custom DDoS mitigation logic via eBPF.
The system addresses the limitations of generic DDoS protection for proprietary or custom UDP protocols used in gaming, VoIP, and streaming.
Customers can write eBPF programs to define precise 'good' versus 'bad' packet logic, which Cloudflare executes across its global edge network.
To ensure security and flexibility, these programs run in a userspace environment rather than the Linux kernel, using a specialized API for stateful mitigation.
The platform includes helper functions for storing client state, performing cryptographic validation, and issuing challenge packets to mitigate attacks without impacting legitimate users.

#security #dist #sre

Read original

PlanetScale Tech BlogMar 31, 2026

Graceful degradation in Postgres

Why it matters: Resource exhaustion often leads to total outages. Implementing graceful degradation at the database level ensures core services remain functional during traffic spikes, preventing a complete system failure by shedding non-critical load dynamically.

PlanetScale's Traffic Control introduces resource budgets to protect high-priority database queries during load spikes.
Engineers can categorize traffic into tiers like Critical, Important, and Best-effort to define degradation strategies.
The sqlcommenter standard enables tagging SQL statements with metadata for granular traffic identification.
Budgets support limits on server share, concurrency, and query duration to prevent resource starvation.
A warn mode allows for safe testing and tuning of limits against real-world traffic patterns before enforcement.
Dynamic budget adjustments enable live load shedding of non-essential features during viral events or DDoS attacks.

#data #sre

Read original

PlanetScale Tech BlogMar 30, 2026

High memory usage in Postgres is good, actually

Why it matters: Engineers often misinterpret high memory as a failure state. Distinguishing between beneficial caching and dangerous RSS pressure prevents unnecessary hardware scaling and helps teams correctly diagnose performance bottlenecks and OOM risks in database clusters.

High memory usage in Postgres is often desirable as the system uses RAM to cache data, significantly reducing slow disk I/O operations.
Postgres relies on two caching layers: the internal shared_buffers pool and the standard Linux OS page cache for frequently accessed data.
Memory is divided into reclaimable cache (active/inactive) and Resident Set Size (RSS), which is non-reclaimable process memory.
High RSS, rather than total memory usage, is the primary indicator of memory pressure and potential Out of Memory (OOM) risks.
RSS growth is driven by per-connection overhead, catalog bloat, and memory-intensive operations like sorts and hashes defined by work_mem.
Connection pooling with tools like PgBouncer is the most effective way to reduce RSS by limiting the number of active backend processes.

#data #sre

Read original

Cloudflare BlogMar 27, 2026

How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams

Why it matters: Visualizing code-based workflows is difficult due to dynamic logic like loops and parallel promises. Using ASTs to generate diagrams provides critical observability into complex durable executions, helping engineers debug and verify logic whether written by humans or AI agents.

Cloudflare Workflows now automatically generates visual diagrams for every deployment to improve observability of complex logic.
Unlike declarative workflow engines using YAML or JSON, Cloudflare uses Abstract Syntax Trees (ASTs) to statically derive graphs from dynamic JavaScript/TypeScript code.
The visualizer tracks Promise and await relationships to accurately represent parallel execution versus sequential blocking steps.
Diagrams are generated at deploy time by fetching bundled scripts and traversing an intermediate graph of WorkflowEntrypoints.
The underlying architecture uses a Durable Object engine to manage execution and dynamic dispatch to trigger user workers.
The system handles minified code from various bundlers like esbuild and rspack to extract step relationships and flow control.

#dist #sre

Read original

GitHub EngineeringMar 26, 2026

What’s coming to our GitHub Actions 2026 security roadmap

Why it matters: CI/CD pipelines are prime targets for supply chain attacks. GitHub's roadmap moves to secure-by-design infrastructure, providing engineers with deterministic dependencies, granular policy controls, and real-time observability to protect sensitive code and credentials.

Introducing workflow-level dependency locking to ensure deterministic, auditable CI/CD runs with cryptographic commit SHA verification.
Transitioning to immutable releases for the Actions ecosystem to prevent malicious code propagation via mutable tags and branches.
Implementing policy-driven execution via GitHub rulesets to centrally manage workflow triggers and allowed event types.
Enhancing credential security with scoped, least-privileged tokens to limit the blast radius of compromised CI environments.
Deploying real-time observability through an optional agent on runners to detect suspicious processes and unauthorized network activity.
Enforcing network boundaries to restrict egress traffic, preventing data exfiltration from CI/CD runners during execution.

#security #sre

Read original

Cloudflare BlogMar 26, 2026

A one-line Kubernetes fix that saved 600 hours a year

Why it matters: Default Kubernetes volume management can cause massive downtime for stateful apps with many small files. Understanding fsGroupChangePolicy is crucial for SREs to prevent recursive ownership checks from blocking pod startups and wasting hundreds of engineering hours.

Atlantis restarts were taking 30 minutes due to a large number of files on the Persistent Volume (PV) causing inode exhaustion and slow mounts.
Kubernetes defaults to recursively changing ownership (chgrp -R) of all files in a volume if fsGroup is specified in the securityContext.
For volumes with millions of files, this recursive check becomes a significant bottleneck, leading to context deadline exceeded errors in kubelet.
The issue was diagnosed by analyzing kubelet systemd logs rather than standard pod events, which failed to show the underlying volume mounting delay.
The fix involved setting fsGroupChangePolicy to OnRootMismatch, which only triggers ownership changes if the root directory's permissions don't match.
This one-line configuration change reduced restart times from 30 minutes to under 2 minutes, saving approximately 600 hours of engineering time annually.

#sre #security

Read original

PlanetScale Tech BlogMar 26, 2026

Stripe Projects partnership: Provision PlanetScale Postgres and MySQL databases from the Stripe CLI

Why it matters: This partnership simplifies infrastructure management by centralizing database provisioning and billing within the Stripe CLI. It addresses workflow fragmentation and provides a standardized way for developers and AI agents to handle credentials and payments across service providers.

PlanetScale is a launch partner for Stripe Projects, a new developer preview for centralizing tool provisioning and billing.
Engineers can now provision fully managed MySQL or Postgres databases directly from the Stripe CLI without leaving the terminal.
The integration simplifies credential management by allowing users to sync database connection strings directly to local .env files.
Stripe Projects aims to standardize the provisioning and credential handoff process, specifically addressing gaps highlighted by AI coding agents.
The workflow reduces context switching between dashboards and manual payment entry for infrastructure services.

#data #finops #sre

Read original

Page 14 of 32

Prev 1...12 13 14 15 16...32 Next