Curated topic

sre

Posts tagged with sre

Salesforce EngineeringJan 21, 2026

How Agentforce Runs Secure AI Agents at 11 Million Calls Per Day

Why it matters: Securing AI agents at scale requires balancing rapid innovation with enterprise-grade protection. This architecture demonstrates how to manage 11M+ daily calls by decoupling security layers, ensuring multi-tenant reliability, and maintaining request integrity across distributed systems.

Salesforce's Developer Access team manages a secure access plane for Agentforce, handling over 11 million daily agent calls across production environments.
The architecture utilizes a layered access-control plane that separates authentication at the edge from authorization within the core platform to reduce latency and operational risk.
A middle-layer API service acts as a technical control point, ensuring all agentic traffic follows consistent security protocols and cannot bypass protection boundaries.
Security invariants include edge-level authentication validation, core-platform-enforced authorization, and end-to-end request integrity using Salesforce-minted tokens.
The system is designed to contain multi-tenant blast radius risks, preventing runaway agents or malformed requests from impacting other customers in a shared environment.
Strict egress traffic filtering and cross-boundary revalidation are employed to maintain the principle of least privilege across the distributed compute layer.

#security #dist #sre

Read original

Salesforce EngineeringJan 15, 2026

How a Mock LLM Service Cut $500K in AI Benchmarking Costs, Boosted Developer Productivity

Why it matters: Benchmarking AI systems against live providers is expensive and noisy. This mock service provides a deterministic, cost-effective way to validate performance and reliability at scale, allowing engineers to iterate faster without financial friction or external latency fluctuations.

Salesforce developed an internal LLM mock service to simulate AI provider behavior, supporting benchmarks of over 24,000 requests per minute.
The service reduced annual token-based costs by over $500,000 by replacing live LLM dependencies during performance and regression testing.
Deterministic latency controls allow engineers to isolate internal code performance from external provider variability, ensuring repeatable results.
The mock layer enables rapid scale and failover benchmarking by simulating high-volume traffic and controlled outages without external infrastructure.
By providing a shared platform capability, the service accelerates development loops and improves confidence in performance signals.

#mlp #finops #sre

Read original

GitHub EngineeringJan 15, 2026

When protections outlive their purpose: A lesson on managing defense systems at scale

Why it matters: Security mitigations added during incidents can become technical debt that degrades user experience. This case study emphasizes the need for lifecycle management and observability in defense systems to ensure temporary protections don't inadvertently block legitimate traffic as patterns evolve.

GitHub identified that emergency defense mechanisms, such as rate limits and traffic controls, were inadvertently blocking legitimate users after outliving their original purpose.
The issue stemmed from composite signals that combined industry-standard fingerprinting with platform-specific business logic, leading to false positives during normal browsing.
While the false-positive rate was low (0.003-0.004% of total traffic), it caused consistent disruption for logged-out users following external links.
The investigation involved tracing requests across a multi-layered infrastructure built on HAProxy to pinpoint which specific defense layer was triggering the blocks.
The incident reinforces that observability and lifecycle management are as critical for security mitigations as they are for core product features.

#sre #security

Read original

GitHub EngineeringJan 14, 2026

GitHub Availability Report: December 2025

Why it matters: This report highlights the operational challenges of scaling AI-integrated services and global infrastructure. It provides insights into managing model-backed dependencies, handling cross-cloud network issues, and mitigating traffic spikes to maintain high availability for developer tools.

A Kafka misconfiguration prevented agent session data from reaching the AI Controls page, leading to improved pre-deployment validation.
Copilot Code Review experienced degradation due to model-backed dependency latency, mitigated by bypassing fix suggestions and increasing worker capacity.
Network packet loss between West US runners and an edge site caused GitHub Actions timeouts, resolved by rerouting traffic away from the affected site.
A database migration caused schema drift that blocked Copilot policy updates, resulting in hardened service synchronization and deployment pipelines.
Unauthenticated traffic spikes to search endpoints caused page load failures, addressed through improved limiters and proactive traffic monitoring.

#sre #dist #mlp

Read original

PlanetScale Tech BlogJan 14, 2026

Database Transactions

Why it matters: Understanding transaction internals like MVCC and undo logs is crucial for optimizing database performance, managing concurrency, and ensuring data integrity. It helps engineers choose between Postgres and MySQL based on their specific storage and maintenance needs.

Transactions ensure atomicity by grouping multiple SQL operations into a single unit that either fully succeeds via commit or fails via rollback.
Postgres implements consistent reads through multi-versioning, using xmin and xmax metadata to track row visibility across concurrent sessions.
MySQL achieves isolation by overwriting rows immediately while maintaining an undo log to reconstruct previous versions for other transactions.
Postgres requires periodic maintenance via VACUUM to reclaim storage space from obsolete row versions created during updates.
Consistent reads allow transactions to maintain an isolated view of data, preventing interference from simultaneous external modifications.

#data #sre

Read original

Cloudflare BlogJan 14, 2026

What came first: the CNAME or the A record?

Why it matters: This incident highlights how subtle optimizations can break systems by violating undocumented assumptions in legacy clients. It serves as a reminder that even when a protocol doesn't mandate order, real-world implementations often depend on it.

A memory optimization in Cloudflare's 1.1.1.1 resolver inadvertently changed the order of records in DNS responses.
The code change moved CNAME records to the end of the answer section instead of the beginning when merging cached partial chains.
While the DNS protocol technically treats record order as irrelevant, many client implementations process records sequentially.
Legacy implementations like glibc's getaddrinfo fail to resolve addresses if the A record appears before the CNAME that defines the alias.
The incident was resolved by reverting the optimization, restoring the original record ordering where CNAMEs precede final answers.

#sre #dist

Read original

Airbnb EngineeringJan 12, 2026

Pay As a Local

Why it matters: This architecture demonstrates how to scale global payment systems by abstracting vendor-specific complexities into standardized archetypes. It enables rapid expansion into new markets while maintaining high reliability and consistency through domain-driven design and asynchronous orchestration.

Replatformed from a monolith to a domain-driven microservices architecture (Payments LTA) to improve scalability and team autonomy.
Implemented a connector and plugin-based architecture to standardize third-party Payment Service Provider (PSP) integrations.
Developed the Multi-Step Transactions (MST) framework, a processor-agnostic system for handling complex flows like redirects and SCA.
Categorized 20+ local payment methods into three standardized archetypes—Redirect, Async, and Direct flows—to maximize code reuse.
Utilized asynchronous orchestration with webhooks and polling to manage external payment confirmations and ensure data consistency.
Enforced strict idempotency and built comprehensive observability dashboards to monitor transaction success rates and latency across regions.

#dist #finops #sre

Read original

Salesforce EngineeringJan 7, 2026

Migration at Scale: Moving Marketing Cloud Caching from Memcached to Redis at 1.5M RPS Without Downtime

Why it matters: This migration provides a blueprint for modernizing stateful infrastructure at massive scale. It demonstrates how to achieve engine-level transitions without downtime or application changes while maintaining sub-millisecond performance and high availability.

Successfully migrated Marketing Cloud's caching layer from Memcached to Redis Cluster at 1.5M RPS with zero downtime.
Implemented a Dynamic Cache Router to enable percentage-based traffic shifts and double-writes for cache warm-up without application code changes.
Addressed functional parity risks by standardizing TTL semantics and key-handling behaviors across more than 50 distinct services.
Utilized service grouping by key ownership to prevent split-brain scenarios and data inconsistencies during the transition.
Maintained strict performance SLAs throughout the migration, sustaining P50 latency near 1ms and P99 latency around 20ms.

#sre #dist #data

Read original

Salesforce EngineeringJan 6, 2026

Scaling Sales Agents: Engineering Next-Gen AI for the Enterprise Era

Why it matters: Scaling AI agents to enterprise levels requires moving beyond simple task assignment to robust orchestration. This architecture shows how to manage LLM rate limits and provider constraints using queues and dispatchers, ensuring reliability for high-volume, time-sensitive workflows.

Transitioned from a single-agent MVP to a dispatcher-orchestrated multi-agent architecture to support over 1 million monthly outreach actions.
Implemented persistent queuing to decouple task arrival from processing, creating a natural buffer for workload spikes and preventing retry storms.
Developed a constraint engine to enforce provider-specific quotas and LLM rate limits, ensuring compliance with Gmail and O365 delivery caps.
Utilized fairness algorithms like Round-Robin and priority-aware polling to prevent resource monopolization and ensure timely processing of urgent tasks.
Adopted a phased scaling strategy to evolve throughput from 15,000 to over 1 million messages monthly through parallel execution across 20 agents.

#dist #mlp #sre

Read original

Microsoft Azure BlogJan 5, 2026

Microsoft’s strategic AI datacenter planning enables seamless, large-scale NVIDIA Rubin deployments

Why it matters: Azure's proactive infrastructure design ensures engineers can deploy next-gen AI models on NVIDIA Rubin hardware immediately. By solving power, cooling, and networking bottlenecks at the datacenter level, Microsoft enables massive-scale AI training and inference with minimal friction.

Azure's datacenter infrastructure is pre-engineered to support NVIDIA's Rubin platform, including Vera Rubin NVL72 racks.
The Rubin platform delivers a 5x performance jump over GB200, offering 50 PF NVFP4 inference per chip and 3.6 EF per rack.
Infrastructure upgrades include 6th-gen NVLink fabric with ~260 TB/s bandwidth and ConnectX-9 1,600 Gb/s scale-out networking.
Azure utilizes a systems approach, integrating liquid cooling, Azure Boost offload engines, and Azure Cobalt CPUs to optimize GPU utilization.
Advanced memory architectures like HBM4/HBM4e and SOCAMM2 are supported through pre-validated thermal and density planning.

#mlp #dist #sre

Read original

Page 5 of 17

Prev 1...3 4 5 6 7...17 Next