Curated topic

dist

Posts tagged with dist

Cloudflare BlogFeb 23, 2026

Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform

Why it matters: With NIST setting a 2030 deadline to deprecate classical encryption, engineers must adopt post-quantum standards now to prevent 'Harvest Now, Decrypt Later' attacks. This update provides built-in crypto agility for SASE, simplifying the transition to quantum-resistant networking.

Cloudflare One is the first SASE platform to implement post-quantum (PQ) encryption across its entire suite, including Secure Web Gateway, Zero Trust, and WAN.
The platform utilizes hybrid ML-KEM (Module-Lattice-based Key-Encapsulation Mechanism) alongside classical ECDHE to provide quantum-resistant key agreement.
Support extends to Cloudflare IPsec and the Cloudflare One Appliance, securing both site-to-site and outbound internet traffic against future quantum threats.
The implementation specifically targets 'Harvest Now, Decrypt Later' attacks by upgrading the key establishment phase of cryptographic handshakes.
Cloudflare One Appliance version 2026.2.0 is generally available with these upgrades, while the Cloudflare IPsec upgrade has entered closed beta.

#security #dist

Read original

Cloudflare BlogFeb 21, 2026

Cloudflare outage on February 20, 2026

Why it matters: This incident highlights the risks of automated configuration propagation in global networks. It demonstrates how a single API change can trigger widespread BGP withdrawals and how software bugs can complicate recovery, emphasizing the need for 'fail small' deployment strategies.

A configuration change to the BYOIP management pipeline caused the unintentional withdrawal of approximately 1,100 BGP prefixes.
The outage lasted 6 hours and 7 minutes, impacting services including Magic Transit, Spectrum, and Dedicated Egress.
While many prefixes were restored by reverting the change, a software bug deleted edge configurations for 300 prefixes, requiring manual restoration.
The Addressing API's immediate propagation mechanism meant that the configuration error was distributed globally to the edge almost instantly.
Impacted customers experienced BGP Path Hunting, where connections repeatedly failed while trying to find valid routes to destination IPs.
Cloudflare is implementing a 'Fail Small' resilience plan to improve how these systems roll out changes and prevent global-scale failures.

#sre #dist

Read original

Cloudflare BlogFeb 20, 2026

Code Mode: give agents an entire API in 1,000 tokens

Why it matters: Code Mode solves the context window bottleneck for AI agents by replacing thousands of tool definitions with a programmable interface. This allows agents to interact with massive APIs efficiently and securely, significantly reducing token costs and latency while improving task performance.

Cloudflare's Code Mode reduces AI agent context usage by 99.9%, fitting the entire Cloudflare API into just 1,000 tokens.
It replaces thousands of individual tool definitions with two primary tools: search() for discovery and execute() for action.
Agents generate JavaScript code to interact with a typed SDK, enabling complex multi-step operations in a single round trip.
Execution occurs within a secure Dynamic Worker isolate, a V8 sandbox that prevents prompt injection leaks and unauthorized access.
This pattern allows agents to handle massive APIs that would otherwise exceed the context limits of modern foundation models.
Cloudflare has open-sourced a Code Mode SDK to facilitate this pattern in third-party MCP servers and AI agents.

#mlp #security #dist

Read original

Spotify EngineeringFeb 19, 2026

Our Multi-Agent Architecture for Smarter Advertising

Why it matters: This shift from monolithic AI features to a multi-agent architecture demonstrates how to scale complex ML systems. It provides a blueprint for managing autonomous components that collaborate to solve high-stakes business problems like ad optimization.

Spotify transitioned from simple AI features to a robust multi-agent architecture for their advertising platform.
The architecture addresses structural inefficiencies by delegating specialized tasks to autonomous agents.
This approach enables better decision-making and optimization across complex advertising workflows.
The system focuses on scalability and modularity, allowing for independent agent updates and improvements.
By using multi-agent systems, Spotify can handle high-dimensional data and real-time constraints more effectively.

#mlp #dist #data

Read original

PlanetScale Tech BlogFeb 19, 2026

Faster PlanetScale Postgres connections with Cloudflare Hyperdrive

Why it matters: This article provides a blueprint for building high-concurrency, real-time applications by combining edge computing with optimized database pooling. It demonstrates how to minimize latency between globally distributed users and centralized stateful databases.

Cloudflare Hyperdrive optimizes Postgres performance by automating connection pooling and reducing the seven-step connection handshake latency.
PlanetScale Postgres Metal provides a high-performance backend using locally-attached NVMe SSDs rather than network-attached storage.
The architecture leverages Cloudflare Workers for global distribution and Durable Objects to manage stateful WebSocket connections for real-time updates.
Engineers must evaluate 'smart placement' to decide whether running Workers closer to the database or closer to the user yields better latency for their specific workload.
Hyperdrive consists of an edge component for connection preparation and a connection pooler located physically near the database to maintain warm connections.

#dist #data #sre

Read original

GitHub EngineeringFeb 18, 2026

What to expect for open source in 2026

Why it matters: As open source scales globally and AI-generated contributions surge, engineers must shift from ad-hoc management to formal governance and automated triaging. This shift is vital for building sustainable projects that can handle increased volume without burning out maintainers.

Open source is becoming increasingly global, with significant developer growth in India, Brazil, and Indonesia requiring asynchronous communication strategies.
AI has lowered the entry barrier for new developers but introduced 'AI slop,' leading to a high volume of low-quality pull requests and issues.
Maintainers are adopting AI defensively to automate triage, label issues, and detect duplicates to manage the influx of contributions.
Sustainable projects must implement formal governance models and clear advancement paths from contributor to maintainer to prevent burnout.
While AI-focused projects represent 60% of top growth, established tools like VS Code continue to thrive by supporting broad international communities.

#culture #mlp #dist

Read original

Airbnb EngineeringFeb 18, 2026

Safeguarding Dynamic Configuration Changes at Scale

Why it matters: Dynamic configuration is a powerful but risky tool. Airbnb's approach demonstrates how to treat configuration with the same rigor as code, using staged rollouts and architectural separation to prevent global outages while maintaining developer velocity.

Airbnb's Sitar platform manages dynamic configurations using a Git-based workflow to provide versioning, peer reviews, and automated CI/CD validation.
The architecture separates the control plane, which handles rollout logic and authorization, from the data plane, which manages high-scale distribution.
Safety is enforced through staged rollouts that gradually expand the blast radius across AWS zones or Kubernetes pod percentages.
A sidecar agent model fetches configurations and maintains a local cache, ensuring low-latency access and system resilience during network partitions.
The platform supports multi-tenancy, allowing individual teams to define custom guardrails, deployment triggers, and risk profiles for their services.

#sre #dist

Read original

Pinterest EngineeringFeb 17, 2026

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Why it matters: OOM errors are a primary cause of Spark job failures at scale. Pinterest's elastic executor sizing allows jobs to be tuned for average usage while automatically handling memory-intensive tasks, significantly reducing manual tuning effort, job failures, and infrastructure costs.

Pinterest implemented Auto Memory Retries to handle Spark Out-of-Memory (OOM) errors by dynamically adjusting resource profiles for failed tasks.
The system uses a hybrid strategy: first increasing the CPU property to reduce task concurrency on existing executors, then launching physically larger executors if OOMs persist.
Core Spark classes like TaskSetManager and TaskSchedulerImpl were modified to support task-level resource profiles, deviating from the standard TaskSet-wide configuration.
This elastic sizing allows engineers to tune jobs for P90 memory usage rather than peak requirements, improving overall cluster resource efficiency.
A proactive OOM Prediction feature was introduced to preemptively assign larger resource profiles to tasks likely to fail based on historical job data.
The implementation resulted in a 40% reduction in OOM-related job failures and a 2.5% reduction in total cluster memory consumption.

#data #dist #finops

Read original

Microsoft Azure BlogFeb 17, 2026

Azure reliability, resiliency, and recoverability: Build continuity by design

Why it matters: Distinguishing between reliability, resiliency, and recoverability prevents architectural anti-patterns. It ensures engineers don't over-invest in recovery when resiliency is needed, or assume redundancy alone guarantees a reliable customer experience.

Reliability is the ultimate outcome where a service consistently performs at its intended level within business-defined constraints.
Resiliency is the architectural ability to withstand faults, such as regional outages or sudden load spikes, without visible customer disruption.
Recoverability involves the processes and tools required to restore a workload to a reliable state once resiliency limits are exceeded.
The Azure Well-Architected Framework and Reliability Guides help define shared responsibility boundaries and service-specific fault behaviors.
Operationalizing reliability requires defining service levels, monitoring steady-state behavior, and validating assumptions via Azure Chaos Studio.

#sre #dist

Read original

Salesforce EngineeringFeb 16, 2026

How Agentforce Achieved Accurate Flow Generation Across 461 Billion Monthly Executions Using a Constrained DSL

Why it matters: This approach demonstrates how to scale LLM-driven automation by replacing black-box fine-tuning with deterministic DSLs. It ensures reliability and debuggability for mission-critical workflows while significantly reducing the operational overhead of model maintenance.

Salesforce transitioned from fine-tuned LLMs to a constrained, multi-stage DSL framework to improve the accuracy of natural-language-to-Flow generation.
The system manages over 461 billion monthly executions across 63+ Flow varieties by enforcing strict metadata rules and validation gates.
A modular pipeline separates the process into an Architect phase for structural planning and a Developer phase for low-level metadata production.
DSL constructs are derived programmatically from Flow Metadata WSDL, ensuring generation rules stay synchronized with evolving platform schemas.
This deterministic approach eliminates expensive model retraining cycles, allowing for faster response to schema changes and correctness fixes.

#mlp #dist

Read original

Page 4 of 22

Prev 1 2 3 4 5 6...22 Next