Curated topic

finops

Posts tagged with finops

Cloudflare BlogMar 23, 2026

Inside Gen 13: how we built our most powerful server yet

Why it matters: Cloudflare's Gen 13 hardware shows how software shifts, like the Rust-based FL2, enable radical hardware optimizations. By reducing cache dependency, they achieved 2x throughput and 50% better power efficiency, which is critical for scaling global edge networks sustainably.

Gen 13 features the 192-core AMD EPYC Turin 9965 processor, doubling the core count and hardware threads compared to the previous generation.
The hardware refresh leverages Cloudflare's FL2 Rust-based rewrite, which reduces L3 cache dependency and allows for linear scaling with higher core counts.
Memory capacity doubled to 768GB of DDR5-6400 across 12 channels, significantly increasing bandwidth for memory-intensive edge workloads.
Networking throughput increased 4x via dual 100 GbE ports, utilizing Intel E830 or NVIDIA Mellanox ConnectX-6 Dx adapters.
The design achieves a 50% improvement in performance-per-watt and 60% higher throughput per rack while maintaining constant power budgets.
Security is bolstered with new PCIe encryption hardware support and a DC-SCM 2.0-based Hardware Root of Trust (HRoT).

#dist #finops #security

Read original

Cloudflare BlogMar 19, 2026

Powering the agents: Workers AI now runs large models, starting with Kimi K2.5

Why it matters: Cloudflare is evolving Workers AI into a full-stack agent platform by adding frontier-scale models. By combining large context windows with optimized inference and usage-based pricing, they enable cost-effective, high-performance autonomous agents at enterprise scale.

Workers AI now supports frontier-scale models, starting with Moonshot AI’s Kimi K2.5, featuring a 256k context window and vision support.
The platform leverages custom kernels and the Infire inference engine to optimize GPU utilization and throughput for large-scale LLMs.
New prefix caching functionality reduces latency and costs by reusing processed input tokens in multi-turn agentic conversations.
Cloudflare has transitioned Workers AI to a usage-based pricing model, charging per-token to better accommodate high-volume agent workloads.
The integration combines LLMs with existing primitives like Durable Objects and Workflows to provide a unified environment for autonomous agents.

#mlp #dist #finops

Read original

Airbnb EngineeringMar 17, 2026

From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

Why it matters: Managing observability at scale requires balancing cost and utility. Airbnb's shift to an in-house, automated platform demonstrates how to regain control over data, standardize metrics across thousands of services, and reduce operational overhead through self-service migration tools.

Airbnb transitioned from vendor-managed observability to an in-house platform built on Prometheus to reduce costs and improve developer workflows.
The migration involved moving 300 million timeseries, 3,100 dashboards, and 300,000 alerts across 1,000 services.
The team shifted from a 'hardest-first' approach to starting with achievable targets to validate the platform and build momentum.
Rather than 1:1 query translation, they migrated the 'intent' of queries to standardize metrics and correct legacy inaccuracies.
A 'One-Click Migration' tool was developed to automate the conversion of dashboards and alerts, significantly reducing manual effort.
A centralized Migration Hub provided service owners with self-service tools and visibility into their migration status.

#sre #finops #dist

Read original

Salesforce EngineeringMar 5, 2026

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

Why it matters: Optimizing Kubernetes scheduling for bursty Spark workloads resolves the conflict between cost efficiency and job stability. By moving from reactive consolidation to proactive bin-packing, engineers can achieve significant cost savings without triggering disruptive pod evictions.

Salesforce's Data 360 team optimized Kubernetes scheduling for Spark workloads, managing 2 million daily applications at global scale.
The default LeastAllocated strategy caused node fragmentation by spreading executors across the cluster, leaving many nodes underutilized.
Reactive autoscaling with Karpenter led to job instability, as evicting executors for consolidation triggered expensive Spark task retries.
The team implemented a custom scheduler using the MostAllocated scoring strategy via the NodeResourcesFit plugin to prioritize high-density bin-packing.
This proactive placement logic ensures executors are packed onto existing nodes before spinning up new capacity, reducing fragmentation.
The architectural shift delivered a 13% reduction in infrastructure costs while maintaining high reliability for critical data workloads.

#data #finops #dist

Read original

Netflix Tech BlogMar 3, 2026

Optimizing Recommendation Systems with JDK’s Vector API

Why it matters: This shows how to optimize high-scale Java services using the JDK Vector API. It highlights that algorithmic changes like matrix multiplication require cache-friendly data layouts and SIMD acceleration to overcome JNI overhead and GC bottlenecks in production environments.

Netflix optimized its Ranker service by converting O(M×N) sequential dot products into a single matrix multiplication operation.
Initial batching attempts caused performance regressions due to GC pressure and non-contiguous memory layouts in double[][] arrays.
The team implemented flat double[] buffers and ThreadLocal reuse to ensure cache locality and eliminate per-request allocations.
Evaluation of BLAS libraries via JNI showed that call overhead outweighed compute gains for the service's specific matrix sizes.
The JDK Vector API (Project Panama) provided a pure Java SIMD implementation that outperformed both scalar Java and JNI-based BLAS.
The final optimization significantly reduced CPU usage per request, leading to a smaller cluster footprint for one of Netflix's largest services.

#mlp #data #finops

Read original

Pinterest EngineeringFeb 24, 2026

Piqama: Pinterest Quota Management Ecosystem

Why it matters: Managing resources at scale requires more than just hard limits. Piqama provides a unified framework for capacity and rate-limiting, enabling automated rightsizing and budget alignment. This reduces manual overhead while improving resource efficiency and system reliability across platforms.

Piqama is a unified quota management ecosystem at Pinterest handling physical resources, service limits like QPS, and application-specific units.
The platform manages the full quota lifecycle including schema definition, pluggable validation rules, and ownership-based authorization.
It supports both capacity-based quotas for Big Data workloads (integrated with Yunikorn) and rate-limiting for online storage services.
A centralized management portal provides visibility and self-service capabilities for quota updates and usage tracking.
Governance features include automated usage statistics collection via Apache Iceberg and an auto-rightsizing service for predictive resource allocation.
The system integrates with Pinterest's chargeback and entitlement systems to align resource consumption with financial budgets.

#sre #data #finops

Read original

Pinterest EngineeringFeb 17, 2026

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Why it matters: OOM errors are a primary cause of Spark job failures at scale. Pinterest's elastic executor sizing allows jobs to be tuned for average usage while automatically handling memory-intensive tasks, significantly reducing manual tuning effort, job failures, and infrastructure costs.

Pinterest implemented Auto Memory Retries to handle Spark Out-of-Memory (OOM) errors by dynamically adjusting resource profiles for failed tasks.
The system uses a hybrid strategy: first increasing the CPU property to reduce task concurrency on existing executors, then launching physically larger executors if OOMs persist.
Core Spark classes like TaskSetManager and TaskSchedulerImpl were modified to support task-level resource profiles, deviating from the standard TaskSet-wide configuration.
This elastic sizing allows engineers to tune jobs for P90 memory usage rather than peak requirements, improving overall cluster resource efficiency.
A proactive OOM Prediction feature was introduced to preemptively assign larger resource profiles to tasks likely to fail based on historical job data.
The implementation resulted in a 40% reduction in OOM-related job failures and a 2.5% reduction in total cluster memory consumption.

#data #dist #finops

Read original

Dropbox Tech BlogFeb 12, 2026

How low-bit inference enables efficient AI

Why it matters: As AI models scale to trillions of parameters, low-bit inference is essential for maintaining low latency and cost-efficiency. It allows engineers to deploy sophisticated models on existing hardware by optimizing memory usage and maximizing throughput via specialized GPU cores.

Low-bit inference reduces memory and compute requirements by decreasing numerical precision during model serving.
Large-scale models like Kimi-K2.5 (1T parameters) require these optimizations to manage energy and hardware constraints.
Compute costs in attention-based models are driven by matrix multiplications in linear layers and the attention mechanism.
Specialized hardware, such as NVIDIA Tensor Cores and AMD Matrix Cores, doubles throughput when precision is halved.
Quantization is critical for delivering responsive, cost-effective AI features like search and summarization in production.

#mlp #finops

Read original

Cloudflare BlogFeb 12, 2026

Introducing Markdown for Agents

Why it matters: As AI agents become primary web consumers, optimizing content for them is crucial. This feature reduces LLM token costs by 80% and simplifies data ingestion pipelines, making it easier to build efficient, agent-friendly applications at the edge.

Cloudflare introduced 'Markdown for Agents' to automatically convert HTML content to Markdown in real-time for AI crawlers.
Converting HTML to Markdown can reduce token usage by approximately 80%, lowering costs and processing complexity for AI pipelines.
The feature leverages HTTP content negotiation, enabling agents to request Markdown via the 'Accept: text/markdown' header.
Responses include an 'x-markdown-tokens' header to help developers manage context windows and chunking strategies.
The service integrates with the Content Signals framework to define how content is used for AI training and search.

#mlp #dist #finops

Read original

Cloudflare BlogFeb 12, 2026

Introducing Markdown for Agents

Why it matters: As AI agents become primary web consumers, serving raw HTML is inefficient and costly. This feature treats agents as first-class citizens, drastically reducing LLM token costs and improving parsing accuracy by providing clean, structured data directly at the network edge.

Cloudflare introduced 'Markdown for Agents,' a feature that automatically converts HTML content to Markdown in real-time at the edge.
Markdown significantly reduces token consumption by up to 80% compared to HTML, optimizing costs and context window usage for LLMs.
The feature utilizes standard HTTP content negotiation, allowing AI agents to request Markdown via the 'Accept: text/markdown' header.
Responses include an 'x-markdown-tokens' header to help developers manage context windows and chunking strategies effectively.
The system integrates with the Content Signals framework to define how content should be used for AI training and search indexing.

#dist #mlp #finops

Read original

Page 3 of 6

Prev 1 2 3 4 5 6 Next