Curated topic

finops

Posts tagged with finops

Salesforce EngineeringMar 5, 2026

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

Why it matters: Optimizing Kubernetes scheduling for bursty Spark workloads resolves the conflict between cost efficiency and job stability. By moving from reactive consolidation to proactive bin-packing, engineers can achieve significant cost savings without triggering disruptive pod evictions.

Salesforce's Data 360 team optimized Kubernetes scheduling for Spark workloads, managing 2 million daily applications at global scale.
The default LeastAllocated strategy caused node fragmentation by spreading executors across the cluster, leaving many nodes underutilized.
Reactive autoscaling with Karpenter led to job instability, as evicting executors for consolidation triggered expensive Spark task retries.
The team implemented a custom scheduler using the MostAllocated scoring strategy via the NodeResourcesFit plugin to prioritize high-density bin-packing.
This proactive placement logic ensures executors are packed onto existing nodes before spinning up new capacity, reducing fragmentation.
The architectural shift delivered a 13% reduction in infrastructure costs while maintaining high reliability for critical data workloads.

#data #finops #dist

Read original

Netflix Tech BlogMar 3, 2026

Optimizing Recommendation Systems with JDK’s Vector API

Why it matters: This shows how to optimize high-scale Java services using the JDK Vector API. It highlights that algorithmic changes like matrix multiplication require cache-friendly data layouts and SIMD acceleration to overcome JNI overhead and GC bottlenecks in production environments.

Netflix optimized its Ranker service by converting O(M×N) sequential dot products into a single matrix multiplication operation.
Initial batching attempts caused performance regressions due to GC pressure and non-contiguous memory layouts in double[][] arrays.
The team implemented flat double[] buffers and ThreadLocal reuse to ensure cache locality and eliminate per-request allocations.
Evaluation of BLAS libraries via JNI showed that call overhead outweighed compute gains for the service's specific matrix sizes.
The JDK Vector API (Project Panama) provided a pure Java SIMD implementation that outperformed both scalar Java and JNI-based BLAS.
The final optimization significantly reduced CPU usage per request, leading to a smaller cluster footprint for one of Netflix's largest services.

#mlp #data #finops

Read original

Pinterest EngineeringFeb 24, 2026

Piqama: Pinterest Quota Management Ecosystem

Why it matters: Managing resources at scale requires more than just hard limits. Piqama provides a unified framework for capacity and rate-limiting, enabling automated rightsizing and budget alignment. This reduces manual overhead while improving resource efficiency and system reliability across platforms.

Piqama is a unified quota management ecosystem at Pinterest handling physical resources, service limits like QPS, and application-specific units.
The platform manages the full quota lifecycle including schema definition, pluggable validation rules, and ownership-based authorization.
It supports both capacity-based quotas for Big Data workloads (integrated with Yunikorn) and rate-limiting for online storage services.
A centralized management portal provides visibility and self-service capabilities for quota updates and usage tracking.
Governance features include automated usage statistics collection via Apache Iceberg and an auto-rightsizing service for predictive resource allocation.
The system integrates with Pinterest's chargeback and entitlement systems to align resource consumption with financial budgets.

#sre #data #finops

Read original

Pinterest EngineeringFeb 17, 2026

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Why it matters: OOM errors are a primary cause of Spark job failures at scale. Pinterest's elastic executor sizing allows jobs to be tuned for average usage while automatically handling memory-intensive tasks, significantly reducing manual tuning effort, job failures, and infrastructure costs.

Pinterest implemented Auto Memory Retries to handle Spark Out-of-Memory (OOM) errors by dynamically adjusting resource profiles for failed tasks.
The system uses a hybrid strategy: first increasing the CPU property to reduce task concurrency on existing executors, then launching physically larger executors if OOMs persist.
Core Spark classes like TaskSetManager and TaskSchedulerImpl were modified to support task-level resource profiles, deviating from the standard TaskSet-wide configuration.
This elastic sizing allows engineers to tune jobs for P90 memory usage rather than peak requirements, improving overall cluster resource efficiency.
A proactive OOM Prediction feature was introduced to preemptively assign larger resource profiles to tasks likely to fail based on historical job data.
The implementation resulted in a 40% reduction in OOM-related job failures and a 2.5% reduction in total cluster memory consumption.

#data #dist #finops

Read original

Dropbox Tech BlogFeb 12, 2026

How low-bit inference enables efficient AI

Why it matters: As AI models scale to trillions of parameters, low-bit inference is essential for maintaining low latency and cost-efficiency. It allows engineers to deploy sophisticated models on existing hardware by optimizing memory usage and maximizing throughput via specialized GPU cores.

Low-bit inference reduces memory and compute requirements by decreasing numerical precision during model serving.
Large-scale models like Kimi-K2.5 (1T parameters) require these optimizations to manage energy and hardware constraints.
Compute costs in attention-based models are driven by matrix multiplications in linear layers and the attention mechanism.
Specialized hardware, such as NVIDIA Tensor Cores and AMD Matrix Cores, doubles throughput when precision is halved.
Quantization is critical for delivering responsive, cost-effective AI features like search and summarization in production.

#mlp #finops

Read original

Cloudflare BlogFeb 12, 2026

Introducing Markdown for Agents

Why it matters: As AI agents become primary web consumers, optimizing content for them is crucial. This feature reduces LLM token costs by 80% and simplifies data ingestion pipelines, making it easier to build efficient, agent-friendly applications at the edge.

Cloudflare introduced 'Markdown for Agents' to automatically convert HTML content to Markdown in real-time for AI crawlers.
Converting HTML to Markdown can reduce token usage by approximately 80%, lowering costs and processing complexity for AI pipelines.
The feature leverages HTTP content negotiation, enabling agents to request Markdown via the 'Accept: text/markdown' header.
Responses include an 'x-markdown-tokens' header to help developers manage context windows and chunking strategies.
The service integrates with the Content Signals framework to define how content is used for AI training and search.

#mlp #dist #finops

Read original

Cloudflare BlogFeb 12, 2026

Introducing Markdown for Agents

Why it matters: As AI agents become primary web consumers, serving raw HTML is inefficient and costly. This feature treats agents as first-class citizens, drastically reducing LLM token costs and improving parsing accuracy by providing clean, structured data directly at the network edge.

Cloudflare introduced 'Markdown for Agents,' a feature that automatically converts HTML content to Markdown in real-time at the edge.
Markdown significantly reduces token consumption by up to 80% compared to HTML, optimizing costs and context window usage for LLMs.
The feature utilizes standard HTTP content negotiation, allowing AI agents to request Markdown via the 'Accept: text/markdown' header.
Responses include an 'x-markdown-tokens' header to help developers manage context windows and chunking strategies effectively.
The system integrates with the Content Signals framework to define how content should be used for AI training and search indexing.

#dist #mlp #finops

Read original

Microsoft Azure BlogFeb 11, 2026

Agentic cloud operations: A new way to run the cloud

Why it matters: As cloud complexity outpaces human capacity, agentic operations allow engineers to move from manual toil to high-level orchestration. By automating context-aware diagnosis and remediation, teams can maintain reliability and efficiency at the scale required for modern AI workloads.

Agentic cloud operations shift from manual, dashboard-centric management to dynamic, AI-driven systems that correlate signals and take autonomous actions.
Azure Copilot serves as the central agentic interface, integrating with subscriptions, resources, and policies to provide context-aware operational intelligence.
Specialized agents cover the full cloud lifecycle, including migration planning, infrastructure-as-code generation, and automated deployment validation.
Real-time observability and troubleshooting agents accelerate root cause analysis by diagnosing health signals across the full stack and recommending fixes.
Resiliency and optimization agents continuously identify gaps in recovery configurations and execute cost-saving or performance-enhancing adjustments.

#sre #mlp #finops

Read original

Microsoft Azure BlogFeb 10, 2026

Can high-temperature superconductors transform the power infrastructure of datacenters?

Why it matters: As AI workloads drive unprecedented power demands, traditional copper infrastructure faces efficiency and space limits. HTS technology offers a path to lossless power delivery and higher density, enabling sustainable scaling of next-generation datacenter architecture.

Microsoft is investigating High-Temperature Superconductors (HTS) to meet the massive power demands of AI and data-intensive computing.
HTS cables provide lossless power transmission with zero electrical resistance, eliminating heat generation and voltage drops.
The technology allows for higher power density in smaller footprints, enabling more compact and efficient datacenter designs.
Operationalizing HTS requires specialized cryogenic cooling systems to maintain materials at temperatures necessary for superconductivity.
By reducing transmission losses and infrastructure size, HTS supports sustainability goals and helps scale cloud infrastructure effectively.

#sre #mlp #finops

Read original

Pinterest EngineeringFeb 5, 2026

Next Generation DB Ingestion at Pinterest

Why it matters: Transitioning from batch to real-time ingestion is critical for modern data-driven apps. Pinterest's architecture shows how to use CDC and Iceberg to reduce latency from days to minutes while cutting costs and ensuring compliance through efficient row-level updates and unified pipelines.

Pinterest replaced fragmented, high-latency batch ingestion with a unified CDC-based framework using Flink, Spark, and Apache Iceberg.
The system captures changes from MySQL, TiDB, and KVStore via a custom CDC service, writing events to Kafka with sub-second latency.
A dual-table architecture uses append-only CDC tables for change logs and Base tables for mirrored snapshots updated via Spark's MERGE INTO.
Standardizing on Iceberg's Merge-on-Read (MOR) strategy significantly reduced storage and compute costs compared to Copy-on-Write (COW).
The framework supports row-level deletions natively, improving data compliance and handling petabyte-scale data across thousands of pipelines.

#data #dist #finops

Read original

Page 1 of 3

Prev1 2 3 Next