Curated topic

data

Posts tagged with data

Salesforce EngineeringJan 7, 2026

Migration at Scale: Moving Marketing Cloud Caching from Memcached to Redis at 1.5M RPS Without Downtime

Why it matters: This migration provides a blueprint for modernizing stateful infrastructure at massive scale. It demonstrates how to achieve engine-level transitions without downtime or application changes while maintaining sub-millisecond performance and high availability.

Successfully migrated Marketing Cloud's caching layer from Memcached to Redis Cluster at 1.5M RPS with zero downtime.
Implemented a Dynamic Cache Router to enable percentage-based traffic shifts and double-writes for cache warm-up without application code changes.
Addressed functional parity risks by standardizing TTL semantics and key-handling behaviors across more than 50 distinct services.
Utilized service grouping by key ownership to prevent split-brain scenarios and data inconsistencies during the transition.
Maintained strict performance SLAs throughout the migration, sustaining P50 latency near 1ms and P99 latency around 20ms.

#sre #dist #data

Read original

Salesforce EngineeringDec 29, 2025

How Agentforce Empowered Incident Response Automation to Cut Common Resolution Time by 70 – 80%

Why it matters: Automating incident response at hyperscale reduces human error and cognitive load during high-pressure events. By using AI agents to correlate billions of signals, teams can cut resolution times by up to 80%, shifting from reactive manual triage to proactive, explainable mitigation.

Salesforce developed the Incident Command Deputy (ICD) platform, a multi-agent system powered by Agentforce to automate incident response.
The system utilizes AI-based anomaly detection across metrics, logs, and traces to replace static thresholds and manual monitoring at hyperscale.
ICD unifies fragmented data from observability, CI/CD, and change management systems into a single reasoning surface for AI agents.
Agentforce-powered agents automate evidence collection and hypothesis generation, significantly reducing cognitive load for engineers during 3:00 AM incidents.
The platform has successfully reduced resolution time for common Severity 2 incidents by 70-80%, with many detected and resolved within ten minutes.

#sre #mlp #data

Read original

Salesforce EngineeringDec 22, 2025

Shattering AWS’s 250K-IP Ceiling: How Data 360 Reached 1 Million IPs with Zero-Downtime Migration

Why it matters: Scaling to 100,000+ tenants requires overcoming cloud provider networking limits. This migration demonstrates how to bypass AWS IP ceilings using prefix delegation and custom observability without downtime, ensuring infrastructure doesn't bottleneck hyperscale data growth.

Overcame the AWS Network Address Usage (NAU) hard limit of 250,000 IPs per VPC to support 1 million IPs for Data 360.
Implemented AWS prefix delegation, which assigns IP addresses in contiguous 16-address blocks to significantly increase network efficiency.
Navigated Hyperforce architectural constraints, including immutable subnet structures and strict security group rules, without altering VPC boundaries.
Developed custom observability tools to monitor IP fragmentation and contiguous block availability, filling gaps in native AWS and Hyperforce metrics.
Utilized AI-driven validation and phased rollouts to ensure zero-downtime migration for massive Spark-driven data processing workloads.

#sre #dist #data

Read original

Cloudflare BlogDec 22, 2025

How Workers powers our internal maintenance scheduling pipeline

Why it matters: Manual infrastructure management fails at scale. This article shows how Cloudflare uses serverless Workers and graph-based data modeling to automate global maintenance scheduling, preventing downtime by programmatically enforcing safety constraints across distributed data centers.

Cloudflare transitioned from manual maintenance coordination to an automated scheduler built on Cloudflare Workers to manage 330+ global data centers.
The system enforces safety constraints to prevent simultaneous downtime of redundant edge routers and customer-specific egress IP pools.
To solve 'out of memory' errors on the Workers platform, the team implemented a graph-based data interface inspired by Facebook’s TAO.
The scheduler uses a graph model of objects and associations to load only the regional data necessary for specific maintenance requests.
The tool programmatically identifies overlapping maintenance windows and alerts operators to potential conflicts to ensure high availability.

#sre #dist #data

Read original

Dropbox Tech BlogDec 18, 2025

Inside the feature store powering real-time AI in Dropbox Dash

Why it matters: Building a scalable feature store is essential for real-time AI applications that require low-latency retrieval of complex user signals across hybrid environments. This approach enables engineers to move quickly from experimentation to production without managing underlying infrastructure.

Dropbox Dash utilizes a custom feature store to manage data signals for real-time machine learning ranking across fragmented company content.
The system bridges a hybrid infrastructure consisting of on-premises low-latency services and a Spark-native cloud environment for data processing.
Engineers selected Feast as the framework for its modular architecture and clear separation between feature definitions and infrastructure management.
To meet sub-100ms latency requirements, the store uses an in-house DynamoDB-compatible solution (Dynovault) for high-concurrency parallel reads.
The architecture supports both batch processing of historical data and real-time streaming ingestion to capture immediate user intent.

#mlp #data #dist

Read original

Cloudflare BlogDec 18, 2025

Announcing support for GROUP BY, SUM, and other aggregation queries in R2 SQL

Why it matters: Engineers can now perform complex analytical queries directly on R2 data without egress or external processing. This distributed approach to aggregations enables high-performance log analysis and reporting across massive datasets using familiar SQL syntax.

Cloudflare R2 SQL now supports SQL aggregations including GROUP BY, SUM, COUNT, and HAVING statements.
The engine executes queries over Apache Parquet files stored in the R2 Data Catalog using a distributed architecture.
Implements a scatter-gather approach where worker nodes compute pre-aggregates to horizontally scale computation.
Pre-aggregates represent partial states, such as intermediate sums and counts, which are merged by a coordinator node.
Introduces shuffling aggregations to handle complex operations like ORDER BY and HAVING on computed aggregate columns.
The system is designed to spot trends, generate reports, and identify anomalies in large-scale log data.

#dist #data

Read original

Microsoft Azure BlogDec 17, 2025

Microsoft named a Leader in Gartner® Magic Quadrant™ for AI Application Development Platforms

Why it matters: Microsoft's leadership in AI platforms highlights the transition from experimental LLM demos to production-grade agentic workflows. For engineers, this provides a unified framework for data grounding, multi-agent orchestration, and governance across cloud and edge environments.

Microsoft Foundry serves as a unified platform for building, deploying, and governing agentic AI applications at scale.
Foundry IQ and Tools provide a secure grounding API with over 1,400 connectors to integrate agents with real-world enterprise data.
Foundry Agent Service supports multi-agent orchestration, allowing autonomous agents to coordinate and drive complex business workflows.
The Foundry Control Plane offers enterprise-grade observability, audit trails, and policy enforcement for autonomous systems.
Deployment flexibility is enabled through Foundry Models for cloud-based GenAI Ops and Foundry Local for low-latency, on-device AI execution.

#mlp #data #dist

Read original

PlanetScale Tech BlogDec 17, 2025

Postgres 18 is now available

Why it matters: Postgres 18 introduces critical performance features like Skip Scans and async I/O, while native UUIDv7 support simplifies modern ID generation. PlanetScale's immediate support allows developers to leverage these optimizations alongside their managed infrastructure.

PlanetScale now defaults to Postgres 18.1 for all new database creations.
A new asynchronous I/O system is introduced to enhance query performance.
Native support for UUIDv7 is now available via the built-in uuidv7() function.
The Skip Scan optimization allows more queries to leverage multi-column indexes effectively.
Upgrades from version 17 require a manual online migration to a new database instance.

#data #sre

Read original

Microsoft Azure BlogDec 16, 2025

Azure updates for partners: December 2025

Why it matters: These updates provide engineers with a unified framework for building, governing, and scaling AI agents. By integrating advanced models like Claude and streamlining data retrieval via Foundry IQ, Microsoft is reducing the complexity of deploying enterprise-grade agentic workflows.

Azure Copilot introduces specialized agents to the portal and CLI to automate cloud migration, assessment, and governance tasks.
Foundry Control Plane enters public preview, offering centralized security, lifecycle management, and observability for AI agents.
Foundry IQ and Fabric IQ provide unified endpoints for RAG solutions and real-time analytics grounded in enterprise data.
The Microsoft Agent Pre-Purchase Plan (P3) simplifies AI procurement by providing a single fund for 32 Microsoft services.
Anthropic Claude models are now available in Microsoft Foundry, enabling advanced reasoning within a unified governance framework.
Azure HorizonDB for PostgreSQL has entered private preview to expand database options for cloud-native applications.

#mlp #data #security

Read original

PlanetScale Tech BlogDec 16, 2025

Using MotherDuck with PlanetScale

Why it matters: This integration bridges the gap between transactional and analytical workloads, allowing engineers to perform high-performance OLAP queries directly within their Postgres environment without sacrificing the performance of their primary OLTP database.

PlanetScale now supports the pg_duckdb extension, integrating DuckDB's OLAP engine directly into Postgres databases.
The extension allows users to combine OLTP and OLAP queries seamlessly within a single Postgres connection.
Engineers can create tables in standard Postgres format or DuckDB's vectorized column format for high-performance analytics.
Support for external data sources includes popular formats such as Apache Parquet and Iceberg.
Integration with MotherDuck enables offloading heavy analytical compute to the cloud, preventing performance degradation of the primary OLTP application.

#data #dist

Read original

Page 8 of 19

Prev 1...6 7 8 9 10...19 Next