Explore the latest engineering posts and summaries

Search by topic, company, or concept and scan results quickly.

Posts indexed431

Last indexedMar 14, 2026

Cloudflare BlogDec 19, 2025

Code Orange: Fail Small — our resilience plan following recent incidents

Why it matters: This initiative highlights the danger of instant global configuration propagation. By treating config as code and implementing gated rollouts, Cloudflare demonstrates how to mitigate blast radius in hyperscale systems, a critical lesson for SRE and platform engineers.

Cloudflare launched 'Code Orange: Fail Small' to prioritize network resilience after two major outages caused by rapid configuration deployments.
The plan mandates controlled, gated rollouts for all configuration changes, mirroring the existing Health Mediated Deployment (HMD) process used for software binaries.
Teams must now define success metrics and automated rollback triggers for configuration updates to prevent global propagation of errors.
Engineers are reviewing failure modes across traffic-handling systems to ensure predictable behavior during unexpected error states.
The initiative aims to eliminate circular dependencies and improve 'break glass' procedures for faster emergency access during incidents.

#sre #dist #security

Read original

Engineering at MetaDec 19, 2025

DrP: Meta’s Root Cause Analysis Platform at Scale

Why it matters: DrP automates manual incident triaging at scale. By codifying expert knowledge into executable playbooks, it reduces MTTR and lets engineers focus on resolution rather than data gathering, improving system reliability in complex microservice environments.

DrP is Meta's programmatic root cause analysis (RCA) platform that automates incident investigation through an expressive SDK and scalable backend.
The platform uses 'analyzers'—codified investigation playbooks—to perform anomaly detection, dimension analysis, and time series correlation.
It integrates directly with alerting and incident management systems to trigger automated investigations immediately upon alert activation.
The system supports analyzer chaining, allowing for complex investigations across interconnected microservices and dependencies.
DrP includes a post-processing layer that can automate mitigation steps, such as creating pull requests or tasks based on findings.
The platform handles 50,000 daily analyses across 300+ teams, reducing Mean Time to Resolve (MTTR) by 20% to 80%.

#sre #dist

Read original

Cloudflare BlogDec 19, 2025

Innovating to address streaming abuse — and our latest transparency report

Why it matters: Cloudflare is scaling its abuse mitigation by integrating AI and real-time APIs. For engineers, this demonstrates how to handle high-volume legal and security compliance through automation and service-specific policies while maintaining network performance and reliability.

Cloudflare's H1 2025 transparency report highlights a significant increase in automated abuse detection and response capabilities.
The company is utilizing AI and machine learning to identify sophisticated patterns in unauthorized streaming and phishing campaigns.
A new API-driven reporting system for rightsholders has scaled DMCA processing, increasing actions from 1,000 to 54,000 in six months.
Cloudflare applies service-specific abuse policies, distinguishing between hosted content and CDN/security services.
Technical measures prevent the misconfiguration of free-tier plans for high-bandwidth video streaming to protect network resources.
Collaborative data sharing with rightsholders enables real-time identification and mitigation of domains involved in streaming abuse.

#security #mlp #dist

Read original

Dropbox Tech BlogDec 18, 2025

Inside the feature store powering real-time AI in Dropbox Dash

Why it matters: Building a scalable feature store is essential for real-time AI applications that require low-latency retrieval of complex user signals across hybrid environments. This approach enables engineers to move quickly from experimentation to production without managing underlying infrastructure.

Dropbox Dash utilizes a custom feature store to manage data signals for real-time machine learning ranking across fragmented company content.
The system bridges a hybrid infrastructure consisting of on-premises low-latency services and a Spark-native cloud environment for data processing.
Engineers selected Feast as the framework for its modular architecture and clear separation between feature definitions and infrastructure management.
To meet sub-100ms latency requirements, the store uses an in-house DynamoDB-compatible solution (Dynovault) for high-concurrency parallel reads.
The architecture supports both batch processing of historical data and real-time streaming ingestion to capture immediate user intent.

#mlp #data #dist

Read original

Cloudflare BlogDec 18, 2025

Announcing support for GROUP BY, SUM, and other aggregation queries in R2 SQL

Why it matters: Engineers can now perform complex analytical queries directly on R2 data without egress or external processing. This distributed approach to aggregations enables high-performance log analysis and reporting across massive datasets using familiar SQL syntax.

Cloudflare R2 SQL now supports SQL aggregations including GROUP BY, SUM, COUNT, and HAVING statements.
The engine executes queries over Apache Parquet files stored in the R2 Data Catalog using a distributed architecture.
Implements a scatter-gather approach where worker nodes compute pre-aggregates to horizontally scale computation.
Pre-aggregates represent partial states, such as intermediate sums and counts, which are merged by a coordinator node.
Introduces shuffling aggregations to handle complex operations like ORDER BY and HAVING on computed aggregate columns.
The system is designed to spot trends, generate reports, and identify anomalies in large-scale log data.

#dist #data

Read original

Microsoft Azure BlogDec 17, 2025

Microsoft named a Leader in Gartner® Magic Quadrant™ for AI Application Development Platforms

Why it matters: Microsoft's leadership in AI platforms highlights the transition from experimental LLM demos to production-grade agentic workflows. For engineers, this provides a unified framework for data grounding, multi-agent orchestration, and governance across cloud and edge environments.

Microsoft Foundry serves as a unified platform for building, deploying, and governing agentic AI applications at scale.
Foundry IQ and Tools provide a secure grounding API with over 1,400 connectors to integrate agents with real-world enterprise data.
Foundry Agent Service supports multi-agent orchestration, allowing autonomous agents to coordinate and drive complex business workflows.
The Foundry Control Plane offers enterprise-grade observability, audit trails, and policy enforcement for autonomous systems.
Deployment flexibility is enabled through Foundry Models for cloud-based GenAI Ops and Foundry Local for low-latency, on-device AI execution.

#mlp #data #dist

Read original

Engineering at MetaDec 17, 2025

How We Built Meta Ray-Ban Display: From Zero to Polish

Why it matters: This article offers insights into the complex engineering and design challenges of developing advanced wearable AI glasses, providing valuable lessons for hardware and software engineers working on next-gen devices and user interfaces.

The Meta Tech Podcast delves into the engineering challenges behind the Meta Ray-Ban Display, Meta's advanced AI glasses.
Engineers Kenan and Emanuel discuss unique design hurdles, from display technology to emerging UI patterns for wearable glasses.
The episode explores the intersection of particle physics and hardware design in developing cutting-edge wearable tech.
It highlights the importance of celebrating incremental wins within a fast-moving development culture for innovative products.

#mobile #culture

Read original

PlanetScale Tech BlogDec 17, 2025

Postgres 18 is now available

Why it matters: Postgres 18 introduces critical performance features like Skip Scans and async I/O, while native UUIDv7 support simplifies modern ID generation. PlanetScale's immediate support allows developers to leverage these optimizations alongside their managed infrastructure.

PlanetScale now defaults to Postgres 18.1 for all new database creations.
A new asynchronous I/O system is introduced to enhance query performance.
Native support for UUIDv7 is now available via the built-in uuidv7() function.
The Skip Scan optimization allows more queries to leverage multi-column indexes effectively.
Upgrades from version 17 require a manual online migration to a new database instance.

#data #sre

Read original

Microsoft Azure BlogDec 16, 2025

Azure updates for partners: December 2025

Why it matters: These updates provide engineers with a unified framework for building, governing, and scaling AI agents. By integrating advanced models like Claude and streamlining data retrieval via Foundry IQ, Microsoft is reducing the complexity of deploying enterprise-grade agentic workflows.

Azure Copilot introduces specialized agents to the portal and CLI to automate cloud migration, assessment, and governance tasks.
Foundry Control Plane enters public preview, offering centralized security, lifecycle management, and observability for AI agents.
Foundry IQ and Fabric IQ provide unified endpoints for RAG solutions and real-time analytics grounded in enterprise data.
The Microsoft Agent Pre-Purchase Plan (P3) simplifies AI procurement by providing a single fund for 32 Microsoft services.
Anthropic Claude models are now available in Microsoft Foundry, enabling advanced reasoning within a unified governance framework.
Azure HorizonDB for PostgreSQL has entered private preview to expand database options for cloud-native applications.

#mlp #data #security

Read original

Salesforce EngineeringDec 16, 2025

How AI-Enabled Tooling Boosted Code Output 30% — While Keeping Quality and Deployment Safety Intact

Why it matters: AI tools can boost code output by 30%, but this creates downstream bottlenecks in testing and review. This article shows how to scale quality gates and deployment safety alongside velocity, ensuring that increased speed doesn't compromise system reliability or engineer well-being.

Unified fragmented tooling across Java, .NET, and Python using a portfolio approach including Cursor, Windsurf, and Claude Code.
Achieved a 30% increase in code production with 85% weekly adoption of AI-assisted development tools among eligible engineers.
Mitigated senior engineer bottlenecks by implementing AI-assisted code reviews to handle routine checks and initial analysis.
Scaled quality gates by automating test coverage and validation workflows to keep pace with accelerated development cycles.
Integrated AIOps and telemetry analysis to maintain high availability and improve incident response across 25 Hyperforce regions.

#sre #culture

Read original

Page 20 of 44

Prev 1...18 19 20 21 22...44 Next