Explore the latest engineering posts and summaries

Search by topic, company, or concept and scan results quickly.

Posts indexed584

Last indexedApr 30, 2026

Salesforce EngineeringDec 22, 2025

Shattering AWS’s 250K-IP Ceiling: How Data 360 Reached 1 Million IPs with Zero-Downtime Migration

Why it matters: Scaling to 100,000+ tenants requires overcoming cloud provider networking limits. This migration demonstrates how to bypass AWS IP ceilings using prefix delegation and custom observability without downtime, ensuring infrastructure doesn't bottleneck hyperscale data growth.

Overcame the AWS Network Address Usage (NAU) hard limit of 250,000 IPs per VPC to support 1 million IPs for Data 360.
Implemented AWS prefix delegation, which assigns IP addresses in contiguous 16-address blocks to significantly increase network efficiency.
Navigated Hyperforce architectural constraints, including immutable subnet structures and strict security group rules, without altering VPC boundaries.
Developed custom observability tools to monitor IP fragmentation and contiguous block availability, filling gaps in native AWS and Hyperforce metrics.
Utilized AI-driven validation and phased rollouts to ensure zero-downtime migration for massive Spark-driven data processing workloads.

#sre #dist #data

Read original

Engineering at MetaDec 22, 2025

Python Typing Survey 2025: Code Quality and Flexibility As Top Reasons for Typing Adoption

Why it matters: This survey highlights the maturation of Python's type system as a standard for professional development. Understanding these trends helps engineers optimize their toolchains, improve codebase maintainability, and align with community best practices for large-scale Python projects.

Python type hint adoption remains high at 86%, with developers citing improved code quality, readability, and IDE support as primary benefits.
Adoption peaks at 93% for developers with 5-10 years of experience, while senior developers (10+ years) show slightly lower usage at 80%.
Mypy remains the most popular type checker, though Pyright and Pylance are gaining significant traction due to speed and IDE integration.
The community values the gradual typing approach, allowing incremental adoption in legacy codebases without sacrificing Python's dynamic nature.
Key pain points include the steep learning curve for complex types and concerns regarding runtime performance overhead.
Developers express a strong desire for unified tooling and better support for runtime type validation in future Python versions.

#culture

Read original

Cloudflare BlogDec 22, 2025

How Workers powers our internal maintenance scheduling pipeline

Why it matters: Manual infrastructure management fails at scale. This article shows how Cloudflare uses serverless Workers and graph-based data modeling to automate global maintenance scheduling, preventing downtime by programmatically enforcing safety constraints across distributed data centers.

Cloudflare transitioned from manual maintenance coordination to an automated scheduler built on Cloudflare Workers to manage 330+ global data centers.
The system enforces safety constraints to prevent simultaneous downtime of redundant edge routers and customer-specific egress IP pools.
To solve 'out of memory' errors on the Workers platform, the team implemented a graph-based data interface inspired by Facebook’s TAO.
The scheduler uses a graph model of objects and associations to load only the regional data necessary for specific maintenance requests.
The tool programmatically identifies overlapping maintenance windows and alerts operators to potential conflicts to ensure high availability.

#sre #dist #data

Read original

Cloudflare BlogDec 19, 2025

Code Orange: Fail Small — our resilience plan following recent incidents

Why it matters: This initiative highlights the danger of instant global configuration propagation. By treating config as code and implementing gated rollouts, Cloudflare demonstrates how to mitigate blast radius in hyperscale systems, a critical lesson for SRE and platform engineers.

Cloudflare launched 'Code Orange: Fail Small' to prioritize network resilience after two major outages caused by rapid configuration deployments.
The plan mandates controlled, gated rollouts for all configuration changes, mirroring the existing Health Mediated Deployment (HMD) process used for software binaries.
Teams must now define success metrics and automated rollback triggers for configuration updates to prevent global propagation of errors.
Engineers are reviewing failure modes across traffic-handling systems to ensure predictable behavior during unexpected error states.
The initiative aims to eliminate circular dependencies and improve 'break glass' procedures for faster emergency access during incidents.

#sre #dist #security

Read original

Engineering at MetaDec 19, 2025

DrP: Meta’s Root Cause Analysis Platform at Scale

Why it matters: DrP automates manual incident triaging at scale. By codifying expert knowledge into executable playbooks, it reduces MTTR and lets engineers focus on resolution rather than data gathering, improving system reliability in complex microservice environments.

DrP is Meta's programmatic root cause analysis (RCA) platform that automates incident investigation through an expressive SDK and scalable backend.
The platform uses 'analyzers'—codified investigation playbooks—to perform anomaly detection, dimension analysis, and time series correlation.
It integrates directly with alerting and incident management systems to trigger automated investigations immediately upon alert activation.
The system supports analyzer chaining, allowing for complex investigations across interconnected microservices and dependencies.
DrP includes a post-processing layer that can automate mitigation steps, such as creating pull requests or tasks based on findings.
The platform handles 50,000 daily analyses across 300+ teams, reducing Mean Time to Resolve (MTTR) by 20% to 80%.

#sre #dist

Read original

Cloudflare BlogDec 19, 2025

Innovating to address streaming abuse — and our latest transparency report

Why it matters: Cloudflare is scaling its abuse mitigation by integrating AI and real-time APIs. For engineers, this demonstrates how to handle high-volume legal and security compliance through automation and service-specific policies while maintaining network performance and reliability.

Cloudflare's H1 2025 transparency report highlights a significant increase in automated abuse detection and response capabilities.
The company is utilizing AI and machine learning to identify sophisticated patterns in unauthorized streaming and phishing campaigns.
A new API-driven reporting system for rightsholders has scaled DMCA processing, increasing actions from 1,000 to 54,000 in six months.
Cloudflare applies service-specific abuse policies, distinguishing between hosted content and CDN/security services.
Technical measures prevent the misconfiguration of free-tier plans for high-bandwidth video streaming to protect network resources.
Collaborative data sharing with rightsholders enables real-time identification and mitigation of domains involved in streaming abuse.

#security #mlp #dist

Read original

Dropbox Tech BlogDec 18, 2025

Inside the feature store powering real-time AI in Dropbox Dash

Why it matters: Building a scalable feature store is essential for real-time AI applications that require low-latency retrieval of complex user signals across hybrid environments. This approach enables engineers to move quickly from experimentation to production without managing underlying infrastructure.

Dropbox Dash utilizes a custom feature store to manage data signals for real-time machine learning ranking across fragmented company content.
The system bridges a hybrid infrastructure consisting of on-premises low-latency services and a Spark-native cloud environment for data processing.
Engineers selected Feast as the framework for its modular architecture and clear separation between feature definitions and infrastructure management.
To meet sub-100ms latency requirements, the store uses an in-house DynamoDB-compatible solution (Dynovault) for high-concurrency parallel reads.
The architecture supports both batch processing of historical data and real-time streaming ingestion to capture immediate user intent.

#mlp #data #dist

Read original

Cloudflare BlogDec 18, 2025

Announcing support for GROUP BY, SUM, and other aggregation queries in R2 SQL

Why it matters: Engineers can now perform complex analytical queries directly on R2 data without egress or external processing. This distributed approach to aggregations enables high-performance log analysis and reporting across massive datasets using familiar SQL syntax.

Cloudflare R2 SQL now supports SQL aggregations including GROUP BY, SUM, COUNT, and HAVING statements.
The engine executes queries over Apache Parquet files stored in the R2 Data Catalog using a distributed architecture.
Implements a scatter-gather approach where worker nodes compute pre-aggregates to horizontally scale computation.
Pre-aggregates represent partial states, such as intermediate sums and counts, which are merged by a coordinator node.
Introduces shuffling aggregations to handle complex operations like ORDER BY and HAVING on computed aggregate columns.
The system is designed to spot trends, generate reports, and identify anomalies in large-scale log data.

#dist #data

Read original

Microsoft Azure BlogDec 17, 2025

Microsoft named a Leader in Gartner® Magic Quadrant™ for AI Application Development Platforms

Why it matters: Microsoft's leadership in AI platforms highlights the transition from experimental LLM demos to production-grade agentic workflows. For engineers, this provides a unified framework for data grounding, multi-agent orchestration, and governance across cloud and edge environments.

Microsoft Foundry serves as a unified platform for building, deploying, and governing agentic AI applications at scale.
Foundry IQ and Tools provide a secure grounding API with over 1,400 connectors to integrate agents with real-world enterprise data.
Foundry Agent Service supports multi-agent orchestration, allowing autonomous agents to coordinate and drive complex business workflows.
The Foundry Control Plane offers enterprise-grade observability, audit trails, and policy enforcement for autonomous systems.
Deployment flexibility is enabled through Foundry Models for cloud-based GenAI Ops and Foundry Local for low-latency, on-device AI execution.

#mlp #data #dist

Read original

Engineering at MetaDec 17, 2025

How We Built Meta Ray-Ban Display: From Zero to Polish

Why it matters: This article offers insights into the complex engineering and design challenges of developing advanced wearable AI glasses, providing valuable lessons for hardware and software engineers working on next-gen devices and user interfaces.

The Meta Tech Podcast delves into the engineering challenges behind the Meta Ray-Ban Display, Meta's advanced AI glasses.
Engineers Kenan and Emanuel discuss unique design hurdles, from display technology to emerging UI patterns for wearable glasses.
The episode explores the intersection of particle physics and hardware design in developing cutting-edge wearable tech.
It highlights the importance of celebrating incremental wins within a fast-moving development culture for innovative products.

#mobile #culture

Read original

Page 35 of 59

Prev 1...33 34 35 36 37...59 Next