Explore the latest engineering posts and summaries

Search by topic, company, or concept and scan results quickly.

Posts indexed431

Last indexedMar 14, 2026

PlanetScale Tech BlogFeb 19, 2026

Faster PlanetScale Postgres connections with Cloudflare Hyperdrive

Why it matters: This article provides a blueprint for building high-concurrency, real-time applications by combining edge computing with optimized database pooling. It demonstrates how to minimize latency between globally distributed users and centralized stateful databases.

Cloudflare Hyperdrive optimizes Postgres performance by automating connection pooling and reducing the seven-step connection handshake latency.
PlanetScale Postgres Metal provides a high-performance backend using locally-attached NVMe SSDs rather than network-attached storage.
The architecture leverages Cloudflare Workers for global distribution and Durable Objects to manage stateful WebSocket connections for real-time updates.
Engineers must evaluate 'smart placement' to decide whether running Workers closer to the database or closer to the user yields better latency for their specific workload.
Hyperdrive consists of an edge component for connection preparation and a connection pooler located physically near the database to maintain warm connections.

#dist #data #sre

Read original

GitHub EngineeringFeb 18, 2026

What to expect for open source in 2026

Why it matters: As open source scales globally and AI-generated contributions surge, engineers must shift from ad-hoc management to formal governance and automated triaging. This shift is vital for building sustainable projects that can handle increased volume without burning out maintainers.

Open source is becoming increasingly global, with significant developer growth in India, Brazil, and Indonesia requiring asynchronous communication strategies.
AI has lowered the entry barrier for new developers but introduced 'AI slop,' leading to a high volume of low-quality pull requests and issues.
Maintainers are adopting AI defensively to automate triage, label issues, and detect duplicates to manage the influx of contributions.
Sustainable projects must implement formal governance models and clear advancement paths from contributor to maintainer to prevent burnout.
While AI-focused projects represent 60% of top growth, established tools like VS Code continue to thrive by supporting broad international communities.

#culture #mlp #dist

Read original

Airbnb EngineeringFeb 18, 2026

Safeguarding Dynamic Configuration Changes at Scale

Why it matters: Dynamic configuration is a powerful but risky tool. Airbnb's approach demonstrates how to treat configuration with the same rigor as code, using staged rollouts and architectural separation to prevent global outages while maintaining developer velocity.

Airbnb's Sitar platform manages dynamic configurations using a Git-based workflow to provide versioning, peer reviews, and automated CI/CD validation.
The architecture separates the control plane, which handles rollout logic and authorization, from the data plane, which manages high-scale distribution.
Safety is enforced through staged rollouts that gradually expand the blast radius across AWS zones or Kubernetes pod percentages.
A sidecar agent model fetches configurations and maintains a local cache, ensuring low-latency access and system resilience during network partitions.
The platform supports multi-tenancy, allowing individual teams to define custom guardrails, deployment triggers, and risk profiles for their services.

#sre #dist

Read original

Microsoft Azure BlogFeb 17, 2026

Claude Sonnet 4.6 in Microsoft Foundry-Frontier Performance for Scale

Why it matters: Claude Sonnet 4.6 brings frontier-level reasoning and a 1M token context window to Microsoft Foundry. For engineers, this enables more efficient large-scale code analysis, sophisticated browser automation, and better cost-performance control for agentic workflows in enterprise environments.

Claude Sonnet 4.6 is now available in Microsoft Foundry, offering near-Opus-level intelligence with improved token efficiency and lower costs.
Features a 1 million token context window (beta) and 128K output limit, enabling analysis of massive codebases and long-form documents.
Introduces adaptive thinking and effort parameters, allowing developers to tune the model's reasoning for better quality-latency-cost tradeoffs.
Enhanced for software engineering with stronger reasoning across complex codebases and reliable performance in iterative development cycles.
Significant improvements in computer use, scoring 72.5% on OSWorld Verified for precise browser automation and cross-app task execution.
Designed as a direct upgrade to Sonnet 4.5, requiring minimal prompting changes for existing enterprise search and agentic pipelines.

#mlp #data

Read original

GitHub EngineeringFeb 17, 2026

Securing the AI software supply chain: Security results across 67 open source projects

Why it matters: Securing the open-source supply chain is critical as a single vulnerability can impact thousands of downstream systems. This initiative provides the resources and training necessary to harden the libraries and tools that form the bedrock of modern AI and cloud infrastructure.

The GitHub Secure Open Source Fund provides $10,000 in non-dilutive funding and expert training to maintainers of critical open source infrastructure.
Session 3 involved 67 projects, resulting in 99% of participants enabling core GitHub security features like CodeQL and secret scanning.
The program has collectively issued 191 new CVEs and resolved over 600 leaked secrets across 138 projects to date.
Training focuses on threat modeling, secure coding, and AI-specific security to harden the foundations of modern software stacks.
Participating projects receive Azure credits and ongoing access to the GitHub Security Lab for 12 months to ensure long-term security maintenance.

#security #culture

Read original

Pinterest EngineeringFeb 17, 2026

Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest

Why it matters: OOM errors are a primary cause of Spark job failures at scale. Pinterest's elastic executor sizing allows jobs to be tuned for average usage while automatically handling memory-intensive tasks, significantly reducing manual tuning effort, job failures, and infrastructure costs.

Pinterest implemented Auto Memory Retries to handle Spark Out-of-Memory (OOM) errors by dynamically adjusting resource profiles for failed tasks.
The system uses a hybrid strategy: first increasing the CPU property to reduce task concurrency on existing executors, then launching physically larger executors if OOMs persist.
Core Spark classes like TaskSetManager and TaskSchedulerImpl were modified to support task-level resource profiles, deviating from the standard TaskSet-wide configuration.
This elastic sizing allows engineers to tune jobs for P90 memory usage rather than peak requirements, improving overall cluster resource efficiency.
A proactive OOM Prediction feature was introduced to preemptively assign larger resource profiles to tasks likely to fail based on historical job data.
The implementation resulted in a 40% reduction in OOM-related job failures and a 2.5% reduction in total cluster memory consumption.

#data #dist #finops

Read original

Microsoft Azure BlogFeb 17, 2026

Azure reliability, resiliency, and recoverability: Build continuity by design

Why it matters: Distinguishing between reliability, resiliency, and recoverability prevents architectural anti-patterns. It ensures engineers don't over-invest in recovery when resiliency is needed, or assume redundancy alone guarantees a reliable customer experience.

Reliability is the ultimate outcome where a service consistently performs at its intended level within business-defined constraints.
Resiliency is the architectural ability to withstand faults, such as regional outages or sudden load spikes, without visible customer disruption.
Recoverability involves the processes and tools required to restore a workload to a reliable state once resiliency limits are exceeded.
The Azure Well-Architected Framework and Reliability Guides help define shared responsibility boundaries and service-specific fault behaviors.
Operationalizing reliability requires defining service levels, monitoring steady-state behavior, and validating assumptions via Azure Chaos Studio.

#sre #dist

Read original

Salesforce EngineeringFeb 16, 2026

How Agentforce Achieved Accurate Flow Generation Across 461 Billion Monthly Executions Using a Constrained DSL

Why it matters: This approach demonstrates how to scale LLM-driven automation by replacing black-box fine-tuning with deterministic DSLs. It ensures reliability and debuggability for mission-critical workflows while significantly reducing the operational overhead of model maintenance.

Salesforce transitioned from fine-tuned LLMs to a constrained, multi-stage DSL framework to improve the accuracy of natural-language-to-Flow generation.
The system manages over 461 billion monthly executions across 63+ Flow varieties by enforcing strict metadata rules and validation gates.
A modular pipeline separates the process into an Architect phase for structural planning and a Developer phase for low-level metadata production.
DSL constructs are derived programmatically from Flow Metadata WSDL, ensuring generation rules stay synchronized with evolving platform schemas.
This deterministic approach eliminates expensive model retraining cycles, allowing for faster response to schema changes and correctness fixes.

#mlp #dist

Read original

Pinterest EngineeringFeb 13, 2026

GPU-Serving Two-Tower Models for Lightweight Ads Engagement Prediction

Why it matters: Transitioning to GPU serving for lightweight ranking allows engineers to deploy sophisticated architectures like MMOE-DCN. This shift significantly improves prediction accuracy and business metrics without sacrificing the strict latency requirements of real-time recommendation systems.

Pinterest transitioned its ads lightweight ranking from CPU to GPU serving to support more complex model architectures while maintaining low latency.
The new architecture replaces Multi-Task Multi-Domain (MTMD) models with a Multi-gate Mixture-of-Experts (MMOE) and Deep & Cross Network (DCN) design.
GPU serving enabled a 5-10% reduction in offline CTR loss and significant improvements in online metrics like Cost-Per-Click (CPC) and Click-Through Rate (CTR).
Training efficiency was optimized using BF16 precision, fused kernels, GPU prefetching, and increased batch sizes on p4d instances.
Segmenting standard and shopping ad scenarios for separate training doubled offline model iteration speed.
The two-tower paradigm uses offline batch updates for Pin embeddings and real-time generation for query embeddings to balance performance and latency.

#mlp #dist

Read original

GitHub EngineeringFeb 13, 2026

Automate repository tasks with GitHub Agentic Workflows

Why it matters: GitHub Agentic Workflows lower the barrier for complex repository automation by replacing rigid YAML with intent-driven Markdown. This enables 'Continuous AI,' allowing teams to automate cognitive tasks like issue triage and CI debugging while maintaining strict security and audit guardrails.

GitHub Agentic Workflows allow developers to automate repository tasks using plain Markdown instructions executed by AI coding agents.
The workflows run within GitHub Actions, leveraging existing infrastructure for permissions, sandboxing, and auditing.
Supported agent engines include Copilot CLI, Claude Code, and OpenAI Codex, allowing for flexible execution of non-deterministic tasks.
Key use cases include automated issue triaging, continuous documentation updates, and proactive investigation of CI failures with proposed fixes.
The initiative introduces 'Continuous AI,' a paradigm designed to augment traditional CI/CD with intent-driven automation for cognitive chores.

#mlp #culture #sre

Read original

Page 9 of 44

Prev 1...7 8 9 10 11...44 Next