Curated topic

dist

Posts tagged with dist

Spotify EngineeringMar 12, 2026

Inside the Archive: The Tech Behind Your 2025 Wrapped Highlights

Why it matters: This demonstrates how to turn massive datasets into personalized user experiences at scale, a key challenge for data-intensive consumer applications.

Spotify leverages large-scale data processing to identify unique listening patterns and 'moments' from a user's year.
The system transforms raw streaming logs into personalized narrative highlights using automated storytelling engines.
Scalability is a core challenge, as the system must process data for hundreds of millions of users simultaneously.
The architecture involves complex data pipelines and machine learning models to categorize and rank listening events.

#data #dist #mlp

Read original

Cloudflare BlogMar 12, 2026

Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans

Why it matters: Modern threats blend human intent with automated scale, making traditional bot detection insufficient. This suite provides privacy-preserving tools like Hashed User IDs and email risk scoring to stop account takeover and promotion abuse without compromising sensitive user data.

Cloudflare introduced Account Abuse Protection to combat hybrid threats involving both automated bots and human-driven fraud.
New features include disposable email checks and email risk scoring to identify throwaway accounts used for promotion abuse.
Hashed User IDs provide a privacy-preserving way to track suspicious activity by cryptographically hashing usernames per domain.
The suite integrates with existing leaked credential detection, which checks passwords against a database of 16 billion compromised records.
Account Takeover (ATO) detection IDs use behavioral anomaly detection to identify suspicious login patterns unique to each customer.
The solution addresses industrialized fraud where attackers use human-powered farms and AI agents to bypass traditional bot defenses.

#security #dist

Read original

GitHub EngineeringMar 12, 2026

GitHub availability report: February 2026

Why it matters: This report highlights how complex dependencies—like telemetry, caching, and security policies—can trigger cascading failures. It provides valuable lessons on the importance of robust monitoring, automated rollbacks, and the need for resilient proxy layers in large-scale distributed systems.

GitHub experienced six major incidents in February 2026, impacting Actions, Codespaces, Dependabot, and Git operations.
A significant outage was caused by telemetry loss that triggered incorrect security policies, blocking access to critical VM metadata.
Cache write amplification and simultaneous rewrites led to cascading failures and connection exhaustion in the Git HTTPS proxy.
Database failovers to read-only instances and authorization claim changes in networking dependencies caused service degradations.
Incorrect network configurations in the LFS service resulted in repository archive download errors for objects using Git LFS.
GitHub is implementing improved monitoring, automated rollbacks, and self-throttling mechanisms to increase system resilience.

#sre #dist #security

Read original

GitHub EngineeringMar 11, 2026

Addressing GitHub’s recent availability issues

Why it matters: This post highlights how rapid scaling and architectural coupling can turn localized issues into platform-wide outages. It provides lessons on managing cache TTLs, the risks of latent configuration errors in failover systems, and the necessity of robust load-shedding mechanisms.

Rapid usage growth and architectural coupling caused localized issues to cascade across GitHub's critical services.
A 10x increase in API traffic from client apps combined with a cache TTL reduction overwhelmed core authentication database clusters.
A telemetry gap triggered security policies that blocked VM metadata access, leading to a global GitHub Actions outage.
Latent configuration issues in Redis failover mechanisms left clusters without writable primaries, requiring manual mitigation.
GitHub is redesigning its user cache system and segmenting database clusters to accommodate significantly higher traffic volumes.
Engineering teams are prioritizing the isolation of critical dependencies to prevent shared infrastructure failures from impacting Git and Actions.

#sre #dist

Read original

Cloudflare BlogMar 11, 2026

Slashing agent token costs by 98% with RFC 9457-compliant error responses

Why it matters: Engineers building AI agents can now handle network errors programmatically and cost-effectively. By replacing verbose HTML with structured data, Cloudflare enables agents to make deterministic decisions like exponential backoff while slashing operational token costs by 98%.

Cloudflare now serves RFC 9457-compliant structured error responses in JSON and Markdown specifically for AI agents.
The new formats replace heavy HTML error pages, reducing payload size and LLM token consumption by over 98%.
Agents can trigger these responses by sending specific Accept headers such as application/problem+json or text/markdown.
Responses include machine-readable metadata like retryable status, retry_after durations, and specific error categories.
The implementation currently covers all 1xxx-class errors, including rate limits, WAF blocks, and DNS resolution issues.
This system requires no configuration from site owners and maintains standard HTML responses for browser-based traffic.

#dist #mlp #sre

Read original

GitHub EngineeringMar 10, 2026

The era of “AI as text” is over. Execution is the new interface.

Why it matters: This shift transforms AI from a chat interface into programmable infrastructure. By embedding execution engines into apps, developers can build resilient, context-aware systems that handle complex multi-step tasks without brittle, hard-coded logic or custom orchestration layers.

The GitHub Copilot SDK enables developers to embed agentic execution and planning directly into their own applications.
Shift from hard-coded scripts to intent-based delegation where agents plan, modify files, and recover from errors autonomously.
Integration with the Model Context Protocol (MCP) allows agents to access structured runtime data like API schemas and dependency graphs.
AI capabilities are moving beyond the IDE into background services, desktop apps, and event-driven systems.
The SDK provides a production-tested orchestration layer, reducing the need for teams to build homegrown AI stacks from scratch.

#mlp #dist

Read original

Cloudflare BlogMar 10, 2026

Building a security overview dashboard for actionable insights

Why it matters: Security teams are overwhelmed by data noise. This architecture demonstrates how to transform massive telemetry into prioritized, actionable insights using a distributed system of specialized microservices, reducing incident response times and closing critical configuration gaps.

Cloudflare revamped its Security Overview dashboard to prioritize actionable 'Security Action Items' over raw data visibility, reducing noise for security teams.
The system addresses the 'configuration gap' by surfacing the status of security tools, identifying when features are inactive or incorrectly set to 'Log Only' mode.
The backend architecture utilizes specialized microservices called 'checkers' that scale independently to process over 10 million insights daily.
Checkers operate via two mechanisms: scheduled deep-inspection tasks for complex configurations and real-time event handlers for immediate risk detection.
Deep-linking between the overview and analytics dashboards reduces the 'tab switching tax' by automatically applying relevant filters during incident investigation.
Action items are categorized by criticality and type, allowing defenders to move from reactive monitoring to proactive control of their security posture.

#security #dist #data

Read original

Engineering at MetaMar 9, 2026

How Advanced Browsing Protection Works in Messenger

Why it matters: It demonstrates how to implement privacy-preserving security features in end-to-end encrypted environments. Engineers can learn how to balance cryptographic privacy primitives like PIR and OPRF with the practical performance requirements of large-scale real-time messaging.

Messenger's Advanced Browsing Protection (ABP) uses Private Information Retrieval (PIR) to check links against a malicious URL database without revealing user activity to the server.
The system employs Oblivious Pseudorandom Functions (OPRF) to ensure the server cannot see the specific content of the client's query during the lookup process.
To handle URL prefix matching for subpaths, the system groups links by domain rather than requiring exact matches, preventing multiple queries that could leak data.
ABP addresses the privacy-efficiency tradeoff by sharding the database into buckets, carefully managing the number of bits leaked to the server to optimize performance.
The architecture is designed to scale to millions of potentially malicious websites while maintaining low latency for users within end-to-end encrypted chats.

#security #dist #data

Read original

Cloudflare BlogMar 9, 2026

Complexity is a choice. SASE migrations shouldn’t take years.

Why it matters: Engineers can bypass the 'marathon of misery' of multi-year SASE deployments. By using programmable, identity-centric tools, teams can secure global infrastructure and AI workflows in weeks rather than years, reducing technical debt and improving performance.

Cloudflare One reduces SASE migration timelines from 18 months to 6 weeks by replacing legacy hardware-centric approaches with a software-defined connectivity cloud.
The platform utilizes identity-first on-ramps and consolidated policy engines to eliminate the 'trombone effect' and latency common in traditional service chaining.
Lightweight cloud-native connectors like cloudflared enable instant connectivity without requiring inbound firewall port changes.
The extensible edge allows for custom environment support, such as creating native services for Arch Linux to maintain consistent device posture checks.
Integrated AI security tools provide shadow AI visibility and Data Loss Prevention (DLP) to control sensitive data flow into LLMs.

#security #sre #dist

Read original

Salesforce EngineeringMar 5, 2026

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

Why it matters: Optimizing Kubernetes scheduling for bursty Spark workloads resolves the conflict between cost efficiency and job stability. By moving from reactive consolidation to proactive bin-packing, engineers can achieve significant cost savings without triggering disruptive pod evictions.

Salesforce's Data 360 team optimized Kubernetes scheduling for Spark workloads, managing 2 million daily applications at global scale.
The default LeastAllocated strategy caused node fragmentation by spreading executors across the cluster, leaving many nodes underutilized.
Reactive autoscaling with Karpenter led to job instability, as evicting executors for consolidation triggered expensive Spark task retries.
The team implemented a custom scheduler using the MostAllocated scoring strategy via the NodeResourcesFit plugin to prioritize high-density bin-packing.
This proactive placement logic ensures executors are packed onto existing nodes before spinning up new capacity, reducing fragmentation.
The architectural shift delivered a 13% reduction in infrastructure costs while maintaining high reliability for critical data workloads.

#data #finops #dist

Read original

Page 1 of 22

Prev1 2 3...22 Next