Slack Engineering

Slack EngineeringJul 14, 2026

Shipyard: How We Built Slack’s Next-Generation EC2 Platform

Why it matters: Shipyard brings container-like deployment practices to EC2, enabling immutability and automated safety at scale. By treating infrastructure as artifacts, Slack reduces drift and improves security for critical workloads that cannot easily migrate to containers.

Shipyard transitions Slack from mutable, long-lived EC2 instances to an immutable infrastructure model where instances are treated as deployable artifacts.
The platform utilizes a layered image strategy starting with 'slack-zero,' a hardened base image providing standardized security, monitoring, and networking.
Deployments are managed via Gondola, enabling progressive rollouts with metrics-driven safety checks and automated rollback capabilities.
Configuration management is shifted to image-baking and initial provisioning phases, eliminating background drift and reducing system complexity.
A new inventory system called Peekaboo provides real-time fleet visibility by integrating AWS EventBridge, OpenSearch, and Lambda.
Instances are short-lived and continuously rotated to ensure security compliance and maintain the integrity of the immutable deployment model.

#sre #security

Read original

Slack EngineeringJun 11, 2026

Agentic Testing: Where Agents Fit in the E2E Testing Stack

Why it matters: Agentic testing shifts E2E focus from rigid journeys to goal-based verification. While too slow and costly for every PR, it provides a powerful exploratory layer that adapts to UI changes and handles complex state transitions where traditional deterministic scripts often fail.

Traditional E2E tests enforce specific UI journeys, while agentic tests focus on achieving high-level goals through adaptive, non-deterministic action sequences.
The study compared three models: Agent + Playwright MCP, Agent + Playwright CLI, and agent-generated deterministic Playwright code.
Agentic workflows are significantly slower (10+ minutes) and costlier ($15–$30/run) than traditional scripts, limiting their use in standard CI/CD.
Structured YAML inputs outperformed natural language for complex workflows by providing explicit mapping between instructions and browser actions.
Agents excel at exploratory testing and self-healing during UI updates, making them ideal for post-deployment verification rather than pre-merge checks.

#frontend #mlp

Read original

Slack EngineeringMay 28, 2026

Slack AI: The Path to Multi-Cloud

Why it matters: This article provides a blueprint for scaling enterprise LLM infrastructure. It details the transition from manual GPU management to managed services, highlighting how to balance security, cost-efficiency, and reliability through strategic multi-cloud orchestration and capacity forecasting.

Slack transitioned from AWS SageMaker to Amazon Bedrock to reduce operational overhead and address GPU scarcity.
The architecture uses an escrow VPC strategy to maintain a zero-knowledge environment, ensuring data privacy and model security.
Infrastructure evolved from managing raw GPU instances to utilizing Model Units for deterministic throughput and easier scaling.
Slack optimized costs by using Provisioned Throughput for interactive features and On Demand capacity for bursty, scheduled tasks.
A zero-incident migration was achieved through extensive load testing, shadow requests, and gradual feature flag rollouts.
The shift enabled faster adoption of new LLM models, reducing the feature lag previously experienced with custom serving solutions.

#mlp #sre #security

Read original

Slack EngineeringMay 5, 2026

From SSH to REST: A Security-Driven Modernization of Slack’s EMR Data Pipelines

Why it matters: This migration demonstrates how to eliminate stateful, insecure SSH dependencies in large-scale data platforms. It shows a path toward better reliability, finer audit granularity, and modern infrastructure like Spark on Kubernetes by adopting stateless REST-based orchestration.

Slack migrated 700+ data pipeline jobs from SSH-based submission to a REST-based architecture across 8 regions with zero downtime.
The legacy SSH pattern caused resource contention on EMR master nodes and created 'zombie' jobs when network connections dropped.
The team utilized YARN Distributed Shell to wrap arbitrary CLI commands and MapReduce jobs into a REST-compatible format via the YARN API.
Custom Airflow operators were developed to abstract the submission protocol, allowing a seamless transition for data engineers.
Eliminating SSH dependencies unblocked critical infrastructure goals, including Spark on Kubernetes and enhanced network isolation via child accounts.

#data #security #sre

Read original

Slack EngineeringApr 13, 2026

Managing context in long-run agentic applications

Why it matters: Managing context in long-run agentic systems is critical as context windows fill and performance degrades. This architecture shows how to use structured memory and specialized agent roles to maintain coherence and accuracy across complex, multi-step workflows.

LLMs are stateless, requiring the full message history to be sent with each request, which can quickly exhaust context windows in long-running tasks.
Slack uses a multi-agent architecture consisting of a Director, Experts, and a Critic to manage security investigations spanning hundreds of inference requests.
The Director's Journal provides structured working memory, using specific entry types like 'hypothesis' and 'decision' to maintain strategic alignment.
The Critic agent generates a Review and Timeline with credibility scores to consolidate findings and prevent information overload for other agents.
Effective context management balances the need for team-wide coherence with the need to limit information sharing to prevent confirmation bias.

#security #mlp #dist

Read original

Slack EngineeringMar 31, 2026

From Custom to Open: Scalable Network Probing and HTTP/3 Readiness with Prometheus

Why it matters: As HTTP/3 and QUIC become standard, legacy monitoring tools often fail to provide visibility into UDP-based traffic. Open-sourcing these capabilities into Prometheus BBE enables engineers to monitor modern network protocols without relying on fragmented or proprietary solutions.

Slack faced a critical observability gap when rolling out HTTP/3 because existing SaaS and internal tools lacked support for UDP-based QUIC probing.
An engineering intern developed and open-sourced QUIC support for the Prometheus Blackbox Exporter (BBE) using the quic-go library.
The implementation integrated a new HTTP/3 transport into BBE's client while maintaining existing configuration patterns and composability.
The new probing system enables a unified view of HTTP/1.1, HTTP/2, and HTTP/3 metrics within Grafana for easier correlation and debugging.
Open-sourcing the contribution future-proofs Slack's infrastructure and provides the wider Prometheus community with native HTTP/3 monitoring capabilities.
Future roadmap items include Server Name Indication (SNI) routing validation and hop-by-hop end-to-end network path visualization.

#sre #dist

Read original

Slack EngineeringMar 19, 2026

How Slack Rebuilt Notifications 📣

Why it matters: Scaling notification systems requires balancing high-volume delivery with user cognitive load. Slack's rebuild demonstrates how architectural simplification and cross-platform consistency reduce technical debt and improve UX by making complex systems predictable.

Consolidated four conflicting mental models into a single, unified notification system across desktop and mobile platforms.
Decoupled notification triggers from delivery mechanisms to allow users to separate in-app awareness from push notifications.
Simplified channel-level settings into three clear, predictable options: All new posts, Mentions, or Mute.
Improved synchronization logic to ensure preference states remain consistent and reliable across all client devices.
Redesigned the global preference architecture to provide a centralized, hierarchical structure for both basic and advanced user controls.

#mobile #frontend #dist

Read original

Slack EngineeringDec 1, 2025

Streamlining Security Investigations with Agents

Why it matters: This article details how Slack built robust AI agent systems for security investigations by moving from single prompts to chained, structured model invocations, offering a blueprint for reliable AI application development.

Slack's Security Engineering team implemented AI agents to streamline security investigations, processing billions of events daily.
Initial prototypes, relying on a single large prompt, exhibited inconsistent performance despite prompt refinement attempts.
The team's solution involved breaking down complex investigations into a sequence of chained, single-purpose model invocations.
Utilizing structured output, defined by JSON schema, was key to achieving fine-grained control and predictable behavior at each step.
The production system employs a team of 'personas' (agents) for specific tasks, with the application orchestrating their interactions and context propagation.
This method significantly improves consistency and reliability in AI-driven security analysis, moving beyond simple prompt engineering.

#security #mlp

Read original

Slack EngineeringNov 19, 2025

Android VPAT journey

Why it matters: Ensuring mobile accessibility is critical for legal compliance and inclusive user experiences. This post provides practical implementation details for common Android a11y hurdles, like custom actions and semantic announcements, helping engineers build more robust, accessible apps.

Slack conducted a VPAT audit in 2024 to identify and fix accessibility (a11y) gaps in their Android app following a major UI redesign.
Improved error communication by updating OutlinedTextField and SKBanner to trigger TalkBack announcements when validation fails.
Enhanced navigation by explicitly marking headings using semantics { heading() } in Jetpack Compose and accessibilityHeading in XML.
Resolved list count inaccuracies for screen readers by implementing CollectionInfo and CollectionItemInfo semantics.
Implemented CustomAccessibilityAction to make drag-and-drop functionality in the workspace switcher accessible to non-visual users.
Utilized TtsSpan to ensure screen readers announce text formatting like strikethrough, which is otherwise ignored by default.

#mobile #frontend

Read original

Slack EngineeringNov 6, 2025

Build better software to build software better

Why it matters: This article demonstrates how applying core software engineering principles like caching and parallelization to build systems can drastically improve developer experience and delivery speed, transforming slow pipelines into agile ones.

Slack reduced 60-minute build times for Quip and Slack Canvas backend by applying software engineering principles to their build pipeline.
They leveraged modern tooling like Bazel and modeled builds as directed acyclic graphs (DAGs) to identify optimization opportunities.
Key strategies included caching (doing less work) and parallelization (sharing the load) to improve build performance.
Effective caching relies on hermetic, idempotent units of work and granular cache keys for high hit rates.
Parallelization requires well-defined inputs/outputs and robust handling of work completion/failure across compute boundaries.

#sre #dist #culture

Read original

Page 1 of 2

Prev1 2 Next