Slack Engineering

Slack EngineeringDec 1, 2025

Streamlining Security Investigations with Agents

Why it matters: This article details how Slack built robust AI agent systems for security investigations by moving from single prompts to chained, structured model invocations, offering a blueprint for reliable AI application development.

Slack's Security Engineering team implemented AI agents to streamline security investigations, processing billions of events daily.
Initial prototypes, relying on a single large prompt, exhibited inconsistent performance despite prompt refinement attempts.
The team's solution involved breaking down complex investigations into a sequence of chained, single-purpose model invocations.
Utilizing structured output, defined by JSON schema, was key to achieving fine-grained control and predictable behavior at each step.
The production system employs a team of 'personas' (agents) for specific tasks, with the application orchestrating their interactions and context propagation.
This method significantly improves consistency and reliability in AI-driven security analysis, moving beyond simple prompt engineering.

#security #mlp

Read original

Slack EngineeringNov 19, 2025

Android VPAT journey

Why it matters: Ensuring mobile accessibility is critical for legal compliance and inclusive user experiences. This post provides practical implementation details for common Android a11y hurdles, like custom actions and semantic announcements, helping engineers build more robust, accessible apps.

Slack conducted a VPAT audit in 2024 to identify and fix accessibility (a11y) gaps in their Android app following a major UI redesign.
Improved error communication by updating OutlinedTextField and SKBanner to trigger TalkBack announcements when validation fails.
Enhanced navigation by explicitly marking headings using semantics { heading() } in Jetpack Compose and accessibilityHeading in XML.
Resolved list count inaccuracies for screen readers by implementing CollectionInfo and CollectionItemInfo semantics.
Implemented CustomAccessibilityAction to make drag-and-drop functionality in the workspace switcher accessible to non-visual users.
Utilized TtsSpan to ensure screen readers announce text formatting like strikethrough, which is otherwise ignored by default.

#mobile #frontend

Read original

Slack EngineeringNov 6, 2025

Build better software to build software better

Why it matters: This article demonstrates how applying core software engineering principles like caching and parallelization to build systems can drastically improve developer experience and delivery speed, transforming slow pipelines into agile ones.

Slack reduced 60-minute build times for Quip and Slack Canvas backend by applying software engineering principles to their build pipeline.
They leveraged modern tooling like Bazel and modeled builds as directed acyclic graphs (DAGs) to identify optimization opportunities.
Key strategies included caching (doing less work) and parallelization (sharing the load) to improve build performance.
Effective caching relies on hermetic, idempotent units of work and granular cache keys for high hit rates.
Parallelization requires well-defined inputs/outputs and robust handling of work completion/failure across compute boundaries.

#sre #dist #culture

Read original

Slack EngineeringOct 23, 2025

Advancing Our Chef Infrastructure: Safety Without Disruption

Why it matters: This article demonstrates a practical approach to enhancing configuration management safety and reliability in large-scale cloud environments. Engineers can learn how to reduce deployment risks and improve system resilience through environment segmentation and phased rollouts.

Slack enhanced its Chef infrastructure for safer deployments by addressing reliability risks associated with a single shared production environment.
They transitioned from a monolithic production Chef environment to multiple isolated `prod-X` environments, dynamically mapped to instances based on their Availability Zones.
The `Poptart Bootstrap` tool, baked into AMIs, was extended to assign instances to these specific Chef environments during boot time.
This environment segmentation enables independent updates, significantly reducing the blast radius of potentially problematic configuration changes.
A staggered deployment strategy was implemented, utilizing `prod-1` as a canary for hourly updates and a release train model for `prod-2` through `prod-6` to ensure progressive rollout and early issue detection.

#sre #dist

Read original

Slack EngineeringOct 7, 2025

Deploy Safety: Reducing customer impact from change

Why it matters: This article details Slack's successful Deploy Safety Program, which drastically cut customer impact from deployments. It provides a practical framework for improving reliability, incident response, and development velocity in complex, distributed systems.

Slack's Deploy Safety Program reduced customer impact from change-triggered incidents by 90% in 1.5 years, maintaining development velocity.
The program tackled 73% of customer-facing incidents caused by code deploys across diverse services and deployment systems.
North Star goals included automated detection/remediation within 10 minutes and preventing problematic deployments from reaching 10% of the fleet.
A custom metric, "Hours of customer impact from high/selected medium severity change-triggered incidents," measured program effectiveness.
Investment prioritized known pain points, rapid iteration, and scaling successful patterns like automated monitoring and rollbacks.
Key projects involved automating deployments, rollbacks, and blast radius control for critical systems like Webapp backend and frontend.

#sre #dist #culture

Read original

Slack EngineeringSep 4, 2025

Building Slack’s Anomaly Event Response

Why it matters: This article details Slack's Anomaly Event Response, showcasing a real-world example of building a proactive, automated security system. Engineers can learn about designing multi-tiered architectures for real-time threat detection and response, crucial for modern platform security.

Slack's Anomaly Event Response (AER) is a proactive security system that detects and responds to emerging threat behaviors in real-time.
AER automatically terminates suspicious user sessions, reducing the detection-to-response gap from hours/days to minutes.
It targets common threats like Tor access, excessive downloads, data scraping, session fingerprint mismatches, and unusual API patterns.
The system uses a multi-tiered architecture: detection engine, decision framework, and response orchestrator.
Enterprise Grid customers can configure AER to select which anomalies trigger automated responses and notification preferences.
This native solution disrupts attack chains, preventing data exfiltration and system compromise without additional tools or human capital.

#security

Read original

Slack EngineeringApr 14, 2025

Optimizing Our E2E Pipeline

Why it matters: This article demonstrates a practical approach to significantly improve CI/CD pipeline efficiency and developer experience. By intelligently caching and reusing build artifacts, engineering teams can drastically reduce build times and infrastructure costs.

Slack's DevXP team optimized their E2E testing pipeline by addressing redundant frontend builds in a large monorepo.
Previously, frontend code was built for every pull request, consuming 5 minutes per run even without relevant changes, leading to significant time and resource waste.
The solution implemented conditional builds, using `git diff` to detect actual frontend changes before initiating a new build.
If no frontend changes were detected, the pipeline reused existing production frontend assets stored in AWS S3 and served via an internal CDN.
This optimization resulted in a 60% reduction in build frequency and a 50% decrease in overall build time, saving thousands of engineering hours and terabytes of storage.

#sre #finops #frontend

Read original

Slack EngineeringMar 7, 2025

How we built enterprise search to be secure and private

Why it matters: This article details how to build secure, privacy-preserving enterprise search and AI features. It offers a blueprint for integrating external data without compromising user data, leveraging RAG, federated search, and strict access controls. Essential for engineers building secure data platforms.

Slack's enterprise search and AI uphold strict security and privacy by keeping customer data within Slack's trust boundary, utilizing an AWS escrow VPC for LLMs.
The system employs Retrieval Augmented Generation (RAG) instead of training Large Language Models (LLMs) on customer data, ensuring data privacy and preventing retention.
Enterprise search operates on a federated, real-time model, never storing external source data in Slack's databases, but rather fetching it via partner APIs.
Access to external content is strictly permissioned based on the user's existing Access Control Lists (ACLs) and requires explicit user/admin consent, adhering to the principle of least privilege.
External data and permissions are always up-to-date with the source system, ensuring accuracy and compliance.
Search Answer summaries generated by the AI are ephemeral, shown to the user and immediately discarded, further enhancing privacy.

#security #data #dist

Read original

Slack EngineeringJan 7, 2025

Automated Accessibility Testing at Slack

Why it matters: This article highlights the practical challenges and solutions in integrating automated accessibility testing into existing frontend development workflows. It provides valuable insights for engineers looking to enhance their testing strategies without disrupting core framework functionalities.

Slack integrates automated accessibility testing into its development process to supplement manual testing and ensure compliance with Web Content Accessibility Guidelines (WCAG).
Automated testing is viewed as a valuable addition to a comprehensive strategy, not a replacement for human judgment, as it has limitations in catching nuanced accessibility issues.
Initial attempts to embed Axe accessibility checks directly into the React Testing Library (RTL) framework with Jest were abandoned due to complexities with Slack's custom Jest setup.
The team pivoted to using Playwright, Slack's end-to-end (E2E) test framework, integrating Axe via the @axe-core/playwright package.
Directly embedding Axe checks into Playwright's Locator object methods proved challenging because Locator ensures individual element readiness, not full page rendering, which is crucial for accurate accessibility audits.
Workarounds involved leveraging Playwright's flexibility and Axe Core's customization features, such as filtering rules and specific accessibility tags, for selective application of checks.

#frontend #culture

Read original

Slack EngineeringDec 16, 2024

Migration Automation: Easing the Jenkins → GHA shift with help from AI

Why it matters: This article showcases a successful, automated approach to a common, complex CI/CD migration challenge. It provides valuable insights into leveraging existing tools and AI to reduce manual effort and accelerate infrastructure shifts, directly impacting developer productivity and system reliability.

Slack successfully migrated its CI infrastructure from Jenkins to GitHub Actions, addressing developer frustration and improving UX.
An intern-developed automation tool, leveraging AI, significantly streamlined the migration of Jenkins pipelines to GHA workflows.
This tool is projected to cut migration time by half and save over 1,300 hours across 242 pipelines.
The process involved using GitHub Actions Importer, followed by custom Python scripts and LLMs to correct partially converted workflows.
Key challenges included identifying and addressing common unsupported Jenkins steps and replacing rate-limited GHA actions with internal mirrors.
The project demonstrates a practical, hybrid approach to large-scale CI/CD system migration.

#sre #dist

Read original