Posts tagged with culture
Why it matters: This article offers valuable lessons on building and scaling an AI platform over a decade, emphasizing the interplay between technical choices, organizational alignment, and adapting to rapid ML advancements. It's crucial for engineers developing complex ML infrastructure.
- •Pinterest's AI Platform evolved over a decade from fragmented team stacks to a unified system, driven by organizational alignment and technical necessity.
- •Platform foundations are layered, bottom-up, and temporary, demanding rebuilds to adapt to new ML paradigms like DNNs, GPUs, and LLMs.
- •Early efforts like Linchpin DSL and Scorpion inference unified features and serving, addressing training-serving skew.
- •Custom DSLs proved brittle with evolving ML, emphasizing the need for flexible, industry-standard solutions.
- •Successful platform adoption requires strong organizational incentives, leadership sponsorship, and alignment with product goals.
- •Efficiency and velocity are boosted by concurrent advances in modeling and platform infrastructure, especially for frontier models.
Why it matters: This event fosters innovation and skill development in game creation, encouraging engineers to experiment with new technologies and collaborative workflows. It's an excellent opportunity to build a portfolio project and engage with a global developer community.
- •GitHub's annual Game Off 2025 game jam has announced "WAVES" as its theme, challenging developers to create games based on this concept.
- •Participants must develop their games and submit them to itch.io by December 1, 2025, with source code hosted in a public GitHub repository.
- •The jam encourages diverse interpretations of the "WAVES" theme, offering various conceptual ideas from physics puzzlers to rhythm games.
- •Developers can work solo or in teams, using any programming languages, game engines (e.g., Godot, Unity, Bevy), or AI-assisted tools.
- •Games will be evaluated by participants across categories like gameplay, graphics, audio, innovation, and theme interpretation.
- •The event is designed to be beginner-friendly, welcoming both experienced and first-time game developers to explore game creation.
Why it matters: This article highlights the transformative impact of AI agents on software development, enabling developers to focus on higher-value tasks and accelerating innovation. It showcases GitHub's platform and Microsoft's infrastructure as key enablers for this "new era of collaboration."
- •GitHub Universe 2025 emphasized a "new era of collaboration" where AI agents automate repetitive coding, freeing developers for complex problem-solving.
- •GitHub launched Agent HQ, an open ecosystem providing a single mission control for assigning, governing, and tracking multiple AI agents.
- •Microsoft Azure is crucial, offering infrastructure to accelerate agentic AI adoption, transforming it into a strategic advantage.
- •The Octoverse 2025 report reveals significant growth: 180M+ developers, 80% of new users adopt Copilot in week one, and 4.3M+ AI-related repositories.
- •AI's influence is evident as TypeScript and Python are now the top two most used languages, reflecting AI development preferences.
- •AI and agentic workflows are reinventing software development, boosting efficiency for enterprises and enabling startups to ship faster.
Why it matters: This centralizes diverse AI coding agents within GitHub, streamlining developer workflows and enhancing productivity. It offers a unified command center and integrated AI capabilities, making AI a native part of development rather than an add-on for complex tasks.
- •GitHub introduces Agent HQ, an open ecosystem integrating AI coding agents from various providers (Anthropic, OpenAI, Google) directly into the GitHub platform.
- •A "mission control" command center allows users to assign, steer, and track multiple agents across GitHub, VS Code, mobile, and CLI.
- •The platform maintains core GitHub primitives (Git, pull requests) while enhancing workflows with AI-driven capabilities like agentic code review.
- •New VS Code "Plan Mode" helps developers define step-by-step approaches for tasks, improving AI context and identifying project gaps early.
- •Enterprise features include a control plane for AI access, agent behavior governance, and metrics dashboards to track AI impact.
- •Access to these integrated agents is included with a paid GitHub Copilot subscription.
Why it matters: This article details Slack's successful Deploy Safety Program, which drastically cut customer impact from deployments. It provides a practical framework for improving reliability, incident response, and development velocity in complex, distributed systems.
- •Slack's Deploy Safety Program reduced customer impact from change-triggered incidents by 90% in 1.5 years, maintaining development velocity.
- •The program tackled 73% of customer-facing incidents caused by code deploys across diverse services and deployment systems.
- •North Star goals included automated detection/remediation within 10 minutes and preventing problematic deployments from reaching 10% of the fleet.
- •A custom metric, "Hours of customer impact from high/selected medium severity change-triggered incidents," measured program effectiveness.
- •Investment prioritized known pain points, rapid iteration, and scaling successful patterns like automated monitoring and rollbacks.
- •Key projects involved automating deployments, rollbacks, and blast radius control for critical systems like Webapp backend and frontend.
Why it matters: This article details how Netflix scaled incident management by empowering all engineers with an intuitive tool and process. It offers a blueprint for other organizations seeking to democratize incident response and foster a culture of continuous learning and reliability.
- •Netflix transitioned from a centralized SRE-led incident management system to a decentralized, "paved road" approach to empower all engineers.
- •The previous system, relying on basic tools, failed to scale with Netflix's growth, leading to missed learning opportunities from numerous uncaptured incidents.
- •They adopted Incident.io after a build-vs-buy analysis, prioritizing intuitive UX, internal data integration, balanced customization, and an approachable design.
- •Key to successful adoption was the tool's intuitive design, which fostered a cultural shift, making incident management less intimidating and more accessible.
- •Organizational investment in standardized processes, educational resources, and internal data integrations significantly reduced cognitive load and accelerated adoption.
- •This transformation aimed to make incident declaration and management easy for any engineer, even for minor issues, to foster continuous improvement and system reliability.
Why it matters: As AI workloads push GPU power consumption beyond the limits of traditional air cooling, liquid cooling becomes essential. This project demonstrates a viable path for maintaining hardware reliability and efficiency in high-density data centers.
- •Dropbox engineers developed a custom liquid cooling system for GPU servers during Hack Week 2025 to address the thermal demands of AI workloads.
- •The team built a prototype from scratch using radiators, pumps, reservoirs, and manifolds when pre-assembled units were unavailable.
- •Stress tests revealed that liquid cooling reduced operating temperatures by 20–30°C compared to standard air-cooled production systems.
- •The project enabled reduced fan speeds for secondary components, leading to quieter operation and potential power savings.
- •The initiative serves as a proof-of-concept for future-proofing data center infrastructure against the rising power consumption of next-gen GPUs.
- •Future plans include expanding testing with dedicated liquid cooling labs across multiple Dropbox data centers.
Why it matters: This article details Pinterest's journey in building PinConsole, an Internal Developer Platform based on Backstage, to enhance developer experience and scale engineering velocity by abstracting complexity and unifying tools.
- •Pinterest adopted an Internal Developer Platform (IDP) strategy to counter engineering velocity degradation caused by increasing complexity and tool fragmentation.
- •They chose Backstage as the open-source foundation for their IDP, PinConsole, due to its community adoption, extensible plugin architecture, and active development.
- •PinConsole aims to provide consistent abstractions, self-service capabilities, and reduce cognitive overhead for engineers by unifying disparate tools and workflows.
- •The architecture includes custom integrations with Pinterest's internal OAuth and LDAP systems for secure and seamless authentication within the platform.
- •The IDP addresses critical challenges such as inconsistent workflows, tool discovery issues, and fragmented documentation, significantly enhancing overall developer experience.
Why it matters: Dropbox's jump to 90% AI adoption provides a blueprint for scaling developer productivity. It shows how combining leadership alignment with a mix of third-party and internal tools can transform the SDLC and overcome developer skepticism toward AI-assisted workflows.
- •Dropbox achieved over 90% AI tool adoption among engineers by 2025 through strong leadership alignment and a structured change management plan.
- •The engineering organization utilizes AI across the entire software development lifecycle, including code generation, testing, debugging, and incident resolution.
- •A three-pronged strategy was employed: evaluating external tools like GitHub Copilot, developing custom internal AI solutions, and fostering a culture of knowledge sharing.
- •Initial adoption challenges, such as distrust of output quality and workflow friction, were addressed through peer-to-peer training and clear performance metrics.
- •The company balances third-party integrations with in-house development to solve specific organizational problems while building internal machine learning expertise.
Why it matters: This article highlights the practical challenges and solutions in integrating automated accessibility testing into existing frontend development workflows. It provides valuable insights for engineers looking to enhance their testing strategies without disrupting core framework functionalities.
- •Slack integrates automated accessibility testing into its development process to supplement manual testing and ensure compliance with Web Content Accessibility Guidelines (WCAG).
- •Automated testing is viewed as a valuable addition to a comprehensive strategy, not a replacement for human judgment, as it has limitations in catching nuanced accessibility issues.
- •Initial attempts to embed Axe accessibility checks directly into the React Testing Library (RTL) framework with Jest were abandoned due to complexities with Slack's custom Jest setup.
- •The team pivoted to using Playwright, Slack's end-to-end (E2E) test framework, integrating Axe via the @axe-core/playwright package.
- •Directly embedding Axe checks into Playwright's Locator object methods proved challenging because Locator ensures individual element readiness, not full page rendering, which is crucial for accurate accessibility audits.
- •Workarounds involved leveraging Playwright's flexibility and Axe Core's customization features, such as filtering rules and specific accessibility tags, for selective application of checks.