This report highlights common infrastructure challenges like rate limiting, certificate management, and configuration errors. It offers valuable insights into incident response, mitigation strategies, and proactive measures for maintaining high availability in complex distributed systems.
In November, we experienced three incidents that resulted in degraded performance across GitHub services.
November 17 16:52 UTC (lasting 2 hours and 16 minutes)
On November 17, 2025, from 16:52 to 19:08 UTC, Dependabot was hitting a rate limit in GitHub Container Registry (GHCR) and was unable to complete about 57% of jobs within SLO.
To mitigate the issue, we lowered the rate at which Dependabot started jobs and increased the GHCR rate limit. This mitigated the circumstances and led to the resolution of the incident.
Longer term, we’re adding new monitors and alerts to help prevent this in the future.
November 18 20:30 UTC (lasting 1 hour and 4 minutes)
On November 18, 2025, from 20:30 to 21:34 UTC, we experienced failures on all Git operations, including both SSH and HTTP Git client interactions, as well as raw file access. These failures also impacted products that rely on Git operations.
The root cause was an expired TLS certificate used for internal service-to-service communication. We mitigated the incident by replacing the expired certificate and restarting impacted services. Once those services were restarted we saw a full recovery.
We have updated our alerting to cover the expired certificate, and we are performing an audit of other certificates in this area to ensure they also have the proper alerting and automation before expiration. In parallel, we are accelerating efforts to eliminate our remaining manually managed certificates, ensuring all service-to-service communication is fully automated.
November 28 05:59 UTC (lasting 2 hours and 24 minutes)
On November 28, 2025, between approximately 05:59 and 08:24 UTC, Copilot experienced an outage affecting the Claude Sonnet 4.5 model. Users attempting to use this model received an HTTP 400 error indicating no model was available until an alternative model was selected. Other models were not impacted.
The issue was caused by a misconfiguration deployed to an internal service, which made Claude Sonnet 4.5 erroneously listed as unavailable. The problem was identified and mitigated by reverting the configuration change. We are working to improve cross-service deploy safeguards to prevent similar incidents in the future.
Follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the engineering section on the GitHub Blog.
The post GitHub Availability Report: November 2025 appeared first on The GitHub Blog.
Continue reading on the original blog to support the author
Read full articleThis report highlights the challenges of scaling a massive monolith under AI-driven traffic growth. It provides a blueprint for reliability through infrastructure migration, service decomposition, and the implementation of automated circuit breakers to prevent cascading failures.
This report highlights the complexity of maintaining high availability in distributed systems. It provides lessons on the risks of automated infrastructure changes, the importance of correctly scoped rate limiting, and the need for robust DNS management and failover strategies.
As AI agents accelerate development, platforms like GitHub face unprecedented load. This update highlights how massive scale requires shifting from monoliths to isolated services and multi-cloud strategies to maintain reliability under exponential growth.
Circular dependencies can paralyze recovery during outages. By using eBPF and cGroups, engineers can enforce network isolation for deployment scripts without impacting production traffic, ensuring that critical infrastructure remains deployable even when primary services are offline.