This report highlights common infrastructure challenges like rate limiting, certificate management, and configuration errors. It offers valuable insights into incident response, mitigation strategies, and proactive measures for maintaining high availability in complex distributed systems.
In November, we experienced three incidents that resulted in degraded performance across GitHub services.
November 17 16:52 UTC (lasting 2 hours and 16 minutes)
On November 17, 2025, from 16:52 to 19:08 UTC, Dependabot was hitting a rate limit in GitHub Container Registry (GHCR) and was unable to complete about 57% of jobs within SLO.
To mitigate the issue, we lowered the rate at which Dependabot started jobs and increased the GHCR rate limit. This mitigated the circumstances and led to the resolution of the incident.
Longer term, we’re adding new monitors and alerts to help prevent this in the future.
November 18 20:30 UTC (lasting 1 hour and 4 minutes)
On November 18, 2025, from 20:30 to 21:34 UTC, we experienced failures on all Git operations, including both SSH and HTTP Git client interactions, as well as raw file access. These failures also impacted products that rely on Git operations.
The root cause was an expired TLS certificate used for internal service-to-service communication. We mitigated the incident by replacing the expired certificate and restarting impacted services. Once those services were restarted we saw a full recovery.
We have updated our alerting to cover the expired certificate, and we are performing an audit of other certificates in this area to ensure they also have the proper alerting and automation before expiration. In parallel, we are accelerating efforts to eliminate our remaining manually managed certificates, ensuring all service-to-service communication is fully automated.
November 28 05:59 UTC (lasting 2 hours and 24 minutes)
On November 28, 2025, between approximately 05:59 and 08:24 UTC, Copilot experienced an outage affecting the Claude Sonnet 4.5 model. Users attempting to use this model received an HTTP 400 error indicating no model was available until an alternative model was selected. Other models were not impacted.
The issue was caused by a misconfiguration deployed to an internal service, which made Claude Sonnet 4.5 erroneously listed as unavailable. The problem was identified and mitigated by reverting the configuration change. We are working to improve cross-service deploy safeguards to prevent similar incidents in the future.
Follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the engineering section on the GitHub Blog.
The post GitHub Availability Report: November 2025 appeared first on The GitHub Blog.
Continue reading on the original blog to support the author
Read full articleAs AI agents accelerate development, platforms like GitHub face unprecedented load. This update highlights how massive scale requires shifting from monoliths to isolated services and multi-cloud strategies to maintain reliability under exponential growth.
Circular dependencies can paralyze recovery during outages. By using eBPF and cGroups, engineers can enforce network isolation for deployment scripts without impacting production traffic, ensuring that critical infrastructure remains deployable even when primary services are offline.
This report highlights how minor configuration errors, cache stampedes, and credential management issues can cause massive service disruptions. It provides a blueprint for improving resilience through killswitches, infrastructure isolation, and automated monitoring of dependencies.
This report highlights how complex dependencies—like telemetry, caching, and security policies—can trigger cascading failures. It provides valuable lessons on the importance of robust monitoring, automated rollbacks, and the need for resilient proxy layers in large-scale distributed systems.