capacity-planning

1 article
sort: new top best
clear filter
0 5/10

GitHub published a detailed postmortem of three major availability incidents (Feb 2, Feb 9, Mar 5) caused by rapid usage growth, architectural coupling in authentication/user management database clusters, insufficient load shedding mechanisms, and latent failover configuration issues. The incidents revealed single points of failure across critical infrastructure including Actions runners and Redis clusters, with mitigation strategies including user cache redesign, infrastructure isolation, and migration to Azure.

GitHub Azure Redis GitHub Actions
github.blog · tjwds · 1 day ago · details · hn