blob: 614c52aec6c96f7d7d503a4fbec82a27efbadd6c [file] [log] [blame] [view]
# How the healthcheck plugin and multi-site keep your sleep at night
GerritHub.io had an outage on the 3rd of November at 15:20 GMT, which
was caused by a [critical Gerrit issue](https://bugs.chromium.org/p/gerrit/issues/detail?id=16384)
discovered that day.
The issue was deep into the core of Gerrit, involving loading the
accounts external-ids, impacting pretty much anything that required
authenticated traffic. However, none of the GerritHub.io users
noticed any issues, delays, slow down, or reduced functionality.
In this talk, Tony and Luca will describe what happened and how
the GerritForge Team detected, analyzed, and mitigated the problem,
avoiding a global outage.
The learnings from this story can help other Gerrit admins to
set up operating practices about metrics, high availability,
and service resilience with Gerrit that can be useful in
preventing sleepless nights and managing outages.
*[Luca Milanesio, GerritForge](../speakers.md#lmilanesio)*
*[Antonio Barone, GerritForge](../speakers.md#abarone)*