How the healthcheck plugin and multi-site keep your sleep at night

GerritHub.io had an outage on the 3rd of November at 15:20 GMT, which was caused by a critical Gerrit issue discovered that day.

The issue was deep into the core of Gerrit, involving loading the accounts external-ids, impacting pretty much anything that required authenticated traffic. However, none of the GerritHub.io users noticed any issues, delays, slow down, or reduced functionality.

In this talk, Tony and Luca will describe what happened and how the GerritForge Team detected, analyzed, and mitigated the problem, avoiding a global outage.

The learnings from this story can help other Gerrit admins to set up operating practices about metrics, high availability, and service resilience with Gerrit that can be useful in preventing sleepless nights and managing outages.

Luca Milanesio, GerritForge Antonio Barone, GerritForge