As many people have noticed, we had an outage this morning starting from 07:23 AM PST to 14:05 PM PST (~7 hours). That’s some serious downtime and we’d like to attempt to explain what happened.
So, what happened?
Yesterday, in order to cope with the increased volume of links we added more crawler processes on one of our machines. (more details about our infrastructure)
Unfortunately, on the same machine we had our MySQL instance which stores user data and is required by all our services. At some point this morning the load increased 5x, the machine became unresponsive and eventually hanged, taking down all the other services with it.
We have email alerts for these kind of situations (we’re using Monit, Munin, and BlameStella) but they don’t really help while you’re sleeping. We’ve noticed the issue and started to work on it as soon as we woke up.
We take these problems very seriously and we’re really sorry about the outage! We’ve learned something important today and we’re working hard to prevent further issues.
How are we going to prevent it?
- Keep MySQL and other critical services on separate, dedicated boxes
- Setup phone and SMS alerts with PagerDuty so we can wake up
- Avoid high loads by re-distributing some of our services to new machines
Thank you for your patience! If you have any questions or feedback email us at firstname.lastname@example.org