Skip to content

Downtime Postmortem

As many people have noticed, we had an outage this morning starting from 07:23 AM PST to 14:05 PM PST (~7 hours). That’s some serious downtime and we’d like to attempt to explain what happened.

So, what happened?

Yesterday, in order to cope with the increased volume of links we added more crawler processes on one of our machines. (more details about our infrastructure)

Unfortunately, on the same machine we had our MySQL instance which stores user data and is required by all our services. At some point this morning the load increased 5x, the machine became unresponsive and eventually hanged, taking down all the other services with it.

We have email alerts for these kind of situations (we’re using Monit, Munin, and BlameStella) but they don’t really help while you’re sleeping. We’ve noticed the issue and started to work on it as soon as we woke up.

We take these problems very seriously and we’re really sorry about the outage! We’ve learned something important today and we’re working hard to prevent further issues.

How are we going to prevent it?

  • Keep MySQL and other critical services on separate, dedicated boxes
  • Setup phone and SMS alerts with PagerDuty so we can wake up
  • Avoid high loads by re-distributing some of our services to new machines

Thank you for your patience! If you have any questions or feedback email us at team@summify.com 

About these ads
No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 87 other followers