Last night and today we had a significant outage, with intermittent service for a number of hours. This post is to explain what happened, hopefully it will put people’s fears to rest and give a sense of what’s happening behind the scenes.
Yesterday (Sunday) we cleared out some old data from our database, which is something we’ve done a few times before, but not in the last year. This time, half way through the operation, the site stopped working. A few common queries were running extremely slowly, building up, and blocking up data access for all users.
Last night we figured out what the issue was and blocked those types of queries, and got the site back up, but it wasn’t working for anyone who had to make one of those slow queries.
Today, we tried to slowly let more and more of the blocked users back in, which seemed to be working, but then things went sideways in a weird way. It turned out, the issue was being compounded by an unrelated issue related to an infrastructure transition we recently made. There was a bug in the web server we use (Apache) that we had never run into before, but which apparently has bitten a lot of people, so we had to resolve that as well in order to get things up and running again.
After fixing that bug, and attempting to let everyone back in, the site immediately went down again and we re-blocked people and the site took a while to come back up. At this point, we thought, “Well, that’s not gonna work” and started looking at a way to code around the problem. It turned out to be pretty simple to just avoid that slow query for people and still keep the site working as expected for them, so that’s what we did.
And now the site is up and running, humming along. We added a few more servers too, just to make sure it was extra speedy.
As to why we’re doing infrastructure upgrades and clearing out old data in the first place, we are preparing for a large migration we’re doing soon. We are sharding our database, which, for those of you who aren’t programmers, means splitting the data in our one big database out across many little databases. It’s a big deal, and a milestone for us that we’ve reached a scale where this is even needed.
It’s a big project, but we’re pretty close to done. And when we finish, we’re excited to start working on some awesome improvements for the product, which our users have been patiently waiting for.