WorkFlowy Outage Post-mortem

Last night and today we had a significant outage, with intermittent service for a number of hours. This post is to explain what happened, hopefully it will put people’s fears to rest and give a sense of what’s happening behind the scenes.

Yesterday (Sunday) we cleared out some old data from our database, which is something we’ve done a few times before, but not in the last year. This time, half way through the operation, the site stopped working. A few common queries were running extremely slowly, building up, and blocking up data access for all users.

Last night we figured out what the issue was and blocked those types of queries, and got the site back up, but it wasn’t working for anyone who had to make one of those slow queries.

Today, we tried to slowly let more and more of the blocked users back in, which seemed to be working, but then things went sideways in a weird way. It turned out, the issue was being compounded by an unrelated issue related to an infrastructure transition we recently made. There was a bug in the web server we use (Apache) that we had never run into before, but which apparently has bitten a lot of people, so we had to resolve that as well in order to get things up and running again.

After fixing that bug, and attempting to let everyone back in, the site immediately went down again and we re-blocked people and the site took a while to come back up. At this point, we thought, “Well, that’s not gonna work” and started looking at a way to code around the problem. It turned out to be pretty simple to just avoid that slow query for people and still keep the site working as expected for them, so that’s what we did.

And now the site is up and running, humming along. We added a few more servers too, just to make sure it was extra speedy.

As to why we’re doing infrastructure upgrades and clearing out old data in the first place, we are preparing for a large migration we’re doing soon. We are sharding our database, which, for those of you who aren’t programmers, means splitting the data in our one big database out across many little databases. It’s a big deal, and a milestone for us that we’ve reached a scale where this is even needed.

It’s a big project, but we’re pretty close to done. And when we finish, we’re excited to start working on some awesome improvements for the product, which our users have been patiently waiting for.

12 thoughts on “WorkFlowy Outage Post-mortem

  1. It’s ok, Jesse. Workflowy access has always been strong, sturdy and reliable. So Workflowy is allowed to fail at least once every 5 years. Seriously, I was not able to get in, but I never got alarmed or anything. I think when you are a truly Workflowy believer, there’s really nothing to worry about. Happy growing!

    1. 100% agree, and thank you also for sharing the post-mortem.
      I briefly considered suicide as an option yesterday given the importance Workflowy has taken in my life, but then renounced when I realized the impact it would have on your revenue.

  2. Hi guys. I did think it a bit odd that my Android client reported it hadn’t synched in a couple of days but of course the offline functionality in the client just kept me ticking along nicely. Thanks for the update about the cause, always great to see a postmortem rather than an “it works for me” response.

  3. I haven’t backed up my WrkFlwy for weeks :(. I think this is a reminder!

    The good thing about these guys is that they don’t lock you in. So long as you keep your backups going then you can carry on in any OPML file without the need for WrkFlwy 🙂

    I could really do with a couple of days of IT at some point. My IT systems are creaking.

  4. integrity matters, thanks for sharing
    your track record is pretty damn good, and i’m confident the learning that comes out of this will make it even better

    loyalty is earned, you’ve got mine

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s