WorkFlowy Outage Post-mortem

By

August 28, 2017

Last night and today we had a significant outage, with intermittent service for a number of hours. This post is to explain what happened, hopefully it will put people’s fears to rest and give a sense of what’s happening behind the scenes.

Yesterday (Sunday) we cleared out some old data from our database, which is something we’ve done a few times before, but not in the last year. This time, half way through the operation, the site stopped working. A few common queries were running extremely slowly, building up, and blocking up data access for all users.

Last night we figured out what the issue was and blocked those types of queries, and got the site back up, but it wasn’t working for anyone who had to make one of those slow queries.

Today, we tried to slowly let more and more of the blocked users back in, which seemed to be working, but then things went sideways in a weird way. It turned out, the issue was being compounded by an unrelated issue related to an infrastructure transition we recently made. There was a bug in the web server we use (Apache) that we had never run into before, but which apparently has bitten a lot of people, so we had to resolve that as well in order to get things up and running again.

After fixing that bug, and attempting to let everyone back in, the site immediately went down again and we re-blocked people and the site took a while to come back up. At this point, we thought, “Well, that’s not gonna work” and started looking at a way to code around the problem. It turned out to be pretty simple to just avoid that slow query for people and still keep the site working as expected for them, so that’s what we did.

And now the site is up and running, humming along. We added a few more servers too, just to make sure it was extra speedy.

As to why we’re doing infrastructure upgrades and clearing out old data in the first place, we are preparing for a large migration we’re doing soon. We are sharding our database, which, for those of you who aren’t programmers, means splitting the data in our one big database out across many little databases. It’s a big deal, and a milestone for us that we’ve reached a scale where this is even needed.

It’s a big project, but we’re pretty close to done. And when we finish, we’re excited to start working on some awesome improvements for the product, which our users have been patiently waiting for.

0 0 votes
Article Rating
Subscribe
Notify of
guest
12 Comments
Inline Feedbacks
View all comments
Carlos Sicilia
4 years ago

It’s ok, Jesse. Workflowy access has always been strong, sturdy and reliable. So Workflowy is allowed to fail at least once every 5 years. Seriously, I was not able to get in, but I never got alarmed or anything. I think when you are a truly Workflowy believer, there’s really nothing to worry about. Happy growing!

rekluce
4 years ago
Reply to  Carlos Sicilia

Very well said, sir. Nailed it.

Jean-Jacques Auffret
Jean-Jacques Auffret
4 years ago
Reply to  Carlos Sicilia

100% agree, and thank you also for sharing the post-mortem.
I briefly considered suicide as an option yesterday given the importance Workflowy has taken in my life, but then renounced when I realized the impact it would have on your revenue.

Kaj Kolding
Kaj Kolding
4 years ago

Well said, Carlos! Jesse, I hope you’re getting some sleep tonight. Everybody keep calm and Workflowy on.

Juergen Allgayer
Juergen Allgayer
4 years ago

thanks for getting your arms around it!
was checking furiously http://www.isitdownrightnow.com/workflowy.com.html 🙂
made me notice how much I rely on WFy, and how rarely I have any problems!!
-J

Olaf Bond (@olafbond)
4 years ago

I’ve got the Android application update so I thought that was the issue.
Glad to here this stories saying you are aware and restlessly working making the product better.

Martin Pugh
Martin Pugh
4 years ago

Hi guys. I did think it a bit odd that my Android client reported it hadn’t synched in a couple of days but of course the offline functionality in the client just kept me ticking along nicely. Thanks for the update about the cause, always great to see a postmortem rather than an “it works for me” response.

Reynard Falconer
Reynard Falconer
4 years ago

I haven’t backed up my WrkFlwy for weeks :(. I think this is a reminder!

The good thing about these guys is that they don’t lock you in. So long as you keep your backups going then you can carry on in any OPML file without the need for WrkFlwy 🙂

I could really do with a couple of days of IT at some point. My IT systems are creaking.

Sam
Sam
4 years ago

Thank you!

Robert Pollini
Robert Pollini
4 years ago

integrity matters, thanks for sharing
your track record is pretty damn good, and i’m confident the learning that comes out of this will make it even better

loyalty is earned, you’ve got mine

luc4leone
luc4leone
4 years ago

thanks for explaining jesse, well done!

John A. Taylor (@johnataylor)

Thanks for maintaining a great product, and being transparent when there is a problem. You guy rock it!

Share on twitter
Share on facebook
Share on linkedin
Share on email

Subscribe to the newsletter

We'll send you a weekly roundup of the latest posts

Overwhelmed?

Workflowy replaces your notebooks, stickies and bloated apps with a simple, smooth digital notebook that makes it easy to get organized and be productive.