Summary: I got a false alert that the issues were resolved. Foolishly, I didn’t look into it further and didn’t realize we had an ongoing outage.
My deepest apologies for the two outages we’ve had in the past day. The first happened yesterday at around 5pm pacific time, the second today at around 6am. Both were caused by the same simple, easy to resolve issue. In both cases, I did not resolve the issue causing the outage because I didn’t realize the outages were happening. Our monitoring system sent me alerts saying the issues had resolved themselves just moments after it pointed out the problems in the first place. I should have looked into the issues upon getting the alerts, regardless of their supposed resolution, and I didn’t. Thus, we had hours of downtime when we should have had minutes.
I am ashamed of the situation, and stressed by the potential repercussions on the trust of our users. Due to my actions, tens of thousands of people around the world were blocked from doing their work over the past day. My apologies to all of you.
Here is what I will do to prevent this happening in the future:
- I’ve added two new alerts to our monitoring which are direct indicators of our most common issues. These metrics will not incorrectly resolve themselves while an outage remains active.
- I will always look into outage alerts, even if I get an automated alert saying an issue is resolved.
- After an alert, I will look at our core metrics dashboards as opposed to my own app responsiveness to determine whether the app is healthy. Loading the home page is an incredibly dumb way to verify the health of our servers. I did it simply because I was away from my computer and have never seen a real outage falsely resolved without additional alerts. That was a huge mistake.
- After an alert, I will monitor our user communication channels such as Twitter and email, to get a sense of how widespread an issue is.
Again, I am truly sorry for these completely unnecessary outages. We will do our best to maintain our previously wonderful reliability.
P.S. Please consider using our offline Chrome Desktop Application. This way you will have access to your WorkFlowy lists even if we have an outage, or if you’re away from the internet.