What Should You Do When Your Web Service Blows Up?

Every major software or web company I’ve ever been involved in has had a catastrophic outage of some sort.  I view it as a rite of passage – when this happens when your company is young and no one notices, it gives you a chance to get better.  But eventually you’ll have one when you are big enough for people to notice.  How you handle it and what you learn from speaks volumes about your future.

Last week, two companies that we are investors in had shitty experiences.  SendGrid‘s was short – it only lasted a few hours – and was quickly diagnosed.  BigDoor‘s was longer and took several days to repair and get things back to a stable state.  Both companies handled their problems with grace and transparency – announcing that all was back to normal with a blog post describing in detail what happened.

While you never ever want something like this to happen, it’s inevitable.  I’m very proud of how both BigDoor and SendGrid handled their respective outages and know that they’ve each learned a lot – both in how to communicate about what happened as well as insuring that this particular type of outage won’t happen again.

In both cases, they ended up with 100% system recovery.  In addition, each company took responsibility for the problem and didn’t shift the blame to a particular person.  I’m especially impressed how my friends at BigDoor processed this as the root cause of the problem was caused by a new employee.  They explain this in detail in their post and end with the following:

“Yes, this employee is still with us, and here’s why: when exceptions like this occur, what’s important is how we react to the crisis, accountability, and how hard we drive to quickly resolve things in the best way possible for our customers.   I’m incredibly impressed with how this individual reacted throughout, and my theory is that they’ll become one of our legendary stars in years to come.”

I still remember the first time I was ever involved in a catastrophic data loss.  I was 17 and working at Petcom, my first real programming job.  It was late on a Friday night and I got a call from a Petcom customer.  I was the only person around so I answered the phone.  The person was panicked – their hard drive had lost all of its data (it was an Apple III ProFile hard drive – probably 5 MB).  The person was the accounting manager and they were trying to run some process but couldn’t get anything to work.  I remember discerning that it seemed like the hard drive was fine but she had deleted all of her data.  Fortunately, Petcom was obsessive about backups and made all of their clients buy a tape drive – in this case, one from Tallgrass (I vaguely remember that they were in Overland Park, KS – I can’t figure out why I remember that.)

After determining the tape drive software was working and was available, I started walking the person through restoring her data.  She was talking out loud as she brought up the tape drive menu and starting clicking on keys before I had a chance to say anything at which point she pressed the key to format the tape that was in the drive.  I sat in shock for a second and asked her if she had another backup tape.  She told me that she didn’t – this was the only one she ever used.  I asked her what it said on the screen.  She said something like “formatting tape.”  I asked again if there was another backup tape.  Nope.  I told her that I thought she had just overwritten her only backup.  Now, in addition to having deleted all of her data, she had wiped out her backup.  We spent a little more time trying to figure this out, at which point she started crying.  I doubt she realized she was talking to a 17 year old.  She eventually calmed down but neither of knew what to do next.  Eventually the call ended and I went into the bathroom and threw up.

I eventually got in touch with the owner of Petcom (Chris) at his house who told me to go home and not to worry about it, they’d figure it out over the weekend.  I can’t remember the resolution, but I think Chris had a backup for the client from the previous month so they only lost a month or so worth of data.  But that evening made an incredible impression on me.  Yes, I finished the evening with at least one illegal drink (since the drinking age at the time in Texas was 18.)

It’s 28 years later and computers still crash, backups are still not 100% failsafe, and the stress of massive system failure still causes people to go in the bathroom and throw up.  It’s just part of how this works.  So, before you end up in pain, I encourage you to think hard about your existing backup, failover, and disaster recovery approaches.  And, when the unexpected, not anticipated, not accounted for thing happens, make sure you communicate continually and clearly what is going on, no matter how painful it might be.

  • mikeyavo

    We had one at my last company, Quigo. It was both terrifying and devastating. It's also a good problem to have especially if it is caused by an unexpected increase in traffic. That's what happened to us, but mixed in, of course, with some poor software architecture. It took us 6 months to fix the underlying problems AND properly scale for the future. Totally worth it looking back but painful while it's happening. No real customer-facing innovation happened during this period of time.

  • Your Petcom story is amazing!

    Sorting out accountability is a hard one. Stay calm in the case of emergency!

    — Stephan

  • Back 2002 I had a contracting gig at a credit card processing company in Boulder. I was working on some new reporting software and mostly worked in a dummy database, but occasionally would touch the real one for testing at scale. (It was many terabytes.) Sure enough, one day I typed "DELETE * FROM transactions;" in the wrong window…

    It was like that feeling when you see you've cut your finger but the pain hasn't come yet. I slowly got up and left the conference room where I worked. I could already hear confusion from the staff developers in their cubes. The CTO was remarkably calm when I explained what happened. He called the Head of Engineering to begin restoring a backup. That would take 12-18 hours, and it was only 1pm. No one else could do any work until tomorrow. He said all the developers have done it at one time or another, and I should bring a six-pack into work tomorrow as penance.

    As I walked out, I looked down into the first-floor "bullpen" where about 150 transaction processors worked in a large open room. They were packing up their things to go home, some celebrating, some grumbling. I calculated in my head how much I had just cost the the company in salaries alone.

    In hindsight I can say that the DBA should never have given me the access rights to do such a thing, but at the time I felt guilty as hell. Didn't throw up, but felt about the same!

  • Bill Mosby

    As I read this I was thinking to myself that people have a better understanding of the potential of that simple word, "format", than they did a decade or two ago. Then I remembered that I absent-mindedly formatted the memory card in my camera a few years ago with about 300 mostly irreplaceable shots on it. Luckily, that was only a hobby and not business related, lol!

  • you should always have a disaster recovery plan in place no matter how big you are. At first it maybe just taking regular backups of your database to insure integrity and making sure they are pushed to another server. Then maybe you get a bit more sophisticated and have multiple servers running clustered so if one dies you get session replication and you keep on trucking. Then eventually you run passive nodes located in another data center that you can bring online at will. Eventually you do full data replication to another site with a full set of prod hardware and if your system goes down you point your dns entry to your backup cluster and go to town.

    However, all of this hinges on the fact that you REGULARLY test these systems. I have heard far too many stories of people who have disaster recovery systems in place only to not test them regularly to find their backup takes just as long to bring up than it would to completely rebuild from scratch.

    It also helps to have some type of automated build system that stores all your production builds as backups. That way when you do have failure that maybe code related you roll back the database and grab the last good prod build and bring your systems back up.

  • pankajunk

    I think this underlines the importance of experience in a web company. You know the pitfalls, have experienced them, and know better not to commit the same mistakes twice.


    • Yes, but there always new and exciting mistakes to make. Experience doesn't help that much with these – you just have to live through them.

  • Brad…interesting thoughts and comments. You can never completely prevent human error (we have all done it) or mechanical failure (they all fail at one time or another). The best medicien is to have some way to recover or prevent the user from feeling the pain of the outage through failover strategies. I am working with an interesting early stage company that you might want to take a quick look at – http://www.zeronines.com – that has a unique and straight-forward way of addressing just this situation.