Jul 27 2005

Acceptable Downtimie

I was on a board call for a company today where we talked about “acceptable downtime” for their web-based service.  This company has commercial customers that depend on its software to run their businesses and the software in question is delivered “as a service.”  I’ve got a number of companies using this approach (vs. a straight software license approach) and I have a lot of experience with this issue dating back to investments in the 1990’s in Critical Path, Raindance, and Service Metrics.

While it’s easy to talk about “5 nines” (99.999% uptime), there are plenty of people who think this metric doesn’t make sense, especially when you are building an emerging company and have a difficult time predicting user adoption (if you are growing > 20% per month you understand what I mean.)  While most companies have SLA’s, these also don’t really take into consideration the actual dynamics associated with uptime for a mission critical app. 

For example, when I was on the board of Raindance, the CEO (Paul Berberian) described Raindance’s system as similar to an incredibly finely tuned and awesomely powerful fighter plane (on that even Jack Bauer would be proud to fly in.)  There was no question that it was by far the best service delivery platform for reservation less audio conferencing in the late 1990’s.  However, in Paul’s words, “when the plane has problems, it simply explodes in the air.”  Basically, there was no possible way to create a situation where you can guarantee that there will not be a catastrophic system failure and – in this situation – while there was plenty of fall over capacity, it’s too expensive to create 100% redundancy so it will take some time (usually in Raindance’s case an hour or two, although it once took two days) to get the system back up.  The capex investment in Raindance’s core infrastructure (at the time) was around $40m – we simply couldn’t afford another $40m for a fully redundant system, even if we could configure something so it was – in fact – fully redundant.

There have been many high profile services that have had catastrophic multi-day outages.  eBay had a number of multi-day outages in 1999; Critical Path had a two day outage in 1999 (I remember it not so fondly because I was without email for two days in the middle of an IPO I was involved in); Amazon had some issues as recently as last years holiday season; my website was down for an hour last night because of a firewall configuration issue; the list goes on. 

Interestingly, as an online service (consumer or enterprise) becomes more popular, the importance of it being up and operational all the time increases.  While this is a logical idea, it’s a feedback loop that creates some pain at some point for a young, but rapidly growing company.  As the importance of the service increases, expectations increase and – when there is the inevitable failure (whether it’s for a minute, an hour, or a day) – more people notice (since you have more users).

After watching this play out many times, I think every company gets a couple of free passes.  However, once you’ve used them up, the tides turn and users become much more impatient with you, even if your overall performance increases.  Ironically, I can’t seem to find any correlation concerning price – the behavior that I’ve witnessed seems to be comparable between free services, services that you make money from (e.g. eBay), and services that you pay for.

My advice on the board call today was that – based on our rate of growth (rapid but not completely out of control yet) – we should get ahead of the issue and invest in a much more redundant infrastructure today.  We haven’t yet used up our first free pass (we’ve had several small downtime incidents, but nothing that wasn’t quickly recovered from), but we had a scare recently (fortunately it was in the middle of the night on a weekend and – given that we are a commercial service – didn’t affect many users).  The debate that ensued balanced cost and redundancy (do we spent $10k, $100k, $500k, or $1m) and we concluded that spending roughly up to 50% of our current capex cost was a reasonable ceiling that should give us plenty of redundant capacity in case of a major outage.  Of course, the network architecture and fall over plan is probably as important (or more important) then the equipment.

I’m searching for a way to describe “acceptable downtime” for an early stage company on a steep adoption curve.  I’m still looking (and I’m sure I’ll feel pain during my search – both as a user and an investor), but there must be a better way than simply saying “5 nines.”