Acceptable Downtimie

I was on a board call for a company today where we talked about “acceptable downtime” for their web-based service.  This company has commercial customers that depend on its software to run their businesses and the software in question is delivered “as a service.”  I’ve got a number of companies using this approach (vs. a straight software license approach) and I have a lot of experience with this issue dating back to investments in the 1990’s in Critical Path, Raindance, and Service Metrics.

While it’s easy to talk about “5 nines” (99.999% uptime), there are plenty of people who think this metric doesn’t make sense, especially when you are building an emerging company and have a difficult time predicting user adoption (if you are growing > 20% per month you understand what I mean.)  While most companies have SLA’s, these also don’t really take into consideration the actual dynamics associated with uptime for a mission critical app. 

For example, when I was on the board of Raindance, the CEO (Paul Berberian) described Raindance’s system as similar to an incredibly finely tuned and awesomely powerful fighter plane (on that even Jack Bauer would be proud to fly in.)  There was no question that it was by far the best service delivery platform for reservation less audio conferencing in the late 1990’s.  However, in Paul’s words, “when the plane has problems, it simply explodes in the air.”  Basically, there was no possible way to create a situation where you can guarantee that there will not be a catastrophic system failure and – in this situation – while there was plenty of fall over capacity, it’s too expensive to create 100% redundancy so it will take some time (usually in Raindance’s case an hour or two, although it once took two days) to get the system back up.  The capex investment in Raindance’s core infrastructure (at the time) was around $40m – we simply couldn’t afford another $40m for a fully redundant system, even if we could configure something so it was – in fact – fully redundant.

There have been many high profile services that have had catastrophic multi-day outages.  eBay had a number of multi-day outages in 1999; Critical Path had a two day outage in 1999 (I remember it not so fondly because I was without email for two days in the middle of an IPO I was involved in); Amazon had some issues as recently as last years holiday season; my website was down for an hour last night because of a firewall configuration issue; the list goes on. 

Interestingly, as an online service (consumer or enterprise) becomes more popular, the importance of it being up and operational all the time increases.  While this is a logical idea, it’s a feedback loop that creates some pain at some point for a young, but rapidly growing company.  As the importance of the service increases, expectations increase and – when there is the inevitable failure (whether it’s for a minute, an hour, or a day) – more people notice (since you have more users).

After watching this play out many times, I think every company gets a couple of free passes.  However, once you’ve used them up, the tides turn and users become much more impatient with you, even if your overall performance increases.  Ironically, I can’t seem to find any correlation concerning price – the behavior that I’ve witnessed seems to be comparable between free services, services that you make money from (e.g. eBay), and services that you pay for.

My advice on the board call today was that – based on our rate of growth (rapid but not completely out of control yet) – we should get ahead of the issue and invest in a much more redundant infrastructure today.  We haven’t yet used up our first free pass (we’ve had several small downtime incidents, but nothing that wasn’t quickly recovered from), but we had a scare recently (fortunately it was in the middle of the night on a weekend and – given that we are a commercial service – didn’t affect many users).  The debate that ensued balanced cost and redundancy (do we spent $10k, $100k, $500k, or $1m) and we concluded that spending roughly up to 50% of our current capex cost was a reasonable ceiling that should give us plenty of redundant capacity in case of a major outage.  Of course, the network architecture and fall over plan is probably as important (or more important) then the equipment.

I’m searching for a way to describe “acceptable downtime” for an early stage company on a steep adoption curve.  I’m still looking (and I’m sure I’ll feel pain during my search – both as a user and an investor), but there must be a better way than simply saying “5 nines.”

  • It’s an interesting question for high growth companies. The problem is you can’t redefine service expectations, shit works or it don’t.

    But you could unbundle and share risk. Enterprise services could offer premium service, an SLA option, for customers who want the vendor to bear uptime risk. For those who want to take the risk, share the risk of 20% growth, they can go with the basic service.

    No matter how you package it, this is just a case of a immature industry when it comes to risk. Advantage will come from offering risk management amidst volatility. Greater rewards come from cooperating market participants who identify shared risk and standardize offerings to grow together.

  • An equally challenging problem is the measurement and definition of availability especially for mission critical services delivered via the internet. There are several potential causes some controllable some not when using TCP/IP and the public net as a development platform. Is the app dead but the “box” is responding to monitoring? Local DNS issues?

    The next step in service levels is usability which is even harder to quantify but in my view necessary.

    Aria gives SLA’s with penalties in the form of discounts but we also demand projections from our clients.

  • Dave Jilk

    I’ve written on this issue and take a distinctly different point of view, at least for hosted/web-based applications. First, it doesn’t have to be a huge capex to have the necessary redundancy and further there are service providers such as Akamai that can provide most of the application infrastructure not only redundantly but on the network edge.

    Further, as my article points out, the focus on infrastructure uptime misses a lot of the point. Many providers of hosted applications have pretty high uptime already (as your company example indicates) but their applications have bugs. Users don’t care WHY it doesn’t work, they just care THAT it doesn’t, so application issues should be included in these uptime/downtime statistics to be correct.

    I do agree with you for services that have telephony and other complex infrastructure.

  • Mike

    Very interesting. I can imagine that there are probably a few decision points in creating the system architecture for these types of service platforms where the trade-off between performance and manageability would benefit from this type of perspective.

    It seems like systems engineers and software architects usually want to design elegant, hi-performance s/w, and marketing people always want feature-rich applications so who is there to advocate for reliability?


  • Third Time Lucky?

    Any downtime in the online channel is unacceptable. This is the equivalent of putting a CLOSED sign in your window.

    Do you accept downtime in your real-world channel? If a customer is in your store, with a product in hand, and cash at the ready…do you say sorry?

    We are 10 years into online commerce. Firms have to stop factoring in downtime. It indicates that companies still think like old-school mainframe folks.

    Stop. The. Madness.


  • My link didn’t work in my earlier comment:

  • To the customer uptime and five nines mean nothing. The assumption is it should just work. The end user needs to have some feedback if something goes wrong. We have learned that the best thing to do when you have a problem is to create a system to fail gracefully. This means if the primary system is down, then have a cheap read only version which can pick up the slack. Maybe with 1-2 hour old data. This is better than being completly down, and far far less expensive than totally realtime redundant systems. As a startup ourselves, we have struggled with this issue. Since cash is limited, we could not afford all the bells. Sometimes this helps focus your mind.

    It turns out after doing a lot or research, our customers will accept a read only version while we get the other system back online. While we have never been down for very long, it is better to have someting up than nothing. Also communication is important. When ever we have had a problem, we shoot an email to all our administrators and call on the phone our major customers. The worst thing for our customers is to not be able to log into the system and not know what is going on.

    To put it in VC terms, it is ALWAYS better to tell bad new right away to your customers, board, investors, than to wait and hope.

  • Colin Evans

    My experience working at a telco infrastructure provider was that downtime was always bad, but the customers mind it less if you can make them aware of when and why it is happening and minimize unexpected side affects (billing errors, delivery errors, etc).

    Sometimes all this means is having good enough monitoring that you can graceully take the system down when it starts to fault, and put up a polite apology on the website while this is happening. This turns an “unexpected outage” into a “planned outage”.

    Of course, if you have “planned outages” too often, your customers have good reason to be unhappy.

  • 1) *Anybody* using an “emerging” service simply has to be honest with themselves and admit that excessive downtime is a risk they agree to assume in exchange for gaining access to the innovative new service. The customer needs to provision their own fallback plans, regardless of the vendor’s offerings and “commitments”.

    2) I thought hardware and open-source software were supposed to be so cheap to be essentially free? At least that’s the hype or “spin”. If so, what excuse is there for a lack of 400% redundancy? Every business and technology plan should have a special section entitled “How Scalable are we Really?”.

    3) If the business plan for a VC-funded venture targets major accounts which have critical uptime requirements, isn’t it the responsibility of the entrepreneur to gain full funding for the appropriate level of redundancy and fallback to meet targeted customer requirements? If so, what’s the issue here? A failure of the entrepreneur to do robust planning? A failure of due diligence by the VCs?

    4) The important thing is that entrepreneur be up-front and spin-free with customers as to what levels of service they can expect and what levels of “drama” are likely.

    — Jack Krupansky

  • Acceptable Downtime

    Brad Feld wrote a post yesterday titled “Acceptable Downtime”, where he explains that he has a position on the board of a startup which is considering adding redundancy to their web based service to mitigate the possibility of a catastrophi…

  • Acceptable Downtime

    Brad Feld wrote a post yesterday titled “Acceptable Downtime”, where he explains that he has a position on the board of a startup which is considering adding redundancy to their web based service to mitigate the possibility of a catastrophi…

  • Downtime unaccepted.

    Brad Feld, a Colorado-based VC, wrote an insightful piece on acceptable downtime for rapidly growing companies. It’s rarely the case, however, that executive management gets full disclosure on what root causes are responsible for the embarrasin…

  • Two avenues for consideration.

    First, couched in terms of the technology product adoption lifecycle the ‘acceptable’ downtime depends upon where you are on the curve. “Innovators” will put up with a certain amount of crashes and failures. Early adopters have less tolerance. The pain comes as you cross the chasm. The early majority simply wont ‘buy’ a service/product unless they believe its stable and reliable. Downtime at this point could kill the market. So the further you are along the curve the more you ought to invest in resiliance of the systems.

    The second stream of thought is a consideration of how mission critical the ‘service’ is to end users and how ‘real time’ their requirements are. A system dealing with truck appointments and gate receipts at a container terminal ( is going to need to be very resiliant as delays can cause major operational impact (and many dollars in consequential losses).

    On the other hand a tshirt website has less need to be resiliant since failure is merely inconvenient and may cause the site loss of revenue, but not the customers. The tradeoff is loss of reputation, negative word of mouth.

    Its good advice to develop practical and simple work around support if the system fails. This isnt feasible for all online services but if you can switch to a ‘manual’ mode to maintain service then customers will prefer that to being totally down.

    This might mean creating static mailto HTML forms and having to maintain manual data entry and manual email response on the backend (we spoofed an auction in Japan this way for a week until a critical bug was resolved) or it might mean fail over to a professional (outsourced) customer service hotline (costs being kept down by it only being used in emergencies).

    In a bootstrapping start-up role Ive had to trade off the risk of system outage (or overload at peak times) against the desire (and need) to grow the customer base. This was a prepaid calling card business in Hong Kong. We managed some of the loading by influencing customer behaviour through our pricing strategies. We didnt have capital to build a fully resiliant platform.

    Even where we invested in resiliance there were still problems. We hosted our services in a state of the art telco facility and bought capacity from world leading telco providers. The FM building lost all 3 redundant aircon systems in a typhoon and their key switching systems overheated and shut down the facility. A separate software upgrade on a providers network tripped a major outage on the routers which rippled around the network and caused sporadic failure for a week. Even when you pay for the so called best you can face downtime.

    Ive also been involved in projects where a Systems Architect has gone mad spending millions of dollars on redundant infrastructure way ahead of progress on customer adoption. When customer numbers / transaction volumes are lower the work arounds are obviously easier. As the numbers grow it becomes more necessary to build greater resiliance in the systems.

    To assist in assessing how much to invest and when, Id recommend laying out a matrix of risk of failure, investment cost of redundancy or resiliance, risk/cost to customer/business, time to recovery.

  • Pingback: cheap car insurance in dc()

  • Pingback: Kissimmee()

  • Pingback: cheap car insurance quotes()