The Downtime Dilemma: Reliability in the Cloud


CRM Analyst,

When I’m not blogging for Software Advice, I like to do a little personal writing of my own. I use Google’s Blogger as my platform for reflection. A couple of weeks ago, I tried to create a new post, but like thousands of other Blogger-ites, I was unable to do so. After a quick search on Twitter and various user boards, I realized Blogger was down.

The application was unavailable for about 20 hours. This outage is just one in what seems to be a string of recent cloud failures. Amazon’s EC2 is probably the biggest fail story lately. But Microsoft’s BPOS hosted bundle also experienced a significant amount of downtime recently. And, earlier this week, little monsters everywhere went gaga when Amazon released a digital copy of “Born This Way” for 99 cents, causing Amazon to experience another unfortunate crash.

These incidents have been covered extensively on the major tech news outlets, leading the technorati to once again question the reliability of cloud computing. One contributor wrote on the Microsoft Service Forum:

“Back to in-house servers we go, I suppose. This string of incidents will set the cloud/off-site model back months, if not years, I fear…"

When things go awry in the cloud, many companies are affected. Because these periods of downtime are public knowledge, it creates a misconception that cloud computing is unreliable and should be avoided. However, when things falter with on-premise systems, it is hidden behind the corporate curtain.

Despite cloud computing’s proven track record of success and gaining popularity as a cost-effective solution, it’s still managing to get a bad rap. Even with these highly visible incidents in the media recently, is this bashing of cloud computing really warranted?

Downtime in the cloud

Anyone who has ever purchased a cloud-based software system is familiar with the Service Level Agreement (SLA). In the SLA, the provider commits to a percentage of up-time, or amount of time the system can be expected to run without interruption. Ideally this would be 100%, but as with most technology, hiccups in service delivery are inevitable.

When creating the SLA, vendors take into account regularly scheduled maintenance, as well as unplanned outages or downtime. After making those considerations, most cloud companies can still quote about 99.9% up-time. That looks pretty impressive and seems to be in line with the kind of performance we have come to expect from SaaS vendors. Unfortunately, naysayers still like to harp on that .1%.

Even though cloud systems are recognized as the cash-flow-friendly alternative to on-premise systems, we still have the traditionalists that refuse to embrace the cloud. Many prefer to instead dwell on the “what ifs.” What if the host’s servers go down? What if mission-critical data is lost? While these are clearly valid questions, for many on-site purists, what it really comes down to is control. Users feel more secure when they are in control of the system. However, Walter Scott, CEO, GFI Software, offers a reminder:

“Cloud-based solution vendors not only have the latest technology, the latest firewalls, the best data centers and the highest levels of redundancy possible but they will apply multiple layers of [in-depth defense] that your average business (a Fortune 500 company may be an exception) can never have."

Downtime on the ground

Like their cloud computing counterparts, on-premise systems make promises on up-time. The difference is that when outages occur inside organizations, we typically don’t hear about it. Therefore, the perception of the always-on on-premise model is skewed.

This lack of coverage also makes it difficult to track down any data regarding the performance of on-premise systems. However, the Radicati Group conducted a study in 2008 on on-premise email solutions that exposes some interesting points.

Monthly Downtime for Email Providers

Most notable in the findings is that among the most popular email systems (Microsoft Exchange, IBM Lotus Notes, etc.), there was an average of 30-60 minutes of unscheduled downtime per month. On top of that, there was an average of 36-90 minutes of scheduled downtime. That stands in stark contrast to Gmail’s total downtime of 10-15 minutes.

Clearly, based on these findings, servers can and will fail on occasion no matter where they are being hosted. And from this chart, one might deduce that cloud companies are more efficient at getting back online than companies that host their own servers.

Getting to 100%

There is one foreseeable upside to this negative press: it puts a fire under the backsides of cloud computing vendors to constantly improve and stay on the leading edge of technology. I spoke with Denis Pombriant of Beagle Research Group about an article he wrote recently in which he discusses reliability in cloud computing in terms of what users expect from vendors:

“You have to be always up, always available. So, what does that mean? It means that you can’t have a single point of failure.”

That is a tall order, but it’s what the user requires. So, how can we achieve this standard? For starters, Pombriant proposes better system modeling in the cloud. In other words, the architecture needs to be improved.

“If you’re going to have a truly robust and reliable infrastructure, you’re going to have to build much greater reliability into your systems,” he says. “Take electric utilities these days. They all have more generating capacity than is online at any one time because they take plants down, and then they put them up. They eliminate all of the obvious possibilities for failure. That’s what cloud computing has to evolve towards.”

Denis makes a really great point. Although, the current cloud infrastructure is probably about 10 times as redundant as most on-premise systems. I think the cloud is simply suffering from the consequences of fame. On-premise systems experience the same failures as cloud systems – probably more – but cloud is the “celebrity” model right now, so it gets all the attention, good and bad.

Think about it. Arnold Schwarzenegger isn’t the first guy, or politician for that matter, to have a child with a mistress, but because he was a governor and, more importantly, the Terminator, we hear about it when he does. I do apologize to the cloud for that comparison. The cloud is far more reliable than Arnold, but you catch my drift.

  • ENKI

    Lauren, as you pointed out, reliability isn’t really the issue, since Amazon never violated its own statements about reliability, and despite the meltdown, their overall reliability is still better than most businesses receive from internal infrastructure.  This is true across the board at SaaS vendors like NetSuite or SalesForce, Platform-as-a-Service vendors like Heroku, or Infrastructure-as-a-Service vendors like Amazon.

    Instead, I see the reliability issue as composed of two parts.  First, there is misalignment of *expectations* about reliability versus what cloud vendors actually offer.  Second, this gap is the result of a customer base that is becoming increasingly unable (due to lack of keeping skilled staff on the payroll or because they buy remarketed cloud as SaaS or PaaS) to discern or solve potential reliability problems because they have outsourced the problem to cloud vendors – or at least they think they have!

    For example, those customers of Amazon’s who were down for 20 hours thought they were buying hosting but instead all along, due to their use model of the cloud, they were paying for remote servers with a fancy provisioning interface and no real high availability features.   Since you brought up politics, I’ll use a political analogy.  Much like the problem with our federal deficit, this is the result of people wanting to have their cake (reliability) and eat it too (cost savings.)   Unfortunately, the laws of physics and probability require more infrastructure for reliability, and that costs money – real customer money.  For example, your article talks about getting to 100%, but yet I can show you that no system will ever reach 100%, and when you go past 4-nines, the costs climb exponentially with every additional 9 of uptime.  Faced with such a sober accounting, cloud users can decide for themselves how much reliability is enough.

    Where things can be improved immediately is for infrastructure cloud vendors and their customers to talk about the cost/reliability tradeoff so the expectations are aligned.  Yet this is difficult in a mass-market, vending-machine self service market, which is why I advocate that infrastructure cloud customers outsource cloud deployment/management if they don’t want to develop it as an in-house skill (and they shouldn’t have to if they don’t want to!)  In any case, infrastructure cloud customers cannot escape the reality at the moment that they need someone on their side to ensure that the way they use the cloud will result in the uptime they expect.  This also goes for performance, by the way.

    For SaaS cloud customers, the issue is more complex, since they have no control over how their SaaS vendor designs and buys their infrastructure.  The best they can do is make sure that they’re getting a reasonable SLA from their vendor, and that there are meaningful penalty clauses that will keep the vendor focused on reliability.
    All things considered, cloud still can enable significant cost savings and thereby permit web-based businesses to exist which could never have been viable before, so it’s here to stay.   But I think the honeymoon period in which cloud seemed a magical panacea to all IT ills is over.   There still has to be someone at the helm of any cloud deployment who knows how to ensure that it is reliable and performs as expected.

    Eric Novikoff

  • Robert.R.Cathey

    Great post, Lauren. Public cloud availability and security failures are indeed much more visible than when similar issues strike the corporate data center.

    Netflix provides an excellent template for how to improve availability and reliability while leveraging the cost and flexibility advantages of public commodity cloud. By spanning multiple data centers and thinking analytically about base/peak demand, they’ve successfully put a critical piece of their product strategy on the lowest cost infrastructure.

    Which brings up an instructive point: All of this assumes that we’re talking about public COMMODITY (or webscale) cloud. Public “clouds” that are essentially virtualization 2.0 strategies built on legacy client/server architectures are not competitive.

  • Lance Becker

    Thank you for sharing some very pertinent facts about Cloud technology in the midst of the media energy around the recent Cloud-based issues/events/failures.
    I think we can safely say that there are issues with every technology and it really doesn’t matter if the media considers them to be great or small when those “issues” happen to your own business. Every IT strategy has a risk-reward scenario that should be reviewed, digested and applied to each individual business environment. The Cloud is no different.
    I’ve posted some of my concerns and suggestions as a business owner and a provider of IT services to other small businesses and I think the items reflect some valid points in approaching Cloud strategies and any other strategies that promote significant changes in dynamics for businesses.
    The Cloud is a very powerful strategy that can – and is – working for many businesses. The Cloud is, however, immature. There are stability and security issues that will, no doubt, be addressed in its coming development stages.
    So it is best to: Practice caution.  Research thoroughly. Use back-ups. Engage experts.
    And watch for what is next on the horizon for the Cloud.
    Lance Becker
    Responzawww.responza.comTotally Managed IT for Businesses

  • Blogs by Market:
  • Subscribe to the Software Advice CRM Blog

Popular Blog Posts