In Search of the Unattainable 100% Uptime
Last week, one of our hosting partners’ data canters had a piece of hardware fail resulting in a number of websites going down temporarily. Thanks to the proactive monitoring service, the situation was discovered right away. Once the cause for the server failure was assessed, the parts in question were immediately replaced and all affected websites were back up within a couple of hours.
Some of our clients whose websites were affected by the downtime naturally inquired about the situation and we provided them with as much information as we had available as well as sending them a number of updates until the situation was resolved. So the #1 thing to do is ALWAYS communicating with clients about any type of situation affecting their services and depending on the particular case, also discussing with them available options.
But the main purpose of this article is to explain what is “Uptime” when it comes to websites availability and how it is measured in the industry. MGR’s Service Level Agreement (SLA) mirrors that of our hosting partners at WP Engine and RackSpace. In our particular case, the SLA reads: MGR through its hosting partners will provide service availability of 99.95% (“Service Availability”), calculated on a calendar month basis. You can read the full SLA here >.
Of course, I understand that nobody likes reading all sorts of fine print documents including Terms & Conditions, Acceptable User Policies (AUP) or SLAs. Most of us, simply check the required box and move on. But at least, it is important to know what the key terms mean.
Uptime is one of the most important single pieces of data that you can use to measure the performance that you get from your web hosting provider. It shows the time or percentage the service (website or any other online application) is up and operational. The opposite of that is downtime. It shows the time or percentage the service was unavailable, or basically, the time your website was down.
Availability is basically the same as uptime. It’s often measured by “Nines” as well as fractions in between the nines. See the chart below to see how each service availability translates to downtime over a 30-day period (or calendar month basis):
|99%||Two Nines equals to 7 hours and 12 minutes downtime in 30 days|
|99.9%||Three Nines equals to 43 minutes and 12 seconds downtime|
|99.99%||Four Nines equals to 4 minutes 19 seconds downtime|
|99.999%||Five Nines equals to 26 seconds downtime|
|99.9999%||Six Nines equals to 3 seconds downtime|
Of course, if I told you that your website would have a guaranteed uptime of 99% (2 nines) most people would be plenty satisfied. However, that would mean that your website could be down (read: not available) for a whopping 7 hrs a month or just under 90 hours in a year and it would still fall within the terms of the SLA.
As you can see in the above table, the difference in downtime between 2 nines and 3 nines is huge (obviously 3 nines downtime is ten times smaller) but for a website, the difference between being down 7 hrs in a month or just 43 minutes could have a greater impact depending on the website traffic and a number of other factors.
Why Can I Just Get 100% Uptime?
The simple answer is because uptime is never 100%. It is important to understand that data centers are just a large group computer servers networked together for remote storage, process or distribution of large amounts of data. I know we like to think of “the cloud” as some sort of celestial infrastructure that magically allows us to access all sorts of documents no matter where we are. But in reality, the cloud is just a terrestrial building with freezing rooms full of computers. And yes, just like your own personal computer, the servers in these data centers also fail and crash from time to time.
For that reason, you will not find any hosting provider that guarantees 100% uptime. Not even Amazon Web Services, Google’s Gmail, Docs, etc. or even Facebook and much less Twitter. Just last weekend, social media rants were all over the place because some Netflix servers were down and people were not able to stream their favorite movies (those people also need to get a life, by the way). Apparently, this happens so frequently that there are a number of websites tracking Netflix outages too.
If Not 100%, How Many “Nines” Can I Get?
Most entry level (budget) hosting providers offer at least a 99% uptime SLA. When you start looking for more robust hosting solutions, you will see SLAs offering 99.9% and even 99.99% uptime. As you may have guessed it, the major difference between two, three or four nines is cost.
If you look at the above table again, you can determine how much potential downtime you can “live with” without affecting the normal operation of your business. For example, if you own an e-Commerce website handling hundreds of sales and transactions every hour, a single hour of downtime could potentially cost you thousands of dollars. On the contrary, if you have a smaller website that is mostly used as an online catalog or brochure for potential clients, chances are, your business will not suffer much if the website is down for a brief period of time.
As a rule of thumb, you can assume that for every additional “9” that you want to add to your uptime, (say from 99% to 99.9% or 99.99%) you will also be adding more complexity and redundancy to your hosting setup, to the point that in order to eliminate a few minutes of downtime, you can easily expect your monthly hosting costs to double and triple very easily.
In sum, what you need to determine is what trade-off of money and 9’s is appropriate for your particular case
The Redundant Server Solution
One of the most commonly offered solutions to avoid (or minimize) downtime caused by hardware and software failure on a single server is to have other servers in the same cluster that you can fall back on. Of course, this option automatically increases costs two or three times. Most hosting companies already run multiple servers, however, to reduce their own costs and remain competitive, they also add more customers to the cluster than secondary servers. What this means is that even if a single server goes down, if the other ones that are already running close to 100% capacity, they will be unable to “pick up the slack” for the down server, therefore, the entire cluster will still go down.
The more robust solution is to add redundancy -not just by adding more servers to the same data center- but by setting up redundancy servers in completely different data centers; and by different I mean data centers located at different locations. It also means ensuring that each redundant server is capable of handling 100% of the traffic on its own should the primary server fail for some reason.
So now, not only do we have two servers mirroring each other at different locations, but we also have them dedicated so that they can eventually run on their own. Add to that the extra infrastructure and resources (human and technical) required to keep both running efficiently 24/7 and you will see how you end up trading “9s” for dollars!
When a website goes down, nobody is happy. At MGR we hold ourselves to the highest standards for every service that we provide. But we also understand that perfection is not attainable. It is important not to get caught up in the moment -whether that moment is 0.1% or 0.01% of a website being down- and reflect on the other 99.9% of the time that everything is working just fine.
The truth is that nobody can predict these types of “Black Swan” events. You can prepare for them, try to anticipate them, but at some point, you realize that by throwing a lot of dollars to add one more “9”, your prevention becomes much more expensive than the actual cure.
You can rest assured that we learn from any situation that we face so that we’re better prepared next time. That’s why we consider other factors such as track record, how incidents are handled, how frequently do they happen, customer service and tech support as well as the availability of multiple data center locations among others as the more relevant factors when it comes to choosing the right hosting partner.
What are your thoughts? I’d love to hear from you.
Until next time, this is Manuel Gil del Real (MGR)