Cloudy Snake Oil

Lambdanaut · on April 19, 2014

That is the biggest nitpick ever. Nobody on Earth except you read Amazon's claim the way you did. Usually I'm against referring directly to the person making the argument rather than the argument itself, but this is just ridiculous, and it reeks of purposefully misrepresenting Amazon's claims.

Of course all of these insane random catastrophic events could occur. The numbers given by Amazon are representative of the service they've been able to responsibly provide in the past, and they're extrapolating from that. Nobody in their right mind would expect Amazon to maintain those numbers if, for instance, the Earth spontaneously imploded.

Disclaimer for future readers: this comment was written in the year 2014, about three years before Amazon built their Moon Base Backup Storage Service™ (MB2S2)

tbagman · on April 20, 2014

I understand your point of view, but I have a different opinion.

When I see these kinds of claims, my guess is that they are usually based on a failure analysis given a particular replication degree and estimates of failure probabilities of various components and failure domains. Underlying this analysis is usually an assumption about the independence of failures between the failure domains.

The good news is that industry has moved away from the computer as the unit of independent failure to a much larger failure domain: often a cluster within a data center, or an entire data center. This means that the analysis takes into account the infrequent occurrence of a large number of correlated failures within the failure domain.

The bad news is that there are inevitably correlated failures across the failure domains, regardless of how carefully you design to avoid them. Software bugs, coordinated attacks, operator errors, cascading failures caused by well-intentioned but runaway control loops and automated failover mechanisms, and so on, can be the culprit.

So, here's the problem. This statistic from Amazon, if taken at face value, would say that relying on Amazon to keep your data durable and safe is practically risk-free to the point of durability issues never happening in your lifetime (or, alternatively, to such a dramatically small fraction of objects that you might not care).

In practice, however, I suspect you do want to plan for the "unknown unknowns" that will cause data loss at low probability, but much higher probability than 0.000000001%.

Here's another way to look at it: I'd love it if Amazon posted some data about the rate at which they've experienced durability failures in the past year or two, rather than posting what I'm supposing (I might be wrong!) are calculations based on assumptions of dependent failure probabilities.

spcoll · on April 19, 2014

Any risk estimate makes implicit assumptions, so they are conditional probabilities rather than absolute probabilities. It is assumed that Amazon engineers mean to prefix these data loss numbers with "under normal operation of our services".

Definitely, exceptional and unpredictable incidents make the absolute risk higher. But in the absence of historical data about the rate of occurrence of such events, it is impossible to make a reliable assessment of absolute risk. Not only is it impossible to assign a realistic probability to events that you know can happen (eg. catastrophic bug), but many of the possible incidents are events you are not even aware can happen (eg. are you going to factor the risk of a meteoric crash destroying your data centers?).

rcthompson · on April 19, 2014

If your service has existed for T years and you store N objects and have never lost any objects, then at most you can claim "less than 100/(T*N)%" yearly object loss rate. If you claim any less than that, you have no empirical evidence to back it up.

(The more accurate formula would take into account the length of time each object has been stored, since not all objects were stored for the full time period.)

cordite · on April 19, 2014

Didn't Netflix at one point have data loss on S3 because an amazon employee was in production without knowing it?

ecma · on April 20, 2014

This is marketing 101, they are making their product sound impressive by throwing improbable numbers but not even the moron in a hurry would truly believe that they can store their 1 object and then check in on it in 100 billion years later. The important number (the 99.99...%) is there in the text, the rest is a hyperbolic and arbitrary example but it is only intended to provide a frame of reference to numbers which are not easily comprehendable into an MTBF.

I would also suggest they may not be using historical data to calculate their object durability, rather a worst case calculation of current failure models. That's pretty standard practice in all future risk modelling, especially for new systems. Comparing theoretical future risk vs evaluated past resilience is a slightly different beast.