Why would users refuse to buy hardware that works 99.9999999999% of the time when they apparently have no problem buying software that works 99% of the time?
Radioactive decays and cosmic particles flipping bits give an upper bound for reliability. You are not going to see low-background packages and rad-hard chips in your iPhone.
Radioactive decays and cosmic particles flipping bits give an upper bound for reliability well below 99.9999999999%
If it works 99.9999999999%, then it has a failure rate of 0.0000000001%, or 1E-12. Considering that a modern CPU executes approximately 1E9 operations per second, and that regular HDDs have a worse-case BER of 1 in 1E14 bits, 1E-12 is actually rather horrible and the actual error rate of computer hardware is much better than that.
Imagine if a CPU calculated 1+1=3 every 1E12 instructions. At current clock rates, that's a fraction of an hour. Computers simply would not work if CPUs had such an error rate.
I picked the 1E-12 number arbitrarily, but it's quite illustrative of the reliability computers are expected to have, despite their flaws.
Generally speaking, a piece of JavaScript in some 0x0 pixel iframe in a tab you're not even looking at can't summon cosmic particles to manipulate your computer's main memory. Rowhammer can.
No, but it’s straightforward engineering to prevent many of these problems. Error-correcting codes have been well understood for most of a century. Yes, it costs in performance and in money. For life-safety applications, it shouldn’t be optional.
Amazon is hosting life-safety applications in EC2. Commodity x86 hardware is grossly negligent for that environment.
ECC alone is absolutely insufficient. But ECC can be part of a system design that includes active monitoring and response. I’d expect that system design to also include measurement of ECC events under ordinary conditions, regular re-measurement, and funding for an analysis of the changes and explanation of the difference—just like you’d find in safety engineering in a coal plant, an MRI machine, any sort of engineering that has a professional scientist or engineer on site supervising all operations.
Of course, you’ll also find a tendency there towards specified hardware. They bend or break it to use COTS x86 machines, but—as I think I heard from a comment here last week—nearly nobody ever specified wanting AMT in the initial design, so it’s pretty weird that we’re all buying and deploying it.
Almost everything I've seen on error rates from radioactive decay and cosmic particles has been on servers in data centers.
I wonder if home systems are equally vulnerable, or if there is something about data center system design or facilities that make them more susceptible?
I ask because I had a couple of home desktop Linux boxes once, without ECC RAM, that were running as lightly loaded servers. I ran a background process on both that just allocated a big memory buffer, wrote a pattern into it, and then cycled through it verifying that the pattern was still there.
Based on the error rates I'd seen published, I expect to see a few flipped bits over the year (if I recall correctly) that I ran these, but I didn't catch a single one.
Later, I bought a 2008 Mac Pro for home, and 2009 Mac Pro for work (I didn't like the PC the office supplied), and used both of those to mid 2017. They had ECC memory, and I never saw any report when I checked memory status that they had ever actually had to correct anything.
So...what's the deal here? What do I need to do to see a bit flip from radiocative decay or cosmic rays on my own computer?
I think it's multiplication. The odds are low but the number of potential instances is larger. Data centers have larger numbers of machines and those machines are doing repeated work where you observe the result.
Personal machines are typically limited by what your senses can handle. There are few of them for starters. They idle a lot. If many pieces failed inexplicably it's not likely to be something you are personally paying attention to with your senses.
(I have personally observed ram and disk failures on personal machines anyway. And I have seen stuff in my dmesg indicating hardware faults on my personal desktops, but rarely in a way that I notice in actual use not looking at dmesg.)
> I wonder if home systems are equally vulnerable, or if there is something about data center system design or facilities that make them more susceptible?
I was told once that today's concrete has a much higher background radiation than brick and mortar from before the 50s. There is also more steel in data centres.
However, I'm not at all sure if background radiation of building materials is even in the right order of magnitude to matter here. Probably not.
A bit of a tangent, but somewhat related to your bit about concrete: steel salvaged from ships built before 1945 is less radioactive than modern steel and is useful for devices that are extremely sensitive to radiation: https://en.wikipedia.org/wiki/Low-background_steel
The reason the modern stuff is more radioactive is the massive number of atmospheric nuclear weapons tests conducted starting in 1945. I imagine the concrete has the same issue.
I think you need high energy radiation like cosmic rays from space to create problems. So those at higher elevation are at more risk. Heavy material like concrete may block this radiation.
Alpha particle emissions are common causes of single-bit errors, especially from ceramic enclosure materials in integrated circuits. Mitigating soft errors from circuit packaging is an active area of research in materials science. Parity bits and CRC error checking are needed precisely to reduce the impact of these errors down to manageable levels.
GP is saying the radiation is emitted by the component package itself, i.e. the random decay of particles in the ceramic surrounding an IC can cause errors.
You are not running uniformly random instructions on the CPU. It doesn't matter how many 9s there are in that percentage, if an attacker knows that 0.000...01% code, you have a problem. It actually makes it more insidious, since the chance that is occurs accidentally is basically zero (unlike previous CPU bugs).
Radioactive decays and cosmic particles flipping bits give an upper bound for reliability. You are not going to see low-background packages and rad-hard chips in your iPhone.