Good reliability data is both highly prized in computing and frustratingly difficult to come by. Occasionally, a third party firm like SquareTrade will publish its own figures but these reports are few and far between. It’s effectively impossible to track how a manufacturer evolves from year to year without a set of consistent criteria and multi-year tracking. European reseller Mindfactory recently chose to share its GPU RMA data for AMD versus Nvidia products and the results are quite interesting.
I’ve written about Mindfactory’s data before and I’m willing to use them as a source for this article, but I want to note an important caveat that I don’t have an explanation for. According to this data set, Mindfactory sold very few RTX 2070s and 2080s, and only ever shipped a handful of SKUs. I suspect what this implies is that the data only covers the previous 12 months. That’s relevant if we’re going to draw any conclusions about the relative age of the process nodes these GPUs were built on. The report covers 44,100 AMD GPUs and 76,280 Nvidia GPUs, and is likely a statistically significant sample of all retail channel cards sold by either company in Europe for the relevant time period.
All of the usual caveats apply. Mindfactory is one European retailer. It isn’t a US company and its data is a snapshot of the total market, nothing else. These results should not be taken as determinative, they should be read with a grain of salt, contest not valid in Alaska or Hawaii, no participation necessary, see store for details, etc, etc. Moving on.
Here are the high-level takeaways from the chart, in no particular order:
- Less-complex, less-powerful GPUs fail less often than more complex, more powerful GPUs.
- AMD’s midrange and budget cards do not fail more often than midrange and budget cards from Nvidia.
- PowerColor AMD GPUs fail more often than other brands.
- The RTX 2080 Ti is the GPU statistically most likely to fail. It is the only GPU with two-digit failure rates (11 percent) reported from multiple vendors.
- AMD high-end GPUs fail more often, in absolute terms, than Nvidia GPUs, even if we remove the impact of PowerColor from the AMD data. The gap is significantly smaller if you do, however.
Some years ago, a report came out showing failure rates between different types of RAM. If anyone can recall it, shoot me a link — I’ve not had any luck finding the article. What it showed was that it was more common for high-end enthusiast DRAM to fail than low-end basic parts from the likes of Kingston or Crucial. Failure rates didn’t correlate perfectly with clock, but as clock speed climbed, so did the RMA rate. The article I’m recalling wasn’t the Google 2009 study, or the 2012 follow-up, and I don’t think it was the Microsoft 2012 study, either. It was based on consumer hardware, not enterprise or server tech. The point was, enthusiast hardware running close to the margin of what’s possible has a higher failure rate than bog-standard parts that are well within clock and voltage margins.
We see evidence of a very similar trend here. If we assume that this data covers July 2019 – July 2020, it means that Nvidia was still having real problems with the RTX 2080 Ti when the GPU was nearly a year old, long after the company began shipping the card. Conversely, if the data set is from Turing’s launch, it would mean all it does is capture the already-known high launch failure rate for the RTX 2080 Ti.
I wish we had more data on the RTX 2070 and 2080, because the limited data we do have suggests some high rates of return on Gainward cards for the RTX 2080 and KFA2 cards for the RTX 2070. The RTX 2070 Super and RTX 2080 Super return rates are excellent. Are they excellent because Nvidia had months to refine Turing, or were they excellent from the beginning? The answer to that question would meaningfully impact how we interpret AMD’s higher RMA rates given that the 5700 XT and 5700 launched on a brand-new 7nm process.
The fact that we see a trend towards lower failure rates on simpler, smaller GPUs from both companies is very likely relevant. The RTX 2080 Ti’s higher failure rate fits with this — the chip was a reticle-buster that pushed engineering to its limit. As for the different manufacturer failure rates, we’ve got nothing but questions. Why did MSI’s Gaming Z Trio RTX 2080 Ti have a 1 percent failure rate with 2 returns (~200 GPUs sold), while the MSI Lightning Z had an 11 percent failure rate with 14 returns (~130 GPUs sold)?
Drastic variation in GPU failure rates could implicate the manufacturer’s cooling practices or reflect the fact that a company introduced new models of GPU over the course of a year and these later cards failed less often. Higher failure rates on AMD cards could reflect the fact that AMD pushes its GPUs closer to the edge of stability or that AMD’s OEM partners are willing to ride the ragged edge a little closer on AMD cards than on Nvidia because Nvidia has more authority and opportunity to play hardball (and to demand that its GPUs are properly supported). One of the reasons why AMD motherboards were historically less reliable than Intel boards was that AMD could neither force VIA to fix its bugs (like the infamous KT133A southbridge problem) or require motherboard vendors to devote an equal amount of time debugging and improving AMD motherboard BIOSes as they were willing to invest in Intel boards. Could a similar dynamic be at work here? It could be. The point is, we don’t know. No Sapphire GPU has more than a 2 percent failure rate, and 2 percent matches any Nvidia card. So is this an AMD problem or a PowerColor problem — but if we say it’s a PowerColor problem, was the 2080 Ti a multi-manufacturer issue or something specific to Nvidia?
This is why manufacturers don’t like releasing quality data. Questions beget questions beget questions. Even if we knew the relevant time period, we wouldn’t know when the GPUs Mindfactory sold were actually made. Maybe the retailer got a big batch of initial GPUs of every sort that failed and all failure rates today are basically equal (1-2 percent) between all cards and manufacturers. Maybe the failure rates have spiked recently because COVID-19 killed quality control and companies are just pumping out whatever they can sell. Without more information, we can’t know — and it’s that “more information” that companies don’t want to hand over in the first place.