During AI Day this week, Tesla shattered all rules and established industry standards when it comes to making a computer. The presentation, just like on Autonomy Day, was rather technical, and the people making the presentation again may have failed to take into account that not everyone is fully literate in microprocessor design and engineering. Though, AI Day was geared to excite the geeks and try to hire industry experts, so this was likely an intentional choice.
In this deep dive, we scrutinize and explain everything Tesla has said about computer hardware and compare it to the competition and the way things are normally done. Full warning: this presentation is still quite technical, but we do try to explain it all in plain English. If you still have any questions, please leave them down below in the comments and we will try to answer anything that we can.
To make this easier to digest, we are also splitting it into a series of 4 or 5 articles.
The Tesla GPU Stack
In case it wasn’t clear, Tesla has built — with NVIDIA GPUs — one of the most powerful supercomputers in the world. That is what they call the GPU stack and what they hope their programmers will want to turn off and never use again as soon as Dojo is up and running. During the presentation, they said that the number of GPUs is “more than top 5 supercomputer(s) in the world.” I had to dig it up, but what Tesla most likely meant is that they have more GPUs than the 5th most powerful supercomputer in the world, because that would be a supercomputer called Selene which has 4,480 NVIDIA V100 GPUs. However, if you add the top 5 together, Tesla would not be beating the total GPU count — it’s not even close.
However, Tesla’s GPU-based supercomputer, or at least its biggest cluster is on its own quite possibly also the 5th most powerful supercomputer in the world. We can see that Tesla started receiving GPUs in mid 2019 for its first cluster. This date and the fact that they mentioned ray tracing during the presentation could mean that Tesla ordered NVIDIA Quadro RTX cards, although they might have also been older NVIDIA V100 GPUs. Since NVIDIA released the A100 in November 2020, cluster number 2 is likely also made up of the older hardware. If they used V100 GPUS, that would put the second cluster at around 22 PetaFLOPS, it would be right at the very bottom edge of the top 10 list in 2020 and might not have even made the list and would certainly not make the top 10 list now.
Fortunately for us, Andrej Karpathy revealed in a presentation he made in June, that the largest cluster is made up of NVIDIA’s new A100 GPUs. He said that this is roughly the 5th most powerful supercomputer in the world. Considering the components, the theoretical maximum would equal 112.32 PetaFLOPS, putting it in 4th place, however since when working together there is always some scaling inefficiency means 5th place is most likely an accurate estimate, if we divide the FP16 TFLOP performance in half to estimate the FP32 performance, you get around 90 PetaFLOP, just a bit less than the Sunway TaihuLight supercomputer in China.
So, at first glance, it might appear that with 1.1 Exaflop, this would become the most powerful supercomputer in the world. However, Tesla sugarcoated the numbers a bit and Dojo will in fact become the 6th most powerful computer in the world. Right now, the most powerful supercomputer is the “Fugaku” in Kobe, Japan, with a world record of 442 PetaFLOPS, three times faster than the second most powerful supercomputer, “Summit,” in Tennessee, USA, which has 148.6 PetaFLOPS. Dojo, with its 68.75 PetaFLOPS (approximately), would then be in 6th place. In fact, because the next 3 supercomputers are quite close 61.4 to 64.59 PetaFLOPS, it’s possible that Dojo is in seventh, eighth or even ninth place. Later on in this series, we will explain this in greater detail under the colorfully named section Tesla flops the FLOPS test.
Nonetheless, this is absolutely nothing to laugh at. In fact, when it comes to the specific tasks that Tesla is creating Dojo for, it is very likely that Dojo will outperform all other supercomputers in the world combined, and by a very large margin. The standard test for supercomputers is peeling apples, but Tesla has a yard full of oranges and designed a tool for that, so the mere fact that in addition to being the best in the world at peeling oranges it is still able to get 6th place for peeling apples just shows how incredible this system it.
Moving away from raw compute performance, Dojo and its its jaw-dropping engineering puts all supercomputers to shame in almost every other way imaginable. To logically explain this, we need to start at the beginning, the small scale.
What is an SoC
The way every computer works right now is you have a processor (CPU) — in some cases, a business server might have two and the/those processor(s) go onto a motherboard that houses the RAM (temporary fast memory of 8–16GB in good laptops/desktops) and the computer has a power supply that feeds electricity into the motherboard to power everything. Most consumer desktops have a separate graphics card (GPU), but most consumer processors also have a built-in graphics card.
Now, if you haven’t read it yet, you may want to read my previous article in which I analyze Tesla’s hardware 3 chip, (Elon Musk on Twitter called it a good analysis, so you can’t go wrong there), but to very quickly summarize: the Tesla HW3 chip and most consumer processors are actually an SoC, a “system on a chip,” because they include cache memory (sadly, only a few megabytes) as well as a processor and a graphics card and in the case of the Tesla HW3 chip, two NPUs or Neural Processing Units.
Now, before we move on, I need to explain something about how processors, graphics cards, and SoCs are normally made. All of the components, like transistors, are not added to individual processors. They are all placed while the processor is part of a circular disc called a wafer, which you can see in the image above. That wafer is then cut into pieces, each of which becomes an individual processor, GPU, or SoC. The chip fabrication does not always go well and often some processors don’t work or are only partially operational. In the industry, the common term to describe this issue is a low wafer yield.
Even most people who don’t know much about computer hardware know that Intel offers celeron/pentium, i3, i5, i7, and i9 processors, and that order is from weakest to strongest. What most people don’t know is that due to problems with wafer yield, some of those processors are defective, they work only partially, so what they do is they disable the broken part of the chip and sell it as a cheaper version, This is called binning. So, a celeron/are ingena broken i3 and an i5 is a broken i7, and even within chips there are various versions of an i5 and i7, some that can’t reach the maximum clock speed are locked and sold as a cheaper variant of that chip. Whether Intel still does this today with their latest chips, I am not sure, but they still did this as recently as 2017. The point is that rather than throw away a defective wafer or defective chips in a wafer, you can still salvage your yield.