Tesla’s Dojo Supercomputer Breaks All Established Industry Standards — CleanTechnica Deep Dive, Part 3

August 22, 20215 years Chanan Bos 0 Comments

Support CleanTechnica's work through a Substack subscription, on Patreon, or on Stripe. Help us produce all of the high-quality, original content we publish week after week despite the challenges of content-scraping AI, antisocial media, inflation, and other hurdles.

If you missed the initial parts of this series, first read: Tesla’s Dojo Supercomputer Breaks All Established Industry Standards — CleanTechnica Deep Dive, Part 1 and Tesla’s Dojo Supercomputer Breaks All Established Industry Standards — CleanTechnica Deep Dive, Part 2.

Networking SoCs Together

Now, normally each SoC sends signals via pins down into the motherboard that then get redistributed. Tesla doesn’t cut the SoCs out of the wafer and instead connects all of the SoCs on the wafer with 72 networking nodes for a total of 16 TB/s or 4TB/s per edge that can connect it to a neighboring SoC. This means that each networking node on the chip is capable of 222 GB/s. During the presentation, Tesla said that this is twice as fast as the current state of the art networking switch chips. At first I was skeptical of that claim, but after doing some research, theoretically, they are correct and large network chip companies like Broadcom and Cisco have only been able to achieve speeds of 25.6 Terabits per second per chip, which when converted equals 3.2 Terabytes.

I do see why Tesla was surprised that the golden standard did not exactly seem all that great compared to what they were able to make, especially since networking is not the primary purpose of this chip, whereas for networking chips it is. Tesla’s D1 chip and Training Tile just keep impressing at every single turn.

Networking Training Tiles together

Now, for the next unit of measurement, here is some good background information. A traditional hard drive with spinning disks inside that everyone owns and can easily reach multiple terabytes inside is unfortunately somewhat slow, it has a read/write speed between 50–150 MB/s. Also, it is important to keep in mind that we are now talking about sequential speeds like transferring files and not the random speeds related to RAM. Then, a regular solid-state disk or SSD which uses NAND flash memory and is connected via a standard SATA port will have a speed between 200–500 MB/s. The newer NVMe SSDs connected via an M.2 slot can reach a speed of 8 GB/s, and the very latest SSDs making use of the new PCI-e Gen 4 connection has a theoretical limit of 64 GB/s — though, the fastest product available on the market only has a speed of 15 GB/s. Then, speaking of PCI-e Gen 4, Tesla also uses that to connect its Training Tiles (or wafers). But with 40 connectors and 32 TB/s bandwidth, that means each connector enjoys a speed of 900 GB/s, but how is that possible when I just said that 64 GB/s is the limit for PCI-e Gen 4?

Well, that only holds true for the largest connection available to consumers, which is the PCI-e Gen 4 x16 slot. Here in the image above you can see the difference between the connectors. Now, as Tesla has announced, they made their own custom connectors, and that is how each connector gets a speed of 900 GB/s. This in essence makes their connector, which is relatively compact all things considered, 14 times faster than the best connector a regular motherboard has to offer.

The Tesla D1 Chip Specs

The D1 chip under its specifications boasts about the fact that it has 50 billion transistors. When it comes to processors, that absolutely beats the current record held by AMD’s Epyc Rome chip that has 39.54 billion transistors. Though, among graphics cards, the NVIDIA’s GA100 Ampere SoC still comes out on top with 54 billion transistors. Now, the fact that a 7nm process has been used to fabricate the chip tells us that Tesla used either Samsung or TSMC to make it happen. Personally, I think Samsung is more likely since it is also Samsung that fabricated Tesla’s HW3 chip.

Moore’s Law Rant

This paragraph is a bit of a tangent, but in response to someone adding the D1 chip to a Moore’s law chart Elon replied to on twitter saying that it is “Pretty wild.” I just want to set something straight — that chart is very misleading.

First of all, the data on it is completely cherry picked to fit that line. There are all kinds of chips with various transistor counts at various points. As was mentioned earlier, NVIDIA has an SoC with 4 billion transistors more than Tesla’s D1 Chip. Then, trying to compare top-tier supercomputers with regular desktop computers or something with vacuum tubes is just apples to oranges. The only reason the chart even forms that line is because he used a logarithmic scale, cherry picked data, and even then most of it is unlabeled, all of which obfuscated the truth.

Try putting all of the same price category Intel chips on a chart (or at least their top tiers) and see how Moore’s law falls apart at the seams. Moore’s law was true at first, but as we continued to die shrink to a lower nm count and started to approach the point where the Heisenberg uncertainty principle makes it hard to guarantee an electron stays in the transistor, progress has significantly slowed down and does not follow this trend line.

The Cooling and the Power

So, this wasn’t completely clear until later in the Q&A. Though, I already suspected this all along. The whole Training Tile is liquid cooled. Interestingly enough, they did not say water cooled, so I wonder what liquid they make use of. Nonetheless, the real revelation here is how well they are able to cool this silicon wafer. Tesla has a lot of experience with power electronics and cooling, and they put that expertise to great use here.

Normally, a processor on one side has a piece of motherboard-quality silicon with pins that lead signals into the motherboard which is obviously impossible to cool. On the other side is the SoC which is covered by some thermal grease (usually not very good thermal grease either), then a metal heat spreader that makes a processor look like a metal processor you might have seen before. Then a manufacturer, computer repair person, or PC enthusiast puts more thermal grease onto the heat spreader and then connects the smooth metal of the cooling block on top of the heat spreader which then redirects the absorbed heat either directly into a radiator with a fan or into a liquid (usually water) that then takes the heat to a larger radiator further away from the processor to which you can attach multiple fans.

In the case of the Tesla training tile, one side of the wafer with all SoCs is as exposed as on a regular processor (even more exposed since there is no heat spreader) and can be cooled directly. The other side has voltage regulators covering every SoC. So, there are two innovations here. First of all, the voltage regulator is usually located on the motherboard right next to the processor, meaning that the current needs to travel through the motherboard, the socket, the pins, and the motherboard-quality silicon on which the SoC is located. However, that is not all. The much bigger innovation is also the final step that makes this whole on top thing possible. Usually, the current reaches the SoC from all of the sides via pins. If you have ever seen a basic old chip with lots of pins on all sides, it’s basically like that but obviously a lot more advanced with a lot more pins. In this case, the power travels directly onto the SoC. How exactly they managed to do this is unclear, but it is rather impressive, and depending on how it was done, this might also cause less heat if the voltage can be introduced into multiple points of the chip so the current does not have to travel as far. For the heat that all these voltage regulators do emit, a cooling block with some holes for connectors are all around the voltage regulators to take away as much heat from that side as well. As I said, the cooling block has holes and a single power supply unit powers all of the voltage regulators simultaneously, it plugs right in on top and right on top of that is again another cooling block to cool the power supply unit, even though it looks suspiciously like a radiator.

Tesla Flops the FLOPS Test

Now that we have gone through all of the nitty gritty details, we are now finally able to truly compare Dojo to the competition.

A single Training tile has 9 PetaFLOPS of computational power. Now, I kind of skipped over what a PetaFLOP even is since you, dear reader, were not yet down the rabbit hole, but now that you are, a PetaFLOP is composed of two parts — Peta, which is the figure that comes after Terra, Giga, Mega, and Kilo; then FLOP stands for Floating-point operations per second. This is different from a TOP, also known as a Terra operation per second — those are for calculating INT8, INT16, and INT32 and we can forget about them now (though, I should mention that NVIDIA sadly sometimes only releases performance in TOPS rather than FLOPS). I am not going to try to get technical and explain what these performance values stand for, I am just going to make sure you don’t accidentally end up comparing apples to oranges as Tesla sort of did. You see, when someone gives you a FLOP figure, you need to make sure whether they mean FP64, FP32, or FP16 since each is twice as hard as the next. However, since Dojo only supports FP32 and the hybrid version of FP32 and FP16, which Tesla referred to as BFP16, I originally assumed that the 1.1 ExaFLOP represented FP32 performance. This would have been great news since the most common test that a supercomputer goes through is the HPL-AI test that gives us a bunch of FP32 PetaFLOP scores that we can compare this to. However, at closer inspection, Tesla’s 1.1 ExaFLOP figure was for BF16/CFP8 and not FP32. Thank goodness that on one slide they gave the FP32 performance of a single SoC, which is 22.6 TeraFLOPS, and it happens to be right next to the BF16/CFP8 score, which is at 362 TeraFLOPS.

Now, performance per SoC and performance as a whole does not always scale equally, and each of these tasks is not exactly equal either. The math we did here is pretty simple, though — if you divide the supercomputers score by the score of one SoC and multiple it times the FP32 score, you get 68.674 PetaFLOPS. In reality, this number could be a bit more or less. However, as I quickly mentioned in the introduction, since the 5th and 6th place are so close to this figure, it’s possible that Dojo is anywhere from the 4th to the 7th most powerful supercomputer in the world. Though, my bet would be 4th place.

Stay tuned for the final part of this series, publishing shortly….

Sign up for CleanTechnica's Weekly Substack for Zach and Scott's in-depth analyses and high level summaries, sign up for our daily newsletter, and follow us on Google News!

Have a tip for CleanTechnica? Want to advertise? Want to suggest a guest for our CleanTech Talk podcast? Contact us here.

Sign up for our daily newsletter for 15 new cleantech stories a day. Or sign up for our weekly one on top stories of the week if daily is too frequent.