If you missed it, first read: Tesla’s Dojo Supercomputer Breaks All Established Industry Standards — CleanTechnica Deep Dive, Part 1.
Tesla Breaks the Rules
What Tesla is planning to do with its Dojo training tiles is screw the whole industry standard of cutting the wafer into pieces. It is just leaving 25 SoCs on the wafer and using that expensive super high quality silicon to let the chips communicate with each other without any loss in speed caused by large bulky cables or even the lower quality silicon of a motherboard. as far as I am aware, this is completely unprecedented.
The biggest challenge facing Tesla, however, is making sure the wafer has a 5 by 5 section with each SoC working flawlessly in order to make the system work the way they expect it to. Given the training tile’s shape with rounded edges, it’s very possible that this represents the entire wafer and that the whole thing needs to work flawlessly — an empty wafer does have a dark grey color, after all.
However, for Tesla, wafer yield may be an issue. Though, considering that it only needs 120 fully functional wafers for Dojo, it should be able to pull it off. By comparison, Intel in 2014 made more than 130,000 wafers and we are talking about large 300mm wafers, not these smaller ones that Tesla is using. Also, since the smaller wafer is not even filled to the brim as a normal wafer would be, the costs should be significantly lower. Though, in general, the perfect quality silicon a wafer is made of is not cheap either.
What is also unprecedented (as far as I am aware) is a computer that does not have any RAM outside of the SoC. Even a smartphone and Tesla’s HW3 has RAM chips outside of the SoC. Even the fastest new hard drives (to which we will get in a bit) cannot Randomly Access Memory as fast as RAM and cannot replace it. Theoretically, the latest PCIe-4 tech available on the market would only reach 0.5-3GB/s rather than the 20-25 GB/s that is standard for consumer computers with DDR4 RAM, or even up to 50 GB/s for the next generation DDR5 RAM that is starting to roll out in data centers. When it comes to size, smartphones and consumer computers usually use 4-32GB of RAM and professional workstations can even reach 512 GB of RAM.
So if Tesla’s training tile has no RAM, what gives? Well, there is an even faster tier of random memory and that is called a cache. This is something I also covered last time but will detail once more. DRAM, or as most people call it RAM, when the SoC/CPU calls on it, has a response time of around 60 nanoseconds. Whereas the L3 cache or the on-chip SRAM can have a response time as low as 10 nanoseconds. The largest L3 cache Intel has right now is 57 MB, IBM’s record is 120 MB, AMD’s most powerful processors have 256 MB of L3 cache, and Tesla’s HW3 chip announced back in 2019 has 64MB of SRAM.
Then, finally, Tesla’s new Training Node has 1.25 MB of high-speed SRAM. Wait, what? That sounds wrong. Well, that is because we are talking about nodes, and 354 nodes make up a compute array. This means that an SoC has 424.8 MB of cache memory, beating all the others. However, I don’t believe the fun ends there, considering the fact that the SRAM is located directly on each node and that Tesla called it “High-Speed” SRAM, I suspect that rather than an L3 cache, we are talking about the even faster L2 cache, even though a non-shared L3 cache the way IBM does it is also a possibility (but a lot less likely since those caches are 10 MB in size, a different league, and Intel has 1 MB of L2 cache per core).
Considering the size of the 1.25 MB cache per node, I would be willing to bet that this is an L2 cache. One of the main differences between the L1 & L2 cache vs. L3 cache (besides their speed and size, which we will get to in a moment) is that the L1 & L2 cache are usually located directly on each node/core, whereas L3 is usually (with the exception of IBM) located elsewhere on the chip and is shared by all cores/nodes.
So, if the 1.25 MB is an L2 cache, this would put it ahead of that Intel chip we mentioned earlier. Even though Intel’s L3 cache was 57 MB, it has only 1 MB of L2 cache per core. However, since Intel’s core count of 38 is much lower than Tesla’s node count of 354, overall, the amount of cache on the Intel processor is a lot lower. Since I failed to mention it till now, an L1 cache has a response time of 0.5 ns, an L2 cache has a response time of 3–4 ns, and as was mentioned earlier, the L3 cache has a response time of 10 nanoseconds, and DRAM has a response time of 60 nanoseconds.
Next, as you can see, there is something Tesla labeled either as 1 cache or i cache or small letter L cache. My bet would be that is the fastest tier L1 cache, and more specifically the L1 Instruction cache. Most processors have 2 L1 caches, one for instructions and one for data — though, in the past, this was one single cache that was used for both. In any case, assuming that Tesla eliminated the L1 data cache and this is a 32 KB instruction cache, then the chip has 11.328 MB of L1 cache, double that if Tesla does have a L1 data cache and is counting them as 1 in their graphic
Back to the matter at hand, it was already weird enough that the Training Tile has no DRAM, but it gets even weirder when you realize that their SoC does not include a shared L3 cache either. It’s important to keep in mind that this is a very specific system fine-tuned to a very particular task whereas most processors have a wider array of components to be more flexible to fit all kinds of tasks. So, as strange as the design sounds, the missing components that you would usually expect to find in a SoC might have been unnecessary and were removed for the sake of cost and simplicity, or they may have even been a crutch that would slow down the system.
Stay tuned for part 3 and part 4.