Published on June 15th, 2019 | by Chanan Bos0
Tesla’s New HW3 Self-Driving Computer — It’s A Beast (CleanTechnica Deep Dive)
June 15th, 2019 by Chanan Bos
A month ago, Tesla revealed several secrets regarding the new chip the Silicon Valley company has designed for full self-driving capability. Nonetheless, some of the people making that presentation may have failed to take into account that not everyone is fully literate in microprocessor design and engineering. I don’t fall into that category either, but I have been a computer enthusiast for quite some time and know a few things that might help me pick out some of the highlights, point out why they are so exciting, and further communicate how Tesla really is way ahead of the competition. I also have some theories about what this new chip can lead to.
Full warning: this presentation is still quite technical, but I do try to explain the key points in plain English as well.
What’s on the Board?
First, we can see a general image of the board. The board has complete redundancy, meaning that any system on the board can fail and the computer will continue operating as if nothing happened. On the right side of the board is where all the cameras plug into the board, and on the left side is where the power supply connects — as well as some input and output connectors. In the middle of the board, stars of the show, you see the 2 processors (note that “processor” is only a semi-accurate description, as you will see later). Tesla uses 2 processors for redundancy and for cross-referencing the results, not to increase performance.
Good analysis. Both computers will be used & sync ~20 times per second. This is a long time to a computer. Like a twin-engine plane, use both engines to max for normal operation, but can safely operate on just one.
— Elon Musk (@elonmusk) June 16, 2019
Under the processor and a bit to the left of the processors (marked light blue) are the flash memory chips that store the operating system. The capacity of each chip as of this moment is unknown, but considering that nowadays you can buy a micro-SD card with a capacity of 500GB, it could potentially be quite big.
On the left and right side of each processor, you see 4 LPDDR4 chips (marked green). The processor is being fabricated by Samsung and this has also mistakenly led some people to believe that the RAM is also from Samsung, but that is actually not the case.
If you take a really close look at the chip, you can see a little logo. Samsung does not put a logo like that on its RAM chips, but Micron does and its logo, especially the one it puts on chips, seems to very closely match what we see in this image. Micron also happens to manufacture LPDDR4 RAM and even has a product line targeting the automotive industry. The reason why Tesla went for Micron rather than Samsung is probably because their LPDDR4 RAM has a higher clock rate, 2133Mhz rather than Samsung’s 1600Mhz.
LPDDR4 chips are a type of DRAM, which is how it will be referred to later in the article. For some perspective, LPDDR4 is the low-powered version of DDR4, which is currently used in desktops and laptops. LPDDR4 is a bit slower than DDR4, but depending on the variant can in some cases outperform DDR3. LPDDR4 is also what is currently used in smartphones.
Once you take off the heat spreader, the die is revealed to us, and Tesla has told us a lot of information about it. The size of the die is around 260mm2. To put that in perspective, the processor in an iPhone is around 80–120mm2, an Intel laptop/desktop processor die is around ~180mm2, NVIDIA’s Xavier chip die is 350mm2, and the chips on dedicated graphics cards range from 400 to 800mm2.
The Guts (or Brains) of Tesla’s SoC
First let’s clear up a huge misconception. I called Tesla’s chip a processor earlier, but that is not completely accurate. It is actually a full system-on-a-chip (SoC). Tesla put a processor, a graphics card, a neural processor, as well as a bunch of other things you probably didn’t even know existed onto this single chip.
Tesla explains the whole process by somewhat following the path that the data from the cameras take. First, the data come in through what is labeled as “Input” at a maximum rate of 2.5 billion pixels per second, which roughly equates to 21 Full HD 1080p screens at 60 frames per second. This a hell of a lot more data than the currently installed sensors create. This then travels into the DRAM we discussed earlier, which is one of the first and main bottlenecks of the chip since this is the slowest component. Then the data go back into the chip and through the image signal processor that can process 1 billion pixels per second (roughly 8 Full HD 1080p screens at 60 frames per second). This part of the chip turns the raw RGB data from the camera sensors into data that is actually useful in addition to enhancing the tone and removing noise.
Then we finally get the most interesting part of the whole chip, the neural network processor, or NPU. The first step in the process is that the data gets stored in the SRAM array. Now, a lot of people, even ones who know a bit about computer components, may be wondering, “What on earth is SRAM?” Well, the closest comparison would be the shared L3 cache you will find on your computer’s processor. What does all of this mean, though? It means storage that is really fast but also expensive. Right now, Intel’s largest L3 cache is 45MB (it was 16MB until 2010 and 24MB before 2014). Most consumer laptop and desktop processors have between 8-12MB of L3 cache. Tesla’s neural network processor has a whopping total of 64MB SRAM that is divided into two 32MB SRAM segments to support the two neural network processors. Tesla considers its large SRAM capacity to be one of its biggest advantages over any other kind of chip it could have potentially used.
This might actually be enough memory to store, render, and process a single frame of all cameras and sensor inputs combined, but because the frames are not bad-quality JPEGs but instead large enhanced lossless frames, it probably isn’t. Keep in mind that if the cameras do indeed work at 60 frames per second and one combined frame in the SRAM could potentially equal 3.84 gigabytes of data processed per second. Since a single frame is probably larger, I don’t even want to venture a guess how many gigabytes per second this is, but I do know its less than 68 gigabytes.
All the data travel through the primary corridors/hallways of the chip, also known as the “Network on a Chip” or “NoC” (painted blue in the image) and then the LPDDR4 DRAM, through which the data travel has a bandwidth of 68 gigabytes per second. Tesla indicated during the Autonomy Day presentation that this is enough but could be better, and from that we gather that Tesla will likely improve it in its next-generation product. Right now, it’s not totally clear whether the bottleneck is the bandwidth of the DRAM or amount of SRAM.
The neural network processor is an incredibly powerful tool. A lot of the data go through it, but some of the computational tasks have not yet been adjusted to work on a neural network processor or are not suitable for that kind of processor. This is where the GPU comes in. The GPU in this chip has (per Tesla) modest performance, runs at 1 Ghz, and is capable of 600 GFLOPS. Tesla indicated that the GPU currently performs some post-processing tasks, which could potentially include the creation of pictures and videos that are understandable for humans. However, from the way Tesla described the role of the GPU in its presentation, expect the next iteration of the chip to have a much smaller GPU.
There are also some general-purpose processing tasks unsuitable to the neural processor that are done by the CPU. The way Tesla explained it, there are 12 ARM Cortex A72 64-bit CPUs in the chip running at 2.2Ghz. Although, a more accurate description would be to say that there are three 4-core CPUs in there. Tesla’s choice of going with ARM’s Cortex A72 architecture is a bit puzzling, however. Cortex A72 is an architecture from 2015. Since then, the A73, A75, and a few days ago even A77 architectures have been released. Elon and team explained it by saying that this was what was available when they started the design of the chip 2 years ago. Perhaps this was a cheaper option for Tesla, which would make sense if multithread performance is more important to them than single task performance, hence the inclusion of 3 older processors rather than one or two newer or more powerful ones. Multithreading usually requires a bit more programming work to distribute tasks properly, but hey, this is Tesla we’re talking about — it’s probably a piece of cake for the company. In any case, the CPU performance on this chip is 2.5 times higher than Tesla had in the previous version HW2.
NVIDIA Feels the Need to Save Face
So that was a lot of technical talk, so let’s have a short break and I’ll show you something funny. After Tesla’s Autonomy Day, NVIDIA published a new blog entry complimenting Tesla for “raising the bar for self-driving.” Immediately after that, NVIDIA tried to save face by patting itself on the back with useless metrics of comparison.
Tesla’s HW2 is powered by an NVIDIA Xavier chip that can do 21 to 30 TOPs (terra operations per second). Tesla’s new HW3 chip can do 144 TOPS.
Tesla in its presentation stated that NVIDIA’s Xavier chip is capable of 21 TOPS. NVIDIA tried to correct Tesla in its blog saying that it’s actually 30 TOPS instead of 21. The thing is, NVIDIA’s Xavier chip is built for multiple purposes and tries its best to conform to the requirements of multiple potential clients. Thus, the chip doesn’t have a neural network processor, but can successfully simulate one using software and some of its deep learning–focused hardware. When Tesla said 21 TOPs, that was the result it got by going through the simulated neural networks on the GPU of the chip. Tesla’s benchmark is very simple in its measure. “How many TOPS can our software reach on this hardware?” That is an entirely different question than how many TOPS this hardware can produce with software to fully utilize the chip and produce maximum TOPS on this hardware. Theoretically, if the chip were tasked to perform some other task in another scenario, it might be able to reach that 30 TOPs figure, but that is a pretty useless metric in this context. Nonetheless, it’s sensible NVIDIA would like to set the record straight for other customers or potential customers.
Worth remembering is that, when benchmarking a complex piece of software, it is all about the performance that specific software can realize. This is why the best hardware is not always the hardware with the highest theoretical performance.
In the past, we used to only have a general purpose processor with a numerical co-processor. Then we got the graphical co-processor, and now the NeuralNet co-processor. Although, ironically, in this case, the CPU is more of a co-processor to the neural processing unit. Basically, what Tesla did was create a specialized processor that is way better at an extremely specific task, but would suck at general purpose processing. So, yeah, the only game this chip is good at is running through the roads in the matrix we all live in — but it’s really good at that.
To further defend its pride, NVIDIA stated that when you combine XAVIER with a powerful GPU in the company’s DRIVE AGX Pegasus product, you can achieve 160 TOPS. If Tesla for its purposes can again only utilize 70% of that due to the need to virtualize the neural network processor, that translates to a maximum of 112 TOPS and wastes a lot of power. NVIDIA also went on to say that the DRIVE AGX Pegasus can reach 320 TOPS by stacking two units in parallel, but this is unrealistic for this particular application.
When we talk about Internet speed, we care not only about the speed, but also about the latency/response time. Tesla in this case already complains about the latency of information reaching the chip from the DRAM, which is right next to it. The latency of data traveling from multiple loosely interconnected chips using a flimsy NVlink cable would be totally unacceptable.
Also, that doesn’t take into account that an electric car is powered by a battery, not a nuclear reactor, and the amount of electricity you would need to use to power this 4 chip solution would drain your battery before you even reach the highway. Efficiency really is key here.
NVIDIA’s solutions focus more on combining multiple chips for performance. It is stuck in its marketing need of having multiple cores, better CPUs, better GPUs, and connecting them with NVlink rather than building them for a specific use case. This is great for companies trying to perfect software or universities working on a project, but this solution is not efficient enough for real-world applications.
So, there you go — that is Tesla’s hardware version 3. So then, what can we expect for hardware version 4? Right now, all we know is that it will be aimed at further improving safety. The only thing that really tells us is that it will not be focused on making an old car learn new tricks, but that doesn’t mean it won’t include some of that, too. Here is my list of potential changes and improvements HW4 could have, ranked from most likely to most speculative:
- Tesla will most likely use a newer CPU version, based on when Tesla started designing the architecture that will likely be the Cortex A75. The increased processing power gives Tesla the opportunity to save power and space on the chip, making room for more important components.
- Tesla may upgrade to LPDDR5, which would result in a significant speed increase and a reduction in power consumption. However, if the HW4 chip is in the design process, or to keep costs down, Tesla may go with LPDDR4X. By using a lower voltage, LPDDR4X saves power, but it can still result in a speed increase if multiple chips are used in parallel. Although, this configuration would not save power compared to HW3. Either choice would represent an overall improvement over HW3.
- Further improved neural processing units with even more SRAM.
- Depending on whether or not the processing capacity on the chip can handle the full resolution and frame rate that the cameras are capable of, Tesla’s HW4 might come with new cameras and sensors with a higher resolution and maybe even a higher frame rate. Higher resolution images are critical, as more detail will help the computer identify objects more accurately, and at greater distance.
- An upgraded image signal processor (ISP). Tesla wanted to make its chip as cheap and as powerful as possible. That’s why there is a large disconnect in HW3 between what the chip input is capable of handling and what the ISP is capable of handling, hence the need for a beefier or secondary ISP, depending on which solution requires less power or less space, or costs less.
- A smaller GPU. One of the reasons there is still a moderate GPU in the HW3 SoC is because not all of the processing tasks have been transferred to the neural network yet. Including a moderate GPU may have been a shortcut for Tesla to give its programmers enough time to re-allocate any remaining GPU processing tasks to either the NPU or CPU. Eliminating the GPU entirely might not be possible; however, a smaller GPU with a smaller footprint on the SoC leads to less NoC, so there is budget for and room for more critical components like more SRAM.
Tesla’s HW3 computer is an absolute beast. It can handle 7 times as many frames, has 7 times larger neural nets, and as was said in the presentation, “There are a lot ways you can spend that.” Being a computer tech enthusiast, watching Tesla’s Autonomy Day presentation was better than going to Disneyland. When it comes to achieving Full Self-Driving capability, the first step is having your priorities straight, and Tesla certainly does.
There are a few points that have not been stressed enough, and this leads to people underestimating or not understanding why Tesla is actually leading in the race to developing fully autonomous vehicles — leading by a significant margin. There is a really good problem analogous to this one. All other manufacturers that are now starting to make EVs have some advanced tech, but still have not been able to beat Tesla’s Model S from 2012, and that’s just on the EV side of things, not to even concerning the computer/software/UI side of things. The reason why competing EV tech hasn’t caught up is simple: vertical integration.
Let me dumb it down slightly to explain: Imagine you are a manufacturer and you need to build a website. You could go to one of those platforms where you drag and drop some widgets, pages, and solutions on there and type some text, or you could have a whole team of dedicated programmers make a professional website. Traditional automakers are trying to do the former with electric cars and self-driving. They are ordering LEGOs from different companies and hoping they fit together. Where they do not fit, they simply use a knife to carve one LEGO to make it fit the other. Tesla, on the other hand, is a lot more like what you see in this Tweet liked by Elon Musk:
Gotta love zero tolerance machining 🖤 pic.twitter.com/EaXmJ8w9mK
— World of Engineering (@engineers_feed) June 2, 2019
With Tesla’s new HW3 computer, everything is made to fit like a glove, to fit almost as well as what you see in the tweet above. Elon Musk has said that Full Self-Driving only really makes sense in an electric car, and he is right. To focus that a bit more appropriately, it isn’t worth doing it for an internal combustion engine car. The lack of instant torque makes self-driving less effective and less safe when it comes to avoiding crashes and with slippery and icy road conditions, something we will dive into in an upcoming article.
The most important reason, however, is that investing resources in a self-driving solution for a dying and soon-to-be-extinct product category like a gas car is just dumb, plain and simple.
When making a self-driving solution for an electric car, power efficiency might be the second most important metric after safety, and it is currently not getting nearly enough consideration (or, at least, not effectively so) by any other automaker or chipmaker. This is yet another reason why Tesla is light years ahead.