Do you remember that clip of the Tesla learning how to drive in a simulation during Tesla’s AI day event? The video included footage of a photorealistic simulation that teaches Tesla’s vehicles to handle difficult corner cases better. The YouTube channel Two Minute Papers dives into that much more deeply and also links it to specific recent research in the AI space.
During Tesla’s AI Day event, Tesla shared a lot of its AI development with the world and showed how the Tesla team is improving the technology as they go, with the aim of one day enabling fully autonomous Tesla vehicles. Much of it went over our heads, though. Two Minute Papers slows things down a bit for us here and more clearly explains some processes.
At the same time, Two Minute Papers is using this Tesla video to show how the findings from various academic/research papers on AI matters are making their way into real-world applications — and quickly. At points, the narrator pinpoints specific papers linked to specific tasks Tesla FSD is doing.
One thing we all know about Tesla vehicles is that they have eight cameras. One key step for Tesla is to create a “vector space view” from those 8 camera feeds, something that is similar to a map or video game version of the roads and surrounding objects. The video emphasized that this is a difficult problem because the car has many cameras and they have to be combined for one “3D map” of the world they are seeing. Each camera only sees part of its surroundings, and each of those parts needs to be stitched with the others. In the video, a truck is used as an example.
The narrator is struck by the fact that even 2020 papers are being used to inform/improve Tesla FSD development — an incredibly fast turnaround from being published to influencing real-world AI.
Back to an example from the Tesla video: There is the case of a large truck passing the Tesla. One camera saw the truck in its entirety while one only saw the cab. Another camera saw the trailers and one saw the cab of the truck along with a partial view of the trailer. The car has to figure out exactly what it’s looking at and how long the truck is in order to accurately place the truck into the vector space view.
“What we need for this is a technique that can fuse information from many cameras together intelligently.”
The difficulty lies within the different calibrations, locations, viewing direction, and other properties of each of the cameras. However difficult it may be, though, it’s not impossible, and the video pointed out that a “transformer neural network” is able to accomplish this. The technique demonstrated in the video worked significantly better than the single-camera network. Again, the narrator tied this to a specific published paper.
If you’re interested in more of an in-depth analysis of some of this AI technology used and related published papers that are used for it, watch the full video here.