In a video by YouTuber Dr. Know-it-all Knows it all, the good doctor shared his thoughts on Tesla’s Full Self-Driving (FSD) Beta and why it is such a big deal. Yes, we all know how revolutionary the technology is behind FSD, but do we really? The doctor jumped into trajectory prediction and 4D data continuity. He explained the reason why Andrej Karpathy and Elon Musk “went on and on” about how Tesla’s new 4D training and processing is so important.
“The two biggest items are data continuity and trajectory projection,” he explained. “Together, these two bring about a massive leap forward and open a path to true level 4 autonomy.” He started by giving a bit of his background with his master’s thesis, Image-Based Content Retrieval via Class-Based Histogram Comparisons, for the Institute for Artificial Intelligence, at the University of Georgia, Athens.
He described the image-based system as a quasi-semantic way of identifying objects via a neural network and then finding the other object in the database of images. “Semantically, for a computer to recognize a cat doesn’t mean that it understands that it’s a cat. It just means that that arrangement of edges and pixels and colors and so forth is associate with a label that it knows is called ‘cat.’ And so that’s how it’s doing it.”
He said that since it’s not understanding it in the way we as humans do, this is why he calls it quasi-semantic. Humans learn things along with language. A certain blob of colors is mom or dad. Another blob could be your hand. “What we think of as basic knowledge like this is really difficult for humans to understand,” he pointed out, adding that we have evolution helping us to understand. “It’s also super difficult for computers to understand that. Therefore, this is actually still kind of cutting edge research.”
How This Affects Driving
The issue is that knowing that a picture contains something related to a semantic label — for example, a dog — it doesn’t give the computer any more information about what a dog actually is. It doesn’t give any insight into how a dog behaves. In the video, he compared a black Labrador Retriever with a trash bag — to a computer, these could look similar.
In order to drive effectively at Levels 4 and 5 autonomy, the computer needs to know the difference between these. He pointed out as well that if the computer was able to identify a dog in one picture, then traditionally that didn’t actually help it figure out what was in the next picture. “That’s continuity,” he noted, and then added, “If I know what something is in one picture, I should be able to know that it’s that in the next picture and the next picture — in other words, four-dimensionally over time.” This is why they keep talking about “4D.” Time, or continuity, is the 4th dimension.
Dog Example: Google Search vs. Video Sequence Processing
The doctor used a comparison of a Google search of a dog with video sequencing to point out the major differences. When you do an image search on a dog, the computer pulls up thousands of images of different dogs in different situations. The search engine is able to figure out what these pixels and shapes show, and only have to find images that match that particular label. In the case of video sequence processing, the computer has to process each frame. Doc showed video stills of a dog in a blue box, frame by frame.
In the first frame, the dog is in a standing position with a branch in its mouth. In the next frame, the dog has moved, but it is still standing in that similar position as in the first frame. Frame three, however, showed the dog in a completely different stance, and he pointed out that the computer may have a bit more of an issue identifying it as a dog. “This is very inefficient because it has to reprocess everything at every single frame and it’s prone to error at any given frame, because it might just decide that that’s not a dog on one given frame,” he said. Suddenly, you’ll have dog, dog, dog, dog, not dog, not dog, dog, dog — which will create an error that will cause problems for the driving capabilities.
The doctor pointed out that this would lead to very rudimentary driving skills. “It’s kind of like having an early student driver in the car. They’re super cautious. They make dumb mistakes. It hugs the center of the lane. If it gets confused. It slows down in the middle of the road for no reason whatsoever — just phantom braking, etc.” He noted that these problematic areas need to be resolved if we’re really going to have the sci-fi full self-driving we expect, with the car driving itself. In order to make this happen, we need “continuity of data over a long sequence of frames or video plus a temporarily semantic understanding of what the car is seeing.” Which is where 4D training comes in.
He noted that Tesla’s FSD looked at eight individual cameras along with the radar and sonar around every thirtieth of a second. That is some seriously fast processing. “It processed all of this information separately. It identified the objects and then it acted on this,” he said, while adding that for the next frame, it threw out all of this information and started over again.
What Does the New FSD Beta with 4D Capabilities Give Us?
As Elon has stated before — and so did the doctor in his videos — this is a complete, ground-up rewrite of the software, with new models, new neural network models, and new algorithms involved. It utilizes the Tesla inference engine — the chips that Tesla has in its vehicles. All eight cameras are now linked together in one view, and he noted that it is considered a video sequence instead of individual images. Along with this, there’s object detection that has more information in it.
Going back to the dog example, the doctor pointed out that, in this case, the computer is able to intelligently figure out whether or not frame 2 is a dog by asking itself if the dog moved a little bit. Along with the likelihood of fewer mistakes in identifying that dog over time, the good doctor said, “Obviously, this data continuity over time is exactly what the 4D that Musk and Karpathy are talking about. What this 4D training opens the door to is trajectory projection.” This would require the computer to possess a great deal of knowledge because it has to understand what these objects are.
Back to the dog — in the first frame, the computer processes and understands that this is a dog, an object that can move under its own power in certain potential ways. In frame 2, the dog is still there but it has moved and changed position. In the third frame, the computer realizes that the dog is taking a step forward and towards the front right of the car, and the computer is able to understand that this object can move rapidly and can intersect what the doctor calls the “ego car,” which is a name for the car containing the computer. “Thus, the computer knows now that it needs to take immediate steps to avoid this object which can move on its own or, at the very least, it needs to brake quickly to avoid collision with the object.”
In the case of a trash bag, the computer would treat this differently from the dog, because it has semantic knowledge of what those two objects are. The computer would know that a trash bag is an object that doesn’t move on its own, and if it is in the way, it would use different calculations to avoid hitting it.
“This type of trajectory projection is what humans do extremely well — at least, if we’re paying attention when we’re driving. Distracted driving is a whole other thing, but computers have been just really, really unable to do this before, which is why they’re really not very good at self-driving. That’s why you have companies like Waymo and GM Cruise and so forth that are mapping out exact environments around them with LiDAR and they can follow those things.” He compared such systems to following the track on a roller coaster. They can detect objects in the way and brake if they need to, but they don’t have the understanding that Tesla’s new FSD software has.
Not only are there a variety of objects in the world that may come into the path of the car, but the computer has to figure out whether these objects are moving, staying still, how dangerous they are, and then be able to react if, for example, a log falls off the back of a truck. Mixed in with this, the computer also needs to determine if things are far enough away or not — and, if not, how it’s going to react. A great example of that is me crossing in the crosswalk. If the driver isn’t paying attention and accidentally has their foot on the gas instead of the brake, they could hit me — this has happened on several occasions. With FSD, the computer would notice that their light is red and probably have the car at a full stop before I get into the crosswalk.
The video that Doctor Know-it-all Knows it all shared was very informative and provided a bit of insight into this world of FSD, neural networks, and artificial intelligence. You can watch the full video here.