Autonomous driving systems include environmental perception and localization, behavior prediction, and planning control. As an autonomous driving perception algorithm engineer, I will only talk about the knowledge required for environmental perception.
This article focuses on the task of environmental perception in autonomous driving, including the origins, current status, and latest trends in perception technology. Most of the perceptual tasks involve algorithm design, so the introduction of the column also takes algorithms as the main line, especially the algorithms of deep learning. It also involves some data acquisition, system testing, algorithm deployment, and analysis of mass production perception systems.
Environmental perception in autonomous driving includes two concepts: automatic driving and environmental perception. First of all, what is autonomous driving? Here's what Wikipedia defines for a self-driving car.
Self-driving cars, also known as driverless cars, computer-driven cars, unmanned vehicles, and self-driving cars, are vehicles that require driver assistance or do not need to be controlled at all. As an automated vehicle, self-driving cars can sense their environment and navigation without human operation.
There are several key words in the above definition. The first is the car, and the self-driving technology we're talking about here involves cars, not vehicles like airplanes and trains. The second is to perceive the environment and navigation, that is, self-driving cars can independently collect and understand the surrounding environment information, and make decisions and travel according to the set destination. Finally, the need for driver assistance or no need for control at all, here involves the classification of automatic driving systems, is a very important concept, the following is a little bit to expand on.
Autonomous driving technology is not a qualitative change from 0 to 1, but a process of gradual iteration. Regarding the classification of autonomous driving systems, the most commonly used standard is currently developed by SAE (American Society of Automotive Engineers). The standards set by different agencies will be slightly different, but the basic concepts are consistent. The following table summarizes the six levels from L0 (manual driving) to L5 (fully autonomous driving). These definitions may be a bit difficult to understand, but they are easy to understand when combined with the different functions on the vehicle. For example, the anti-lock braking system (ABS) and body electronic stability system (ESP) that are now standard in automobiles are L1. In addition, cruise control, adaptive cruise (ACC), and lane keeping assist (LKA) also fall under the L1 category, as they can only maneuver the vehicle in one direction (lateral or vertical). If both ACC and LKA are implemented, then the car goes to L2 level. For systems of L2 and below, the driver of the vehicle needs to monitor the surrounding environment and be ready to take over at any time. This is critical, and it's the main reason for many L2-class vehicles to get into traffic accidents, which is that drivers expect too much from the system and don't always keep an eye on their surroundings during driving. If a vehicle is equipped with some sort of Pilot system, such as the Traffic Jam Pilot, then it reaches L3 level. This means that in some specific scenarios (such as highways, traffic jams, etc.), the driver does not need to monitor the current road conditions at all times, can let go, loosen his feet, and loosen his eyes, and only need to take over the vehicle when prompted by the system. In this limited case, the driver has become a passenger. For L4-level systems, it is currently only present in demo vehicles. We usually see such as "a manufacturer's vehicle on a certain road to achieve XX hours without manual takeover of automatic driving", which belongs to the L4 level category, and the biggest difference with L3 is that there is no need for manual takeover, in the limited scene can achieve the vehicle's completely autonomous driving. L5 level is to remove the condition of "limited scene". The biggest feature of this level of vehicle is that there is no steering wheel, everyone is a passenger, and all the control rights of the vehicle belong to the system.
After understanding what autonomous driving is, let's take a look at how the autonomous driving system is realized. In general, autonomous driving systems contain three main modules: perception, decision-making, and control. Roughly speaking, these three modules correspond to the eyes, brain, and limbs of biological systems. The perceptual system (eyes) is responsible for understanding the information of the surrounding obstacles and roads, the decision-making system (brain) determines the next action that needs to be performed according to the surrounding environment and the set goals, while the control system (limbs) is responsible for performing these actions, such as steering, acceleration, braking, etc. Further, the perception system includes two tasks: environmental perception and vehicle positioning. Environmental perception is responsible for detecting various moving and stationary obstacles (such as vehicles, pedestrians, buildings, etc.), as well as collecting various information on the road (such as drivable areas, lane lines, traffic signs, traffic lights, etc.), and the main sensors needed here (such as cameras, lidar, millimeter-wave radar, etc.). Vehicle positioning is based on the information obtained from environmental perception to determine the vehicle's location in the environment, which requires high-precision maps, as well as inertial navigation (IMU) and global positioning system (GPS) assistance.
This column focuses on environment-aware systems and focuses on the three main sensors of cameras, lidar, and millimeter-wave radar, and their fusion. Different sensors have different characteristics, each with advantages and disadvantages, so it is also suitable for different tasks. The camera is the most commonly used sensor in the perception system, and the advantage is that it can extract rich texture and color information, so it is suitable for the classification of the target. However, its disadvantage is that the perception of distance is weak and it is greatly affected by lighting conditions. Lidar makes up for the shortcomings of the camera to a certain extent, can accurately perceive the distance and shape of the object, so it is suitable for medium and near target detection and ranging. However, its disadvantages are that the cost is higher, the mass production is difficult, the perceived distance is limited, and it is also greatly affected by the weather. Millimeter-wave radar has the characteristics of all-weather operation, can be more accurate measurement of the speed and distance of the target, the perception distance is far, the price is relatively low, so it is suitable for low-cost perception systems or auxiliary other sensors. However, the disadvantage is that the resolution of height and landscape is low, and the perception of stationary objects is limited.
A variety of sensors in an environment-aware system
Technical overview
As mentioned in the previous section, the hardware basis of an environment-aware system is multiple sensors and their combinations, while the core of the software aspect is the perception algorithm. In general, perceptual algorithms accomplish two main tasks: object detection and semantic segmentation. The former obtains information about important targets in the scene, including position, size, speed, etc., which is a sparse representation; while the latter obtains semantic information about each position in the scene, such as drivability, obstacles, etc., which is a dense representation. The combination of these two tasks is known as panoramic segmentation, and it's also a concept that has recently emerged in the field of autonomous driving and robotics. For object targets (e.g. vehicles, pedestrians), the panoramic segment outputs its split Mask, category and instance ID; for non-object targets (e.g. roads, buildings), only its split Mask and category are output. The ultimate goal of the environment perception system is to obtain the result of panoramic segmentation in the three-dimensional space around the vehicle. Of course, for autonomous driving applications at different levels and in different scenarios, the perceived output required is not the same.
The explosion of this round of self-driving technology is largely due to the breakthrough made by deep learning in the field of computer vision, and this breakthrough first began with image classification and object detection in images. In autonomous driving environment perception, the first task of deep learning to achieve application is object detection in a single two-dimensional image. Classic algorithms in this field, such as Faster R-CNN, YOLO, CenterNet, etc., are the mainstream of visual perception algorithms in different periods. However, a vehicle cannot rely solely on the detection results on a two-dimensional image. Therefore, in order to meet the needs of autonomous driving applications, these basic algorithms need to be further expanded, the most important of which is the fusion of timing information and three-dimensional information. The former derives the object tracking algorithm, and the latter derives the monocular/binocular/multi-purpose three-dimensional object detection algorithm. By analogy, semantic segmentation includes image semantic segmentation, video semantic segmentation, and dense depth estimation.
In order to obtain more accurate three-dimensional information, lidar has also been an important part of autonomous driving perception systems, especially for L3/4 level applications. Lidar data is a relatively sparse point cloud, which is very different from the dense mesh structure of the image, so the algorithms commonly used in the image field need to be modified to be applied to the point cloud data. The tasks of point cloud perception can also be divided by object detection, which outputs a three-dimensional bounding box of objects, while the latter outputs the semantic category of each point in the point cloud. To take advantage of algorithms in the image domain, point clouds can be converted into dense mesh structures under a bird's Eye View or a range view. Convolutional Neural Networks (CNNs) in deep learning can also be improved to make them suitable for sparse point cloud structures such as PointNet or Graph Neural Networks.
Millimeter wave radar is also widely used in autonomous driving perception systems due to its all-weather operation, accurate speed measurement, and low cost, but it is generally used in L2-level systems, or as an aid to other sensors in L3/4-level systems. Millimeter-wave radar data is also generally a point cloud, but it is more sparse and has a lower spatial resolution than the point cloud of lidar. Millimeter-wave radar has a very low data density compared to cameras and lidar, so some traditional methods (such as clustering and Kalman filtering) are not much worse than deep learning, and these traditional methods are relatively low in computation. In recent years, researchers have begun to start from the lower data, replace the classical radar signal processing with deep learning, and achieve a perception effect similar to lidar through end-to-end learning.
The perception of a single sensor is always limited, and if the system cost is put aside, the solution of multi-sensor fusion is naturally better choice. In general, the camera is an essential sensor for the perception system, in order to obtain depth information and a 360-degree field of view, the scheme of binocular or multi-eye fusion can be used. For more accurate access to three-dimensional and motion information, cameras can also be fused with lidar and millimeter-wave radar. The coordinate systems of these sensors are different, the data forms are different, and even the acquisition frequency is different, so the design of the fusion algorithm is not a simple task. Roughly speaking, fusion can be carried out at the decision-making layer (fusing the output of different sensors) or the data layer (fusing data from different sensors or intermediate results). Data layer fusion is theoretically a better approach, but the spatial and temporal alignment between sensors is more demanding.
The above roughly introduces the algorithm part involved in environmental perception, and some other contents of the algorithm part, such as the fusion of multi-eye cameras, the spatial and time alignment of multiple sensors, will also be introduced later.
In addition to the core algorithm design, other important parts of the perception system include data acquisition and annotation, algorithm testing and iteration, and system deployment, which will be analyzed later in the column.
The current state of the industry
Now that we understand the technologies included in the perception system, let's take a look at the current state of application of these sensors in production or demonstration vehicles.
Roughly speaking, self-driving companies can be divided into two main categories. One category is traditional car companies (such as foreign Volkswagen, BMW, GM, Toyota, etc., domestic Great Wall, Geely, etc.), new energy vehicle companies (such as Tesla, Weilai, Xiaopeng, etc.) and Tier1 (such as foreign veteran Bosch, Mainland, Aptiv), etc., as well as the domestic emerging Huawei, DJI, etc.). The primary goal of such companies is mass production, generally based on L2-level schemes, and is currently expanding to L3-level. The other category is some solution providers or startups (such as Waymo, Mobileye, Pony.AI, Momenta, TuSimple, etc.). These companies are working on L4-level autonomous driving technology for applications such as Robotaxi, Robotruck and Robobus.
For different levels of autonomous driving, different application scenarios, the configuration scheme of the sensor is not the same. For L2-level applications, such as emergency braking and adaptive cruise, only front-looking monocular cameras or forward millimeter-wave radar can be used. If lane change assist is required, additional sensors need to be added to perceive adjacent lanes. A common solution is to add multiple angle radars to the front and rear of the car to achieve 360-degree target detection capabilities. For L3-level applications, the vehicle needs to be fully autonomous in specific scenarios, so it is necessary to expand the vehicle's perception of the surrounding environment. At this time, it is necessary to add lidar, side-view and rear-view cameras and millimeter-wave radar, as well as GPS, IMU and high-precision maps to assist vehicle positioning. After the L4 level, since there is no need for manual takeover in specific scenarios, the sensor needs not only high accuracy, but also high reliability. This requires increased sensor redundancy, which means a backup system.
Let's look at a few specific cases.
The first is Tesla's recently launched pure visual solution. Although tesla is the first thing that comes to mind when it comes to autonomous driving, Tesla is actually only an L2-level (or advanced L2) autopilot system, because it still requires the driver to be ready to take over the vehicle at any time. If you only compare horizontally in L2-level systems, Tesla's solution is still very competitive. This system uses only vision sensors, including cameras installed in different positions of the car body, with multiple focal lengths and field of view. These cameras can cover a 360-degree field of view and have some redundancy. Tesla's multi-camera fusion algorithm based on deep learning demonstrated on AI Day is still very worth studying, and the follow-up article will be launched for detailed analysis.
Tesla's pure vision sensor configuration (L2 level)
In the summer of 2017, Audi released the fourth generation of the A8, the biggest highlight of which is the traffic Jam Pilot (TJP) system. As mentioned earlier, the TJP system already belongs to the L3 category, so the Audi A8 can be said to be the world's first "mass production" L3 level system. Why the quotation marks, this is because the function has not been turned on in the delivered vehicle, and the user can only experience it in Audi's own demo car. Audi's official interpretation is a regulatory issue, but in fact, the core reason is technical, that is, the so-called "takeover paradox" problem in L3. In a structured road jam scenario below 60 km/h, the TJP system allows drivers to play with their phones down or sleep. At this time, if there is an unexpected situation, there may be a situation that the takeover is not enough. Although Audi canceled the L3 level autonomous driving project at the end of 2019, this exploration also provides valuable experience for the subsequent development of L4 and various advanced L2 systems. More details will not be expanded here, let's take a look at the sensor solution in this system. The Audi A8 has a total of 12 ultrasonic sensors, 4 panoramic cameras, 1 front camera, 4 medium-range radars, 1 long-range radar, and 1 infrared camera. In addition, for the first time, the Audi A8 is equipped with a 4-wire vehicle-grade lidar and is equipped with a central driver assistance system control unit (zFAS), which are essential options for L3 level autonomous driving systems.
Sensor configuration for the Audi A8 (L3 level)
From L2 to L3 to L4, the biggest change in sensors is the addition of lidar, and the number is gradually increasing. For example, in Waymo's sensor solution, in addition to the forward-facing lidar, 360-degree lidar in the rear and roof has also been added. Moreover, the number of lidar harnesses has increased significantly, reaching a perceptual range of about 300 meters. With the exception of Waymo, L4 systems at other companies inevitably include one or more lidars. From the current technology development trend, the realization of L4 level of automatic driving is mainly based on the addition of sensors, thereby greatly improving the perception of driving road conditions and the environment, and the most important of which is lidar. At L4 level, the vehicle is completely autonomous in the limited scene, when 99% accuracy is not enough, but 99.99999% accuracy is needed, and lidar is the guarantee of the number of decimal points. This guarantee comes from the coordination between lidar and various other sensors, rather than just simple stacking, so efficient and accurate sensor fusion plays a crucial role in L4-level systems.
Waymo's sensor configuration (L4 level)
Reproduced from the network, the views in the text are only for sharing and exchange, do not represent the position of this public account, such as copyright and other issues, please inform, we will deal with it in a timely manner.
-- END --