Review of the second lecture of the Horizon AI chip technology session

On February 23rd, the second lecture of "Horizon AI Chip Technology Special Session" opened by Horizon in the Zhidong Open Class has been successfully completed, and is delivered by Ling Kun, Senior R&D Director of Horizon, with the theme of "Good Autonomous Driving AI Chips are "Easy to Use" Chips"

Teacher Ling Kun started from the development paradigm of software 2.0, combined with the development practice of horizon autonomous driving AI chips, from the balance of soft and hard combination + soft and hard decoupling, AI chip development principles, software 2.0 development paradigm infrastructure Eddy AI development platform, Tiangong kaiwu tool chain, rich software stack and other aspects of in-depth explanation of how to create a "easy to use" automatic driving AI chip.

First of all, welcome everyone to this course, and thank you for the platform provided by the Zhidong Open Course, which gives us the opportunity to do relevant exchanges. In the last course, Dr. Horizon Luo Heng focused on what a good autopilot chip should look like, and today it is about how to make a good autopilot chip into a good chip.

Review of the second lecture of the Horizon AI chip technology session

My name is Ling Kun, Senior R&D Director of Horizon, graduated from the Institute of Computing Technology of the Chinese Academy of Sciences, more than ten years focus on the cpu/DSP/DSA on the compiler & instruction architecture joint optimization and implementation; joined Horizon in 2016, responsible for the management of horizon Tiangong open tool chain and Eddy AI development platform related R & D team management; once served as the head of the Horizon compiler R & D department, has participated in the instruction set architecture definition, compiler and toolchain research and development, productization, Marketing promotion and mass production related work.

This course is mainly divided into the following 3 parts:

1. Pay attention to the soft and hard decoupling under the premise of soft and hard combination

2, the key to easy use: improve the efficiency of product research and development

3. Achieve developers with software 2.0 infrastructure, toolchain, open software stack and rich examples

For many years, we've been doing algorithm and software development in the traditional sense. Under this system, the programmer must first understand what the problem is. How to solve it? Write the code on this basis and let the code run to see if it is correct. With the performance of chips getting higher and higher, the storage data, model capacity is getting larger and larger, machine learning methods have helped us solve many practical problems, so we have entered the software 2.0 era. With the continuous evolution of Moore's Law in the future, more and more research and development work based on software 2.0 will be carried out, and we must be ready for the era of software 2.0 development.

Looking back at the development paradigm of Software 2.0, it is completely different from Software 1.0. In software 1.0, when we want to solve a problem, we first need the developer to define the problem very clearly; then decompose the problem into specific steps, and think very clearly about the solution of each step; then write the code, and the code is integrated after testing to see if it can solve the actual problem. If it can't be solved, looking at it in reverse, is the problem not clearly defined? The problem is not clearly broken down? The solution to the problem is incorrect? Or is the programmer's code not well written? Repeated inspection, debugging, verification, the above is the closed-loop iterative process of software 1.0 development. For example, when controlling a car to go forward, if the road does not exceed the speed limit, there is no obstacle in front, then the next vehicle can accelerate, accelerate to how much can not accelerate, this is a typical if-then-else problem, so the entire software 1.0 era code program is around if-then-else, for loop function and other typical concepts.

When it comes to the software 2.0 era, it is a completely different development model. First of all, you need to define the problem, and at the same time need a lot of data, the data is used to represent several different situations; then design an appropriate model, the model can classify or detect the problem; then do model training on a large amount of labeled data, and then deploy the integration after the training is completed, in the scene to see how many results are correct and how many results are wrong; and then continue to collect data, do labeling, training, or change the design of the model to solve these wrong badcases.

In the process, no programmer can think of the solution to the problem very clearly. For example, to identify a cat, whether the cat's hair is curved or straight, whether the cat's color is flowery or solid color, and whether the cat's two ears are upright or bent, the programmer does not solve the problem in this way. From the perspective of each pixel definition, these problems are a data-driven problem of training the model through 1 million or 10 million different images. In this mode, the programmer does not need to have a very deep understanding of how to solve the problem, nor does he need to know how to operate at each step of the computer, just pay attention to the capacity of the convolutional neural network and the information inside, how the gradient propagates when backpropagation, and how the activation function should be set.

The software 2.0 era approach allows machines to understand and understand the world around them like humans, so it has a wide range of application scenarios, and with the continuous progress of Moore's Law, there is a lot of room for growth. Regarding the good use of autonomous driving chips, it should also revolve around the development paradigm of software 2.0, because the 1.0 era has been accumulated for forty or fifty years, and various tools have been very perfect, on the basis of which there are more micro-innovations, while the 2.0 era is a subversive innovation at the underlying methodological level.

Under this development paradigm, for the machine, software 2.0 technology allows it to perceive the world around it, know where it is in the world, when there are many autonomously moving targets in the world, you can predict the trajectory of the surrounding targets, you can plan whether your own actions are around the target, or go straight forward or stop, and then control yourself to complete the action.

1. Pay attention to the soft and hard decoupling under the premise of soft and hard combination

First of all, looking back at history, when looking at history, you will find that a very important point is that "the application's pursuit of performance is endless." In this case, many chips are moving forward from generation to generation. Since 1970, a variety of chips and computing devices have emerged in an endless stream, which has also created many PC eras.

The blue statistical table on the left of the chart above is a statistical result from ACM Communications in 2021, which shows that the improvement in microprocessor performance has gradually slowed down over the past 20 years. At the same time, from the perspective of suppliers, it can be seen from the yellow statistical table that the performance improvement of microprocessors under the unit R& D investment is gradually becoming smaller, so the R& D input-output ratio on general-purpose processors will be lower and lower in the future. More and more companies are putting more effort into multicore and heterogeneous accelerators, as evidenced by the chart on the right.

As the quest for performance never ends, the number of transistors on a single chip will grow exponentially. The performance of individual threads gradually slowed down around 2010. Because of the limitations of physical conditions, the frequency no longer grows, and with the change of process manufacturing, the power consumption of the entire chip is also in a stagnant state. At the same time, there are many logical processor cores in a single chip, which will cause us to pursue very high latency enhancements, because the performance of single-threaded does not change significantly, so that we can not achieve good results through single-threaded general-purpose processors, can only do parallel code, or introduce heterogeneous accelerators to achieve performance optimization.

While the performance of individual threads is slow or even stagnant, there is still a lot of room for optimization at the software and algorithmic level. The upper part of the figure above is an optimization of the matrix multiplication example. When we implement matrix multiplication in Python, assuming its speed is 1, when we change the Python code to Java or C, we can see an 11x or even 47x improvement. This is a change between languages, just rewritten in a different programming language, independent of the chip architecture. Then the cyclic parallelism is used to multi-core on the chip; parallel splitting is to divide the matrix into blocks, and then put it in the cache; and then use automatic vectorization to automatically use the data flow parallel CMD instructions already provided in the chip; when using a wide AVX vector, directly write AVX function calls in the code, and you can get up to more than 60,000 acceleration ratios. Therefore, when we min more hardware features around the characteristics of software and algorithms, we can obtain a very large performance improvement and a very cost-effective computing platform through this combination of software and hardware.

Recently, there is an interesting news that Intel has abandoned AVX512 support in some chips in order to ensure software compatibility in the new large and small core architecture.

Let's take a look at how hard-software bonding and hard-soft decoupling used to be done in the entire technology stack. When we look at standard C and C++ code, which has nothing to do with the chip and can achieve soft and hard decoupling, how does it do it? Let's take the LLVM compiler as an example, the compiler has a front end, a mid-end and a backend, where the front-end and mid-end have a lot of code analysis, optimization, and transformation, which are not related to the chip architecture. The compiler backend also has chip-related parts, like the ARM backend and the RISC-V backend, through which the compiler can turn the code into executables, which can be deployed on ARM chips or RISC-V chips.

In contrast to intel AVX disabled messages, on the latest ARM processor architecture, an SVE vector processing unit has been introduced, which can be used to replace NEON. NEON is a SIMD extension instruction set on the ARM architecture, which is similar to AVX's single instruction multiple data, but NEON is fixed-width, that is, 128 bits. SVE is longer, and it can achieve widths of 128 bits, 256 bits, and 512 bits. In terms of performance evaluation, up to 3.5 times the acceleration ratio can be obtained. At the SVE instruction set level is not specified in detail, it is in the specific chip implementation, the hardware can set some constants to do, and all the instructions are through their own judgment or add some parameters to the way, can not consider the actual width of the vector, that is, under the same instruction, both can be executed under 128 bit width binary code, can also be executed at 256 bits or 512 bit width of binary code, will not appear Intel AVX512 situation. Therefore, ARM in terms of data parallelism, in the binary code compatibility thinking is relatively advanced, it is trying to avoid the problem, maintain binary compatibility, in this way can well support the underlying transistor for the above service, while maintaining a good soft and hard decoupling.

Just talked about C and C++, when it comes to really facing the AI era, we have to mention GPUs, that is, CUDA. Using a large number of parallel single-instruction multi-threaded architectures in the GPU, it is possible to achieve very complex data flow parallel operations, thereby accelerating the tensor computation of the upper layer to obtain better AI performance acceleration.

The typical CUDA here is proposed by Nvidia, which becomes PDX code through the NVCC compiler, and the PDX code will have its own PTXToGPU compiler on each generation of GPUs, and then run on the Ampere architecture GPU or Turing architecture GPU through the driver. In the past many years, NVIDIA has accumulated a lot in this regard, and has formed a relatively strong market dominance.

In the era of AI, NVIDIA is basically singing the protagonist, although AMD's recent market value has been greatly improved, but AMD's knowledge of AI has been in a passive state. Recently, AMD's big move is to propose the ROCm compiler, although there is no explicit introduction, but it can be considered to be for better compatibility with CUDA ecology, so it will first convert CUDA code into HIP code through a converter, then through the ROCm compiler, and finally run on AMD's GPU.

The dotted line is drawn on the right part of the above figure, which is due to the fact that when we decide whether to buy NVIDIA GPU or BUY AMD GPU to do AI computing acceleration, most developers will choose NVIDIA, because there is no need to worry about the problem of compiler or Runtime, or some bugs are not detected, which in turn leads to a great impact on production efficiency, so the right side uses a dotted line. Although this line exists, but good or bad, many users and developers have used their feet to vote, which is a typical combination of soft and hard and soft and hard decoupling is not easy to do the trade-off problem. And on NVIDIA's previous GPUs, it can achieve better soft and hard decoupling through PTX code. At the same time, the combination of soft and hardware is reflected in the deep mining and utilization of the chip architecture by the NVCC compiler and CUDA, which is software 1.0.

Looking at software 2.0, its entire development process is roughly divided into the following stages: first train a model, and then do quantification, because quantization can bring about the improvement of chip function and performance, and then see whether the accuracy is up to standard, and then do the model compilation to the chip platform to run. These steps, including horizon, most chip manufacturers can achieve better soft and hard decoupling. Through this form of soft and hard decoupling, it can ensure that some historical research and development code written by developers in the past can run better on the platform, and can also ensure a certain supply chain security.

Just looking back at history, the hardware and software design for AI computing also needs a relatively complete engineering architecture to ensure. First of all, there must be performance, and on this basis, let's see how to decouple the software and hardware. Combined with the practice of horizon, in general, from the two levels of hardware design and software design, hardware design is mainly aimed at storage, tensor computing organization, instruction set design, software aspects include computational analysis and parallel optimization, data parallelism and dependency analysis optimization, on-chip storage management and instruction scheduling. Their core goal is to maximize the utilization of hardware resources.

More transistors are needed to do more to ensure continued competitiveness in the future, but what exactly are these transistors used for? How to ensure the best use of these resources and provide enough AI computing power for upper-level AI algorithms and applications requires a complete set of engineering iteration frameworks that combine soft and hard.

Horizon has a BPU architecture modeling tool, which can model power consumption, performance, and area, and input is a sequence of instructions, and the modeling tool provides some hardware configuration information and instruction performance information for the model performance analysis tool. The Model Performance Analysis Tool provides analysis results in terms of performance and accuracy, while providing input for modeling the BPU architecture. The BPU architecture is exploring the future chip architecture, and the model performance analysis tool is exploring what the next compiler, model quantization tool, and training tool should do. They have a very important input: the Testing Benchmark. If the Testing Benchmark is not selected well, the entire closed loop will be distorted, so the Testing Benchmark selection is very important.

The selection of Testing Benchmark must grasp the algorithm evolution trend, because Benchmark contains a rich algorithm model that represents the future evolution trend, after using Benchmark and related transformations, you can better balance the combination of soft and hard and soft and hard decoupling. For example, Horizon has reached millions of chip shipments in the second generation of the journey and the third generation of the journey chip, and in 2016 and 2017, the evolution trend of some related algorithms has been taken into account.

Horizon has a very strong algorithm software team, which constantly looks, listens to or practices the actual application of algorithms, and future evolution trends, to better provide input for Testing Benchmark.

The next step will be explained in conjunction with the actual situation of the horizon, hoping to give you some new inspiration, or bring some different perspectives and perspectives. In the chip architecture design of Horizon, including the selection of Testing Benchmark, it is aimed at the key algorithms in important scenarios in the future, and it must be done in the product driver to do architecture iteration, depending on the generalization of the model in the product, how the model actually works, which goals can be recognized well, which are still problematic, and these points are excavated as much as possible at the product level. Then, under the combination of product-driven agile architecture iterations and key algorithms for important scenarios in the future, the Testing Benchmark is formed.

At the same time, Horizon has a team of world-leading experts who combine the accumulation of computing architecture, software, hardware, chips and algorithms over the past few decades to predict what else can be done at the AI computing level to optimize and innovate. Here first of all, we must pay attention to efficiency, and we must take into account flexibility, which will be done from the three perspectives of chip architecture, algorithm and compiler, and there will be many cross-cutting thinking collisions and engineering practice iterations in these three aspects. For example, when we look at the instruction set, we don't just look at the RISC-V instruction set, but what tensor computation is in the eyes of the compiler. In this case, how should we do the instruction set, elastic tensor kernel, on-chip storage, programmable stream processing architecture, etc. What are the collisions of ideas, through the specific technical points of these aspects to give everyone a general feeling, that is, soft and hard combination and soft and hard decoupling in what kind of situation can be balanced.

When we talk about soft-hard combination and hard-software decoupling, the final chip needs to maximize the productivity of developers and allow them to quickly develop products. Therefore, Horizon insists on doing a good job of automating tools, automating the use of chip characteristics, if the chip characteristics can not be used automatically, it is a tool problem, or a chip architecture design problem, or an algorithm level problem, these need to be strictly demonstrated. In this case, we take the tools well, automate the use of these features, automate the analysis model, analyze dependencies, de-transform, improve performance, and reduce bandwidth.

In compiler optimization, the first is to split the tensor computation, through the feature map and convolutional kernel calculation splitting, the compiler can use a smaller granularity to describe the calculation, avoid the introduction of unnecessary dependencies, improve data parallelism, and create potential scheduling opportunities.

Then there's instruction scheduling, which is also a very classic compiler optimization method, and we've done a lot of work at the compiler level. First of all it is a tensor, and the big difference relative to registers is that the tensor is variable, it has different channels, kernels, convolutional kernels. Therefore, it is necessary to model the tensor data, and at the same time, there must be a strong instruction pipeline scheduling in terms of software, that is, soft pipeline.

The flow of soft flow is shown in the upper left corner of the figure, after doing Load, Conv, Store, then do Load, Conv, Store. Since there is no necessary connection between the two Loads, it can be made into a pipeline in a way such as the lower left corner of the figure, and it can be seen that each group of blocks is itself a loop body, but the three instructions inside the loop body do not have an inevitable connection, and in this way the three instructions can run freely and flexibly at the same time. The lower right corner of the graph represents an actual network execution process, and you can see that the convolutional array is basically completely full, without any gaps. ddr_load in the middle to provide inputs to the convolutional array, and there are other operations. Very high convolutional utilization can be achieved overall.

Convolutional splitting and instruction scheduling are mentioned above, but the key question is how to split so many layers? This first comes to mind how the C compiler does, it analyzes and compiles inside each function, and then looks at how the code inside the function should be optimized for each other. Similarly, when one inference of a convolutional neural network is considered to be a function, one should look at how the calculations are done throughout the execution inside the function.

In the practice of horizon, we use a set of computational fusion techniques to synthesize the operators. As shown on the left side of the figure above, the operator is fused together as a whole, because the memory on the chip is always very limited, and it is very expensive, if necessary, it overflows some data into the DDR, freeing up a piece of on-chip space to ensure that the execution can be carried out, the smaller the space, the better, so that the entire DDR access memory bandwidth can be very small.

The picture in the lower right corner is a 720p picture input into the ResNet101 network, you can see that the initial is to first load the picture into the chip, and then through the intermediate operation, and then store it in the DDR. The intermediate process has only three DDR accesses, and these three times realize the move of some data in and out, which is about an 18-layer layer fusion output in front, a layer fusion output of 14 layers, and finally a three-layer fusion output, which can minimize the access pressure of the DDR bandwidth in the entire Inference process. At the same time, you can also see how the memory on the film is used, how the memory is fully utilized next, whether it is optimized on the compiler, or more reasonable adjustments in the chip architecture or algorithm tensor size.

The above figure is a macroscopic display of all the effects together, and you can see that there is global computing fusion, single-layer computing splitting and global computing fusion and dependency analysis and instruction scheduling.

2. The key to easy use: improve the efficiency of product research and development

What is "easy to use"? I think the best use is to put the developer's mind thinking about the product, using the auxiliary facilities that can be provided on the chip, and creating the product he wants most in the fastest way to improve the efficiency of research and development.

How to improve the efficiency of research and development is not an easy word to measure, so let's look at history first. The picture above shows the development of the entire computing technology in the past 100 years, from the earliest use of mechanical joystick password cracking, cable plugging ballistic computing, to commercial computing, office game cloud services, and then to the more familiar mobile communications, including Android and iOS. Until then, it was impossible to bypass the Turing machine and the programming model that controlled the Turing machine to achieve the behavior that people wanted the machine to do.

If you have explored several different programming languages, you can see that whether it is C, C++, Java or Python, its core is still if-else, loops, jumps, but there will be some differences in terms of purpose, compilation and computation, but the basic things have not changed. But in the era of autonomous robots, it's completely different, with machines being able to understand and understand the world around them, a data-driven, differentiated, programmable way. Throughout the process, the application scenario and development paradigm are continuously iterated.

Looking back, the 2.0 era is becoming more and more important, so the "easy to use" around the 2.0 era is the development of the 2.0 era. But there will be a problem, what kind of things should be done, and by what standards can be measured in order to do a good job in the development of the software 2.0 era. I consulted a teacher: the father of C++, who published a paper in 2020 about the development of C++ in 2006 and 2020.

The ordinate of the above figure is the degree of activity of the C++ community, the abscissa is the time, you can see that C++ after a very rapid upward period, there was a decline around 2000, there was an inflection point in 2006, and then C++11, C++14, C++17, C++20, the number of software developers began to have a very large increase.

In these 14 years, Bjarne led the entire C++ standards committee to discuss what exactly needs to be put into the C++ standard, so that C++ development engineers around the world can use the compilers they can get to do a good job, such as C++11 programming under Windows may be Visual Studio, programming under Linux may be GCC or LVM or some other commercial compiler, so what features should be added to C++ to be better used? What criteria should be used to measure "good use"?

He summarized two points: C++ must enable applications to make good use of hardware performance characteristics to improve hardware performance, while better controlling the underlying programming complexity. These two sentences are also contradictory, and the efficient use of hardware is like some form of soft and hard combination. Effectively reducing complexity is a soft and hard decoupling in some cases. At the same time, he also proposed a principle, to be able to let programmers write good code and create good applications, rather than prevent programmers from making bugs. One is patching, and the other is better traction to make better developers.

From this point, we are more convinced that the development of AI chips also needs to fully release the performance of hardware, reduce the complexity of development, let AI developers develop their most important applications on the AI chip platform, and turn it into a good product to customers and markets. Combined with the 14-year history of C++, we believe that such a platform is a "useful" development platform.

3. Achieve developers with software 2.0 infrastructure, toolchains, open software stacks, and rich samples

Combined with the programming mode and programming paradigm of Software 2.0, what should be done to make a useful chip, upper-layer software development environment. In this regard, Horizon has also been practicing, so the following will be combined with Horizon's practice, from the software 2.0 infrastructure, tool chain, open software stack and rich samples to introduce some related thinking, we also believe that this may be the way to easy to use chips, especially easy to use AI chips, autonomous driving chips, must go.

AI chips need a tool chain and software 2.0 infrastructure, in terms of infrastructure, there must be data annotation, model training platform, it can support algorithm development and training, algorithm evaluation, end-to-end data closed loop, so that more data can be transmitted back to the infrastructure for data-driven software 2.0 development. Software 1.0 has become very mature after more than 40 years, and we will insist on micro-innovation on it.

On the other hand, model deployment optimization and performance analysis, AI algorithms should be fast, good performance, and high precision when they are put on the chip, and they must know how to analyze after problems. When the algorithm is used, it is the development of the entire application, and it will eventually help the developer achieve the goal of the final product.

First, let's talk about infrastructure, which will be combined with the practice of Horizon Eddie's AI development tool platform. Eddy AI Development Tool Platform is an efficient software 2.0 training, testing, and management tool platform. It consists of several parts, such as a car on the edge side, a chip, and the data is transmitted through encrypted transmission. In the cloud, it is also a complete infrastructure, including semi-automatic annotation tools, automated model training, long-tail scene management, automatic software integration, automated regression testing, and finally the entire set of models is deployed to the chip through OTA upgrades. At the same time, there are also shadow mode, model deployment related to mass production, functional safety and information security work on the side.

This whole set of work is not only for the chips on the horizon, but also for other chips, but the model deployment is different, but the methodology for software 2.0 is the same. Developers mining around the problems of key scenarios, the automation of the whole process of model iteration can greatly improve the research and development efficiency of the algorithm, and can be openly docked to various terminals. In this way, the R&D efficiency of algorithm developers is greatly improved.

The above figure is a preliminary analysis and modeling based on the R&D efficiency of the algorithm personnel, and some efficiency improvement figures are obtained for your reference. For example, data mining, including some long-tail data management and shadow mode, the shadow mode on the end is like a child doing exam questions, a child does 100 questions, may only be wrong one, but the remaining 99 are useless, the key this question should be placed on our wrong question book, continue to review, iterate. Therefore, shadow mode and long-tail data management are very valuable wrong books for AI models. In this way, the efficiency associated with data upload and storage will be greatly improved.

Next is the data labeling, which turns out to be labeling all the pictures. But in the actual operation of the vehicle, it is a state of time and space continuity, in a continuous state, such as in two different lanes, when overtaking a car with a slightly lower speed, it may be in my field of vision for the next 10 seconds, and then disappear from the field of vision little by little. Within this 10-second range, if you capture the picture at 30 frames per second and annotate it, it will be very time-consuming and laborious. In fact, just mark a picture, and use the continuity of time and space, you can achieve automated labeling, you can even use independent learning methods to train a large model to do labeling, after labeling only need to be slightly calibrated, you can get a very large labeling efficiency improvement.

There is a set of equipment clusters that are completely consistent with the chips and boards on the horizon Eddie platform, this set of equipment clusters can make each board run like a car, but it is in the computer room, and the input is not the pictures collected on the street, but some backfilled video pictures, on which we do a lot of exploration, like the exploration of AI models, the exploration of compilation architecture, and can also do a lot of application code modification and regression debugging, which greatly reduces the cost of testing the entire equipment, code and software algorithm.

In addition to this, there is the Badcase management system, which is not only a picture, but also some input, or some small case of software 1.0 and software 2.0 for autonomous driving.

Through the management of these cases, algorithm developers can better find the wrong problem book directly, see how to solve the problem, and greatly improve the efficiency of research and development. After this set of R & D efficiency improvement, you can better serve the product development on the chip.

Let's talk about toolchains and application development. If you have development experience on the NVIDIA platform, the process is roughly as follows: first train the floating-point model, then quantify, see if the accuracy is up to standard, if not up to the standard and then do iteration, then the model compilation is placed on the platform to run. Horizon also respects the developer's development habits for toolchain and application development.

Horizon Has served more than 100 customers, and developers in these more than 100 customers use these tools, read these documents, look at these examples, and then on the basis, put their ideas and creativity into play, and analyze and debug when they encounter problems.

We put the problems and ideas they see, and the obstacles they encounter in exerting their creativity, which in turn helps us improve and enhance the tool chain of Tiangong Kaiwu, which can better improve efficiency. In addition, because it is oriented to cars, Horizon also follows the complete ISO26262 process for development. This year, the entire functional safety certification is expected to be completed, so that the entire toolchain can be delivered to customers and developers, so that they can be more assured and secure.

The post-training quantization tool is also a typical soft-hard decoupling tool, and any floating-point model trained to meet the specification requirements can be deployed on a horizon chip. This tool itself is a software, together with the chip to do a lot of joint optimization, joint optimization in turn can improve the quantization accuracy, that is, after the conversion of the quantization tool, without training, its accuracy loss and Nvidia quantized accuracy pairs as shown in the figure above, you can see that the quantified accuracy is better than Nvidia quantized. This is a typical process of decoupling hard and soft, while combining hard and soft to create a better tool for programmers and developers.

In terms of training tools, we have similar innovations, such as training plugins, and there are many related tools in terms of optimization and compiler, which are valuable suggestions from customers and practical appeals from developers.

On this basis, there is a very rich software stack, the bottom layer of the software station is OS, and then the bottom layer is some development boards, and above is some software and development reference programs. When you get our chips or tools, it contains a complete set of toolchains and development components that reduce complexity in development. At the same time, there will be many application reference solutions available to developers in a white box or open source manner. We believe that through these, we can also greatly improve the efficiency of developers, so that developers can make better products, and get good results.

In short, Horizon includes algorithm development and application, algorithm evaluation, end-to-end data closed-loop, AI algorithm deployment, application development, diagnosis, debugging, and performance tuning on the Aidi platform and Tiangong Kaiwu. This process is all developer-oriented, completely for developers, and for the efficiency of developers' product development. Therefore, in this process, we must uphold the principles of openness, flexibility, compatibility, and high performance. Under the guidance of these principles, we will refer to more historical experiences and lessons, and take history as a mirror to see what to do next to achieve better use of AI chip development tools.

The above describes a useful development tool, but it needs to be validated and iterated in the market. When we look at the on-board chips on the entire market, we find that China, especially the domestic independent brands, has become the world's top automotive intelligent chips and algorithms, computing platforms "gladiatorial arena". For example, In 2021, Mobileye's EyeQ5, Qualcomm's Snapdragon Ride, NVIDIA's Xavier and Orin are all debuted on the domestic independent brands' own models, including Horizon Journey 3 and Journey 5.

Finally, to review today's content, combined with Horizon's practice, we are making a better-use, world-class AI computing platform, and its goal is to enable developers to better develop AI-based products on it. Including Rising Sun and Journey chip, chip architecture, compiler, SoC, AI algorithm, deep learning framework technology integration, soft and hard collaborative optimization, through the BPU microarchitecture, layout, timing, on-chip network, instruction architecture, operation scheduling, communication, power consumption and other aspects to improve performance and reliability, tape-out verification, use it on the machine, so that the machine can understand and understand the surrounding world like a human. This is the core of the combination of soft and hard and soft combinations in the chip that ensures the easy use of the chip.

In addition, Tiangong Kaiwu toolchain has precipitated the most advanced lightweight model research and development practices, model compression, quantitative training, post-training quantification, deep learning framework, runtime environment, AI application solutions into the toolchain, through automation, tooling, and samples, serving thousands of developers, inclusive AI, and making empowering machines more efficient.

Horizon's Eddy platform improves the efficiency of algorithm research and development through end-cloud collaboration, data closed-loop, automatic data mining, and automated annotation, and improves the edge-side iteration efficiency through evaluation cluster and hardware-in-the-loop testing. There is also a complete set of hardware and evaluation clusters behind it, in order to complete the training of the model, but also to manage the GPU cluster, at the same time, these data must also have storage management, the whole process must be automated and efficient scheduling. If the front is to improve the research and development efficiency of the algorithm, the latter is to improve the utilization of hardware resources.

What you see above is the product, and behind the product is a whole set of software code and infrastructure. This includes how to write code to achieve high system throughput, low latency, high performance, low memory footprint, and get the system to run correctly and surely. From the boot loader in the most basic chip, to the architectural design on the chip, the assembly code on the architecture design is similar to the tensor instruction code, and then the operating system driver to the kernel, to the compiler and runtime environment, to the deep learning framework, quantization, this whole set of technology stacks are accumulated with code bit by bit. It will also use multi-threading, high-performance algorithm libraries, single-core multi-threading, multi-core multi-threading, communication scheduling on multi-SoCs, and vehicle-level functional safety, all of which need to be considered at the software level.

Review of the second lecture of the Horizon AI chip technology session

Read on