laitimes

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

author:Alibaba Cloud Yaochi database
The following article comes from PolarDB, written by Yuanyun and Jinbei

Customer testimonials

Alibaba Cloud Yaochi's PolarDB for PostgreSQL storage capacity is elastically scalable, supporting up to 100 TB of storage space. Its large table optimization and elastic cross-node parallel query (ePQ) successfully solve the problem of slow query and concurrent update of large tables in the community PostgreSQL. In Xpeng's intelligent assisted driving business, 70 million rows of terabyte-level big data tables are updated every day and big data tables are analyzed and queried in seconds.

——Head of SRE of Xpeng Motors Intelligent Assisted Driving

1. Customer introduction

1.1 About Xpeng Motors

Xpeng Motors is a leading smart electric vehicle company in China that designs, develops, manufactures, and markets smart electric vehicles for technology-passionate consumers. The company's core mission is to lead the future of mobility through technology-driven transformation of smart electric vehicles. In order to improve the driving experience of customers, Xpeng Motors has invested a lot of resources in independent research and development of full-stack intelligent driver assistance technology, in-vehicle intelligent operating system, and vehicle core systems covering powertrain and electronic and electrical architecture.

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

1.2 Business Scenarios

Intelligent assisted driving is the key technical direction of Xpeng Motors, and a large number of pictures and videos are collected and uploaded every day. There is a huge amount of data stored in object storage, and at the same time, a "directory" needs to be generated for each file in the relational database to find and manage files in batches, recording the location, attributes, metrics, etc. of files. As a result, this "directory" needs to cover the full data file, that is, the large single table of the database, and the full data of the table must be updated frequently as the metric changes.

2. Encounter a bottleneck - the test of massive data

Xpeng Motors originally used a community PostgreSQL database, but with the rapid growth of the intelligent assisted driving business, the system faced three challenges in data processing:

2.1 Large table query is slow

In the face of massive data, the parallel processing capability of a single machine has reached its limit, and it is unable to cope with the query of terabyte-level large tables, and Xpeng's intelligent assisted driving analysis business is under unprecedented pressure. The largest number of data tables in Xpeng's data center has climbed to 7 terabytes, and the number of data tables exceeding the terabyte level is as many as four. When the analysis and query time of large tables reaches tens of minutes or even hours, this data processing bottleneck has become a key problem restricting the intelligent assisted driving business, and a new technical solution is urgently needed to solve it.

2.2 Large tables are updated frequently

In the annotation business of Xpeng Motors, it is facing an increasingly serious problem: the number of updates of terabyte-level big data tables is as high as 70 million rows per day. This huge amount of data traffic causes a series of problems, especially in the process of updating terabyte-level large tables, excessive file verification jobs cause the IOPS of the file system to reach the limit, which in turn leads to a sharp decline in system performance. Eventually, this overload caused a single line of data to take minutes to update, triggering a database avalanche that posed a direct threat to the company's business operations.

2.3 Storage space is growing rapidly

The total database capacity has skyrocketed to 30 TB and continues to grow at an astonishing rate of 2 TB per month. The Community Edition PostgreSQL database faces a serious challenge: it cannot be scaled automatically. This rapidly increasing storage demand is approaching the limits of the system and is a problem that needs to be solved urgently.

3. Break the deadlock: PolarDB solution

As the amount of data increases, the community edition of PostgreSQL encounters performance bottlenecks, and the data processing link accumulates, especially the batch update of data, which becomes a stuck point, affecting the efficiency of intelligent assisted driving research and development. In view of the current situation and the compatibility of the PostgreSQL ecosystem, we recommend that you test the PolarDB for PostgreSQL (PolarDB-PG) to be compatible with existing business codes, solve the performance bottleneck of a single-node database, and improve R&D efficiency.

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

3.1 ePQ accelerates the analysis and query of terabyte-level large tables

ePQ is an abbreviation for Elastic Parallel Query. PolarDB-PG uses the ePQ optimizer to generate execution plans that can be executed in parallel by multiple compute nodes. ePQ's execution engine will coordinate the execution of the plan on multiple compute nodes, utilizing the CPU, memory, and I/O bandwidth of multiple nodes to scan and compute data. The architecture of a PolarDB-PG ePQ cluster is as follows:

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

Based on this architecture, PolarDB-PG's ePQ has the following advantages over community PostgreSQL:

  • Extreme analysis and query performance: The performance of a 1TB TPC-H test is improved by an average of 23 times, and the performance can be linearly improved with the degree of parallelism and the number of nodes.
  • Full service transparency: You do not need to modify any service code, and you only need to turn on the polar_enable_px switch in the console to use ePQ.
  • All-in-one storage: TP/APs share a set of stored data to reduce storage costs. When TP is written under high pressure, the AP engine provides millisecond-level data freshness.

3.2 Large table optimization solves the problem of frequent update of large tables

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

▶︎ Customer business issues

In the Xpeng scenario, the size of a single table can reach the terabyte level, and the maximum size can reach 7 terabytes. On the business side, short connections + high concurrency (> 400) updates are used to access terabyte-level large tables. In this case, file access operations (such as open/lseek) on large tables will fill up the IOPS of the entire database, causing a spike in I/O latency and triggering a database avalanche.

▶︎ Cause analysis ️

In community PostgreSQL, the files of each table are stored in segments, and the size of each segment is 1 GB. In the following three scenarios, you need to use the open/lseek file access operation to obtain the file size:

  1. Brush dirty pages: If dirty pages are written, you need to locate the specific location where each dirty page needs to be written. In this case, you need to open all the segment files before the page is written, get the size of all segment files through lseek and sum them, and finally, check the dirty page is written to the correct location.
  2. DML-triggered table expansion: During the Insert/Update process, if no free page can be found, the table is expanded. When expanding a table, you need to obtain the size of the current table to locate the position after the table expansion, and to obtain the current table size, you need to open/lseek segment files one by one.
  3. Optimizer for cost estimation: When the optimizer estimates the cost of a common table, it needs to obtain the table size to determine whether seqscan or indexscan is used. To obtain the current table size, you need to open or lseek segment files one by one.

In the Xpeng Motors intelligent assisted driving scenario, a large table of 7 TB means 7,000 segment files, and each time a large table is brushed or the table is expanded, a single file write operation will be enlarged to 7,000 segment file length checks. At the same time, the business side still uses short connection + high concurrency to access large tables, which means that file handles cannot be cached, and can only use file operations (open/lseek) to access the length of segment files.

▶︎ Solution

PolarDB-PG solves the problem of writing large tables in the following ways:

  1. Binary lookup gets table size: Estimate the cost of file expansion or optimizer. To obtain the file size of a table in the community PostgreSQL, you need to get the size of each segment one by one, and the complexity is O(N). After the PolarDB-PG instance is optimized, open the segment file (for example, 0, 1, 2, 4, 16, 32...) in multiples of 2 until the file does not exist, and lock the file size range, for example, [64,128]. Then [64,128) binary finds that segment N exists and N+1 does not, and the final file size is locked as (N-1)*segment_size + Nth segment size (segment_size is 1GB). After optimization, the complexity of obtaining the table file size is reduced from O(N) to O(logN).
  2. Reduce redundant file verification: When PostgreSQL writes a page, you need to open all the previous segment files one by one to calculate the page write location information and check that the page write location information is correct. After the optimization is optimized, PolarDB-PG reduces redundant file verification and obtains only the two file sizes, SegN-1 and SegN, to ensure data accuracy.
  3. Relation Size Cache :P the table size cache designed by olarDB-PG to prioritize the cache when the optimizer estimates the cost, rather than obtaining the file size through I/O.

3.3 Separation of storage and computing to achieve elastic expansion

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

▶︎ Customer business issues

XPeng's PolarDB-PG instance data volume reaches 30 TB and is growing at an average rate of 2 TB per month. Self-managed PostgreSQL databases based on ECS instances are no longer able to cope with data growth.

▶︎ Solution

Based on the architecture of storage and computing separation, PolarDB-PG can be independently expanded, supports elastic expansion, pay-as-you-go storage, supports up to 100 TB of storage, and the read/write bandwidth of the PolarStore cluster is stable at more than 1.6 GB/s. Large capacity + high bandwidth ensure that the I/O and storage space of the PolarDB-PG do not become bottlenecks.

4. Welcome the turnaround - the butterfly change of system performance

4.1 The query speed of large table analysis is increased by 3.6 times

For example, if the size of a large table (xxx is 7.6 TB) is 7.6 TB, you can use the following statement to query the table:

select count(1) as cnt from xxx where create_time>='2024-03-19 01:00:00' and 
create_time<'2024-03-19 02:00:00';           

If ePQ is not used, the native stand-alone parallel query is used, with a parallelism of 6 and an execution time of 66 seconds.

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

With ePQ 2 RO nodes and 24 concurrent queries, the execution time can be reduced to 18 seconds.

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

The speed of cross-node parallel query with ePQ can be increased by 3.6 times compared with that of stand-alone parallel query, and the query performance can be improved linearly with parallelism. If you add a read-only node, the query speed can continue to improve.

4.2 Data wait events are reduced to almost zero

The following figure shows the changes in the file system SEEK and Open IOPS, with the performance before optimization on the left side of the red line and the performance after optimization on the right side of the red line.

Before: The overall open/seek IOPS reaches 30,000 ~ 40,000.

Optimized: The overall open/seek IOPS reaches about 5000, reducing the number of IOPS by 80%.

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

As shown in the following figure, each item in the graph represents the total number of wait events for the database as it runs. From the original average of 600+ FileOpen/FileSeek wait events, it has been optimized to almost no FileOpen/FileSeek wait events (< 1).

Before:

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

Optimized:

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

4.3 Separation of storage and computing to achieve elastic expansion

The chart below shows the total data volume of Xpeng customers and the data growth in 7 days. The total data volume reaches the 37 TB level, and the data growth volume peaks at 1.5 TB every 7 days, with an average growth rate of 2 TB per month. However, the storage space of a PolarDB-PG cluster does not become a bottleneck.

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

The following figure shows the I/O read and write bandwidth of Xpeng's intelligent assisted driving service during peak hours, which is stable at more than 1.6 GB/s. The high read/write bandwidth of PolarDB-PG ensures that the intelligent driver assistance service runs normally during peak hours.

Xpeng Motors uses PolarDB to achieve high-frequency updates and real-time analysis of tens of billions of tables

5. Summary

The automatic elastic scaling, large table optimization, and elastic cross-node parallel query of PolarDB-PG have become a "powerful tool" for Xpeng Motors' intelligent assisted driving business to cope with terabyte-level large table annotation and analysis and query. In addition, the high-performance solution of PolarDB-PG for large tables can better meet the database access requirements of intelligent assisted driving services in the automotive field.

  1. Supports second-level analysis and query of terabyte-level large tables;
  2. Supports 70 million daily annotation updates for TB-level large tables;
  3. Provides 100 TB of storage space, automatic storage elastic expansion, and pay-as-you-go mode.
  4. All large table optimizations are completely transparent to the business, eliminating the need for business modifications, eliminating the burden of development and O&M.
Car

Read on