laitimes

Explore the application of parallel computing technology in databases

author:HUAWEI CLOUD Developer Alliance

This article is shared from the HUAWEI CLOUD community's "GaussTech Technical Column: Exploring the Application of Parallel Computing Technology in Databases-Cloud Community-HUAWEI CLOUD", written by GaussDB database.

Parallel computing is one of the important means to improve system performance. This technology accelerates the processing of tasks by using technologies such as multiple servers, multiple processors, multiple cores in processors, and SIMD instruction sets to achieve parallel processing of tasks. At the same time, it has applications in many computer fields, such as image processing, big data processing, scientific computing and databases.

Parallel processing techniques in databases

1. Distributed parallel processing architecture

The advent of parallel processing database architectures can be traced back to the 80s of the last century. At that time, computer performance was very limited, but enterprises already had large-scale data processing needs.

So how did the technology industry improve its data processing capabilities at that time?

当时技术界提出了三种并行架构:Shared Nothing、Shared Disk、Shared Memory,并对他们展开了各种讨论。 图灵奖获得者Michael Stonebraker在1985年发表的一篇关于Shared Nothing的文章《The Case for Shared Nothing》,从不同维度,对三种架构能力做了一些比较分析。 由于在成本、扩展性、可用性方面的优势,Shared Nothing成为主流的设计思路。

1)最早的Shared Nothing商业产品

The first Shared Nothing data processing system was the DBC/1012, the first generation of which was released by Teradata in 1984.

Explore the application of parallel computing technology in databases

Figure 1 DBC/1012 architecture

The key components of DBC/1012's system architecture are:

Explore the application of parallel computing technology in databases

DBC/1012 began as a backend for the mainframe IBM 370 and later as a backend for a variety of other mainframes, small computers, and workstations. The data is divided evenly by the algorithm into local disks managed by AMP, and AMP usually does not exchange data between them. You can increase the data capacity and performance of the entire system by increasing the number of AMPs.

Although it may seem full of history now, Teradata did very well when it came to processing big data with Shared Nothing technology, and as a result, it also won high-quality large customers and helped Teradata achieve commercial success.

2)MPP(Massively Parallel Processing)和shared-nothing

Massively Parallel Processing (MPP), which is often mentioned in database parallel processing technology, usually refers to the system architecture classification method of servers. In addition to MPP, there are two classifications: NUMA and SMP.

  • SMP(Symmetric MultiProcessing):对称多处理器结构

The main feature of an SMP server is sharing. All resources in the system (such as memory, I/O, etc.) are shared, and the expansion capacity is limited.

SMP is also sometimes referred to as a Consistent Memory Access (UMA) architecture, where memory is shared evenly across all processors. Unlike NUMA, all SMP processors have the same access time for all memory.

Explore the application of parallel computing technology in databases

Figure 2 SMP schematic

  • NUMA(Non-Uniform Memory Architecture):非一致存储访问结构

The main feature of the NUMA server is that it has multiple CPU modules, and the modules can be connected and exchanged with each other through interconnected modules.

Each CPU can access the memory of the entire system, but not at the same speed. The speed at which the CPU accesses the local memory is much higher than the memory speed of other nodes within the system.

The difference between NUMA and MPP is that NUMA is one physical server, whereas MPP is multiple.

Explore the application of parallel computing technology in databases

Figure 3 Schematic illustration of NUMA

  • MPP(Massively Parallel Processing):大规模并行处理结构

MPP is a network in which multiple server nodes are connected through an interconnected network, and each server node only accesses local resources (memory and storage), and each server shares nothing with each other.

In the database field, when we talk about a database being MPPDB, it means that in the design and implementation of data, the server cluster with MPP parallel processing scale-out scale-out expands the database performance, and the servers share nothing. It can be understood as MPPDB == Shared Nothing database.

当前支持MPP架构数据库产品有很多,如:Netezza(基于PG;IBM收购后不活跃)、Greenplum(基于PG;VMware)、Vertica(HP)、Sybase IQ(SAP)、TD Aster Data(Teradata)、Doris(百度)、Clickhouse(Clickhouse, Inc.)、GaussDB(华为)、SeaboxMPP(东方金信)等。

2. SMP parallelism

One size does not fit all。 The Shared Nothing parallel technology has achieved a good horizontal scale-out, but as the hardware resources of a single physical server become more and more powerful (dozens ~ hundreds of cores/servers), only using the Shared Nothing technology cannot well tap the hardware potential. Because many of the stand-alone computers that make up the Shared Nothing architecture database are SMP architectures, even if they are NUMA architectures, in fact, each NUMA domain can be approximated as an SMP system. Therefore, the industry has done SMP parallel execution to improve the scale-up capability of a single machine and optimize the processing performance.

SMP parallelism technology can realize the full and efficient use of system computing resources through the mechanism of multi-threading and multi-subtask parallel execution, as shown in the following figure:

Explore the application of parallel computing technology in databases

3. Other parallel technologies

For example, ARM and x86 processors are often equipped with SIMD instruction sets, which increase the bit width of the data that can be processed by an instruction. For reasons of space, these parallel technologies will be discussed in subsequent GaussTech series articles, and will not be repeated here.

Applications of parallel technologies in open-source databases

There are two popular open-source databases: MySQL and PostgreSQL. Let's take a look at the use of Shared Nothing and SMP technologies in these two open-source database families.

1. Shared Nothing

MySQL builds a Shared Nothing database cluster by using the middleware developed by various vendors or open source, combined with MySQL databases to provide distributed parallel processing capabilities. For example, GoldenDB, TDSQL-MySQL, etc. MySQL also provides MySQL NDB Cluster, which can be used to build distributed clusters.

PostgreSQL也是类似的思路,比如:TDSQL- PostgreSQL以及PostgreSQL生态圈流行的开源中间件Postgres-XL、Postgres-XC、citus等。

As you can see, MySQL and PostgreSQL are distributed databases with middleware architecture that provide Shared Nothing capabilities.

Explore the application of parallel computing technology in databases

Although this type of database can horizontally expand data processing capabilities, there are also shortcomings in terms of function degradation, global transaction capabilities, high availability, and performance, which need to be enhanced in a targeted manner.

2. SMP parallel technology

MySQL 8.0.14 released in 2019 introduced the parallel query feature for the first time, which can also give full play to the multi-core capabilities of the host CPU for a single SQL statement and improve the ability of complex large queries.

Parallel processing capabilities are mainly provided by the InnoDB storage engine.

(1) innodb_parallel_read_threads: Configure the maximum number of threads used for parallel scanning.

(2) innodb_ddl_threads: Controls the maximum number of parallel threads that InnoDB can create (sort and build) secondary indexes.

PostgreSQL supports parallel sequential scanning and aggregation starting with 9.6 released in 2016, and supports more parallel operators in 2018 released in 11, such as parallel hash joining, append, and index creation.

PostgreSQL provides parameters for parallel control, such as max_parallel_workers_per_gather. When the optimizer predicts that the cost of parallel execution is high, it does not generate a parallel execution plan.

As you can see that as leaders in open source databases, PostgreSQL and MySQL both apply SMP thread-level parallel processing technology to improve the single-node processing performance of databases.

summary

As an important means to improve the processing performance of databases, parallel computing technology has been widely used in existing database products. This paper briefly introduces the inter-node parallel processing technology represented by Shared Nothing, as well as the SMP intra-node parallel processing technology and their application in open source databases.

As an enterprise-level database, GaussDB also uses these two technologies to improve the performance of database processing. Compared with the implementation of open source databases, the implementation of GaussDB adds more features to further improve the performance of distributed processing based on various practical scenarios, which we will explain in the next article.

Follow #HUAWEI CLOUD Developer Alliance#Click below to learn about HUAWEI CLOUD's fresh technologies for the first time~

HUAWEI CLOUD Blog_Big Data Blog_AI Blog_Cloud Computing Blog_Developer Center-HUAWEI CLOUD

Read on