Preamble:

I've been interested in how Spark works lately, so I read the book Spark Big Data Processing: Technologies, Applications, and Performance Optimization. Through the study of this book, you will learn about the core technology, practical application scenarios, and performance optimization methods of Spark. The purpose of this article is to record and share the basic process of running Spark.

1. The basic components of Spark and their concepts

1. ClusterManager

In standalone mode, it is the master, controls the entire cluster, and monitors workers. In YARN mode, it is a resource manager.

2. Application

After the user submits a user-defined Spark program, Spark allocates resources to the app, converts the program, and executes it.

3. Driver

In Spark, driver is a core concept that refers to the master process of a Spark application, also known as the master node. Responsible for running the Application's main( ) function and creating a SparkContext.

4. Worker

The slave node, which is responsible for controlling the compute node, starts the Executor or Driver. In YARN mode, it is the NodeManager, which is responsible for the control of computing nodes.

5. Executor

An executor is a component that executes tasks on a worker node and is used to start a thread pool to run tasks. Each application has its own set of executors.

6. RDD Graph

RDD is the core structure of Spark, which can be operated through a series of operators (mainly Transformation and Action operations). When an RDD encounters an Action operator, all previous operators are formed into a Directed Acyclic Graph (DAG), which is an RDD Graph. It is then converted into a job in Spark and submitted to the cluster for execution. An app can contain multiple jobs.

7. Job

An RDD Graph-triggered job is usually triggered by the Spark Action operator and submits the job to Spark in the SparkContext using the runJob method.

8. Stage

Each job is divided into multiple stages based on the wide dependencies of the RDD, and each stage contains the same set of tasks, which is also called a TaskSet.

9. Task

A partition corresponds to a task, and the task executes the operators contained in the corresponding stage in the RDD. The task is encapsulated and executed in the thread pool of the executor.

2. Spark architecture

The Spark architecture uses the master-slave model in distributed computing. As the controller of the entire cluster, the master is responsible for the normal operation of the entire cluster. A worker is equivalent to a computing node, receiving commands from the master node and reporting status. The Executor is responsible for the execution of the task; The client is responsible for submitting applications as the user's client, and the driver is responsible for controlling the execution of an application.

As shown in the figure, after a spark cluster is deployed, you need to start the master process and the worker process on the master node and slave node respectively to control the entire cluster. In the execution of a spark application, the driver and worker are two important roles. The driver program is the starting point of application logic execution and is responsible for job scheduling, that is, the distribution of task tasks, while multiple workers are used to manage compute nodes and create executors to process tasks in parallel. During the execution phase, the driver serializes the task and the files and jars on which the task depends and passes them to the corresponding worker, while the executor processes the task for the corresponding data partition.

3. How Spark works

1. The overall process of Spark

Client提交应用,Master找到一个Worker启动Driver,Driver向Master或者资源管理器申请资源,之后将应用转化为RDD Graph,再由DAG Scheduler将RDD Graph转化为Stage的有向无环图提交给TaskScheduler,由TaskScheduler提交任务给Executor执行。

As shown in the figure, in a Spark application, the entire execution process logically forms a directed acyclic graph. After the Action operator is triggered, all accumulated operators are formed into a directed acyclic graph, and then the scheduler schedules the tasks on the graph for calculation. Spark divides different stages based on the dependencies between RDDs, and each stage contains a series of function execution pipelines. A, B, C, D, E, F, and other RDDs represent different RDDs, and the boxes in the RDDs represent partitions. The data is imported from HDFS into spark to form RDD A and RDD C, and the map operation is performed on RDD C, which is converted to RDD D, RDD B and RDD E, and the data is converted to F. In the process of converting B and E to F, Shuffle is executed, and finally RDD F is output and saved to HDFS through the function saveAsSequenceFile.

2. Classification of Stages

As shown in the run-up above, in Apache Spark, a job is usually divided into stages, each of which contains a set of parallel tasks. This division is mainly based on the dependency of data breadth and narrowness for more efficient task scheduling and execution. Here are some key points about the stage division in Spark:

• Wide and narrow dependence

Narrow Dependency: Each partition of the parent RDD will only be used by the partition of one child RDD, or multiple child RDD partitions will use the same parent RDD partition for calculation. Narrow dependencies allow all parent partitions to be computed in a pipeline manner on a cluster node without causing data shuffling between networks.

宽依赖（Wide Dependency）：父 RDD 的每个分区都可能被多个子 RDD 分区所使用，会引起 shuffle。

• Partition of Stage

Spark divides stages based on the wide and narrow dependencies between RDDs. When encountering a wide dependency, divide a stage, each stage contains multiple tasks, and the number of tasks is determined by the number of partitions of the last RDD of the stage. Multiple Tasks within a Stage can be executed in parallel, while Stages are executed serially between Stages. Calculations for the next stage will begin only when all Tasks in one stage have been calculated.

•Shuffle 与 Stage 边界

When Spark encounters a wide dependency (such as reduceByKey, groupBy, etc.), it needs to create a new Stage before and after that operation. This is because wide dependencies require shuffle data, and shuffles typically involve disk I/O, so having wide dependencies as a boundary between stages can be more efficient.

3. Stage and task scheduling mode

The scheduling of the Stage is done by DAGScheduler. The directed acyclic graph DAG of RDD is segmented from the directed acyclic graph DAG of RDD. If the submitted stage still has an unfinished parent stage, the stage needs to wait for the parent stage to be executed. At the same time, DAGScheduler maintains several important Key-Value collective constructs to record the state of the stage, so as to avoid premature execution and repeated stage submissions. Parent Stages that still have not been executed are recorded in waitingStages to prevent premature execution. runningStages saves the stage being executed to prevent duplicate executions. failedStages is used to save the stage that failed to be executed, and you need to execute it again.

Each Stage contains a set of parallel Tasks that are organized into TaskSets. DAGScheduler submits the partitioned TaskSet to the TaskScheduler. TaskScheduler is a component responsible for task scheduling and cluster resource management. The TaskScheduler manages each TaskSet through the TaskSetManager. The TaskSetManager tracks and controls the execution of tasks under its jurisdiction, including task starting, status monitoring, and failure retries. When a TaskSet is submitted to the TaskScheduler, the TaskScheduler decides which executors to run the Task on and distributes the Task to the appropriate nodes for execution through a cluster manager such as YARN, Mesos, or Spark Standalone. After the executor receives the task, it executes the task in the thread pool it manages. During execution, the status of the task is continuously updated and the TaskSetManager is notified through the status update mechanism. The TaskSetManager tracks the execution of the task based on the status updates it receives, and triggers a retry mechanism until the set number of retries is reached.

When all tasks are executed, the TaskScheduler notifies DAGScheduler, which triggers the execution of subsequent stages (if any).

4. Shuffle Mechanic

Why does the spark computational model require a Shuffle process? As we all know, the Spark computing model is computed in a distributed environment, which makes it impossible to accommodate all the computing data in a single process space for computing, so that the data is partitioned according to the key, distributed into small partitions one by one, and scattered in the memory space of each process in the cluster, not all computing operators are satisfied with partitioning in one way for computing. For example, when data needs to be sorted and stored, it is necessary to repartition the data according to certain rules, and Shuffle is a process of recombining data wrapped under various operators that need to be repartitioned.

As shown in the figure, the whole job is divided into Stage1~Stage3, and 3 Stages. First, it is executed from the top Stage2 and Stage3, and each stage performs a pipeline-like transformation function operation for each partition, and then executes Shuffle Write to the last stage of each stage, divides the data into corresponding buckets according to the number of partitions of the next stage, and writes the bucket to disk at the end. This process is known as the Shuffle Write phase. After Stage2 and Stage3 are executed, Stage1 stores the data required by Fetch on the disk with the Shuffle data node, fetch the data to the local computer, and performs a user-defined aggregate function operation. This phase is called the Shuffle Fetch, and the Shuffle Fetch contains the aggregation phase. In this way, the Shuffle operation is completed between rounds and stages.

IV. Conclusion

After reading the book "Spark Big Data Processing: Technologies, Applications, and Performance Optimization", I had a general understanding of the operating mechanism and principles of Spark. The above is only a brief summary, and it does not go into some details. There is a very detailed introduction in the original book, including its fault tolerance, IO, network and other mechanisms, as well as the operation process of analyzing Spark from the source code, and the book shows how to use Spark for data processing, analysis and mining in specific applications through a large number of practical cases, so that theory and practice are combined, if you are interested, you can read it yourself.

The basic process of running Spark