13 | With the same essence, why can Spark be more efficient?

In the last issue, we discussed the programming model of Spark. In this issue, we talk about the architectural principles of Spark. Like MapReduce,Spark also follows the basic principle of big data computing that mobile computing is more cost-effective than moving data. However, compared with MapReduce’s rigid Map and Reduce staged calculations, Spark’s computing framework is more elastic and flexible, resulting in better operating performance.

Spark’s calculation phase

We can compare and contrast. First of all, unlike MapReduce, an application only runs one map and one reduce at a time. Spark can be divided into more computing stages (stages) according to the complexity of the application. These computing stages form a directed acyclic graph DAG, Spark task scheduler The calculation phase can be performed based on the dependencies of the DAG.

Remember that in the last issue, we gave an example of comparing the performance of logistic regression machine learning and found that Spark is more than 100 times faster than MapReduce. Because some machine learning algorithms may require a large number of iterative calculations, resulting in tens of thousands of calculation stages, these calculation stages are processed in one application, instead of needing to start tens of thousands of applications like MapReduce, thus greatly improving operating efficiency. .

The so-called DAG is a directed acyclic graph, which means that the dependencies of different stages are directed. The calculation process can only be executed along the direction of the dependency relationship. Before the execution of the dependent stage is completed, the dependent stage cannot start execution. At the same time, This dependency cannot have circular dependencies, otherwise it will become an infinite loop. The diagram below depicts the different stages of a typical Spark running DAG.

From the diagram, the entire application is divided into three stages. Stage 3 depends on stage 1 and stage 2, and stage 1 and stage 2 are independent of each other. When Spark performs scheduling, it first executes Phase 1 and Phase 2, and then executes Phase 3 after completion. Spark’s strategy is the same if there are more stages. As long as the DAG is initialized according to the program, dependencies are established, and then each calculation stage is executed sequentially according to the dependencies, and the calculation of the Spark big data application is completed.

The pseudo code of the Spark program corresponding to the DAG in the above picture is as follows.

rddB = rddA.groupBy(key)
rddD = rddC.map(func)
rddF = rddD.union(rddE)
rddG = rddB.join(rddF)

Therefore, you can see that the core of Spark job scheduling and execution is DAG. With DAG, the entire application is divided into stages, and the dependencies of each stage are clear. Then a corresponding task set (TaskSet) is generated based on the amount of data to be processed in each stage, and each task is assigned a task process to process. Spark realizes distributed computing of big data.

Specifically, the component responsible for the generation and management of Spark application DAG is DAGScheduler. DAGScheduler generates DAG based on the program code, then distributes the program to the distributed computing cluster, and schedules execution according to the sequence of computing stages.

So what is the basis for Spark to divide the calculation stages? Obviously not every transformation function on the RDD will generate a calculation stage. For example, the above example has 4 transformation functions, but only 3 stages.

You can look at the DAG diagram above again. You can see the rules about the division of calculation stages from the diagram. When the conversion connection lines between RDDs show many-to-many cross-connections, a new stage will be generated. An RDD represents a data set. Each RDD in the figure contains multiple small blocks, and each small block represents a fragment of the RDD.

Multiple data shards in one data set need to be partitioned and written to different shards in another data set. We have also seen this kind of data partition cross-transfer operation during the operation of MapReduce.

Yes, this is the shuffle process. Spark also needs to recombine the data through shuffle. Data with the same Key is put together for aggregation, association and other operations. Therefore, each shuffle generates a new calculation stage. This is why the calculation phase has dependencies. The data it requires comes from the data generated by one or more previous calculation phases. You must wait for the previous phases to be executed before shuffling and getting the data.

What you need to pay special attention to here is that the calculation phase is divided based on shuffle, not the type of conversion function. Some functions sometimes have shuffle and sometimes do not. For example, in the example above, RDD B and RDD F are joined to obtain RDD G. RDD F here needs to be shuffled, but RDD B does not.

Because RDD B has already been data partitioned in the previous stage, the shuffle process of stage 1. As long as the number of partitions and partition keys remain unchanged, there is no need to perform shuffle.

Such dependencies that do not require shuffle are called narrow dependencies in Spark; conversely, dependencies that require shuffle are called wide dependencies. Like MapReduce, shuffle is also the most important part of Spark. Only through shuffle can related data be calculated with each other and build complex application logic.

After you are familiar with the shuffle mechanism in Spark, let’s return to the title of today’s article. We also need to go through shuffle. Why can Spark be more efficient?

In fact, in essence, Spark can be regarded as a different implementation of the MapReduce computing model. Hadoop MapReduce simply and roughly divides big data calculations into two stages: Map and Reduce based on shuffle, and then it’s done. Spark is more delicate. It connects the previous Reduce and the latter Map and continues calculation as one stage, forming a more elegant and efficient computing model, although its essence is still Map and Reduce. However, this solution that relies on execution of multiple computing stages can effectively reduce access to HDFS and reduce the number of job scheduling executions, so the execution speed is also faster.

And unlike Hadoop MapReduce, which mainly uses disks to store data during the shuffle process, Spark prioritizes using memory for data storage, including RDD data. Unless the memory is not enough, memory is used as much as possible. This is another reason why Spark performance is higher than Hadoop.

Spark job management

As mentioned in the previous column, there are two types of RDD functions in Spark. One is the conversion function. After calling it, you still get an RDD. The calculation logic of the RDD is mainly completed through the conversion function.

The other is the action function, which does not return RDD after being called. For example, the count() function returns the number of elements in the data in the RDD; saveAsTextFile(path) stores the RDD data in the path. When Spark’s DAGScheduler encounters a shuffle, it will generate a calculation phase, and when it encounters an action function, it will generate a job.

Spark will create a computing task to process each data shard in the RDD, so a computing phase will include many computing tasks.

The dependencies and time sequence of jobs, calculation stages, tasks can be seen in the figure below.

The horizontal axis in the figure represents time, and the vertical axis represents tasks. Between the two thick black lines is a job, and between the two thin lines is a computational stage. A job contains at least one calculation phase. The horizontal red lines are tasks. Each stage is composed of many tasks, and these tasks form a task set.

After DAGScheduler generates the DAG graph based on the code, Spark’s task scheduling is allocated in units of tasks, and the tasks are assigned to different machines in the distributed cluster for execution.

Spark execution process

Spark supports multiple deployment solutions such as Standalone, Yarn, Mesos, and Kubernetes. The principles of several deployment solutions are the same, except that the roles of different components are named differently, but the core functions and running processes are similar.

The picture above is the running process of Spark. Let’s look at it step by step.

First, the Spark application is started in its own JVM process, that is, the Driver process. After startup, SparkContext is called to initialize execution configuration and input data. SparkContext starts DAGScheduler to construct a DAG graph for execution and divides it into the smallest execution units, which are computing tasks.

Then the Driver requests computing resources from the Cluster Manager for distributed computing of the DAG. After receiving the request, the Cluster Manager notifies the driver’s host address and other information to all computing node workers in the cluster.

After receiving the information, the Worker communicates and registers with the Driver based on the Driver’s host address, and then reports to the Driver the number of tasks it can take based on its free resources. Driver starts assigning tasks to registered Workers based on the DAG graph.

After the Worker receives the task, it starts the Executor process to start executing the task. The Executor first checks whether it has the execution code of the Driver. If not, it downloads the execution code from the Driver and starts execution after loading it through Java reflection.

Summary

In summary, Spark has three main features: RDD’s programming model is simpler, the multi-stage calculation process of DAG splitting is faster, and it is more efficient to use memory to store intermediate calculation results. These three features enable Spark to have faster execution speed and simpler programming implementation than Hadoop MapReduce.

The emergence and popularity of Spark actually have some inevitability, which is the combined effect of time, location, and people. First of all, Spark became popular around 2012. At that time, the memory capacity improvement and cost reduction were already an order of magnitude better than those ten years ago when MapReduce appeared. The conditions for Spark to prioritize memory use were mature; secondly, the use of big data for machine learning The demand is getting stronger and stronger, and it is no longer the simple computing demand for data analysis in previous years. Most machine learning algorithms require many rounds of iterations. Compared with the simple division of Map and Reduce, Spark’s stage division has a more friendly programming experience and more efficient execution efficiency. So it’s no surprise that Spark has become the new king of big data computing.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Cloud native entry-level skills treeHomepageOverview 16933 people are learning the system