Spark Core—number of RDD partitions

1. Why should we discuss the number of partitions in RDD? The reason is that each Task acts on each partition. If the number of partitions is changed, the number of Task tasks will be indirectly changed, which will in turn change the execution efficiency of the entire task in Spark. 2. Without shuffle operation, […]

Spark data structure: RDD

Table of Contents 1. Overview of RDD 1.RDD definition 2. Core points of RDD design: 3. Characteristics of RDD 2. RDD creation (1) Create RDD (2) Number of RDD partitions 3. RDD function (1) Function classification 1. Overview of RDD 1.RDD definition RDD (Resilient Distributed Dataset) is called a resilient distributed data set. It is […]

Several indicator statistics about SparkRdd and SparkSql, scala language, packaging and uploading to spark cluster, running in yarn mode

need: ? Requirements: Use SparkRDD and SparkSQL programming methods to complete the following data analysis, compare performance with webUI monitoring, and give a rational explanation for the results. 1. Count the number of users, gender, and occupation respectively: 2. Check the statistical age distribution (divided into 7 segments according to age) 3. Check the statistical […]

Python big data PySpark (6) RDD operation

Article directory RDD operations Function classification Transformation function Action function Basic exercises [Wordcount quick demonstration] Transformer operator Action operator important function postscript RDD operations Function classification *The Transformation operation only establishes the calculation relationship, and the Action operation is the actual executor*. Transformation operator conversion operator There is no conversion between operations. If you want […]

2023_Spark_Experiment 11: RDD advanced operator operations

//checkpoint: sc.setCheckpointDir(“hdfs://Master:9000/ck”) // Set checkpoint val rdd = sc.textFile(“hdfs://Master:9000/input/word.txt”).flatMap(_.split(” “)).map((_,1)).reduceByKey( _ + _) // Perform conversion of wordcount task rdd.checkpoint // Mark this RDD for checkpointing. rdd.isCheckpointed rdd.count //Trigger calculation, log display: ReliableRDDCheckpointData: Done checkpointing RDD 27 to hdfs://hadoop001:9000/ck/fce48fd4-d76f- 4322-8d23-6a48d1aed7b5/rdd-27, new parent is RDD 28 rdd.isCheckpointed // res61: Boolean = true rdd.getCheckpointFile // Option[String] = […]

Spark-RDD programming (1)

Introduction to RDD RDD (Resilient Distributed Dataset) is called a distributed data set and is the most basic abstract class in Spark. It represents an immutable and partitionable collection whose elements can be calculated in parallel. In Spark, all operations on data include creating RDDs, transforming (operators) existing RDDs, and calling RDD operations for evaluation […]

Spark [RDD Programming (4) Comprehensive Case]

Case 1-The value of TOP N data Input data: 1,1768,50,155 2,1218,600,211 3,2239,788,242 4,3101,28,599 5,4899,290,129 6,3110,54,1201 7,4436,259,877 8,2369,7890,27 Processing code: def main(args: Array[String]): Unit = { //Create SparkContext object val conf:SparkConf = new SparkConf() conf.setAppName(“test1”).setMaster(“local”) val sc: SparkContext = new SparkContext(conf) var index: Int = 0 //Create an RDD object by loading data from the local […]

Spark [RDD Programming (3) Key-value Pair RDD]

Introduction Key-value pair RDD means that each RDD element is a key-value pair of type (key, value). It is a common RDD and can be applied to many scenarios. Because after all, through our previous study of Hadoop, we can see that the processing of data is basically unified batch processing in the form of […]

Spark – Kernel Scheduling for RDD

1. DAG definition The core of Spark is implemented based on RDD, and SparkScheduler is an important part of Spark core implementation, and its role is task scheduling. Spark’s task scheduling is how to organize tasks to process the data of each partition in RDD, and is built based on the dependencies of RDD strong> […]

Autonomous Optimization of Spark RDD Lazy Computing

Original/Zhu Jiqian The data in RDD (Resilient Distributed Data Set) is just like the final definition, which can only be read and cannot be modified. If you want to convert or operate RDD, you need to create a new RDD to save the result. Therefore, operators for conversion and action are needed. Spark runs inertly. […]