Differences between Spark, RDD, Hive, Hadoop-Hive and traditional relational databases

Hive Hadoop The difference between Hive and traditional relational database Spark concepts Memory-based Distributed Computing Framework Only responsible for calculation, not for storage Spark is similar to mapreduce in terms of offline computing functions Disadvantages of MapReduce Runs slowly (not fully utilizing memory) The interface is relatively simple and only supports Map Reduce The function […]

[Python] PySpark data calculation ⑤ (RDD#sortBy method – sorting elements in RDD)

Article directory 1. RDD#sortBy method 1. Introduction to RDD#sortBy syntax 2. Analysis of function parameters passed in by RDD#sortBy 2. Code example – RDD#sortBy example 1. Demand analysis 2. Code example 3. Execution results 1. RDD#sortBy method 1. Introduction to RDD#sortBy syntax RDD#sortBy method is used for to sort the elements in RDD according to […]

[Python] PySpark data calculation ④ ( RDD#filter method – filter elements in RDD | RDD#distinct method – deduplicate elements in RDD )

Article directory 1. RDD#filter method 1. Introduction to RDD#filter method 2. RDD#filter function syntax 3. Code example – RDD#filter method example 2. RDD#distinct method 1. Introduction to RDD#distinct method 2. Code example – RDD#distinct method example 1. RDD#filter method 1. Introduction to RDD#filter method RDD#filter method can filter the elements in the RDD object according […]

[Python] PySpark data calculation ③ ( RDD#reduceByKey function concept | RDD#reduceByKey method workflow | RDD#reduceByKey syntax | code example )

Article directory 1. RDD#reduceByKey method 1. RDD#reduceByKey method concept 2. RDD#reduceByKey method workflow 3. RDD#reduceByKey function syntax 2. Code example – RDD#reduceByKey method 1. Code example 2. Execution results 3. Code example – use RDD#reduceByKey to count file content 1. Demand analysis 2. Code example 1. RDD#reduceByKey method 1. RDD#reduceByKey method concept RDD#reduceByKey method is […]

[Python] PySpark data calculation ① ( RDD#map method | RDD#map syntax | pass in ordinary functions | pass in lambda anonymous functions | chain calls )

Article directory 1. RDD#map method 1. Introduction of RDD#map method 2. RDD#map syntax 3. RDD#map usage 4. Code example – RDD#map numerical calculation (pass in ordinary functions) 5. Code example – RDD#map numerical calculation (input lambda anonymous function) 6. Code example – RDD#map numerical calculation (chain call) 1. RDD#map method 1. RDD#map method introduction In […]

Spark RDD various operator sample codes

Code executed by the Anaconda pre-environment #Initialize the context import findspark findspark.init() from pyspark import SparkConf,SparkContext # conf=SparkConf().setMaster(“local[*]”).setAppName(“test”) conf=SparkConf().setAppName(“test”).setMaster(“local[*]”) sc=SparkContext(conf=conf) #Local file path path=”file:///app/notebook/testdata/” Article directory Code executed by the Anaconda pre-environment RDD operator The first operator: map operator The second operator: flatMap operator The third operator: reduceByKey operator, for KV-type RDD, can be grouped […]

Spark RDD dataframe hehe

RDD (Resilient Distributed Datasets) scalable elastic distributed datasets, RDD is the most basic data abstraction of spark, RDD represents a read-only, partitioned and immutable data collection, is a distributed memory abstraction, and distributed Shared memory (Distributed Shared Memory, DSM) is a distributed memory abstraction, but the two are different. RDD supports two types of operations: […]

Five characteristics of Spark-RDD

1. Five characteristics of RDD Partition list (a list of partitions) Each RDD is divided into multiple partitions (Partitions), these partitions run on different nodes of the cluster, each partition will be processed by a computing task, and the number of partitions determines the number of parallel computing. The user can specify the number of […]

3.3 Mastering RDD partitions

1. RRD partition (1) RDD partition concept RDD is a large data collection, which is divided into multiple sub-collections and distributed to different nodes, and each sub-collection is called a partition (Partition). Therefore, it can also be said that RDD is composed of several partitions. (2) RDD partition function In a distributed program, the overhead […]

Detection and identification analysis of urban road pavement diseases, taking the Czech-India-Japan integrated fusion data set of the RDD event as an example, developing and constructing an urban road disease detection and identification system based on the yolov5m model

Urban road disease detection is a popular task area recently. The core is to transfer the existing research results of deep learning to realize real-time urban road road surface disease detection and identification analysis. In many of my previous blog posts, I have done similar bridges, Projects related to target detection of cracks and cracks […]