20 | Spark performance optimization case analysis (Part 2)

In the last issue, we talked about software performance optimization that must be conducted through performance testing and based on an understanding of software architecture and technology. Today, we use several Spark performance optimization cases to see how the performance optimization principles mentioned are implemented. If you forget the principles of performance optimization, you can […]

7. spark sql programming

Directory Overview The difference between RDD, Datasets, DataFrames Datasets, DataFrames and RDDs getting Started people.json SparkSession Create DataFrames DataFrame operations Run sql queries programmatically Create Datasets Convert DataFrames to RDDs and back Using reflection inference mode Encoding issues Programmatically Specify Schema The problem of incomplete code in official documents Finish Overview The spark version is […]

13 | With the same essence, why can Spark be more efficient?

In the last issue, we discussed the programming model of Spark. In this issue, we talk about the architectural principles of Spark. Like MapReduce,Spark also follows the basic principle of big data computing that mobile computing is more cost-effective than moving data. However, compared with MapReduce’s rigid Map and Reduce staged calculations, Spark’s computing framework […]

Read from Spark.sql to Lightgbm model storage

Summary This article will introduce the steps to read from Spark.sql to Lightgbm model storage Overall architecture process Import essential toolkit, data reading, data preprocessing, model building, model evaluation, field filtering, model storage Technical details 1. Import necessary tool packages from pyspark.conf import SparkConf #SparkConf contains various parameters for spark cluster configuration from pyspark.sql import […]

Error when compiling Spark source code locally in idea

Report the error content first [INFO] Scanning for projects… [INFO] ————————————————– ————————– [INFO] Detecting the operating system and CPU architecture [INFO] ————————————————– ————————– [INFO] os.detected.name: osx [INFO] os.detected.arch: x86_64 [INFO] os.detected.version: 10.15 [INFO] os.detected.version.major: 10 [INFO] os.detected.version.minor: 15 [INFO] os.detected.classifier: osx-x86_64 [INFO] ————————————————– ————————– [INFO] Reactor Build Order: [INFO] [INFO] Spark Project Parent POM [pom] […]

Spark memory management

Introduction Since there is an overflow mechanism, why does OOM still occur? What are these two memories used for? set spark.executor.memory = 4g; set spark.executor.memoryOverhead = 3g; Is it possible to directly use JVM garbage collection for memory management? 1. Problems to be solved by the memory management mechanism Big data processing frameworks such as […]