Table of Contents
Preface
Spark
Scala
Data Sources
process
Preparation
Download plugin
Create a new normal Scala project
Upload jsonl file to Hadoop
Code (five indicator requirements)
1. Cinemas with statistical attendance rates higher than 50%
The running results are as follows:
2. Count how many cinemas there are with the same name
The running results are as follows:
3. Calculate the average number of visitors to the cinema on that day
The running results are as follows:
4. Count the top 15 cinemas with the highest number of visitors on that day
The running results are as follows:
5. Check the box office of a cinema on a certain day
The running results are as follows:
Summarize:
Foreword
Spark
Spark is an open source big data processing framework designed to provide fast, versatile, and easy-to-use distributed data processing and analysis capabilities. It was originally developed by UC Berkeley’s AMPLab and open sourced in 2010. The emergence of Spark has greatly simplified the complexity of big data processing and provided high performance and flexibility, making it easier for developers to process and analyze large-scale data sets. It has become one of the widely used frameworks in the field of big data.
Scala
Scala is a multi-paradigm programming language that runs on the Java Virtual Machine and combines the features of object-oriented programming and functional programming. Scala’s design goal is to provide a concise, efficient, and type-safe programming language while maintaining interoperability with Java. It provides many features of modern programming languages, such as a powerful static type system, pattern matching, higher-order functions, closures, type inference, and concurrent programming support.
data source
This website has a lot of data and is suitable for various tests.
Data science research and teaching integrated platform (idatascience.cn)
process
Preparation
Download plugin
I wrote it using idea, and I need to download the Scala plug-in first.
New ordinary Scala project
Upload jsonl file to Hadoop
Code (five indicator requirements)
1. Cinemas with an occupancy rate higher than 50%
Attendance This class is a cinema whose attendance rate is higher than 50%. You can know which cinema is the most popular and use it to judge its service quality. Here we use grouping, statistical functions, descending order, and deduplication operations to calculate the statistics. Ranking of each city
Read files from hdfs
import org.apache.spark.sql.Dataset import org.apache.spark.sql._ object Attendance { def main(args: Array[String]): Unit = { val ss: SparkSession = org.apache.spark.sql.SparkSession.builder.appName("Attendance ").master("local").getOrCreate val df: Dataset[Row] = ss.read.json("hdfs://192.168.187.169:9000/01/DataCinema.jsonl") //Cinema theaters whose occupancy rate is higher than 50% val city: Dataset[Row] = df.where("Attendance>50") val select: Dataset[Row] = city.select("Attendance", "CinemaName") select.groupBy("Attendance", "CinemaName").count.dropDuplicates("Attendance").orderBy(new Column("count").desc).show() } }
The running results are as follows:
2. Count how many cinemas there are with the same name
Can judge the popularity of this cinema
import org.apache.spark.sql.Dataset import org.apache.spark.sql._ object CinemaNameSum { def main(args: Array[String]): Unit = { val ss: SparkSession = org.apache.spark.sql.SparkSession.builder.appName("CinemaNameSum").master("local").getOrCreate val df: Dataset[Row] = ss.read.json("hdfs://192.168.187.169:9000/01/DataCinema.jsonl") /** * * Count how many cinemas there are with the same name */ // Dataset<Row>where =df.where("CinemaName='Dena International Cinema Hangzhou Xiaoshan Store');// val select: Dataset[Row] = df.select("CinemaName") select.groupBy("CinemaName").count.orderBy(new Column("count").desc, new Column("CinemaName")).show(50) } }
The running results are as follows:
3. Calculate the average number of visitors to the cinema on that day
The higher the average attendance, the more consistent the movies shown in the cinema are with the public’s aesthetics
import org.apache.spark.sql.{AnalysisException, Column, Dataset, Row, SparkSession} object Date { def main(args: Array[String]): Unit = { val ss: SparkSession = org.apache.spark.sql.SparkSession.builder.appName("one").master("local").getOrCreate val df: Dataset[Row] = ss.read.json("hdfs://192.168.187.169:9000/01/DataCinema.jsonl") /** * * The average number of visitors per day in the cinema and the total number of cinemas with this average number of visitors per day */ val city: Dataset[Row] = df.where("data = '2018/10/11'") val select: Dataset[Row] = df.select("data", "AvgPeople") select.groupBy("data", "AvgPeople").count.orderBy(new Column("count").desc).show() } }
The running results are as follows:
4. Count the top 15 cinemas with the highest number of visitors on that day
Read files locally; count the top 15 cinemas with the highest number of visitors on that day to determine the popularity of this cinema
import org.apache.spark.sql._ import org.apache.spark.sql.functions.sum objectTodayAudienceCount { def main(args: Array[String]): Unit = { val sparkSession = SparkSession.builder.appName("TodayAudienceCount").master("local").getOrCreate val avg = sparkSession.read.json("D:\course\Spark\End\Final\src\DataCinema.jsonl") // val srcRdd = sc.textFile("hdfs://192.168.53.169:9000/can_data/2023-09-24/*1") //Statistics on the top 15 cinemas with the highest number of viewers on that day val price = avg.select("CinemaName", "TodayAudienceCount") price.groupBy("TodayAudienceCount", "CinemaName").agg(sum("TodayAudienceCount").alias("Today's Audience Count")).orderBy(new Column("Today's Audience Count").desc).show(15 ) } }
The running results are as follows:
5. Query the box office of the cinema on a certain day
You can visually see the cinema’s data and compare it with other cinemas
import org.apache.spark.sql.*; public class TodayBox1 { public static void main(String[] args) throws AnalysisException { SparkSession ss = SparkSession.builder().appName("one").master("local").getOrCreate(); Dataset<Row> df = ss.read().json("D:\course\Spark\End\Final\src\DataCinema.jsonl"); /*** * The box office of a movie theater on a certain day */ Dataset<Row> city = df.where("data = '2018/10/25'"); df.createTempView("TodayBox"); Dataset<Row> oo = ss.sql("select data,TodayBox,CinemaName from TodayBox group by data,TodayBox,CinemaName"); oo.orderBy(new Column("TodayBox")) .show(30); } }
The running results are as follows:
Summary:
In the study of spark and scala, we will encounter many problems, but with our persistence, we can always find solutions. The road is long, but Only by persevering can you reach the other side of success.
Spark and Scala are both very good languages, but my skills are still shallow. I am willing to share my knowledge with you.
Put away the clouds and mist and see the blue sky! Keep the clouds and see the bright moon! ! !
The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Cloud native entry-level skills treeHomepageOverview 17028 people are learning the system