Based on Spark and Scala data statistics

Table of Contents

Preface

Spark

Scala

Data Sources

process

Preparation

Download plugin

Create a new normal Scala project

Upload jsonl file to Hadoop

Code (five indicator requirements)

1. Cinemas with statistical attendance rates higher than 50%

The running results are as follows:

2. Count how many cinemas there are with the same name

The running results are as follows:

3. Calculate the average number of visitors to the cinema on that day

The running results are as follows:

4. Count the top 15 cinemas with the highest number of visitors on that day

The running results are as follows:

5. Check the box office of a cinema on a certain day

The running results are as follows:

Summarize:


Foreword

Spark

Spark is an open source big data processing framework designed to provide fast, versatile, and easy-to-use distributed data processing and analysis capabilities. It was originally developed by UC Berkeley’s AMPLab and open sourced in 2010. The emergence of Spark has greatly simplified the complexity of big data processing and provided high performance and flexibility, making it easier for developers to process and analyze large-scale data sets. It has become one of the widely used frameworks in the field of big data.

Scala

Scala is a multi-paradigm programming language that runs on the Java Virtual Machine and combines the features of object-oriented programming and functional programming. Scala’s design goal is to provide a concise, efficient, and type-safe programming language while maintaining interoperability with Java. It provides many features of modern programming languages, such as a powerful static type system, pattern matching, higher-order functions, closures, type inference, and concurrent programming support.

data source

This website has a lot of data and is suitable for various tests.

Data science research and teaching integrated platform (idatascience.cn)

process

Preparation

Download plugin

I wrote it using idea, and I need to download the Scala plug-in first.

New ordinary Scala project

Upload jsonl file to Hadoop

Code (five indicator requirements)

1. Cinemas with an occupancy rate higher than 50%

Attendance This class is a cinema whose attendance rate is higher than 50%. You can know which cinema is the most popular and use it to judge its service quality. Here we use grouping, statistical functions, descending order, and deduplication operations to calculate the statistics. Ranking of each city

Read files from hdfs

import org.apache.spark.sql.Dataset
import org.apache.spark.sql._


object Attendance {
  def main(args: Array[String]): Unit = {
    val ss: SparkSession = org.apache.spark.sql.SparkSession.builder.appName("Attendance ").master("local").getOrCreate
    val df: Dataset[Row] = ss.read.json("hdfs://192.168.187.169:9000/01/DataCinema.jsonl")

    //Cinema theaters whose occupancy rate is higher than 50%
    val city: Dataset[Row] = df.where("Attendance>50")
    val select: Dataset[Row] = city.select("Attendance", "CinemaName")
    select.groupBy("Attendance", "CinemaName").count.dropDuplicates("Attendance").orderBy(new Column("count").desc).show()
  }

}

The running results are as follows:

2. Count how many cinemas there are with the same name

Can judge the popularity of this cinema

import org.apache.spark.sql.Dataset
import org.apache.spark.sql._
object CinemaNameSum {
  def main(args: Array[String]): Unit = {
    val ss: SparkSession = org.apache.spark.sql.SparkSession.builder.appName("CinemaNameSum").master("local").getOrCreate
    val df: Dataset[Row] = ss.read.json("hdfs://192.168.187.169:9000/01/DataCinema.jsonl")
    
    /** *
     * Count how many cinemas there are with the same name
     */

    // Dataset<Row>where =df.where("CinemaName='Dena International Cinema Hangzhou Xiaoshan Store');//
    val select: Dataset[Row] = df.select("CinemaName")
    select.groupBy("CinemaName").count.orderBy(new Column("count").desc, new Column("CinemaName")).show(50)
  }

}

The running results are as follows:

3. Calculate the average number of visitors to the cinema on that day

The higher the average attendance, the more consistent the movies shown in the cinema are with the public’s aesthetics

import org.apache.spark.sql.{AnalysisException, Column, Dataset, Row, SparkSession}

object Date {

  def main(args: Array[String]): Unit = {

    val ss: SparkSession = org.apache.spark.sql.SparkSession.builder.appName("one").master("local").getOrCreate
    val df: Dataset[Row] = ss.read.json("hdfs://192.168.187.169:9000/01/DataCinema.jsonl")


    /** *
     * The average number of visitors per day in the cinema and the total number of cinemas with this average number of visitors per day
     */
    val city: Dataset[Row] = df.where("data = '2018/10/11'")
    val select: Dataset[Row] = df.select("data", "AvgPeople")
    select.groupBy("data", "AvgPeople").count.orderBy(new Column("count").desc).show()
  }
}

The running results are as follows:

4. Count the top 15 cinemas with the highest number of visitors on that day

Read files locally; count the top 15 cinemas with the highest number of visitors on that day to determine the popularity of this cinema

import org.apache.spark.sql._
import org.apache.spark.sql.functions.sum

objectTodayAudienceCount {

  def main(args: Array[String]): Unit = {


    val sparkSession = SparkSession.builder.appName("TodayAudienceCount").master("local").getOrCreate

    val avg = sparkSession.read.json("D:\course\Spark\End\Final\src\DataCinema.jsonl")
// val srcRdd = sc.textFile("hdfs://192.168.53.169:9000/can_data/2023-09-24/*1")

    //Statistics on the top 15 cinemas with the highest number of viewers on that day
    val price = avg.select("CinemaName", "TodayAudienceCount")
    price.groupBy("TodayAudienceCount", "CinemaName").agg(sum("TodayAudienceCount").alias("Today's Audience Count")).orderBy(new Column("Today's Audience Count").desc).show(15 )


  }
}

The running results are as follows:

5. Query the box office of the cinema on a certain day

You can visually see the cinema’s data and compare it with other cinemas

import org.apache.spark.sql.*;

public class TodayBox1 {

    public static void main(String[] args) throws AnalysisException {
        SparkSession ss = SparkSession.builder().appName("one").master("local").getOrCreate();
        Dataset<Row> df = ss.read().json("D:\course\Spark\End\Final\src\DataCinema.jsonl");


        /***
         * The box office of a movie theater on a certain day
         */
        Dataset<Row> city = df.where("data = '2018/10/25'");

        df.createTempView("TodayBox");
        Dataset<Row> oo = ss.sql("select data,TodayBox,CinemaName from TodayBox group by data,TodayBox,CinemaName");
        oo.orderBy(new Column("TodayBox"))
                .show(30);
        
    }
}

The running results are as follows:

Summary:

In the study of spark and scala, we will encounter many problems, but with our persistence, we can always find solutions. The road is long, but Only by persevering can you reach the other side of success.

Spark and Scala are both very good languages, but my skills are still shallow. I am willing to share my knowledge with you.

Put away the clouds and mist and see the blue sky! Keep the clouds and see the bright moon! ! !

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Cloud native entry-level skills treeHomepageOverview 17028 people are learning the system