Reading mark [R language data analysis and mining practice] 3

Chapter 3 Data Exploration

3.1 Data Quality Analysis

The processing of missing values is generally divided into deletion, interpolation of possible values, and no processing.
Outlier analysis is also called outlier analysis, which can be analyzed through simple statistics, 3σ principle and box plot analysis (box plot only uses quantiles to compare data Identification, without the need for distribution, the analysis results are more objective).

# set workspace
# Copy the "data and program" folder to the F disk, and then use setwd to set the workspace
setwd("F:/data and program/chapter3/sample program")
# read data
saledata <- read.csv(file = "./data/catering_sale.csv", header = TRUE)

# Missing value detection and print the results, since R treats TRUE and FALSE as 1 and 0 respectively, you can use the sum() and mean() functions to obtain the number of missing samples and the missing ratio respectively
sum(complete. cases(saledata))
sum(!complete. cases(saledata))
mean(!complete. cases(saledata))
saledata[!complete.cases(saledata), ]

# Outlier detection boxplot
sp <- boxplot(saledata$"sales", boxwex = 0.7)
title("Sales outlier detection box plot")
xi <- 1.1
sd.s <- sd(saledata[complete.cases(saledata), ]$"sales")
mn.s <- mean(saledata[complete.cases(saledata), ]$"sales")
points(xi, mn.s, col = "red", pch = 18)
arrows(xi, mn.s - sd.s, xi, mn.s + sd.s, code = 3, col = "pink", angle = 75, length = .1)
text(rep(c(1.05, 1.05, 0.95, 0.95), length = length(sp$out)),
     labels = sp$out[order(sp$out)], sp$out[order(sp$out)] +
       rep(c(150, -150, 150, -150), length = length(sp$out)), col = "red")

The result of the operation is that the number of missing values is 1, accounting for 0.497% of the total sample, and it is located on line 15.

3.2 Analysis of data characteristics

Distribution analysis: For quantitative data, you can use frequency distribution tables, histograms and stem-and-leaf diagrams to analyze the distribution characteristics; for qualitative data, you can use pie charts and bar charts to display the distribution.
Contrastive analysis: Absolute number comparison to find differences, and relative number comparison to reflect the degree of correlation between phenomena.
Statistical analysis: Commonly used mean and median show the central trend of indicators, and use standard deviation and interquartile range to show the central trend.
Mean

m

e

a

no

(

x

)

=

x

ˉ

=

x

i

no

;

m

e

a

no

(

x

)

=

w

i

x

i

w

i

=

f

i

x

i

mean(x)=\bar{x}=\frac{\sum x_i}{n};mean(x)={\frac{\sum w_ix_i}{\sum w_i}}=\ sum f_ix_i

mean(x)=xˉ=n∑xi;mean(x)=∑wi?∑wi?xi=∑fi?xi?, the mean is very sensitive to extreme values, if the data is skewed distribution, you can use Truncate the mean or median to measure the central tendency of the data.
Median

m

=

x

(

no

+

1

2

)

=

1

2

(

x

no

2

+

x

no

+

1

2

)

M=x_{(\frac{n + 1}{2})}=\frac{1}{2}(x_{\frac{n}{2}} + x_{\frac{n + 1}{2}})

M=x(2n + 1?)?=21?(x2n + x2n + 1)
Very bad:

m

a

x

?

m

i

no

max-min

max?min, the range is sensitive to outliers, and ignores the distribution of data between ranges.
Standard Deviation:

the s

=

(

x

i

?

x

ˉ

)

2

no

s=\sqrt{\frac{\sum(x_i-\bar x)^2}{n} }

s=n∑(xixˉ)2?
?, used to measure the degree to which the data deviates from the mean.
Coefficient of Variation:

C

V

=

the s

x

ˉ

x

100

%

CV=\frac{s}{\bar x}×100\%

CV=xˉs?×100%, which is used to measure the deviation trend of the standard deviation s relative to the mean value x, and compare the deviation trend of data sets with different units or different fluctuation ranges.
Interquartile range: The difference between the upper quartile QU and the lower quartile QL contains half of the data. The larger the value, the greater the variability of the data, and vice versa Small.

# set workspace
# Copy the "data and program" folder to the F disk, and then use setwd to set the workspace
setwd("F:/data and program/chapter3/sample program")
# read data
saledata <- read.table(file = "./data/catering_sale.csv", sep=",", header = TRUE)
sales <- saledata[, 2]

# Statistics analysis
# mean
mean_ <- mean(sales, na.rm = T)
# median
median_ <- median(sales, na.rm = T)
# extremely poor
range_ <- max(sales, na.rm = T) - min(sales, na.rm = T)
# standard deviation
std_ <- sqrt(var(sales, na.rm = T))
# coefficient of variation
variation_ <- std_ / mean_
# interquartile range
q1 <- quantile(sales, 0.25, na.rm = T)
q3 <- quantile(sales, 0.75, na.rm = T)
distance <- q3 - q1
a <- matrix(c(mean_, median_, range_, std_, variation_, q1, q3, distance),
            1, byrow = T)
colnames(a) <- c("mean", "median", "range", "standard deviation", "variation coefficient",
                 "1/4 quantile", "3/4 quantile", "interquartile range")
print(a)

Cycle Analysis: Explore whether a variable exhibits a cyclical trend over time.
Contribution analysis: Also known as “Pareto analysis”, the principle is the “2/8 law” of the Pareto rule, trying to find the 20% of resources and use them reasonably.

# set workspace
# Copy the "data and program" folder to the F disk, and then use setwd to set the workspace
setwd("F:/data and program/chapter3/sample program")
# Read dish data and draw Pareto chart
dishdata <- read.csv(file = "./data/catering_dish_profit.csv")
barplot(dishdata[, 3], col = "blue1", names.arg = dishdata[, 2], width = 1,
        space = 0, ylim = c(0, 10000), xlab = "dishes", ylab = "profit: yuan")
accratio <- dishdata[, 3]
for ( i in 1:length(accratio)) {<!-- -->
  accratio[i] <- sum(dishdata[1:i, 3]) / sum(dishdata[, 3])
}

par(new = T, mar = c(4, 4, 4, 4))
points(accratio * 10000 ~ c((1:length(accratio) - 0.5)), new = FALSE,
       type = "b", new = T)
axis(4, col = "red", col. axis = "red", at = 0:10000, label = c(0:10000 / 10000))
mtext("cumulative percentage", 4, 2)

points(6.5, accratio[7] * 10000, col="red")
text(7, accratio[7] * 10000,paste(round(accratio[7] + 0.00001, 4) * 100, "%"))

Correlation analysis: draw scatter diagram, scatter matrix analysis, correlation coefficient.
Pearson correlation coefficient

r

=

i

=

1

no

(

x

i

?

x

ˉ

)

(

the y

i

?

the y

ˉ

)

i

=

1

no

(

x

i

?

x

ˉ

)

2

i

=

1

no

(

the y

i

?

the y

ˉ

)

2

r=\frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^{ n}(x_i-\bar x)^2\sum_{i=1}^{n}(y_i-\bar y)^2}}

r=∑i=1n?(xixˉ)2∑i=1n?(yiyˉ?)2
?∑i=1n?(xixˉ)(yiyˉ?)?, 0~0.3 very weak linear correlation, 0.3~0.5 low linear correlation, 0.5~0.8 significant linear correlation, 0.8~1 high linear correlation . Spearman rank correlation coefficient: The pearson correlation coefficient requires continuous variables to obey the normal distribution, and the non-normal correlation can be described by the rank correlation coefficient

r

the s

=

1

?

6

i

=

1

no

(

R

i

?

Q

i

)

2

no

(

no

2

?

1

)

r_s=1-\frac{6\sum_{i=1}^{n}(R_i-Q_i)^2}{n(n^2-1)}

rs?=1?n(n2?1)6∑i=1n?(RiQi?)2?, spearman correlation only requires two variables to have a strictly monotone functional relationship, while pearson requires two variables to be linearly correlated It is said that there is a linear relationship.
Coefficient of determination: The square of the correlation coefficient r^2, which is used to measure the degree to which the regression equation explains y. The closer to 1, the stronger the xy correlation.

# Correlation analysis of catering sales data
# set workspace
# Copy the "data and program" folder to the F disk, and then use setwd to set the workspace
setwd("F:/data and program/chapter3/sample program")
# read data
cordata <- read.csv(file = "./data/catering_sale_all.csv", header = TRUE)
# Find the correlation coefficient matrix
cor(cordata[, 2:11])

3.3 R language main data exploration functions


Statistical graphing functions
barplot(): Draw a simple bar graph

##barplot(X,horiz=FLASE,main,xlab,ylab), the parameter horiz is a logical value, TRUE means a horizontal bar graph
x=sample(rep(c("A","B","C"),20),50) ##Generate a random vector containing ABC
counts=table(x) ##Statistics of the three samples are stored in counts
barplot(counts) ##Draw a bar graph to show the number of three samples

pie(): Pie chart, when it is less than 1, the pie chart is incomplete, and when it is more than 1, you get a proportional pie chart

##pie(X) pie chart
x=sample(rep(c("A","B","C"),20),50) ##Generate a random vector containing ABC
counts=table(x) ##Statistics of the three samples are stored in counts
pct=round(counts/sum(counts)*100)
lbls=paste(c("A","B","C"),pct,"%") ##Calculate the percentage for easy labeling
pie(counts,labels = lbls) ##draw pie chart

hist(): two-dimensional bar histogram

##hist(X,freq=TRUE) two-dimensional histogram, freq=T draw frequency map, freq=F draw frequency map
x=sample(1:999,100)% 0 ##take 100 random numbers between 1 and 99
hist(x,freq=FALSE,breaks=7,main="H x") ##Draw frequency histogram
line(density(x),col="red") ##add density curve

boxplot(): box plot

##boxplot(X,notch,horiztal), when notch is TRUE, draw the concave box plot of X with notch
x1=c(rnorm(50,5,2),11,1) ##Generate 50 random numbers, mean 5, variance 2, and add 2 constants
x2=c(rnorm(50,7,4),10,2)
boxplot(x1,x2,notch=TRUE) ##Draw a box plot with notch