Chapter 3 Data Exploration
3.1 Data Quality Analysis
The processing of missing values is generally divided into deletion, interpolation of possible values, and no processing.
Outlier analysis is also called outlier analysis, which can be analyzed through simple statistics, 3σ principle and box plot analysis (box plot only uses quantiles to compare data Identification, without the need for distribution, the analysis results are more objective).
# set workspace # Copy the "data and program" folder to the F disk, and then use setwd to set the workspace setwd("F:/data and program/chapter3/sample program") # read data saledata <- read.csv(file = "./data/catering_sale.csv", header = TRUE) # Missing value detection and print the results, since R treats TRUE and FALSE as 1 and 0 respectively, you can use the sum() and mean() functions to obtain the number of missing samples and the missing ratio respectively sum(complete. cases(saledata)) sum(!complete. cases(saledata)) mean(!complete. cases(saledata)) saledata[!complete.cases(saledata), ] # Outlier detection boxplot sp <- boxplot(saledata$"sales", boxwex = 0.7) title("Sales outlier detection box plot") xi <- 1.1 sd.s <- sd(saledata[complete.cases(saledata), ]$"sales") mn.s <- mean(saledata[complete.cases(saledata), ]$"sales") points(xi, mn.s, col = "red", pch = 18) arrows(xi, mn.s - sd.s, xi, mn.s + sd.s, code = 3, col = "pink", angle = 75, length = .1) text(rep(c(1.05, 1.05, 0.95, 0.95), length = length(sp$out)), labels = sp$out[order(sp$out)], sp$out[order(sp$out)] + rep(c(150, -150, 150, -150), length = length(sp$out)), col = "red")
The result of the operation is that the number of missing values is 1, accounting for 0.497% of the total sample, and it is located on line 15.
3.2 Analysis of data characteristics
Distribution analysis: For quantitative data, you can use frequency distribution tables, histograms and stem-and-leaf diagrams to analyze the distribution characteristics; for qualitative data, you can use pie charts and bar charts to display the distribution.
Contrastive analysis: Absolute number comparison to find differences, and relative number comparison to reflect the degree of correlation between phenomena.
Statistical analysis: Commonly used mean and median show the central trend of indicators, and use standard deviation and interquartile range to show the central trend.
Mean
m
e
a
no
(
x
)
=
x
ˉ
=
∑
x
i
no
;
m
e
a
no
(
x
)
=
∑
w
i
x
i
∑
w
i
=
∑
f
i
x
i
mean(x)=\bar{x}=\frac{\sum x_i}{n};mean(x)={\frac{\sum w_ix_i}{\sum w_i}}=\ sum f_ix_i
mean(x)=xˉ=n∑xi;mean(x)=∑wi?∑wi?xi=∑fi?xi?, the mean is very sensitive to extreme values, if the data is skewed distribution, you can use Truncate the mean or median to measure the central tendency of the data.
Median
m
=
x
(
no
+
1
2
)
=
1
2
(
x
no
2
+
x
no
+
1
2
)
M=x_{(\frac{n + 1}{2})}=\frac{1}{2}(x_{\frac{n}{2}} + x_{\frac{n + 1}{2}})
M=x(2n + 1?)?=21?(x2n + x2n + 1)
Very bad:
m
a
x
?
m
i
no
max-min
max?min, the range is sensitive to outliers, and ignores the distribution of data between ranges.
Standard Deviation:
the s
=
∑
(
x
i
?
x
ˉ
)
2
no
s=\sqrt{\frac{\sum(x_i-\bar x)^2}{n} }
s=n∑(xixˉ)2?
C
V
=
the s
x
ˉ
x
100
%
CV=\frac{s}{\bar x}×100\%
CV=xˉs?×100%, which is used to measure the deviation trend of the standard deviation s relative to the mean value x, and compare the deviation trend of data sets with different units or different fluctuation ranges.
Interquartile range: The difference between the upper quartile QU and the lower quartile QL contains half of the data. The larger the value, the greater the variability of the data, and vice versa Small.
# set workspace # Copy the "data and program" folder to the F disk, and then use setwd to set the workspace setwd("F:/data and program/chapter3/sample program") # read data saledata <- read.table(file = "./data/catering_sale.csv", sep=",", header = TRUE) sales <- saledata[, 2] # Statistics analysis # mean mean_ <- mean(sales, na.rm = T) # median median_ <- median(sales, na.rm = T) # extremely poor range_ <- max(sales, na.rm = T) - min(sales, na.rm = T) # standard deviation std_ <- sqrt(var(sales, na.rm = T)) # coefficient of variation variation_ <- std_ / mean_ # interquartile range q1 <- quantile(sales, 0.25, na.rm = T) q3 <- quantile(sales, 0.75, na.rm = T) distance <- q3 - q1 a <- matrix(c(mean_, median_, range_, std_, variation_, q1, q3, distance), 1, byrow = T) colnames(a) <- c("mean", "median", "range", "standard deviation", "variation coefficient", "1/4 quantile", "3/4 quantile", "interquartile range") print(a)
Cycle Analysis: Explore whether a variable exhibits a cyclical trend over time.
Contribution analysis: Also known as “Pareto analysis”, the principle is the “2/8 law” of the Pareto rule, trying to find the 20% of resources and use them reasonably.
# set workspace # Copy the "data and program" folder to the F disk, and then use setwd to set the workspace setwd("F:/data and program/chapter3/sample program") # Read dish data and draw Pareto chart dishdata <- read.csv(file = "./data/catering_dish_profit.csv") barplot(dishdata[, 3], col = "blue1", names.arg = dishdata[, 2], width = 1, space = 0, ylim = c(0, 10000), xlab = "dishes", ylab = "profit: yuan") accratio <- dishdata[, 3] for ( i in 1:length(accratio)) {<!-- --> accratio[i] <- sum(dishdata[1:i, 3]) / sum(dishdata[, 3]) } par(new = T, mar = c(4, 4, 4, 4)) points(accratio * 10000 ~ c((1:length(accratio) - 0.5)), new = FALSE, type = "b", new = T) axis(4, col = "red", col. axis = "red", at = 0:10000, label = c(0:10000 / 10000)) mtext("cumulative percentage", 4, 2) points(6.5, accratio[7] * 10000, col="red") text(7, accratio[7] * 10000,paste(round(accratio[7] + 0.00001, 4) * 100, "%"))
Correlation analysis: draw scatter diagram, scatter matrix analysis, correlation coefficient.
Pearson correlation coefficient
r
=
∑
i
=
1
no
(
x
i
?
x
ˉ
)
(
the y
i
?
the y
ˉ
)
∑
i
=
1
no
(
x
i
?
x
ˉ
)
2
∑
i
=
1
no
(
the y
i
?
the y
ˉ
)
2
r=\frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_{i=1}^{ n}(x_i-\bar x)^2\sum_{i=1}^{n}(y_i-\bar y)^2}}
r=∑i=1n?(xixˉ)2∑i=1n?(yiyˉ?)2
r
the s
=
1
?
6
∑
i
=
1
no
(
R
i
?
Q
i
)
2
no
(
no
2
?
1
)
r_s=1-\frac{6\sum_{i=1}^{n}(R_i-Q_i)^2}{n(n^2-1)}
rs?=1?n(n2?1)6∑i=1n?(RiQi?)2?, spearman correlation only requires two variables to have a strictly monotone functional relationship, while pearson requires two variables to be linearly correlated It is said that there is a linear relationship.
Coefficient of determination: The square of the correlation coefficient r^2, which is used to measure the degree to which the regression equation explains y. The closer to 1, the stronger the xy correlation.
# Correlation analysis of catering sales data # set workspace # Copy the "data and program" folder to the F disk, and then use setwd to set the workspace setwd("F:/data and program/chapter3/sample program") # read data cordata <- read.csv(file = "./data/catering_sale_all.csv", header = TRUE) # Find the correlation coefficient matrix cor(cordata[, 2:11])
3.3 R language main data exploration functions
Statistical graphing functions
barplot(): Draw a simple bar graph
##barplot(X,horiz=FLASE,main,xlab,ylab), the parameter horiz is a logical value, TRUE means a horizontal bar graph x=sample(rep(c("A","B","C"),20),50) ##Generate a random vector containing ABC counts=table(x) ##Statistics of the three samples are stored in counts barplot(counts) ##Draw a bar graph to show the number of three samples
pie(): Pie chart, when it is less than 1, the pie chart is incomplete, and when it is more than 1, you get a proportional pie chart
##pie(X) pie chart x=sample(rep(c("A","B","C"),20),50) ##Generate a random vector containing ABC counts=table(x) ##Statistics of the three samples are stored in counts pct=round(counts/sum(counts)*100) lbls=paste(c("A","B","C"),pct,"%") ##Calculate the percentage for easy labeling pie(counts,labels = lbls) ##draw pie chart
hist(): two-dimensional bar histogram
##hist(X,freq=TRUE) two-dimensional histogram, freq=T draw frequency map, freq=F draw frequency map x=sample(1:999,100)% 0 ##take 100 random numbers between 1 and 99 hist(x,freq=FALSE,breaks=7,main="H x") ##Draw frequency histogram line(density(x),col="red") ##add density curve
boxplot(): box plot
##boxplot(X,notch,horiztal), when notch is TRUE, draw the concave box plot of X with notch x1=c(rnorm(50,5,2),11,1) ##Generate 50 random numbers, mean 5, variance 2, and add 2 constants x2=c(rnorm(50,7,4),10,2) boxplot(x1,x2,notch=TRUE) ##Draw a box plot with notch