Data sharing | R language logistic regression, linear discriminant analysis LDA, GAM, MARS, KNN, QDA, decision tree, random forest, SVM classification wine cross-validation ROC…

Full text link: http://tecdat.cn/?p=27384

In this article, the data contains information about the Portuguese wine “Vinho Verde”(Click “Read More” at the end of the article to get the fullcode data< /strong>).

Introduction

This data set (See the end of the article for how to obtain data) has 1599 observations and 12 variables, namely fixed acidity, volatile acidity, citric acid, residual sugar, chloride, free sulfur dioxide, total Sulfur dioxide, density, pH, sulfates, alcohol and mass. Fixed acidity, volatile acidity, citric acid, residual sugar, chloride, free sulfur dioxide, total sulfur dioxide, density, pH, sulfate, and alcohol are independent variables and are continuous. Quality is the dependent variable, measured on a score from 0 to 10.

Exploratory analysis

A total of 855 wines were classified as “good” quality and 744 wines were classified as “poor” quality. Fixed acidity, volatile acidity, citric acid, chloride, free sulfur dioxide, total sulfur dioxide, density, sulfate, and alcohol were significantly associated with wine quality ( P -value for t test < 0.05), indicating significant predictors. We also constructed density plots to explore the distribution of 11 continuous variables across “poor” and “good” wine quality. As can be seen from the figure, there is no difference in pH between high-quality wines, while there are differences in other variables between different types of wine, which is consistent with the t-test results.

na.oit() %>

muate(qal= ase_hen(ality>5 ~good", quaity <=5 ~ "poor")) %>%

muate(qua= s.fatrqual)) %>%

dpeme1 <- rsparentTme(trans = .4)



plot = "density", pch = "|",

auto.key = list(columns = 2))

Figure 1. Descriptive plot between wine quality and predicted characteristics.
Table 1. Basic characteristics of good and bad wines.

# Create a variable b1 we want in table 1 <- CeatTableOe(vars litars, straa = ’qual’ da winetab

Click on the title to view previous issues

R language principal component analysis (PCA) wine visualization: principal component score scatter plot and loading plot

Swipe left or right to see more

Model

We randomly select 70% of the observations as training data and the rest as testing data. All 11 predictor variables were included in the analysis. We use linear methods, non-linear methods, tree methods and support vector machines to predict the classification of wine quality. For linear methods, we train (penalized) logistic regression models and linear discriminant analysis (LDA). The assumptions of logistic regression include independent observations and a linear relationship between the independent variables and the log odds. LDA and QDA assume characteristics of a normal distribution, i.e., the predictor variables are normally distributed for both “good” and “poor” wine quality. For nonlinear models, we performed generalized additive models (GAM), multivariate adaptive regression splines (MARS), KNN models, and quadratic discriminant analysis (QDA). For tree models, we performed classification tree and random forest models. SVM with linear and radial kernels was also performed. We calculated the ROC and accuracy of model selection and investigated the importance of variables. 10-fold cross-validation (CV) was used for all models.

inTrai <- cateatPariti(y winequal, p = 0.7, lit =FASE)traiData <- wine\[inexTr, teDt <wi\[-idxTrain,\]

Linear model multiple logistic regression showed that among the 11 predictors, volatile acidity, citric acid, free sulfur dioxide, total sulfur dioxide, sulfate and alcohol were significantly associated with wine quality (P value < 0.05), explaining 25.1 of the total variance. %. Wine quality. When the model was applied to the test data, the accuracy was 0.75 (95% CI: 0.71-0.79) and the ROC was 0.818, indicating a good fit to the data. When performing penalized logistic regression, we found that when maximizing ROC, the best tuning parameters are alpha=1 and lambda=0.00086, the accuracy is 0.75 (95%CI: 0.71-0.79), and the ROC is also 0.818. Since lambda is close to zero and the ROC is the same as the logistic regression model, the penalty is relatively small,

However, because logistic regression requires little or no multicollinearity among the independent variables, the model may be disturbed by collinearity (if any) among the 11 predictor variables. As for LDA, when applying the model to the test data, the ROC was 0.819 and the accuracy was 0.762 (95% CI: 0.72-0.80). The most important variables in predicting wine quality are alcohol, volatile acidity and sulfates. Compared to logistic regression models, LDA is more helpful in cases of smaller sample sizes or good class separation, provided normality assumptions are met.

### Logistic regression cl - tranControlmehod =cv" number 10,

summayFunio = TRUE)

set.seed(1)

moel.gl<- train(x = tainDaa %>% dpyr::selct(-ual),

y = trainDaa$qualmetod "glm",

metic = OC",

tContrl = crl# Check the importance of predictors summary(odel.m)

# Create confusion matrix

tetred.prb <- rdct(mod.gl, newdat = tstDat

tye = "robtest.ped <- rep("good", length(pred.prconfusionMatrix(data = as.factor(test.pred),

# Draw the test ROC graph oc.l <- roc(testa$al, es.pr.rob$god)

## Test error and training error er.st. <- mean(tett$qul!= tt.pred)tranped.obgl <-pric(moel.lmnewda= taiDaa,type = "robmoe.ln <-tai( xtraDa %>% dlyr:seec-qal),y = traDmethd = "met",tueGid = lGrid,mtc = "RO",trontrol ctl)plotodel.gl, xTras =uction() lg(x)

#Select the best parameters mol.mn$bestune

# Confusion matrix tes.red2 <- rp"good" ngth(test.ed.prob2$good))

tst.red2\[tespre.prob2$good < 0.5\] <- "poor

conuionMatridata = as.fcto(test.prd2),

Nonlinear model In the GAM model, only volatile acidity has degrees of freedom equal to 1, indicating a linear association, while smoothing splines are applied to all other 10 variables.

The results showed that alcohol, citric acid, residual sugar, sulfate, fixed acidity, volatile acidity, chloride and total sulfur dioxide were significant predictors (P value <0.05).

Overall, these variables explained 39.1% of the total variation in wine quality. A confusion matrix using the test data showed that GAM had an accuracy of 0.76 (95%CI: 0.72-0.80) and an ROC of 0.829.

The MARS model shows that when maximizing ROC we include 5 terms among 11 predictors with nprune equal to 5 and degree 2. Together, these predictors and hinge functions explained 32.2% of the total variance. According to MARS output, the three most important predictors are total sulfur dioxide, alcohol, and sulfate.

When the MARS model was applied to the test data, the accuracy was 0.75 (95%CI: 0.72, 0.80) and the ROC was 0.823. We also performed a KNN model for classification. When k equals 22, ROC is maximized. The accuracy of KNNmodel is 0.63 (95%CI: 0.59-0.68) and the ROC is 0.672.

The QDA model showed an ROC of 0.784 and an accuracy of 0.71 (95% CI: 0.66-0.75). The most important variables in predicting wine quality are alcohol, volatile acidity and sulfates. 59-0.68), with an ROC of 0.672. The QDA model showed an ROC of 0.784 and an accuracy of 0.71 (95% CI: 0.66-0.75).

The most important variables in predicting wine quality are alcohol, volatile acidity and sulfates. 59-0.68), with an ROC of 0.672. The QDA model showed an ROC of 0.784 and an accuracy of 0.71 (95% CI: 0.66-0.75). The most important variables in predicting wine quality are alcohol, volatile acidity and sulfates.

The advantage of GAM and MARS is that both models are nonparametric and capable of handling highly complex nonlinear relationships. Specifically, MARS models can include potential interactions in the model. However, limitations of both models are due to model complexity, time-consuming calculations, and a high tendency to overfit. For the KNN model, when k is large, the prediction may be inaccurate.

### GAMse.see(1)

md.gam<- ran(x =trainDta %%dplr::slect(-qal),y = traiat$ual,thod = "am",metri = "RO",trCotrol = ctrl)

moel.gm$finlMdel

summary(mel.gam)

# Create confusion matrix test.pr.pob3 - prdict(mod.ga nwdata =tstData,

tye = "prb")

testped3 - rep "good" legt(test.predpob3$goo))

testprd3\[test.predprob3good < 0.5\] <- "poo

referetv = "good")

model.mars$finalModel

vpmodl.rs$inlodel)

# Draw test ROC chart

ocmas <- roctestataqua, tes.pred.rob4god)

## Sting level: conrol = god, case= poor

## Settig diectio: cntrols> caseplot(ro.mars legac.axes = TRE, prin.auc= RUE)

plot(soothroc.mars), co = 4, ad =TRUE)

errr.tria.mas <-man(tainat$qul ! trai.red.ars)### KNNGrid < epa.gri(k seq(from = 1, to = 40, by = 1))

seted(1fknnrainqual ~.,

dta = trnData,

mthd="knn"metrid = kid)

ggplot(fitkn

# Create confusion matrix ts.re.po7 < prdi(ft.kn, ewdt = estDaatype = "prb"

### QDAseteed1)%>% pyr:c-ual),y= trataq

ethod "d"mric = "OC",tContol =ctl)# Create confusion matrix tet.pprob <-pedct(mol.da,nedaa = teDta,te = "pb")

testred6<- rep(o", leng(est.ped.pob6$goo))

Tree method

Based on the classification tree, the final tree size when maximizing AUC is 41. The test error rate was 0.24 and the ROC was 0.809. The accuracy of this classification tree was 0.76 (95%CI: 0.72-0.80). We also conducted a random forest approach to study the importance of variables. Therefore, alcohol is the most important variable, followed by sulfate, volatile acidity, total sulfur dioxide, density, chloride, fixed acidity, citric acid, free sulfur dioxide and residual sugar. pH is the least important variable. For the random forest model, the test error rate was 0.163, the accuracy rate was 0.84 (95%CI: 0.80-0.87), and the ROC was 0.900. A potential limitation of tree methods is that they are sensitive to changes in the data, i.e., small changes in the data may cause larger changes in the classification tree.

# Classificationctr <- tintol(meod ="cv", number = 10,smmryFuton= twoClassSma

et.se(1rart_grid = a.fra(cp = exp(eq(10,-, len =0)))clsste = traqua~., rainDta,metho="rprt

tueGrid = patid,

trCtrl cr)

ggt(class.tee,highight =TRE)

## Calculate test error rpartpred = icla.te edta =testata, ye = "aw)

te.ero.sree = mean(testa$a !=rartpre)

rprred_trin reic(ss.tre,newdta = raiata, tye "raw")



# Create confusion matrix

teste.pob8 <-rdic(cste, edata =tstData,pe = "po"

tet.pd8 - rpgod" legthtetred.rb8d))

# Draw test ROC chart

ro.r <-oc(testaual, tstedrob$od)pot(rc.ctreegy.axes TU pit.a = TRE)plo(ooth(c.tre, col= 4, ad = TRE

# Random Forest and Variable Importance

ctl <traontr(mthod= "cv, numbr = 10,clasPos = RUEoClssSummry)

rf.grid - xpa.gr(mt = 1:10,

spltrule "gini"min.nd.sie =seq(from = 1,to 12, by = 2))se.sed(1)

rf.fit <- inqual

mthd= "ranger",

meric = "ROC",

 =ctrl

gglt(rf.it,hiliht TRE)

scle.ermutatin.iportace TRU)barplt(sort(rangr::imoranc(random

Support Vector Machine

We use SVM with a linear kernel and adjust the cost function. We find the model with maximizing ROChad cost = 0.59078. The model had an ROC of 0.816 and an accuracy of 0.75 (test error of 0.25) (95%CI: 0.71-0.79). The most important variable for quality prediction is alcohol; volatile acidity and total sulfur dioxide are also important variables. If the true boundary is non-linear, SVM with radial kernel performs better.

st.seed(svl.fi <- tain(qual~ . ,data = trainDatamehod= "mLar2",tueGri = data.frae(cos = ep(seq(-25,ln = 0))

## SVM with radial kernel svmr.grid epand.gid(C = epseq(1,4,le=10)),

iga = expsq(8,len=10)))

svmr.it<- tan(qual ~ .,

da = taiDataRialSigma",

preProcess= c("cer" "scale"),

tunnrol = c)

Model comparison

After the models were built, we performed model comparisons based on the training and testing performance of all models. The table below shows the cross-validated classification error rate and ROC for all models. In the results, the random forest model has the largest AUC value, while the KNN has the smallest. Therefore, we selected the random forest model as the best predictive classification model for our data. Based on the random forest model, alcohol, sulfate, volatile acidity, total sulfur dioxide, and density were the top 5 significant predictors that helped us predict wine quality classification. Such findings are in line with our expectations since factors such as alcohol, sulfates, and volatile acidity may determine the flavor and mouthfeel of wine. When looking at the summary of each model, we realized that the KNN model had the lowest AUC value and the highest test classification error rate of 0.367. The other nine models had similar AUC values of approximately 82%.

rsam = rsmes(list(summary(resamp))

comrin = sumaryes)$satitics$ROr_quare smary(rsamp)saisis$sqrekntr::ableomris\[,1:6\])

bpot(remp meic = "ROC")

f<- dataframe(dl\_Name, TainError,Test\_Eror, Tes_RC)

knir::abe(df)

Conclusion

The model building process showed that alcohol, sulfates, volatile acidity, total sulfur dioxide, and density were the top 5 significant predictors of wine quality classification in the training dataset. We selected the random forest model because it had the largest AUC value and the lowest classification error rate. The model also performs well on the test dataset. Therefore, this random forest model is an effective method for wine quality classification.

Data acquisition

Reply “Wine DataData” in the background of the official account below to get the complete data.

Click “Read original text” at the end of the article

Get full text complete information.

This article is selected from “R Language Penalized Logistic Regression, Linear Discriminant Analysis LDA, Generalized Additive Model GAM, Multivariate Adaptive Regression Spline MARS, KNN, Quadratic Discriminant Analysis QDA, Decision Tree, Random Forest, Support Vector Machine SVM Classification of Good and Bad Quality Wine 10-fold cross-validation and ROC visualization.

The WineData in this article is shared to the Member Group. Scan the QR code below to join the group!

Click on the title to view previous issues

R language Bayesian generalized linear mixed (multi-level/level/nested) model GLMM, logistic regression analysis of data on influencing factors of education grade repetition

Logistic regression Logistic model principle R language classification prediction of coronary heart disease risk example

Data sharing | Predicting abalone age and visualization using additive multiple linear regression, random forest, and elastic network models

Penalized regression methods for high-dimensional data in R language: principal component regression PCR, ridge regression, lasso, elastic network elastic net analysis of genetic data (including practice questions)

The minimum angle algorithm of LARS and Lasso regression in Python Lars analyzes Boston housing data example

Ridge regression and adaptive LASSO regression visualization in R language Bootstrap

R language Lasso regression model variable selection and diabetes development prediction model

R language implements Bayesian quantile regression, lasso and adaptive lasso Bayesian quantile regression analysis

Implementing LASSO regression analysis based on R language

R language uses LASSO, adaptive LASSO to predict inflation time series

R language adaptive LASSO polynomial regression, binary logistic regression and ridge regression application analysis

Classification model case of high-dimensional variable selection using R language penalized logistic regression (LASSO, ridge regression)

Lasso Regression Minimum Angle Algorithm LARS in Python

Implementation of LASSO regression, Ridge regression and Elastic Net model in r language

Implementation of LASSO regression, Ridge regression and Elastic Net models in r language

R language to implement LASSO regression – write your own LASSO regression algorithm

R uses LASSO regression to predict stock returns

Python uses LASSO regression to predict stock returns

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 53785 people are learning the system