Data sharing | R language logistic regression, linear discriminant analysis LDA, GAM, MARS, KNN, QDA, decision tree, random forest, SVM classification wine cross-validation ROC…

Full text link: http://tecdat.cn/?p=27384

In this article, the data contains information about the Portuguese wine “Vinho Verde”(Click “Read More” at the end of the article to get the fullcode data< /strong>).

Introduction

This data set (See the end of the article for how to obtain data) has 1599 observations and 12 variables, namely fixed acidity, volatile acidity, citric acid, residual sugar, chloride, free sulfur dioxide, total Sulfur dioxide, density, pH, sulfates, alcohol and mass. Fixed acidity, volatile acidity, citric acid, residual sugar, chloride, free sulfur dioxide, total sulfur dioxide, density, pH, sulfate, and alcohol are independent variables and are continuous. Quality is the dependent variable, measured on a score from 0 to 10.

Related videos

Exploratory analysis

A total of 855 wines were classified as “good” quality and 744 wines were classified as “poor” quality. Fixed acidity, volatile acidity, citric acid, chloride, free sulfur dioxide, total sulfur dioxide, density, sulfate, and alcohol were significantly associated with wine quality ( P -value for t test < 0.05), indicating significant predictors. We also constructed density plots to explore the distribution of 11 continuous variables across “poor” and “good” wine quality. As can be seen from the figure, there is no difference in pH between high-quality wines, while there are differences in other variables between different types of wine, which is consistent with the t-test results.

na.oit() %>

muate(qal= ase_hen(ality>5 ~good", quaity <=5 ~ "poor")) %>%

muate(qua= s.fatrqual)) %>%

dpeme1 <- rsparentTme(trans = .4)



plot = "density", pch = "|",

auto.key = list(columns = 2))

da1441b30597e222888f139fe796e6fc.png

Figure 1. Descriptive plot between wine quality and predicted characteristics.
Table 1. Basic characteristics of good and bad wines.

# Create a variable b1 we want in table 1 <- CeatTableOe(vars litars, straa = ’qual’ da winetab

08c6580949507ec84ec8b08586b0a2be.png

Click on the title to view previous issues

8e933285294be4cc19c7290bde1af0a1.jpeg

R language principal component analysis (PCA) wine visualization: principal component score scatter plot and loading plot

outside_default.png

Swipe left or right to see more

outside_default.png

01

ce540a37caa5709aff8b8fa03c6128b6.png

02

c898b6fc19460ed849fedbed550fd870.png

03

76a3dc1d3d7fe4a6506cbb6b0d9dd4cb.png

04

cda5265f96d1a382437c67f4eb106770.png

Model

We randomly select 70% of the observations as training data and the rest as testing data. All 11 predictor variables were included in the analysis. We use linear methods, non-linear methods, tree methods and support vector machines to predict the classification of wine quality. For linear methods, we train (penalized) logistic regression models and linear discriminant analysis (LDA). The assumptions of logistic regression include independent observations and a linear relationship between the independent variables and the log odds. LDA and QDA assume characteristics of a normal distribution, i.e., the predictor variables are normally distributed for both “good” and “poor” wine quality. For nonlinear models, we performed generalized additive models (GAM), multivariate adaptive regression splines (MARS), KNN models, and quadratic discriminant analysis (QDA). For tree models, we performed classification tree and random forest models. SVM with linear and radial kernels was also performed. We calculated the ROC and accuracy of model selection and investigated the importance of variables. 10-fold cross-validation (CV) was used for all models.

inTrai <- cateatPariti(y winequal, p = 0.7, lit =FASE)traiData <- wine\[inexTr, teDt <wi\[-idxTrain,\]

Linear model multiple logistic regression showed that among the 11 predictors, volatile acidity, citric acid, free sulfur dioxide, total sulfur dioxide, sulfate and alcohol were significantly associated with wine quality (P value < 0.05), explaining 25.1 of the total variance. %. Wine quality. When the model was applied to the test data, the accuracy was 0.75 (95% CI: 0.71-0.79) and the ROC was 0.818, indicating a good fit to the data. When performing penalized logistic regression, we found that when maximizing ROC, the best tuning parameters are alpha=1 and lambda=0.00086, the accuracy is 0.75 (95%CI: 0.71-0.79), and the ROC is also 0.818. Since lambda is close to zero and the ROC is the same as the logistic regression model, the penalty is relatively small,

However, because logistic regression requires little or no multicollinearity among the independent variables, the model may be disturbed by collinearity (if any) among the 11 predictor variables. As for LDA, when applying the model to the test data, the ROC was 0.819 and the accuracy was 0.762 (95% CI: 0.72-0.80). The most important variables in predicting wine quality are alcohol, volatile acidity and sulfates. Compared to logistic regression models, LDA is more helpful in cases of smaller sample sizes or good class separation, provided normality assumptions are met.

### Logistic regression cl - tranControlmehod =cv" number 10,

summayFunio = TRUE)

set.seed(1)

moel.gl<- train(x = tainDaa %>% dpyr::selct(-ual),

y = trainDaa$qualmetod "glm",

metic = OC",

tContrl = crl# Check the importance of predictors summary(odel.m)

38607053a398e8b96756ee412ffa20d6.png

# Create confusion matrix

tetred.prb <- rdct(mod.gl, newdat = tstDat

tye = "robtest.ped <- rep("good", length(pred.prconfusionMatrix(data = as.factor(test.pred),

fd1692916b913bc3237d3e4a6dc427c3.png

774d2b14f157d6aad2b54bb5b18e04ee.png

# Draw the test ROC graph oc.l <- roc(testa$al, es.pr.rob$god)

e6c974f92661a8b8e5d9e0fb5bf6f0ca.png

## Test error and training error er.st. <- mean(tett$qul!= tt.pred)tranped.obgl <-pric(moel.lmnewda= taiDaa,type = "robmoe.ln <-tai( xtraDa %>% dlyr:seec-qal),y = traDmethd = "met",tueGid = lGrid,mtc = "RO",trontrol ctl)plotodel.gl, xTras =uction() lg(x)

b888a9b138e12b686656937a1cda98c4.png

#Select the best parameters mol.mn$bestune

22f0721316c88582dd6437b526a278c0.png

# Confusion matrix tes.red2 <- rp"good" ngth(test.ed.prob2$good))

tst.red2\[tespre.prob2$good < 0.5\] <- "poor

conuionMatridata = as.fcto(test.prd2),

dabbfb3d69f134c63140797219857620.png

5e5f728c82c09f32c30835333f5a6338.png

1daa5650eddeeaf9539097025bccd097.png

d9e0cc7b5de9c294d37a6ae87f67d221.png

9f86ecdd2cb4b285f157542c4894b506.png

Nonlinear model In the GAM model, only volatile acidity has degrees of freedom equal to 1, indicating a linear association, while smoothing splines are applied to all other 10 variables.

The results showed that alcohol, citric acid, residual sugar, sulfate, fixed acidity, volatile acidity, chloride and total sulfur dioxide were significant predictors (P value <0.05).

Overall, these variables explained 39.1% of the total variation in wine quality. A confusion matrix using the test data showed that GAM had an accuracy of 0.76 (95%CI: 0.72-0.80) and an ROC of 0.829.

The MARS model shows that when maximizing ROC we include 5 terms among 11 predictors with nprune equal to 5 and degree 2. Together, these predictors and hinge functions explained 32.2% of the total variance. According to MARS output, the three most important predictors are total sulfur dioxide, alcohol, and sulfate.

When the MARS model was applied to the test data, the accuracy was 0.75 (95%CI: 0.72, 0.80) and the ROC was 0.823. We also performed a KNN model for classification. When k equals 22, ROC is maximized. The accuracy of KNNmodel is 0.63 (95%CI: 0.59-0.68) and the ROC is 0.672.

The QDA model showed an ROC of 0.784 and an accuracy of 0.71 (95% CI: 0.66-0.75). The most important variables in predicting wine quality are alcohol, volatile acidity and sulfates. 59-0.68), with an ROC of 0.672. The QDA model showed an ROC of 0.784 and an accuracy of 0.71 (95% CI: 0.66-0.75).

The most important variables in predicting wine quality are alcohol, volatile acidity and sulfates. 59-0.68), with an ROC of 0.672. The QDA model showed an ROC of 0.784 and an accuracy of 0.71 (95% CI: 0.66-0.75). The most important variables in predicting wine quality are alcohol, volatile acidity and sulfates.

The advantage of GAM and MARS is that both models are nonparametric and capable of handling highly complex nonlinear relationships. Specifically, MARS models can include potential interactions in the model. However, limitations of both models are due to model complexity, time-consuming calculations, and a high tendency to overfit. For the KNN model, when k is large, the prediction may be inaccurate.

### GAMse.see(1)

md.gam<- ran(x =trainDta %%dplr::slect(-qal),y = traiat$ual,thod = "am",metri = "RO",trCotrol = ctrl)

moel.gm$finlMdel

879d1580dcd6d3bb01c029071d0d4af3.png

summary(mel.gam)

82b8a97e169e836c8d83057179a11b3b.png

# Create confusion matrix test.pr.pob3 - prdict(mod.ga nwdata =tstData,

tye = "prb")

testped3 - rep "good" legt(test.predpob3$goo))

testprd3\[test.predprob3good < 0.5\] <- "poo

referetv = "good")

1ac603b45076dfd0dd6acc56c63183e8.png

f82cc633986faf0cf8d6aef92446e519.png

15a9783666443dda7428a7fb6d5d11f8.png

ee3189f3a5a012ac2d2903228cea7798.png

model.mars$finalModel

aff0ab5a560d10296243fb881accd8fe.png

vpmodl.rs$inlodel)

5272a40f99784bd6016e1a736e2b4a8a.png

2f14c98bc59333cb447458fee6de8a9e.png

02d73d32fee5134f6944968c508cdb9d.png

# Draw test ROC chart

ocmas <- roctestataqua, tes.pred.rob4god)

## Sting level: conrol = god, case= poor

## Settig diectio: cntrols> caseplot(ro.mars legac.axes = TRE, prin.auc= RUE)

plot(soothroc.mars), co = 4, ad =TRUE)

e1d134b3b4d6f82989859a28c465a16f.png

errr.tria.mas <-man(tainat$qul ! trai.red.ars)### KNNGrid < epa.gri(k seq(from = 1, to = 40, by = 1))

seted(1fknnrainqual ~.,

dta = trnData,

mthd="knn"metrid = kid)

ggplot(fitkn

f963409da71cf1542fc4899656244aae.png

d7c396dae4bfa5bb77d8dabcc91984b9.png

# Create confusion matrix ts.re.po7 < prdi(ft.kn, ewdt = estDaatype = "prb"

df3d599e3141a5dffc98bb7e83b732fb.png

cb09b439fd550a71f1bf3f462965d33b.png

### QDAseteed1)%>% pyr:c-ual),y= trataq

ethod "d"mric = "OC",tContol =ctl)# Create confusion matrix tet.pprob <-pedct(mol.da,nedaa = teDta,te = "pb")

testred6<- rep(o", leng(est.ped.pob6$goo))

e7d4480fb33daf4cff70ee4c1c20d1e3.png

9c43b2263b0ad22932ba29091ad27525.png

55a988ea4be53aba15d8bb31dcbea52c.png

Tree method

Based on the classification tree, the final tree size when maximizing AUC is 41. The test error rate was 0.24 and the ROC was 0.809. The accuracy of this classification tree was 0.76 (95%CI: 0.72-0.80). We also conducted a random forest approach to study the importance of variables. Therefore, alcohol is the most important variable, followed by sulfate, volatile acidity, total sulfur dioxide, density, chloride, fixed acidity, citric acid, free sulfur dioxide and residual sugar. pH is the least important variable. For the random forest model, the test error rate was 0.163, the accuracy rate was 0.84 (95%CI: 0.80-0.87), and the ROC was 0.900. A potential limitation of tree methods is that they are sensitive to changes in the data, i.e., small changes in the data may cause larger changes in the classification tree.

# Classificationctr <- tintol(meod ="cv", number = 10,smmryFuton= twoClassSma

et.se(1rart_grid = a.fra(cp = exp(eq(10,-, len =0)))clsste = traqua~., rainDta,metho="rprt

tueGrid = patid,

trCtrl cr)

ggt(class.tee,highight =TRE)

cda86c133bb42ba56283656547e96d8e.png

7aab781e23a93cf3a215b97a333f15f9.png

## Calculate test error rpartpred = icla.te edta =testata, ye = "aw)

te.ero.sree = mean(testa$a !=rartpre)

rprred_trin reic(ss.tre,newdta = raiata, tye "raw")



# Create confusion matrix

teste.pob8 <-rdic(cste, edata =tstData,pe = "po"

tet.pd8 - rpgod" legthtetred.rb8d))

af2ebde188985b84ab5835e33c4caff4.png

fd81ad22763300f5c41c0ef7c50f70cd.png

# Draw test ROC chart

ro.r <-oc(testaual, tstedrob$od)pot(rc.ctreegy.axes TU pit.a = TRE)plo(ooth(c.tre, col= 4, ad = TRE

ab8fbbff118b93b0e13a7ba12d190206.png

# Random Forest and Variable Importance

ctl <traontr(mthod= "cv, numbr = 10,clasPos = RUEoClssSummry)

rf.grid - xpa.gr(mt = 1:10,

spltrule "gini"min.nd.sie =seq(from = 1,to 12, by = 2))se.sed(1)

rf.fit <- inqual

mthd= "ranger",

meric = "ROC",

 =ctrl

gglt(rf.it,hiliht TRE)

6f308cf8f910aa8d70ff225e74e1193b.png

scle.ermutatin.iportace TRU)barplt(sort(rangr::imoranc(random

b632666aa29ff6da6d452ed0e5b8c60b.png

b085284932097744535cc31bb27fb76b.png

11e98ed4debb98adb025e10b1b3deb96.png

9ebe5c47f0649cdcc5dc4eb6e40ffca0.png

Support Vector Machine

We use SVM with a linear kernel and adjust the cost function. We find the model with maximizing ROChad cost = 0.59078. The model had an ROC of 0.816 and an accuracy of 0.75 (test error of 0.25) (95%CI: 0.71-0.79). The most important variable for quality prediction is alcohol; volatile acidity and total sulfur dioxide are also important variables. If the true boundary is non-linear, SVM with radial kernel performs better.

st.seed(svl.fi <- tain(qual~ . ,data = trainDatamehod= "mLar2",tueGri = data.frae(cos = ep(seq(-25,ln = 0))

4f06dabe8e9b0308b570a7bb89c66841.png

fe8e16afa9452b4bbbbad0abf3a1b096.png

d162020a9bb54e0c3caacd9ba204c101.png

af23a504932fb2d81686c8189d16f9fe.png

3aefb0a7b8029df3f0d8be571790eae9.png

## SVM with radial kernel svmr.grid epand.gid(C = epseq(1,4,le=10)),

iga = expsq(8,len=10)))

svmr.it<- tan(qual ~ .,

da = taiDataRialSigma",

preProcess= c("cer" "scale"),

tunnrol = c)

a789cb9d20048fa26d162936cd6d806b.png

19163ced599b8e072347d4586f126221.png

17fe5004186cde723590daef885105ff.png

30f5f975983fdb3e138a43bf2ba9686f.png

Model comparison

After the models were built, we performed model comparisons based on the training and testing performance of all models. The table below shows the cross-validated classification error rate and ROC for all models. In the results, the random forest model has the largest AUC value, while the KNN has the smallest. Therefore, we selected the random forest model as the best predictive classification model for our data. Based on the random forest model, alcohol, sulfate, volatile acidity, total sulfur dioxide, and density were the top 5 significant predictors that helped us predict wine quality classification. Such findings are in line with our expectations since factors such as alcohol, sulfates, and volatile acidity may determine the flavor and mouthfeel of wine. When looking at the summary of each model, we realized that the KNN model had the lowest AUC value and the highest test classification error rate of 0.367. The other nine models had similar AUC values of approximately 82%.

rsam = rsmes(list(summary(resamp))

6d8183f74e4745487feffff9b2614128.png

09923ec8311b027896d3d58087e0f5d1.png

e53ecf878c0e13dfd657e98c76e1bd6b.png

comrin = sumaryes)$satitics$ROr_quare smary(rsamp)saisis$sqrekntr::ableomris\[,1:6\])

c8995c8aa06970bd6e9c0412c4261bd5.png

bpot(remp meic = "ROC")

4abfbe1480bda2029c210808ffcc124a.png

f<- dataframe(dl\_Name, TainError,Test\_Eror, Tes_RC)

knir::abe(df)

059db5cc724bfb2aa3857e2ba1d9cb0e.png

Conclusion

The model building process showed that alcohol, sulfates, volatile acidity, total sulfur dioxide, and density were the top 5 significant predictors of wine quality classification in the training dataset. We selected the random forest model because it had the largest AUC value and the lowest classification error rate. The model also performs well on the test dataset. Therefore, this random forest model is an effective method for wine quality classification.

Data acquisition

Reply “Wine DataData” in the background of the official account below to get the complete data.

7e19215183c10af0d2a174b60b5011a0.png

Click “Read original text” at the end of the article

Get full text complete information.

This article is selected from “R Language Penalized Logistic Regression, Linear Discriminant Analysis LDA, Generalized Additive Model GAM, Multivariate Adaptive Regression Spline MARS, KNN, Quadratic Discriminant Analysis QDA, Decision Tree, Random Forest, Support Vector Machine SVM Classification of Good and Bad Quality Wine 10-fold cross-validation and ROC visualization.

4d65cc3baee268c1136dbe7fff0901ab.jpeg

The WineData in this article is shared to the Member Group. Scan the QR code below to join the group!

992481c0b254f842ed395cdd30c63eb3.png

b87e2e4d12010c68d8abf48458982575.jpeg

89dbf8964054f7d965eaebe5eed37d2c.png

Click on the title to view previous issues

R language Bayesian generalized linear mixed (multi-level/level/nested) model GLMM, logistic regression analysis of data on influencing factors of education grade repetition

Logistic regression Logistic model principle R language classification prediction of coronary heart disease risk example

Data sharing | Predicting abalone age and visualization using additive multiple linear regression, random forest, and elastic network models

Penalized regression methods for high-dimensional data in R language: principal component regression PCR, ridge regression, lasso, elastic network elastic net analysis of genetic data (including practice questions)

The minimum angle algorithm of LARS and Lasso regression in Python Lars analyzes Boston housing data example

Ridge regression and adaptive LASSO regression visualization in R language Bootstrap

R language Lasso regression model variable selection and diabetes development prediction model

R language implements Bayesian quantile regression, lasso and adaptive lasso Bayesian quantile regression analysis

Implementing LASSO regression analysis based on R language

R language uses LASSO, adaptive LASSO to predict inflation time series

R language adaptive LASSO polynomial regression, binary logistic regression and ridge regression application analysis

Classification model case of high-dimensional variable selection using R language penalized logistic regression (LASSO, ridge regression)

Lasso Regression Minimum Angle Algorithm LARS in Python

Implementation of LASSO regression, Ridge regression and Elastic Net model in r language

Implementation of LASSO regression, Ridge regression and Elastic Net models in r language

R language to implement LASSO regression – write your own LASSO regression algorithm

R uses LASSO regression to predict stock returns

Python uses LASSO regression to predict stock returns

2f5f245ab4e551020dc1eb36a39a3a12.png

4ecff0ab3f764a5b7fd5d94330e6e9ff.jpeg

d6db3f0cc0c7b5326e9503f33b4b9d41.png

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 53785 people are learning the system