In this article, the data contains information about the Portuguese wine “Vinho Verde”(Click “Read More” at the end of the article to get the fullcode data< /strong>).


This data set (See the end of the article for how to obtain data) has 1599 observations and 12 variables, namely fixed acidity, volatile acidity, citric acid, residual sugar, chloride, free sulfur dioxide, total Sulfur dioxide, density, pH, sulfates, alcohol and mass. Fixed acidity, volatile acidity, citric acid, residual sugar, chloride, free sulfur dioxide, total sulfur dioxide, density, pH, sulfate, and alcohol are independent variables and are continuous. Quality is the dependent variable, measured on a score from 0 to 10.

Exploratory analysis

A total of 855 wines were classified as “good” quality and 744 wines were classified as “poor” quality. Fixed acidity, volatile acidity, citric acid, chloride, free sulfur dioxide, total sulfur dioxide, density, sulfate, and alcohol were significantly associated with wine quality ( P -value for t test < 0.05), indicating significant predictors. We also constructed density plots to explore the distribution of 11 continuous variables across “poor” and “good” wine quality. As can be seen from the figure, there is no difference in pH between high-quality wines, while there are differences in other variables between different types of wine, which is consistent with the t-test results.

Figure 1. Descriptive plot between wine quality and predicted characteristics.
Table 1. Basic characteristics of good and bad wines.

We randomly select 70% of the observations as training data and the rest as testing data. All 11 predictor variables were included in the analysis. We use linear methods, non-linear methods, tree methods and support vector machines to predict the classification of wine quality. For linear methods, we train (penalized) logistic regression models and linear discriminant analysis (LDA). The assumptions of logistic regression include independent observations and a linear relationship between the independent variables and the log odds. LDA and QDA assume characteristics of a normal distribution, i.e., the predictor variables are normally distributed for both “good” and “poor” wine quality. For nonlinear models, we performed generalized additive models (GAM), multivariate adaptive regression splines (MARS), KNN models, and quadratic discriminant analysis (QDA). For tree models, we performed classification tree and random forest models. SVM with linear and radial kernels was also performed. We calculated the ROC and accuracy of model selection and investigated the importance of variables. 10-fold cross-validation (CV) was used for all models.

Linear model multiple logistic regression showed that among the 11 predictors, volatile acidity, citric acid, free sulfur dioxide, total sulfur dioxide, sulfate and alcohol were significantly associated with wine quality (P value < 0.05), explaining 25.1 of the total variance. %. Wine quality. When the model was applied to the test data, the accuracy was 0.75 (95% CI: 0.71-0.79) and the ROC was 0.818, indicating a good fit to the data. When performing penalized logistic regression, we found that when maximizing ROC, the best tuning parameters are alpha=1 and lambda=0.00086, the accuracy is 0.75 (95%CI: 0.71-0.79), and the ROC is also 0.818. Since lambda is close to zero and the ROC is the same as the logistic regression model, the penalty is relatively small,

However, because logistic regression requires little or no multicollinearity among the independent variables, the model may be disturbed by collinearity (if any) among the 11 predictor variables. As for LDA, when applying the model to the test data, the ROC was 0.819 and the accuracy was 0.762 (95% CI: 0.72-0.80). The most important variables in predicting wine quality are alcohol, volatile acidity and sulfates. Compared to logistic regression models, LDA is more helpful in cases of smaller sample sizes or good class separation, provided normality assumptions are met.

Nonlinear model In the GAM model, only volatile acidity has degrees of freedom equal to 1, indicating a linear association, while smoothing splines are applied to all other 10 variables.

The results showed that alcohol, citric acid, residual sugar, sulfate, fixed acidity, volatile acidity, chloride and total sulfur dioxide were significant predictors (P value <0.05).

Overall, these variables explained 39.1% of the total variation in wine quality. A confusion matrix using the test data showed that GAM had an accuracy of 0.76 (95%CI: 0.72-0.80) and an ROC of 0.829.

The MARS model shows that when maximizing ROC we include 5 terms among 11 predictors with nprune equal to 5 and degree 2. Together, these predictors and hinge functions explained 32.2% of the total variance. According to MARS output, the three most important predictors are total sulfur dioxide, alcohol, and sulfate.

When the MARS model was applied to the test data, the accuracy was 0.75 (95%CI: 0.72, 0.80) and the ROC was 0.823. We also performed a KNN model for classification. When k equals 22, ROC is maximized. The accuracy of KNNmodel is 0.63 (95%CI: 0.59-0.68) and the ROC is 0.672.

The advantage of GAM and MARS is that both models are nonparametric and capable of handling highly complex nonlinear relationships. Specifically, MARS models can include potential interactions in the model. However, limitations of both models are due to model complexity, time-consuming calculations, and a high tendency to overfit. For the KNN model, when k is large, the prediction may be inaccurate.

Tree method

Based on the classification tree, the final tree size when maximizing AUC is 41. The test error rate was 0.24 and the ROC was 0.809. The accuracy of this classification tree was 0.76 (95%CI: 0.72-0.80). We also conducted a random forest approach to study the importance of variables. Therefore, alcohol is the most important variable, followed by sulfate, volatile acidity, total sulfur dioxide, density, chloride, fixed acidity, citric acid, free sulfur dioxide and residual sugar. pH is the least important variable. For the random forest model, the test error rate was 0.163, the accuracy rate was 0.84 (95%CI: 0.80-0.87), and the ROC was 0.900. A potential limitation of tree methods is that they are sensitive to changes in the data, i.e., small changes in the data may cause larger changes in the classification tree.

Model comparison

After the models were built, we performed model comparisons based on the training and testing performance of all models. The table below shows the cross-validated classification error rate and ROC for all models. In the results, the random forest model has the largest AUC value, while the KNN has the smallest. Therefore, we selected the random forest model as the best predictive classification model for our data. Based on the random forest model, alcohol, sulfate, volatile acidity, total sulfur dioxide, and density were the top 5 significant predictors that helped us predict wine quality classification. Such findings are in line with our expectations since factors such as alcohol, sulfates, and volatile acidity may determine the flavor and mouthfeel of wine. When looking at the summary of each model, we realized that the KNN model had the lowest AUC value and the highest test classification error rate of 0.367. The other nine models had similar AUC values of approximately 82%.

The model building process showed that alcohol, sulfates, volatile acidity, total sulfur dioxide, and density were the top 5 significant predictors of wine quality classification in the training dataset. We selected the random forest model because it had the largest AUC value and the lowest classification error rate. The model also performs well on the test dataset. Therefore, this random forest model is an effective method for wine quality classification.

