Analysis of consumer reviews of clothing products based on sentiment analysis + cluster analysis + LDA topic analysis

?♂? Personal homepage: @ aiperson’s personal homepage

?About the author: Python learner
I hope everyone will support us and we will make progress together!
If the article is helpful to you,
Welcome to comment Like Collection Add follow +

Table of Contents

1.Project background

2. Introduction to data sets

3.Technical Tools

4. Experimental process

4.1 Load data

4.2 Data preprocessing

4.3 Data visualization

4.4 Sentiment Analysis

4.5 Correlation analysis

4.6 Feature importance analysis

4.7 Cluster analysis

4.8LDA topic analysis

5. Summary

Recommendations and benefits at the end of the article

1.Project background

With the rapid development of the Internet and e-commerce, more and more consumers choose to purchase clothing products online. This trend has brought about a large number of consumer reviews, which contain consumers’ opinions, feelings and usage experiences about products, and are valuable information resources. Sentiment analysis, cluster analysis and LDA topic analysis of these comments can help companies understand consumer needs more comprehensively, optimize product design, improve service quality, and achieve higher market competitiveness.

However, traditional analysis methods often can only provide single, shallow-level information, making it difficult to deeply mine the multi-dimensional information in reviews. Therefore, this study uses a method that combines sentiment analysis, cluster analysis and LDA topic analysis to conduct a comprehensive analysis of consumer reviews of clothing products. In this way, consumers’ emotional tendencies, group characteristics and hot spots of concern can be more accurately grasped, providing a more comprehensive and in-depth basis for corporate decision-making.

Specifically, the research objectives of this study include:

Through sentiment analysis, we can understand consumers’ overall emotional tendencies toward clothing products, as well as the emotional differences between different products and brands.
Through cluster analysis, the characteristics and behavior patterns of consumer groups are discovered, which provides reference for enterprises to formulate personalized marketing strategies.
Through LDA topic analysis, the key topics and concerns in the comments are mined to provide directions for enterprises to optimize product design and improve service quality.

In summary, this study aims to provide enterprises with more comprehensive and in-depth market insights and decision-making support through a comprehensive analysis of consumer reviews of clothing products. At the same time, the methods and results of this study can also provide reference for consumer review analysis in other fields.

2. Introduction to data sets

This data set comes from kaggle. The original data set has a total of 49338 items and 9 feature variables. The meaning of each variable is as follows:

Title: Comment title

Review: review content

Cons_rating: review rating

Cloth_class: clothing type

Materials: Cloth type

Construction: cloth structure

Color: color

Finishing: The meaning is unknown and will be ignored for now.

Durability: Durability

3.Technical tools

Python version: 3.9

Code editor: jupyter notebook

4. Experimental process

4.1 Loading data

First import the third-party library used in this experiment and load the original data set

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
import math
rc = {
    "axes.facecolor": "#E6FFE6",
    "figure.facecolor": "#E6FFE6",
    "axes.edgecolor": "#000000",
    "grid.color": "#EBEBE7",
    "font.family": "serif",
    "axes.labelcolor": "#000000",
    "xtick.color": "#000000",
    "ytick.color": "#000000",
    "grid.alpha": 0.4
}
sns.set(rc=rc)
from colorama import Style, Fore
red = Style.BRIGHT + Fore.RED
blu = Style.BRIGHT + Fore.BLUE
mgt = Style.BRIGHT + Fore.MAGENTA
gld = Style.BRIGHT + Fore.YELLOW
res = Style.RESET_ALL
df= pd.read_table("data_amazon.xlsx - Sheet1.csv", delimiter=",")
df.head()

View data size

View basic information about data

It can be seen that there are a large number of missing values in the variables in the last five columns, which need to be processed later.

4.2 Data preprocessing

First check the missing values in the original data set

import missingno as msno
# Missing value analysis
fig, ax = plt.subplots(2,2,figsize=(12,7))
axs = np.ravel(ax)
msno.matrix(df, fontsize=9, color=(0.25,0,0.5),ax=axs[0]);
msno.bar(df, fontsize=8, color=(0.25,0,0.5), ax=axs[1]);
msno.heatmap(df,fontsize=8,ax=axs[2]);
msno.dendrogram(df,fontsize=8,ax=axs[3], orientation='top')

fig.suptitle('Missing Values Analysis', y=1.01, fontsize=15)
# plt.savefig('missing_values_analysis.png') # Save the picture
plt.show()

Fill missing values and remove duplicates

df.fillna(0, inplace=True) # Fill missing values with 0
df = df.drop_duplicates() # Remove duplicate values
df.info()

Descriptive statistics provide a summary of the main characteristics of the data set. This includes measures such as mean, median, standard deviation, minimum, maximum, etc.

4.3 Data Visualization

# Calculate the frequency of occurrence of each cloth category
cloth_class_counts = df['Cloth_class'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(cloth_class_counts, labels=cloth_class_counts.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Distribution of Cloth Classes', fontsize = 14, fontweight = 'bold', color = 'darkgreen')
plt.axis('equal')
plt.savefig('Distribution of Cloth Classes.png')
plt.show()

This pie chart will provide a visual representation of the distribution of different clothing categories in the dataset. Each slice represents a different class, and the size of the slice represents its proportion in the dataset.

# Distribution of ratings
plt.figure(figsize=(12, 6))
sns.histplot(df['Cons_rating'], kde=True, color='skyblue')
plt.title('Distribution of Cons Ratings', fontsize = 14, fontweight = 'bold', color = 'darkgreen')
plt.xlabel('Cons Rating', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.ylabel('Frequency', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.savefig('Distribution of Con Ratings.png')
plt.show()

This histogram visualizes the distribution of “con_rating”. It shows how often each rating appears in the data set. It helps to understand the distribution of opinions about product shortcomings.

# Pros and Cons Rating
plt.figure(figsize=(12, 6))
sns.boxplot(x='Construction', y='Cons_rating', data=df, palette='pastel')
plt.title('Construction vs. Cons Ratings', fontsize = 14, fontweight = 'bold', color = 'darkgreen')
plt.xlabel('Construction', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.ylabel('Cons Rating', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.savefig('Construction vs. Cons Ratings.png')
plt.show()

This box plot helps visualize the relationship between “Construction” and “con_rating”. It shows the distribution of defect levels for different quality levels. This helps understand whether there is a correlation between quality and ratings.

# Color distribution
plt.figure(figsize=(12, 6))
sns.countplot(x='Color', data=df, palette='pastel')
plt.title('Distribution of Colors', fontsize = 14, fontweight = 'bold', color = 'darkgreen')
plt.xlabel('Color', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.ylabel('Frequency', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.xticks(rotation=45)
plt.savefig('Distribution of Colors.png')
plt.show()

This histogram shows how often different colors appear. It provides an overview of the color distribution in the dataset.

sns.pairplot(df[['Cons_rating', 'Materials', 'Construction', 'Finishing', 'Durability']], diag_kind='kde')
plt.suptitle('Pairplot of Numerical Variables', fontsize = 14, fontweight = 'bold', color = 'darkgreen')
plt.savefig('Pairplot of Numerical Variables.png')
plt.show()

This paired plot shows a scatter plot of numerical variables against each other, as well as a histogram for each variable. It is useful for visualizing relationships and distributions between numerical attributes.

df['Review'] = df['Review'].astype(str)
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(df['Review']))
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud of Reviews', fontsize = 14, fontweight = 'bold', color = 'darkblue')
plt.axis('off')
# plt.savefig('Word Cloud of Reviews.png') Save image
plt.show()

This word cloud visually represents the words that appear most frequently in reviews. The size of each word is proportional to its frequency. It gives a quick overview of the main themes or points expressed in the review.

4.4 Sentiment Analysis

Sentiment analysis involves using natural language processing techniques to determine the sentiment or emotion expressed in a piece of text. In this case, it is applied to the “Review” column to evaluate whether reviews are generally positive, negative, or neutral.

We use the TextBlob library, which provides a simple API for common NLP tasks including sentiment analysis. For each review, we calculate polarity, a sentiment measure ranging from -1 (negative) to 1 (positive).

from textblob import TextBlob
df['Sentiment'] = df['Review'].apply(lambda x: TextBlob(x).sentiment.polarity)
plt.figure(figsize=(12, 4))
sns.histplot(df['Sentiment'], kde=True, color='skyblue')
plt.title('Distribution of Sentiment Scores', fontsize = 14, fontweight = 'bold', color = 'darkgreen')
plt.xlabel('Sentiment Score', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.ylabel('Frequency', fontsize = 12, fontweight = 'bold', color = 'darkblue')
# plt.savefig('Distribution of Sentiment Scores.png')
plt.show()

It can be seen that the sentiment scores of comments are concentrated around 0.25, indicating that positive comments still account for the majority.

4.5 Correlation Analysis

Here we use a heat map to show the relationship between the correlation coefficients between variables.

# Delete non-numeric columns
df_numeric = df.drop(columns=['Title', 'Review', 'Cloth_class'])
correlation_matrix = df_numeric.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap", fontsize = 14, fontweight = 'bold', color = 'darkblue')
plt.savefig('Correlation Heatmap.png')
plt.show()

4.6 Feature Importance Analysis

Feature importance analysis determines which variables have the greatest impact on the target variable in the predictive model. It helps to understand which attributes are most influential in making predictions. We use a random forest regressor to estimate feature importance based on the trained model.

X = df.drop(columns=['Cons_rating', 'Title', 'Review', 'Cloth_class'])
y = df['Cons_rating']
model = RandomForestRegressor()
model.fit(X, y)
feature_importance = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)

print("\
Feature Importance:")
print(feature_importance)

4.7 Cluster Analysis

Clustering is a technique used to group similar data points together. It helps in discovering patterns and structures in data. We use the K-Means clustering algorithm to cluster projects based on attributes such as “Materials” and “Construction”. Assign a cluster of results to each project.

from sklearn.cluster import KMeans
X_cluster = df[['Materials', 'Construction', 'Color', 'Finishing', 'Durability']]
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_cluster)
df['Cluster'] = kmeans.labels_
plt.figure(figsize=(12, 4))
plt.scatter(X_cluster['Materials'], X_cluster['Construction'], c=df['Cluster'], cmap='viridis')
plt.xlabel('Materials', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.ylabel('Construction', fontsize = 12, fontweight = 'bold', color = 'darkblue')
plt.title('Clustering of Materials vs. Construction', fontsize = 14, fontweight = 'bold', color = 'darkgreen')
plt.colorbar(label='Cluster')
plt.savefig('Clustering of Materials vs. Construction.png')
plt.show()
cluster_counts = df['Cluster'].value_counts()
print("Cluster Counts:")
print(cluster_counts)

Judging from the number of clusters and the previous sentiment analysis results, category 0 should be positive comments, 1 is negative comments, and 2 is neutral comments.

4.8LDA topic analysis

Topic modeling is a technique for discovering themes or themes in a collection of text documents. This helps to understand the main topics discussed in the review. We use Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, to identify topics in reviews.

from sklearn.decomposition import LatentDirichletAllocation

vectorizer = CountVectorizer(max_features=1000, stop_words='english')
X_nlp = vectorizer.fit_transform(df['Review'])
lda = LatentDirichletAllocation(n_components=5, random_state=0)
topics = lda.fit_transform(X_nlp)

# Find the keywords for each topic
feature_names = vectorizer.get_feature_names()
top_words = []

for topic_idx, topic in enumerate(lda.components_):
    top_words_idx = topic.argsort()[:-10-1:-1]
    top_words.append([feature_names[i] for i in top_words_idx])

#Print out the keywords for each topic
for i, words in enumerate(top_words):
    print(f"Topic {i + 1}:")
    print(", ".join(words))

5. Summary

This experiment uses a method that combines sentiment analysis, cluster analysis and LDA topic analysis to conduct a comprehensive analysis of consumer reviews of clothing products. Through experiments, we obtained rich and valuable results. The following is a summary of the experiments:

Sentiment analysis effectively reveals consumers’ emotional tendencies toward apparel products. Through the emotion tags of reviews, we observed that most consumers’ emotions are positive, showing satisfaction and love for the product. At the same time, we also found some negative sentiment reviews, which provide companies with opportunities and directions to improve their products.
Cluster analysis helps us discover different characteristics and behavioral patterns of consumer groups. Through clustering, we divide consumers into different groups, each with its own unique purchasing preferences and consumption habits. This provides an important reference for enterprises to formulate personalized marketing strategies, and can adopt different promotion measures for different groups.
LDA topic analysis mines key topics and concerns in comments. Through theme analysis, we found that consumers mainly focus on product quality, comfort, style design, price, etc. This provides a clear direction for enterprises to optimize product design and improve service quality, and can improve and enhance products based on consumer concerns.

To sum up, this experiment provides more comprehensive and in-depth market insights and decision-making support through a comprehensive analysis of consumer reviews of clothing products. Enterprises can adjust product strategies, improve service quality, and enhance market competitiveness based on experimental results. At the same time, the methods and results of this experiment also have certain reference significance and reference value for consumer review analysis in other fields. In the future, the analysis method can be further expanded and combined with more dimensions of data to gain more accurate insights into consumer needs and market trends.

Recommendations and benefits at the end of the article

Choose 1 out of 3 copies of “The Skillful Use of Chatgpt Series” and get 3 copies for free with free shipping!

Introduction:

With the rapid development of artificial intelligence technology, more and more tools and applications are being used in the workplace to improve our work efficiency. Among them, ChatGPT, as an advanced natural language processing technology, is gradually attracting people’s attention.

Skillful use of ChatGPT series of books: “Smart use of chatGPT to quickly complete data analysis”, “Skillful use of ChatGPT to quickly improve career promotion potential”, “Skillful use of ChatGPT to play with new media operations” are published by Peking University Press, Introduction The practical application of ChatGPT in the workplace, and how it helps us improve work efficiency, solve problems encountered at work, and improve professional skills. With the continuous development and application of artificial intelligence technology, I believe ChatGPT will become a powerful assistant in our work and life.

Lucky draw method: 3 friends will be randomly selected from the comment area and given away for free!

How to participate: Follow the blogger, like, collect, and comment in the comment area “Life is too short, refuse to get involved!” (Remember to like + collect, otherwise the draw will be invalid, and each person can comment up to three times!)

Event deadline: 2023-11-11 20:00:00

“Quickly do data analysis with chatGPT”

JD.com purchase link: https://item.jd.com/13810483.html

” Use ChatGPT to quickly improve your career promotion potential”

JD.com purchase link: https://item.jd.com/13832713.html

“Smart use of ChatGPT to play with new media operations”

JD.com purchase link: https://item.jd.com/14141370.html

List announcement time: 2023-11-11 21:00:00

To obtain information and more fan benefits, follow the public account below to obtain

a74f7d5d03234f7c8a635562034442a0.gif#pic_center

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 387961 people are learning the system