?♂? Personal homepage: @ aiperson’s personal homepage
?About the author: Python learner
I hope everyone will support us and we will make progress together!
If the article is helpful to you,
Welcome to comment Like Collection Add follow +
Table of Contents
1.Project background
2. Introduction to data sets
3.Technical Tools
4. Experimental process
4.1 Load data
4.2 Data preprocessing
4.3 Data visualization
4.4 Sentiment Analysis
4.5 Correlation analysis
4.6 Feature importance analysis
4.7 Cluster analysis
4.8LDA topic analysis
5. Summary
Recommendations and benefits at the end of the article
1.Project background
With the rapid development of the Internet and e-commerce, more and more consumers choose to purchase clothing products online. This trend has brought about a large number of consumer reviews, which contain consumers’ opinions, feelings and usage experiences about products, and are valuable information resources. Sentiment analysis, cluster analysis and LDA topic analysis of these comments can help companies understand consumer needs more comprehensively, optimize product design, improve service quality, and achieve higher market competitiveness.
However, traditional analysis methods often can only provide single, shallow-level information, making it difficult to deeply mine the multi-dimensional information in reviews. Therefore, this study uses a method that combines sentiment analysis, cluster analysis and LDA topic analysis to conduct a comprehensive analysis of consumer reviews of clothing products. In this way, consumers’ emotional tendencies, group characteristics and hot spots of concern can be more accurately grasped, providing a more comprehensive and in-depth basis for corporate decision-making.
Specifically, the research objectives of this study include:
- Through sentiment analysis, we can understand consumers’ overall emotional tendencies toward clothing products, as well as the emotional differences between different products and brands.
- Through cluster analysis, the characteristics and behavior patterns of consumer groups are discovered, which provides reference for enterprises to formulate personalized marketing strategies.
- Through LDA topic analysis, the key topics and concerns in the comments are mined to provide directions for enterprises to optimize product design and improve service quality.
In summary, this study aims to provide enterprises with more comprehensive and in-depth market insights and decision-making support through a comprehensive analysis of consumer reviews of clothing products. At the same time, the methods and results of this study can also provide reference for consumer review analysis in other fields.
2. Introduction to data sets
This data set comes from kaggle. The original data set has a total of 49338 items and 9 feature variables. The meaning of each variable is as follows:
Title: Comment title
Review: review content
Cons_rating: review rating
Cloth_class: clothing type
Materials: Cloth type
Construction: cloth structure
Color: color
Finishing: The meaning is unknown and will be ignored for now.
Durability: Durability
3.Technical tools
Python version: 3.9
Code editor: jupyter notebook
4. Experimental process
4.1 Loading data
First import the third-party library used in this experiment and load the original data set
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.feature_extraction.text import CountVectorizer from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split import warnings warnings.filterwarnings("ignore") import math rc = { "axes.facecolor": "#E6FFE6", "figure.facecolor": "#E6FFE6", "axes.edgecolor": "#000000", "grid.color": "#EBEBE7", "font.family": "serif", "axes.labelcolor": "#000000", "xtick.color": "#000000", "ytick.color": "#000000", "grid.alpha": 0.4 } sns.set(rc=rc) from colorama import Style, Fore red = Style.BRIGHT + Fore.RED blu = Style.BRIGHT + Fore.BLUE mgt = Style.BRIGHT + Fore.MAGENTA gld = Style.BRIGHT + Fore.YELLOW res = Style.RESET_ALL df= pd.read_table("data_amazon.xlsx - Sheet1.csv", delimiter=",") df.head()
View data size
View basic information about data
It can be seen that there are a large number of missing values in the variables in the last five columns, which need to be processed later.
4.2 Data preprocessing
First check the missing values in the original data set
import missingno as msno # Missing value analysis fig, ax = plt.subplots(2,2,figsize=(12,7)) axs = np.ravel(ax) msno.matrix(df, fontsize=9, color=(0.25,0,0.5),ax=axs[0]); msno.bar(df, fontsize=8, color=(0.25,0,0.5), ax=axs[1]); msno.heatmap(df,fontsize=8,ax=axs[2]); msno.dendrogram(df,fontsize=8,ax=axs[3], orientation='top') fig.suptitle('Missing Values Analysis', y=1.01, fontsize=15) # plt.savefig('missing_values_analysis.png') # Save the picture plt.show()
Fill missing values and remove duplicates
df.fillna(0, inplace=True) # Fill missing values with 0 df = df.drop_duplicates() # Remove duplicate values df.info()
Descriptive statistics provide a summary of the main characteristics of the data set. This includes measures such as mean, median, standard deviation, minimum, maximum, etc.
4.3 Data Visualization
# Calculate the frequency of occurrence of each cloth category cloth_class_counts = df['Cloth_class'].value_counts() plt.figure(figsize=(8, 8)) plt.pie(cloth_class_counts, labels=cloth_class_counts.index, autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel')) plt.title('Distribution of Cloth Classes', fontsize = 14, fontweight = 'bold', color = 'darkgreen') plt.axis('equal') plt.savefig('Distribution of Cloth Classes.png') plt.show()
This pie chart will provide a visual representation of the distribution of different clothing categories in the dataset. Each slice represents a different class, and the size of the slice represents its proportion in the dataset.
# Distribution of ratings plt.figure(figsize=(12, 6)) sns.histplot(df['Cons_rating'], kde=True, color='skyblue') plt.title('Distribution of Cons Ratings', fontsize = 14, fontweight = 'bold', color = 'darkgreen') plt.xlabel('Cons Rating', fontsize = 12, fontweight = 'bold', color = 'darkblue') plt.ylabel('Frequency', fontsize = 12, fontweight = 'bold', color = 'darkblue') plt.savefig('Distribution of Con Ratings.png') plt.show()
This histogram visualizes the distribution of “con_rating”. It shows how often each rating appears in the data set. It helps to understand the distribution of opinions about product shortcomings.
# Pros and Cons Rating plt.figure(figsize=(12, 6)) sns.boxplot(x='Construction', y='Cons_rating', data=df, palette='pastel') plt.title('Construction vs. Cons Ratings', fontsize = 14, fontweight = 'bold', color = 'darkgreen') plt.xlabel('Construction', fontsize = 12, fontweight = 'bold', color = 'darkblue') plt.ylabel('Cons Rating', fontsize = 12, fontweight = 'bold', color = 'darkblue') plt.savefig('Construction vs. Cons Ratings.png') plt.show()
This box plot helps visualize the relationship between “Construction” and “con_rating”. It shows the distribution of defect levels for different quality levels. This helps understand whether there is a correlation between quality and ratings.
# Color distribution plt.figure(figsize=(12, 6)) sns.countplot(x='Color', data=df, palette='pastel') plt.title('Distribution of Colors', fontsize = 14, fontweight = 'bold', color = 'darkgreen') plt.xlabel('Color', fontsize = 12, fontweight = 'bold', color = 'darkblue') plt.ylabel('Frequency', fontsize = 12, fontweight = 'bold', color = 'darkblue') plt.xticks(rotation=45) plt.savefig('Distribution of Colors.png') plt.show()
This histogram shows how often different colors appear. It provides an overview of the color distribution in the dataset.
sns.pairplot(df[['Cons_rating', 'Materials', 'Construction', 'Finishing', 'Durability']], diag_kind='kde') plt.suptitle('Pairplot of Numerical Variables', fontsize = 14, fontweight = 'bold', color = 'darkgreen') plt.savefig('Pairplot of Numerical Variables.png') plt.show()
This paired plot shows a scatter plot of numerical variables against each other, as well as a histogram for each variable. It is useful for visualizing relationships and distributions between numerical attributes.
df['Review'] = df['Review'].astype(str) from wordcloud import WordCloud wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(df['Review'])) plt.figure(figsize=(10, 6)) plt.imshow(wordcloud, interpolation='bilinear') plt.title('Word Cloud of Reviews', fontsize = 14, fontweight = 'bold', color = 'darkblue') plt.axis('off') # plt.savefig('Word Cloud of Reviews.png') Save image plt.show()
This word cloud visually represents the words that appear most frequently in reviews. The size of each word is proportional to its frequency. It gives a quick overview of the main themes or points expressed in the review.
4.4 Sentiment Analysis
Sentiment analysis involves using natural language processing techniques to determine the sentiment or emotion expressed in a piece of text. In this case, it is applied to the “Review” column to evaluate whether reviews are generally positive, negative, or neutral.
We use the TextBlob library, which provides a simple API for common NLP tasks including sentiment analysis. For each review, we calculate polarity, a sentiment measure ranging from -1 (negative) to 1 (positive).
from textblob import TextBlob df['Sentiment'] = df['Review'].apply(lambda x: TextBlob(x).sentiment.polarity) plt.figure(figsize=(12, 4)) sns.histplot(df['Sentiment'], kde=True, color='skyblue') plt.title('Distribution of Sentiment Scores', fontsize = 14, fontweight = 'bold', color = 'darkgreen') plt.xlabel('Sentiment Score', fontsize = 12, fontweight = 'bold', color = 'darkblue') plt.ylabel('Frequency', fontsize = 12, fontweight = 'bold', color = 'darkblue') # plt.savefig('Distribution of Sentiment Scores.png') plt.show()
It can be seen that the sentiment scores of comments are concentrated around 0.25, indicating that positive comments still account for the majority.
4.5 Correlation Analysis
Here we use a heat map to show the relationship between the correlation coefficients between variables.
# Delete non-numeric columns df_numeric = df.drop(columns=['Title', 'Review', 'Cloth_class']) correlation_matrix = df_numeric.corr() plt.figure(figsize=(10, 8)) sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.title("Correlation Heatmap", fontsize = 14, fontweight = 'bold', color = 'darkblue') plt.savefig('Correlation Heatmap.png') plt.show()
4.6 Feature Importance Analysis
Feature importance analysis determines which variables have the greatest impact on the target variable in the predictive model. It helps to understand which attributes are most influential in making predictions. We use a random forest regressor to estimate feature importance based on the trained model.
X = df.drop(columns=['Cons_rating', 'Title', 'Review', 'Cloth_class']) y = df['Cons_rating'] model = RandomForestRegressor() model.fit(X, y) feature_importance = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False) print("\ Feature Importance:") print(feature_importance)
4.7 Cluster Analysis
Clustering is a technique used to group similar data points together. It helps in discovering patterns and structures in data. We use the K-Means clustering algorithm to cluster projects based on attributes such as “Materials” and “Construction”. Assign a cluster of results to each project.
from sklearn.cluster import KMeans X_cluster = df[['Materials', 'Construction', 'Color', 'Finishing', 'Durability']] kmeans = KMeans(n_clusters=3, random_state=0).fit(X_cluster) df['Cluster'] = kmeans.labels_ plt.figure(figsize=(12, 4)) plt.scatter(X_cluster['Materials'], X_cluster['Construction'], c=df['Cluster'], cmap='viridis') plt.xlabel('Materials', fontsize = 12, fontweight = 'bold', color = 'darkblue') plt.ylabel('Construction', fontsize = 12, fontweight = 'bold', color = 'darkblue') plt.title('Clustering of Materials vs. Construction', fontsize = 14, fontweight = 'bold', color = 'darkgreen') plt.colorbar(label='Cluster') plt.savefig('Clustering of Materials vs. Construction.png') plt.show() cluster_counts = df['Cluster'].value_counts() print("Cluster Counts:") print(cluster_counts)
Judging from the number of clusters and the previous sentiment analysis results, category 0 should be positive comments, 1 is negative comments, and 2 is neutral comments.
4.8LDA topic analysis
Topic modeling is a technique for discovering themes or themes in a collection of text documents. This helps to understand the main topics discussed in the review. We use Latent Dirichlet Allocation (LDA), a popular topic modeling algorithm, to identify topics in reviews.
from sklearn.decomposition import LatentDirichletAllocation vectorizer = CountVectorizer(max_features=1000, stop_words='english') X_nlp = vectorizer.fit_transform(df['Review']) lda = LatentDirichletAllocation(n_components=5, random_state=0) topics = lda.fit_transform(X_nlp) # Find the keywords for each topic feature_names = vectorizer.get_feature_names() top_words = [] for topic_idx, topic in enumerate(lda.components_): top_words_idx = topic.argsort()[:-10-1:-1] top_words.append([feature_names[i] for i in top_words_idx]) #Print out the keywords for each topic for i, words in enumerate(top_words): print(f"Topic {i + 1}:") print(", ".join(words))
5. Summary
This experiment uses a method that combines sentiment analysis, cluster analysis and LDA topic analysis to conduct a comprehensive analysis of consumer reviews of clothing products. Through experiments, we obtained rich and valuable results. The following is a summary of the experiments:
- Sentiment analysis effectively reveals consumers’ emotional tendencies toward apparel products. Through the emotion tags of reviews, we observed that most consumers’ emotions are positive, showing satisfaction and love for the product. At the same time, we also found some negative sentiment reviews, which provide companies with opportunities and directions to improve their products.
- Cluster analysis helps us discover different characteristics and behavioral patterns of consumer groups. Through clustering, we divide consumers into different groups, each with its own unique purchasing preferences and consumption habits. This provides an important reference for enterprises to formulate personalized marketing strategies, and can adopt different promotion measures for different groups.
- LDA topic analysis mines key topics and concerns in comments. Through theme analysis, we found that consumers mainly focus on product quality, comfort, style design, price, etc. This provides a clear direction for enterprises to optimize product design and improve service quality, and can improve and enhance products based on consumer concerns.
To sum up, this experiment provides more comprehensive and in-depth market insights and decision-making support through a comprehensive analysis of consumer reviews of clothing products. Enterprises can adjust product strategies, improve service quality, and enhance market competitiveness based on experimental results. At the same time, the methods and results of this experiment also have certain reference significance and reference value for consumer review analysis in other fields. In the future, the analysis method can be further expanded and combined with more dimensions of data to gain more accurate insights into consumer needs and market trends.
Recommendations and benefits at the end of the article
Choose 1 out of 3 copies of “The Skillful Use of Chatgpt Series” and get 3 copies for free with free shipping!
?
Introduction:
With the rapid development of artificial intelligence technology, more and more tools and applications are being used in the workplace to improve our work efficiency. Among them, ChatGPT, as an advanced natural language processing technology, is gradually attracting people’s attention.
Skillful use of ChatGPT series of books: “Smart use of chatGPT to quickly complete data analysis”, “Skillful use of ChatGPT to quickly improve career promotion potential”, “Skillful use of ChatGPT to play with new media operations” are published by Peking University Press, Introduction The practical application of ChatGPT in the workplace, and how it helps us improve work efficiency, solve problems encountered at work, and improve professional skills. With the continuous development and application of artificial intelligence technology, I believe ChatGPT will become a powerful assistant in our work and life.
- Lucky draw method: 3 friends will be randomly selected from the comment area and given away for free!
- How to participate: Follow the blogger, like, collect, and comment in the comment area “Life is too short, refuse to get involved!” (Remember to like + collect, otherwise the draw will be invalid, and each person can comment up to three times!)
- Event deadline: 2023-11-11 20:00:00
“Quickly do data analysis with chatGPT”
JD.com purchase link: https://item.jd.com/13810483.html
” Use ChatGPT to quickly improve your career promotion potential”
JD.com purchase link: https://item.jd.com/13832713.html
“Smart use of ChatGPT to play with new media operations”
JD.com purchase link: https://item.jd.com/14141370.html
List announcement time: 2023-11-11 21:00:00
To obtain information and more fan benefits, follow the public account below to obtain
The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 387961 people are learning the system