1. Project background
India is a multi-ethnic and multi-cultural country with rich and diverse food traditions, and its food culture attracts a large number of tourists. Different regions in India have distinct differences in flavor and cooking methods, and understanding these differences is crucial to exploring the mysteries of Indian cuisine. Through research and analysis of Indian cuisine, we can help tourists better understand the cuisine characteristics of different regions and provide them with a richer travel experience.
2. Business needs
1. Analyze the characteristics of Indian cuisine.
2. Find out what Indian food has in common.
3. Project implementation
import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally
1. Clean data
# Import data ifood = pd.read_excel('indian_food1.xlsx') ifood.head()
# Basic data situation ifood.info() # Replacement value # File description: Values of -1 are NaN values ifood = ifood.replace(-1,float('nan')) ifood = ifood.replace('-1',float('nan')) ifood.isna().sum()
1.1. Fill in missing values in prep_time and cook_time columns
# Fill in missing values # After classifying according to dish type, use mode filling for prep_time and cook_time ls = ['prep_time', 'cook_time'] for i in ls: mode_per_group = ifood.groupby('course')[i].apply(lambda x: x.mode().iloc[0] if not x.mode().empty else None) ifood[i] = ifood.apply(lambda row: mode_per_group[row['course']] if pd.isna(row[i]) else row[i], axis=1) ifood.isna().sum()
1.2. Fill in missing values in the state column
1.2.1. Encode and divide the data set
# Fill in missing values in state column from sklearn.preprocessing import LabelEncoder # Temporarily fill in the missing values of the state column with A’ ifood['state'].fillna('A',inplace=True) # Perform one-hot encoding on diet','flavor_profile','course' columns with known ''state' column labels dfcs = ifood[['diet','flavor_profile','course','state']] dfcs_gd = pd.get_dummies(dfcs,columns=['diet','flavor_profile','course']) # Label-encode the 'state' column le = LabelEncoder() dfcs_gd['state'] = le.fit_transform(dfcs_gd['state']) # The training set is the part where the state’ column is not missing values dfcs_gd_train =dfcs_gd[dfcs_gd['state']!=0] dfcs_gd_train.head() # The part of the prediction set where state’ is listed as missing values dfcs_gd_pred =dfcs_gd[dfcs_gd['state']==0] dfcs_gd_pred.head() from sklearn.model_selection import train_test_split X = dfcs_gd_train.iloc[:,1:] y = dfcs_gd_train.iloc[:,0] # Divide the data set into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
1.2.2. Prediction model selection
# Decision tree prediction from sklearn.metrics import accuracy_score, f1_score, classification_report from sklearn.tree import DecisionTreeClassifier #Create a decision tree classifier clf = DecisionTreeClassifier() # Train the decision tree model on the training set clf.fit(X_train, y_train) # Make predictions on the test set y_pred = clf.predict(X_test) # Calculate classification accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy}") # Calculate F1 score f1 = f1_score(y_test, y_pred, average='weighted') print(f"F1 score: {f1}") from sklearn.metrics import accuracy_score, f1_score, classification_report from sklearn.ensemble import RandomForestClassifier #Create a random forest classifier clf = RandomForestClassifier(n_estimators=100, random_state=42) # Train the random forest model on the training set clf.fit(X_train, y_train) # Make predictions on the test set y_pred = clf.predict(X_test) # Calculate classification accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy}") # Calculate F1 score f1 = f1_score(y_test, y_pred, average='weighted') print(f"F1 score: {f1}") from sklearn.metrics import accuracy_score, f1_score, classification_report from sklearn.neighbors import KNeighborsClassifier #Create a KNN classifier and set the k value k=3 clf = KNeighborsClassifier(n_neighbors=k) # Train the KNN model on the training set clf.fit(X_train, y_train) # Make predictions on the test set y_pred = clf.predict(X_test) # Calculate classification accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy}") # Calculate F1 score f1 = f1_score(y_test, y_pred, average='weighted') print(f"F1 score: {f1}") from sklearn.metrics import accuracy_score, f1_score, classification_report from sklearn.naive_bayes import MultinomialNB # Create a multinomial naive Bayes classifier clf = MultinomialNB() # Train a naive Bayes model on the training set clf.fit(X_train, y_train) # Make predictions on the test set y_pred = clf.predict(X_test) # Calculate classification accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy}") # Calculate F1 score f1 = f1_score(y_test, y_pred, average='weighted') print(f"F1 score: {f1}") from sklearn.metrics import accuracy_score, f1_score, classification_report from sklearn.naive_bayes import GaussianNB #Create a Gaussian Naive Bayes classifier clf = GaussianNB() # Train a naive Bayes model on the training set clf.fit(X_train, y_train) # Make predictions on the test set y_pred = clf.predict(X_test) # Calculate classification accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy}") # Calculate F1 score f1 = f1_score(y_test, y_pred, average='weighted') print(f"F1 score: {f1}")
1.2.3. Use the model to predict the missing values in the state column
# After comparing the accuracy and F1 scores of many models, we finally decided to use the random forest model to predict the missing values in the 'state' column from sklearn.metrics import accuracy_score, f1_score, classification_report from sklearn.ensemble import RandomForestClassifier #Create a random forest classifier clf = RandomForestClassifier(n_estimators=100, random_state=42) # Train the random forest model on the training set clf.fit(X_train, y_train) # Predict the missing values in the 'state' column y_pred = clf.predict(dfcs_gd_pred.iloc[:,1:]) y_pred
1.2.4. Fill in the prediction results into the original data table
# Add the predicted new label to dfcs_gd_pred dfcs_gd_pred['label'] = y_pred unique_states = ifood['state'].unique() # Get an array of unique values unique_states_sorted = sorted(unique_states, reverse=False) # Sort the array of unique values in descending order # Generate dictionary result_dict = {} for index, value in enumerate(unique_states_sorted): result_dict[index] = value print(result_dict) # Replace with new value dfcs_gd_pred['state'] = dfcs_gd_pred['label'].map(result_dict).fillna(dfcs_gd_pred['state']) print('Before filling:',ifood['state'].unique()) for row_index in ifood.index: if row_index in dfcs_gd_pred.index: ifood.loc[ifood.index == row_index,'state'] = dfcs_gd_pred.loc[dfcs_gd_pred.index == row_index,'state'].values[0] print('After filling:',ifood['state'].unique())
1.3. Fill in missing values in region
sr = ifood[ifood['region'].notna()][['state','region']].drop_duplicates() # Use .to_dict() to create a dictionary sr_dict = sr.set_index('state')['region'].to_dict() sr_dict ifood['region'] = ifood['state'].map(sr_dict).fillna(ifood['region'])
1.4. Check if there are any missing values
ifood.isna().sum()
1.5. Outlier processing
# Classify according to course column ifood.groupby('course')['prep_time','cook_time'].describe() time = ['prep_time','cook_time'] course = ['dessert','main course','snack','starter'] for i in time: for j in course: a = ifood[i].loc[ifood['course']==j] mode = a.mode().values[0] q1 = a.quantile(0.25) q3 = a.quantile(0.75) iqr = q3 - q1 lower_bound = q1 - 1.5*iqr upper_bound = q3 + 1.5*iqr ifood[i].loc[ifood['course']==j] = a.apply(lambda x:mode if x < lower_bound or x > upper_bound else x) # After handling outliers ifood.groupby('course')['prep_time','cook_time'].describe()
2. Data visualization
df = pd.read_excel('indian_food1_clear.xlsx') df.head()
2.1. Diet type
diet = df['diet'].value_counts() fig, axes = plt.subplots(1, 2, figsize=(10, 5)) color = ['lime','limegreen'] axes[0].pie(diet, labels=diet.index, colors=color ,autopct='%1.1f%%', startangle=140) axes[0].set_title(' Diet type proportion') diet_bar = axes[1].bar(diet.index, diet,color=color) axes[1].bar_label(diet_bar,label_type='edge') axes[1].set_title(' Diet type summary') axes[1].set_ylim((0,250)) plt.tight_layout() plt.show()
Most of the diet types in Indian cuisine are vegetarian, with 226 vegetarian and 29 non-vegetarian varieties
2.2. Flavor characteristic types
flavor_profile = df['flavor_profile'].value_counts() fig, axes = plt.subplots(1, 2, figsize=(10, 5)) color = ['cornflowerblue','royalblue','mediumblue','navy'] axes[0].pie(flavor_profile, labels=flavor_profile.index,colors=color, autopct='%1.1f%%',pctdistance=0.9) axes[0].set_title('proportion of flavor characteristics') flavor_profile_bar = axes[1].bar(flavor_profile.index, flavor_profile,color=color) axes[1].bar_label(flavor_profile_bar,label_type='edge') axes[1].set_ylim((0,160)) axes[1].set_title(' Summary of flavor characteristics') plt.tight_layout() plt.show()
The taste of Indian food is mainly spicy and sweet. Among the 200+ Indian food, 145 are spicy; 102 are sweet; 5 are bitter; 3 are sour
2.3. Types of dishes
course = df['course'].value_counts() fig, axes = plt.subplots(1, 2, figsize=(10, 5)) color = ['aquamarine','turquoise','mediumturquoise','lightseagreen'] axes[0].pie(course, labels=course.index, colors=color, autopct='%1.1f%%',pctdistance=0.85) axes[0].set_title('Dish type proportion') course_bar = axes[1].bar(course.index, course,color=color) axes[1].bar_label(course_bar,label_type='edge') axes[1].set_ylim((0,140)) axes[1].set_title(' Summary of dish types') plt.tight_layout() plt.show()
Among the 200+ Indian delicacies, main dishes accounted for 50.6%, desserts accounted for 33.3%, snacks accounted for 15.3%, and appetizers accounted for only 0.8%
2.4. Summary of raw ingredients
df1 = pd.read_excel('food.xlsx') df2 = df1.sort_values(by='frequency',ascending=False).head(30) print('The total number of types of ingredients is',len(df1['raw materials'])) plt.figure(figsize=(10,4)) x = df2['raw material'] y = df2['frequency'] df1_bar = plt.bar(x,y,color='deepskyblue') plt.bar_label(df1_bar,label_type='edge') plt.title('Original ingredient summary TOP30') plt.ylim((0,60)) plt.xticks(x,rotation=90) plt.show()
200 + There are 384 main ingredients used in Indian cuisine. Sugar, ginger, curry leaves, ghee, green chili, etc. are used as seasonings. Commonly used ingredients include rice noodles, tomatoes, potatoes, wheat flour, coconut, sesame, etc.
2.5, Preparation time & amp;Cooking time
cook = df['cook_time'].value_counts() plt.figure(figsize=(10,4)) x = np.arange(len(cook)) y = cook cook_bar = plt.bar(x,y,color='salmon') plt.bar_label(cook_bar,label_type='edge') plt.title('Ingredients cooking time') plt.xlabel('minutes') plt.xticks(x,cook.index) plt.show()
prep = df['prep_time'].value_counts() plt.figure(figsize=(10,4)) x = np.arange(len(prep)) y = prep prep_bar = plt.bar(x,y,color='palegreen') plt.bar_label(prep_bar,label_type='edge') plt.title('Ingredients preparation time') plt.xlabel('minutes') plt.xticks(x,prep.index) plt.show()
Overall, the ingredient preparation time for each dish is about 10-20 minutes; the cooking time is about 20-40 minutes
2.6. Serving speed of different dish types
cook_time = pd.crosstab(index=df['cook_time'], columns=df['course']) fig,axes = plt.subplots(2,2,figsize=(18,10)) x = np.arange(len(cook_time.index)) dessert = cook_time['dessert'] main_course = cook_time['main course'] snack = cook_time['snack'] starter = cook_time['starter'] title_fontsize=25 dessert_bar = axes[0,0].bar(x,dessert,label='dessert',color='gold') axes[0,0].bar_label(dessert_bar,label_type='edge') axes[0,0].set_title('dessert', fontsize=title_fontsize) axes[0,0].set_xlabel('minutes') axes[0,0].set_xticks(x) axes[0,0].set_xticklabels(cook_time.index) main_course_bar = axes[0,1].bar(x,main_course,label='main_course',color='khaki') axes[0,1].bar_label(main_course_bar,label_type='edge') axes[0,1].set_title('main_course', fontsize=title_fontsize) axes[0,1].set_xlabel('minutes') axes[0,1].set_xticks(x) axes[0,1].set_xticklabels(cook_time.index) snack_bar = axes[1,0].bar(x,snack,label='snack',color='goldenrod') axes[1,0].bar_label(snack_bar,label_type='edge') axes[1,0].set_title('snack', fontsize=title_fontsize) axes[1,0].set_xlabel('minutes') axes[1,0].set_xticks(x) axes[1,0].set_xticklabels(cook_time.index) starter_bar = axes[1,1].bar(x,starter,label='starter',color='darkkhaki') axes[1,1].bar_label(starter_bar,label_type='edge') axes[1,1].set_title('starter', fontsize=title_fontsize) axes[1,1].set_xlabel('minutes') axes[1,1].set_xticks(x) axes[1,1].set_xticklabels(cook_time.index) plt.tight_layout() plt.show()
2.7 Regional differences in diet
region_diet = pd.crosstab(index=df['region'], columns=df['diet']) region_flavor_profile = pd.crosstab(index=df['region'], columns=df['flavor_profile']) fig, axes = plt.subplots(2, 1, figsize=(10, 8)) width1 = 0.25 x1 = np.arange(len(region_diet.index)) width2 = 0.25 x2 = np.arange(len(region_flavor_profile.index)) group_num = 4 group_width = 1 bar_span = group_width / group_num bar_width = bar_span - 0.1 baseline_x = x2 - (group_width - bar_span) / 2 non_vegetarian_bar = axes[0].bar(x1 - width1/2, region_diet['non vegetarian'], width1,color='brown', label='non vegetarian') vegetarian_bar = axes[0].bar(x1 + width1/2, region_diet['vegetarian'], width1,color='lightcoral', label='vegetarian') axes[0].bar_label(non_vegetarian_bar,label_type='edge') axes[0].bar_label(vegetarian_bar,label_type='edge') axes[0].set_xticks(x1) axes[0].set_xticklabels(region_diet.index) axes[0].legend() axes[0].set_title('Regional dietary differences') bitter_bar = axes[1].bar(baseline_x + 0*bar_span, region_flavor_profile['bitter'], width2,color='thistle', label='bitter') sour_bar = axes[1].bar(baseline_x + 1*bar_span, region_flavor_profile['sour'], width2,color='plum', label='sour') spicy_bar = axes[1].bar(baseline_x + 2*bar_span, region_flavor_profile['spicy'], width2,color='violet', label='spicy') sweet_bar = axes[1].bar(baseline_x + 3*bar_span, region_flavor_profile['sweet'], width2,color='fuchsia', label='sweet') axes[1].bar_label(bitter_bar,label_type='edge') axes[1].bar_label(sour_bar,label_type='edge') axes[1].bar_label(spicy_bar,label_type='edge') axes[1].bar_label(sweet_bar,label_type='edge') axes[1].set_xticks(x2) axes[1].set_xticklabels(region_flavor_profile.index) axes[1].legend() axes[1].set_title('Differences in regional taste characteristics') plt.tight_layout() plt.show()
Every region has vegetarian food, and every region’s cuisine has spicy and sweet flavors. Among them, South is the most prominent, with the most vegetarian food, spicy food and sweet food.
The food at Central and East is all vegetarian
The number of sour and bitter delicacies in each region is very small, or even non-existent in some regions.
# Number of cities in each region state_region_num = df[['state','region']].drop_duplicates()['region'].value_counts() x = np.arange(len(state_region_num.index)) plt.subplots(figsize=(10, 6)) state_region_num_bar = plt.bar(x,state_region_num,color='blueviolet') plt.bar_label(state_region_num_bar,label_type='edge') plt.title('Number of cities in each region') plt.ylim((0,10)) plt.xticks(x,state_region_num.index) plt.show()
# Number of delicious foods in each city region_state_food = df.groupby('region')['state'].value_counts().reset_index(name='frequency') # Create colormap and category list color_mapping = {'Central': 'b', 'East': 'orange', 'North': 'r', 'North East': 'c\ ', 'South': 'g','West':'y'} categories = region_state_food['region'].unique() # Draw a column chart fig, ax = plt.subplots(figsize=(16, 6)) for category in categories: subset = region_state_food[region_state_food['region'] == category] x = subset['state'] y = subset['frequency'] color = color_mapping.get(category, 'gray') # Get the color mapping, if there is no match, use gray a=ax.bar(x, y, color=color, label=category) ax.bar_label(a,label_type='edge') plt.xlabel('state') plt.title('Number of delicious foods in each city') plt.xticks(region_state_food['state'],rotation=90) #Add legend ax.legend(title='Region') plt.show()
4. Data analysis results:
Conclusion:
In terms of food type, this gourmet menu includes more than 200 dishes, most of which are vegetarian, and only a few are meat.
From the perspective of flavor characteristics, these delicacies are mainly characterized by spiciness and sweetness, providing people with a variety of taste options.
In terms of the types of dishes, staple dishes have the largest number, followed by desserts and snacks, providing customers with a variety of rich choices.
The cooking time of ingredients is usually between 20-40 minutes, while the preparation time of ingredients is generally around 10-20 minutes. Different types of dishes may require different cooking times, but overall, the cooking times are roughly similar and the dish is usually ready to serve in 20-40 minutes.
In terms of regional differences, South (South India) and West (West India) regions have the most abundant vegetarian dishes. West regions like Gujarat, Punjab, West Bengal, Maharashtra, Assam are famous for their spicy and sweet delicacies. ) and other cities offer many delicious cuisines. Overall, each region has a city that excels when it comes to gastronomy, displaying a diverse range of tastes and cooking styles.
Suggestion:
Try vegetarian food: Since most of the dishes are vegetarian, it is recommended that tourists try some colorful vegetarian dishes to experience the unique charm of Indian vegetarian culture.
Explore spicy and sweet flavors: Spicy and sweet flavors are the main characteristics of Indian cuisine. Try some spicy and sweet dishes, but be aware of your personal spiciness preference. Some Indian foods can be very spicy.
Diverse choices: There are rich choices of staple food, desserts and snacks. It is recommended that tourists try a variety of different types of dishes to fully experience the flavor of India.
Consider waiting time: Cooking time is usually between 20-40 minutes, visitors need to consider waiting time, especially during peak hours, it may be better to order in advance or choose to dine at a less busy time.
Explore regional cuisine: Depending on the regional differences, you might as well explore the cuisine of different regions, such as tasting local vegetarian and spicy cuisine in the South and West regions, and trying local specialties in cities such as Gujarat, Punjab, West Bengal, Maharashtra, and Assam.
Know the serving speed: Various dishes may be served at different times, and some dishes may be served faster. Make your choice based on your personal schedule.
All in all, India’s food culture is rich and colorful, and tourists can try different types of dishes and enjoy unique tastes and cultural experiences.