Indian Cuisine Analysis and Similarity Study

1. Project background

India is a multi-ethnic and multi-cultural country with rich and diverse food traditions, and its food culture attracts a large number of tourists. Different regions in India have distinct differences in flavor and cooking methods, and understanding these differences is crucial to exploring the mysteries of Indian cuisine. Through research and analysis of Indian cuisine, we can help tourists better understand the cuisine characteristics of different regions and provide them with a richer travel experience.

2. Business needs

1. Analyze the characteristics of Indian cuisine.

2. Find out what Indian food has in common.

3. Project implementation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally

1. Clean data

# Import data
ifood = pd.read_excel('indian_food1.xlsx')
ifood.head()

# Basic data situation
ifood.info()

# Replacement value
# File description: Values of -1 are NaN values
ifood = ifood.replace(-1,float('nan'))
ifood = ifood.replace('-1',float('nan'))
ifood.isna().sum()

1.1. Fill in missing values in prep_time and cook_time columns

# Fill in missing values
# After classifying according to dish type, use mode filling for prep_time and cook_time

ls = ['prep_time', 'cook_time']
for i in ls:
    mode_per_group = ifood.groupby('course')[i].apply(lambda x: x.mode().iloc[0] if not x.mode().empty else None)
    ifood[i] = ifood.apply(lambda row: mode_per_group[row['course']] if pd.isna(row[i]) else row[i], axis=1)
ifood.isna().sum()

1.2. Fill in missing values in the state column

1.2.1. Encode and divide the data set

# Fill in missing values in state column
from sklearn.preprocessing import LabelEncoder

# Temporarily fill in the missing values of the state column with A’
ifood['state'].fillna('A',inplace=True)

# Perform one-hot encoding on diet','flavor_profile','course' columns with known ''state' column labels
dfcs = ifood[['diet','flavor_profile','course','state']]
dfcs_gd = pd.get_dummies(dfcs,columns=['diet','flavor_profile','course'])

# Label-encode the 'state' column
le = LabelEncoder()
dfcs_gd['state'] = le.fit_transform(dfcs_gd['state'])

# The training set is the part where the state’ column is not missing values
dfcs_gd_train =dfcs_gd[dfcs_gd['state']!=0]
dfcs_gd_train.head()

# The part of the prediction set where state’ is listed as missing values
dfcs_gd_pred =dfcs_gd[dfcs_gd['state']==0]
dfcs_gd_pred.head()

from sklearn.model_selection import train_test_split

X = dfcs_gd_train.iloc[:,1:]
y = dfcs_gd_train.iloc[:,0]

# Divide the data set into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

1.2.2. Prediction model selection

# Decision tree prediction

from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.tree import DecisionTreeClassifier


#Create a decision tree classifier
clf = DecisionTreeClassifier()

# Train the decision tree model on the training set
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate classification accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Calculate F1 score
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1 score: {f1}")


from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.ensemble import RandomForestClassifier

#Create a random forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the random forest model on the training set
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate classification accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Calculate F1 score
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1 score: {f1}")


from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.neighbors import KNeighborsClassifier

#Create a KNN classifier and set the k value
k=3
clf = KNeighborsClassifier(n_neighbors=k)

# Train the KNN model on the training set
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate classification accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Calculate F1 score
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1 score: {f1}")


from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.naive_bayes import MultinomialNB

# Create a multinomial naive Bayes classifier
clf = MultinomialNB()

# Train a naive Bayes model on the training set
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate classification accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Calculate F1 score
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1 score: {f1}")


from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.naive_bayes import GaussianNB

#Create a Gaussian Naive Bayes classifier
clf = GaussianNB()

# Train a naive Bayes model on the training set
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate classification accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Calculate F1 score
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1 score: {f1}")

1.2.3. Use the model to predict the missing values in the state column

# After comparing the accuracy and F1 scores of many models, we finally decided to use the random forest model to predict the missing values in the 'state' column

from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.ensemble import RandomForestClassifier

#Create a random forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the random forest model on the training set
clf.fit(X_train, y_train)

# Predict the missing values in the 'state' column
y_pred = clf.predict(dfcs_gd_pred.iloc[:,1:])
y_pred

1.2.4. Fill in the prediction results into the original data table

# Add the predicted new label to dfcs_gd_pred
dfcs_gd_pred['label'] = y_pred

unique_states = ifood['state'].unique() # Get an array of unique values
unique_states_sorted = sorted(unique_states, reverse=False) # Sort the array of unique values in descending order

# Generate dictionary
result_dict = {}
for index, value in enumerate(unique_states_sorted):
    result_dict[index] = value

print(result_dict)

# Replace with new value
dfcs_gd_pred['state'] = dfcs_gd_pred['label'].map(result_dict).fillna(dfcs_gd_pred['state'])


print('Before filling:',ifood['state'].unique())

for row_index in ifood.index:
    if row_index in dfcs_gd_pred.index:
        ifood.loc[ifood.index == row_index,'state'] = dfcs_gd_pred.loc[dfcs_gd_pred.index == row_index,'state'].values[0]

print('After filling:',ifood['state'].unique())

1.3. Fill in missing values in region

sr = ifood[ifood['region'].notna()][['state','region']].drop_duplicates()

# Use .to_dict() to create a dictionary
sr_dict = sr.set_index('state')['region'].to_dict()
sr_dict

ifood['region'] = ifood['state'].map(sr_dict).fillna(ifood['region'])

1.4. Check if there are any missing values

ifood.isna().sum()

1.5. Outlier processing

# Classify according to course column
ifood.groupby('course')['prep_time','cook_time'].describe()

time = ['prep_time','cook_time']
course = ['dessert','main course','snack','starter']

for i in time:
    for j in course:
        a = ifood[i].loc[ifood['course']==j]
        mode = a.mode().values[0]
        q1 = a.quantile(0.25)
        q3 = a.quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5*iqr
        upper_bound = q3 + 1.5*iqr
        ifood[i].loc[ifood['course']==j] = a.apply(lambda x:mode if x < lower_bound or x > upper_bound else x)
        

# After handling outliers
ifood.groupby('course')['prep_time','cook_time'].describe()

2. Data visualization

df = pd.read_excel('indian_food1_clear.xlsx')
df.head()

2.1. Diet type

diet = df['diet'].value_counts()

fig, axes = plt.subplots(1, 2, figsize=(10, 5))

color = ['lime','limegreen']

axes[0].pie(diet, labels=diet.index, colors=color ,autopct='%1.1f%%', startangle=140)
axes[0].set_title(' Diet type proportion')

diet_bar = axes[1].bar(diet.index, diet,color=color)
axes[1].bar_label(diet_bar,label_type='edge')
axes[1].set_title(' Diet type summary')
axes[1].set_ylim((0,250))

plt.tight_layout()
plt.show()

Most of the diet types in Indian cuisine are vegetarian, with 226 vegetarian and 29 non-vegetarian varieties

2.2. Flavor characteristic types

flavor_profile = df['flavor_profile'].value_counts()

fig, axes = plt.subplots(1, 2, figsize=(10, 5))

color = ['cornflowerblue','royalblue','mediumblue','navy']

axes[0].pie(flavor_profile, labels=flavor_profile.index,colors=color, autopct='%1.1f%%',pctdistance=0.9)
axes[0].set_title('proportion of flavor characteristics')

flavor_profile_bar = axes[1].bar(flavor_profile.index, flavor_profile,color=color)
axes[1].bar_label(flavor_profile_bar,label_type='edge')
axes[1].set_ylim((0,160))
axes[1].set_title(' Summary of flavor characteristics')

plt.tight_layout()
plt.show()

The taste of Indian food is mainly spicy and sweet. Among the 200+ Indian food, 145 are spicy; 102 are sweet; 5 are bitter; 3 are sour

2.3. Types of dishes

course = df['course'].value_counts()

fig, axes = plt.subplots(1, 2, figsize=(10, 5))

color = ['aquamarine','turquoise','mediumturquoise','lightseagreen']

axes[0].pie(course, labels=course.index, colors=color, autopct='%1.1f%%',pctdistance=0.85)
axes[0].set_title('Dish type proportion')

course_bar = axes[1].bar(course.index, course,color=color)
axes[1].bar_label(course_bar,label_type='edge')
axes[1].set_ylim((0,140))
axes[1].set_title(' Summary of dish types')

plt.tight_layout()
plt.show()

Among the 200+ Indian delicacies, main dishes accounted for 50.6%, desserts accounted for 33.3%, snacks accounted for 15.3%, and appetizers accounted for only 0.8%

2.4. Summary of raw ingredients

df1 = pd.read_excel('food.xlsx')
df2 = df1.sort_values(by='frequency',ascending=False).head(30)

print('The total number of types of ingredients is',len(df1['raw materials']))

plt.figure(figsize=(10,4))

x = df2['raw material']
y = df2['frequency']

df1_bar = plt.bar(x,y,color='deepskyblue')
plt.bar_label(df1_bar,label_type='edge')
plt.title('Original ingredient summary TOP30')
plt.ylim((0,60))
plt.xticks(x,rotation=90)
plt.show()

200 + There are 384 main ingredients used in Indian cuisine. Sugar, ginger, curry leaves, ghee, green chili, etc. are used as seasonings. Commonly used ingredients include rice noodles, tomatoes, potatoes, wheat flour, coconut, sesame, etc.

2.5, Preparation time & amp;Cooking time

cook = df['cook_time'].value_counts()

plt.figure(figsize=(10,4))

x = np.arange(len(cook))
y = cook

cook_bar = plt.bar(x,y,color='salmon')
plt.bar_label(cook_bar,label_type='edge')
plt.title('Ingredients cooking time')
plt.xlabel('minutes')
plt.xticks(x,cook.index)
plt.show()

prep = df['prep_time'].value_counts()

plt.figure(figsize=(10,4))

x = np.arange(len(prep))
y = prep

prep_bar = plt.bar(x,y,color='palegreen')
plt.bar_label(prep_bar,label_type='edge')
plt.title('Ingredients preparation time')
plt.xlabel('minutes')
plt.xticks(x,prep.index)
plt.show()

Overall, the ingredient preparation time for each dish is about 10-20 minutes; the cooking time is about 20-40 minutes

2.6. Serving speed of different dish types

cook_time = pd.crosstab(index=df['cook_time'], columns=df['course'])

fig,axes = plt.subplots(2,2,figsize=(18,10))

x = np.arange(len(cook_time.index))
dessert = cook_time['dessert']
main_course = cook_time['main course']
snack = cook_time['snack']
starter = cook_time['starter']

title_fontsize=25

dessert_bar = axes[0,0].bar(x,dessert,label='dessert',color='gold')
axes[0,0].bar_label(dessert_bar,label_type='edge')
axes[0,0].set_title('dessert', fontsize=title_fontsize)
axes[0,0].set_xlabel('minutes')
axes[0,0].set_xticks(x)
axes[0,0].set_xticklabels(cook_time.index)

main_course_bar = axes[0,1].bar(x,main_course,label='main_course',color='khaki')
axes[0,1].bar_label(main_course_bar,label_type='edge')
axes[0,1].set_title('main_course', fontsize=title_fontsize)
axes[0,1].set_xlabel('minutes')
axes[0,1].set_xticks(x)
axes[0,1].set_xticklabels(cook_time.index)

snack_bar = axes[1,0].bar(x,snack,label='snack',color='goldenrod')
axes[1,0].bar_label(snack_bar,label_type='edge')
axes[1,0].set_title('snack', fontsize=title_fontsize)
axes[1,0].set_xlabel('minutes')
axes[1,0].set_xticks(x)
axes[1,0].set_xticklabels(cook_time.index)

starter_bar = axes[1,1].bar(x,starter,label='starter',color='darkkhaki')
axes[1,1].bar_label(starter_bar,label_type='edge')
axes[1,1].set_title('starter', fontsize=title_fontsize)
axes[1,1].set_xlabel('minutes')
axes[1,1].set_xticks(x)
axes[1,1].set_xticklabels(cook_time.index)

plt.tight_layout()
plt.show()

2.7 Regional differences in diet

region_diet = pd.crosstab(index=df['region'], columns=df['diet'])
region_flavor_profile = pd.crosstab(index=df['region'], columns=df['flavor_profile'])

fig, axes = plt.subplots(2, 1, figsize=(10, 8))

width1 = 0.25
x1 = np.arange(len(region_diet.index))

width2 = 0.25
x2 = np.arange(len(region_flavor_profile.index))
group_num = 4
group_width = 1
bar_span = group_width / group_num
bar_width = bar_span - 0.1
baseline_x = x2 - (group_width - bar_span) / 2

non_vegetarian_bar = axes[0].bar(x1 - width1/2, region_diet['non vegetarian'], width1,color='brown', label='non vegetarian')
vegetarian_bar = axes[0].bar(x1 + width1/2, region_diet['vegetarian'], width1,color='lightcoral', label='vegetarian')
axes[0].bar_label(non_vegetarian_bar,label_type='edge')
axes[0].bar_label(vegetarian_bar,label_type='edge')
axes[0].set_xticks(x1)
axes[0].set_xticklabels(region_diet.index)
axes[0].legend()
axes[0].set_title('Regional dietary differences')

bitter_bar = axes[1].bar(baseline_x + 0*bar_span, region_flavor_profile['bitter'], width2,color='thistle', label='bitter')
sour_bar = axes[1].bar(baseline_x + 1*bar_span, region_flavor_profile['sour'], width2,color='plum', label='sour')
spicy_bar = axes[1].bar(baseline_x + 2*bar_span, region_flavor_profile['spicy'], width2,color='violet', label='spicy')
sweet_bar = axes[1].bar(baseline_x + 3*bar_span, region_flavor_profile['sweet'], width2,color='fuchsia', label='sweet')
axes[1].bar_label(bitter_bar,label_type='edge')
axes[1].bar_label(sour_bar,label_type='edge')
axes[1].bar_label(spicy_bar,label_type='edge')
axes[1].bar_label(sweet_bar,label_type='edge')
axes[1].set_xticks(x2)
axes[1].set_xticklabels(region_flavor_profile.index)
axes[1].legend()
axes[1].set_title('Differences in regional taste characteristics')

plt.tight_layout()
plt.show()

Every region has vegetarian food, and every region’s cuisine has spicy and sweet flavors. Among them, South is the most prominent, with the most vegetarian food, spicy food and sweet food.

The food at Central and East is all vegetarian

The number of sour and bitter delicacies in each region is very small, or even non-existent in some regions.

# Number of cities in each region
state_region_num = df[['state','region']].drop_duplicates()['region'].value_counts()

x = np.arange(len(state_region_num.index))

plt.subplots(figsize=(10, 6))

state_region_num_bar = plt.bar(x,state_region_num,color='blueviolet')
plt.bar_label(state_region_num_bar,label_type='edge')
plt.title('Number of cities in each region')
plt.ylim((0,10))
plt.xticks(x,state_region_num.index)
plt.show()

# Number of delicious foods in each city

region_state_food = df.groupby('region')['state'].value_counts().reset_index(name='frequency')

# Create colormap and category list
color_mapping = {'Central': 'b', 'East': 'orange', 'North': 'r', 'North East': 'c\ ', 'South': 'g','West':'y'}
categories = region_state_food['region'].unique()

# Draw a column chart
fig, ax = plt.subplots(figsize=(16, 6))

for category in categories:
    subset = region_state_food[region_state_food['region'] == category]
    x = subset['state']
    y = subset['frequency']
    color = color_mapping.get(category, 'gray') # Get the color mapping, if there is no match, use gray
    a=ax.bar(x, y, color=color, label=category)
    ax.bar_label(a,label_type='edge')
plt.xlabel('state')
plt.title('Number of delicious foods in each city')
plt.xticks(region_state_food['state'],rotation=90)
#Add legend
ax.legend(title='Region')

plt.show()

4. Data analysis results:

Conclusion:

In terms of food type, this gourmet menu includes more than 200 dishes, most of which are vegetarian, and only a few are meat.

From the perspective of flavor characteristics, these delicacies are mainly characterized by spiciness and sweetness, providing people with a variety of taste options.

In terms of the types of dishes, staple dishes have the largest number, followed by desserts and snacks, providing customers with a variety of rich choices.

The cooking time of ingredients is usually between 20-40 minutes, while the preparation time of ingredients is generally around 10-20 minutes. Different types of dishes may require different cooking times, but overall, the cooking times are roughly similar and the dish is usually ready to serve in 20-40 minutes.

In terms of regional differences, South (South India) and West (West India) regions have the most abundant vegetarian dishes. West regions like Gujarat, Punjab, West Bengal, Maharashtra, Assam are famous for their spicy and sweet delicacies. ) and other cities offer many delicious cuisines. Overall, each region has a city that excels when it comes to gastronomy, displaying a diverse range of tastes and cooking styles.

Suggestion:

Try vegetarian food: Since most of the dishes are vegetarian, it is recommended that tourists try some colorful vegetarian dishes to experience the unique charm of Indian vegetarian culture.

Explore spicy and sweet flavors: Spicy and sweet flavors are the main characteristics of Indian cuisine. Try some spicy and sweet dishes, but be aware of your personal spiciness preference. Some Indian foods can be very spicy.

Diverse choices: There are rich choices of staple food, desserts and snacks. It is recommended that tourists try a variety of different types of dishes to fully experience the flavor of India.

Consider waiting time: Cooking time is usually between 20-40 minutes, visitors need to consider waiting time, especially during peak hours, it may be better to order in advance or choose to dine at a less busy time.

Explore regional cuisine: Depending on the regional differences, you might as well explore the cuisine of different regions, such as tasting local vegetarian and spicy cuisine in the South and West regions, and trying local specialties in cities such as Gujarat, Punjab, West Bengal, Maharashtra, and Assam.

Know the serving speed: Various dishes may be served at different times, and some dishes may be served faster. Make your choice based on your personal schedule.

All in all, India’s food culture is rich and colorful, and tourists can try different types of dishes and enjoy unique tastes and cultural experiences.