Environment construction:
Environment: win10 + Anaconda + jupyter Notebook
Libraries: Numpy, pandas, matplotlib, seaborn, missingno, the management and installation of various packages mainly use conda and pip.
Dataset: Insurance Claims Analysis
Explore questions:
1. Gender distribution of claims
2. Discrete distribution of age
3. Regional distribution
4. Discrete distribution of monthly income
5. Discrete distribution of annual income
6. Distribution of claims status
7. Reasons for claim failure
8. Distribution of insurance types
9. Physical condition analysis
# Import the required database: import pandas as pd import numpy as np import seaborn as sns sns. set() import matplotlib.pyplot as plt # Set the configuration to output high-definition vector graphics: %config InlineBackend.figure_format = 'svg' %matplotlib inline # Use pandas for data reading and analysis: data = pd.read_excel("D:/Insurance_claims.xls") # Output main information: data. info()
<class 'pandas. core. frame. DataFrame'> RangeIndex: 18740 entries, 0 to 18739 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sex 18740 non-null object 1 name 18740 non-null object 2 age 18740 non-null int64 3 regions 18740 non-null object 4 Insurance claim date 18740 non-null object May Income 18740 non-null int64 6 year income 18740 non-null int64 7 Tel 18740 non-null int64 8 Physical condition 18740 non-null object 9 policy number 18740 non-null object 10 Claim status 18740 non-null object 11 Claim failure reason 11388 non-null object 12 Insurance claim payment 18740 non-null int64 13 Types of insurance 18740 non-null object dtypes: int64(5), object(9) memory usage: 2.0 + MB
# Get the number of rows and columns rows = len(data) columns = len(house. columns) print(rows, columns) # The data type of the output column columns_type = house.dtypes columns_type
18740 14 sex object name object age int64 region object Insurance claim date object monthly income int64 annual income int64 call int64 Physical condition object policy number object Claim status object Claim failure reason object Insurance claim payment int64 Insurance type object dtype: object
# In order to display Chinese from pylab import mpl mpl.rcParams['font.sans-serif'] = [u'SimHei'] mpl.rcParams['axes.unicode_minus'] = False
# Through the above info information, we found that there are missing values in the data, here we count the missing cases: missing_values = data.isnull().sum() print(missing_values) # Visualized as: import missing no as msno msno.matrix(data, figsize = (12,5), labels=True)
Gender 0 name 0 age 0 region 0 Insurance claim date 0 Monthly income 0 Annual income 0 phone 0 Physical condition 0 policy number 0 Claim Status 0 Claim Failure Reason 7352 Insurance claim payment 0 Insurance Type 0 dtype: int64 <Axes:>
?
?
msno.bar(data,figsize = (15,5)) # bar graph display
<Axes:>
?
Data cleaning:
The data processing at this stage is mainly aimed at the data problem raised in the previous stage: missing value problem. Cleaning has two goals: First, cleaning makes data usable. The second purpose is to transform these data into better data for analysis. Generally speaking, a “dirty” data must be clear, and a “clean” data must be clean.
In the processing of missing data, we usually remove the missing data. In the pandas module, you can use dropna() to drop rows containing NaN values.
If it is numerical data, it can be replaced by the average or median of the data in the column. If it is sub-type data, it can be filled with the most frequently occurring data (mode) of the data in the column. If you can’t solve the blank value, you can temporarily put it on hold and don’t rush to delete it. Because in later conditions: later operations can skip this null value.
# Gender distribution: gender = data['gender'].value_counts(ascending=True) print(gender) gender.plot.pie() # Most of the claims cases are women, or there are female drivers
Male 3837 Female 14903 Name: gender, dtype: int64 <Axes: ylabel='gender'>
?
# Age distribution: type(data['age']) df = data['age'].to_frame() column = df.columns[0] df['index'] = df.index.tolist() df.plot.scatter(x='index', y=column, figsize = (30,15)) plt. show() # There are more claims cases in older areas, maybe young people are less involved, inexperienced, or young people do not have enough money to buy insurance.
# Regional distribution: region = data['region'].value_counts(ascending=True) print(region) plt.figure(figsize= (25 ,5))#Create a canvas plt.xticks(rotation = 90) # abscissa plt.plot(region, linewidth=3, marker='o', markerfacecolor='blue', markersize=5) plt.title('regional distribution') plt. show() # The number of cases is in Hainan Province
Shanxi Province 555 Jiangsu Province 585 Tibet Autonomous Region 586 Jilin Province 588 Tianjin 593 Xinjiang Uygur Autonomous Region 596 Inner Mongolia Autonomous Region 597 Shandong Province 604 Liaoning Province 604 Macau Special Administrative Region 607 Henan Province 608 Fujian Province 610 Beijing 616 Gansu Province 620 Zhejiang Province 621 Sichuan Province 627 Shanghai 628 Yunnan Province 635 Jiangxi Province 637 Chongqing 640 Guizhou Province 641 Ningxia Hui Autonomous Region 642 Shaanxi Province 644 Anhui Province 644 Hebei Province 645 Heilongjiang Province 649 Qinghai Province 662 Hubei Province 665 Hong Kong Special Administrative Region 692 Hainan Province 699 Name: region, dtype: int64
# Monthly income analysis: data['monthly income'].describe()
count 18740.000000 mean 4759.274066 std 6170.065566 min 200.000000 25% 919.000000 50% 2584.500000 75% 5434.250000 max 49934.000000 Name: monthly income, dtype: float64
# Annual income analysis: data['annual income'].describe()
count 18740.000000 mean 73578.522785 std 74427.940579 min 5563.000000 25% 29920.000000 50% 47891.000000 75% 83821.000000 max 625194.000000 Name: annual income, dtype: float64
In the pandas module, the method dropna() is provided to delete rows containing NaN values
Data visualization
The pandas.pivot_table function contains four main variables, as well as some optional parameters. The four main variables are data source data, row index index, columns columns, and numerical values. Optional parameters include how values are summarized, how NaN values are handled, and whether to display summary row data.
In terms of visual analysis, it will involve python’s commonly used drawing libraries: matplotlib and seaborn. There are already a lot of user guides on the Internet, so I won’t say much here. I will make some summaries when I have time in the future.
# Monthly income scatter plot analysis: type(data['monthly income']) df = data['monthly income'].to_frame() column = df.columns[0] df['index'] = df.index.tolist() df.plot.scatter(x='index', y=column, figsize = (30,5)) plt. show() # We found that there are four classes in the picture, <=5000, <=10000, <=20000,>=20000, indicating that most people are within the monthly income of around 5000,
?
![[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-GPvRnRVi-1684846522340)(output_15_0.svg)](https://img-blog.csdnimg.cn /a673e6343c6b40a99aecbb685365a29c.png)
?
# Annual income scatter plot analysis: type(data['annual income']) df = data['annual income'].to_frame() column = df.columns[0] df['index'] = df.index.tolist() df.plot.scatter(x='index', y=column, figsize = (30,5)) plt. show() # We found that the annual income is divided into several intervals
?
# Use tools for interval division data["Monthly income class"]=pd.cut(data["Monthly income"],4)#Divide the value of the age column into 5 equal parts price_info = data["monthly income class"].value_counts(sort=False)#Check how many people are in each group price_info.plot(label='quantity', title='monthly income distribution', figsize=(11,5)) plt. show() # We found that this interval division is larger
?
# Claim status distribution: status = data['claim status'].value_counts(ascending=True) print(status) status.plot.pie() # Mostly fail
Success 7352 fail 11388 Name: Claim Status, dtype: int64 <Axes: ylabel='Claim Status'>
?
# Claim failure analysis: reason = data['reason for claim failure'].value_counts(ascending=True) print(reason) plt.figure(figsize= (10 ,5))#Create a canvas plt.xticks(rotation = 90) # abscissa plt.plot(reason, linewidth=3, marker='o', markerfacecolor='blue', markersize=5) plt.title('Cause Analysis of Claim Failure') plt. show()
Does not meet the scope of claims 707 Client conceals facts 1409 The parties concerned do not understand the relevant business scope of insurance 9272 Name: claim failure reason, dtype: int64
# Distribution of insurance types: kind = data['insurance type'].value_counts(ascending=True) print(kind) plt.figure(figsize= (10 ,5))#Create a canvas plt.xticks(rotation = 90) # abscissa plt.plot(kind, linewidth=3, marker='o', markerfacecolor='blue', markersize=5) plt.title('Insurance type distribution') plt. show() # mostly medical-based claims
including pension insurance 427 Property damage insurance 630 Credit Guarantee Insurance 885 Maternity insurance 1238 Unemployment Insurance 1392 Liability insurance 1417 Work injury insurance 2203 Medical insurance 10548 Name: type of insurance, dtype: int64
# Claim status distribution: body = data['physical condition'].value_counts(ascending=True) print(body) body.plot.pie() # Many applicants are sick
Sub-health 2037 health 7638 Disease state 9065 Name: physical condition, dtype: int64 <Axes: ylabel='physical condition'>