Insurance_claims

Environment construction:

Environment: win10 + Anaconda + jupyter Notebook
Libraries: Numpy, pandas, matplotlib, seaborn, missingno, the management and installation of various packages mainly use conda and pip.
Dataset: Insurance Claims Analysis

Explore questions:
1. Gender distribution of claims
2. Discrete distribution of age
3. Regional distribution
4. Discrete distribution of monthly income
5. Discrete distribution of annual income
6. Distribution of claims status
7. Reasons for claim failure
8. Distribution of insurance types
9. Physical condition analysis

# Import the required database:
import pandas as pd
import numpy as np
import seaborn as sns
sns. set()
import matplotlib.pyplot as plt


# Set the configuration to output high-definition vector graphics:
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

# Use pandas for data reading and analysis:
data = pd.read_excel("D:/Insurance_claims.xls")

# Output main information:
data. info()
<class 'pandas. core. frame. DataFrame'>
RangeIndex: 18740 entries, 0 to 18739
Data columns (total 14 columns):
 # Column Non-Null Count Dtype
--- ------ -------------- -----
 0 sex 18740 non-null object
 1 name 18740 non-null object
 2 age 18740 non-null int64
 3 regions 18740 non-null object
 4 Insurance claim date 18740 non-null object
 May Income 18740 non-null int64
 6 year income 18740 non-null int64
 7 Tel 18740 non-null int64
 8 Physical condition 18740 non-null object
 9 policy number 18740 non-null object
 10 Claim status 18740 non-null object
 11 Claim failure reason 11388 non-null object
 12 Insurance claim payment 18740 non-null int64
 13 Types of insurance 18740 non-null object
dtypes: int64(5), object(9)
memory usage: 2.0 + MB
# Get the number of rows and columns
rows = len(data)
columns = len(house. columns)
print(rows, columns)
# The data type of the output column
columns_type = house.dtypes
columns_type
18740 14





sex object
name object
age int64
region object
Insurance claim date object
monthly income int64
annual income int64
call int64
Physical condition object
policy number object
Claim status object
Claim failure reason object
Insurance claim payment int64
Insurance type object
dtype: object
# In order to display Chinese
from pylab import mpl
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False
# Through the above info information, we found that there are missing values in the data, here we count the missing cases:
missing_values = data.isnull().sum()
print(missing_values)
# Visualized as:
import missing no as msno
msno.matrix(data, figsize = (12,5), labels=True)
Gender 0
name 0
age 0
region 0
Insurance claim date 0
Monthly income 0
Annual income 0
phone 0
Physical condition 0
policy number 0
Claim Status 0
Claim Failure Reason 7352
Insurance claim payment 0
Insurance Type 0
dtype: int64





<Axes:>

?
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture Save it and upload directly (img-Y52olG5f-1684846522333)(output_5_2.svg)]

?

msno.bar(data,figsize = (15,5)) # bar graph display
<Axes:>

?

Data cleaning:

The data processing at this stage is mainly aimed at the data problem raised in the previous stage: missing value problem. Cleaning has two goals: First, cleaning makes data usable. The second purpose is to transform these data into better data for analysis. Generally speaking, a “dirty” data must be clear, and a “clean” data must be clean.
In the processing of missing data, we usually remove the missing data. In the pandas module, you can use dropna() to drop rows containing NaN values.
If it is numerical data, it can be replaced by the average or median of the data in the column. If it is sub-type data, it can be filled with the most frequently occurring data (mode) of the data in the column. If you can’t solve the blank value, you can temporarily put it on hold and don’t rush to delete it. Because in later conditions: later operations can skip this null value.

# Gender distribution:
gender = data['gender'].value_counts(ascending=True)
print(gender)
gender.plot.pie()
# Most of the claims cases are women, or there are female drivers
Male 3837
Female 14903
Name: gender, dtype: int64





<Axes: ylabel='gender'>

?

# Age distribution:
type(data['age'])
df = data['age'].to_frame()
column = df.columns[0]
df['index'] = df.index.tolist()
df.plot.scatter(x='index', y=column, figsize = (30,15))
plt. show()
# There are more claims cases in older areas, maybe young people are less involved, inexperienced, or young people do not have enough money to buy insurance.

?

# Regional distribution:
region = data['region'].value_counts(ascending=True)
print(region)
plt.figure(figsize= (25 ,5))#Create a canvas
plt.xticks(rotation = 90) # abscissa
plt.plot(region, linewidth=3, marker='o',
         markerfacecolor='blue', markersize=5)

plt.title('regional distribution')
plt. show()
# The number of cases is in Hainan Province
Shanxi Province 555
Jiangsu Province 585
Tibet Autonomous Region 586
Jilin Province 588
Tianjin 593
Xinjiang Uygur Autonomous Region 596
Inner Mongolia Autonomous Region 597
Shandong Province 604
Liaoning Province 604
Macau Special Administrative Region 607
Henan Province 608
Fujian Province 610
Beijing 616
Gansu Province 620
Zhejiang Province 621
Sichuan Province 627
Shanghai 628
Yunnan Province 635
Jiangxi Province 637
Chongqing 640
Guizhou Province 641
Ningxia Hui Autonomous Region 642
Shaanxi Province 644
Anhui Province 644
Hebei Province 645
Heilongjiang Province 649
Qinghai Province 662
Hubei Province 665
Hong Kong Special Administrative Region 692
Hainan Province 699
Name: region, dtype: int64

# Monthly income analysis:
data['monthly income'].describe()
count 18740.000000
mean 4759.274066
std 6170.065566
min 200.000000
25% 919.000000
50% 2584.500000
75% 5434.250000
max 49934.000000
Name: monthly income, dtype: float64
# Annual income analysis:
data['annual income'].describe()
count 18740.000000
mean 73578.522785
std 74427.940579
min 5563.000000
25% 29920.000000
50% 47891.000000
75% 83821.000000
max 625194.000000
Name: annual income, dtype: float64

In the pandas module, the method dropna() is provided to delete rows containing NaN values

Data visualization

The pandas.pivot_table function contains four main variables, as well as some optional parameters. The four main variables are data source data, row index index, columns columns, and numerical values. Optional parameters include how values are summarized, how NaN values are handled, and whether to display summary row data.

In terms of visual analysis, it will involve python’s commonly used drawing libraries: matplotlib and seaborn. There are already a lot of user guides on the Internet, so I won’t say much here. I will make some summaries when I have time in the future.

# Monthly income scatter plot analysis:
type(data['monthly income'])
df = data['monthly income'].to_frame()
column = df.columns[0]
df['index'] = df.index.tolist()
df.plot.scatter(x='index', y=column, figsize = (30,5))
plt. show()
# We found that there are four classes in the picture, <=5000, <=10000, <=20000,>=20000, indicating that most people are within the monthly income of around 5000,

?
![[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-GPvRnRVi-1684846522340)(output_15_0.svg)](https://img-blog.csdnimg.cn /a673e6343c6b40a99aecbb685365a29c.png)

?

# Annual income scatter plot analysis:
type(data['annual income'])
df = data['annual income'].to_frame()
column = df.columns[0]
df['index'] = df.index.tolist()
df.plot.scatter(x='index', y=column, figsize = (30,5))
plt. show()
# We found that the annual income is divided into several intervals

?
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture Save it and upload directly (img-Ppp3vzqz-1684846522342)(output_16_0.svg)]
?

# Use tools for interval division
data["Monthly income class"]=pd.cut(data["Monthly income"],4)#Divide the value of the age column into 5 equal parts
price_info = data["monthly income class"].value_counts(sort=False)#Check how many people are in each group

price_info.plot(label='quantity', title='monthly income distribution', figsize=(11,5))
plt. show()
# We found that this interval division is larger

?
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture Save it and upload directly (img-vIKs7dns-1684846522343)(output_17_0.svg)]
?

# Claim status distribution:
status = data['claim status'].value_counts(ascending=True)
print(status)
status.plot.pie()
# Mostly fail
Success 7352
fail 11388
Name: Claim Status, dtype: int64





<Axes: ylabel='Claim Status'>

?[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to Save the picture and upload it directly (img-1YxbRW1t-1684846522344)(output_18_2.svg)]

?

# Claim failure analysis:
reason = data['reason for claim failure'].value_counts(ascending=True)
print(reason)
plt.figure(figsize= (10 ,5))#Create a canvas
plt.xticks(rotation = 90) # abscissa
plt.plot(reason, linewidth=3, marker='o',
         markerfacecolor='blue', markersize=5)

plt.title('Cause Analysis of Claim Failure')
plt. show()
Does not meet the scope of claims 707
Client conceals facts 1409
The parties concerned do not understand the relevant business scope of insurance 9272
Name: claim failure reason, dtype: int64

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture Save it and upload directly (img-zV89ojpQ-1684846522345)(output_19_1.svg)]

# Distribution of insurance types:
kind = data['insurance type'].value_counts(ascending=True)
print(kind)
plt.figure(figsize= (10 ,5))#Create a canvas
plt.xticks(rotation = 90) # abscissa
plt.plot(kind, linewidth=3, marker='o',
         markerfacecolor='blue', markersize=5)

plt.title('Insurance type distribution')
plt. show()
# mostly medical-based claims
including pension insurance 427
Property damage insurance 630
Credit Guarantee Insurance 885
Maternity insurance 1238
Unemployment Insurance 1392
Liability insurance 1417
Work injury insurance 2203
Medical insurance 10548
Name: type of insurance, dtype: int64

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture Save it and upload directly (img-ec6u0pvD-1684846522346)(output_20_1.svg)]

# Claim status distribution:
body = data['physical condition'].value_counts(ascending=True)
print(body)
body.plot.pie()
# Many applicants are sick
Sub-health 2037
health 7638
Disease state 9065
Name: physical condition, dtype: int64





<Axes: ylabel='physical condition'>

?[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to Save the picture and upload it directly (img-kLGAwRQl-1684846522347)(output_21_2.svg)]
?