Statistics Analysis using Python
Here, we explore statistics concepts of mean, median and mode. Skewness of the data is also being explored. And we end with the kernal density estimation (KDE) plot.
DS102 Problem Set - Week 5
HWA-DS102-PS5-201810
If you have any questions to the Problem Set, feel free to send a message to your TAs or Instructors in the Telegram chat. You are encouraged to also discuss the problem sets within your own groups.
# Rename the notebook. Then, fill in your details here.
# Name : Barbara Yam
# Class : 5 May 2019
Datasets Required:¶
insurance-claims.csv
In this problem set, you will be using what you have learnt in DS102 to understand an applied problem in actuarial science, in particular the concept of moral hazard.
Problem Statement¶
In insurance, an insurance policy is a contract between a policy holder and the insurance company. For a fee or premium, the policy holder receives financial protection in the event of a loss. This claim due to the loss will be paid by the insurance company.
The claim that is paid by the insurance company or insurer to the policy holder or insured (usually an individual) is subject to a deductible, a co-insurance payment and a policy limit. The definition for each of the terms is as follows:
A Deductible is the minimum sum that is payable by the policy holder, before the insurance company needs to pay out to the policy holder.
A Co-insurance is the proportion of the bill that is payable by the insurance company. The remaining proportion is payable by the policy holder.
A Out-of-pocket Limit / Policy Limit is the maximum amount that the insurance company has to pay for a claim. The remaining amount needs to be paid by the policy holder.
Consider the following example:
Two travellers, A and B paid for a travel insurance policy before their flights. This policy has a deductible of $d=\$300$. The **co-insurance** ratio is $c=0.2$ or $20\%$ and the** policy limit** is $l=\$5,000$.
A. Traveller A was not able to fly back due to a typhoon at the holiday destination. A files for a claim of $\$2,000$ for A's flight back. The payout from the insurer is calculated as follows:
- After subjecting to a deductible, the amount to be paid by the insurer is $2000-d=2000-300=1700$
- After subjecting to co-insurance, the amount to be paid is $(1-c)\times 1700 = (1-0.2)\times 1700=1360$
and hence A will receive a claim amount of $\$1360$ from the insurer and his out-of-pocket payment is $\$640$.
B. Traveller B's valuables and laptop was stolen while travelling. B files for a claim of $\$12,000$ which was the value of all the items stolen. The payout from the insurer is calculated as follows:
- After subjecting to a deductible, the amount to be paid by the insurer is $12000-d=12000-300=11700$
- After subjecting to co-insurance, the amount to be paid is $(1-c)\times 11700 = (1-0.2)\times 11700=9360$
- The policy limit is $l=5000$. Since the claim amount, $\$9360$ is higher than $\$5000$, the amount to be paid is only the policy limit of $\$5000$
and hence B will receive a claim amount of $\$5000$ from the insurer and his out-of-pocket payment is $\$7000$.
In this problem, the task is to study the effects of each of the following instruments on the total amount the insurance company needs to pay out.
Dataset¶
You are given a baggage loss dataset from the TSA Claims Database. In this dataset, the case_number
represents the unique ID associated with the case from the insurer and claim_amount
is the amount of claim filed by the passenger, or the loss due to travelling borne by the passenger.
Answer the following questions. Mark breakdowns are stated in the question e.g. [1]
. **Where stated, both the code and typed answers must be provided to receive full credit.**
# import the relevant packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Read from CSV to df
¶
Read the dataset from insurance-claims.csv
to a DataFrame
called df
.
# Read the dataset from CSV to a df. Write your code here
df= pd.read_csv('insurance-claims.csv')
Part 1 (6 Marks)¶
1 i) What is the mean of claim_amount
? [0.5]
# What is the mean of claim_amount?
# (Write your code here)
#
print(df['claim_amount'].mean())
# (Type your answer here)
# The mean is $131.41266399999995.
1 ii) What is the variance of claim_amount
? [0.5]
# What is the variance of claim_amount?
# (Write your code here)
#
print(df['claim_amount'].var(axis=0))
# (Type your answer here)
#The variance is $25235.105180495306.
1 iii)
What is the skewness of
claim_amount
?[0.5]
How do you interpret the result of
claim_amount
's skewness value?[0.5]
# What is the skewness of claim_amount?
# (Write your code here)
#
print(df['claim_amount'].skew())
# (Type your answer here)
#The skew is 2.4650649305683907
# How do you interpret the result of claim_amount's skewness value?
# (Type your answer here)
# The skewness is positive (>0), indicating that claim_amount is skewed right (right-tailed).
1 iv) Fill in the blanks: Using the interquartile-range method, a claim_amount
is considered an outlier if it is below _1.5*IQR_____
or above _1.5*IQR_____
. [1]
# Find the range of values an outlier of claim_amount can take (find the lower bound & upper bound)
# (Write your code here)
#
quartile_1 = np.percentile(df['claim_amount'], 25)
quartile_3 = np.percentile(df['claim_amount'], 75)
iqr = quartile_3 - quartile_1
lower_bound = quartile_1 - 1.5*iqr
upper_bound = quartile_3 + 1.5*iqr
print(lower_bound)
print(upper_bound)
# (Type your answer here)
# A claim_amount is considered an outlier if it is below _-153.81_____ or above _352.53_____
1 v) What proportion of the observations in claim_amount
are outliers? [1]
# What proportion of the observations in claim_amount are outliers?
# (Write your code here)
#
outliers_df = df[(df['claim_amount']<-153.81)|(df['claim_amount']>352.53)]
print(len(outliers_df)/len(df)*100)
# (Type your answer here)
#8.04% of the observations are outliers.
1 vi) Plot a Kernel Density Estimate (KDE) plot for claim_amount
using sns.kdeplot()
. Use the Gaussian kernel with bandwidth 25. In other words, add the parameters kernel='gau'
and bw=25
when plotting. Also add the parameter shade=True
to shade the area under the graph.[1]
Hint: Use the following lines to help you.
ax1.set_xlabel("claim_amount")
ax1.set_ylabel("density")
ax1.set_title("KDE plot of claim_amount")
# Plot a KDE plot for claim_amount
# (Write your code here)
fig = plt.figure(figsize=(16,8))
ax1 = fig.add_subplot(111)
sns.kdeplot(df['claim_amount'], kernel='gau', bw=25, shade= True, ax=ax1)
ax1.set_xlabel("claim_amount")
ax1.set_ylabel("density")
ax1.set_title("KDE plot of claim_amount")
ax1.legend(['claim_amount'])
1 vii)
- What is the difference when calculating ($\text{median}$ - $\text{mean}$) of
claim_amount
? How will you interpret this value, in relation to the skewness of the distribution? Hint: Use the answers in 1 i) and 1 vii) to help you.[1]
# What is the difference when calculating (median - mean) of claim_amount?
# (Write your code here)
#
print(df['claim_amount'].median()-df['claim_amount'].mean())
# (Type your answer here)
# The difference is $-60.28766399999995.
# How will you interpret the value of (median - mean), in relation to the skewness
# of the distribution?
# (Type your answer here)
# Since (median-mean) < 0, it indicates that median is on the left of the mean.
# The graph is positively skewed and skewed to the right (right tailed).
Part 2 (6 Marks)¶
For this question, the following values are defined:
Instrument | Variable | Value |
---|---|---|
Deductible | $d$ | $50$ |
Co-insurance ratio | $c$ | $0.3$ |
Policy Limit | $l$ | $500$ |
2 i) Write a function compute_claim_after_deductible(claim_amount)
that takes in a claim amount claim_amount
as the only parameter. compute_claim_after_deductible()
returns 0
if the claim_amount
is less than 50
, and returns (claim_amount - 50
) if the claim amount is above 50
.
Hint: You may use the following code to help you. If you do, take note that you need to correct the logic errors. [0.5]
# Write a function compute_claim_after_deductible(claim_amount)
# (Write your code here)
def compute_claim_after_deductible(claim_amount):
claim = claim_amount
d = 50
if d <= claim:
return(claim - d)
else:
return(0)
2 ii) Now, create a new column claim_amount_with_deductible
. This is the result of apply
ing compute_claim_after_deductible
to the column claim_amount
. [0.5]
# Create a new column claim_amount_with_deductible by using Series.apply()
# (Write your code here)
df['claim_amount_with_deductible'] = df['claim_amount'].apply(compute_claim_after_deductible)
2 iii) Plot a KDE plot for both claim_amount
and claim_amount_with_deductible
using sns.kdeplot()
. They MUST be in the same KDE plot.
Use the Gaussian kernel with bandwidth 25. In other words, add the parameters kernel='gau', bw=25
when plotting. Also, add the parameters shade=True
to shade the area under the KDE plot. [1]
Hint: Use the following lines to help you.
fig2 = plt.figure(figsize=(16, 8))
ax1 = fig2.add_subplot(111)
# Plot a KDE plot for both claim_amount and claim_amount_with_deductible. They MUST be
# in the same plot.
# (Write your code here)
fig2 = plt.figure(figsize=(16, 8))
ax1 = fig2.add_subplot(111)
sns.kdeplot(df['claim_amount'], kernel='gau', bw=25, shade= True, ax=ax1,color='blue')
ax1.set_xlabel("claim_amount")
ax1.set_ylabel("density")
ax1.set_title("KDE plot of claim_amount")
ax1.legend(['claim_amount','claim_amount_with_deductible'])
sns.kdeplot(df['claim_amount_with_deductible'], kernel='gau', bw=25, shade= True, ax=ax1,color='red')
2 iv) How has the deductible impacted the shape of the KDE plot? Describe the change in terms of the shift of the mean and the proportion of the observations having the value of 0.
Hint: Consider using the results of df.describe()
to explain your answer. [1]
# How has the deductible impacted the shape of the KDE plot?
# Code is not required. However, you could use df.describe() to help you.
#
df.describe()
# (Type your answer here)
# The mean for claim_amount_with_deductible is lower than that of claim_amount.
# The mean is shifted to the left.
# The standard deviation has also been reduced; this makes the graph more skewed to the right.
# The claims less than $50 become $0 with deductible.
# There is a higher percentage of observations having 0, the bottom 25% claims 0 now,
# compared to the values in the claim_amount column.
2 v) Write a function compute_claim_payable_by_insurer(claim_amount)
that takes in a claim amount claim_amount
as the only parameter. compute_claim_payable_by_insurer()
performs the following:
- if the
claim_amount
is less than50
, return0
- if the claim amount is above
50
, deduct50
from theclaim_amount
, then calculate the copayment paid by the insurer which is $1-0.3=0.7$ of the remaining amount - if the value calculated in step 2 exceeds the policy limit of
500
, then return500
. If not, return the value calculated in step 2
Hint: You may use the following code to help you. If you do, take note that you need to correct the logic errors. [1]
# Write a function compute_claim_payable_by_insurer(claim_amount)
# (Write your code here)
#
def compute_claims_payable_by_insurer(claim_amount):
d = 50
c = 0.3
l = 500
claim = claim_amount
payout = 0
if claim < d:
payout = payout
else:
payout = (claim-d) * (1-c)
if payout > l:
payout = l
return (payout)
2 vi) Now, create a new column claim_amount_by_insurer
. This is the result of apply
ing compute_claim_payable_by_insurer()
to the column claim_amount
Be careful! Choose the correct column. [0.5]
# Create a new column claim_amount_with_deductible by using Series.apply()
# (Write your code here)
df['claim_amount_by_insurer']= df['claim_amount'].apply(compute_claims_payable_by_insurer)
df.head()
2 vii) Run the following cell to find out key statistics of the df
. Fill in the blanks:
The maximum of claim_amount_by_insurer
is this equal to the ( ? )
.
[0.5]
# Uncomment and run the following lines of code to see the report.
print("=== SUMMARY STATISTICS ===")
print(df.describe())
print()
print("=== SKEWNESS OF SERIES ===")
print(df.skew())
print()
print("=== SUM OF SERIES ========")
print(df[['claim_amount', 'claim_amount_with_deductible', 'claim_amount_by_insurer']].sum())
print()
print("=== VARIANCE OF SERIES ===")
print(df[['claim_amount', 'claim_amount_with_deductible', 'claim_amount_by_insurer']].var())
# Uncomment and run the following lines to see the KDE plot for both series.
fig3 = plt.figure(figsize=(16, 6))
ax1 = fig3.add_subplot(111)
sns.kdeplot(df['claim_amount'], shade=True, kernel='gau', bw=25, ax=ax1, color='red')
sns.kdeplot(df['claim_amount_by_insurer'], shade=True, kernel='gau', bw=25, ax=ax1, color='purple')
ax1.set_xlabel("claim_amount")
ax1.set_ylabel("density")
ax1.set_title("KDE plot of claim_amount, loss and paid by insurer",fontsize=16)
plt.show()
# No additional code is required.
# Answer the following questions:
# Fill in the blanks with one of the following: deductible, co-payment, policy limit
# The maximum of claim_amount_by_insurer is this equal to the __policy limit____.
2 viii) The definition of moral hazard. It states, with the application to property:
When a property owner obtains insurance on a property, the contract is based on the idea that the property owner will avoid situations that may damage the property. The moral hazard exists that the property owner, because of the availability of the insurance, may be less inclined to protect the property, since the payment from an insurance company lessens the burden on the property owner in the case of a disaster.
Using the KDE plot and report in 2 vii), answer the following questions.
i) When a traveller buys travel insurance for his/her trip, how does moral hazard exist? Explain by comparing how his / her behaviour changes when there is and when there isn't travel insurance for the trip.
ii) By considering the burden of payment of both the insurer and insured, and using the concept of moral hazard, explain why an insurance company needs to implement a deductible, co-insurance and a policy limit for this travel insurance policy.
[1]
# No additional code is required.
# Answer the following questions: When a traveller buys travel insurance for his/her
# trip, how does moral hazard exist?
# (Type your answer here)
For example, since there is availability of insurance, the traveller is less likely to take care of
his/her own belongings, or even himself/herself up to the policy limit.
If personal items get stolen, and traveller can claim and buy a new one at no
additional costs on his/her part, traveller is more inclined to declare the stolen items.
# No additional code is required.
# Why does an insurance company need to implement a deductible,
# co-insurance and a policy limit for this travel insurance policy.
# (Type your answer here)
Since there is now a copayment where insurer and insured to share the costs of the claims,
70% and 30% respectively after deductible, there is less incentives for the insured to lose their items
or to be reckless, sometimes or to even to make a claim, especially if a loss is less than $50.
This encourages travellers to take partial responsibility, and to reduce unethical behaviours.
The policy limit safeguards the insurance company from going out of business as there is a policy limit
to each claim.
Credits
- healthcare.gov for the definitions of instruments
- Investopedia for the definitions
- TSA Claims Database for the dataset