DS102 Problem Set - Week 3

Hackwagon Academy

HWA-DS102-PS3-201810

If you have any questions to the Problem Set, feel free to send a message to your TAs or Instructors in the Telegram chat. You are encouraged to also discuss the problem sets within your own groups.

Before you start, fill in your details for this notebook and rename the notebook to `_-problem-set-week-3`, for example `jane-doe-problem-set-week-3`.

# Rename the notebook. Then, fill in your details here.
# Name  :Barbara Yam
# Class : 5 May 2019, 7 pm

Datasets Required:¶

sgp-literacy-rate.csv
insurance-claims.csv
ks-projects-15k.csv
wines-200.csv
loans-by-grade.csv

# import the relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set matplotlib inline
%matplotlib inline

Part 1 (5 Marks)¶

Mark breakdowns are stated in the question e.g. [1]. For questions wtih short answers, **both the code and typed answers must be provided to receive full credit.** **For plotting, all features must be present to receive full credit.**

1A.

i) What are the key features of a plot that needs to be explicitly stated? [0.5]

# What are the key features of a plot that needs to be explicitly stated?
# Code is not required.
# (Type your answer here)
#Title
#Legend
#Axes Labels for both the horizontal or  x -axis and the vertical or  y -axis
#Reasonable tick values

ii) You are given the following data in sgp-literacy-rate.csv, taken from World Bank Open Data. The definition of each column is as follows:

sgp-literacy-rate.csv

col name	definition	datatype	example
literacy_rate	The percentage of people aged 15 and above who can both read and write	float	96.716660
year	The year in which the literacy rate was obtained	int	2010

Read from CSV to df

Read the data from the sgp-literacy-rate.csv CSV file into a DataFrame called sgp_lr_df.

# Read from CSV to sgp_lr_df. Write your code here.

sgp_lr_df = pd.read_csv('sgp-literacy-rate.csv',sep="|")

sgp_lr_df
sgp_lr_df.reset_index()

The following code to plot the line graph is given to you. It plots the literacy rate of Singapore (% of people ages 15 and above) from 2010 to 2016. Perform the following steps:

Run the code to plot the line graph.
Using your answer from 1A or otherwise, modify the code to add the missing features of the plot. [1]

# Instructions: modify the code to add the missing features of the plot

DIMS = (10,5)
fig, ax1 = plt.subplots()
ax1.set_title("Singapore's Literacy Rate, 2010 to 2016", fontsize=16)
sgp_lr_df.plot(kind='line',x ='year',y ='literacy_rate',figsize=DIMS, ax=ax1, marker='o')

plt.xlabel("Year")
plt.ylabel('Literacy Rate')
ax1.legend(["Singapore's Literacy Rate"], fontsize=12)
# Display the line graph
plt.show()

1B. You are given a baggage loss dataset from the TSA Claims Database. In this dataset, the case_number represents the unique ID associated with the case from the insurer and claim_amount is the amount of claim filed by the passenger, or the loss due to travelling borne by the passenger.

Read from CSV to `df`¶

Read the dataset from insurance-claims.csv to a DataFrame called ins_df.

# Read from CSV to ins_df. Write your code here.
#

ins_df= pd.read_csv('insurance-claims.csv')

ins_df.head()

i) Plot a histogram for claim_amount using $8$ bins. Use the bins parameter to specify this. Also, as we do not need the KDE component, set the parameter kde=False. Ensure that all features of the plot have to be shown for full credit. Clue: Use sns.distplot(). [1]

Note: If you see this error:

Warning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.

You can ignore it. The plot will not be affected.

# Plot a histogram for claim_amount using 8 bins.
# Write your code here.

fig = plt.figure(figsize=(18, 8))
ax1 = fig.add_subplot(111)

sns.distplot(ins_df['claim_amount'],kde=False,ax=ax1,bins=8)

ax1.set_title("Histogram of Claim Amount, 8 bins", fontsize=16)
ax1.set_ylabel("Number of Claims")

Text(0, 0.5, 'Number of Claims')

ii) The following DataFrame called ins_df_agg_c is given to you. Run the code. What is the interpretation of count in ins_df_agg_c with respect to the plotted histogram in i) ? [0.5]

# No additional code is required.
ins_df_agg = pd.cut(ins_df['claim_amount'], 8).reset_index(name='bin_range')
ins_df_agg_c = ins_df_agg.groupby('bin_range').size().reset_index(name='count')
ins_df_agg_c

# What is the interpretation of count in ins_df_agg_c with respect to the plotted histogram?
# (Type your answer here)
# The number of claims within the amount that is specified the bin_range. 
# Bin range is gotten by dividing insurance amounts into 8 equal intervals.

1C. You are given the details of some projects taken from the KickStarter Projects dataset. The definition of each column is as follows:

ks-projects-15k.csv

col name	definition	datatype	example
ID	internal kickstart ID	int	59178964
name	name of a kickstarter project	str	I Believe in King The Kid
main_category	category of a campaign	str	Music
backers	the number of backers a project has	int	444
usd_pledged_real	the amount that was pledged for the project, in USD	float	27591.46
usd_goal_real	the fundraising goal of the project, in USD	float	55000

Read from CSV to `df`¶

Read the dataset from ks-projects-15k.csv to a DataFrame called kst_df.

# Read from CSV to kst_df. Write your code here.

kst_df = pd.read_csv('ks-projects-15k.csv')

kst_df.sample()

i) Filter for all Kickstarter Projects where the total amount pledged in USD, or usd_pledged_real is greater than $\$10000$. Store this in a new DataFrame called kst_df_10k. [0.5]

# Filter for all Kickstarter Projects where the usd_pledged_real is greater than 10000.
# (Write your code here)

kst_df_10k = kst_df[kst_df['usd_pledged_real']>10000]

kst_df_10k.sample()

ii) Create a new DataFrame, kst_df_agg where the first column is the main_category and the second column is called n. In kst_df_agg, main_category is the main category of kst_df_10k and n is the number of projects belonging to that main_category. Hint: use df.groupby().size(). [0.5]

# Create an aggregated df with 1 column being the main_category and the other column
# being the no. of projects in that category.
# (Write your code here)

kst_df_agg = kst_df.groupby('main_category').size()
kst_df_agg.reset_index(name='n')

iii) Using your answers to (i) and (ii) or otherwise, plot a bar graph where the $x$-axis is the main_category and the $y$-axis is the number of projects belonging to that main_category. [1]

# Plot the bar graph where the x-axis is the main_category and 
# the y-axis is the number of projects belonging to that main_category
# (Write your code here)

fig = plt.figure(figsize=(18, 7))
ax = fig.add_subplot(111)
kst_df_agg.plot(kind='bar', x='main_category', y='n',ax=ax,color='skyblue')
# Add the ylabel using ax.set_ylabel() and the title using ax.set_title()
ax.set_ylabel("Number of Projects")
ax.set_xlabel('Main Category')
ax.set_title("Number of Projects in each Main Category")

Text(0.5, 1.0, 'Number of Projects in each Main Category')

**Checkpoint**: When you have completed Part 1, upload your progress to eLearn. eLearn only accepts the latest notebooks so if you have an existing notebook, your latest upload will **override** it.

Part 2 (7 Marks)¶

For questions wtih short answers, **both the code and typed answers must be provided to receive full credit.** **For plotting, all features must be present to receive full credit.**

2A. You are given the score and price of some bottles of wines taken from the Wine Reviews dataset. The definition of each column is as follows:

wines-200.csv

col name	definition	datatype	example
country	The country that the wine is from	str	US
points	The number of points WineEnthusiast rated the wine on a scale of 1-100	int	86
price	The price of the bottle of wine	float	12.0

Read from CSV to `df`¶

Read the dataset from wines-200.csv to a DataFrame called wines_df.

# Read from CSV to wines_df. Write your code here.

wines_df = pd.read_csv('wines-200.csv')

wines_df.head()

i). Using dropna(), remove all records where the price value is not given. What is the shape of wines_df after removing the values?[0.5]

# Remove the records where the price is not given
# (Write your code here)

wines_df = wines_df.dropna(subset=['price'])
# What is the shape of wines_df after removing the values?
# (Write your code here)

#wine_df.dropna(inplace=True)

wines_df.shape

#185 entries, 3 columns

(185, 3)

ii). Using your answers to i) or otherwise, plot a scatterplot where the $x$-axis represents the number of points a wine has received, and the $y$-axis is the price of the wine. Only plot for points where both the point and price value exist. The title, $x$-axis label and $y$-axis label must be shown to obtain full credit. [1]

# Plot the scatterplot where the x-axis represents the number of points a wine has received, 
# and the y-axis is the price of the wine.
# Write your code here

fig = plt.figure(figsize=(8, 7))
ax = fig.add_subplot(111)

wines_df.plot(kind='scatter',ax=ax,x='points',y='price',title='Scatterplot of Price to Points of Wine')

<matplotlib.axes._subplots.AxesSubplot at 0x1ac1e6ba848>

iii) Complete the code below to plot a boxplot showing the distribution of the points of the bottles of wines. [0.5]

# Complete the code below to plot a boxplot.
sns.set(style="whitegrid")
sns.boxplot(wines_df['points'], orient="h", width=0.2)
plt.title('Boxplot of point values of wine')
# Show the boxplot
plt.show()

#sns.reset_orig()

#reset the seaborn library

iv) Using the plot of iii) to help you or otherwise, find the number of outliers for the points distribtution of the wines dataset. [0.5]

# Find the number of outliers for the points distribtution of the wines dataset.
# (Write your code here)
# Hint: Use the following code to help you:
p25 = np.percentile(wines_df['points'], 25)
p75 = np.percentile(wines_df['points'], 75)
iqr = p75 - p25
# p25, p75 = np.percentile(wines_df['points'],[25,75])
#lo, hi = p25-1.5*iqr, p75+1.5*iqr

lower_bound = p25 - 1.5*iqr
upper_bound = p75 + 1.5*iqr

wine_df = wines_df[(wines_df['points'] < lower_bound) | (wines_df['points'] > upper_bound)]
wine_df.count()
# How many outliers are there for the points distribtution of the wines dataset?
# (Type your answer here)

#1 outlier

country    1
points     1
price      1
dtype: int64

2B. You are given the number of loans per term for each grade taken from a sample from the Lending Club Dataset. The definition of each column is as below.

loans-by-grade.csv

col name	definition	datatype	example
grade	The grade scored	str	A
36 months	The number of loans with a 36 month term (repayment must be complete in 3 years)	int	18572
60 months	The number of loans with a 60 month term (repayment must be complete in 5 years)	int	8013

Read from CSV to `df`¶

Read the dataset from loans-by-grade.csv to a DataFrame called loans_df.

# Read from CSV to loans_df. Write your code here.

loans_df= pd.read_csv('loans-by-grade.csv')

loans_df.head()

i) Add a column to loans_df called Total Loans. This represents the total number of 36 & 60 month loans belonging to the grade. [0.5].

#Add a column to loans_df called 'Total Loans'
# Write your code here

loans_df['Total Loans'] = loans_df['36 months'] + loans_df['60 months']

loans_df

ii)

What proportion of the loans in Grade A are 36 month loans?
What proportion of the loans in Grade A are 60 month loans?

[0.5]

# What proportion of the loans in Grade A are 36 month loans?
# Code is not required. However, you could modify this line of code to help you
18572 / (18572 + 670) *100

# (Type your answer here)
# 96.51803346845442% of loans in Grade A are 36 month loans.

96.51803346845442

# What proportion of the loans in Grade A are 60 month loans?
# Code is not required. However, you could modify this line of code to help you
670 / (18572 + 670)* 100

# (Type your answer here)
# 3.481966531545577% of loans in Grade A are 60 month loans.

3.4819665315455772

iii) Using the answers to i) and ii) or otherwise, add 2 columns to loan_df:

36 months percentage is the proportion of loans that has a 36 month term belonging to the grade
60 months percentage is the proportion of loans that has a 60 month term belonging to the grade

[1]

# Write the code for adding the columns to loans_df below.
# Write your code here

loans_df['36 months percentage'] = loans_df['36 months']/ loans_df['Total Loans'] * 100
loans_df['60 months percentage'] = loans_df['60 months']/ loans_df['Total Loans'] * 100

loans_df

iv) Plot a stacked bar-graph where each bar in the $x$-axis is a grade and the $y$-axis represents the percentage of loans belonging to the grade. Hint: As each bar represents a percentage, all bars have the same height of $100\%$. Also, use DataFrame.plot(kind='bar', stacked=True) and add more parameters accordingly. [1]

# Write the code for plotting the stacked bar-graph below.
# Write your code here

#loans_df = loans_df[['grade'],['36 months percentage'],['60 months percentage']]
#loans_df.index = loans_df['grade']

loans_df1 = loans_df.drop(columns = ['36 months','60 months','Total Loans'])
loans_df1.head()

fig, ax1 = plt.subplots()
loans_df1.plot(kind='bar', stacked=True, figsize=(18,10),  
                                     title="No. of Loans by Grade", ax=ax1)
ax1.set_ylabel("No. of Loans")
plt.show()

2C. You will be using the wines-200.csv again for this question. Read from the wines-200.csv dataset from 2A to wines_df2. Remember to perform the dropna() step.

# Read from CSV file to wines_df2
# Remember to dropna()
# Write your code here.

wines_df2 = pd.read_csv('wines-200.csv')

wines_df2 = wines_df2.dropna()
wines_df2.head()

i) Find the top 2 countries with the most wines in the dataset. [0.5]

# What are the top 2 countries in terms of number of wines produced?
# Write your code here
#
country_count = wines_df2.groupby('country').size().reset_index(name='c')
country_count.sort_values('c',ascending=False,inplace=True)
print()
top_2_countries = country_count.iloc[:2]['country']
print(top_2_countries)
top_2_countries = top_2_countries.tolist()
print()
print(top_2_countries)

#country_count = wines_df.grouby('country').size().reset_index(name='c')
#country_count.sort_values('c',ascending=False, inplace= True)

#store the list of top 2 into a list

# (Type your answer here)
# US, Italy has the most wines in the dataset.

15       US
9     Italy
Name: country, dtype: object

['US', 'Italy']

ii) Using your answer in (i) or otherwise, debug and complete the following code to plot a violin plot of the wine prices of the top 2 countries with the most wines in the dataset. There should be a seperate violin plot for each country in the top 2. Refer to the seaborn documentation, this Stackoverflow answer and this Stackoverflow answer to help you. [1]

# Debug and complete the code for plotting the violin below.

#wines_df = wines_df[wines_df['country'].isin(top_2_countries)]
#wines_df['one']=1
#sns.violinplot(x="one", y='price', hue="country',data=wines_df3, palette="muted")

wines_df3= wines_df[wines_df['country'].isin(top_2_countries)]
wines_df3['one'] =1 
fig= plt.figure(figsize=(8,8))
ax= fig.add_subplot(111)
sns.violinplot(x="one", y='price', hue='country', data=wines_df3, split=True, palette="muted",ax=ax)
ax.set_xticklabels(list(top_2_countries))


ax.set_ylabel("Price")
ax.set_xlabel("")
ax.set_xticklabels(['country'])
ax.set_title("Violin plot of wine prices by country", fontsize=14)

plt.show()
#plot a double joint violin plot to see the distribution

C:\Users\yamba\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

**Checkpoint**: When you have completed Part 2, upload your progress to eLearn. eLearn only accepts the latest notebooks so if you have an existing notebook, your latest upload will **override** it.

	index	literacy_rate	year
0	0	95.857330	2010
1	1	96.186081	2011
2	2	96.369789	2012
3	3	96.548080	2013
4	4	96.716660	2014
5	5	96.825493	2015
6	6	97.049591	2016

	case_number	claim_amount
0	2004103065920	50.00
1	0226059M	312.50
2	2006012099982	90.00
3	2005010372123	216.00
4	2004092263584	178.74

	bin_range	count
0	(1.011, 125.625]	1689
1	(125.625, 249.25]	440
2	(249.25, 372.875]	181
3	(372.875, 496.5]	76
4	(496.5, 620.125]	48
5	(620.125, 743.75]	26
6	(743.75, 867.375]	27
7	(867.375, 991.0]	13

	main_category	n
0	Art	77
1	Comics	70
2	Crafts	13
3	Dance	11
4	Design	289
5	Fashion	94
6	Film & Video	386
7	Food	155
8	Games	280
9	Journalism	12
10	Music	256
11	Photography	34
12	Publishing	122
13	Technology	241
14	Theater	47

	grade	36 months	60 months
0	A	18572	670
1	B	27570	5572
2	C	21087	10739
3	D	9973	8013
4	E	2859	6296

	country	points	price
0	US	86	50.0
1	Italy	92	48.0
2	France	89	19.0
3	US	92	34.0
4	US	84	20.0

	grade	36 months	60 months	Total Loans	36 months percentage	60 months percentage
0	A	18572	670	19242	96.518033	3.481967
1	B	27570	5572	33142	83.187496	16.812504
2	C	21087	10739	31826	66.257148	33.742852
3	D	9973	8013	17986	55.448682	44.551318
4	E	2859	6296	9155	31.228837	68.771163
5	F	580	2353	2933	19.774974	80.225026
6	G	96	620	716	13.407821	86.592179

	country	points	price
0	US	86	50.0
1	Italy	92	48.0
2	France	89	19.0
3	US	92	34.0
4	US	84	20.0