Data Visualization using Python
In this exercise, we explored Data Visualization using Matplotlib and the Seaborn Library
DS102 Problem Set - Week 3
HWA-DS102-PS3-201810
If you have any questions to the Problem Set, feel free to send a message to your TAs or Instructors in the Telegram chat. You are encouraged to also discuss the problem sets within your own groups.
# Rename the notebook. Then, fill in your details here.
# Name :Barbara Yam
# Class : 5 May 2019, 7 pm
Datasets Required:¶
sgp-literacy-rate.csv
insurance-claims.csv
ks-projects-15k.csv
wines-200.csv
loans-by-grade.csv
# import the relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set matplotlib inline
%matplotlib inline
Part 1 (5 Marks)¶
Mark breakdowns are stated in the question e.g. [1]
. For questions wtih short answers, **both the code and typed answers must be provided to receive full credit.** **For plotting, all features must be present to receive full credit.**
1A.
i) What are the key features of a plot that needs to be explicitly stated? [0.5]
# What are the key features of a plot that needs to be explicitly stated?
# Code is not required.
# (Type your answer here)
#Title
#Legend
#Axes Labels for both the horizontal or x -axis and the vertical or y -axis
#Reasonable tick values
ii) You are given the following data in sgp-literacy-rate.csv
, taken from World Bank Open Data. The definition of each column is as follows:
sgp-literacy-rate.csv
col name | definition | datatype | example |
---|---|---|---|
literacy_rate | The percentage of people aged 15 and above who can both read and write | float | 96.716660 |
year | The year in which the literacy rate was obtained | int | 2010 |
Read from CSV to df
Read the data from the sgp-literacy-rate.csv
CSV file into a DataFrame
called sgp_lr_df
.
# Read from CSV to sgp_lr_df. Write your code here.
sgp_lr_df = pd.read_csv('sgp-literacy-rate.csv',sep="|")
sgp_lr_df
sgp_lr_df.reset_index()
The following code to plot the line graph is given to you. It plots the literacy rate of Singapore (% of people ages 15 and above) from 2010 to 2016. Perform the following steps:
- Run the code to plot the line graph.
- Using your answer from 1A or otherwise, modify the code to add the missing features of the plot.
[1]
# Instructions: modify the code to add the missing features of the plot
DIMS = (10,5)
fig, ax1 = plt.subplots()
ax1.set_title("Singapore's Literacy Rate, 2010 to 2016", fontsize=16)
sgp_lr_df.plot(kind='line',x ='year',y ='literacy_rate',figsize=DIMS, ax=ax1, marker='o')
plt.xlabel("Year")
plt.ylabel('Literacy Rate')
ax1.legend(["Singapore's Literacy Rate"], fontsize=12)
# Display the line graph
plt.show()
1B. You are given a baggage loss dataset from the TSA Claims Database. In this dataset, the case_number
represents the unique ID associated with the case from the insurer and claim_amount
is the amount of claim filed by the passenger, or the loss due to travelling borne by the passenger.
Read from CSV to df
¶
Read the dataset from insurance-claims.csv
to a DataFrame
called ins_df
.
# Read from CSV to ins_df. Write your code here.
#
ins_df= pd.read_csv('insurance-claims.csv')
ins_df.head()
i) Plot a histogram for claim_amount
using $8$ bins. Use the bins
parameter to specify this. Also, as we do not need the KDE component, set the parameter kde=False
. Ensure that all features of the plot have to be shown for full credit. Clue: Use sns.distplot()
. [1]
Note: If you see this error:
Warning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
You can ignore it. The plot will not be affected.
# Plot a histogram for claim_amount using 8 bins.
# Write your code here.
fig = plt.figure(figsize=(18, 8))
ax1 = fig.add_subplot(111)
sns.distplot(ins_df['claim_amount'],kde=False,ax=ax1,bins=8)
ax1.set_title("Histogram of Claim Amount, 8 bins", fontsize=16)
ax1.set_ylabel("Number of Claims")
ii) The following DataFrame
called ins_df_agg_c
is given to you. Run the code. What is the interpretation of count
in ins_df_agg_c
with respect to the plotted histogram in i) ? [0.5]
# No additional code is required.
ins_df_agg = pd.cut(ins_df['claim_amount'], 8).reset_index(name='bin_range')
ins_df_agg_c = ins_df_agg.groupby('bin_range').size().reset_index(name='count')
ins_df_agg_c
# What is the interpretation of count in ins_df_agg_c with respect to the plotted histogram?
# (Type your answer here)
# The number of claims within the amount that is specified the bin_range.
# Bin range is gotten by dividing insurance amounts into 8 equal intervals.
1C. You are given the details of some projects taken from the KickStarter Projects dataset. The definition of each column is as follows:
ks-projects-15k.csv
col name | definition | datatype | example |
---|---|---|---|
ID | internal kickstart ID | int | 59178964 |
name | name of a kickstarter project | str | I Believe in King The Kid |
main_category | category of a campaign | str | Music |
backers | the number of backers a project has | int | 444 |
usd_pledged_real | the amount that was pledged for the project, in USD | float | 27591.46 |
usd_goal_real | the fundraising goal of the project, in USD | float | 55000 |
Read from CSV to df
¶
Read the dataset from ks-projects-15k.csv
to a DataFrame
called kst_df
.
# Read from CSV to kst_df. Write your code here.
kst_df = pd.read_csv('ks-projects-15k.csv')
kst_df.sample()
i) Filter for all Kickstarter Projects where the total amount pledged in USD, or usd_pledged_real
is greater than $\$10000$. Store this in a new DataFrame
called kst_df_10k
. [0.5]
# Filter for all Kickstarter Projects where the usd_pledged_real is greater than 10000.
# (Write your code here)
kst_df_10k = kst_df[kst_df['usd_pledged_real']>10000]
kst_df_10k.sample()
ii) Create a new DataFrame
, kst_df_agg
where the first column is the main_category
and the second column is called n
. In kst_df_agg
, main_category
is the main category of kst_df_10k
and n
is the number of projects belonging to that main_category
. Hint: use df.groupby().size()
. [0.5]
# Create an aggregated df with 1 column being the main_category and the other column
# being the no. of projects in that category.
# (Write your code here)
kst_df_agg = kst_df.groupby('main_category').size()
kst_df_agg.reset_index(name='n')
iii) Using your answers to (i) and (ii) or otherwise, plot a bar graph where the $x$-axis is the main_category
and the $y$-axis is the number of projects belonging to that main_category
. [1]
# Plot the bar graph where the x-axis is the main_category and
# the y-axis is the number of projects belonging to that main_category
# (Write your code here)
fig = plt.figure(figsize=(18, 7))
ax = fig.add_subplot(111)
kst_df_agg.plot(kind='bar', x='main_category', y='n',ax=ax,color='skyblue')
# Add the ylabel using ax.set_ylabel() and the title using ax.set_title()
ax.set_ylabel("Number of Projects")
ax.set_xlabel('Main Category')
ax.set_title("Number of Projects in each Main Category")
Part 2 (7 Marks)¶
For questions wtih short answers, **both the code and typed answers must be provided to receive full credit.** **For plotting, all features must be present to receive full credit.**
2A. You are given the score and price of some bottles of wines taken from the Wine Reviews dataset. The definition of each column is as follows:
wines-200.csv
col name | definition | datatype | example |
---|---|---|---|
country | The country that the wine is from | str | US |
points | The number of points WineEnthusiast rated the wine on a scale of 1-100 | int | 86 |
price | The price of the bottle of wine | float | 12.0 |
Read from CSV to df
¶
Read the dataset from wines-200.csv
to a DataFrame
called wines_df
.
# Read from CSV to wines_df. Write your code here.
wines_df = pd.read_csv('wines-200.csv')
wines_df.head()
i). Using dropna()
, remove all records where the price
value is not given. What is the shape
of wines_df
after removing the values?[0.5]
# Remove the records where the price is not given
# (Write your code here)
wines_df = wines_df.dropna(subset=['price'])
# What is the shape of wines_df after removing the values?
# (Write your code here)
#wine_df.dropna(inplace=True)
wines_df.shape
#185 entries, 3 columns
ii). Using your answers to i) or otherwise, plot a scatterplot where the $x$-axis represents the number of points a wine has received, and the $y$-axis is the price of the wine. Only plot for points where both the point and price value exist. The title, $x$-axis label and $y$-axis label must be shown to obtain full credit. [1]
# Plot the scatterplot where the x-axis represents the number of points a wine has received,
# and the y-axis is the price of the wine.
# Write your code here
fig = plt.figure(figsize=(8, 7))
ax = fig.add_subplot(111)
wines_df.plot(kind='scatter',ax=ax,x='points',y='price',title='Scatterplot of Price to Points of Wine')
iii) Complete the code below to plot a boxplot showing the distribution of the point
s of the bottles of wines. [0.5]
# Complete the code below to plot a boxplot.
sns.set(style="whitegrid")
sns.boxplot(wines_df['points'], orient="h", width=0.2)
plt.title('Boxplot of point values of wine')
# Show the boxplot
plt.show()
#sns.reset_orig()
#reset the seaborn library
iv) Using the plot of iii) to help you or otherwise, find the number of outliers for the point
s distribtution of the wines dataset. [0.5]
# Find the number of outliers for the points distribtution of the wines dataset.
# (Write your code here)
# Hint: Use the following code to help you:
p25 = np.percentile(wines_df['points'], 25)
p75 = np.percentile(wines_df['points'], 75)
iqr = p75 - p25
# p25, p75 = np.percentile(wines_df['points'],[25,75])
#lo, hi = p25-1.5*iqr, p75+1.5*iqr
lower_bound = p25 - 1.5*iqr
upper_bound = p75 + 1.5*iqr
wine_df = wines_df[(wines_df['points'] < lower_bound) | (wines_df['points'] > upper_bound)]
wine_df.count()
# How many outliers are there for the points distribtution of the wines dataset?
# (Type your answer here)
#1 outlier
2B. You are given the number of loans per term for each grade taken from a sample from the Lending Club Dataset. The definition of each column is as below.
loans-by-grade.csv
col name | definition | datatype | example |
---|---|---|---|
grade | The grade scored | str | A |
36 months | The number of loans with a 36 month term (repayment must be complete in 3 years) | int | 18572 |
60 months | The number of loans with a 60 month term (repayment must be complete in 5 years) | int | 8013 |
Read from CSV to df
¶
Read the dataset from loans-by-grade.csv
to a DataFrame
called loans_df
.
# Read from CSV to loans_df. Write your code here.
loans_df= pd.read_csv('loans-by-grade.csv')
loans_df.head()
i) Add a column to loans_df
called Total Loans
. This represents the total number of 36 & 60 month loans belonging to the grade. [0.5]
.
#Add a column to loans_df called 'Total Loans'
# Write your code here
loans_df['Total Loans'] = loans_df['36 months'] + loans_df['60 months']
loans_df
ii)
- What proportion of the loans in Grade A are 36 month loans?
- What proportion of the loans in Grade A are 60 month loans?
[0.5]
# What proportion of the loans in Grade A are 36 month loans?
# Code is not required. However, you could modify this line of code to help you
18572 / (18572 + 670) *100
# (Type your answer here)
# 96.51803346845442% of loans in Grade A are 36 month loans.
# What proportion of the loans in Grade A are 60 month loans?
# Code is not required. However, you could modify this line of code to help you
670 / (18572 + 670)* 100
# (Type your answer here)
# 3.481966531545577% of loans in Grade A are 60 month loans.
iii) Using the answers to i) and ii) or otherwise, add 2 columns to loan_df
:
36 months percentage
is the proportion of loans that has a 36 month term belonging to the grade60 months percentage
is the proportion of loans that has a 60 month term belonging to the grade
[1]
# Write the code for adding the columns to loans_df below.
# Write your code here
loans_df['36 months percentage'] = loans_df['36 months']/ loans_df['Total Loans'] * 100
loans_df['60 months percentage'] = loans_df['60 months']/ loans_df['Total Loans'] * 100
loans_df
iv) Plot a stacked bar-graph where each bar in the $x$-axis is a grade and the $y$-axis represents the percentage of loans belonging to the grade. Hint: As each bar represents a percentage, all bars have the same height of $100\%$. Also, use DataFrame.plot(kind='bar', stacked=True)
and add more parameters accordingly. [1]
# Write the code for plotting the stacked bar-graph below.
# Write your code here
#loans_df = loans_df[['grade'],['36 months percentage'],['60 months percentage']]
#loans_df.index = loans_df['grade']
loans_df1 = loans_df.drop(columns = ['36 months','60 months','Total Loans'])
loans_df1.head()
fig, ax1 = plt.subplots()
loans_df1.plot(kind='bar', stacked=True, figsize=(18,10),
title="No. of Loans by Grade", ax=ax1)
ax1.set_ylabel("No. of Loans")
plt.show()
2C. You will be using the wines-200.csv
again for this question. Read from the wines-200.csv
dataset from 2A to wines_df2
. Remember to perform the dropna()
step.
# Read from CSV file to wines_df2
# Remember to dropna()
# Write your code here.
wines_df2 = pd.read_csv('wines-200.csv')
wines_df2 = wines_df2.dropna()
wines_df2.head()
i) Find the top 2 countries with the most wines in the dataset. [0.5]
# What are the top 2 countries in terms of number of wines produced?
# Write your code here
#
country_count = wines_df2.groupby('country').size().reset_index(name='c')
country_count.sort_values('c',ascending=False,inplace=True)
print()
top_2_countries = country_count.iloc[:2]['country']
print(top_2_countries)
top_2_countries = top_2_countries.tolist()
print()
print(top_2_countries)
#country_count = wines_df.grouby('country').size().reset_index(name='c')
#country_count.sort_values('c',ascending=False, inplace= True)
#store the list of top 2 into a list
# (Type your answer here)
# US, Italy has the most wines in the dataset.
ii) Using your answer in (i) or otherwise, debug and complete the following code to plot a violin plot of the wine prices of the top 2 countries with the most wines in the dataset. There should be a seperate violin plot for each country in the top 2. Refer to the seaborn documentation, this Stackoverflow answer and this Stackoverflow answer to help you. [1]
# Debug and complete the code for plotting the violin below.
#wines_df = wines_df[wines_df['country'].isin(top_2_countries)]
#wines_df['one']=1
#sns.violinplot(x="one", y='price', hue="country',data=wines_df3, palette="muted")
wines_df3= wines_df[wines_df['country'].isin(top_2_countries)]
wines_df3['one'] =1
fig= plt.figure(figsize=(8,8))
ax= fig.add_subplot(111)
sns.violinplot(x="one", y='price', hue='country', data=wines_df3, split=True, palette="muted",ax=ax)
ax.set_xticklabels(list(top_2_countries))
ax.set_ylabel("Price")
ax.set_xlabel("")
ax.set_xticklabels(['country'])
ax.set_title("Violin plot of wine prices by country", fontsize=14)
plt.show()
#plot a double joint violin plot to see the distribution