Exploring Tidyverse and Dplyr
It looks as though I have had a bit of hiatus from R programming. Well, after the A to Z course by Kirill from Udemy, I went to study a number of Probability, Statistics and also preparing for the Data Science interview courses. Also spent most of May working on creative visualisations and with the circuit breaker coming to an end, it has been a very good 3 months for me working from home.
We will continue to work from home for the time being. But this circuit breaker season has brought me back to long needed reset of my life. I got to declutter my shelves and cupboards, and I also got to look at all the courses I have placed on the backburner and actually completing them.
I first got exposed to Dplyr and Tidyverse packages from R in another course, Wiley Certified Data Analyst. Finally got to practice my head knowledge here.
Exploring Dplyr
Barbara Yam
17 Jun 2020
Task 1 to 3
f1= read.csv('f1.csv',check.names = FALSE)
pacman::p_load(dplyr,tidyverse)
f1New = f1 %>%
gather('year','count',2:ncol(f1),convert=TRUE) %>%
rename('categories'= Variables) %>%
arrange(categories)
head(f1New,10)
## categories year count
## 1 Available Room-Nights (Number) 2010 11262019
## 2 Available Room-Nights (Number) 2011 12377895
## 3 Available Room-Nights (Number) 2012 12450851
## 4 Available Room-Nights (Number) 2013 13118384
## 5 Available Room-Nights (Number) 2014 14241499
## 6 Available Room-Nights (Number) 2015 15130568
## 7 Available Room-Nights (Number) 2016 16161862
## 8 Hotel Food & Beverage Revenue (Thousand Dollars) 2010 1052016
## 9 Hotel Food & Beverage Revenue (Thousand Dollars) 2011 1315098
## 10 Hotel Food & Beverage Revenue (Thousand Dollars) 2012 1309864
f1New %>%
subset(year==2011)
## categories year count
## 2 Available Room-Nights (Number) 2011 12377895.0
## 9 Hotel Food & Beverage Revenue (Thousand Dollars) 2011 1315097.6
## 16 Hotel Room Revenue (Thousand Dollars) 2011 2643538.8
## 23 Number Of Gazetted Hotels (At End Year) (Number) 2011 98.0
## 30 Standard Average Occupancy Rate (Per Cent) 2011 86.4
## 37 Standard Average Room Rate (Dollar) 2011 247.1
library(ggplot2)
f1New %>%
ggplot(aes(x=year,y=count,color=categories))+
geom_line()
f1New %>%
ggplot(aes(x=year,y=count)) + geom_line() + facet_wrap(~categories)
df <- data.frame(x=c(NA,"a.b","a.d","b.c"))
df %>% separate(x,c("A","B"))
## A B
## 1 <NA> <NA>
## 2 a b
## 3 a d
## 4 b c
df <- data.frame(x=c("x:123","y:error:7"))
df %>% separate(x,c("key","value"))
## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [2].
## key value
## 1 x 123
## 2 y error
df <- data.frame(x=c("x:123","y:error:7"))
df %>% separate(x,c("key","value"),":",extra="merge")
## key value
## 1 x 123
## 2 y error:7
df <- data.frame(x=c("a","a b","a b c", NA))
df %>% separate(x,c("a","b"),extra="drop",fill="right")
## a b
## 1 a <NA>
## 2 a b
## 3 a b
## 4 <NA> <NA>
df %>% separate(x,c("a","b"),extra="merge",fill="left")
## a b
## 1 <NA> a
## 2 a b
## 3 a b c
## 4 <NA> <NA>
Task 5 Part 1
pacman::p_load(dplyr,tidyverse)
f1New %>%
group_by(categories) %>%
summarise(mean = round(mean(count),2),
median=round(median(count,2)),
standardDeviation= round(sd(count),2),
no.of.rows=n())
## # A tibble: 6 x 5
## categories mean median standardDeviati~ no.of.rows
## <chr> <dbl> <dbl> <dbl> <int>
## 1 " Available Room-Nights (Number)~ 1.35e7 1.31e7 1722385. 7
## 2 " Hotel Food & Beverage Revenue ~ 1.34e6 1.34e6 143975. 7
## 3 " Hotel Room Revenue (Thousand D~ 2.86e6 2.92e6 397677. 7
## 4 " Number Of Gazetted Hotels (At ~ 1.21e2 1.13e2 24.7 7
## 5 " Standard Average Occupancy Rat~ 8.56e1 8.60e1 0.83 7
## 6 " Standard Average Room Rate (Do~ 2.46e2 2.47e2 15.4 7
Task 5 Part 2
messyData = data.frame(
name = c('Tom','Bob','Merv'),
"2010" = c("married","single","single"),
"2012" = c("father","married","single")
)
library(dplyr)
library(tidyverse)
messyData %>%
gather(key='Year', value="Event",-name) %>%
separate(col= 'Year',into= c('X','Year'),sep="X") %>%
select(-X)
## name Year Event
## 1 Tom 2010 married
## 2 Bob 2010 single
## 3 Merv 2010 single
## 4 Tom 2012 father
## 5 Bob 2012 married
## 6 Merv 2012 single
Task 6
f1New %>%
group_by(categories) %>%
filter(str_detect(categories,'Revenue')) %>%
ggplot(aes(x=year,y=count))+
facet_wrap(~categories) +
geom_line()
Task 7
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
tidy_iris = iris %>%
gather('Type','Value',1:4) %>%
separate(col='Type',into=c("Part","Measure"))
head(tidy_iris)
## Species Part Measure Value
## 1 setosa Sepal Length 5.1
## 2 setosa Sepal Length 4.9
## 3 setosa Sepal Length 4.7
## 4 setosa Sepal Length 4.6
## 5 setosa Sepal Length 5.0
## 6 setosa Sepal Length 5.4
ggplot(tidy_iris,aes(x=Measure,y=Value, color=Part))+
geom_jitter()+facet_grid(.~Species)