Data Visualization with R

In this notebook, we will start by showing how to produce simple graphics using the base R installation. We will eventually introduce more sophisticated graphics via the tidyverse and the ggplot2 library. A few examples are adapted from R. Irizarry’s dslabs documentation and other sources.

1. SCATTERPLOTS

On a scatter plot, the features you study depend on the scale of interest:

for “big” patterns, look at:
- form and direction
- strength
for “small” patterns, look at:
- deviations from the pattern
- outliers

1.1 `swiss` dataset

We start with a built-in R dataset called swiss.

str(swiss) # structure of the swiss dataset

## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

pairs(swiss) # scatter plot matrix for the swiss dataset

Let’s focus on one specific pair: Fertility vs. Education

# raw plot
plot(swiss$Fertility, swiss$Education)

The same plot can be prettified and made more informative:

# add a title and axis labels
plot(swiss$Fertility, swiss$Education, xlab="Fertility", ylab="Education", main="Education vs Fertility (by province), Switzerland, 1888", las=1)

# add the line of best fit (in red)
abline(lm(swiss$Education~swiss$Fertility), col="red", lwd=2.5) 

# add the smoothing lowess curve (in blue)
lines(lowess(swiss$Fertility,swiss$Education), col="blue", lwd=2.5) 

# add a legend
legend(75,50, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))

Compare that graph with the one found below:

plot(swiss$Education, xlab="Province", ylab="Education", main="Education by Province, Switzerland, 1888", las=1)
abline(lm(swiss$Education~row(swiss)[,1]), col="red", lwd=2.5) 
lines(swiss$Education)
lines(lowess(row(swiss)[,1],swiss$Education), col="blue", lwd=2.5) 
legend(5,52, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))

So we can get an actual graph here, but … why doesn’t it actually make sense to produce that specific graph?

The take-away here is that being able to produce a graph doesn’t guarantee that it will be useful or meaningful in any way.

Let’s see what the unadorned scatter plots look like with ggplot2.

## Again, with ggplot2
library(ggplot2)
qplot(Fertility, Education, data=swiss)

qplot(Fertility, Education, xlim=c(min(0,10*floor(min(swiss$Fertility)/10)),10*ceiling(max(swiss$Fertility)/10)), ylim=c(min(0,10*floor(min(swiss$Education)/10)),10*ceiling(max(swiss$Education)/10)), data=swiss)

1.2 `iris` dataset

Let’s do the same thing for the built-in iris dataset.

str(iris) # structure of the dataset

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(iris) # information on the distributions for each feature

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

?iris # information on the dataset itself

# scatter plot matrix on which the lower panel has been removed due to redundancy
pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21, lower.panel=NULL, labels=c("SL","SW","PL","PW"), font.labels=2, cex.labels=4.5)

We can compare the sepal width and length variables in a manner similar to what we did with the swiss dataset.

## Iris 1
plot(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal Length", ylab="Sepal Width", main="Sepal Width vs Sepal Length, Anderson's Iris Dataset", las=1, bg=c("yellow","black","green")[unclass(iris$Species)])
abline(lm(iris$Sepal.Width~iris$Sepal.Length), col="red", lwd=2.5) 
lines(lowess(iris$Sepal.Length,iris$Sepal.Width), col="blue", lwd=2.5) 
legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))

There does not seem to be a very strong relationship between these variables. What can we say about sepal length and petal length?

## Iris 2
plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length", main="Sepal Width vs Petal Length, Anderson's Iris Dataset", las=1)
abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5) 
lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5) 
legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))

Visually, the relationship is striking: the line seems to have a slope of 1! But notice that the axes are unevenly scaled, and have been cutoff away from the origin. The following graph gives a better idea of the situation.

## Iris 3
plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length", main="Sepal Width vs Petal Length, Anderson's Iris Dataset", xlim=c(0,8), ylim=c(0,8), las=1)
abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5) 
lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5) 
legend(2,7, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))

A relationship is still present, but it is affine, not linear as could have been guessed by naively looking at the original graph.

Colour can also be used to highlight various data elements.

# colour each observation differently according to its species
plot(iris$Sepal.Length, iris$Sepal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Anderson's Iris Data -- Sepal Length vs. Sepal Width", xlim=)

This can be done on all scatterplots concurrently using pairs.

# scatterplot matrix with species membership
pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)], lower.panel=NULL, labels=c("SL","SW","PL","PW"), font.labels=2, cex.labels=4.5)

The redundancy in the scatter plot matrix can be used to display other data elements as well: GGally allows for feature distributions in the diagonal entries, and correlations between pairs of variables and density plots in the redundant panels (among other things). Be weary of utilizing too many colours or features in these plots, however. They can easily become too difficult to read to provide any meaningful insights.

2. HISTOGRAM AND BAR CHARTS

In histograms or frequency charts, be on the lookout for bin size effects.

2.1 `swiss` dataset

For instance, what does the distribution of the Education variable in the swiss dataset look like?

## Histogram/Bar Charts
hist(swiss$Education)   # default number of bins

hist(swiss$Education, breaks=10)   # with 10 bins

hist(swiss$Education, breaks=20)   # with 20 bins

The distribution pattern is distincly different with 10 and 20 bins. Don’t get too carried away, however: too many bins may end up maskig trends if the dataset isn’t large enough.

We can look for best fits for various parametric distributions:

#Swiss 1 - default number of bine
hist(swiss$Education, freq=FALSE, xlab="Education",main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", lwd=4) # fit a gamma distribution
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4) # fit an exponential distribution
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4) # fit a normal distribution
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3"))

# Swiss 2 - 10 bins
hist(swiss$Education, breaks=10, freq=FALSE, xlab="Education",main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4)
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3"))

# Swiss 3 - 20 bins
hist(swiss$Education, breaks=20, freq=FALSE, xlab="Education",main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4)
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3"))

With a small number of bins, the exponential distribution seems like a good fit, visually. With a larger number of bins, neither of the three families seems particularly well-advised.

2.2 `iris` dataset

Can you figure out what is happening with these visualizations of the iris dataset?

hist(iris$Sepal.Length, freq=FALSE, xlab="Sepal.Length",main="Sepal.Length Distribution", col="firebrick1", ylim=c(0,0.15))

# what happens if you replace freq=FALSE with freq=TRUE?

# Another feature
hist(iris$Sepal.Width, freq=FALSE, xlab="Sepal.Width",main="Sepal.Width Distribution", col="firebrick1", ylim=c(0,0.15))

Histograms (1D data representations) can be combined with scatterplots (2D data representations) to provide marginal information.

# create our own function (see R-blogger's scatterhist function)
scatterhist = function(x, y, xlab="", ylab=""){
  zones=matrix(c(2,0,1,3), ncol=2, byrow=TRUE)
  layout(zones, widths=c(4/5,1/5), heights=c(1/5,4/5))
  xhist = hist(x, plot=FALSE)
  yhist = hist(y, plot=FALSE)
  top = max(c(xhist$counts, yhist$counts))
  par(mar=c(3,3,1,1))
  plot(x,y)
  par(mar=c(0,3,1,1))
  barplot(xhist$counts, axes=FALSE, ylim=c(0, top), space=0)
  par(mar=c(3,0,1,1))
  barplot(yhist$counts, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE)
  par(oma=c(3,3,0,0))
  mtext(xlab, side=1, line=1, outer=TRUE, adj=0, 
    at=.8 * (mean(x) - min(x))/(max(x)-min(x)))
  mtext(ylab, side=2, line=1, outer=TRUE, adj=0, 
    at=(.8 * (mean(y) - min(y))/(max(y) - min(y))))
}

ds = iris
with(ds, scatterhist(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal.Length", ylab="Sepal.Width"))

3. BUBBLE CHARTS

Bubble charts are a neat way to show at least 3 variables on the same 2D display. The location of the bubbles’ centre takes care of 2 variables: size, colour, and shape of bubbles can also be added to represent different data elements.

3.1 Canadian 2011 demographic dataset

For this example, we’ll look at demographic data regarding Canada’s CMA and CA (from StatsCan).

can.2011=read.csv("../Data/Canada2011.csv", head=TRUE) # import the data
head(can.2011) # take a look at the first 6 entries

str(can.2011) # take a look at the structure of the data

## 'data.frame':    147 obs. of  12 variables:
##  $ Geographic.code       : int  1 5 10 15 105 110 205 210 215 220 ...
##  $ Geographic.name       : Factor w/ 147 levels "Abbotsford - Mission",..: 117 7 42 26 20 121 46 56 133 73 ...
##  $ Province              : Factor w/ 12 levels "AB","BC","MB",..: 5 5 5 5 9 9 6 6 6 6 ...
##  $ Region                : Factor w/ 6 levels "Atlantic","British Columbia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Type                  : Factor w/ 2 levels "CA","CMA": 2 1 1 1 1 1 2 1 1 1 ...
##  $ pop_2011              : int  196966 10871 13725 27202 64487 16488 390328 26359 45888 35809 ...
##  $ log_pop_2011          : num  5.29 4.04 4.14 4.43 4.81 ...
##  $ pop_rank_2011         : int  20 147 128 94 52 120 13 97 67 78 ...
##  $ priv_dwell_2011       : int  84542 4601 6134 11697 28864 7323 177295 11941 21708 16788 ...
##  $ occ_private_dwell_2011: int  78960 4218 5723 11110 26192 6941 165153 11123 19492 15256 ...
##  $ occ_rate_2011         : num  0.934 0.917 0.933 0.95 0.907 ...
##  $ med_total_income_2011 : int  33420 24700 26920 27430 30110 27250 33400 25500 26710 26540 ...

summary(can.2011[,3:12], maxsum=13) # provide a distribution information for features 3 to 12, allowing for up to 13 factors in the categorical distributions

##  Province              Region    Type        pop_2011      
##  AB:18    Atlantic        :18   CA :114   Min.   :  10871  
##  BC:25    British Columbia:25   CMA: 33   1st Qu.:  18429  
##  MB: 5    North           : 2             Median :  40077  
##  NB: 7    Ontario         :43             Mean   : 186632  
##  NL: 4    Prairies        :31             3rd Qu.:  98388  
##  NS: 5    Quebec          :28             Max.   :5583064  
##  NW: 1                                                     
##  ON:43                                                     
##  PE: 2                                                     
##  QC:28                                                     
##  SK: 8                                                     
##  YT: 1                                                     
##   log_pop_2011   pop_rank_2011   priv_dwell_2011   occ_private_dwell_2011
##  Min.   :4.036   Min.   :  1.0   Min.   :   4601   Min.   :   4218       
##  1st Qu.:4.265   1st Qu.: 37.5   1st Qu.:   8292   1st Qu.:   7701       
##  Median :4.603   Median : 74.0   Median :  17428   Median :  16709       
##  Mean   :4.720   Mean   : 74.0   Mean   :  78691   Mean   :  74130       
##  3rd Qu.:4.993   3rd Qu.:110.5   3rd Qu.:  44674   3rd Qu.:  41142       
##  Max.   :6.747   Max.   :147.0   Max.   :2079459   Max.   :1989705       
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##  occ_rate_2011    med_total_income_2011
##  Min.   :0.6492   Min.   :22980        
##  1st Qu.:0.9173   1st Qu.:27630        
##  Median :0.9348   Median :29910        
##  Mean   :0.9276   Mean   :31311        
##  3rd Qu.:0.9475   3rd Qu.:33255        
##  Max.   :0.9723   Max.   :73030        
##                   NA's   :5            
##                                        
##                                        
##                                        
##                                        
##

Before jumping aboard the bubble chart train, let’s see what the dataset looks like in the scatterplot framework for 5 of the variables, grouped along regions.

pairs(can.2011[,c(7,9,10,11,12)])

Meh… not that great. There are some interesting tidbits, but nothing jumps as being meaningful beyond a first pass.

Can anything else be derived using bubble charts? We use median income as the radius for the bubbles, and focus on occupancy rates and population.

radius.med.income.2011<-sqrt(can.2011$med_total_income_2011/pi)
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, inches=0.45, fg="white", bg="red", xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")

Clearly, an increase in population seems to be associated with (and not necessarily a cause of) a rise in occupancy rates. But the median income seems to have very little correlation with either of the other two variables. Perhaps such a correlation is hidden by the default unit used to draw the bubbles? Let’s shrink it from 0.45 to 0.25 and see if anything pops out.

symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, inches=0.25, fg="white", bg="red", xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")

Nope. But surely there would be a relationship in these quantities if we included the CMA/CA’s region?

symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, inches=0.15, fg="white", bg=c("red","blue","black","green","yellow","violet")[can.2011$Region], xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")

What do you think?

While we’re at it, can you identify the regions without the legend?

3.2 Canadian 2011 CMA demographic dataset

Perhaps the CA distort the full picture (given that they are small and more numerous). Let’s analyze the world of CMAs instead.

can.2011.CMA=read.csv("../Data/Canada2011_CMA.csv", head=TRUE) # import the data
#head(can.2011.CMA)
str(can.2011.CMA) # see the dataset's structure (notice the number of observations)

## 'data.frame':    33 obs. of  15 variables:
##  $ Geographic.code       : int  1 205 305 310 408 421 433 442 462 505 ...
##  $ Geographic.name       : Factor w/ 33 levels "Abbotsford - Mission",..: 26 8 14 22 21 19 24 29 15 17 ...
##  $ Province              : Factor w/ 9 levels "AB","BC","MB",..: 5 6 4 4 8 8 8 8 8 7 ...
##  $ Region                : Factor w/ 5 levels "Atlantic","British Columbia",..: 1 1 1 1 5 5 5 5 5 3 ...
##  $ Type                  : Factor w/ 1 level "CMA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pop_2011              : int  196966 390328 138644 127761 157790 765706 201890 151773 3824221 1236324 ...
##  $ log_pop_2011          : num  5.29 5.59 5.14 5.11 5.2 ...
##  $ fem_pop_2011          : int  103587 204579 70901 65834 79770 394935 104026 78105 1971520 647083 ...
##  $ prop_fem_2011         : num  52.6 52.4 51.1 51.5 50.6 ...
##  $ pop_rank_2011         : int  20 13 29 31 26 7 19 27 2 4 ...
##  $ priv_dwell_2011       : int  84542 177295 62403 56775 73766 361447 99913 74837 1696210 526627 ...
##  $ occ_private_dwell_2011: int  78960 165153 58294 52281 69507 345892 91099 70138 1613260 498636 ...
##  $ occ_rate_2011         : num  0.934 0.932 0.934 0.921 0.942 ...
##  $ med_unemployment_2011 : num  6.6 6.2 7.45 6.75 7.8 5.15 6.4 8.9 8.1 6.3 ...
##  $ med_total_income_2011 : int  33420 33400 30690 29910 29560 33760 27620 27050 28870 39170 ...

summary(can.2011.CMA[,3:12], maxsum=13) # feature by feature distribution

##  Province              Region    Type       pop_2011        log_pop_2011  
##  AB: 2    Atlantic        : 4   CMA:33   Min.   : 118975   Min.   :5.075  
##  BC: 4    British Columbia: 4            1st Qu.: 159561   1st Qu.:5.203  
##  MB: 1    Ontario         :15            Median : 260600   Median :5.416  
##  NB: 2    Prairies        : 5            Mean   : 700710   Mean   :5.553  
##  NL: 1    Quebec          : 5            3rd Qu.: 721053   3rd Qu.:5.858  
##  NS: 1                                   Max.   :5583064   Max.   :6.747  
##  ON:15                                                                    
##  QC: 5                                                                    
##  SK: 2                                                                    
##   fem_pop_2011     prop_fem_2011   pop_rank_2011 priv_dwell_2011  
##  Min.   :  63260   Min.   :50.55   Min.   : 1    Min.   :  53730  
##  1st Qu.:  83243   1st Qu.:51.55   1st Qu.: 9    1st Qu.:  72817  
##  Median : 135561   Median :52.17   Median :17    Median : 110314  
##  Mean   : 365093   Mean   :52.05   Mean   :17    Mean   : 291519  
##  3rd Qu.: 378690   3rd Qu.:52.41   3rd Qu.:25    3rd Qu.: 294150  
##  Max.   :2947971   Max.   :53.17   Max.   :33    Max.   :2079459  
##                                                                   
##                                                                   
##                                                                   
##  occ_private_dwell_2011
##  Min.   :  48848       
##  1st Qu.:  67767       
##  Median : 104237       
##  Mean   : 275587       
##  3rd Qu.: 282186       
##  Max.   :1989705       
##                        
##                        
##

We can use the median income for the bubbles radius again, but this time we’ll look at population and unemployment.

radius.med.income.2011.CMA<-sqrt(can.2011.CMA$med_total_income_2011/pi)
symbols(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011, circles=radius.med.income.2011.CMA, inches=0.25, fg="white", bg=c("red","blue","gray","green","yellow")[can.2011.CMA$Region], ylab="Population (log)", xlab="Unemployment Rate")
title("Total Median Income, by CMA (2011)")
text(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011,can.2011.CMA$Geographic.code, cex=0.5)

Part of the issue is that median income seems to be roughly uniform among CMAs. What if we used rank statistics instead? Switch the radius to population rank, say?

radius.pop.rank.2011.CMA<-sqrt(can.2011.CMA$pop_rank_2011/pi)
symbols(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011, circles=radius.pop.rank.2011.CMA, inches=0.25, fg="white", bg=c("red","blue","gray","green","yellow")[can.2011.CMA$Region], ylab="Median Income", xlab="Unemployment Rate")
title("Population Rank, by CMA and CA (2011)")
text(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011,can.2011.CMA$Geographic.code, cex=0.5)

There’s a bit more structure here, isn’t there?

4. ALGAE BLOOMS

The ability to monitor and perform early forecasts of various river algae blooms is crucial to control the ecological harm they can cause.

The dataset which is used to train the learning model consists of: - chemical properties of various water samples of European rivers - the quantity of seven algae in each of the samples, and - the characteristics of the collection process for each sample.

Notes: - we have already cleaned this dataset in notebook DS 02, but it would have made more sense to visualize it beforehand… we will revisit the process in detail in notebook DS 07. For now, let’s just focus on producing some graphs.
- The dataset is stored in the CSV file algae_blooms.csv (in the same directory as this notebook).

algae_blooms<-read.csv("../Data/algae_blooms.csv", sep=",", header=TRUE)

We can get an idea of the data frame’s structure by calling the str function.

str(algae_blooms)

## 'data.frame':    340 obs. of  18 variables:
##  $ season: Factor w/ 4 levels "autumn","spring",..: 4 2 1 2 1 4 3 1 4 4 ...
##  $ size  : Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ speed : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 1 1 1 3 1 ...
##  $ mxPH  : num  8 8.35 8.1 8.07 8.06 8.25 8.15 8.05 8.7 7.93 ...
##  $ mnO2  : num  9.8 8 11.4 4.8 9 13.1 10.3 10.6 3.4 9.9 ...
##  $ Cl    : num  60.8 57.8 40 77.4 55.4 ...
##  $ NO3   : num  6.24 1.29 5.33 2.3 10.42 ...
##  $ NH4   : num  578 370 346.7 98.2 233.7 ...
##  $ oPO4  : num  105 428.8 125.7 61.2 58.2 ...
##  $ PO4   : num  170 558.8 187.1 138.7 97.6 ...
##  $ Chla  : num  50 1.3 15.6 1.4 10.5 ...
##  $ a1    : num  0 1.4 3.3 3.1 9.2 15.1 2.4 18.2 25.4 17 ...
##  $ a2    : num  0 7.6 53.6 41 2.9 14.6 1.2 1.6 5.4 0 ...
##  $ a3    : num  0 4.8 1.9 18.9 7.5 1.4 3.2 0 2.5 0 ...
##  $ a4    : num  0 1.9 0 0 0 0 3.9 0 0 2.9 ...
##  $ a5    : num  34.2 6.7 0 1.4 7.5 22.5 5.8 5.5 0 0 ...
##  $ a6    : num  8.3 0 0 0 4.1 12.6 6.8 8.7 0 0 ...
##  $ a7    : num  0 2.1 9.7 1.4 1 2.9 0 0 0 1.7 ...

Evidently, algae_blooms is a data frame with 340 observations of 18 variables each.

Basic histograms can be constructed with the hist function.

hist(algae_blooms$mnO2)

This basic histogram can be spruced up by using Hadley Wickham’s (in)famous ggplot2 library.

library(ggplot2)

ggplot(algae_blooms,aes(x=mnO2)) +        # plotting mnO2 from the algae_blooms dataset ...
    geom_histogram(aes(y=..density..)) +  # as a histogram, where the vertical axis is the density ... 
    geom_density(color="blue") +          # on which will be layered a blue density curve ... 
    geom_rug() +                          # and a rug (or comb) showing where the observations actually fall...
    ggtitle("Histogram of minimum value of O2 among 340 observations") + # with this title ... 
    xlab("") +                                                           # no x axis label ...   
    ylab("")                                                             # and no y axis label

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 2 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing non-finite values (stat_density).

Let’s do the same for a1.

ggplot(algae_blooms,aes(x=a1)) + 
    geom_histogram(aes(y=..density..)) +
    geom_density(color="red") + 
    geom_rug() +
    ggtitle("Histogram of minimum value of a1 among 340 observations") + 
    xlab("") +
    ylab("")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

\(qq-\)plots, another traditional statistical plot, can be produced with the car libary’s qqPlot function.

library(car)

## Loading required package: carData

qqPlot(algae_blooms$mnO2, main='Normal QQ plot for minimum values of O2', ylab="")

## [1] 69 70

Again, we can see that the normal distribution is not a good fit for mnO2, but that it is even worse for a1 (see below).

qqPlot(algae_blooms$a1, main='Normal QQ plot for a1', ylab="")

## [1]  49 118

Now let’s take a look at some plots involving NH4.

ggplot(algae_blooms,aes(x=factor(0),y=NH4)) +   # plotting NH4 from the algae_blooms dataset ...
    geom_boxplot() +                            # as a boxplot ... 
    geom_rug() +                                # with a rug on which the true values are shown ...
    geom_hline(aes(yintercept=mean(algae_blooms$NH4, na.rm=TRUE)), linetype=2, colour="pink") + # and a horizontal line showing where the mean NH4 value falls ...
    ylab("Ammonium (NH4+)") +                   
    xlab("") +
    scale_x_discrete(breaks=NULL)

## Warning: Use of `algae_blooms$NH4` is discouraged. Use `NH4` instead.

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

We don’t see much here because of the suspected outliers.

plot(algae_blooms$NH4, xlab="", ylab="Ammonium (NH4+)") # scatter plot of NH4 against observation number
abline(h=mean(algae_blooms$NH4, na.rm=TRUE), lty=1)     # mean value of NH4, solid line
abline(h=mean(algae_blooms$NH4, na.rm=TRUE) + sd(algae_blooms$NH4, na.rm=TRUE), lty=2) # mean + sd value of NH4, dashed line
abline(h=median(algae_blooms$NH4, na.rm=TRUE), lty=3)   # median value of NH4, tight dashed line We can also look at the data and see which observations have values of NH4 below 3000 (roughly all values below the long dashed line above)

Let’s see what the boxplot above looks like without the suspected outliers.

ggplot(algae_blooms[-which(algae_blooms$NH4>3000),],aes(x=factor(0),y=NH4)) + 
    geom_boxplot() + 
    geom_rug() +
    geom_hline(aes(yintercept=mean(algae_blooms[-which(algae_blooms$NH4>3000),8], na.rm=TRUE)), linetype=2, colour="pink") +
    ylab("Ammonium (NH4+)") +
    xlab("") +
    scale_x_discrete(breaks=NULL)

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

It’s a bit better, to be sure (the box structure has expanded). There still seems to be a bit of very high values. Perhaps that’s normal? How would we find out?

Now let’s take a look at some of the algae levels.

ggplot(algae_blooms,aes(x=season,y=a3)) +   # plot a3 by season ...
    geom_boxplot() +                        # in a series of boxplots ...
    xlab("Season") +                        # with x axis as Seasons and y axis as a3
    ylab("Algae Level a3")

We can re-arrange the factors’ order, but it requires a bit of fancy footwork using the forcats’library fct_relevel function, and dplyr’s mutate.

library(forcats) # for fct_relevel
library(dplyr)   # for mutate

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

algae_blooms = mutate(algae_blooms, 
                     size=fct_relevel(size,c("small","medium","large")), # factors should appear in the desired order
                     speed=fct_relevel(speed,c("low","medium","high")),  # ditto
                     season=fct_relevel(season,c("spring","summer","autumn","winter")) # same here
                     )

ggplot(algae_blooms,aes(x=season,y=a3)) + 
    geom_boxplot() +
    xlab("Season") +
    ylab("Algae Level a3")

Violin plots are cousins to the boxplots. Can we get a bit more insight on the a3 trend?

ggplot(algae_blooms,aes(x=season,y=a3)) +  # plot a3 by season ...
    geom_violin() +                        # in a series of violin plots ...  
    geom_jitter() +                        # with some jitter to avoid all the points being on top of one another ...
    xlab("Season") + 
    ylab("Algae Level a3")

(What happens if you turn off the jitter option?)

Let’s take a look at both NH4 and season.

We only keep the observations for which NH4\(> 3000\), and we bin them with respect to the quartiles.

f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>% 
    filter(NH4<3000) %>% 
    mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1)), include.lowest=TRUE))
table(f.NH4.data$q.NH4,useNA="ifany")

## 
##       [5,36.8]     (36.8,103]      (103,226] (226,2.17e+03] 
##             82             82             81             82

ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) +
    geom_point() +
    facet_wrap(~q.NH4) +
    guides(color=FALSE) +
    ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")

library(ggplot2)

library("dslabs")
data(package="dslabs")

5. `ggplot2` and the Tidyverse

5.1 US GUN MURDERS (2010)

data("murders")

?murders

head(murders)

str(murders)

## 'data.frame':    51 obs. of  5 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ abb       : chr  "AL" "AK" "AZ" "AR" ...
##  $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
##  $ population: num  4779736 710231 6392017 2915918 37253956 ...
##  $ total     : num  135 19 232 93 1257 ...

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────── tidyverse 1.2.1 ──

## ✓ tibble  3.0.1     ✓ readr   1.3.1
## ✓ tidyr   1.0.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ stringr 1.4.0

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## ── Conflicts ─────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::recode() masks car::recode()
## x purrr::some()   masks car::some()

r <- murders %>%
  summarize(pop=sum(population), tot=sum(total)) %>%
  mutate(rate = tot/pop*10^6) %>% .$rate

r

## [1] 30.34555

library(ggrepel)
library(ggthemes)
murders %>% ggplot(aes(x = population/10^6, y = total, label = abb)) +
  geom_abline(intercept = log10(r), lty=2, col="darkgrey") +
  geom_point(aes(color=region), size = 3) +
  geom_text_repel() +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Populations in millions (log scale)") +
  ylab("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in 2010") +
  scale_color_discrete(name="Region")

5.2. GAPMINDER

data("gapminder")
?gapminder

head(gapminder)

str(gapminder)

## 'data.frame':    10545 obs. of  9 variables:
##  $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##  $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
##  $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
##  $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
##  $ population      : num  1636054 11124892 5270844 54681 20619075 ...
##  $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
##  $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
##  $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...

summary(gapminder)

##                 country           year      infant_mortality
##  Albania            :   57   Min.   :1960   Min.   :  1.50  
##  Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00  
##  Angola             :   57   Median :1988   Median : 41.50  
##  Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31  
##  Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10  
##  Armenia            :   57   Max.   :2016   Max.   :276.90  
##  (Other)            :10203                  NA's   :1453    
##  life_expectancy   fertility       population             gdp           
##  Min.   :13.20   Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07  
##  1st Qu.:57.50   1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09  
##  Median :67.54   Median :3.750   Median :5.009e+06   Median :7.794e+09  
##  Mean   :64.81   Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11  
##  3rd Qu.:73.00   3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10  
##  Max.   :83.90   Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13  
##                  NA's   :187     NA's   :185         NA's   :2972       
##     continent                region    
##  Africa  :2907   Western Asia   :1026  
##  Americas:2052   Eastern Africa : 912  
##  Asia    :2679   Western Africa : 912  
##  Europe  :2223   Caribbean      : 741  
##  Oceania : 684   South America  : 684  
##                  Southern Europe: 684  
##                  (Other)        :5586

west <- c("Western Europe","Northern Europe","Southern Europe",
          "Northern America","Australia and New Zealand")

gapminder <- gapminder %>%
  mutate(group = case_when(
    region %in% west ~ "The West",
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
    continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
    TRUE ~ "Others"))
gapminder <- gapminder %>%
  mutate(group = factor(group, levels = rev(c("Others", "Latin America", "East Asia","Sub-Saharan Africa", "The West"))))

filter(gapminder, year%in%c(1962, 2013) & !is.na(group) &
         !is.na(fertility) & !is.na(life_expectancy)) %>%
  mutate(population_in_millions = population/10^6) %>%
  ggplot( aes(fertility, y=life_expectancy, col = group, size = population_in_millions)) +
  geom_point(alpha = 0.8) + 
  guides(size=FALSE) +
  theme(plot.title = element_blank(), legend.title = element_blank()) +
  coord_cartesian(ylim = c(30, 85)) +
  xlab("Fertility rate (births per woman)") +
  ylab("Life Expectancy") +
  geom_text(aes(x=7, y=82, label=year), cex=12, color="grey") +
  facet_grid(. ~ year) +
  theme(strip.background = element_blank(),
        strip.text.x = element_blank(),
        strip.text.y = element_blank(),
   legend.position = "top")

gapminder$gdppc = gapminder$gdp/gapminder$population
gapminder2 <- gapminder[,c(1,2,4,6,8,9,11)]
head(gapminder2)

filter(gapminder2, year%in%c(2011) & !is.na(gdppc) & !is.na(life_expectancy)) %>%
  mutate(population_in_millions = (population/10^6)) %>%
  ggplot( aes(x=gdppc, y=life_expectancy, col = continent, size = population_in_millions, label = country)) +
  geom_point(alpha = 1) + 
  geom_text_repel() +
  guides(size=FALSE) +
  theme(plot.title = element_blank(), legend.title = element_blank()) +
  coord_cartesian(ylim = c(40, 85)) +
  scale_x_log10() +
  xlab("GDP per capita (log scale)") +
  ylab("Life Expectancy (in years)") +
  ggtitle("Health and Wealth of Nations (2011)") +
  #facet_grid(. ~ year) +
  theme(strip.background = element_blank(),
        strip.text.x = element_blank(),
        strip.text.y = element_blank(),
   legend.position = "top")

filter(gapminder2, year%in%c(2011) & !is.na(gdppc) & !is.na(life_expectancy)) %>%
  mutate(population_in_millions = (population/10^6)) %>%
  ggplot( aes(x=gdppc, y=life_expectancy, col = continent, size = population_in_millions, label = country)) +
  geom_point(alpha = 1) + 
  geom_smooth(method='lm',formula=y~x, se=FALSE) + 
  guides(size=FALSE) +
  theme(plot.title = element_blank(), legend.title = element_blank()) +
  coord_cartesian(ylim = c(40, 85)) +
  scale_x_log10() +
  xlab("GDP per capita (log scale)") +
  ylab("Life Expectancy (in years)") +
  ggtitle("Health and Wealth of Nations (2011)") +
  #facet_grid(. ~ year) +
  theme(strip.background = element_blank(),
        strip.text.x = element_blank(),
        strip.text.y = element_blank(),
   legend.position = "top")

filter(gapminder2, year%in%c(2011) & !is.na(gdppc) & !is.na(life_expectancy)) %>%
  mutate(population_in_millions = (population/10^6)) %>%
  ggplot( aes(x=gdppc, y=life_expectancy, size = population_in_millions, label = country)) +
  geom_point(alpha = 1) + 
  geom_smooth(method='lm',formula=y~x, se=FALSE) + 
  guides(size=FALSE) +
  theme(plot.title = element_blank(), legend.title = element_blank()) +
  coord_cartesian(ylim = c(40, 85)) +
  scale_x_log10() +
  xlab("GDP per capita (log scale)") +
  ylab("Life Expectancy (in years)") +
  ggtitle("Health and Wealth of Nations (2011)") +
  #facet_grid(. ~ year) +
  theme(strip.background = element_blank(),
        strip.text.x = element_blank(),
        strip.text.y = element_blank(),
   legend.position = "top")

5.3 FIVETHIRTYEIGHT’s 2016 POLL DATA

data(polls_us_election_2016)
?polls_us_election_2016
head(polls_us_election_2016)

str(polls_us_election_2016)

## 'data.frame':    4208 obs. of  15 variables:
##  $ state           : Factor w/ 57 levels "Alabama","Alaska",..: 50 50 50 50 50 50 50 50 37 50 ...
##  $ startdate       : Date, format: "2016-11-03" "2016-11-01" ...
##  $ enddate         : Date, format: "2016-11-06" "2016-11-07" ...
##  $ pollster        : Factor w/ 196 levels "ABC News/Washington Post",..: 1 63 81 194 65 55 18 113 195 76 ...
##  $ grade           : Factor w/ 10 levels "D","C-","C","C+",..: 10 6 8 6 5 9 8 8 NA 8 ...
##  $ samplesize      : int  2220 26574 2195 3677 16639 1295 1426 1282 8439 1107 ...
##  $ population      : chr  "lv" "lv" "lv" "lv" ...
##  $ rawpoll_clinton : num  47 38 42 45 47 ...
##  $ rawpoll_trump   : num  43 35.7 39 41 43 ...
##  $ rawpoll_johnson : num  4 5.46 6 5 3 3 5 6 6 7.1 ...
##  $ rawpoll_mcmullin: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ adjpoll_clinton : num  45.2 43.3 42 45.7 46.8 ...
##  $ adjpoll_trump   : num  41.7 41.2 38.8 40.9 42.3 ...
##  $ adjpoll_johnson : num  4.63 5.18 6.84 6.07 3.73 ...
##  $ adjpoll_mcmullin: num  NA NA NA NA NA NA NA NA NA NA ...

summary(polls_us_election_2016)

##             state        startdate             enddate          
##  U.S.          :1106   Min.   :2015-11-06   Min.   :2015-11-08  
##  Florida       : 148   1st Qu.:2016-08-10   1st Qu.:2016-08-21  
##  North Carolina: 125   Median :2016-09-23   Median :2016-09-30  
##  Pennsylvania  : 125   Mean   :2016-08-31   Mean   :2016-09-06  
##  Ohio          : 115   3rd Qu.:2016-10-20   3rd Qu.:2016-10-28  
##  New Hampshire : 112   Max.   :2016-11-06   Max.   :2016-11-07  
##  (Other)       :2477                                            
##                                      pollster        grade     
##  Ipsos                                   : 919   A-     :1085  
##  Google Consumer Surveys                 : 743   B      :1011  
##  SurveyMonkey                            : 660   C-     : 693  
##  YouGov                                  : 130   C+     : 329  
##  Rasmussen Reports/Pulse Opinion Research: 125   B+     : 204  
##  USC Dornsife/LA Times                   : 121   (Other): 457  
##  (Other)                                 :1510   NA's   : 429  
##    samplesize       population        rawpoll_clinton rawpoll_trump  
##  Min.   :   35.0   Length:4208        Min.   :11.04   Min.   : 4.00  
##  1st Qu.:  447.5   Class :character   1st Qu.:38.00   1st Qu.:35.00  
##  Median :  772.0   Mode  :character   Median :43.00   Median :40.00  
##  Mean   : 1148.2                      Mean   :41.99   Mean   :39.83  
##  3rd Qu.: 1236.5                      3rd Qu.:46.20   3rd Qu.:45.00  
##  Max.   :84292.0                      Max.   :88.00   Max.   :68.00  
##  NA's   :1                                                           
##  rawpoll_johnson  rawpoll_mcmullin adjpoll_clinton adjpoll_trump   
##  Min.   : 0.000   Min.   : 9.0     Min.   :17.06   Min.   : 4.373  
##  1st Qu.: 5.400   1st Qu.:22.5     1st Qu.:40.21   1st Qu.:38.429  
##  Median : 7.000   Median :25.0     Median :44.15   Median :42.765  
##  Mean   : 7.382   Mean   :24.0     Mean   :43.32   Mean   :42.674  
##  3rd Qu.: 9.000   3rd Qu.:27.9     3rd Qu.:46.92   3rd Qu.:46.290  
##  Max.   :25.000   Max.   :31.0     Max.   :86.77   Max.   :72.433  
##  NA's   :1409     NA's   :4178                                     
##  adjpoll_johnson  adjpoll_mcmullin
##  Min.   :-3.668   Min.   :11.03   
##  1st Qu.: 3.145   1st Qu.:23.11   
##  Median : 4.384   Median :25.14   
##  Mean   : 4.660   Mean   :24.51   
##  3rd Qu.: 5.756   3rd Qu.:27.98   
##  Max.   :20.367   Max.   :31.57   
##  NA's   :1409     NA's   :4178

polls_us_election_2016 %>%
  filter(state == "U.S." & enddate>="2016-07-01") %>%
  select(enddate, pollster, rawpoll_clinton, rawpoll_trump) %>%
  rename(Clinton = rawpoll_clinton, Trump = rawpoll_trump) %>%
  gather(candidate, percentage, -enddate, -pollster) %>% 
  mutate(candidate = factor(candidate, levels = c("Trump","Clinton")))%>%
  group_by(pollster) %>%
  filter(n()>=10) %>%
  ungroup() %>%
  ggplot(aes(enddate, percentage, color = candidate)) +  
  geom_point(show.legend = FALSE, alpha=0.4)  + 
  geom_smooth(method = "loess", span = 0.15) +
  scale_y_continuous(limits = c(30,50))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 22 rows containing non-finite values (stat_smooth).

## Warning: Removed 22 rows containing missing values (geom_point).

5.4 CONTAGIOUS DISEASES (US STATES)

library(RColorBrewer)
data("us_contagious_diseases")
?us_contagious_diseases
head(us_contagious_diseases)

str(us_contagious_diseases)

## 'data.frame':    16065 obs. of  6 variables:
##  $ disease        : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ state          : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year           : num  1966 1967 1968 1969 1970 ...
##  $ weeks_reporting: num  50 49 52 49 51 51 45 45 45 46 ...
##  $ count          : num  321 291 314 380 413 378 342 467 244 286 ...
##  $ population     : num  3345787 3364130 3386068 3412450 3444165 ...

summary(us_contagious_diseases)

##         disease            state            year      weeks_reporting
##  Hepatitis A:2346   Alabama   :  315   Min.   :1928   Min.   : 0.00  
##  Measles    :3825   Alaska    :  315   1st Qu.:1950   1st Qu.:31.00  
##  Mumps      :1785   Arizona   :  315   Median :1975   Median :46.00  
##  Pertussis  :2856   Arkansas  :  315   Mean   :1971   Mean   :37.38  
##  Polio      :2091   California:  315   3rd Qu.:1990   3rd Qu.:50.00  
##  Rubella    :1887   Colorado  :  315   Max.   :2011   Max.   :52.00  
##  Smallpox   :1275   (Other)   :14175                                 
##      count          population      
##  Min.   :     0   Min.   :   86853  
##  1st Qu.:     7   1st Qu.: 1018755  
##  Median :    69   Median : 2749249  
##  Mean   :  1492   Mean   : 4107584  
##  3rd Qu.:   525   3rd Qu.: 4996229  
##  Max.   :132342   Max.   :37607525  
##                   NA's   :214

the_disease <- "Pertussis"
us_contagious_diseases %>%
  filter(disease ==  the_disease) %>%
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>%
  mutate(state = reorder(state, rate)) %>%
  ggplot(aes(year, state,  fill = rate)) +
  geom_tile(color = "grey50") +
  scale_x_continuous(expand=c(0,0)) +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
  geom_vline(xintercept=1963, col = "blue") +
  theme_minimal() +  theme(panel.grid = element_blank()) +
  ggtitle(the_disease) +
  ylab("") +
  xlab("")

5.5 ARTIFICIAL DATASET

library("ggplot2")
theme_set(theme_bw()) # use the black and white theme throughout
# artificial data:
d <- data.frame(x = c(1:8, 1:8), y = runif(16),
  group1 = rep(gl(2, 4, labels = c("a", "b")), 2),
  group2 = gl(2, 8))
head(d)

ggplot(data = d) + geom_point(aes(x, y, colour = group1)) +
  facet_grid(~group2)

5.6 NEW YORK CHORAL SOCIETY SINGERS

library("ggplot2")
data(singer, package="lattice")
?singer
summary(singer,8)

##      height         voice.part
##  Min.   :60.0   Bass 2   :26  
##  1st Qu.:65.0   Bass 1   :39  
##  Median :67.0   Tenor 2  :21  
##  Mean   :67.3   Tenor 1  :21  
##  3rd Qu.:70.0   Alto 2   :27  
##  Max.   :76.0   Alto 1   :35  
##                 Soprano 2:30  
##                 Soprano 1:36

table(singer$height,singer$voice.part)

##     
##      Bass 2 Bass 1 Tenor 2 Tenor 1 Alto 2 Alto 1 Soprano 2 Soprano 1
##   60      0      0       0       0      0      1         3         1
##   61      0      0       0       0      0      4         2         2
##   62      0      0       0       0      0      3         5         6
##   63      0      0       0       0      3      4         3         3
##   64      0      0       0       2      5      3         4         1
##   65      0      0       0       1      5      4         5        15
##   66      1      2       1       3      6      7         3         6
##   67      2      0       0       2      2      4         3         1
##   68      2      6       3       3      0      2         1         1
##   69      1      3       7       1      1      1         0         0
##   70      4      8       2       2      5      1         1         0
##   71      1      6       6       2      0      0         0         0
##   72      7      6       0       2      0      1         0         0
##   73      0      3       1       1      0      0         0         0
##   74      4      1       0       1      0      0         0         0
##   75      4      4       0       0      0      0         0         0
##   76      0      0       1       1      0      0         0         0

ggplot(singer, aes(x=height)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(singer, aes(x=voice.part, y=height)) + geom_boxplot()

ggplot(data=singer, aes(x=height)) +
       geom_histogram() +
       facet_wrap(~voice.part, nrow=4)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

5.7 UNIVERSITY PROFESSORS SALARIES

data(package="car")

## no data sets found

head(Salaries)

summary(Salaries)

##         rank     discipline yrs.since.phd    yrs.service        sex     
##  AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39  
##  AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358  
##  Prof     :266              Median :21.00   Median :16.00               
##                             Mean   :22.31   Mean   :17.61               
##                             3rd Qu.:32.00   3rd Qu.:27.00               
##                             Max.   :56.00   Max.   :60.00               
##      salary      
##  Min.   : 57800  
##  1st Qu.: 91000  
##  Median :107300  
##  Mean   :113706  
##  3rd Qu.:134185  
##  Max.   :231545

ggplot(Salaries, aes(x=rank, y=salary)) +
       geom_boxplot(fill="cornflowerblue",
       color="black", notch=TRUE)+
       geom_point(position="jitter", color="blue", alpha=.5)+
       geom_rug(side="l", color="black")

## Warning: Ignoring unknown parameters: side

ggplot(Salaries, aes(x=yrs.since.phd, y=salary, color=rank,
       shape=rank)) + geom_point() + facet_grid(.~sex)

p1 <- ggplot(data=Salaries, aes(x=rank)) + geom_bar()
p2 <- ggplot(data=Salaries, aes(x=sex)) + geom_bar()
p3 <- ggplot(data=Salaries, aes(x=yrs.since.phd, y=salary)) + geom_point()

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(p1, p2, p3, ncol=3)

library(car)
library(ggplot2)
mytheme <- theme(plot.title=element_text( 
                face="bold.italic", size="14", 
                color="brown"), axis.title= 
                element_text( face="bold.italic",
                    size=10, color="brown"),
                axis.text=element_text( 
                    face="bold", size=9,
                    color="darkblue"),
                panel.background=element_rect( 
                    fill="white",color="darkblue"),
                panel.grid.major.y=element_line( 
                    color="grey", linetype=1),
                panel.grid.minor.y=element_line( 
                    color="grey", linetype=2),
                panel.grid.minor.x=element_blank(),
                legend.position="top")

ggplot(Salaries, aes(x=rank, y=salary, fill=sex)) +
       geom_boxplot() +
       labs(title="Salary by Rank and Sex", x="Rank", y="Salary") +
       mytheme

5.8 MPG

head(mpg)

?mpg

ggplot(mpg, aes(cty, hwy)) + geom_point(aes(colour = class))

ggplot(mpg, aes(cty, hwy)) + geom_point(colour = "red")

5.9 WORLD PHONES

data("WorldPhones")
head(WorldPhones)

##      N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
## 1951  45939  21574 2876   1815    1646     89      555
## 1956  60423  29990 4708   2568    2366   1411      733
## 1957  64721  32510 5230   2695    2526   1546      773
## 1958  68484  35218 6662   2845    2691   1663      836
## 1959  71799  37598 6856   3000    2868   1769      911
## 1960  76036  40341 8220   3145    3054   1905     1008

help(WorldPhones)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

WorldPhones.m = melt(WorldPhones)
head(WorldPhones.m)

colnames(WorldPhones.m) = c("Year", "Continent", "Phones")
head(WorldPhones.m)

ggplot(WorldPhones.m, aes(x=Year, y=Phones, color=Continent)) + geom_point()

ggplot(WorldPhones.m, aes(x=Year, y=Phones, color=Continent)) + geom_line()

ggplot(WorldPhones.m, aes(x=Year, y=Phones, color=Continent)) + geom_line() + scale_y_log10()

6. EXAMPLES and METHODS

6.1 Smoothing Lines

# install.packages("ggplot2")
# load package and data
options(scipen=999)  # turn-off scientific notation like 1e+48
library(ggplot2)
theme_set(theme_bw())  # pre-set the bw theme.
data("midwest", package = "ggplot2")
# midwest <- read.csv("http://goo.gl/G1K41K")  # bkup data source

# Scatterplot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state, size=popdensity)) + 
  geom_smooth(method="loess", se=F) + 
  xlim(c(0, 0.1)) + 
  ylim(c(0, 500000)) + 
  labs(subtitle="Area Vs Population", 
       y="Population", 
       x="Area", 
       title="Scatterplot", 
       caption = "Source: midwest")

plot(gg)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 15 rows containing non-finite values (stat_smooth).

## Warning: Removed 15 rows containing missing values (geom_point).

6.2 Scatterplots and Jitter Charts

data(mpg, package="ggplot2")
# mpg <- read.csv("http://goo.gl/uEeRGu")

mpg_select <- mpg[mpg$manufacturer %in% c("audi", "ford", "honda", "hyundai"), ]

# Scatterplot
theme_set(theme_bw())  # pre-set the bw theme.
g <- ggplot(mpg_select, aes(displ, cty)) + 
  labs(subtitle="mpg: Displacement vs City Mileage",
       title="Bubble chart")

g + geom_jitter(aes(col=manufacturer, size=hwy)) + 
  geom_smooth(aes(col=manufacturer), method="lm", se=F)

## `geom_smooth()` using formula 'y ~ x'

6.3 Animations

# Source: https://github.com/dgrtwo/gganimate
#install.packages("devtools")
# install.packages("cowplot")  # a gganimate dependency
# devtools::install_github("dgrtwo/gganimate")


library(ggplot2)
library(gganimate)

## No renderer backend detected. gganimate will default to writing frames to separate files
## Consider installing:
## - the `gifski` package for gif output
## - the `av` package for video output
## and restarting the R session

library(gapminder)

## 
## Attaching package: 'gapminder'

## The following object is masked _by_ '.GlobalEnv':
## 
##     gapminder

## The following object is masked from 'package:dslabs':
## 
##     gapminder

theme_set(theme_bw())  # pre-set the bw theme.

head(gapminder)

ggplot(gapminder, aes(gdppc, life_expectancy, size = population, colour = country)) +
  geom_point(alpha = 0.7, show.legend = FALSE) +
  #scale_colour_manual(values = country_colors) +
  scale_size(range = c(2, 12)) +
  scale_x_log10() +
  facet_wrap(~continent) +
  # Here comes the gganimate specific bits
  labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
  transition_time(year) +
  ease_aes('linear')

## Warning: No renderer available. Please install the gifski, av, or magick
## package to create animated output

## NULL

# anim_save(file="gapminder.gif") # saved, not plotted

6.4 Marginal Distributions

library(ggplot2)
library(ggExtra)
data(mpg, package="ggplot2")
# mpg <- read.csv("http://goo.gl/uEeRGu")

# Scatterplot
theme_set(theme_bw())  # pre-set the bw theme.
mpg_select <- mpg[mpg$hwy >= 35 & mpg$cty > 27, ]
g <- ggplot(mpg, aes(cty, hwy)) + 
  geom_count() + 
  geom_smooth(method="lm", se=F)

ggMarginal(g, type = "histogram", fill="transparent")

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

ggMarginal(g, type = "boxplot", fill="transparent")

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

# ggMarginal(g, type = "density", fill="transparent")

6.5 Diverging Bar Charts

library(ggplot2)
theme_set(theme_bw())  
# Data Prep
data("mtcars")  # load data
mtcars$`car name` <- rownames(mtcars)  # create new column for car names
mtcars$mpg_z <- round((mtcars$mpg - mean(mtcars$mpg))/sd(mtcars$mpg), 2)  # compute normalized mpg
mtcars$mpg_type <- ifelse(mtcars$mpg_z < 0, "below", "above")  # above / below avg flag
mtcars <- mtcars[order(mtcars$mpg_z), ]  # sort
mtcars$`car name` <- factor(mtcars$`car name`, levels = mtcars$`car name`)  # convert to factor to retain sorted order in plot.

# Diverging Barcharts
ggplot(mtcars, aes(x=`car name`, y=mpg_z, label=mpg_z)) + 
  geom_bar(stat='identity', aes(fill=mpg_type), width=.5)  +
  scale_fill_manual(name="Mileage", 
                    labels = c("Above Average", "Below Average"), 
                    values = c("above"="#00ba38", "below"="#f8766d")) + 
  labs(subtitle="Normalised mileage from 'mtcars'", 
       title= "Diverging Bars") + 
  coord_flip()

6.6 Area Charts

library(ggplot2)
#install.packages("quantmod")
library(quantmod)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## Loading required package: TTR

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Version 0.4-0 included new data defaults. See ?getSymbols.

data("economics", package = "ggplot2")

# Compute % Returns
economics$returns_perc <- c(0, diff(economics$psavert)/economics$psavert[-length(economics$psavert)])

# Create break points and labels for axis ticks
brks <- economics$date[seq(1, length(economics$date), 12)]
#install.packages("lubridate")
lbls <- lubridate::year(economics$date[seq(1, length(economics$date), 12)]) 

# Plot
ggplot(economics[1:100, ], aes(date, returns_perc)) + 
  geom_area() + 
  scale_x_date(breaks=brks, labels=lbls) + 
  theme(axis.text.x = element_text(angle=90)) + 
  labs(title="Area Chart", 
       subtitle = "Perc Returns for Personal Savings", 
       y="% Returns for Personal savings", 
       caption="Source: economics")

6.7 Funnel Charts

library(ggplot2)
library(ggthemes)
options(scipen = 999)  # turns of scientific notations like 1e+40
# Read data
email_campaign_funnel <- read.csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")

# X Axis Breaks and Labels 
brks <- seq(-15000000, 15000000, 5000000)
lbls = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "m")

# Plot
ggplot(email_campaign_funnel, aes(x = Stage, y = Users, fill = Gender)) +   # Fill column
                              geom_bar(stat = "identity", width = .6) +   # draw the bars
                              scale_y_continuous(breaks = brks,   # Breaks
                                                 labels = lbls) + # Labels
                              coord_flip() +  # Flip axes
                              labs(title="Email Campaign Funnel") +
                              theme_tufte() +  # Tufte theme from ggfortify
                              theme(plot.title = element_text(hjust = .5), 
                                    axis.ticks = element_blank()) +   # Centre plot title
                              scale_fill_brewer(palette = "Dark2")  # Color palette

6.8 Calendar Heatmaps

# http://margintale.blogspot.in/2012/04/ggplot2-time-series-heatmaps.html
library(ggplot2)
library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:purrr':
## 
##     compact

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(zoo)

df <- read.csv("https://raw.githubusercontent.com/selva86/datasets/master/yahoo.csv")
df$date <- as.Date(df$date)  # format date
df <- df[df$year >= 2012, ]  # filter reqd years

# Create Month Week
df$yearmonth <- as.yearmon(df$date)
df$yearmonthf <- factor(df$yearmonth)
df <- ddply(df,.(yearmonthf), transform, monthweek=1+week-min(week))  # compute week number of month
df <- df[, c("year", "yearmonthf", "monthf", "week", "monthweek", "weekdayf", "VIX.Close")]
head(df)

#>   year yearmonthf monthf week monthweek weekdayf VIX.Close
#> 1 2012   Jan 2012    Jan    1         1      Tue     22.97
#> 2 2012   Jan 2012    Jan    1         1      Wed     22.22
#> 3 2012   Jan 2012    Jan    1         1      Thu     21.48
#> 4 2012   Jan 2012    Jan    1         1      Fri     20.63
#> 5 2012   Jan 2012    Jan    2         2      Mon     21.07
#> 6 2012   Jan 2012    Jan    2         2      Tue     20.69


# Plot
ggplot(df, aes(monthweek, weekdayf, fill = VIX.Close)) + 
  geom_tile(colour = "white") + 
  facet_grid(year~monthf) + 
  scale_fill_gradient(low="red", high="green") +
  labs(x="Week of Month",
       y="",
       title = "Time-Series Calendar Heatmap", 
       subtitle="Yahoo Closing Price", 
       fill="Close")

6.9 Ordered Bar Charts

# Prepare data: group mean city mileage by manufacturer.
cty_mpg <- aggregate(mpg$cty, by=list(mpg$manufacturer), FUN=mean)  # aggregate
colnames(cty_mpg) <- c("make", "mileage")  # change column names
cty_mpg <- cty_mpg[order(cty_mpg$mileage), ]  # sort
cty_mpg$make <- factor(cty_mpg$make, levels = cty_mpg$make)  # to retain the order in plot.
head(cty_mpg, 4)

#>          make  mileage
#> 9     lincoln 11.33333
#> 8  land rover 11.50000
#> 3       dodge 13.13514
#> 10    mercury 13.25000

#The X variable is now a factor, let's plot.

library(ggplot2)
theme_set(theme_bw())

# Draw plot
ggplot(cty_mpg, aes(x=make, y=mileage)) + 
  geom_bar(stat="identity", width=.5, fill="tomato3") + 
  labs(title="Ordered Bar Chart", 
       subtitle="Make Vs Avg. Mileage", 
       caption="source: mpg") + 
  theme(axis.text.x = element_text(angle=65, vjust=0.6))

6.10 Correlograms

#install.packages("ggcorrplot")
library(ggplot2)
library(ggcorrplot)

# Correlation matrix
data(mtcars)
corr <- round(cor(mtcars), 1)

# Plot
ggcorrplot(corr, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of mtcars", 
           ggtheme=theme_bw)

6.11 Treemaps

library(devtools)

## Warning: package 'devtools' was built under R version 3.6.2

## Loading required package: usethis

## Warning: package 'usethis' was built under R version 3.6.2

#devtools::install_github("wilkox/treemapify")
library(treemapify)
library(ggplot2)
data(G20)
head(G20)

#       region       country gdp_mil_usd   hdi econ_classification
#       Africa  South Africa      384315 0.629          Developing
# North America United States    15684750 0.937            Advanced
# North America        Canada     1819081 0.911            Advanced
# North America        Mexico     1177116 0.775          Developing
# South America        Brazil     2395968 0.730          Developing
# South America     Argentina      474954 0.811          Developing

ggplot(G20, aes(area = gdp_mil_usd, fill = region, label = country)) +
  geom_treemap() +
  geom_treemap_text(grow = T, reflow = T, colour = "black") +
  facet_wrap( ~ econ_classification) +
  scale_fill_brewer(palette = "Set1") +
  theme(legend.position = "bottom") +
  labs(
    title = "The G-20 major economies",
    caption = "The area of each country is proportional to its relative GDP
    within the economic group (advanced or developing)",
    fill = "Region"
  )

6.12 Network Charts

library(ggplot2)
library(ggnetwork)
library(geomnet)

## Registered S3 methods overwritten by 'geomnet':
##   method          from     
##   fortify.igraph  ggnetwork
##   fortify.network ggnetwork

library(network)

## network: Classes for Relational Data
## Version 1.16.0 created on 2019-11-30.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
##                     Mark S. Handcock, University of California -- Los Angeles
##                     David R. Hunter, Penn State University
##                     Martina Morris, University of Washington
##                     Skye Bender-deMoll, University of Washington
##  For citation information, type citation("network").
##  Type help("network-package") to get started.

## 
## Attaching package: 'network'

## The following object is masked from 'package:plyr':
## 
##     is.discrete

# make the data available
data(madmen, package = 'geomnet')
# data step for  ggnetwork
# create undirected network
mm.net <- network(madmen$edges[, 1:2], directed = FALSE)
mm.net # glance at network object

##  Network attributes:
##   vertices = 45 
##   directed = FALSE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 39 
##     missing edges= 0 
##     non-missing edges= 39 
## 
##  Vertex attribute names: 
##     vertex.names 
## 
## No edge attributes

## Network attributes:
## vertices = 45
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 39
## missing edges= 0
## non-missing edges= 39
##
## Vertex attribute names:
## vertex.names
## No edge attributes
# create node attribute (gender)
rownames(madmen$vertices) <- madmen$vertices$label
mm.net %v% "gender" <- as.character(
  madmen$vertices[ network.vertex.names(mm.net), "Gender"]
)
# gender color palette
mm.col <- c("female" = "#ff0000", "male" = "#00ff00")
set.seed(10052016)
ggplot(data = ggnetwork(mm.net, layout = "kamadakawai"),
       aes(x, y, xend = xend, yend = yend)) +
  geom_edges(color = "grey50") + # draw edge layer
  geom_nodes(aes(colour = gender), size = 2) + # draw node layer
  geom_nodetext(aes(colour = gender, label = vertex.names),
                size = 3, vjust = -0.6) + # draw node label layer
  scale_colour_manual(values = mm.col) +
  xlim(c(-0.05, 1.05)) +
  theme_blank() +
  theme(legend.position = "bottom")

6.13 Parallel Coordinates

library(triangle)
set.seed(0)

q1_d1 <- round(rtriangle(1000, 1, 7, 5))
q1_d2 <- round(rtriangle(1000, 1, 7, 6))
q1_d3 <- round(rtriangle(1000, 1, 7, 2))
df <- data.frame(q1_d1 = factor(q1_d1), q1_d2 = factor(q1_d2), q1_d3 =  factor(q1_d3))

library(dplyr)

# group by combinations and count
df_grouped <- df %>% group_by(q1_d1, q1_d2, q1_d3) %>% count()

# set an id string that denotes the value combination
df_grouped <- df_grouped %>% mutate(id = factor(paste(q1_d1, q1_d2, q1_d3, sep = '-')))

order.freq <- order(df_grouped[,4],decreasing=TRUE)

# sort by count and select top rows
df_grouped <- df_grouped[order.freq[1:25],]

library(reshape2)
library(ggplot2)
# create long format
df_pcp <- melt(df_grouped, id.vars = c('id', 'freq'))
df_pcp$value <- factor(df_pcp$value)

y_levels <- levels(factor(1:7))
ggplot(df_pcp, aes(x = variable, y = value, group = id)) +   # group = id is important!
  geom_path(aes(size = freq, color = id),
            alpha = 0.5,
            lineend = 'round', linejoin = 'round') +
  scale_y_discrete(limits = y_levels, expand = c(0.5, 0)) +
  scale_size(breaks = NULL, range = c(1, 7))

6.14 Time Series and Variants

## From Timeseries object (ts)
library(ggplot2)
library(ggfortify)

## Warning: package 'ggfortify' was built under R version 3.6.2

theme_set(theme_classic())

# Plot 
autoplot(AirPassengers) + 
  labs(title="AirPassengers") + 
  theme(plot.title = element_text(hjust=0.5))

library(ggplot2)
theme_set(theme_classic())
# Allow Default X Axis Labels
ggplot(economics, aes(x=date)) + 
  geom_line(aes(y=unemploy)) + 
  labs(title="Time Series Chart", 
       subtitle="Number of unemployed in thousands from 'Economics-US' Dataset", 
       caption="Source: Economics", 
       y="unemploy")

library(ggplot2)
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:plyr':
## 
##     here

## The following object is masked from 'package:base':
## 
##     date

theme_set(theme_bw())

economics_m <- economics[1:24, ]

# labels and breaks for X axis text
lbls <- paste0(month.abb[month(economics_m$date)], " ", lubridate::year(economics_m$date))
brks <- economics_m$date

# plot
ggplot(economics_m, aes(x=date)) + 
  geom_line(aes(y=pce)) + 
  labs(title="Monthly Time Series", 
       subtitle="Personal consumption expenditures, in billions of dollars", 
       caption="Source: Economics", 
       y="pce") +  # title and caption
  scale_x_date(labels = lbls, 
               breaks = brks) +  # change to monthly ticks and labels
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())  # turn off minor grid

library(ggplot2)
library(lubridate)
theme_set(theme_bw())

economics_y <- economics[1:90, ]

# labels and breaks for X axis text
brks <- economics_y$date[seq(1, length(economics_y$date), 12)]
lbls <- lubridate::year(brks)

# plot
ggplot(economics_y, aes(x=date)) + 
  geom_line(aes(y=psavert)) + 
  labs(title="Yearly Time Series", 
       subtitle="Personal savings rate", 
       caption="Source: Economics", 
       y="psavert") +  # title and caption
  scale_x_date(labels = lbls, 
               breaks = brks) +  # change to monthly ticks and labels
  theme(axis.text.x = element_text(angle = 90, vjust=0.5),  # rotate x axis text
        panel.grid.minor = element_blank())  # turn off minor grid

data(economics_long, package = "ggplot2")
head(economics_long)

#>         date variable value      value01
#>       <date>   <fctr> <dbl>        <dbl>
#> 1 1967-07-01      pce 507.4 0.0000000000
#> 2 1967-08-01      pce 510.5 0.0002660008
#> 3 1967-09-01      pce 516.3 0.0007636797
#> 4 1967-10-01      pce 512.9 0.0004719369
#> 5 1967-11-01      pce 518.1 0.0009181318
#> 6 1967-12-01      pce 525.8 0.0015788435
library(ggplot2)
library(lubridate)
theme_set(theme_bw())
df <- economics_long[economics_long$variable %in% c("psavert", "uempmed"), ]
df <- df[lubridate::year(df$date) %in% c(1967:1981), ]

# labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date), 12)]
lbls <- lubridate::year(brks)

# plot
ggplot(df, aes(x=date)) + 
  geom_line(aes(y=value, col=variable)) + 
  labs(title="Time Series of Returns Percentage", 
       subtitle="Drawn from Long Data format", 
       caption="Source: Economics", 
       y="Returns %", 
       color=NULL) +  # title and caption
  scale_x_date(labels = lbls, breaks = brks) +  # change to monthly ticks and labels
  scale_color_manual(labels = c("psavert", "uempmed"), 
                     values = c("psavert"="#00ba38", "uempmed"="#f8766d")) +  # line color
  theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8),  # rotate x axis text
        panel.grid.minor = element_blank())  # turn off minor grid

library(ggplot2)
library(lubridate)
theme_set(theme_bw())

df <- economics[, c("date", "psavert", "uempmed")]
df <- df[lubridate::year(df$date) %in% c(1967:1981), ]

# labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date), 12)]
lbls <- lubridate::year(brks)

# plot
ggplot(df, aes(x=date)) + 
  geom_area(aes(y=psavert+uempmed, fill="psavert")) + 
  geom_area(aes(y=uempmed, fill="uempmed")) + 
  labs(title="Area Chart of Returns Percentage", 
       subtitle="From Wide Data format", 
       caption="Source: Economics", 
       y="Returns %") +  # title and caption
  scale_x_date(labels = lbls, breaks = brks) +  # change to monthly ticks and labels
  scale_fill_manual(name="", 
                    values = c("psavert"="#00ba38", "uempmed"="#f8766d")) +  # line color
  theme(panel.grid.minor = element_blank())  # turn off minor grid

library(ggplot2)
library(forecast)

## Warning: package 'forecast' was built under R version 3.6.2

## Registered S3 methods overwritten by 'forecast':
##   method                 from     
##   autoplot.Arima         ggfortify
##   autoplot.acf           ggfortify
##   autoplot.ar            ggfortify
##   autoplot.bats          ggfortify
##   autoplot.decomposed.ts ggfortify
##   autoplot.ets           ggfortify
##   autoplot.forecast      ggfortify
##   autoplot.stl           ggfortify
##   autoplot.ts            ggfortify
##   fitted.ar              ggfortify
##   fortify.ts             ggfortify
##   residuals.ar           ggfortify

theme_set(theme_classic())

# Subset data
nottem_small <- window(nottem, start=c(1920, 1), end=c(1925, 12))  # subset a smaller timewindow

# Plot
ggseasonplot(AirPassengers) + labs(title="Seasonal plot: International Airline Passengers")

ggseasonplot(nottem_small) + labs(title="Seasonal plot: Air temperatures at Nottingham Castle")

6.15 Clusters

# devtools::install_github("hrbrmstr/ggalt")
library(ggplot2)
library(ggalt)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from     
##   fortify.table           ggfortify
##   grid.draw.absoluteGrob  ggplot2  
##   grobHeight.absoluteGrob ggplot2  
##   grobWidth.absoluteGrob  ggplot2  
##   grobX.absoluteGrob      ggplot2  
##   grobY.absoluteGrob      ggplot2

library(ggfortify)
theme_set(theme_classic())
# Compute data with principal components 
df <- iris[c(1, 2, 3, 4)]
pca_mod <- prcomp(df)  # compute principal components
# Data frame of principal components -
df_pc <- data.frame(pca_mod$x, Species=iris$Species)  # dataframe of principal components
df_pc_vir <- df_pc[df_pc$Species == "virginica", ]  # df for 'virginica'
df_pc_set <- df_pc[df_pc$Species == "setosa", ]  # df for 'setosa'
df_pc_ver <- df_pc[df_pc$Species == "versicolor", ]  # df for 'versicolor'

# Plot -
ggplot(df_pc, aes(PC1, PC2, col=Species)) + 
  geom_point(aes(shape=Species), size=2) +   # draw points
  labs(title="Iris Clustering", 
       subtitle="With principal components PC1 and PC2 as X and Y axis",
       caption="Source: Iris") + 
  coord_cartesian(xlim = 1.2 * c(min(df_pc$PC1), max(df_pc$PC1)), 
                  ylim = 1.2 * c(min(df_pc$PC2), max(df_pc$PC2))) +   # change axis limits
  geom_encircle(data = df_pc_vir, aes(x=PC1, y=PC2)) +   # draw circles
  geom_encircle(data = df_pc_set, aes(x=PC1, y=PC2)) + 
  geom_encircle(data = df_pc_ver, aes(x=PC1, y=PC2))

6.16 Dumbbell Charts

# devtools::install_github("hrbrmstr/ggalt")
library(ggplot2)
library(ggalt)
theme_set(theme_classic())

health <- read.csv("https://raw.githubusercontent.com/selva86/datasets/master/health.csv")

# for right ordering of the dumbells
health$Area <- factor(health$Area, levels=as.character(health$Area))  
# health$Area <- factor(health$Area)
gg <- ggplot(health, aes(x=pct_2013, xend=pct_2014, y=Area, group=Area)) + 
        geom_dumbbell(color="#a3c4dc", 
                      size=0.75, 
                      point.colour.l="#0e668b") + 
        scale_x_continuous(label=waiver()) + 
        labs(x=NULL, 
             y=NULL, 
             title="Dumbbell Chart", 
             subtitle="Pct Change: 2013 vs 2014", 
             caption="Source: https://github.com/hrbrmstr/ggalt") +
        theme(plot.title = element_text(hjust=0.5, face="bold"),
              plot.background=element_rect(fill="#f7f7f7"),
              panel.background=element_rect(fill="#f7f7f7"),
              panel.grid.minor=element_blank(),
              panel.grid.major.y=element_blank(),
              panel.grid.major.x=element_line(),
              axis.ticks=element_blank(),
              legend.position="top",
              panel.border=element_blank())

## Warning: Ignoring unknown parameters: point.colour.l

plot(gg)

6.17 Slope Charts

library(dplyr)
theme_set(theme_classic())
source_df <- read.csv("https://raw.githubusercontent.com/jkeirstead/r-slopegraph/master/cancer_survival_rates.csv")

# Define functions. Source: https://github.com/jkeirstead/r-slopegraph
tufte_sort <- function(df, x="year", y="value", group="group", method="tufte", min.space=0.05) {
    ## First rename the columns for consistency
    ids <- match(c(x, y, group), names(df))
    df <- df[,ids]
    names(df) <- c("x", "y", "group")

    ## Expand grid to ensure every combination has a defined value
    tmp <- expand.grid(x=unique(df$x), group=unique(df$group))
    tmp <- merge(df, tmp, all.y=TRUE)
    df <- mutate(tmp, y=ifelse(is.na(y), 0, y))
  
    ## Cast into a matrix shape and arrange by first column
    require(reshape2)
    tmp <- dcast(df, group ~ x, value.var="y")
    ord <- order(tmp[,2])
    tmp <- tmp[ord,]
    
    min.space <- min.space*diff(range(tmp[,-1]))
    yshift <- numeric(nrow(tmp))
    ## Start at "bottom" row
    ## Repeat for rest of the rows until you hit the top
    for (i in 2:nrow(tmp)) {
        ## Shift subsequent row up by equal space so gap between
        ## two entries is >= minimum
        mat <- as.matrix(tmp[(i-1):i, -1])
        d.min <- min(diff(mat))
        yshift[i] <- ifelse(d.min < min.space, min.space - d.min, 0)
    }

    
    tmp <- cbind(tmp, yshift=cumsum(yshift))

    scale <- 1
    tmp <- melt(tmp, id=c("group", "yshift"), variable.name="x", value.name="y")
    ## Store these gaps in a separate variable so that they can be scaled ypos = a*yshift + y

    tmp <- transform(tmp, ypos=y + scale*yshift)
    return(tmp)
   
}
plot_slopegraph <- function(df) {
    ylabs <- subset(df, x==head(x,1))$group
    yvals <- subset(df, x==head(x,1))$ypos
    fontSize <- 3
    gg <- ggplot(df,aes(x=x,y=ypos)) +
        geom_line(aes(group=group),colour="grey80") +
        geom_point(colour="white",size=8) +
        geom_text(aes(label=y), size=fontSize, family="American Typewriter") +
        scale_y_continuous(name="", breaks=yvals, labels=ylabs)
    return(gg)
}    

## Prepare data    
df <- tufte_sort(source_df, 
                 x="year", 
                 y="value", 
                 group="group", 
                 method="tufte", 
                 min.space=0.05)

df <- transform(df, 
                x=factor(x, levels=c(5,10,15,20), 
                            labels=c("5 years","10 years","15 years","20 years")), 
                y=round(y))

## Plot
plot_slopegraph(df) + labs(title="Estimates of % survival rates") + 
                      theme(axis.title=element_blank(),
                            axis.ticks = element_blank(),
                            plot.title = element_text(hjust=0.5,
                                                      family = "American Typewriter",
                                                      face="bold"),
                            axis.text = element_text(family = "American Typewriter",
                                                     face="bold"))

6.18 Dendrograms

#install.packages("ggdendro")
library("ggplot2")
library("ggdendro")
theme_set(theme_bw())
hc <- hclust(dist(USArrests), "ave")  # hierarchical clustering
# plot
ggdendrogram(hc, rotate = TRUE, size = 2)

6.19 Density Plots

library(ggplot2)
theme_set(theme_classic())

# Plot
g <- ggplot(mpg, aes(cty))
g + geom_density(aes(fill=factor(cyl)), alpha=0.8) + 
    labs(title="Density Plot", 
         subtitle="City Mileage Grouped by Number of cylinders",
         caption="Source: mpg",
         x="City Mileage",
         fill="# Cylinders")

6.20 Boxplots

library(ggplot2)
theme_set(theme_classic())

# Plot
g <- ggplot(mpg, aes(class, cty))
g + geom_boxplot(varwidth=T, fill="plum") + 
    labs(title="Boxplot", 
         subtitle="City Mileage grouped by Class of vehicle",
         caption="Source: mpg",
         x="Class of Vehicle",
         y="City Mileage")

6.21 Boxplots and Dotplots

library(ggplot2)
theme_set(theme_bw())

# plot
g <- ggplot(mpg, aes(manufacturer, cty))
g + geom_boxplot() + 
  geom_dotplot(binaxis='y', 
               stackdir='center', 
               dotsize = .5, 
               fill="red") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) + 
  labs(title="Boxplot + Dotplot", 
       subtitle="City Mileage vs Class: Each dot represents 1 row in source data",
       caption="Source: mpg",
       x="Class of Vehicle",
       y="City Mileage")

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

6.22 Waffle Charts

library(ggplot2)
var <- mpg$class  # the categorical data 
## Prep data (nothing to change here)
nrows <- 10
df <- expand.grid(y = 1:nrows, x = 1:nrows)
categ_table <- round(table(var) * ((nrows*nrows)/(length(var))))
categ_table

## var
##    2seater    compact    midsize    minivan     pickup subcompact 
##          2         20         18          5         14         15 
##        suv 
##         26

#>   2seater    compact    midsize    minivan     pickup subcompact        suv 
#>         2         20         18          5         14         15         26 
df$category <- factor(rep(names(categ_table), categ_table))  
# NOTE: if sum(categ_table) is not 100 (i.e. nrows^2), it will need adjustment to make the sum to 100.
## Plot
ggplot(df, aes(x = x, y = y, fill = category)) + 
        geom_tile(color = "black", size = 0.5) +
        scale_x_continuous(expand = c(0, 0)) +
        scale_y_continuous(expand = c(0, 0), trans = 'reverse') +
        scale_fill_brewer(palette = "Set3") +
        labs(title="Waffle Chart", subtitle="'Class' of vehicles",
             caption="Source: mpg") + 
        theme(panel.border = element_rect(size = 2),
              plot.title = element_text(size = rel(1.2)),
              axis.text = element_blank(),
              axis.title = element_blank(),
              axis.ticks = element_blank(),
              legend.title = element_blank(),
              legend.position = "right")