More Data Visualization Stuff in R

In this notebook, we show how to do more sophisticated charts in R using lattice, ggplot2, and the tidyverse.

As always, do not hesitate to consult online resources for more examples (StackOverflow, R-bloggers, etc.).

TUTORIAL OUTLINE

Titanic Dataset
Census PUMS Data
Gapminder
Minard’s March to Moscow
Snow’s Colera Outbreak Map

1. TITANIC DATASET

We will explore a dataset related to the Titanic. It can be found online (see URL below) in a txt format. We use read.table() to download the dataset to a data frame.

titanic.url <- "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt"
titanic <- read.table(titanic.url,  sep=",", header=TRUE)
str(titanic)

## 'data.frame':    1313 obs. of  11 variables:
##  $ row.names: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pclass   : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
##  $ survived : int  1 0 0 0 1 1 1 0 1 0 ...
##  $ name     : Factor w/ 1310 levels "Abbing, Mr Anthony",..: 22 25 26 27 24 31 45 46 50 54 ...
##  $ age      : num  29 2 30 25 0.917 ...
##  $ embarked : Factor w/ 4 levels "","Cherbourg",..: 4 4 4 4 4 4 4 4 4 2 ...
##  $ home.dest: Factor w/ 372 levels "","?Havana, Cuba",..: 312 233 233 233 233 239 164 26 24 231 ...
##  $ room     : Factor w/ 54 levels "","2131","A-11",..: 13 42 42 42 41 51 48 6 19 1 ...
##  $ ticket   : Factor w/ 42 levels ""," ","111361 L57 19s 7d",..: 31 1 1 1 1 1 9 1 1 1 ...
##  $ boat     : Factor w/ 100 levels "","(101)","(103)",..: 88 1 13 1 80 89 79 1 88 35 ...
##  $ sex      : Factor w/ 2 levels "female","male": 1 1 2 1 2 2 1 2 1 2 ...

It contains 131 observations and 11 variables (the first of which is a row identifier). We see that most variables are factors (categorical variables).

Based on the name of some of the variables, we might expect some of them to be character variables instead.

head(titanic)

##   row.names pclass survived
## 1         1    1st        1
## 2         2    1st        0
## 3         3    1st        0
## 4         4    1st        0
## 5         5    1st        1
## 6         6    1st        1
##                                              name     age    embarked
## 1                    Allen, Miss Elisabeth Walton 29.0000 Southampton
## 2                     Allison, Miss Helen Loraine  2.0000 Southampton
## 3             Allison, Mr Hudson Joshua Creighton 30.0000 Southampton
## 4 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton
## 5                   Allison, Master Hudson Trevor  0.9167 Southampton
## 6                              Anderson, Mr Harry 47.0000 Southampton
##                         home.dest room     ticket  boat    sex
## 1                    St Louis, MO  B-5 24160 L221     2 female
## 2 Montreal, PQ / Chesterville, ON  C26                  female
## 3 Montreal, PQ / Chesterville, ON  C26            (135)   male
## 4 Montreal, PQ / Chesterville, ON  C26                  female
## 5 Montreal, PQ / Chesterville, ON  C22               11   male
## 6                    New York, NY E-12                3   male

We can remedy the situation by specifying that the first column corresponds to the row number, and by making sure that strings do not get imported as factors.

titanic <- read.table( file=titanic.url, sep=",", header=TRUE, stringsAsFactors=FALSE, row.names=1)
# stringsAsFactors=FALSE to keep strings as strings.
# row.names=1 tells read.table that column 1 contains the row names.

We do not need to call str() to see the type for each of the variables; we can use sapply() instead.

sapply(titanic, class)

##      pclass    survived        name         age    embarked   home.dest 
## "character"   "integer" "character"   "numeric" "character" "character" 
##        room      ticket        boat         sex 
## "character" "character" "character" "character"

Now the problem is that some of the variables that should have been (ordered) factors or logicals are now character strings.

We will fix that issue by using the transform() function.

titanic <- transform(titanic, 
                     pclass = ordered(pclass, levels=c("3rd", "2nd", "1st")),
                     survived = as.logical(survived), 
                     embarked = as.factor(embarked), 
                     sex = as.factor(sex))

sapply(titanic, class)

## $pclass
## [1] "ordered" "factor" 
## 
## $survived
## [1] "logical"
## 
## $name
## [1] "character"
## 
## $age
## [1] "numeric"
## 
## $embarked
## [1] "factor"
## 
## $home.dest
## [1] "character"
## 
## $room
## [1] "character"
## 
## $ticket
## [1] "character"
## 
## $boat
## [1] "character"
## 
## $sex
## [1] "factor"

The dataset can be summarized using the summary() function.

summary(titanic)

##  pclass     survived           name                age         
##  3rd:711   Mode :logical   Length:1313        Min.   : 0.1667  
##  2nd:280   FALSE:864       Class :character   1st Qu.:21.0000  
##  1st:322   TRUE :449       Mode  :character   Median :30.0000  
##                                               Mean   :31.1942  
##                                               3rd Qu.:41.0000  
##                                               Max.   :71.0000  
##                                               NA's   :680      
##         embarked    home.dest             room          
##             :492   Length:1313        Length:1313       
##  Cherbourg  :203   Class :character   Class :character  
##  Queenstown : 45   Mode  :character   Mode  :character  
##  Southampton:573                                        
##                                                         
##                                                         
##                                                         
##     ticket              boat               sex     
##  Length:1313        Length:1313        female:463  
##  Class :character   Class :character   male  :850  
##  Mode  :character   Mode  :character               
##                                                    
##                                                    
##                                                    
##

We notice, among other things,thata number of age observations are missing, and that a fair number of passengers do not have a port of embarcation.

What can we say about this dataset as a whole?

First, build a contingency table of survival by passenger class.

(pclass.survived <- with(titanic, table(pclass, survived)))

##       survived
## pclass FALSE TRUE
##    3rd   574  137
##    2nd   161  119
##    1st   129  193

It certainly seems as though the passenger class was linked to survival (at least, at first glance). A barplot can help illustrate the relationship.

barplot(pclass.survived, beside=TRUE, legend.text=levels(titanic$pclass))

Splitting along the survived (TRUE) / did not survive (FALSE) axis can muddle the situation to some extent: of course there were more 3rd class passenger who didn’t survive – there were more 3rd class passengers, period.

Compare with the following chart.

library(lattice)
barchart(pclass.survived, stack=FALSE, auto.key=TRUE)

The relative proportions of survivors per passenger class is striking.

Here is another way to display the same information.

barchart(t(pclass.survived), groups=FALSE, layout=c(3,1), horizontal=FALSE)

Visualization experts clain that the most efficient visualization dimensions for comparison of quantities are, in order:

position
length
slope
angle

Let’s see if that seem reasonable, by comparing how length and angles do the job.

(pass.class <- table(titanic$pclass))

## 
## 3rd 2nd 1st 
## 711 280 322

par(mfrow=c(1,2))   # two charts side by side
pie(pass.class)     # angle
barplot(pass.class) # length

What do you think? Is the much-maligned pie chart THAT bad?

Let’s see what things look like when we combine 3 variables instead of 2.

(sv.pc.sx <- with(titanic, table(survived, pclass, sex)))

## , , sex = female
## 
##         pclass
## survived 3rd 2nd 1st
##    FALSE 134  13   9
##    TRUE   79  94 134
## 
## , , sex = male
## 
##         pclass
## survived 3rd 2nd 1st
##    FALSE 440 148 120
##    TRUE   58  25  59

As a start, the table of frequencies is a tad harder to read/interpret.

But we can get a good sense of the underlying data distributions with a bar chart.

barchart(sv.pc.sx, horizontal=FALSE, stack=FALSE, layout=c(3,1), auto.key=list(columns=2))

Histograms, density plots, and dot plots can also provide an idea of the underlying distributions (implemented with lattice’s histogram(), densityplot(), and dotplot())

Here are histograms of age by survival status:

histogram(~age | survived, data=titanic)

We might want to reverse this and get the distributions of survival statuses by age:

# start by creating age groups
titanic$age.groups <- cut(titanic$age, breaks=seq(0,80,by=20))
table(titanic$age.groups)

## 
##  (0,20] (20,40] (40,60] (60,80] 
##     145     327     140      21

densityplot(~as.numeric(survived) | age.groups, data=titanic)

Is what’s represented in this chart clear to you? What happens to the passengers for which no age was recorded?

We could also look at the distribution of age within each passenger class:

dotplot(~age | pclass, data=titanic, layout=c(1,3))

2. CENSUS PUMS DATA

(This example is taken from Practical Data Science with R, by Nina Zumel and John Mount, Manning, 2014)

The custdata.tsv file (in the Data folder) is derived from Census PUMS data.

The business objective is to predict whether a customer has health insurance. This synthetic dataset contains customer information for individuals whose health insurance status is known.

We start by importing the data into a data frame using the read.delim() function (to handle the odd file format).

df <- read.delim("Data/custdata.tsv")
class(df)

## [1] "data.frame"

dim(df)

## [1] 1000   11

head(df)

##   custid sex is.employed income  marital.stat health.ins
## 1   2068   F          NA  11300       Married       TRUE
## 2   2073   F          NA      0       Married       TRUE
## 3   2848   M        TRUE   4500 Never Married      FALSE
## 4   5641   M        TRUE  20000 Never Married      FALSE
## 5   6369   F        TRUE  12000 Never Married       TRUE
## 6   8322   F        TRUE 180000 Never Married       TRUE
##                   housing.type recent.move num.vehicles age state.of.res
## 1     Homeowner free and clear       FALSE            2  49     Michigan
## 2                       Rented        TRUE            3  40      Florida
## 3                       Rented        TRUE            3  22      Georgia
## 4        Occupied with no rent       FALSE            0  22   New Mexico
## 5                       Rented        TRUE            1  31      Florida
## 6 Homeowner with mortgage/loan       FALSE            1  40     New York

We see that we have 1000 observations of 11 variables. We obtain a data summary via summary().

summary(df)

##      custid        sex     is.employed         income      
##  Min.   :   2068   F:440   Mode :logical   Min.   : -8700  
##  1st Qu.: 345667   M:560   FALSE:73        1st Qu.: 14600  
##  Median : 693403           TRUE :599       Median : 35000  
##  Mean   : 698500           NA's :328       Mean   : 53505  
##  3rd Qu.:1044606                           3rd Qu.: 67000  
##  Max.   :1414286                           Max.   :615000  
##                                                            
##              marital.stat health.ins     
##  Divorced/Separated:155   Mode :logical  
##  Married           :516   FALSE:159      
##  Never Married     :233   TRUE :841      
##  Widowed           : 96                  
##                                          
##                                          
##                                          
##                        housing.type recent.move      num.vehicles  
##  Homeowner free and clear    :157   Mode :logical   Min.   :0.000  
##  Homeowner with mortgage/loan:412   FALSE:820       1st Qu.:1.000  
##  Occupied with no rent       : 11   TRUE :124       Median :2.000  
##  Rented                      :364   NA's :56        Mean   :1.916  
##  NA's                        : 56                   3rd Qu.:2.000  
##                                                     Max.   :6.000  
##                                                     NA's   :56     
##       age              state.of.res
##  Min.   :  0.0   California  :100  
##  1st Qu.: 38.0   New York    : 71  
##  Median : 50.0   Pennsylvania: 70  
##  Mean   : 51.7   Texas       : 56  
##  3rd Qu.: 64.0   Michigan    : 52  
##  Max.   :146.7   Ohio        : 51  
##                  (Other)     :600

Right off the bat, we see that there are some problems with the data (NAs, impossible ranges, etc.).

We can provide the number of NAs per variable using the following code block:

mv <- colSums(is.na(df))
cbind(mv)   # cbind to display as column

##               mv
## custid         0
## sex            0
## is.employed  328
## income         0
## marital.stat   0
## health.ins     0
## housing.type  56
## recent.move   56
## num.vehicles  56
## age            0
## state.of.res   0

The fact that three of the variables have the same number of missing values means that it is likely that there are 56 observations with no measurement for housing.type, recent.move, and num.vehicles, but that is no guarantee.

We still need to check (try it!).

As per the ranges, something is definitely fishy with income and age:

summary(df$income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -8700   14600   35000   53505   67000  615000

summary(df$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    38.0    50.0    51.7    64.0   146.7

What does it mean for incomes to be negative? For a customer to be 0 years old? Or worse, 146.7?

We use ggplot2 to visually explore the data. Note that:

the ggplot() function works only on data frames (or tibbles)
it does not create a graph, it creates an object
graphs are produced from layers, which are added to the object
aesthetics are the visual elements of the graph, e.g., the \(x\) and \(y\) variables, the size of markers, colors, etc.

We will start by providing a number of univariate visualizations, starting with the age variable.

library(ggplot2)
ggplot(df) + geom_histogram(aes(x=age), binwidth=5, fill="gray")

What happens if we use a different bin width?

ggplot(df) + geom_histogram(aes(x=age), binwidth=10, fill="gray")

ggplot(df) + geom_histogram(aes(x=age), binwidth=1, fill="gray")

We can also get some (univariate) information about the income variable:

library(scales)
ggplot(df) + geom_histogram(aes(x=income), binwidth = 10000) + scale_x_continuous(labels=dollar)

ggplot(df) + geom_histogram(aes(x=income), binwidth = 5000) + scale_x_continuous(labels=dollar)

We have already seen that there are negative income values in the dataset. Let’s restrict the data to those customers with positive values, and display using a logarithmic scale.

df.2 <- subset(df, income > 0)
ggplot(df.2) + 
    geom_histogram(aes(x=income), binwidth = 5000) +  
    scale_x_log10(breaks=10^(1:6), labels=dollar)

Whoa, that’s … entirely useless. The problem here is that the binwidth refers to the powers, not the raw numbers.

ggplot(df.2) + 
    geom_histogram(aes(x=income), binwidth = 0.05) +  
    scale_x_log10(breaks=10^(1:6), labels=dollar)

The density plot estimates the probability distribution function.

library(scales)
ggplot(df.2) + geom_density(aes(x=income)) + scale_x_continuous(labels=dollar)

The tail is a bit long/heavy: it might be more useful to use a logarithmic scale.

# try logarithmic scale
ggplot(df.2) + geom_density(aes(x=income)) + 
    scale_x_log10(breaks=10^(2:5), labels=dollar) + 
    annotation_logticks()

What can we say about the distribution of marital status?

ggplot(df) + geom_bar(aes(x=marital.stat), fill="gray")

Nothing is too surprising so far (although, as mentionned, something is definitely off with the some age and some income measurements).

If we try to get information about a variable with 10+ levels (state.of.res), we see that the charts can get busy.

ggplot(df) + geom_bar(aes(x=state.of.res), fill="gray")

# same, but with a coordinate flip
ggplot(df) + geom_bar(aes(x=state.of.res), fill="gray") + 
  coord_flip()

# same, but with text resizing for readibility
ggplot(df) + geom_bar(aes(x=state.of.res), fill="gray") + 
  coord_flip() + 
  theme(axis.text.y=element_text(size=rel(0.6)))

The flipped chart is clearly easier to read.

Currently, the chart displays the states ordered alphabetically; to order according to the number of observations in each state, we first need to modify the data using transform(), which will actually re-order the levels for state.of.res by population in the dataset (presumably the same order as in the US’ population).

tbl <- as.data.frame(table(df$state.of.res))
colnames(tbl) <- c("state.of.res", "count")
tbl <- transform(tbl, state.of.res=reorder(state.of.res, count))
ggplot(tbl) +
    geom_bar(aes(x=state.of.res, y=count), stat="identity") +
    coord_flip() + 
    theme(axis.text.y=element_text(size=rel(0.6)))

What is the average number of vehicles per customer in each state?

For instance, in Delaware and Alaska, it is:

with(df, mean(num.vehicles[state.of.res=="Delaware"], na.rm=TRUE))

## [1] 2

with(df, mean(num.vehicles[state.of.res=="Alaska"], na.rm=TRUE))

## [1] 2.333333

(Note the na.rm=TRUE to avoid issues with computations involving observations with no measurement)

We could repeat the process 50 times (one for each state), or we could use either a Split/Apply/Combine approach (in Base R) or a a tidyverse approach (using plyr).

# split
pieces <- split(df, df$state.of.res)

# apply
result <- lapply(pieces, function(p) data.frame(
    state.of.res=p$state.of.res[[1]],
    state.avg.vehicles=mean(p$num.vehicles, na.rm=TRUE)
  )
)

result <- do.call("rbind", result)
rownames(result) <- c()
result

##      state.of.res state.avg.vehicles
## 1         Alabama           2.100000
## 2          Alaska           2.333333
## 3         Arizona           1.888889
## 4        Arkansas           1.833333
## 5      California           2.098901
## 6        Colorado           1.636364
## 7     Connecticut           2.000000
## 8        Delaware           2.000000
## 9         Florida           1.866667
## 10        Georgia           1.708333
## 11         Hawaii           1.750000
## 12          Idaho           1.666667
## 13       Illinois           2.183673
## 14        Indiana           2.000000
## 15           Iowa           2.000000
## 16         Kansas           1.750000
## 17       Kentucky           1.933333
## 18      Louisiana           1.533333
## 19          Maine           2.200000
## 20       Maryland           2.750000
## 21  Massachusetts           1.833333
## 22       Michigan           1.843137
## 23      Minnesota           2.350000
## 24    Mississippi           1.500000
## 25       Missouri           1.950000
## 26        Montana           2.500000
## 27       Nebraska           1.500000
## 28         Nevada           2.000000
## 29  New Hampshire           1.800000
## 30     New Jersey           1.555556
## 31     New Mexico           2.333333
## 32       New York           1.928571
## 33 North Carolina           1.666667
## 34   North Dakota           3.000000
## 35           Ohio           1.836735
## 36       Oklahoma           1.818182
## 37         Oregon           2.285714
## 38   Pennsylvania           1.938462
## 39   Rhode Island           2.000000
## 40 South Carolina           1.785714
## 41   South Dakota           1.600000
## 42      Tennessee           1.571429
## 43          Texas           1.833333
## 44           Utah           1.750000
## 45        Vermont           1.666667
## 46       Virginia           1.884615
## 47     Washington           2.235294
## 48  West Virginia           1.666667
## 49      Wisconsin           1.692308
## 50        Wyoming           2.000000

The tidyverse-like solution is much more elegant, however:

library(plyr)
result <- ddply(
    df,                 # dataframe
    "state.of.res",     # split-by variable
    summarize,          # function to apply to each piece
                        # function arguments
    state.avg.vehicles=mean(num.vehicles, na.rm=TRUE)
)

result

##      state.of.res state.avg.vehicles
## 1         Alabama           2.100000
## 2          Alaska           2.333333
## 3         Arizona           1.888889
## 4        Arkansas           1.833333
## 5      California           2.098901
## 6        Colorado           1.636364
## 7     Connecticut           2.000000
## 8        Delaware           2.000000
## 9         Florida           1.866667
## 10        Georgia           1.708333
## 11         Hawaii           1.750000
## 12          Idaho           1.666667
## 13       Illinois           2.183673
## 14        Indiana           2.000000
## 15           Iowa           2.000000
## 16         Kansas           1.750000
## 17       Kentucky           1.933333
## 18      Louisiana           1.533333
## 19          Maine           2.200000
## 20       Maryland           2.750000
## 21  Massachusetts           1.833333
## 22       Michigan           1.843137
## 23      Minnesota           2.350000
## 24    Mississippi           1.500000
## 25       Missouri           1.950000
## 26        Montana           2.500000
## 27       Nebraska           1.500000
## 28         Nevada           2.000000
## 29  New Hampshire           1.800000
## 30     New Jersey           1.555556
## 31     New Mexico           2.333333
## 32       New York           1.928571
## 33 North Carolina           1.666667
## 34   North Dakota           3.000000
## 35           Ohio           1.836735
## 36       Oklahoma           1.818182
## 37         Oregon           2.285714
## 38   Pennsylvania           1.938462
## 39   Rhode Island           2.000000
## 40 South Carolina           1.785714
## 41   South Dakota           1.600000
## 42      Tennessee           1.571429
## 43          Texas           1.833333
## 44           Utah           1.750000
## 45        Vermont           1.666667
## 46       Virginia           1.884615
## 47     Washington           2.235294
## 48  West Virginia           1.666667
## 49      Wisconsin           1.692308
## 50        Wyoming           2.000000

library(dplyr) # to avoid compatibility issues with plyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

When it comes to univariate representations:

use a histogram or density plot to look for outliers, or incorrect values, in numerical variables
which will also give a sense of the distribution – is it symmetric, normal, lognormal, etc.
use a bar chart to compare frequencies for categorical variables

We can also look at bivariate charts, say involving age and income. Let’s start by removing the weird observations.

df.3 <- with(df, subset(df, age>0 & age < 100 & income > 0))

We can prepare a scatterplot:

ggplot(df.3, aes(x=age, y=income)) + 
  geom_point() +                        # scatterplot call
  scale_y_continuous(labels=dollar)

Or colour the dots according to the health insurance status.

ggplot(df.3, aes(x=age, y=income, colour = health.ins)) + # chart elements
  geom_point() +                        # scatterplot call
  scale_y_continuous(labels=dollar)

The relationship between age and income is not linear, so adding the line of best-fit might not provide much in the way of insight, but it can be done nonetheless.

ggplot(df, aes(x=age, y=income)) + 
  geom_point() +
  geom_smooth(method="lm") + 
  scale_y_continuous(labels=dollar)

## `geom_smooth()` using formula 'y ~ x'

ggplot(df, aes(x=age, y=income, colour = health.ins)) + 
  geom_point() +
  geom_smooth(method="lm") + 
  scale_y_continuous(labels=dollar)

## `geom_smooth()` using formula 'y ~ x'

A heat map (where a cell’s colour represents the number of observations in the cell) might be more à propos.

ggplot(df, aes(x=age, y=income)) + 
  geom_bin2d() +
  scale_y_continuous(labels=dollar)

ggplot(df, aes(x=age, y=income)) + 
  geom_bin2d() +
  scale_y_continuous(labels=dollar) + 
  geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Is the smoothing curve a bit too much, here?

How about a hexbin heat map?

library(hexbin)
ggplot(df, aes(x=age, y=income)) +    # selecting data elements
    geom_hex(binwidth=c(5, 20000)) +  # hexagon heat map
    scale_y_continuous(labels=dollar) + 
    geom_smooth()                     # smooth loess curve of best fit

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Other plots can be produced: how about a plot of the relationship between age and health.ins?

ggplot(df.3, aes(x=age, y=health.ins)) + geom_point()

That’s not really surprising, is it: as one gets older, one is more likely to get health insurance?

It doesn’t seem as though there are that many more people with insurance than people without, but that’s an illusion: all the observations that have the same age are represented by a single dot.

A heatmap could incorporate the number of observations into the picture.

ggplot(df.3, aes(x=age, y=health.ins)) + 
  geom_bin2d()

Mmhhh… that’s not nearly as insightful as I was expecting.

One of the geoms can come in handy: geom_jitter.

ggplot(df.3, aes(x=age, y=health.ins)) + 
  geom_jitter(height=0.2)

Now we can clearly see that there are substantially fewer customers without life insurance.

Why stop at only 2 variables when we could add income to the picture?

ggplot(df.3, aes(x=age, y=health.ins, colour=log10(income))) + 
  geom_jitter(height=0.2)

Is there anything insightful in there?

We could also try to link marital status to health insurance status?

ggplot(df) + geom_bar(aes(x=marital.stat, fill=health.ins))

Stacked bar charts are the pie charts of bar charts – much better to put the bars side-by-side.

ggplot(df) + geom_bar(aes(x=marital.stat, fill=health.ins), position="dodge")

One exception could be made for proportion stacked bar charts:

ggplot(df, aes(x=marital.stat)) +
    geom_bar(aes(fill=health.ins), position="fill")

But we do lose the sense of the size of each sub-categories’ population. Some jitter functionality comes to the rescue once again!

last_plot() + 
   geom_point(aes(x=marital.stat, y=-0.05), position=position_jitter(h=0.02), size=0.75, alpha=0.75)

Might there be a link between housing type and marital status?

ggplot(df.3) +
   geom_bar(aes(housing.type, fill=marital.stat), position="dodge")

ggplot(df.3) +
   geom_bar(aes(housing.type, fill=marital.stat), position="dodge") + 
   coord_flip()

ggplot(subset(df.3, !is.na(housing.type))) +
   geom_bar(aes(housing.type, fill=marital.stat), position="dodge") +
   theme(axis.text.x=element_text(angle=15))

It’s easy to see how some fine-tuning can make the charts easier to read (which can only help when it comes to extracting actionable insights).

We’ll end the exploration of custdata.tsv by showing how to build a small multiple chart:

ggplot(subset(df.3, !is.na(housing.type))) +
    geom_bar(aes(marital.stat)) +
    facet_wrap(~housing.type, scales="free_y") +    # small multiple, by housing.type
    theme(axis.text.x=element_text(size=rel(0.8)))

Any link with with health insurance status?

ggplot(subset(df.3, !is.na(housing.type))) + 
    geom_bar(aes(marital.stat, fill=health.ins), position="dodge") +
    facet_wrap(~housing.type, ncol=2, labeller=label_wrap_gen(width=20, multi_line=TRUE)) +    
    theme(axis.text.x=element_text(size=rel(0.6))) + coord_flip()

3. GAPMINDER

Here, we show how one would produce gapminder-style chart.

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✓ tibble  3.0.1     ✓ purrr   0.3.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()    masks plyr::arrange()
## x readr::col_factor() masks scales::col_factor()
## x purrr::compact()    masks plyr::compact()
## x dplyr::count()      masks plyr::count()
## x purrr::discard()    masks scales::discard()
## x dplyr::failwith()   masks plyr::failwith()
## x dplyr::filter()     masks stats::filter()
## x dplyr::id()         masks plyr::id()
## x dplyr::lag()        masks stats::lag()
## x dplyr::mutate()     masks plyr::mutate()
## x dplyr::rename()     masks plyr::rename()
## x dplyr::summarise()  masks plyr::summarise()
## x dplyr::summarize()  masks plyr::summarize()

library(ggplot2)
library(dslabs)
library(wesanderson)
library(ggrepel)

# get range of available data
summary(gapminder)

##                 country           year      infant_mortality
##  Albania            :   57   Min.   :1960   Min.   :  1.50  
##  Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00  
##  Angola             :   57   Median :1988   Median : 41.50  
##  Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31  
##  Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10  
##  Armenia            :   57   Max.   :2016   Max.   :276.90  
##  (Other)            :10203                  NA's   :1453    
##  life_expectancy   fertility       population             gdp           
##  Min.   :13.20   Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07  
##  1st Qu.:57.50   1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09  
##  Median :67.54   Median :3.750   Median :5.009e+06   Median :7.794e+09  
##  Mean   :64.81   Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11  
##  3rd Qu.:73.00   3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10  
##  Max.   :83.90   Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13  
##                  NA's   :187     NA's   :185         NA's   :2972       
##     continent                region    
##  Africa  :2907   Western Asia   :1026  
##  Americas:2052   Eastern Africa : 912  
##  Asia    :2679   Western Africa : 912  
##  Europe  :2223   Caribbean      : 741  
##  Oceania : 684   South America  : 684  
##                  Southern Europe: 684  
##                  (Other)        :5586

# year for plotting
yr = 2009
chart_title <- paste("Health & Wealth of Nations \nGampinder (",yr,")",sep="")

# sort the countries by inverse population
gapminder <- gapminder[with(gapminder, order(year, -population)), ]

# select which countries will have their names labelled
num.countries = 20

# tidyverse + ggplot2
filtered.gapminder = gapminder %>%
  filter(year==yr) %>%   
  mutate(pop_m = population/1e6, gdppc = gdp/population) 

weights = filtered.gapminder$pop_m/sum(filtered.gapminder$pop_m)
p = sample(nrow(filtered.gapminder), num.countries, prob = weights)

filtered.gapminder$country.display <- ifelse(ifelse(1:185 %in% p, TRUE,FALSE),as.character(filtered.gapminder$country),"") 

filtered.gapminder %>% 
  ggplot(aes(x=gdppc, 
             y=life_expectancy, 
             size=pop_m)) +
  geom_point(aes(fill=continent), pch=21) + 
  scale_fill_manual(values=wes_palette(n=5, name="Darjeeling1")) +
  scale_x_log10() + 
  geom_label_repel(aes(label=country.display, size=sqrt(pop_m/pi)), 
                   alpha=0.9, 
                   fontface="bold", 
                   min.segment.length = unit(0, 'lines'),
                   show.legend=FALSE) +
  ggtitle(chart_title) + 
  theme(plot.title = element_text(size=14, face="bold")) + 
  xlab('log GDP per capita ($/year)') +
  ylab('Life expectancy (years)') + 
  ylim(45,85) + 
  scale_size_continuous(range=c(1,40),
                        breaks = c(1,10,100,1000),
                        limits = c(0, 1500),
                        labels = c("1","10","100","1000")) + 
  guides(fill = guide_legend(override.aes = list(size = 5))) + 
  labs(fill="Continent", size="Population (M)") +
  theme_bw() + 
  theme(plot.title = element_text(size=16, face="bold"))

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing missing values (geom_label_repel).

Same, but with a colourblind-friendly palette.

filtered.gapminder %>% 
  ggplot(aes(x=gdppc, 
             y=life_expectancy, 
             size=pop_m)) +
  geom_point(aes(fill=continent), pch=21) + 
  scale_fill_manual(values=c("#E1DAAE","#FF934F","#CC2D35","#058ED9","#2D3142","#848FA2")) +
  scale_x_log10() + 
  geom_label_repel(aes(label=country.display, size=sqrt(pop_m/pi)), 
                   alpha=0.9, 
                   fontface="bold", 
                   min.segment.length = unit(0, 'lines'),
                   show.legend=FALSE) +
  ggtitle(chart_title) + 
  theme(plot.title = element_text(size=14, face="bold")) + 
  xlab('log GDP per capita ($/year)') +
  ylab('Life expectancy (years)') + 
  ylim(45,85) + 
  scale_size_continuous(range=c(1,40),
                        breaks = c(1,10,100,1000),
                        limits = c(0, 1500),
                        labels = c("1","10","100","1000")) + 
  guides(fill = guide_legend(override.aes = list(size = 5))) + 
  labs(fill="Continent", size="Population (M)") +
  theme_bw() + 
  theme(plot.title = element_text(size=16, face="bold"))

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing missing values (geom_label_repel).

4. MINARD’S MARCH TO MOSCOW

This code follows very closely M. Friendly’s tutorial to recreate of Minard’s famous March to Moscow chart.

# from M. Friendly, York University

# getting the data
data(Minard.troops, Minard.cities, Minard.temp, package="HistData")

# required packages
library(ggplot2)
library(scales)        # additional formatting for scales
library(grid)          # combining plots
library(gridExtra)     # combining plots

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(dplyr)         # tidy data manipulations

#Minard.troops <- Minard.troops[order(Minard.troops$direction), ]
#Minard.troops$temp <- 52 - 1:nrow(Minard.troops)
#Minard.troops <- Minard.troops[order(Minard.troops$temp), ]

# the troops/cities upper chart
breaks <- c(1, 2, 3) * 10^5 
plot.troops.cities <- Minard.troops %>% 
  ggplot(aes(long, lat)) +
  geom_path(aes(size = survivors, colour = direction, group = group),
            lineend="round") +
  scale_size("Survivors", range = c(0,20), 
             breaks=breaks, labels=scales::comma(breaks)) +
  scale_colour_manual("Direction", 
                     values = c("#E80000", "#1F1A1B"), 
                     labels=c("Advance", "Retreat")) + 
  ylim(53.8,56.0) + coord_cartesian(xlim = c(24, 38)) +
  labs(x = NULL, y = NULL) + theme_bw() + guides(color = FALSE, size = FALSE) + 
  geom_point(data = Minard.cities, size=10, pch=22, color = "black", fill="gray") + 
  geom_label_repel(data = Minard.cities, aes(label = city), size = 5, vjust = -1.5) 

# replacing a missing value in the temperature data
Minard.temp$date = factor(Minard.temp$date, levels=c(levels(Minard.temp$date), "Unknown"))
Minard.temp$date[is.na(Minard.temp$date)] <- "Unknown" 

# the temperature lower chart
plot.temp <- Minard.temp %>% 
  mutate(label = paste0(temp, "° ", date)) %>% 
  ggplot(aes(long, temp)) +
  geom_path(color="grey", size=2, group=1) +
  geom_point(size=1) +
  geom_label_repel(aes(label=label), size=4) + 
  coord_cartesian(xlim = c(24, 38)) +
  labs(x = NULL, y="Temperature") + 
  theme_bw() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.text.x = element_blank(), axis.ticks = element_blank(),
        panel.border = element_blank())

# combining both charts
grid.arrange(plot.troops.cities, plot.temp, nrow=2, heights=c(3.5, 1.2))
grid.rect(width = .95, height = .95, gp = gpar(lwd = 0, col = "white", fill = NA))

5. SNOW’S CHOLERA OUTBREAK MAP

(Adapted from Lanzetta’s R Data Visualization Recipes)

This code re-creates John Snow’s historical map of the 1854 London cholera outbreak.

# getting the data and required  packages
data(Snow.streets, Snow.deaths, Snow.pumps, package="HistData")
library(ggplot2); library(ggrepel)

# street map, death locations, water pump locations
ggplot(data = Snow.streets) +
  geom_path(aes(x = x, y = y, group = street)) + 
  geom_point(data = Snow.deaths, aes(x = x, y=y, colour="black", shape="15")) + 
  geom_point(data = Snow.pumps, aes(x = x, y=y, colour="red", shape="16"), size=4) + 
  scale_colour_manual("Locations", 
                      values = c("black","red"), 
                      labels=c("deaths","water pumps")) + 
  scale_shape_manual("Locations", 
                      values = c(16,15), 
                      labels=c("deaths","water pumps")) +
  geom_label_repel(data = Snow.pumps, aes(x=x, y=y, label=label), colour="black", size=5, vjust=-1.5, alpha=0.8) + 
  ggtitle("John Snow's London Cholera Outbreak Map (1854)")  + theme_bw() + 
  theme(plot.title = element_text(size=16, face="bold"), 
        axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(),         
        axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank())