In this notebook, we show how to produce simple graphics using the base
R installation. We will eventually introduce more sophisticated graphics via thelattice
,tidyverse
and the ggplot2
library in other notebooks.
On a scatter plot, the features you study depend on the scale of interest:
swiss
datasetWe start with a built-in R dataset called swiss
.
str(swiss) # structure of the swiss dataset
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
pairs(swiss) # scatter plot matrix for the swiss dataset
Let’s focus on one specific pair: Fertility vs. Education
# raw plot
plot(swiss$Fertility, swiss$Education)
The same plot can be prettified and made more informative:
# add a title and axis labels
plot(swiss$Fertility, swiss$Education, xlab="Fertility", ylab="Education", main="Education vs Fertility (by province), Switzerland, 1888", las=1)
# add the line of best fit (in red)
abline(lm(swiss$Education~swiss$Fertility), col="red", lwd=2.5)
# add the smoothing lowess curve (in blue)
lines(lowess(swiss$Fertility,swiss$Education), col="blue", lwd=2.5)
# add a legend
legend(75,50, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))
Compare that graph with the one found below:
plot(swiss$Education, xlab="Province", ylab="Education", main="Education by Province, Switzerland, 1888", las=1)
abline(lm(swiss$Education~row(swiss)[,1]), col="red", lwd=2.5)
lines(swiss$Education)
lines(lowess(row(swiss)[,1],swiss$Education), col="blue", lwd=2.5)
legend(5,52, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))
So we can get an actual graph here, but … why doesn’t it actually make sense to produce that specific graph?
The take-away here is that being able to produce a graph doesn’t guarantee that it will be useful or meaningful in any way.
Let’s see what the unadorned scatter plots look like with ggplot2
.
## Again, with ggplot2
library(ggplot2)
qplot(Fertility, Education, data=swiss)
qplot(Fertility, Education, xlim=c(min(0,10*floor(min(swiss$Fertility)/10)),10*ceiling(max(swiss$Fertility)/10)), ylim=c(min(0,10*floor(min(swiss$Education)/10)),10*ceiling(max(swiss$Education)/10)), data=swiss)
iris
datasetLet’s do the same thing for the built-in iris
dataset.
str(iris) # structure of the dataset
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris) # information on the distributions for each feature
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
?iris # information on the dataset itself
# scatter plot matrix on which the lower panel has been removed due to redundancy
pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21, lower.panel=NULL, labels=c("SL","SW","PL","PW"), font.labels=2, cex.labels=4.5)
We can compare the sepal width and length variables in a manner similar to what we did with the swiss
dataset.
## Iris 1
plot(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal Length", ylab="Sepal Width", main="Sepal Width vs Sepal Length, Anderson's Iris Dataset", las=1, bg=c("yellow","black","green")[unclass(iris$Species)])
abline(lm(iris$Sepal.Width~iris$Sepal.Length), col="red", lwd=2.5)
lines(lowess(iris$Sepal.Length,iris$Sepal.Width), col="blue", lwd=2.5)
legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))
There does not seem to be a very strong relationship between these variables. What can we say about sepal length and petal length?
## Iris 2
plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length", main="Sepal Width vs Petal Length, Anderson's Iris Dataset", las=1)
abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5)
lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5)
legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))
Visually, the relationship is striking: the line seems to have a slope of 1! But notice that the axes are unevenly scaled, and have been cutoff away from the origin. The following graph gives a better idea of the situation.
## Iris 3
plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length", main="Sepal Width vs Petal Length, Anderson's Iris Dataset", xlim=c(0,8), ylim=c(0,8), las=1)
abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5)
lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5)
legend(2,7, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))
A relationship is still present, but it is affine, not linear as could have been guessed by naively looking at the original graph.
Colour can also be used to highlight various data elements.
# colour each observation differently according to its species
plot(iris$Sepal.Length, iris$Sepal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Anderson's Iris Data -- Sepal Length vs. Sepal Width", xlim=)
This can be done on all scatterplots concurrently using pairs
.
# scatterplot matrix with species membership
pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)], lower.panel=NULL, labels=c("SL","SW","PL","PW"), font.labels=2, cex.labels=4.5)
The redundancy in the scatter plot matrix can be used to display other data elements as well: GGally
allows for feature distributions in the diagonal entries, and correlations between pairs of variables and density plots in the redundant panels (among other things). Be weary of utilizing too many colours or features in these plots, however. They can easily become too difficult to read to provide any meaningful insights.
In histograms or frequency charts, be on the lookout for bin size effects.
swiss
dataset For instance, what does the distribution of the Education
variable in the swiss
dataset look like?
## Histogram/Bar Charts
hist(swiss$Education) # default number of bins
hist(swiss$Education, breaks=10) # with 10 bins
hist(swiss$Education, breaks=20) # with 20 bins
The distribution pattern is distincly different with 10 and 20 bins. Don’t get too carried away, however: too many bins may end up masking trends if the dataset isn’t large enough.
We can look for best fits for various parametric distributions:
#Swiss 1 - default number of bine
hist(swiss$Education, freq=FALSE, xlab="Education",main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", lwd=4) # fit a gamma distribution
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4) # fit an exponential distribution
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4) # fit a normal distribution
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3"))
# Swiss 2 - 10 bins
hist(swiss$Education, breaks=10, freq=FALSE, xlab="Education",main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4)
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3"))
# Swiss 3 - 20 bins
hist(swiss$Education, breaks=20, freq=FALSE, xlab="Education",main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4)
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3"))
With a small number of bins, the exponential distribution seems like a good fit, visually. With a larger number of bins, neither of the three families seems particularly well-advised.
iris
dataset Can you figure out what is happening with these visualizations of the iris
dataset?
hist(iris$Sepal.Length, freq=FALSE, xlab="Sepal.Length",main="Sepal.Length Distribution", col="firebrick1", ylim=c(0,0.15))
# what happens if you replace freq=FALSE with freq=TRUE?
# Another feature
hist(iris$Sepal.Width, freq=FALSE, xlab="Sepal.Width",main="Sepal.Width Distribution", col="firebrick1", ylim=c(0,0.15))
Histograms (1D data representations) can be combined with scatterplots (2D data representations) to provide marginal information.
# create our own function (see R-blogger's scatterhist function)
scatterhist = function(x, y, xlab="", ylab=""){
zones=matrix(c(2,0,1,3), ncol=2, byrow=TRUE)
layout(zones, widths=c(4/5,1/5), heights=c(1/5,4/5))
xhist = hist(x, plot=FALSE)
yhist = hist(y, plot=FALSE)
top = max(c(xhist$counts, yhist$counts))
par(mar=c(3,3,1,1))
plot(x,y)
par(mar=c(0,3,1,1))
barplot(xhist$counts, axes=FALSE, ylim=c(0, top), space=0)
par(mar=c(3,0,1,1))
barplot(yhist$counts, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE)
par(oma=c(3,3,0,0))
mtext(xlab, side=1, line=1, outer=TRUE, adj=0,
at=.8 * (mean(x) - min(x))/(max(x)-min(x)))
mtext(ylab, side=2, line=1, outer=TRUE, adj=0,
at=(.8 * (mean(y) - min(y))/(max(y) - min(y))))
}
ds = iris
with(ds, scatterhist(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal.Length", ylab="Sepal.Width"))
Bubble charts are a neat way to show at least 3 variables on the same 2D display. The location of the bubbles’ centre takes care of 2 variables: size, colour, and shape of bubbles can also be added to represent different data elements.
For this example, we’ll look at demographic data regarding Canada’s CMA and CA (from StatsCan).
can.2011=read.csv("Data/Canada2011.csv", head=TRUE) # import the data
head(can.2011) # take a look at the first 6 entries
str(can.2011) # take a look at the structure of the data
## 'data.frame': 147 obs. of 12 variables:
## $ Geographic.code : int 1 5 10 15 105 110 205 210 215 220 ...
## $ Geographic.name : Factor w/ 147 levels "Abbotsford - Mission",..: 117 7 42 26 20 121 46 56 133 73 ...
## $ Province : Factor w/ 12 levels "AB","BC","MB",..: 5 5 5 5 9 9 6 6 6 6 ...
## $ Region : Factor w/ 6 levels "Atlantic","British Columbia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Type : Factor w/ 2 levels "CA","CMA": 2 1 1 1 1 1 2 1 1 1 ...
## $ pop_2011 : int 196966 10871 13725 27202 64487 16488 390328 26359 45888 35809 ...
## $ log_pop_2011 : num 5.29 4.04 4.14 4.43 4.81 ...
## $ pop_rank_2011 : int 20 147 128 94 52 120 13 97 67 78 ...
## $ priv_dwell_2011 : int 84542 4601 6134 11697 28864 7323 177295 11941 21708 16788 ...
## $ occ_private_dwell_2011: int 78960 4218 5723 11110 26192 6941 165153 11123 19492 15256 ...
## $ occ_rate_2011 : num 0.934 0.917 0.933 0.95 0.907 ...
## $ med_total_income_2011 : int 33420 24700 26920 27430 30110 27250 33400 25500 26710 26540 ...
summary(can.2011[,3:12], maxsum=13) # provide a distribution information for features 3 to 12, allowing for up to 13 factors in the categorical distributions
## Province Region Type pop_2011
## AB:18 Atlantic :18 CA :114 Min. : 10871
## BC:25 British Columbia:25 CMA: 33 1st Qu.: 18429
## MB: 5 North : 2 Median : 40077
## NB: 7 Ontario :43 Mean : 186632
## NL: 4 Prairies :31 3rd Qu.: 98388
## NS: 5 Quebec :28 Max. :5583064
## NW: 1
## ON:43
## PE: 2
## QC:28
## SK: 8
## YT: 1
## log_pop_2011 pop_rank_2011 priv_dwell_2011 occ_private_dwell_2011
## Min. :4.036 Min. : 1.0 Min. : 4601 Min. : 4218
## 1st Qu.:4.265 1st Qu.: 37.5 1st Qu.: 8292 1st Qu.: 7701
## Median :4.603 Median : 74.0 Median : 17428 Median : 16709
## Mean :4.720 Mean : 74.0 Mean : 78691 Mean : 74130
## 3rd Qu.:4.993 3rd Qu.:110.5 3rd Qu.: 44674 3rd Qu.: 41142
## Max. :6.747 Max. :147.0 Max. :2079459 Max. :1989705
##
##
##
##
##
##
## occ_rate_2011 med_total_income_2011
## Min. :0.6492 Min. :22980
## 1st Qu.:0.9173 1st Qu.:27630
## Median :0.9348 Median :29910
## Mean :0.9276 Mean :31311
## 3rd Qu.:0.9475 3rd Qu.:33255
## Max. :0.9723 Max. :73030
## NA's :5
##
##
##
##
##
Before jumping aboard the bubble chart train, let’s see what the dataset looks like in the scatterplot framework for 5 of the variables, grouped along regions.
pairs(can.2011[,c(7,9,10,11,12)])
Meh… not that great. There are some interesting tidbits, but nothing jumps as being meaningful beyond a first pass.
Can anything else be derived using bubble charts? We use median income
as the radius for the bubbles, and focus on occupancy rates and population.
radius.med.income.2011<-sqrt(can.2011$med_total_income_2011/pi)
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, inches=0.45, fg="white", bg="red", xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")
Clearly, an increase in population seems to be associated with (and not necessarily a cause of) a rise in occupancy rates. But the median income seems to have very little correlation with either of the other two variables. Perhaps such a correlation is hidden by the default unit used to draw the bubbles? Let’s shrink it from 0.45 to 0.25 and see if anything pops out.
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, inches=0.25, fg="white", bg="red", xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")
Nope. But surely there would be a relationship in these quantities if we included the CMA/CA’s region?
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, inches=0.15, fg="white", bg=c("red","blue","black","green","yellow","violet")[can.2011$Region], xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")
What do you think?
While we’re at it, can you identify the regions without the legend?
Perhaps the CA distort the full picture (given that they are small and more numerous). Let’s analyze the world of CMAs instead.
can.2011.CMA=read.csv("Data/Canada2011_CMA.csv", head=TRUE) # import the data
#head(can.2011.CMA)
str(can.2011.CMA) # see the dataset's structure (notice the number of observations)
## 'data.frame': 33 obs. of 15 variables:
## $ Geographic.code : int 1 205 305 310 408 421 433 442 462 505 ...
## $ Geographic.name : Factor w/ 33 levels "Abbotsford - Mission",..: 26 8 14 22 21 19 24 29 15 17 ...
## $ Province : Factor w/ 9 levels "AB","BC","MB",..: 5 6 4 4 8 8 8 8 8 7 ...
## $ Region : Factor w/ 5 levels "Atlantic","British Columbia",..: 1 1 1 1 5 5 5 5 5 3 ...
## $ Type : Factor w/ 1 level "CMA": 1 1 1 1 1 1 1 1 1 1 ...
## $ pop_2011 : int 196966 390328 138644 127761 157790 765706 201890 151773 3824221 1236324 ...
## $ log_pop_2011 : num 5.29 5.59 5.14 5.11 5.2 ...
## $ fem_pop_2011 : int 103587 204579 70901 65834 79770 394935 104026 78105 1971520 647083 ...
## $ prop_fem_2011 : num 52.6 52.4 51.1 51.5 50.6 ...
## $ pop_rank_2011 : int 20 13 29 31 26 7 19 27 2 4 ...
## $ priv_dwell_2011 : int 84542 177295 62403 56775 73766 361447 99913 74837 1696210 526627 ...
## $ occ_private_dwell_2011: int 78960 165153 58294 52281 69507 345892 91099 70138 1613260 498636 ...
## $ occ_rate_2011 : num 0.934 0.932 0.934 0.921 0.942 ...
## $ med_unemployment_2011 : num 6.6 6.2 7.45 6.75 7.8 5.15 6.4 8.9 8.1 6.3 ...
## $ med_total_income_2011 : int 33420 33400 30690 29910 29560 33760 27620 27050 28870 39170 ...
summary(can.2011.CMA[,3:12], maxsum=13) # feature by feature distribution
## Province Region Type pop_2011 log_pop_2011
## AB: 2 Atlantic : 4 CMA:33 Min. : 118975 Min. :5.075
## BC: 4 British Columbia: 4 1st Qu.: 159561 1st Qu.:5.203
## MB: 1 Ontario :15 Median : 260600 Median :5.416
## NB: 2 Prairies : 5 Mean : 700710 Mean :5.553
## NL: 1 Quebec : 5 3rd Qu.: 721053 3rd Qu.:5.858
## NS: 1 Max. :5583064 Max. :6.747
## ON:15
## QC: 5
## SK: 2
## fem_pop_2011 prop_fem_2011 pop_rank_2011 priv_dwell_2011
## Min. : 63260 Min. :50.55 Min. : 1 Min. : 53730
## 1st Qu.: 83243 1st Qu.:51.55 1st Qu.: 9 1st Qu.: 72817
## Median : 135561 Median :52.17 Median :17 Median : 110314
## Mean : 365093 Mean :52.05 Mean :17 Mean : 291519
## 3rd Qu.: 378690 3rd Qu.:52.41 3rd Qu.:25 3rd Qu.: 294150
## Max. :2947971 Max. :53.17 Max. :33 Max. :2079459
##
##
##
## occ_private_dwell_2011
## Min. : 48848
## 1st Qu.: 67767
## Median : 104237
## Mean : 275587
## 3rd Qu.: 282186
## Max. :1989705
##
##
##
We can use the median income for the bubbles radius again, but this time we’ll look at population and unemployment.
radius.med.income.2011.CMA<-sqrt(can.2011.CMA$med_total_income_2011/pi)
symbols(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011, circles=radius.med.income.2011.CMA, inches=0.25, fg="white", bg=c("red","blue","gray","green","yellow")[can.2011.CMA$Region], ylab="Population (log)", xlab="Unemployment Rate")
title("Total Median Income, by CMA (2011)")
text(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011,can.2011.CMA$Geographic.code, cex=0.5)
Part of the issue is that median income seems to be roughly uniform among CMAs. What if we used rank statistics instead? Switch the radius to population rank, say?
radius.pop.rank.2011.CMA<-sqrt(can.2011.CMA$pop_rank_2011/pi)
symbols(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011, circles=radius.pop.rank.2011.CMA, inches=0.25, fg="white", bg=c("red","blue","gray","green","yellow")[can.2011.CMA$Region], ylab="Median Income", xlab="Unemployment Rate")
title("Population Rank, by CMA and CA (2011)")
text(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011,can.2011.CMA$Geographic.code, cex=0.5)
There’s a bit more structure here, isn’t there?
The ability to monitor and perform early forecasts of various river algae blooms is crucial to control the ecological harm they can cause.
The dataset which is used to train the learning model consists of: - chemical properties of various water samples of European rivers - the quantity of seven algae in each of the samples, and - the characteristics of the collection process for each sample.
Notes: - we have already cleaned this dataset in notebook DS 02, but it would have made more sense to visualize it beforehand… we will revisit the process in detail in notebook DS 07. For now, let’s just focus on producing some graphs.
- The dataset is stored in the CSV file algae_blooms.csv
(in the same directory as this notebook).
algae_blooms<-read.csv("Data/algae_blooms.csv", sep=",", header=TRUE)
We can get an idea of the data frame’s structure by calling the str
function.
str(algae_blooms)
## 'data.frame': 340 obs. of 18 variables:
## $ season: Factor w/ 4 levels "autumn","spring",..: 4 2 1 2 1 4 3 1 4 4 ...
## $ size : Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ speed : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 1 1 1 3 1 ...
## $ mxPH : num 8 8.35 8.1 8.07 8.06 8.25 8.15 8.05 8.7 7.93 ...
## $ mnO2 : num 9.8 8 11.4 4.8 9 13.1 10.3 10.6 3.4 9.9 ...
## $ Cl : num 60.8 57.8 40 77.4 55.4 ...
## $ NO3 : num 6.24 1.29 5.33 2.3 10.42 ...
## $ NH4 : num 578 370 346.7 98.2 233.7 ...
## $ oPO4 : num 105 428.8 125.7 61.2 58.2 ...
## $ PO4 : num 170 558.8 187.1 138.7 97.6 ...
## $ Chla : num 50 1.3 15.6 1.4 10.5 ...
## $ a1 : num 0 1.4 3.3 3.1 9.2 15.1 2.4 18.2 25.4 17 ...
## $ a2 : num 0 7.6 53.6 41 2.9 14.6 1.2 1.6 5.4 0 ...
## $ a3 : num 0 4.8 1.9 18.9 7.5 1.4 3.2 0 2.5 0 ...
## $ a4 : num 0 1.9 0 0 0 0 3.9 0 0 2.9 ...
## $ a5 : num 34.2 6.7 0 1.4 7.5 22.5 5.8 5.5 0 0 ...
## $ a6 : num 8.3 0 0 0 4.1 12.6 6.8 8.7 0 0 ...
## $ a7 : num 0 2.1 9.7 1.4 1 2.9 0 0 0 1.7 ...
Evidently, algae_blooms
is a data frame with 340 observations of 18 variables each.
Basic histograms can be constructed with the hist
function.
hist(algae_blooms$mnO2)
This basic histogram can be spruced up by using Hadley Wickham’s (in)famous ggplot2
library.
library(ggplot2)
ggplot(algae_blooms,aes(x=mnO2)) + # plotting mnO2 from the algae_blooms dataset ...
geom_histogram(aes(y=..density..)) + # as a histogram, where the vertical axis is the density ...
geom_density(color="blue") + # on which will be layered a blue density curve ...
geom_rug() + # and a rug (or comb) showing where the observations actually fall...
ggtitle("Histogram of minimum value of O2 among 340 observations") + # with this title ...
xlab("") + # no x axis label ...
ylab("") # and no y axis label
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing non-finite values (stat_density).
Let’s do the same for a1
.
ggplot(algae_blooms,aes(x=a1)) +
geom_histogram(aes(y=..density..)) +
geom_density(color="red") +
geom_rug() +
ggtitle("Histogram of minimum value of a1 among 340 observations") +
xlab("") +
ylab("")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
\(qq-\)plots, another traditional statistical plot, can be produced with the car
libary’s qqPlot
function.
library(car)
## Loading required package: carData
qqPlot(algae_blooms$mnO2, main='Normal QQ plot for minimum values of O2', ylab="")
## [1] 69 70
Again, we can see that the normal distribution is not a good fit for mnO2
, but that it is even worse for a1
(see below).
qqPlot(algae_blooms$a1, main='Normal QQ plot for a1', ylab="")
## [1] 49 118
Now let’s take a look at some plots involving NH4
.
ggplot(algae_blooms,aes(x=factor(0),y=NH4)) + # plotting NH4 from the algae_blooms dataset ...
geom_boxplot() + # as a boxplot ...
geom_rug() + # with a rug on which the true values are shown ...
geom_hline(aes(yintercept=mean(algae_blooms$NH4, na.rm=TRUE)), linetype=2, colour="pink") + # and a horizontal line showing where the mean NH4 value falls ...
ylab("Ammonium (NH4+)") +
xlab("") +
scale_x_discrete(breaks=NULL)
## Warning: Use of `algae_blooms$NH4` is discouraged. Use `NH4` instead.
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
We don’t see much here because of the suspected outliers.
plot(algae_blooms$NH4, xlab="", ylab="Ammonium (NH4+)") # scatter plot of NH4 against observation number
abline(h=mean(algae_blooms$NH4, na.rm=TRUE), lty=1) # mean value of NH4, solid line
abline(h=mean(algae_blooms$NH4, na.rm=TRUE) + sd(algae_blooms$NH4, na.rm=TRUE), lty=2) # mean + sd value of NH4, dashed line
abline(h=median(algae_blooms$NH4, na.rm=TRUE), lty=3) # median value of NH4, tight dashed line We can also look at the data and see which observations have values of NH4 below 3000 (roughly all values below the long dashed line above)
Let’s see what the boxplot above looks like without the suspected outliers.
ggplot(algae_blooms[-which(algae_blooms$NH4>3000),],aes(x=factor(0),y=NH4)) +
geom_boxplot() +
geom_rug() +
geom_hline(aes(yintercept=mean(algae_blooms[-which(algae_blooms$NH4>3000),8], na.rm=TRUE)), linetype=2, colour="pink") +
ylab("Ammonium (NH4+)") +
xlab("") +
scale_x_discrete(breaks=NULL)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
It’s a bit better, to be sure (the box structure has expanded). There still seems to be a bit of very high values. Perhaps that’s normal? How would we find out?
Now let’s take a look at some of the algae levels.
ggplot(algae_blooms,aes(x=season,y=a3)) + # plot a3 by season ...
geom_boxplot() + # in a series of boxplots ...
xlab("Season") + # with x axis as Seasons and y axis as a3
ylab("Algae Level a3")
We can re-arrange the factors’ order, but it requires a bit of fancy footwork using the forcats
’library fct_relevel
function, and dplyr
’s mutate
.
library(forcats) # for fct_relevel
library(dplyr) # for mutate
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
algae_blooms = mutate(algae_blooms,
size=fct_relevel(size,c("small","medium","large")), # factors should appear in the desired order
speed=fct_relevel(speed,c("low","medium","high")), # ditto
season=fct_relevel(season,c("spring","summer","autumn","winter")) # same here
)
ggplot(algae_blooms,aes(x=season,y=a3)) +
geom_boxplot() +
xlab("Season") +
ylab("Algae Level a3")
Violin plots are cousins to the boxplots. Can we get a bit more insight on the a3
trend?
ggplot(algae_blooms,aes(x=season,y=a3)) + # plot a3 by season ...
geom_violin() + # in a series of violin plots ...
geom_jitter() + # with some jitter to avoid all the points being on top of one another ...
xlab("Season") +
ylab("Algae Level a3")
(What happens if you turn off the jitter option?)
Let’s take a look at both NH4
and season
.
We only keep the observations for which NH4
\(> 3000\), and we bin them with respect to the quartiles.
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>%
filter(NH4<3000) %>%
mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1)), include.lowest=TRUE))
table(f.NH4.data$q.NH4,useNA="ifany")
##
## [5,36.8] (36.8,103] (103,226] (226,2.17e+03]
## 82 82 81 82
ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) +
geom_point() +
facet_wrap(~q.NH4) +
guides(color=FALSE) +
ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")