In this notebook, we will start by showing how to produce simple graphics using the base
R installation. We will eventually introduce more sophisticated graphics via the tidyverse
and the ggplot2
library. A few examples are adapted from R. Irizarry’s dslabs
documentation and other sources.
ggplot2
and the Tidyverse (US Murders, Gapminder, 2016 US Election - Polling, Diseases, Artificial Data, New York Choral Society Singers, University Professors Salaries, MPG, World Phones)On a scatter plot, the features you study depend on the scale of interest:
swiss
datasetWe start with a built-in R dataset called swiss
.
str(swiss) # structure of the swiss dataset
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
pairs(swiss) # scatter plot matrix for the swiss dataset
Let’s focus on one specific pair: Fertility vs. Education
# raw plot
plot(swiss$Fertility, swiss$Education)
The same plot can be prettified and made more informative:
# add a title and axis labels
plot(swiss$Fertility, swiss$Education, xlab="Fertility", ylab="Education", main="Education vs Fertility (by province), Switzerland, 1888", las=1)
# add the line of best fit (in red)
abline(lm(swiss$Education~swiss$Fertility), col="red", lwd=2.5)
# add the smoothing lowess curve (in blue)
lines(lowess(swiss$Fertility,swiss$Education), col="blue", lwd=2.5)
# add a legend
legend(75,50, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))
Compare that graph with the one found below:
plot(swiss$Education, xlab="Province", ylab="Education", main="Education by Province, Switzerland, 1888", las=1)
abline(lm(swiss$Education~row(swiss)[,1]), col="red", lwd=2.5)
lines(swiss$Education)
lines(lowess(row(swiss)[,1],swiss$Education), col="blue", lwd=2.5)
legend(5,52, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))
So we can get an actual graph here, but … why doesn’t it actually make sense to produce that specific graph?
The take-away here is that being able to produce a graph doesn’t guarantee that it will be useful or meaningful in any way.
Let’s see what the unadorned scatter plots look like with ggplot2
.
## Again, with ggplot2
library(ggplot2)
qplot(Fertility, Education, data=swiss)
qplot(Fertility, Education, xlim=c(min(0,10*floor(min(swiss$Fertility)/10)),10*ceiling(max(swiss$Fertility)/10)), ylim=c(min(0,10*floor(min(swiss$Education)/10)),10*ceiling(max(swiss$Education)/10)), data=swiss)
iris
datasetLet’s do the same thing for the built-in iris
dataset.
str(iris) # structure of the dataset
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris) # information on the distributions for each feature
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
?iris # information on the dataset itself
# scatter plot matrix on which the lower panel has been removed due to redundancy
pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21, lower.panel=NULL, labels=c("SL","SW","PL","PW"), font.labels=2, cex.labels=4.5)
We can compare the sepal width and length variables in a manner similar to what we did with the swiss
dataset.
## Iris 1
plot(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal Length", ylab="Sepal Width", main="Sepal Width vs Sepal Length, Anderson's Iris Dataset", las=1, bg=c("yellow","black","green")[unclass(iris$Species)])
abline(lm(iris$Sepal.Width~iris$Sepal.Length), col="red", lwd=2.5)
lines(lowess(iris$Sepal.Length,iris$Sepal.Width), col="blue", lwd=2.5)
legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))
There does not seem to be a very strong relationship between these variables. What can we say about sepal length and petal length?
## Iris 2
plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length", main="Sepal Width vs Petal Length, Anderson's Iris Dataset", las=1)
abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5)
lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5)
legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))
Visually, the relationship is striking: the line seems to have a slope of 1! But notice that the axes are unevenly scaled, and have been cutoff away from the origin. The following graph gives a better idea of the situation.
## Iris 3
plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length", main="Sepal Width vs Petal Length, Anderson's Iris Dataset", xlim=c(0,8), ylim=c(0,8), las=1)
abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5)
lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5)
legend(2,7, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue"))
A relationship is still present, but it is affine, not linear as could have been guessed by naively looking at the original graph.
Colour can also be used to highlight various data elements.
# colour each observation differently according to its species
plot(iris$Sepal.Length, iris$Sepal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)], main="Anderson's Iris Data -- Sepal Length vs. Sepal Width", xlim=)
This can be done on all scatterplots concurrently using pairs
.
# scatterplot matrix with species membership
pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)], lower.panel=NULL, labels=c("SL","SW","PL","PW"), font.labels=2, cex.labels=4.5)
The redundancy in the scatter plot matrix can be used to display other data elements as well: GGally
allows for feature distributions in the diagonal entries, and correlations between pairs of variables and density plots in the redundant panels (among other things). Be weary of utilizing too many colours or features in these plots, however. They can easily become too difficult to read to provide any meaningful insights.
In histograms or frequency charts, be on the lookout for bin size effects.
swiss
dataset For instance, what does the distribution of the Education
variable in the swiss
dataset look like?
## Histogram/Bar Charts
hist(swiss$Education) # default number of bins
hist(swiss$Education, breaks=10) # with 10 bins
hist(swiss$Education, breaks=20) # with 20 bins
The distribution pattern is distincly different with 10 and 20 bins. Don’t get too carried away, however: too many bins may end up maskig trends if the dataset isn’t large enough.
We can look for best fits for various parametric distributions:
#Swiss 1 - default number of bine
hist(swiss$Education, freq=FALSE, xlab="Education",main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", lwd=4) # fit a gamma distribution
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4) # fit an exponential distribution
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4) # fit a normal distribution
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3"))
# Swiss 2 - 10 bins
hist(swiss$Education, breaks=10, freq=FALSE, xlab="Education",main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4)
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3"))
# Swiss 3 - 20 bins
hist(swiss$Education, breaks=20, freq=FALSE, xlab="Education",main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4)
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3"))
With a small number of bins, the exponential distribution seems like a good fit, visually. With a larger number of bins, neither of the three families seems particularly well-advised.
iris
dataset Can you figure out what is happening with these visualizations of the iris
dataset?
hist(iris$Sepal.Length, freq=FALSE, xlab="Sepal.Length",main="Sepal.Length Distribution", col="firebrick1", ylim=c(0,0.15))
# what happens if you replace freq=FALSE with freq=TRUE?
# Another feature
hist(iris$Sepal.Width, freq=FALSE, xlab="Sepal.Width",main="Sepal.Width Distribution", col="firebrick1", ylim=c(0,0.15))
Histograms (1D data representations) can be combined with scatterplots (2D data representations) to provide marginal information.
# create our own function (see R-blogger's scatterhist function)
scatterhist = function(x, y, xlab="", ylab=""){
zones=matrix(c(2,0,1,3), ncol=2, byrow=TRUE)
layout(zones, widths=c(4/5,1/5), heights=c(1/5,4/5))
xhist = hist(x, plot=FALSE)
yhist = hist(y, plot=FALSE)
top = max(c(xhist$counts, yhist$counts))
par(mar=c(3,3,1,1))
plot(x,y)
par(mar=c(0,3,1,1))
barplot(xhist$counts, axes=FALSE, ylim=c(0, top), space=0)
par(mar=c(3,0,1,1))
barplot(yhist$counts, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE)
par(oma=c(3,3,0,0))
mtext(xlab, side=1, line=1, outer=TRUE, adj=0,
at=.8 * (mean(x) - min(x))/(max(x)-min(x)))
mtext(ylab, side=2, line=1, outer=TRUE, adj=0,
at=(.8 * (mean(y) - min(y))/(max(y) - min(y))))
}
ds = iris
with(ds, scatterhist(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal.Length", ylab="Sepal.Width"))
Bubble charts are a neat way to show at least 3 variables on the same 2D display. The location of the bubbles’ centre takes care of 2 variables: size, colour, and shape of bubbles can also be added to represent different data elements.
For this example, we’ll look at demographic data regarding Canada’s CMA and CA (from StatsCan).
can.2011=read.csv("../Data/Canada2011.csv", head=TRUE) # import the data
head(can.2011) # take a look at the first 6 entries
str(can.2011) # take a look at the structure of the data
## 'data.frame': 147 obs. of 12 variables:
## $ Geographic.code : int 1 5 10 15 105 110 205 210 215 220 ...
## $ Geographic.name : Factor w/ 147 levels "Abbotsford - Mission",..: 117 7 42 26 20 121 46 56 133 73 ...
## $ Province : Factor w/ 12 levels "AB","BC","MB",..: 5 5 5 5 9 9 6 6 6 6 ...
## $ Region : Factor w/ 6 levels "Atlantic","British Columbia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Type : Factor w/ 2 levels "CA","CMA": 2 1 1 1 1 1 2 1 1 1 ...
## $ pop_2011 : int 196966 10871 13725 27202 64487 16488 390328 26359 45888 35809 ...
## $ log_pop_2011 : num 5.29 4.04 4.14 4.43 4.81 ...
## $ pop_rank_2011 : int 20 147 128 94 52 120 13 97 67 78 ...
## $ priv_dwell_2011 : int 84542 4601 6134 11697 28864 7323 177295 11941 21708 16788 ...
## $ occ_private_dwell_2011: int 78960 4218 5723 11110 26192 6941 165153 11123 19492 15256 ...
## $ occ_rate_2011 : num 0.934 0.917 0.933 0.95 0.907 ...
## $ med_total_income_2011 : int 33420 24700 26920 27430 30110 27250 33400 25500 26710 26540 ...
summary(can.2011[,3:12], maxsum=13) # provide a distribution information for features 3 to 12, allowing for up to 13 factors in the categorical distributions
## Province Region Type pop_2011
## AB:18 Atlantic :18 CA :114 Min. : 10871
## BC:25 British Columbia:25 CMA: 33 1st Qu.: 18429
## MB: 5 North : 2 Median : 40077
## NB: 7 Ontario :43 Mean : 186632
## NL: 4 Prairies :31 3rd Qu.: 98388
## NS: 5 Quebec :28 Max. :5583064
## NW: 1
## ON:43
## PE: 2
## QC:28
## SK: 8
## YT: 1
## log_pop_2011 pop_rank_2011 priv_dwell_2011 occ_private_dwell_2011
## Min. :4.036 Min. : 1.0 Min. : 4601 Min. : 4218
## 1st Qu.:4.265 1st Qu.: 37.5 1st Qu.: 8292 1st Qu.: 7701
## Median :4.603 Median : 74.0 Median : 17428 Median : 16709
## Mean :4.720 Mean : 74.0 Mean : 78691 Mean : 74130
## 3rd Qu.:4.993 3rd Qu.:110.5 3rd Qu.: 44674 3rd Qu.: 41142
## Max. :6.747 Max. :147.0 Max. :2079459 Max. :1989705
##
##
##
##
##
##
## occ_rate_2011 med_total_income_2011
## Min. :0.6492 Min. :22980
## 1st Qu.:0.9173 1st Qu.:27630
## Median :0.9348 Median :29910
## Mean :0.9276 Mean :31311
## 3rd Qu.:0.9475 3rd Qu.:33255
## Max. :0.9723 Max. :73030
## NA's :5
##
##
##
##
##
Before jumping aboard the bubble chart train, let’s see what the dataset looks like in the scatterplot framework for 5 of the variables, grouped along regions.
pairs(can.2011[,c(7,9,10,11,12)])
Meh… not that great. There are some interesting tidbits, but nothing jumps as being meaningful beyond a first pass.
Can anything else be derived using bubble charts? We use median income
as the radius for the bubbles, and focus on occupancy rates and population.
radius.med.income.2011<-sqrt(can.2011$med_total_income_2011/pi)
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, inches=0.45, fg="white", bg="red", xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")
Clearly, an increase in population seems to be associated with (and not necessarily a cause of) a rise in occupancy rates. But the median income seems to have very little correlation with either of the other two variables. Perhaps such a correlation is hidden by the default unit used to draw the bubbles? Let’s shrink it from 0.45 to 0.25 and see if anything pops out.
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, inches=0.25, fg="white", bg="red", xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")
Nope. But surely there would be a relationship in these quantities if we included the CMA/CA’s region?
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, inches=0.15, fg="white", bg=c("red","blue","black","green","yellow","violet")[can.2011$Region], xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")
What do you think?
While we’re at it, can you identify the regions without the legend?
Perhaps the CA distort the full picture (given that they are small and more numerous). Let’s analyze the world of CMAs instead.
can.2011.CMA=read.csv("../Data/Canada2011_CMA.csv", head=TRUE) # import the data
#head(can.2011.CMA)
str(can.2011.CMA) # see the dataset's structure (notice the number of observations)
## 'data.frame': 33 obs. of 15 variables:
## $ Geographic.code : int 1 205 305 310 408 421 433 442 462 505 ...
## $ Geographic.name : Factor w/ 33 levels "Abbotsford - Mission",..: 26 8 14 22 21 19 24 29 15 17 ...
## $ Province : Factor w/ 9 levels "AB","BC","MB",..: 5 6 4 4 8 8 8 8 8 7 ...
## $ Region : Factor w/ 5 levels "Atlantic","British Columbia",..: 1 1 1 1 5 5 5 5 5 3 ...
## $ Type : Factor w/ 1 level "CMA": 1 1 1 1 1 1 1 1 1 1 ...
## $ pop_2011 : int 196966 390328 138644 127761 157790 765706 201890 151773 3824221 1236324 ...
## $ log_pop_2011 : num 5.29 5.59 5.14 5.11 5.2 ...
## $ fem_pop_2011 : int 103587 204579 70901 65834 79770 394935 104026 78105 1971520 647083 ...
## $ prop_fem_2011 : num 52.6 52.4 51.1 51.5 50.6 ...
## $ pop_rank_2011 : int 20 13 29 31 26 7 19 27 2 4 ...
## $ priv_dwell_2011 : int 84542 177295 62403 56775 73766 361447 99913 74837 1696210 526627 ...
## $ occ_private_dwell_2011: int 78960 165153 58294 52281 69507 345892 91099 70138 1613260 498636 ...
## $ occ_rate_2011 : num 0.934 0.932 0.934 0.921 0.942 ...
## $ med_unemployment_2011 : num 6.6 6.2 7.45 6.75 7.8 5.15 6.4 8.9 8.1 6.3 ...
## $ med_total_income_2011 : int 33420 33400 30690 29910 29560 33760 27620 27050 28870 39170 ...
summary(can.2011.CMA[,3:12], maxsum=13) # feature by feature distribution
## Province Region Type pop_2011 log_pop_2011
## AB: 2 Atlantic : 4 CMA:33 Min. : 118975 Min. :5.075
## BC: 4 British Columbia: 4 1st Qu.: 159561 1st Qu.:5.203
## MB: 1 Ontario :15 Median : 260600 Median :5.416
## NB: 2 Prairies : 5 Mean : 700710 Mean :5.553
## NL: 1 Quebec : 5 3rd Qu.: 721053 3rd Qu.:5.858
## NS: 1 Max. :5583064 Max. :6.747
## ON:15
## QC: 5
## SK: 2
## fem_pop_2011 prop_fem_2011 pop_rank_2011 priv_dwell_2011
## Min. : 63260 Min. :50.55 Min. : 1 Min. : 53730
## 1st Qu.: 83243 1st Qu.:51.55 1st Qu.: 9 1st Qu.: 72817
## Median : 135561 Median :52.17 Median :17 Median : 110314
## Mean : 365093 Mean :52.05 Mean :17 Mean : 291519
## 3rd Qu.: 378690 3rd Qu.:52.41 3rd Qu.:25 3rd Qu.: 294150
## Max. :2947971 Max. :53.17 Max. :33 Max. :2079459
##
##
##
## occ_private_dwell_2011
## Min. : 48848
## 1st Qu.: 67767
## Median : 104237
## Mean : 275587
## 3rd Qu.: 282186
## Max. :1989705
##
##
##
We can use the median income for the bubbles radius again, but this time we’ll look at population and unemployment.
radius.med.income.2011.CMA<-sqrt(can.2011.CMA$med_total_income_2011/pi)
symbols(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011, circles=radius.med.income.2011.CMA, inches=0.25, fg="white", bg=c("red","blue","gray","green","yellow")[can.2011.CMA$Region], ylab="Population (log)", xlab="Unemployment Rate")
title("Total Median Income, by CMA (2011)")
text(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011,can.2011.CMA$Geographic.code, cex=0.5)
Part of the issue is that median income seems to be roughly uniform among CMAs. What if we used rank statistics instead? Switch the radius to population rank, say?
radius.pop.rank.2011.CMA<-sqrt(can.2011.CMA$pop_rank_2011/pi)
symbols(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011, circles=radius.pop.rank.2011.CMA, inches=0.25, fg="white", bg=c("red","blue","gray","green","yellow")[can.2011.CMA$Region], ylab="Median Income", xlab="Unemployment Rate")
title("Population Rank, by CMA and CA (2011)")
text(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011,can.2011.CMA$Geographic.code, cex=0.5)
There’s a bit more structure here, isn’t there?
The ability to monitor and perform early forecasts of various river algae blooms is crucial to control the ecological harm they can cause.
The dataset which is used to train the learning model consists of: - chemical properties of various water samples of European rivers - the quantity of seven algae in each of the samples, and - the characteristics of the collection process for each sample.
Notes: - we have already cleaned this dataset in notebook DS 02, but it would have made more sense to visualize it beforehand… we will revisit the process in detail in notebook DS 07. For now, let’s just focus on producing some graphs.
- The dataset is stored in the CSV file algae_blooms.csv
(in the same directory as this notebook).
algae_blooms<-read.csv("../Data/algae_blooms.csv", sep=",", header=TRUE)
We can get an idea of the data frame’s structure by calling the str
function.
str(algae_blooms)
## 'data.frame': 340 obs. of 18 variables:
## $ season: Factor w/ 4 levels "autumn","spring",..: 4 2 1 2 1 4 3 1 4 4 ...
## $ size : Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ speed : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 1 1 1 3 1 ...
## $ mxPH : num 8 8.35 8.1 8.07 8.06 8.25 8.15 8.05 8.7 7.93 ...
## $ mnO2 : num 9.8 8 11.4 4.8 9 13.1 10.3 10.6 3.4 9.9 ...
## $ Cl : num 60.8 57.8 40 77.4 55.4 ...
## $ NO3 : num 6.24 1.29 5.33 2.3 10.42 ...
## $ NH4 : num 578 370 346.7 98.2 233.7 ...
## $ oPO4 : num 105 428.8 125.7 61.2 58.2 ...
## $ PO4 : num 170 558.8 187.1 138.7 97.6 ...
## $ Chla : num 50 1.3 15.6 1.4 10.5 ...
## $ a1 : num 0 1.4 3.3 3.1 9.2 15.1 2.4 18.2 25.4 17 ...
## $ a2 : num 0 7.6 53.6 41 2.9 14.6 1.2 1.6 5.4 0 ...
## $ a3 : num 0 4.8 1.9 18.9 7.5 1.4 3.2 0 2.5 0 ...
## $ a4 : num 0 1.9 0 0 0 0 3.9 0 0 2.9 ...
## $ a5 : num 34.2 6.7 0 1.4 7.5 22.5 5.8 5.5 0 0 ...
## $ a6 : num 8.3 0 0 0 4.1 12.6 6.8 8.7 0 0 ...
## $ a7 : num 0 2.1 9.7 1.4 1 2.9 0 0 0 1.7 ...
Evidently, algae_blooms
is a data frame with 340 observations of 18 variables each.
Basic histograms can be constructed with the hist
function.
hist(algae_blooms$mnO2)
This basic histogram can be spruced up by using Hadley Wickham’s (in)famous ggplot2
library.
library(ggplot2)
ggplot(algae_blooms,aes(x=mnO2)) + # plotting mnO2 from the algae_blooms dataset ...
geom_histogram(aes(y=..density..)) + # as a histogram, where the vertical axis is the density ...
geom_density(color="blue") + # on which will be layered a blue density curve ...
geom_rug() + # and a rug (or comb) showing where the observations actually fall...
ggtitle("Histogram of minimum value of O2 among 340 observations") + # with this title ...
xlab("") + # no x axis label ...
ylab("") # and no y axis label
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing non-finite values (stat_density).
Let’s do the same for a1
.
ggplot(algae_blooms,aes(x=a1)) +
geom_histogram(aes(y=..density..)) +
geom_density(color="red") +
geom_rug() +
ggtitle("Histogram of minimum value of a1 among 340 observations") +
xlab("") +
ylab("")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
\(qq-\)plots, another traditional statistical plot, can be produced with the car
libary’s qqPlot
function.
library(car)
## Loading required package: carData
qqPlot(algae_blooms$mnO2, main='Normal QQ plot for minimum values of O2', ylab="")
## [1] 69 70
Again, we can see that the normal distribution is not a good fit for mnO2
, but that it is even worse for a1
(see below).
qqPlot(algae_blooms$a1, main='Normal QQ plot for a1', ylab="")
## [1] 49 118
Now let’s take a look at some plots involving NH4
.
ggplot(algae_blooms,aes(x=factor(0),y=NH4)) + # plotting NH4 from the algae_blooms dataset ...
geom_boxplot() + # as a boxplot ...
geom_rug() + # with a rug on which the true values are shown ...
geom_hline(aes(yintercept=mean(algae_blooms$NH4, na.rm=TRUE)), linetype=2, colour="pink") + # and a horizontal line showing where the mean NH4 value falls ...
ylab("Ammonium (NH4+)") +
xlab("") +
scale_x_discrete(breaks=NULL)
## Warning: Use of `algae_blooms$NH4` is discouraged. Use `NH4` instead.
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
We don’t see much here because of the suspected outliers.
plot(algae_blooms$NH4, xlab="", ylab="Ammonium (NH4+)") # scatter plot of NH4 against observation number
abline(h=mean(algae_blooms$NH4, na.rm=TRUE), lty=1) # mean value of NH4, solid line
abline(h=mean(algae_blooms$NH4, na.rm=TRUE) + sd(algae_blooms$NH4, na.rm=TRUE), lty=2) # mean + sd value of NH4, dashed line
abline(h=median(algae_blooms$NH4, na.rm=TRUE), lty=3) # median value of NH4, tight dashed line We can also look at the data and see which observations have values of NH4 below 3000 (roughly all values below the long dashed line above)
Let’s see what the boxplot above looks like without the suspected outliers.
ggplot(algae_blooms[-which(algae_blooms$NH4>3000),],aes(x=factor(0),y=NH4)) +
geom_boxplot() +
geom_rug() +
geom_hline(aes(yintercept=mean(algae_blooms[-which(algae_blooms$NH4>3000),8], na.rm=TRUE)), linetype=2, colour="pink") +
ylab("Ammonium (NH4+)") +
xlab("") +
scale_x_discrete(breaks=NULL)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
It’s a bit better, to be sure (the box structure has expanded). There still seems to be a bit of very high values. Perhaps that’s normal? How would we find out?
Now let’s take a look at some of the algae levels.
ggplot(algae_blooms,aes(x=season,y=a3)) + # plot a3 by season ...
geom_boxplot() + # in a series of boxplots ...
xlab("Season") + # with x axis as Seasons and y axis as a3
ylab("Algae Level a3")
We can re-arrange the factors’ order, but it requires a bit of fancy footwork using the forcats
’library fct_relevel
function, and dplyr
’s mutate
.
library(forcats) # for fct_relevel
library(dplyr) # for mutate
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
algae_blooms = mutate(algae_blooms,
size=fct_relevel(size,c("small","medium","large")), # factors should appear in the desired order
speed=fct_relevel(speed,c("low","medium","high")), # ditto
season=fct_relevel(season,c("spring","summer","autumn","winter")) # same here
)
ggplot(algae_blooms,aes(x=season,y=a3)) +
geom_boxplot() +
xlab("Season") +
ylab("Algae Level a3")
Violin plots are cousins to the boxplots. Can we get a bit more insight on the a3
trend?
ggplot(algae_blooms,aes(x=season,y=a3)) + # plot a3 by season ...
geom_violin() + # in a series of violin plots ...
geom_jitter() + # with some jitter to avoid all the points being on top of one another ...
xlab("Season") +
ylab("Algae Level a3")
(What happens if you turn off the jitter option?)
Let’s take a look at both NH4
and season
.
We only keep the observations for which NH4
\(> 3000\), and we bin them with respect to the quartiles.
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>%
filter(NH4<3000) %>%
mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1)), include.lowest=TRUE))
table(f.NH4.data$q.NH4,useNA="ifany")
##
## [5,36.8] (36.8,103] (103,226] (226,2.17e+03]
## 82 82 81 82
ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) +
geom_point() +
facet_wrap(~q.NH4) +
guides(color=FALSE) +
ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")
library(ggplot2)
library("dslabs")
data(package="dslabs")
ggplot2
and the Tidyverse data("murders")
?murders
head(murders)
str(murders)
## 'data.frame': 51 obs. of 5 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ abb : chr "AL" "AK" "AZ" "AR" ...
## $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
## $ population: num 4779736 710231 6392017 2915918 37253956 ...
## $ total : num 135 19 232 93 1257 ...
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────── tidyverse 1.2.1 ──
## ✓ tibble 3.0.1 ✓ readr 1.3.1
## ✓ tidyr 1.0.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ stringr 1.4.0
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## ── Conflicts ─────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::recode() masks car::recode()
## x purrr::some() masks car::some()
r <- murders %>%
summarize(pop=sum(population), tot=sum(total)) %>%
mutate(rate = tot/pop*10^6) %>% .$rate
r
## [1] 30.34555
library(ggrepel)
library(ggthemes)
murders %>% ggplot(aes(x = population/10^6, y = total, label = abb)) +
geom_abline(intercept = log10(r), lty=2, col="darkgrey") +
geom_point(aes(color=region), size = 3) +
geom_text_repel() +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010") +
scale_color_discrete(name="Region")
data("gapminder")
?gapminder
head(gapminder)
str(gapminder)
## 'data.frame': 10545 obs. of 9 variables:
## $ country : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
## $ infant_mortality: num 115.4 148.2 208 NA 59.9 ...
## $ life_expectancy : num 62.9 47.5 36 63 65.4 ...
## $ fertility : num 6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
## $ population : num 1636054 11124892 5270844 54681 20619075 ...
## $ gdp : num NA 1.38e+10 NA NA 1.08e+11 ...
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
## $ region : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
summary(gapminder)
## country year infant_mortality
## Albania : 57 Min. :1960 Min. : 1.50
## Algeria : 57 1st Qu.:1974 1st Qu.: 16.00
## Angola : 57 Median :1988 Median : 41.50
## Antigua and Barbuda: 57 Mean :1988 Mean : 55.31
## Argentina : 57 3rd Qu.:2002 3rd Qu.: 85.10
## Armenia : 57 Max. :2016 Max. :276.90
## (Other) :10203 NA's :1453
## life_expectancy fertility population gdp
## Min. :13.20 Min. :0.840 Min. :3.124e+04 Min. :4.040e+07
## 1st Qu.:57.50 1st Qu.:2.200 1st Qu.:1.333e+06 1st Qu.:1.846e+09
## Median :67.54 Median :3.750 Median :5.009e+06 Median :7.794e+09
## Mean :64.81 Mean :4.084 Mean :2.701e+07 Mean :1.480e+11
## 3rd Qu.:73.00 3rd Qu.:6.000 3rd Qu.:1.523e+07 3rd Qu.:5.540e+10
## Max. :83.90 Max. :9.220 Max. :1.376e+09 Max. :1.174e+13
## NA's :187 NA's :185 NA's :2972
## continent region
## Africa :2907 Western Asia :1026
## Americas:2052 Eastern Africa : 912
## Asia :2679 Western Africa : 912
## Europe :2223 Caribbean : 741
## Oceania : 684 South America : 684
## Southern Europe: 684
## (Other) :5586
west <- c("Western Europe","Northern Europe","Southern Europe",
"Northern America","Australia and New Zealand")
gapminder <- gapminder %>%
mutate(group = case_when(
region %in% west ~ "The West",
region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
TRUE ~ "Others"))
gapminder <- gapminder %>%
mutate(group = factor(group, levels = rev(c("Others", "Latin America", "East Asia","Sub-Saharan Africa", "The West"))))
filter(gapminder, year%in%c(1962, 2013) & !is.na(group) &
!is.na(fertility) & !is.na(life_expectancy)) %>%
mutate(population_in_millions = population/10^6) %>%
ggplot( aes(fertility, y=life_expectancy, col = group, size = population_in_millions)) +
geom_point(alpha = 0.8) +
guides(size=FALSE) +
theme(plot.title = element_blank(), legend.title = element_blank()) +
coord_cartesian(ylim = c(30, 85)) +
xlab("Fertility rate (births per woman)") +
ylab("Life Expectancy") +
geom_text(aes(x=7, y=82, label=year), cex=12, color="grey") +
facet_grid(. ~ year) +
theme(strip.background = element_blank(),
strip.text.x = element_blank(),
strip.text.y = element_blank(),
legend.position = "top")
gapminder$gdppc = gapminder$gdp/gapminder$population
gapminder2 <- gapminder[,c(1,2,4,6,8,9,11)]
head(gapminder2)
filter(gapminder2, year%in%c(2011) & !is.na(gdppc) & !is.na(life_expectancy)) %>%
mutate(population_in_millions = (population/10^6)) %>%
ggplot( aes(x=gdppc, y=life_expectancy, col = continent, size = population_in_millions, label = country)) +
geom_point(alpha = 1) +
geom_text_repel() +
guides(size=FALSE) +
theme(plot.title = element_blank(), legend.title = element_blank()) +
coord_cartesian(ylim = c(40, 85)) +
scale_x_log10() +
xlab("GDP per capita (log scale)") +
ylab("Life Expectancy (in years)") +
ggtitle("Health and Wealth of Nations (2011)") +
#facet_grid(. ~ year) +
theme(strip.background = element_blank(),
strip.text.x = element_blank(),
strip.text.y = element_blank(),
legend.position = "top")
filter(gapminder2, year%in%c(2011) & !is.na(gdppc) & !is.na(life_expectancy)) %>%
mutate(population_in_millions = (population/10^6)) %>%
ggplot( aes(x=gdppc, y=life_expectancy, col = continent, size = population_in_millions, label = country)) +
geom_point(alpha = 1) +
geom_smooth(method='lm',formula=y~x, se=FALSE) +
guides(size=FALSE) +
theme(plot.title = element_blank(), legend.title = element_blank()) +
coord_cartesian(ylim = c(40, 85)) +
scale_x_log10() +
xlab("GDP per capita (log scale)") +
ylab("Life Expectancy (in years)") +
ggtitle("Health and Wealth of Nations (2011)") +
#facet_grid(. ~ year) +
theme(strip.background = element_blank(),
strip.text.x = element_blank(),
strip.text.y = element_blank(),
legend.position = "top")
filter(gapminder2, year%in%c(2011) & !is.na(gdppc) & !is.na(life_expectancy)) %>%
mutate(population_in_millions = (population/10^6)) %>%
ggplot( aes(x=gdppc, y=life_expectancy, size = population_in_millions, label = country)) +
geom_point(alpha = 1) +
geom_smooth(method='lm',formula=y~x, se=FALSE) +
guides(size=FALSE) +
theme(plot.title = element_blank(), legend.title = element_blank()) +
coord_cartesian(ylim = c(40, 85)) +
scale_x_log10() +
xlab("GDP per capita (log scale)") +
ylab("Life Expectancy (in years)") +
ggtitle("Health and Wealth of Nations (2011)") +
#facet_grid(. ~ year) +
theme(strip.background = element_blank(),
strip.text.x = element_blank(),
strip.text.y = element_blank(),
legend.position = "top")
data(polls_us_election_2016)
?polls_us_election_2016
head(polls_us_election_2016)
str(polls_us_election_2016)
## 'data.frame': 4208 obs. of 15 variables:
## $ state : Factor w/ 57 levels "Alabama","Alaska",..: 50 50 50 50 50 50 50 50 37 50 ...
## $ startdate : Date, format: "2016-11-03" "2016-11-01" ...
## $ enddate : Date, format: "2016-11-06" "2016-11-07" ...
## $ pollster : Factor w/ 196 levels "ABC News/Washington Post",..: 1 63 81 194 65 55 18 113 195 76 ...
## $ grade : Factor w/ 10 levels "D","C-","C","C+",..: 10 6 8 6 5 9 8 8 NA 8 ...
## $ samplesize : int 2220 26574 2195 3677 16639 1295 1426 1282 8439 1107 ...
## $ population : chr "lv" "lv" "lv" "lv" ...
## $ rawpoll_clinton : num 47 38 42 45 47 ...
## $ rawpoll_trump : num 43 35.7 39 41 43 ...
## $ rawpoll_johnson : num 4 5.46 6 5 3 3 5 6 6 7.1 ...
## $ rawpoll_mcmullin: num NA NA NA NA NA NA NA NA NA NA ...
## $ adjpoll_clinton : num 45.2 43.3 42 45.7 46.8 ...
## $ adjpoll_trump : num 41.7 41.2 38.8 40.9 42.3 ...
## $ adjpoll_johnson : num 4.63 5.18 6.84 6.07 3.73 ...
## $ adjpoll_mcmullin: num NA NA NA NA NA NA NA NA NA NA ...
summary(polls_us_election_2016)
## state startdate enddate
## U.S. :1106 Min. :2015-11-06 Min. :2015-11-08
## Florida : 148 1st Qu.:2016-08-10 1st Qu.:2016-08-21
## North Carolina: 125 Median :2016-09-23 Median :2016-09-30
## Pennsylvania : 125 Mean :2016-08-31 Mean :2016-09-06
## Ohio : 115 3rd Qu.:2016-10-20 3rd Qu.:2016-10-28
## New Hampshire : 112 Max. :2016-11-06 Max. :2016-11-07
## (Other) :2477
## pollster grade
## Ipsos : 919 A- :1085
## Google Consumer Surveys : 743 B :1011
## SurveyMonkey : 660 C- : 693
## YouGov : 130 C+ : 329
## Rasmussen Reports/Pulse Opinion Research: 125 B+ : 204
## USC Dornsife/LA Times : 121 (Other): 457
## (Other) :1510 NA's : 429
## samplesize population rawpoll_clinton rawpoll_trump
## Min. : 35.0 Length:4208 Min. :11.04 Min. : 4.00
## 1st Qu.: 447.5 Class :character 1st Qu.:38.00 1st Qu.:35.00
## Median : 772.0 Mode :character Median :43.00 Median :40.00
## Mean : 1148.2 Mean :41.99 Mean :39.83
## 3rd Qu.: 1236.5 3rd Qu.:46.20 3rd Qu.:45.00
## Max. :84292.0 Max. :88.00 Max. :68.00
## NA's :1
## rawpoll_johnson rawpoll_mcmullin adjpoll_clinton adjpoll_trump
## Min. : 0.000 Min. : 9.0 Min. :17.06 Min. : 4.373
## 1st Qu.: 5.400 1st Qu.:22.5 1st Qu.:40.21 1st Qu.:38.429
## Median : 7.000 Median :25.0 Median :44.15 Median :42.765
## Mean : 7.382 Mean :24.0 Mean :43.32 Mean :42.674
## 3rd Qu.: 9.000 3rd Qu.:27.9 3rd Qu.:46.92 3rd Qu.:46.290
## Max. :25.000 Max. :31.0 Max. :86.77 Max. :72.433
## NA's :1409 NA's :4178
## adjpoll_johnson adjpoll_mcmullin
## Min. :-3.668 Min. :11.03
## 1st Qu.: 3.145 1st Qu.:23.11
## Median : 4.384 Median :25.14
## Mean : 4.660 Mean :24.51
## 3rd Qu.: 5.756 3rd Qu.:27.98
## Max. :20.367 Max. :31.57
## NA's :1409 NA's :4178
polls_us_election_2016 %>%
filter(state == "U.S." & enddate>="2016-07-01") %>%
select(enddate, pollster, rawpoll_clinton, rawpoll_trump) %>%
rename(Clinton = rawpoll_clinton, Trump = rawpoll_trump) %>%
gather(candidate, percentage, -enddate, -pollster) %>%
mutate(candidate = factor(candidate, levels = c("Trump","Clinton")))%>%
group_by(pollster) %>%
filter(n()>=10) %>%
ungroup() %>%
ggplot(aes(enddate, percentage, color = candidate)) +
geom_point(show.legend = FALSE, alpha=0.4) +
geom_smooth(method = "loess", span = 0.15) +
scale_y_continuous(limits = c(30,50))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 22 rows containing non-finite values (stat_smooth).
## Warning: Removed 22 rows containing missing values (geom_point).
library(RColorBrewer)
data("us_contagious_diseases")
?us_contagious_diseases
head(us_contagious_diseases)
str(us_contagious_diseases)
## 'data.frame': 16065 obs. of 6 variables:
## $ disease : Factor w/ 7 levels "Hepatitis A",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : num 1966 1967 1968 1969 1970 ...
## $ weeks_reporting: num 50 49 52 49 51 51 45 45 45 46 ...
## $ count : num 321 291 314 380 413 378 342 467 244 286 ...
## $ population : num 3345787 3364130 3386068 3412450 3444165 ...
summary(us_contagious_diseases)
## disease state year weeks_reporting
## Hepatitis A:2346 Alabama : 315 Min. :1928 Min. : 0.00
## Measles :3825 Alaska : 315 1st Qu.:1950 1st Qu.:31.00
## Mumps :1785 Arizona : 315 Median :1975 Median :46.00
## Pertussis :2856 Arkansas : 315 Mean :1971 Mean :37.38
## Polio :2091 California: 315 3rd Qu.:1990 3rd Qu.:50.00
## Rubella :1887 Colorado : 315 Max. :2011 Max. :52.00
## Smallpox :1275 (Other) :14175
## count population
## Min. : 0 Min. : 86853
## 1st Qu.: 7 1st Qu.: 1018755
## Median : 69 Median : 2749249
## Mean : 1492 Mean : 4107584
## 3rd Qu.: 525 3rd Qu.: 4996229
## Max. :132342 Max. :37607525
## NA's :214
the_disease <- "Pertussis"
us_contagious_diseases %>%
filter(disease == the_disease) %>%
mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>%
mutate(state = reorder(state, rate)) %>%
ggplot(aes(year, state, fill = rate)) +
geom_tile(color = "grey50") +
scale_x_continuous(expand=c(0,0)) +
scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
geom_vline(xintercept=1963, col = "blue") +
theme_minimal() + theme(panel.grid = element_blank()) +
ggtitle(the_disease) +
ylab("") +
xlab("")
library("ggplot2")
theme_set(theme_bw()) # use the black and white theme throughout
# artificial data:
d <- data.frame(x = c(1:8, 1:8), y = runif(16),
group1 = rep(gl(2, 4, labels = c("a", "b")), 2),
group2 = gl(2, 8))
head(d)
ggplot(data = d) + geom_point(aes(x, y, colour = group1)) +
facet_grid(~group2)
library("ggplot2")
data(singer, package="lattice")
?singer
summary(singer,8)
## height voice.part
## Min. :60.0 Bass 2 :26
## 1st Qu.:65.0 Bass 1 :39
## Median :67.0 Tenor 2 :21
## Mean :67.3 Tenor 1 :21
## 3rd Qu.:70.0 Alto 2 :27
## Max. :76.0 Alto 1 :35
## Soprano 2:30
## Soprano 1:36
table(singer$height,singer$voice.part)
##
## Bass 2 Bass 1 Tenor 2 Tenor 1 Alto 2 Alto 1 Soprano 2 Soprano 1
## 60 0 0 0 0 0 1 3 1
## 61 0 0 0 0 0 4 2 2
## 62 0 0 0 0 0 3 5 6
## 63 0 0 0 0 3 4 3 3
## 64 0 0 0 2 5 3 4 1
## 65 0 0 0 1 5 4 5 15
## 66 1 2 1 3 6 7 3 6
## 67 2 0 0 2 2 4 3 1
## 68 2 6 3 3 0 2 1 1
## 69 1 3 7 1 1 1 0 0
## 70 4 8 2 2 5 1 1 0
## 71 1 6 6 2 0 0 0 0
## 72 7 6 0 2 0 1 0 0
## 73 0 3 1 1 0 0 0 0
## 74 4 1 0 1 0 0 0 0
## 75 4 4 0 0 0 0 0 0
## 76 0 0 1 1 0 0 0 0
ggplot(singer, aes(x=height)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(singer, aes(x=voice.part, y=height)) + geom_boxplot()
ggplot(data=singer, aes(x=height)) +
geom_histogram() +
facet_wrap(~voice.part, nrow=4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
data(package="car")
## no data sets found
head(Salaries)
summary(Salaries)
## rank discipline yrs.since.phd yrs.service sex
## AsstProf : 67 A:181 Min. : 1.00 Min. : 0.00 Female: 39
## AssocProf: 64 B:216 1st Qu.:12.00 1st Qu.: 7.00 Male :358
## Prof :266 Median :21.00 Median :16.00
## Mean :22.31 Mean :17.61
## 3rd Qu.:32.00 3rd Qu.:27.00
## Max. :56.00 Max. :60.00
## salary
## Min. : 57800
## 1st Qu.: 91000
## Median :107300
## Mean :113706
## 3rd Qu.:134185
## Max. :231545
ggplot(Salaries, aes(x=rank, y=salary)) +
geom_boxplot(fill="cornflowerblue",
color="black", notch=TRUE)+
geom_point(position="jitter", color="blue", alpha=.5)+
geom_rug(side="l", color="black")
## Warning: Ignoring unknown parameters: side
ggplot(Salaries, aes(x=yrs.since.phd, y=salary, color=rank,
shape=rank)) + geom_point() + facet_grid(.~sex)
p1 <- ggplot(data=Salaries, aes(x=rank)) + geom_bar()
p2 <- ggplot(data=Salaries, aes(x=sex)) + geom_bar()
p3 <- ggplot(data=Salaries, aes(x=yrs.since.phd, y=salary)) + geom_point()
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p1, p2, p3, ncol=3)
library(car)
library(ggplot2)
mytheme <- theme(plot.title=element_text(
face="bold.italic", size="14",
color="brown"), axis.title=
element_text( face="bold.italic",
size=10, color="brown"),
axis.text=element_text(
face="bold", size=9,
color="darkblue"),
panel.background=element_rect(
fill="white",color="darkblue"),
panel.grid.major.y=element_line(
color="grey", linetype=1),
panel.grid.minor.y=element_line(
color="grey", linetype=2),
panel.grid.minor.x=element_blank(),
legend.position="top")
ggplot(Salaries, aes(x=rank, y=salary, fill=sex)) +
geom_boxplot() +
labs(title="Salary by Rank and Sex", x="Rank", y="Salary") +
mytheme
head(mpg)
?mpg
ggplot(mpg, aes(cty, hwy)) + geom_point(aes(colour = class))
ggplot(mpg, aes(cty, hwy)) + geom_point(colour = "red")
data("WorldPhones")
head(WorldPhones)
## N.Amer Europe Asia S.Amer Oceania Africa Mid.Amer
## 1951 45939 21574 2876 1815 1646 89 555
## 1956 60423 29990 4708 2568 2366 1411 733
## 1957 64721 32510 5230 2695 2526 1546 773
## 1958 68484 35218 6662 2845 2691 1663 836
## 1959 71799 37598 6856 3000 2868 1769 911
## 1960 76036 40341 8220 3145 3054 1905 1008
help(WorldPhones)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
WorldPhones.m = melt(WorldPhones)
head(WorldPhones.m)
colnames(WorldPhones.m) = c("Year", "Continent", "Phones")
head(WorldPhones.m)
ggplot(WorldPhones.m, aes(x=Year, y=Phones, color=Continent)) + geom_point()
ggplot(WorldPhones.m, aes(x=Year, y=Phones, color=Continent)) + geom_line()
ggplot(WorldPhones.m, aes(x=Year, y=Phones, color=Continent)) + geom_line() + scale_y_log10()
# install.packages("ggplot2")
# load package and data
options(scipen=999) # turn-off scientific notation like 1e+48
library(ggplot2)
theme_set(theme_bw()) # pre-set the bw theme.
data("midwest", package = "ggplot2")
# midwest <- read.csv("http://goo.gl/G1K41K") # bkup data source
# Scatterplot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) +
xlim(c(0, 0.1)) +
ylim(c(0, 500000)) +
labs(subtitle="Area Vs Population",
y="Population",
x="Area",
title="Scatterplot",
caption = "Source: midwest")
plot(gg)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 15 rows containing non-finite values (stat_smooth).
## Warning: Removed 15 rows containing missing values (geom_point).
data(mpg, package="ggplot2")
# mpg <- read.csv("http://goo.gl/uEeRGu")
mpg_select <- mpg[mpg$manufacturer %in% c("audi", "ford", "honda", "hyundai"), ]
# Scatterplot
theme_set(theme_bw()) # pre-set the bw theme.
g <- ggplot(mpg_select, aes(displ, cty)) +
labs(subtitle="mpg: Displacement vs City Mileage",
title="Bubble chart")
g + geom_jitter(aes(col=manufacturer, size=hwy)) +
geom_smooth(aes(col=manufacturer), method="lm", se=F)
## `geom_smooth()` using formula 'y ~ x'
# Source: https://github.com/dgrtwo/gganimate
#install.packages("devtools")
# install.packages("cowplot") # a gganimate dependency
# devtools::install_github("dgrtwo/gganimate")
library(ggplot2)
library(gganimate)
## No renderer backend detected. gganimate will default to writing frames to separate files
## Consider installing:
## - the `gifski` package for gif output
## - the `av` package for video output
## and restarting the R session
library(gapminder)
##
## Attaching package: 'gapminder'
## The following object is masked _by_ '.GlobalEnv':
##
## gapminder
## The following object is masked from 'package:dslabs':
##
## gapminder
theme_set(theme_bw()) # pre-set the bw theme.
head(gapminder)
ggplot(gapminder, aes(gdppc, life_expectancy, size = population, colour = country)) +
geom_point(alpha = 0.7, show.legend = FALSE) +
#scale_colour_manual(values = country_colors) +
scale_size(range = c(2, 12)) +
scale_x_log10() +
facet_wrap(~continent) +
# Here comes the gganimate specific bits
labs(title = 'Year: {frame_time}', x = 'GDP per capita', y = 'life expectancy') +
transition_time(year) +
ease_aes('linear')
## Warning: No renderer available. Please install the gifski, av, or magick
## package to create animated output
## NULL
# anim_save(file="gapminder.gif") # saved, not plotted
library(ggplot2)
library(ggExtra)
data(mpg, package="ggplot2")
# mpg <- read.csv("http://goo.gl/uEeRGu")
# Scatterplot
theme_set(theme_bw()) # pre-set the bw theme.
mpg_select <- mpg[mpg$hwy >= 35 & mpg$cty > 27, ]
g <- ggplot(mpg, aes(cty, hwy)) +
geom_count() +
geom_smooth(method="lm", se=F)
ggMarginal(g, type = "histogram", fill="transparent")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
ggMarginal(g, type = "boxplot", fill="transparent")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
# ggMarginal(g, type = "density", fill="transparent")
library(ggplot2)
theme_set(theme_bw())
# Data Prep
data("mtcars") # load data
mtcars$`car name` <- rownames(mtcars) # create new column for car names
mtcars$mpg_z <- round((mtcars$mpg - mean(mtcars$mpg))/sd(mtcars$mpg), 2) # compute normalized mpg
mtcars$mpg_type <- ifelse(mtcars$mpg_z < 0, "below", "above") # above / below avg flag
mtcars <- mtcars[order(mtcars$mpg_z), ] # sort
mtcars$`car name` <- factor(mtcars$`car name`, levels = mtcars$`car name`) # convert to factor to retain sorted order in plot.
# Diverging Barcharts
ggplot(mtcars, aes(x=`car name`, y=mpg_z, label=mpg_z)) +
geom_bar(stat='identity', aes(fill=mpg_type), width=.5) +
scale_fill_manual(name="Mileage",
labels = c("Above Average", "Below Average"),
values = c("above"="#00ba38", "below"="#f8766d")) +
labs(subtitle="Normalised mileage from 'mtcars'",
title= "Diverging Bars") +
coord_flip()
library(ggplot2)
#install.packages("quantmod")
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Version 0.4-0 included new data defaults. See ?getSymbols.
data("economics", package = "ggplot2")
# Compute % Returns
economics$returns_perc <- c(0, diff(economics$psavert)/economics$psavert[-length(economics$psavert)])
# Create break points and labels for axis ticks
brks <- economics$date[seq(1, length(economics$date), 12)]
#install.packages("lubridate")
lbls <- lubridate::year(economics$date[seq(1, length(economics$date), 12)])
# Plot
ggplot(economics[1:100, ], aes(date, returns_perc)) +
geom_area() +
scale_x_date(breaks=brks, labels=lbls) +
theme(axis.text.x = element_text(angle=90)) +
labs(title="Area Chart",
subtitle = "Perc Returns for Personal Savings",
y="% Returns for Personal savings",
caption="Source: economics")
library(ggplot2)
library(ggthemes)
options(scipen = 999) # turns of scientific notations like 1e+40
# Read data
email_campaign_funnel <- read.csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")
# X Axis Breaks and Labels
brks <- seq(-15000000, 15000000, 5000000)
lbls = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "m")
# Plot
ggplot(email_campaign_funnel, aes(x = Stage, y = Users, fill = Gender)) + # Fill column
geom_bar(stat = "identity", width = .6) + # draw the bars
scale_y_continuous(breaks = brks, # Breaks
labels = lbls) + # Labels
coord_flip() + # Flip axes
labs(title="Email Campaign Funnel") +
theme_tufte() + # Tufte theme from ggfortify
theme(plot.title = element_text(hjust = .5),
axis.ticks = element_blank()) + # Centre plot title
scale_fill_brewer(palette = "Dark2") # Color palette
# http://margintale.blogspot.in/2012/04/ggplot2-time-series-heatmaps.html
library(ggplot2)
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
##
## compact
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(zoo)
df <- read.csv("https://raw.githubusercontent.com/selva86/datasets/master/yahoo.csv")
df$date <- as.Date(df$date) # format date
df <- df[df$year >= 2012, ] # filter reqd years
# Create Month Week
df$yearmonth <- as.yearmon(df$date)
df$yearmonthf <- factor(df$yearmonth)
df <- ddply(df,.(yearmonthf), transform, monthweek=1+week-min(week)) # compute week number of month
df <- df[, c("year", "yearmonthf", "monthf", "week", "monthweek", "weekdayf", "VIX.Close")]
head(df)
#> year yearmonthf monthf week monthweek weekdayf VIX.Close
#> 1 2012 Jan 2012 Jan 1 1 Tue 22.97
#> 2 2012 Jan 2012 Jan 1 1 Wed 22.22
#> 3 2012 Jan 2012 Jan 1 1 Thu 21.48
#> 4 2012 Jan 2012 Jan 1 1 Fri 20.63
#> 5 2012 Jan 2012 Jan 2 2 Mon 21.07
#> 6 2012 Jan 2012 Jan 2 2 Tue 20.69
# Plot
ggplot(df, aes(monthweek, weekdayf, fill = VIX.Close)) +
geom_tile(colour = "white") +
facet_grid(year~monthf) +
scale_fill_gradient(low="red", high="green") +
labs(x="Week of Month",
y="",
title = "Time-Series Calendar Heatmap",
subtitle="Yahoo Closing Price",
fill="Close")
# Prepare data: group mean city mileage by manufacturer.
cty_mpg <- aggregate(mpg$cty, by=list(mpg$manufacturer), FUN=mean) # aggregate
colnames(cty_mpg) <- c("make", "mileage") # change column names
cty_mpg <- cty_mpg[order(cty_mpg$mileage), ] # sort
cty_mpg$make <- factor(cty_mpg$make, levels = cty_mpg$make) # to retain the order in plot.
head(cty_mpg, 4)
#> make mileage
#> 9 lincoln 11.33333
#> 8 land rover 11.50000
#> 3 dodge 13.13514
#> 10 mercury 13.25000
#The X variable is now a factor, let's plot.
library(ggplot2)
theme_set(theme_bw())
# Draw plot
ggplot(cty_mpg, aes(x=make, y=mileage)) +
geom_bar(stat="identity", width=.5, fill="tomato3") +
labs(title="Ordered Bar Chart",
subtitle="Make Vs Avg. Mileage",
caption="source: mpg") +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
#install.packages("ggcorrplot")
library(ggplot2)
library(ggcorrplot)
# Correlation matrix
data(mtcars)
corr <- round(cor(mtcars), 1)
# Plot
ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="circle",
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of mtcars",
ggtheme=theme_bw)
library(devtools)
## Warning: package 'devtools' was built under R version 3.6.2
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 3.6.2
#devtools::install_github("wilkox/treemapify")
library(treemapify)
library(ggplot2)
data(G20)
head(G20)
# region country gdp_mil_usd hdi econ_classification
# Africa South Africa 384315 0.629 Developing
# North America United States 15684750 0.937 Advanced
# North America Canada 1819081 0.911 Advanced
# North America Mexico 1177116 0.775 Developing
# South America Brazil 2395968 0.730 Developing
# South America Argentina 474954 0.811 Developing
ggplot(G20, aes(area = gdp_mil_usd, fill = region, label = country)) +
geom_treemap() +
geom_treemap_text(grow = T, reflow = T, colour = "black") +
facet_wrap( ~ econ_classification) +
scale_fill_brewer(palette = "Set1") +
theme(legend.position = "bottom") +
labs(
title = "The G-20 major economies",
caption = "The area of each country is proportional to its relative GDP
within the economic group (advanced or developing)",
fill = "Region"
)
library(ggplot2)
library(ggnetwork)
library(geomnet)
## Registered S3 methods overwritten by 'geomnet':
## method from
## fortify.igraph ggnetwork
## fortify.network ggnetwork
library(network)
## network: Classes for Relational Data
## Version 1.16.0 created on 2019-11-30.
## copyright (c) 2005, Carter T. Butts, University of California-Irvine
## Mark S. Handcock, University of California -- Los Angeles
## David R. Hunter, Penn State University
## Martina Morris, University of Washington
## Skye Bender-deMoll, University of Washington
## For citation information, type citation("network").
## Type help("network-package") to get started.
##
## Attaching package: 'network'
## The following object is masked from 'package:plyr':
##
## is.discrete
# make the data available
data(madmen, package = 'geomnet')
# data step for ggnetwork
# create undirected network
mm.net <- network(madmen$edges[, 1:2], directed = FALSE)
mm.net # glance at network object
## Network attributes:
## vertices = 45
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 39
## missing edges= 0
## non-missing edges= 39
##
## Vertex attribute names:
## vertex.names
##
## No edge attributes
## Network attributes:
## vertices = 45
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 39
## missing edges= 0
## non-missing edges= 39
##
## Vertex attribute names:
## vertex.names
## No edge attributes
# create node attribute (gender)
rownames(madmen$vertices) <- madmen$vertices$label
mm.net %v% "gender" <- as.character(
madmen$vertices[ network.vertex.names(mm.net), "Gender"]
)
# gender color palette
mm.col <- c("female" = "#ff0000", "male" = "#00ff00")
set.seed(10052016)
ggplot(data = ggnetwork(mm.net, layout = "kamadakawai"),
aes(x, y, xend = xend, yend = yend)) +
geom_edges(color = "grey50") + # draw edge layer
geom_nodes(aes(colour = gender), size = 2) + # draw node layer
geom_nodetext(aes(colour = gender, label = vertex.names),
size = 3, vjust = -0.6) + # draw node label layer
scale_colour_manual(values = mm.col) +
xlim(c(-0.05, 1.05)) +
theme_blank() +
theme(legend.position = "bottom")
library(triangle)
set.seed(0)
q1_d1 <- round(rtriangle(1000, 1, 7, 5))
q1_d2 <- round(rtriangle(1000, 1, 7, 6))
q1_d3 <- round(rtriangle(1000, 1, 7, 2))
df <- data.frame(q1_d1 = factor(q1_d1), q1_d2 = factor(q1_d2), q1_d3 = factor(q1_d3))
library(dplyr)
# group by combinations and count
df_grouped <- df %>% group_by(q1_d1, q1_d2, q1_d3) %>% count()
# set an id string that denotes the value combination
df_grouped <- df_grouped %>% mutate(id = factor(paste(q1_d1, q1_d2, q1_d3, sep = '-')))
order.freq <- order(df_grouped[,4],decreasing=TRUE)
# sort by count and select top rows
df_grouped <- df_grouped[order.freq[1:25],]
library(reshape2)
library(ggplot2)
# create long format
df_pcp <- melt(df_grouped, id.vars = c('id', 'freq'))
df_pcp$value <- factor(df_pcp$value)
y_levels <- levels(factor(1:7))
ggplot(df_pcp, aes(x = variable, y = value, group = id)) + # group = id is important!
geom_path(aes(size = freq, color = id),
alpha = 0.5,
lineend = 'round', linejoin = 'round') +
scale_y_discrete(limits = y_levels, expand = c(0.5, 0)) +
scale_size(breaks = NULL, range = c(1, 7))
## From Timeseries object (ts)
library(ggplot2)
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 3.6.2
theme_set(theme_classic())
# Plot
autoplot(AirPassengers) +
labs(title="AirPassengers") +
theme(plot.title = element_text(hjust=0.5))
library(ggplot2)
theme_set(theme_classic())
# Allow Default X Axis Labels
ggplot(economics, aes(x=date)) +
geom_line(aes(y=unemploy)) +
labs(title="Time Series Chart",
subtitle="Number of unemployed in thousands from 'Economics-US' Dataset",
caption="Source: Economics",
y="unemploy")
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:plyr':
##
## here
## The following object is masked from 'package:base':
##
## date
theme_set(theme_bw())
economics_m <- economics[1:24, ]
# labels and breaks for X axis text
lbls <- paste0(month.abb[month(economics_m$date)], " ", lubridate::year(economics_m$date))
brks <- economics_m$date
# plot
ggplot(economics_m, aes(x=date)) +
geom_line(aes(y=pce)) +
labs(title="Monthly Time Series",
subtitle="Personal consumption expenditures, in billions of dollars",
caption="Source: Economics",
y="pce") + # title and caption
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank()) # turn off minor grid
library(ggplot2)
library(lubridate)
theme_set(theme_bw())
economics_y <- economics[1:90, ]
# labels and breaks for X axis text
brks <- economics_y$date[seq(1, length(economics_y$date), 12)]
lbls <- lubridate::year(brks)
# plot
ggplot(economics_y, aes(x=date)) +
geom_line(aes(y=psavert)) +
labs(title="Yearly Time Series",
subtitle="Personal savings rate",
caption="Source: Economics",
y="psavert") + # title and caption
scale_x_date(labels = lbls,
breaks = brks) + # change to monthly ticks and labels
theme(axis.text.x = element_text(angle = 90, vjust=0.5), # rotate x axis text
panel.grid.minor = element_blank()) # turn off minor grid
data(economics_long, package = "ggplot2")
head(economics_long)
#> date variable value value01
#> <date> <fctr> <dbl> <dbl>
#> 1 1967-07-01 pce 507.4 0.0000000000
#> 2 1967-08-01 pce 510.5 0.0002660008
#> 3 1967-09-01 pce 516.3 0.0007636797
#> 4 1967-10-01 pce 512.9 0.0004719369
#> 5 1967-11-01 pce 518.1 0.0009181318
#> 6 1967-12-01 pce 525.8 0.0015788435
library(ggplot2)
library(lubridate)
theme_set(theme_bw())
df <- economics_long[economics_long$variable %in% c("psavert", "uempmed"), ]
df <- df[lubridate::year(df$date) %in% c(1967:1981), ]
# labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date), 12)]
lbls <- lubridate::year(brks)
# plot
ggplot(df, aes(x=date)) +
geom_line(aes(y=value, col=variable)) +
labs(title="Time Series of Returns Percentage",
subtitle="Drawn from Long Data format",
caption="Source: Economics",
y="Returns %",
color=NULL) + # title and caption
scale_x_date(labels = lbls, breaks = brks) + # change to monthly ticks and labels
scale_color_manual(labels = c("psavert", "uempmed"),
values = c("psavert"="#00ba38", "uempmed"="#f8766d")) + # line color
theme(axis.text.x = element_text(angle = 90, vjust=0.5, size = 8), # rotate x axis text
panel.grid.minor = element_blank()) # turn off minor grid
library(ggplot2)
library(lubridate)
theme_set(theme_bw())
df <- economics[, c("date", "psavert", "uempmed")]
df <- df[lubridate::year(df$date) %in% c(1967:1981), ]
# labels and breaks for X axis text
brks <- df$date[seq(1, length(df$date), 12)]
lbls <- lubridate::year(brks)
# plot
ggplot(df, aes(x=date)) +
geom_area(aes(y=psavert+uempmed, fill="psavert")) +
geom_area(aes(y=uempmed, fill="uempmed")) +
labs(title="Area Chart of Returns Percentage",
subtitle="From Wide Data format",
caption="Source: Economics",
y="Returns %") + # title and caption
scale_x_date(labels = lbls, breaks = brks) + # change to monthly ticks and labels
scale_fill_manual(name="",
values = c("psavert"="#00ba38", "uempmed"="#f8766d")) + # line color
theme(panel.grid.minor = element_blank()) # turn off minor grid
library(ggplot2)
library(forecast)
## Warning: package 'forecast' was built under R version 3.6.2
## Registered S3 methods overwritten by 'forecast':
## method from
## autoplot.Arima ggfortify
## autoplot.acf ggfortify
## autoplot.ar ggfortify
## autoplot.bats ggfortify
## autoplot.decomposed.ts ggfortify
## autoplot.ets ggfortify
## autoplot.forecast ggfortify
## autoplot.stl ggfortify
## autoplot.ts ggfortify
## fitted.ar ggfortify
## fortify.ts ggfortify
## residuals.ar ggfortify
theme_set(theme_classic())
# Subset data
nottem_small <- window(nottem, start=c(1920, 1), end=c(1925, 12)) # subset a smaller timewindow
# Plot
ggseasonplot(AirPassengers) + labs(title="Seasonal plot: International Airline Passengers")
ggseasonplot(nottem_small) + labs(title="Seasonal plot: Air temperatures at Nottingham Castle")
# devtools::install_github("hrbrmstr/ggalt")
library(ggplot2)
library(ggalt)
## Registered S3 methods overwritten by 'ggalt':
## method from
## fortify.table ggfortify
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
library(ggfortify)
theme_set(theme_classic())
# Compute data with principal components
df <- iris[c(1, 2, 3, 4)]
pca_mod <- prcomp(df) # compute principal components
# Data frame of principal components -
df_pc <- data.frame(pca_mod$x, Species=iris$Species) # dataframe of principal components
df_pc_vir <- df_pc[df_pc$Species == "virginica", ] # df for 'virginica'
df_pc_set <- df_pc[df_pc$Species == "setosa", ] # df for 'setosa'
df_pc_ver <- df_pc[df_pc$Species == "versicolor", ] # df for 'versicolor'
# Plot -
ggplot(df_pc, aes(PC1, PC2, col=Species)) +
geom_point(aes(shape=Species), size=2) + # draw points
labs(title="Iris Clustering",
subtitle="With principal components PC1 and PC2 as X and Y axis",
caption="Source: Iris") +
coord_cartesian(xlim = 1.2 * c(min(df_pc$PC1), max(df_pc$PC1)),
ylim = 1.2 * c(min(df_pc$PC2), max(df_pc$PC2))) + # change axis limits
geom_encircle(data = df_pc_vir, aes(x=PC1, y=PC2)) + # draw circles
geom_encircle(data = df_pc_set, aes(x=PC1, y=PC2)) +
geom_encircle(data = df_pc_ver, aes(x=PC1, y=PC2))
# devtools::install_github("hrbrmstr/ggalt")
library(ggplot2)
library(ggalt)
theme_set(theme_classic())
health <- read.csv("https://raw.githubusercontent.com/selva86/datasets/master/health.csv")
# for right ordering of the dumbells
health$Area <- factor(health$Area, levels=as.character(health$Area))
# health$Area <- factor(health$Area)
gg <- ggplot(health, aes(x=pct_2013, xend=pct_2014, y=Area, group=Area)) +
geom_dumbbell(color="#a3c4dc",
size=0.75,
point.colour.l="#0e668b") +
scale_x_continuous(label=waiver()) +
labs(x=NULL,
y=NULL,
title="Dumbbell Chart",
subtitle="Pct Change: 2013 vs 2014",
caption="Source: https://github.com/hrbrmstr/ggalt") +
theme(plot.title = element_text(hjust=0.5, face="bold"),
plot.background=element_rect(fill="#f7f7f7"),
panel.background=element_rect(fill="#f7f7f7"),
panel.grid.minor=element_blank(),
panel.grid.major.y=element_blank(),
panel.grid.major.x=element_line(),
axis.ticks=element_blank(),
legend.position="top",
panel.border=element_blank())
## Warning: Ignoring unknown parameters: point.colour.l
plot(gg)
library(dplyr)
theme_set(theme_classic())
source_df <- read.csv("https://raw.githubusercontent.com/jkeirstead/r-slopegraph/master/cancer_survival_rates.csv")
# Define functions. Source: https://github.com/jkeirstead/r-slopegraph
tufte_sort <- function(df, x="year", y="value", group="group", method="tufte", min.space=0.05) {
## First rename the columns for consistency
ids <- match(c(x, y, group), names(df))
df <- df[,ids]
names(df) <- c("x", "y", "group")
## Expand grid to ensure every combination has a defined value
tmp <- expand.grid(x=unique(df$x), group=unique(df$group))
tmp <- merge(df, tmp, all.y=TRUE)
df <- mutate(tmp, y=ifelse(is.na(y), 0, y))
## Cast into a matrix shape and arrange by first column
require(reshape2)
tmp <- dcast(df, group ~ x, value.var="y")
ord <- order(tmp[,2])
tmp <- tmp[ord,]
min.space <- min.space*diff(range(tmp[,-1]))
yshift <- numeric(nrow(tmp))
## Start at "bottom" row
## Repeat for rest of the rows until you hit the top
for (i in 2:nrow(tmp)) {
## Shift subsequent row up by equal space so gap between
## two entries is >= minimum
mat <- as.matrix(tmp[(i-1):i, -1])
d.min <- min(diff(mat))
yshift[i] <- ifelse(d.min < min.space, min.space - d.min, 0)
}
tmp <- cbind(tmp, yshift=cumsum(yshift))
scale <- 1
tmp <- melt(tmp, id=c("group", "yshift"), variable.name="x", value.name="y")
## Store these gaps in a separate variable so that they can be scaled ypos = a*yshift + y
tmp <- transform(tmp, ypos=y + scale*yshift)
return(tmp)
}
plot_slopegraph <- function(df) {
ylabs <- subset(df, x==head(x,1))$group
yvals <- subset(df, x==head(x,1))$ypos
fontSize <- 3
gg <- ggplot(df,aes(x=x,y=ypos)) +
geom_line(aes(group=group),colour="grey80") +
geom_point(colour="white",size=8) +
geom_text(aes(label=y), size=fontSize, family="American Typewriter") +
scale_y_continuous(name="", breaks=yvals, labels=ylabs)
return(gg)
}
## Prepare data
df <- tufte_sort(source_df,
x="year",
y="value",
group="group",
method="tufte",
min.space=0.05)
df <- transform(df,
x=factor(x, levels=c(5,10,15,20),
labels=c("5 years","10 years","15 years","20 years")),
y=round(y))
## Plot
plot_slopegraph(df) + labs(title="Estimates of % survival rates") +
theme(axis.title=element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(hjust=0.5,
family = "American Typewriter",
face="bold"),
axis.text = element_text(family = "American Typewriter",
face="bold"))
#install.packages("ggdendro")
library("ggplot2")
library("ggdendro")
theme_set(theme_bw())
hc <- hclust(dist(USArrests), "ave") # hierarchical clustering
# plot
ggdendrogram(hc, rotate = TRUE, size = 2)
library(ggplot2)
theme_set(theme_classic())
# Plot
g <- ggplot(mpg, aes(cty))
g + geom_density(aes(fill=factor(cyl)), alpha=0.8) +
labs(title="Density Plot",
subtitle="City Mileage Grouped by Number of cylinders",
caption="Source: mpg",
x="City Mileage",
fill="# Cylinders")
library(ggplot2)
theme_set(theme_classic())
# Plot
g <- ggplot(mpg, aes(class, cty))
g + geom_boxplot(varwidth=T, fill="plum") +
labs(title="Boxplot",
subtitle="City Mileage grouped by Class of vehicle",
caption="Source: mpg",
x="Class of Vehicle",
y="City Mileage")
library(ggplot2)
theme_set(theme_bw())
# plot
g <- ggplot(mpg, aes(manufacturer, cty))
g + geom_boxplot() +
geom_dotplot(binaxis='y',
stackdir='center',
dotsize = .5,
fill="red") +
theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
labs(title="Boxplot + Dotplot",
subtitle="City Mileage vs Class: Each dot represents 1 row in source data",
caption="Source: mpg",
x="Class of Vehicle",
y="City Mileage")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
library(ggplot2)
var <- mpg$class # the categorical data
## Prep data (nothing to change here)
nrows <- 10
df <- expand.grid(y = 1:nrows, x = 1:nrows)
categ_table <- round(table(var) * ((nrows*nrows)/(length(var))))
categ_table
## var
## 2seater compact midsize minivan pickup subcompact
## 2 20 18 5 14 15
## suv
## 26
#> 2seater compact midsize minivan pickup subcompact suv
#> 2 20 18 5 14 15 26
df$category <- factor(rep(names(categ_table), categ_table))
# NOTE: if sum(categ_table) is not 100 (i.e. nrows^2), it will need adjustment to make the sum to 100.
## Plot
ggplot(df, aes(x = x, y = y, fill = category)) +
geom_tile(color = "black", size = 0.5) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), trans = 'reverse') +
scale_fill_brewer(palette = "Set3") +
labs(title="Waffle Chart", subtitle="'Class' of vehicles",
caption="Source: mpg") +
theme(panel.border = element_rect(size = 2),
plot.title = element_text(size = rel(1.2)),
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
legend.title = element_blank(),
legend.position = "right")