In this notebook, we will show how to solve a variety of statistical analysis problems using R
.
The selection of problems is not intended to be complete, but it provides a gloss of the myriad ways to approach data analysis with R
.
Do not hesitate to consult online resources for more examples (StackOverflow, R-bloggers, etc.).
cities
)In this section, we illustrate how to create SRS and StS estimates from scratch.
You are charged with estimating the yearly salary of data scientists in Canada. Identify potential:
populations (target, study, respondent, sampling frames)
samples (intended, achieved)
unit information (unit, response variate, population attribute)
sources of bias (coverage, nonresponse, sampling, measurement) and variability (sampling, measurement).
Solution: here are some possible answers.
Target Population: all data scientists in Canada (is that a well-defined population?)
Study Population: this one poses a big challenge as there is no professional association of data scientists in Canada which is likely to contain even a fair number of data scientists (at least, none that we are aware/members of). Perhaps we could use membership lists of the Statistical Society of Canada (SSC)? Or scrape social platforms like LinkedIn for Canadian data scientists?
Sampling Frame: SSC membership directory, or the scraped LinkedIn data.
Intended Sample: a sample of SSC members or LinkedIn account holders who have been identified as data scientists.
Respondent Population: those SSC members or LinkedIn-identified data scientists who would respond if selected to participate.
Unit: a data scientist.
Response Variate: the salary of a data scientist.
Population Attribute: the average salary of all data scientists in the population.
Coverage Bias: in terms of salary, SSC members or LinkedIn-identified data scientists might not be representative of all data scientists in Canada.
Nonresponse Bias: in terms of salary, SSC members or LinkedIn-identified data scientists who would respond if selected might not be representative of all data scientists in Canada.
Sampling Bias: for those SSC members or LinkedIn-identified data scientists who would respond if selected, it could be that those that are sampled are not representative (in terms of salary) of all who would have responded if selected.
Measurement Bias: for some or all of the SSC members or LinkedIn-identified data scientists who participated in the study, the actual salaries might not have been assessed correctly.
Sampling Variability: different samples of SSC members or LinkedIn-identified data scientists who would respond if selected might produce different results in terms of salary.
Measurement Variability: for those SSC members or LinkedIn-identified data scientists who participated in the survey, reapeatedly asking an individual about their salary could produce a number of different responses.
cities
The file cities.csv
contains population information for a country’s cities. A city is classified as “small” if its population is below 75K, as “medium” if it falls between 75K and 1M, and as “large” otherwise.
Solutions:
Data
folder. It can be loaded using the regular functionality, and investigated at a high-level using the str()
and summary()
calls. A simple visualization is produced using ggplot2
’s qplot()
.cities <- read.csv('Data/cities.csv',sep=",")
str(cities)
## 'data.frame': 37 obs. of 3 variables:
## $ category : Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ population_1999: int 4590 5210 6030 6580 7300 7940 7960 8430 8750 9080 ...
## $ population_2019: int 5090 7090 7710 8120 7320 6780 10050 11060 12140 12480 ...
summary(cities)
## category population_1999 population_2019
## large : 2 Min. : 4590 Min. : 5090
## medium: 6 1st Qu.: 9080 1st Qu.: 12480
## small :29 Median : 29500 Median : 31160
## Mean : 201091 Mean : 249796
## 3rd Qu.: 43360 3rd Qu.: 52250
## Max. :3009490 Max. :3742940
library(ggplot2)
qplot(population_1999,population_2019,data=cities)
Evidently, the dataset consists of the populations of 37 cities, in 1999 and in 2019; 2 of which are classified as large, 6 as medium and 29 as small. Does this make sense to you? Are the classifications constant over the 20 year interval?
describe()
function (loaded with the psych
package).library(psych)
##
## Attaching package: 'psych'
## The following object is masked _by_ '.GlobalEnv':
##
## cities
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(cities[,2:3])
## vars n mean sd median trimmed mad min
## population_1999 1 37 201090.8 666239.4 29500 41693.23 30274.69 4590
## population_2019 2 37 249796.5 838340.9 31160 49518.39 28199.05 5090
## max range skew kurtosis se
## population_1999 3009490 3004900 3.74 12.49 109529.1
## population_2019 3742940 3737850 3.75 12.48 137822.4
The statistics by groups can be obtained by first separating the data into 3 subsets (that’s not the only way to do this, of course… we’ll re-visit at a later stage in this module):
cities_large <- subset(cities, category=="large") # note the "==", instead of simply "="
summary(cities_large)
## category population_1999 population_2019
## large :2 Min. :2869350 Min. :3652020
## medium:0 1st Qu.:2904385 1st Qu.:3674750
## small :0 Median :2939420 Median :3697480
## Mean :2939420 Mean :3697480
## 3rd Qu.:2974455 3rd Qu.:3720210
## Max. :3009490 Max. :3742940
describe(cities_large[,2:3])
## vars n mean sd median trimmed mad min
## population_1999 1 2 2939420 99093.94 2939420 2939420 103885.8 2869350
## population_2019 2 2 3697480 64290.15 3697480 3697480 67399.0 3652020
## max range skew kurtosis se
## population_1999 3009490 140140 0 -2.75 70070
## population_2019 3742940 90920 0 -2.75 45460
cities_medium <- subset(cities, category=="medium")
summary(cities_medium)
## category population_1999 population_2019
## large :0 Min. : 80720 Min. : 81750
## medium:6 1st Qu.: 93658 1st Qu.:107170
## small :0 Median :148265 Median :176750
## Mean :152875 Mean :179953
## 3rd Qu.:195575 3rd Qu.:244448
## Max. :253200 Max. :293480
describe(cities_medium[,2:3])
## vars n mean sd median trimmed mad min
## population_1999 1 6 152875.0 70933.87 148265 152875.0 78459.19 80720
## population_2019 2 6 179953.3 90642.84 176750 179953.3 102751.59 81750
## max range skew kurtosis se
## population_1999 253200 172480 0.19 -1.97 28958.63
## population_2019 293480 211730 0.07 -2.14 37004.78
cities_small <- subset(cities, category=="small")
summary(cities_small)
## category population_1999 population_2019
## large : 0 Min. : 4590 Min. : 5090
## medium: 0 1st Qu.: 8430 1st Qu.:11060
## small :29 Median :20520 Median :25000
## Mean :22216 Mean :26476
## 3rd Qu.:30900 3rd Qu.:37120
## Max. :58080 Max. :63730
describe(cities_small[,2:3])
## vars n mean sd median trimmed mad min
## population_1999 1 29 22216.21 14378.98 20520 21321.2 17924.63 4590
## population_2019 2 29 26475.52 16402.64 25000 25597.6 20667.44 5090
## max range skew kurtosis se
## population_1999 58080 53490 0.56 -0.68 2670.11
## population_2019 63730 58640 0.41 -0.96 3045.89
set.seed(222)# fixing the seed for replicability purposes
N=dim(cities)[1]
n=10
(cities_SRS = cities[sample(1:N,n, replace=FALSE),])
## category population_1999 population_2019
## 15 small 20520 27710
## 18 small 21640 25000
## 23 small 35520 45800
## 24 small 35520 48280
## 22 small 30900 34100
## 26 small 40940 49040
## 20 small 29830 39120
## 9 small 8750 12140
## 10 small 9080 12480
## 30 medium 80720 81750
For each year, the SRS estimate for \(\mu\) is given simply by computing the mean of the population in the sample:
(mean_SRS=sapply(Filter(is.numeric, cities_SRS), mean))
## population_1999 population_2019
## 31342 37542
Under SRS, the bound on the error is given by \[B=2\sqrt{\frac{s^2}{n}\left(1-\frac{n}{N}\right)},\] where \(s\) is the standard deviation of the sample, \(n\) the sample size, and \(N\) the population size:
(sd_SRS=sapply(Filter(is.numeric, cities_SRS), sd))
## population_1999 population_2019
## 20507.29 20579.72
(bound_SRS = 2*sqrt(sd_SRS^2/n*(1-n/N)))
## population_1999 population_2019
## 11079.48 11118.61
Finally, the \(95\%\) SRS C.I. is given by mean_SRS ± bound_SRS
, which yields:
(CI_SRS = cbind(mean_SRS-bound_SRS,mean_SRS+bound_SRS))
## [,1] [,2]
## population_1999 20262.52 42421.48
## population_2019 26423.39 48660.61
That is, we would expect the mean population of the cities to fall between the lower bound and the upper bound obtained this way roughly 19 times out of every 20 times we sampled the data using SRS.
There is just one problem: the true means were \(201,091\) in 1999 and \(249,796\) in 2019, so they are not, in fact in the confidence intervals!
This is due, of course, to thespecific sample that was selected, and which does not contain a large city. If we try with set.seed(2)
instead of set.seed(22)
above, we obtain a sample with a large city, and the true population means are indeed in the corresponding \(95\%\) C.I. (there are even some samples that give negative lower bounds).
# compute the estimate and bound components from the "large" stratum
n_l=2
N_l=2
cities_large_StS = cities_large[sample(1:N_l,n_l, replace=FALSE),]
mean_large_SRS = sapply(Filter(is.numeric, cities_large_StS), mean)
sd_large_SRS = sapply(Filter(is.numeric, cities_large_StS), sd)
bound_large_SRS = 1/N^2*N_l^2*sd_large_SRS^2/n_l*(1-n_l/N_l)
# compute the estimate and bound components from the "medium" stratum
n_m=3
N_m=6
cities_medium_StS = cities_medium[sample(1:N_m,n_m, replace=FALSE),]
mean_medium_SRS = sapply(Filter(is.numeric, cities_medium_StS), mean)
sd_medium_SRS = sapply(Filter(is.numeric, cities_medium_StS), sd)
bound_medium_SRS = 1/N^2*N_m^2*sd_medium_SRS^2/n_m*(1-n_m/N_m)
# compute the estimate and bound components from the "small" stratum
n_s=5
N_s=29
cities_small_StS = cities_small[sample(1:N_s,n_s, replace=FALSE),]
mean_small_SRS = sapply(Filter(is.numeric, cities_small_StS), mean)
sd_small_SRS = sapply(Filter(is.numeric, cities_small_StS), sd)
bound_small_SRS = 1/N^2*N_s^2*sd_small_SRS^2/n_s*(1-n_s/N_s)
# compute the StS esimate and bound
(mean_StS = N_l/N*mean_large_SRS + N_m/N*mean_medium_SRS + N_s/N*mean_small_SRS)
## population_1999 population_2019
## 199651.1 246210.5
bound_StS = 2*sqrt(bound_large_SRS + bound_medium_SRS + bound_small_SRS)
# derive the StS 95% CI
(CI_StS = cbind(mean_StS-bound_StS,mean_StS+bound_StS))
## [,1] [,2]
## population_1999 185450.5 213851.7
## population_2019 232668.4 259752.6
The \(95\%\) C.I. are much tighter (and incidentally, provide estimates that are closer to the true value, although that is secondary to the tightness of the intervals).
Write R
code that lets you draw “random”" samples from the various distributions discussed in the slide deck section on distributions.
Uniformly distributed numbers can be generated by runif()
(call ?runif
at the prompt to get the function information).
The default call is runif(n)
, producing \(n\) uniformly distributed numbers on \([0,1]\).
runif(1) # generate a single number from U(0,1)
## [1] 0.5122694
(x<-runif(10))
## [1] 0.24427790 0.95997506 0.72662884 0.47102506 0.42161913 0.90393694
## [7] 0.11783592 0.67981691 0.02245412 0.53831510
hist(x)
We can also compare the sample mean and variance with the true (theoretical) mean and variance.
(mean_x = mean(x))
## [1] 0.5085885
(true_mean = (1+0)/2)
## [1] 0.5
(var_x = var(x))
## [1] 0.1006271
(true_var = 1/12*(1-0)^2)
## [1] 0.08333333
For uniform distributions with specific lower and upper bounds\(a\) and \(b\), we need to modify the call to runif()
.
a=-2
b=3
n=250
y<-runif(n,min=a,max=b) # generate a sample of size 250 from U(-2,3)
hist(y)# fat histogram
brks = seq(a,b,0.1)
hist(y, breaks = brks)# histogram with more bins
What is the next block of code supposed to do?
(mean_y = mean(y))
## [1] 0.5166937
(true_mean = (a+b)/2)
## [1] 0.5
(var_y = var(y))
## [1] 1.927141
(true_var = 1/12*(b-a)^2)
## [1] 2.083333
Normally distributed numbers can be generated by rnorm()
, which accepts three parameters: n
, mean
, and sd
. The default parameter values are mean=0
and sd=1
.
rnorm(1) # generate a single number from N(0,1)
## [1] 0.3703478
z<-rnorm(500) # generate a sample of size 500 from N(0,1)
hist(z)
brks = seq(min(z),max(z),(max(z)-min(z))/20) # what does this line do?
hist(z, breaks = brks)
For normal distributions with specific mean \(\mu\) and standard deviation \(\sigma\), we need to modify the call to rnorm()
.
w<-rnorm(5000, sd=3, mean=-2)
mean(w)
## [1] -2.087508
sd(w)
## [1] 2.980652
brks = seq(min(w),max(w),(max(w)-min(w))/50)
hist(w, breaks = brks)
You won’t be surprised to find out that things are pretty similar for the Poisson distribution \(P(\lambda)\), except that we use the function rpois()
, with required parameters n
and lambda
.
rpois(1,lambda=13)
## [1] 19
u<-rpois(500,lambda=13)
head(u)
## [1] 8 11 15 11 11 11
mean(u)
## [1] 13.182
var(u)
## [1] 12.48986
hist(u)
Would you believe that rbinom()
does the trick? Parameters are n
, size
, and prob
. You can probably guess what the next two block do.
rbinom(1, size=11, prob=0.2)
## [1] 1
v<- rbinom(1000,size=11, prob=0.2)
mean(v)
## [1] 2.144
var(v)
## [1] 1.654919
brks = min(v):max(v)
hist(v, breaks = brks)
# let's try another one
v<- rbinom(1000,size=19, prob=0.7)
mean(v)
## [1] 13.348
var(v)
## [1] 3.966863
brks = min(v):max(v)
hist(v, breaks = brks)
rlnorm()
? rlnorm()
. With parameters n
, meanlog
, sdlog
(whose default values are 0 and 1, respectively).
rlnorm(1)
## [1] 2.233227
s <- rlnorm(250,meanlog=0,sdlog=1)
hist(s)
Exponentially distributed numbers are generated by rexp()
, with required parameters n
and rate
.
rexp(1,100)
## [1] 0.004335029
q<-rexp(1000,100)
mean(q)
## [1] 0.009842063
var(q)
## [1] 0.0001046535
hist(q)
Gamma distributed numbers are generated by rgamma()
, with required parameters n
, shape
, and scale
.
rgamma(1,shape=2,scale=1/3)
## [1] 0.2865575
q<-rgamma(1000,shape=2, scale=1/3)
mean(q)
## [1] 0.6807423
var(q)
## [1] 0.2262379
hist(q)
We can generate a multivariate joint normal via MASS
’s mvrnorm()
, whose required paramters are n
, a mean vector mu
and a covariance matrix Sigma
.
# let's start with a standard bivariate joint normal
mu = rep(0,2)
Sigma = matrix(c(1,0,0,1),2,2)
We’ll sample 1000 observations from the joint normal \(N(\mu,\Sigma)\).
library(MASS)
a<-mvrnorm(1000,mu,Sigma)
a<-data.frame(a)
str(a)
## 'data.frame': 1000 obs. of 2 variables:
## $ X1: num -0.307 0.923 0.651 -2.111 -0.158 ...
## $ X2: num 0.5828 -0.8599 -0.0108 -1.4353 -1.2544 ...
# let's plot the data... what would you expect to see here?
library(ggplot2)
library(hexbin)
qplot(X1, X2, data=a, geom="hex")
The covariance matrix was diagonal, so we expect theblob to be circular. What happens if we use a non-diagonalcovariance matrix?
mu = c(-3,12)
Sigma = matrix(c(110,15,15,3),2,2)
a<-mvrnorm(1000,mu,Sigma)
a<-data.frame(a)
qplot(X1, X2, data=a, geom="hex") + ylim(-40,40) + xlim(-40,40)
A hospital needs to schedule night shifts in the maternity ward.
It is known that there are \(3000\) deliveries per year; if these happened randomly round the clock (is this a reasonable assumption?), we would expect \(1000\) deliveries between the hours of midnight and 8.00 a.m., a time when much of the staff is off-duty.
It is thus important to ensure that the night shift is sufficiently staffed to allow the maternity ward to cope with the workload on any particular night, or at least, on a high proportion of nights.
The average number of deliveries per night is \(\lambda = 1000/365.25\approx 2.74\).
Solution: we are looking for \[f(x)=P(X=x)=\frac{\lambda^x\exp(-x)}{x!},\] for \(x=0,1,2,\ldots\).
For a Poisson distribution, the probability mass values \(f(x)\) can be obtained using dpois()
(for a general distribution, replace the r
in the rxxxxx(...)
random number generators by d
: dxxxxx(...)
).
lambda = 2.74 # Poisson distribution parameter
x=0:10 # in theory, goes up to infinity, but we've got to stop somewhere...
y=dpois(x,lambda) # pmf
z=ppois(x,lambda) # cmf
(prob.details=data.frame(x,y,z))
## x y z
## 1 0 0.0645703469 0.06457035
## 2 1 0.1769227505 0.24149310
## 3 2 0.2423841682 0.48387727
## 4 3 0.2213775403 0.70525481
## 5 4 0.1516436151 0.85689842
## 6 5 0.0831007011 0.93999912
## 7 6 0.0379493202 0.97794844
## 8 7 0.0148544482 0.99280289
## 9 8 0.0050876485 0.99789054
## 10 9 0.0015489063 0.99943945
## 11 10 0.0004244003 0.99986385
# pdf plot
plot(x,y, type="h", col=2, main="Poisson PMF", xlab="x", ylab="f(x)=P(X=x)")> points(x,y, col=2)
## logical(0)
abline(h=0, col=4)
# cmf plot
plot(c(1,x),c(0,z), type="s", col=2, main="Poisson CMF", xlab="x", ylab="F(x)=P(X<=x)")
abline(h=0:1, col=4)
Solution: we seek an \(x\) for which \[P(X\leq x-1)\leq 0.80\leq P(X\leq x);\] a visual display simplifies the task.
# cmf plot
plot(c(1,x),c(0,z), type="s", col=2, main="Poisson CMF", xlab="x", ylab="F(x)=P(X<=x)")
abline(h=0:1, col=4)
abline(h=0.8, col=1)
The \(y=0.8\) line crosses the CMF at \(x=4\); let’s evaluate \(F(3)=P(X\leq 3)\) and \(F(4)=P(X\leq 4)\) to confirm that \(F(3)\leq 0.8 \leq F(4)\).
lambda=2.74
ppois(3,lambda)
## [1] 0.7052548
ppois(4,lambda)
## [1] 0.8568984
It is indeed the case. Thus, if the hospital prepares for \(4\) deliveries a night, they will be ready for the worst on at least \(80\%\) of the nights (closer to \(85.7\%\), actually).
Note that this is different than asking how many deliveries are expected nightly (namely, \(E[X]=2.74\)).
Solution: we need to evaluate \[ 365.25\cdot P(X\ge 5)=365.25 (1-P(X\le 4)).\]
365.25*(1-ppois(4,2.74))
## [1] 52.26785
That is, roughly \(14\%\) of the nights.
Solution: we are looking for largest value of \(x\) for which \(365.25\cdot P(X=x)\geq 1\) (if \(365.25\cdot P(X=x)<1\), then the probability of that number of deliveries is too low to expect that we would ever see it during the year).
lambda=2.74
# initializing vector
nights=c()
# expected number of nights with each number of deliveries
for(j in 0:10){
nights[j+1]=365.25*dpois(j,lambda)
}
rbind(0:10,nights)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## 0.00000 1.00000 2.00000 3.00000 4.00000 5.00000 6.00000
## nights 23.58432 64.62103 88.53082 80.85815 55.38783 30.35253 13.86099
## [,8] [,9] [,10] [,11]
## 7.000000 8.000000 9.000000 10.0000000
## nights 5.425587 1.858264 0.565738 0.1550122
# identify largest index
max(which(nights>1))-1
## [1] 8
Namely, \(x=8\). Indeed, for larger values of \(x\), \(365.25\cdot P(X=x)<1\).
lambda=2.74
365.25*dpois(8,lambda)
## [1] 1.858264
365.25*dpois(9,lambda)
## [1] 0.565738
In this section, we illustrate how to create the Central Limit Theorem to answer questions.
A large freight elevator can transport a maximum of 9800 lbs. Suppose a load containing 49 boxes must be transported. From experience, the weight of boxes follows a distribution with mean \(\mu=205\) lbs and standard deviation \(\sigma=15\) lbs.
Estimate the probability that all 49 boxes can be safely loaded onto the freight elevator and transported.
Solution: we are given \(n=49\), \(\mu=205\), and \(\sigma=15\). Let’s assume further that the boxes all come from different sources (i.e. the boxes’ weight \(x_i\), \(i=1,\ldots,49\), are independent of one another)
To get a sense of the task’s feasibility, let’s simulate a few scenarios. You’ll notice that the problem makes no mention of the type of distribution followed by the weights.
For now, let us assume that the weights are normally distributed.
set.seed(0) # to ensure replicability; the seed only applies to the next call requiring the pseudo-random generator
x<-rnorm(49,mean=205,sd=15)
brks = seq(min(x),max(x),(max(x)-min(x))/10)
hist(x, breaks = brks)
The elevator can transport up to 9800 lbs; the \(n=49\) boxes can be transported if their total weight \[T=49w=x_1+\cdots +x_{49},\] where \(w=\overline{x}\), is less than 9800 lbs. In mathematical terms, we’re interested in the probability \(P(T<9800)\).
For the sample x
from above, we get:
(T<-sum(x))
## [1] 10066.36
and so that specific group of 49 boxes would be too heavy to carry in one trip. Perhaps we were simply unlucky – perhaps another group of boxes would have been light enough. Let’s try the same thing, but with a different group of boxes.
set.seed(999)
(T=sum(rnorm(49,mean=205,sd=15)))
## [1] 9852.269
Close, but no cigar. However, two tries aren’t enough to establish a trend and to estimate \(P(T<9800)\).
We’ll write a little function to help us find an estimate. The idea is simple: if we were to try a large number of random combinations of 49 boxes, the proportion of the attempts for which the total weight \(T\) falls below 9800 is (hopefully?) going to approximate \(P(T<9800)\).
# see if you can figure out what kind of inputs these are meant to be, and what this code does
# running this cell will compile the function, but
# it still needs to be called with appropriate inputs to provide an estimate for P(T<9800)
estimate_T.normal <- function(n, T.threshold, mean, sd, num.tries){
a=0
for(j in 1:num.tries){
if(sum(rnorm(n,mean=mean,sd=sd))<T.threshold){
a=a+1
}
}
estimate_T.normal <- a/num.tries
}
We’ll try the experiment 10, 100, 1000, 10000, 100000, and 1000000 times.
(c(estimate_T.normal(49,9800,205,15,10),
estimate_T.normal(49,9800,205,15,100),
estimate_T.normal(49,9800,205,15,1000),
estimate_T.normal(49,9800,205,15,10000),
estimate_T.normal(49,9800,205,15,100000),
estimate_T.normal(49,9800,205,15,1000000)))
## [1] 0.00000 0.01000 0.00700 0.00990 0.00973 0.00975
We can’t say too much from such a simple set up, but it certainly seems as though we could expect success about \(1\%\) of the time.
That’s low, but perhaps that was because we assumed normality. What would happen if we used other distributions with the same characteristics, such as \(U(179.0192379,230.9807621)\) or \(\Lambda(5.320340142,0.005339673624)\)?
(How would you verify that these distributions indeed have the right characteristics? How would you determine the appropriate parameters?)
Let’s try with those distributions as well.
estimate_T.unif <- function(n, T.threshold, min, max, num.tries){
a=0
for(j in 1:num.tries){
if(sum(runif(n,min=min,max=max))<T.threshold){
a=a+1
}
}
estimate_T.unif <- a/num.tries
}
(c(estimate_T.unif(49,9800,179.0192379,230.9807621,10),
estimate_T.unif(49,9800,179.0192379,230.9807621,100),
estimate_T.unif(49,9800,179.0192379,230.9807621,1000),
estimate_T.unif(49,9800,179.0192379,230.9807621,10000),
estimate_T.unif(49,9800,179.0192379,230.9807621,100000),
estimate_T.unif(49,9800,179.0192379,230.9807621,1000000)))
## [1] 0.000000 0.010000 0.008000 0.007900 0.010230 0.009613
estimate_T.lnorm <- function(n, T.threshold, meanlog, sdlog, num.tries){
a=0
for(j in 1:num.tries){
if(sum(rlnorm(n,meanlog=meanlog,sdlog=sdlog))<T.threshold){
a=a+1
}
}
estimate_T.lnorm <- a/num.tries
}
(c(estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),10),
estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),100),
estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),1000),
estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),10000),
estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),100000),
estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),1000000)))
## [1] 0.000000 0.000000 0.006000 0.009500 0.009060 0.009184
Under all three distributions, it appears as though \(P(T<9800)\) converges to a value near \(1\%\), even though the three distributions are very different. That might be surprising at first glance, but it isn’t really, given the Central Limit Theorem (CLT).
In effect, we are interested in estimating \(P(T<9800)=P(w<9800/49)=P(w<200)\), where \(w\) is the mean weight of the boxes.
According to the CLT, the distribution of \(w\) is approximately normal with mean \(\mu=205\) and variance \(\sigma^2/n=15^2/49\), even if the weights themselves were not normally distributed.
By subtracting the mean of \(w\) and dividing by the standard deviation we obtain a new random variable \(z\) which is approximately the standard unit normal, i.e. \[P(w<200)\approx P\left(z<\frac{200−205}{15/7}\right).\]
(200-205)/(15/7)
## [1] -2.333333
Thus, \(P(w<200)\approx P(z<−2.33)\) and we need to find the probability that the standard unit normal pdf is smaller than \(-2.33\).
This can be calculated with the pnorm()
function:
pnorm(-2.33, mean=0, sd=1)
## [1] 0.009903076
Hence, \(P(T<9800)\approx 0.0099\), which means that it’s highly unlikely (\(\approx 1.1\%\) probability) that the 49 boxes can be transported in the elevator all at once.
What elevator threshold would be required to reach a probability of success of \(10\%\)? \(50\%\)? \(75\%\)?
The following routine approximates the probability in question without resorting to simulating the weights (that is, independently of the underlying distribution of weights) for given n
, threshold
, mean
, and sd
. Can you figure out what pnorm()
is doing?
prob_T <- function(n,threshold,mean,sd){
prob_T=pnorm((threshold/n - mean)/(sd/sqrt(n)),0,1)
}
max(which(prob_T(49,1:12000,205,15)<0.1))
## [1] 9910
max(which(prob_T(49,1:12000,205,15)<0.5))
## [1] 10044
max(which(prob_T(49,1:12000,205,15)<0.75))
## [1] 10115
plot((prob_T(49,1:12000,205,15)))
In this section, we show howto run and interpret the results of linear regression models.
A specific auto part is manufactured by a company once a month in lots that vary in size as demand fluctuates. The data below represent observations on lot size \(y\) and number of employee-hours of labor \(x\) for 10 recent production runs.
Fit a simple regression model \(y_i=\beta_0+\beta_1x_i+\varepsilon_i\), where \(E(\varepsilon_i)=0\), \(E(\varepsilon_i\varepsilon_j)=0\) for \(i\neq j\), and \(V(\varepsilon_i)=\sigma^2\), if the observations are: \[\begin{align*}\mathbf{Y}&=[73,50,128,170,87,108,135,69,148,132]^{\!\top} \\ \mathbf{x}&=[30,20,60,80,40,50,60,30,70,60]^{\!\top} \end{align*}\]
Solution: the line of best fit can be obtained via the design matrix \(\mathbf{X}=[\mathbf{1}\ \mathbf{x}]\), which allows us to compute the quantities required to solve the canonical equation: \((\mathbf{X}^{\!\top}\mathbf{X})^{-1}\), \(\mathbf{X}^{\!\top}\mathbf{Y}\), and \(\mathbf{\hat{\beta}}=(\mathbf{X}^{\!\top}\mathbf{X})^{-1}\mathbf{X}^{\!\top}\mathbf{Y}\).
X=cbind(matrix(rep(1,10),10,1),matrix(c(30,20,60,80,40,50,60,30,70,60),10,1))
Y=matrix(c(73,50,128,170,87,108,135,69,148,132),10,1)
t(X)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 1 1 1 1 1 1 1 1 1
## [2,] 30 20 60 80 40 50 60 30 70 60
t(Y)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 73 50 128 170 87 108 135 69 148 132
beta=solve(t(X)%*%X)%*%t(X)%*%Y
beta
## [,1]
## [1,] 10
## [2,] 2
The line of best fit (in the ordinary least square sense) is thus \(y=10+2x\).
data.set = data.frame(cbind(X[,2],Y))
colnames(data.set)<-c("X","Y")
str(data.set)
## 'data.frame': 10 obs. of 2 variables:
## $ X: num 30 20 60 80 40 50 60 30 70 60
## $ Y: num 73 50 128 170 87 108 135 69 148 132
plot(data.set)
abline(a=beta[1], b=beta[2], col="red")
Of course, no one computes the line of best fit explicitly in practice.
Once you’ve convinced yourself that the OLS framework is valid for your data, it makes much more sense to use the built-in functionality, such as is offered by lm()
.
It’s not the only such method; we’re presenting it mostly to hammer homethe point that the R
syntax for predictive models usually follows the same format:
a method produces an object (the model),
the method call requires the user to pass on a model structure (y ~ predictors
),
to specify the dataset, and
a number of optional parameters.
best.fit = lm(Y~X, data=data.set)
best.fit
##
## Call:
## lm(formula = Y ~ X, data = data.set)
##
## Coefficients:
## (Intercept) X
## 10 2
We see quickly that the answer that was computed directly above is the same as the one provided by lm()
, which is good news, but we can get more information via the summary()
function.
summary(best.fit)
##
## Call:
## lm(formula = Y ~ X, data = data.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0 -2.0 -0.5 1.5 5.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.00000 2.50294 3.995 0.00398 **
## X 2.00000 0.04697 42.583 1.02e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.739 on 8 degrees of freedom
## Multiple R-squared: 0.9956, Adjusted R-squared: 0.9951
## F-statistic: 1813 on 1 and 8 DF, p-value: 1.02e-10
(Question 4.7.11 in An Introduction to Statistical Learning, with Applications in R, pp. 171-2.)
This problem involves the ISLR
’s Auto
dataset.
library(ISLR)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
summary(Auto)
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
##
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## name
## amc matador : 5
## ford pinto : 5
## toyota corolla : 5
## amc gremlin : 4
## amc hornet : 4
## chevrolet chevette: 4
## (Other) :365
mpg01
, that contains a 1 if mpg
contains a value above its median, and a 0 if mpg
contains a value below its median.Solution: there are many ways to do this – one such way is to use the tidyverse’s pipeline operator %>%
.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✓ tibble 3.0.1 ✓ purrr 0.3.4
## ✓ tidyr 1.0.2 ✓ dplyr 0.8.5
## ✓ readr 1.3.1 ✓ stringr 1.4.0
## ✓ tibble 3.0.1 ✓ forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x psych::%+%() masks ggplot2::%+%()
## x psych::alpha() masks ggplot2::alpha()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
Auto.mpg01 <- Auto %>%
mutate(mpg01 = ifelse(mpg > median(mpg), 1, 0))
Auto.mpg01$mpg01 <- factor(Auto.mpg01$mpg01)
Auto.mpg01$origin <- factor(Auto.mpg01$origin)
str(Auto.mpg01)
## 'data.frame': 392 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
## $ mpg01 : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
mpg01
and the other features. Which of the other features seem most likely to be useful in predicting mpg01
?Solution: here are a number of charts for the Auto.mpg01
dataset.
We start with a scatterplot matrix that uses GGally
’s ggscatmat()
.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
ggscatmat(as.data.frame(Auto.mpg01), columns = 1:8,
color = "mpg01")
## Warning in ggscatmat(as.data.frame(Auto.mpg01), columns = 1:8, color =
## "mpg01"): Factor variables are omitted in plot
The correlations are easier to spot using the corrplot()
function.
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor(Auto.mpg01[,c(1:7)]),type="upper")
We could produce many more plots, but let’s stick to boxplots for now.
Auto.mpg01 <- Auto.mpg01[,-1]
Auto.mpg01 %>%
ggplot(aes(x=mpg01, displacement)) +
geom_boxplot()
Auto.mpg01 %>%
ggplot(aes(mpg01, horsepower)) +
geom_boxplot()
Auto.mpg01 %>%
ggplot(aes(mpg01, weight)) +
geom_boxplot()
Auto.mpg01 %>%
ggplot(aes(mpg01, acceleration)) +
geom_boxplot()
ggplot(Auto.mpg01) +
geom_count(aes(mpg01, cylinders))
ggplot(Auto.mpg01) +
geom_boxplot(aes(mpg01, cylinders))
The correlation between mpg01
and acceleration
, year
is positive, whereas the correlation between mpg01
and cylinders
, displacement
, horsepower
, and weight
is negative.
(Note that the origin is treated as a numeric variable due to the way it was encoded, but it should really be treated as a categorical variable.)
mpg01
for the test observations using the variables that seemed most associated with mpg01
in the previous question. What is the test error of the model obtained?Solution: we will use a \(70\%/30\%\) split.
# 70%/30% split
set.seed(31415) # for replicability
tmp <- Auto.mpg01 %>% mutate(id = row_number()) # we create a new variable id corresponding to the row number
Auto.train <- tmp %>% sample_frac(.70)
Auto.test <- anti_join(tmp, Auto.train, by = 'id')
nrow(Auto.train)
## [1] 274
nrow(Auto.test)
## [1] 118
We start by running one logistic regression analysis (using glm(..., family = binomial)
) with the variable year
included in the set of predictors (cylinders
, displacement
, horsepower
, weight
, acceleration
, year
), and one without (strictly speaking, we should probably create a categorical variable with levels Old
and New
instead of the numerical year values, but that’s a problem for another day).
Auto.log <- glm(mpg01 ~ cylinders + displacement +
horsepower + weight + acceleration + year,
data = Auto.train,
family = binomial)
summary(Auto.log)
##
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight + acceleration + year, family = binomial, data = Auto.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3502 -0.1216 -0.0009 0.2216 3.1910
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.427487 6.616579 -1.576 0.1150
## cylinders -0.532987 0.482273 -1.105 0.2691
## displacement 0.004309 0.011357 0.379 0.7044
## horsepower -0.039762 0.028604 -1.390 0.1645
## weight -0.003979 0.001281 -3.105 0.0019 **
## acceleration -0.013896 0.173644 -0.080 0.9362
## year 0.364917 0.085721 4.257 2.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 379.83 on 273 degrees of freedom
## Residual deviance: 112.38 on 267 degrees of freedom
## AIC: 126.38
##
## Number of Fisher Scoring iterations: 8
Only the variables weight
, year
and horsepower
are significant at the \(\alpha=0.1\) level (we’re still going to use all of the variables to make the predictions, however).
The “probability” predictions on the test data are obtained using the predict()
function:
# the probability of predicting mpg01 = 0
Auto.log.probs <- predict(Auto.log, Auto.test,
type = "response")
Auto.log.probs
## 1 2 3 4 5
## 8.223165e-04 5.175335e-04 1.533244e-06 2.095788e-06 2.065226e-04
## 6 7 8 9 10
## 2.468985e-01 2.940733e-01 7.595980e-01 6.530110e-06 4.233838e-05
## 11 12 13 14 15
## 5.248310e-05 4.402751e-07 4.965142e-02 7.763576e-01 8.499461e-01
## 16 17 18 19 20
## 2.386803e-05 1.475026e-06 1.613606e-04 1.562219e-04 1.101385e-01
## 21 22 23 24 25
## 9.186736e-01 8.944352e-01 8.857737e-01 2.068811e-04 3.997363e-02
## 26 27 28 29 30
## 1.600116e-01 6.396976e-06 7.681503e-06 2.194514e-01 7.524787e-01
## 31 32 33 34 35
## 8.553042e-01 3.278123e-06 6.186946e-01 1.854775e-01 7.433616e-02
## 36 37 38 39 40
## 2.174472e-04 1.073646e-01 2.036722e-01 4.605751e-02 3.703533e-05
## 41 42 43 44 45
## 9.836101e-01 9.605762e-01 9.625620e-01 9.443120e-01 9.327093e-02
## 46 47 48 49 50
## 2.933971e-01 1.522843e-05 4.468975e-05 3.004570e-05 7.982652e-01
## 51 52 53 54 55
## 9.644078e-01 3.809157e-01 8.998975e-01 9.672360e-01 1.216884e-01
## 56 57 58 59 60
## 7.360486e-02 9.952170e-01 9.782763e-01 9.254961e-01 1.997725e-01
## 61 62 63 64 65
## 3.706235e-05 1.269055e-04 9.760979e-01 9.936897e-01 9.943298e-01
## 66 67 68 69 70
## 9.810295e-01 9.979466e-01 6.451700e-02 3.804970e-02 1.108955e-02
## 71 72 73 74 75
## 9.105604e-01 9.609822e-01 9.235237e-01 1.078530e-01 9.952231e-01
## 76 77 78 79 80
## 6.306253e-01 8.366398e-01 9.022938e-03 4.911004e-03 1.785491e-03
## 81 82 83 84 85
## 9.974656e-01 9.966332e-01 9.639892e-02 6.114174e-01 9.519968e-01
## 86 87 88 89 90
## 9.950282e-01 9.985572e-01 9.963954e-01 9.934700e-01 9.783528e-01
## 91 92 93 94 95
## 9.985216e-01 9.959005e-01 7.964433e-01 9.988975e-01 3.331308e-01
## 96 97 98 99 100
## 9.707782e-01 9.924896e-01 9.673216e-01 7.835725e-01 9.897617e-01
## 101 102 103 104 105
## 9.973371e-01 9.925931e-01 9.788452e-01 7.840992e-01 5.817979e-01
## 106 107 108 109 110
## 2.221020e-01 3.786351e-01 9.729914e-01 9.819340e-01 9.645347e-01
## 111 112 113 114 115
## 9.988681e-01 9.988207e-01 9.981796e-01 9.972075e-01 9.991459e-01
## 116 117 118
## 9.619199e-01 9.989901e-01 9.713495e-01
We use a probability threshold of \(0.5\) for the class predictions (i.e. if probability is smaller than 0.5, predict that mpg01
is 0, else predict that it is 1).
# setting up the categorical response
Auto.log.preds <- rep(0,length(Auto.test$mpg01))
Auto.log.preds[Auto.log.probs > 0.5] <- 1
table(Auto.log.preds)
## Auto.log.preds
## 0 1
## 53 65
We can now compare how often the prediction matches the reality and build a confusion matrix
table(Auto.log.preds, Auto.test$mpg01)
##
## Auto.log.preds 0 1
## 0 50 3
## 1 8 57
with accuracy \(\frac{58+48}{58+48+3+9}=89.8\%\), error rate \(\frac{3+9}{58+48+3+9}=10.2\%\), true negative rate \(\frac{58}{58+9}=86.6\%\) and true positive rate \(\frac{48}{48+3}=94.1\%\) (48/(48+3)).
mean(Auto.log.preds == Auto.test$mpg01)
## [1] 0.9067797
1 - mean(Auto.log.preds == Auto.test$mpg01)
## [1] 0.09322034
The second model doesn’t use the variable year
.
# without the variable year
Auto.log <- glm(mpg01 ~ cylinders + displacement
+ horsepower + weight + acceleration,
data = Auto.train,
family = binomial)
summary(Auto.log)
##
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower +
## weight + acceleration, family = binomial, data = Auto.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5472 -0.1501 -0.0036 0.3566 3.3249
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 14.264874 3.568420 3.998 6.4e-05 ***
## cylinders -0.546774 0.433090 -1.262 0.2068
## displacement -0.001175 0.010203 -0.115 0.9083
## horsepower -0.047321 0.025437 -1.860 0.0628 .
## weight -0.002187 0.001093 -2.000 0.0455 *
## acceleration -0.032208 0.156340 -0.206 0.8368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 379.83 on 273 degrees of freedom
## Residual deviance: 137.49 on 268 degrees of freedom
## AIC: 149.49
##
## Number of Fisher Scoring iterations: 7
Only the variables displacement
and horsepower
are significant at the \(\alpha=0.1\) level (we’re still going to use all of the variables to make the predictions, however).
The “probability” predictions on the test data are:
Auto.log.probs <- predict(Auto.log, Auto.test,
type = "response")
Auto.log.preds <- rep(0,length(Auto.test$mpg01))
Auto.log.preds[Auto.log.probs > 0.5] <- 1
table(Auto.log.preds)
## Auto.log.preds
## 0 1
## 54 64
The confusion matrix is exactly the same as before (that won’t always be the case, it depends on the seed used to split the training and testing data), with the same
table(Auto.log.preds, Auto.test$mpg01)
##
## Auto.log.preds 0 1
## 0 49 5
## 1 9 55
with accuracy \(\frac{58+48}{58+48+3+9}=89.8\%\), error rate \(\frac{3+9}{58+48+3+9}=10.2\%\), true negative rate \(\frac{58}{58+9}=86.6\%\) and true positive rate \(\frac{48}{48+3}=94.1\%\) (48/(48+3)).
mean(Auto.log.preds == Auto.test$mpg01)
## [1] 0.8813559
1 - mean(Auto.log.preds == Auto.test$mpg01)
## [1] 0.1186441
Even though the model with the year
predictor has the same performance metrics on the test set as the one without, it is difficult to justify the use of year
as a numerical variable in the model.
Consequently, we select the second model for any future prediction endeavour.
(I suppose that we could also create a model with a categorical age variable, but my interest in this dataset is waning even as I type this).
(Question 3.7.15 in An Introduction to Statistical Learning, with Applications in R, p. 126.)
This problem involves the MASS
’s Boston
dataset.
library(MASS)
Boston$chas <- factor(Boston$chas, labels = c("N","Y"))
attach(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 N 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 N 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 N 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 N 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 N 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 N 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
We are interested in the relationship between the per capita crime rate (response \(Y\)) and the other variables (predictors \(X_j\)).
Solution: we build simple linear regression models \[\hat{Y}=\beta_{0,j}+\beta_jX_j\] for each predictor \(X_j\) in the dataset, except for \(j=1\), for which we build \(\hat{Y}=\beta_{0,1}\).
One of the ways in which we can determine if we reject the null hypothesis \(H^j_0:\beta_{0,j}=0\) for the \(j\)th model is to see if the associated \(p-\)value is below some significance threshold, say \(\alpha=0.05\).
To do so, we fit each of the 14 models (one for each simple regression models), then we extract the \(p-\)value corresponding to the null hypothesis for each model, and we look for those values that are above \(\alpha=0.05\) as being predictors for which we cannot reject the possibility that there is no relation between the predictor and the response crim
.
Since some of the \(p-\)values are quite small, we plot their logarithms instead, and we look for predictors for which \(\ln(p-\text{value})>-3\), which correspond to a \(p-\)value greater than \(\approx 0.05\).
# initialize model, pval, and coeff objects
model=list()
pval=c()
coeff=c()
# build the first model (constant response), and extract the corresponding p-value and coefficient
model[[1]] <- lm(crim ~ 1)
pval[1] <- summary(model[[1]])$coefficients[1,4]
coeff[1] <- coefficients(model[[1]])[1]
model
## [[1]]
##
## Call:
## lm(formula = crim ~ 1)
##
## Coefficients:
## (Intercept)
## 3.614
pval
## [1] 1.262593e-19
coeff
## [1] 3.613524
# repeat the procedure for each of the other predictors
for(i in 1:(ncol(Boston)-1)){
model[[i+1]] <- lm(crim ~., data=Boston[,c(1,i+1)])
pval[i+1] <- summary(model[[i+1]])$coefficients[2,4]
coeff[i+1] <- coefficients(model[[i+1]])[2]
}
# combine results
results<-data.frame(pval)
rownames(results)<-colnames(Boston)
# plot p-values
plot(log(pval), pch=20, ylim=c(-140, 0),
xaxt="n", xlab="Predictor", ylab="log(pval)")
# Plot the axis separately
axis(1, at=1:14, labels=rownames(results))
abline(h=-3,col=2)
The only predictor for which we cannot conclude that there is a statistically significant association with the response is chas
(\(X_3\)), with \(p-\)value \(\approx 0.21\).
# p-value of predictor X3
pval[4]
## [1] 0.2094345
Solution: the output for the full regression model \[\hat{Y}=\beta_1+\sum_{j=2}^{14}\beta_jX_j\] is shown below.
# building the full multiple regression model
lm.full = lm(crim~., data=Boston)
summary(lm.full)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chasY -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
# diagnostic plots
par(mfrow=c(2,2))
plot(lm.full)
In the full model, we can only reject the null hypothesis \(H_0:\beta_j=0\) at \(\alpha=0.05\) zn
, dis
, rad
, black
, and medv
.
The diagnostic plots for the full model are… well, does it look to you as though the various assumptions required to use a linear model are met?
Solution: we create a plot showing the relationship between the simple regression coefficients and the multiplie regression coefficients (each predictor is displayed as a single point on the plot).
What would you expect to see here?
# multiple regression coefficient
y=coefficients(lm.full)[2:14]
# recall that the simple regression coefficients were stored in coeff
par(mfrow=c(1,1))
plot(coeff[2:14], y, xlab="Simple Regression Coeffients", ylab="Multiple Regression Coefficients")
In most cases, the coefficient from the simple regression is fairly close to the corresponding coefficient from the multiple regression, except for one predictor (bottom right corner). Which predictor is this?
colnames(Boston)[which(coeff[2:14]> 30)+1]
## [1] "nox"
Let’s zoom in on the upper left corner to get a better sense of the relationship among the non-nox
predictors.
plot(coeff[-c(1,5)],y[-c(4)], xlab="Simple Regression Coeffients", ylab="Multiple Regression Coefficients")
We can see that there is still a fair amount of variation, save for a small group of predictors for which the simple regression coefficients lie between \(-0.5\) and \(0.75\) (eyeballing things, here).
colnames(Boston)[(which(coeff[2:14]>-1 & coeff[2:14]<= 0.55))+1]
## [1] "zn" "indus" "age" "tax" "black" "lstat" "medv"
The predictors for which both coefficients are “mostly” the same are thus zn
, indus
, age
, tax
, black
, lstat
, and medv
.
Solution: to answer this question, we will fit, for every predictor \(X_j\), a cubic polynomial model
\[\hat{Y}=\beta_{0,j}+\beta_{1,j}X_j+\beta_{2,j}X_j^2+\beta_{3,j}X_j^3, \] except for \(X_j=\) (chas
) as polynomial regression is non-sensical for categorical features.
We are interested in finding associated \(p-\)values below \(0.05\) for the higher order terms, say, for each model \(j\), in order to reject the null hypotheses \[H_{0,j}: \beta_{2,j}=\beta_{3,j}=0 \quad \text{against} H_{1,j}: \beta_{2,j}^2+\beta_{3,j}^2\neq 0\] in the \(j\)th model.
To do so, we repeat the procedure of the first question.
# remove chas from the dataset
Boston2 <-Boston[,-c(4)]
detach(Boston)
attach(Boston2)
# initialize model, pval, and p.array (higher degree) objects
model=list()
pval=list()
p.array=matrix(rep(NA,26),nrow=13,ncol=2)
# build the first model (constant response), and extract the corresponding p-value
model[[1]] <- lm(crim ~ 1)
pval[[1]] <- summary(model[[1]])$coefficients[1,4]
# repeat the procedure for each of the other predictors
for(i in 1:(ncol(Boston2)-1)){
model[[i+1]] <- lm(crim ~ poly(Boston2[,c(i+1)],3),
data=Boston[,c(1,i+1)])
pval[[i+1]] <- summary(model[[i+1]])$coefficients[3:4,4]
p.array[i+1,1] <- as.numeric(pval[[i+1]][1]) # quadratic term
p.array[i+1,2] <- as.numeric(pval[[i+1]][2]) # cubic term
}
rownames(p.array)<-c("crim","zn","indus","nox","rm",
"age","dis","rad","tax","ptratio",
"black","lstat","medv")
p.array<-p.array[2:13,]
We find evidence of a non-linear association between crim
and all predictors, except for black
(and chas
, of course).
rowSums(p.array< 0.05)
## zn indus nox rm age dis rad tax ptratio
## 1 2 2 1 2 2 1 1 2
## black lstat medv
## 0 1 2
detach(Boston2)
Suppose that a researcher wants to determine if, as she believes, a new teaching method enables students to understand elementary statistical concepts better than the traditional lectures given in a university setting.
She recruits \(N=80\) second-year students to test her claim. The students are randomly assigned to one of two groups:
students in group \(A\) are given the traditional lectures,
whereas students in group \(B\) are taught using the new teaching method.
After three weeks, a short quiz is administered to the students in order to assess their understanding of statistical concepts.
The results are found in the teaching.csv
dataset.
teaching <- read.csv("Data/teaching.csv", header = TRUE)
str(teaching)
## 'data.frame': 80 obs. of 3 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: Factor w/ 2 levels "A","B": 2 2 1 1 2 1 2 2 2 2 ...
## $ Grade: num 75.5 77.5 73.5 75 77 79 79 75.5 80 79.5 ...
summary(teaching)
## ID Group Grade
## Min. : 1.00 A:40 Min. :68.50
## 1st Qu.:20.75 B:40 1st Qu.:75.00
## Median :40.50 Median :77.25
## Mean :40.50 Mean :77.06
## 3rd Qu.:60.25 3rd Qu.:79.12
## Max. :80.00 Max. :84.00
head(teaching)
## ID Group Grade
## 1 1 B 75.5
## 2 2 B 77.5
## 3 3 A 73.5
## 4 4 A 75.0
## 5 5 B 77.0
## 6 6 A 79.0
ANOVA is a statistical method that partitions a dataset’s variability into explainable variability (model-based) and unexplained variability (error) using various statistical models, to determine whether (multiple) treatment groups have significantly different group means.
The total sample variability of a feature \(y\) in a dataset is defined as \[ \text{SS}_{\textrm{tot}}=\sum_{k=1}^{N}(y_{k}-\bar{y})^{2}, \] where \(\bar{y}\) is the overall mean of the data.
Solution: since the assignment of student ID is arbitrary when it comes to performance (at least, in theory), we should not observe any pattern.
# class mean
(mu = mean(teaching$Grade))
## [1] 77.0625
# scatterplot
plot(teaching$ID,teaching$Grade)
# add a horizontal line at the mean level
abline(h = mu)
If we had to guess someone’s score with no knowledge except for their participant ID, then picking the sample mean is as good a guess as any other.
Statistically speaking, this means that the null model is \[ y_{i,j}=\mu+\varepsilon_{i,j}, \] where \(\mu\) is the class mean, \(i\) refers to the method \(i= {A,B}\), and \(j=1,\ldots,40\) represent the students in each group.
This model does not explain any of the variability in the student scores other than through a noise term.
But the students DID NOT all receive the same treatment: \(40\) randomly selected students were assigned to group \(A\), and the other \(40\) to group \(B\), and both group were taught using a different method.
# compute the means in each treatment group
means.by.group = aggregate(teaching[, 3], list(teaching$Group), mean)
# plot students from the two groups using different markers
plot(teaching$ID,teaching$Grade,
col=c("red","blue")[as.numeric(teaching$Group)],
pch=c(15,17)[as.numeric(teaching$Group)])
# add the class mean and the mean in each group to the plot
abline(h = mu)
abline(h = means.by.group[2,2], col="blue", lwd=2, lty=2)
abline(h = means.by.group[2,2], col="red", lwd=2, lty=2)
# add legend
legend("topleft", legend=c("Group A", "Group B"),
col=c("red", "blue"), pch=c(15,17), cex=0.8)
We clearly see that the two study groups show different characteristics in term of their average scores.
With the group assignment information, we can refine our null model into the treatment-based model \[ y_{i,j}=\mu_{i}+\varepsilon_{i,j}, \] where \(\mu_i\), \(i={A,B}\) represent the group means.
Using this model, we can decompose \(\text{SS}_{\textrm{tot}}\) into between-treatment sum of squares and error (within-treatment) sum of squares as \[ \text{SS}_{\textrm{tot}}=\sum_{i,j}(y_{i,j}-\bar{y})^{2}=\sum_{i,j}(y_{i,j}-\bar{y}_{i}+\bar{y}_{i}-\bar{y})^{2} =\sum_{i}N_{i}(\bar{y}_{i}-\bar{y})^{2}+\sum_{i,j}(y_{i,j}-\bar{y}_{i})^2=\text{SS}_{\textrm{treat}}+\text{SS}_{\textrm{e}} \] The \(\text{SS}_{\textrm{treat}}\) component looks at the difference between each of the treatment means and the overall mean, which we consider to be explainable; the \(\text{SS}_{\textrm{e}}\) component, on the other hand, looks at the difference between each observation and its own group mean, and is considered to be random
The spread about each group mean is fairly large (relatively-speaking), so the researcher would be right to suggest that the treatment-based model does not capture all the variability in the data on its own, but we will not pursue this further.
In general, \(\text{SS}_{\textrm{treat}}/\text{SS}_{\textrm{tot}}\times 100\%\) of the total variability can be explained using a treatment-based model. This ratio is coefficient of determination, denoted by \(R^{2}\).
Solution: we can recover the ANOVA table in R
by first running a linear regression on the data.
# run the linear regression
model.lm <- lm(Grade ~ Group, data = teaching)
# compute the ANOVA table
SS.Table <- anova(model.lm)
SS.Table
## Analysis of Variance Table
##
## Response: Grade
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 300.31 300.312 49.276 7.227e-10 ***
## Residuals 78 475.38 6.095
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The test statistic \(F_{0}\) follows an \(F\)-distribution with \[(\text{df}_{\textrm{treat}}, \text{df}_{\textrm{e}})=(1,n_1+n_2-2)=(1,78)\] degrees of freedom.
At a significance level of \(\alpha=0.05\), the critical value \(F^*=F_{0.95, 1, 78}=3.96\) is substantially smaller than the test statistic \(F_{0}=49.28\), implying that the two-treatment model is statistically significant.
This, in turn, means that the model recognises a statistically significant difference between the students’ scores, based on the teaching methods.
The coefficient of determination \(R^2\) provides a way to measure the model’s significance. We can compute it directly from the ANOVA table produced by lm()
: \[R^{2}=\frac{\text{SS}_{\textrm{treat}}}{\text{SS}_{\textrm{tot}}}=\frac{300.31}{775.69}\approx 0.39,\] or we can extract it from model.lm
:
# provide a summary of the linear regression model
summary(model.lm)
##
## Call:
## lm(formula = Grade ~ Group, data = teaching)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.625 -1.500 0.000 1.906 5.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.1250 0.3903 192.46 < 2e-16 ***
## GroupB 3.8750 0.5520 7.02 7.23e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 78 degrees of freedom
## Multiple R-squared: 0.3872, Adjusted R-squared: 0.3793
## F-statistic: 49.28 on 1 and 78 DF, p-value: 7.227e-10
# highlights which attributes exist to find R2
attributes(summary(model.lm))
## $names
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
##
## $class
## [1] "summary.lm"
# extract R2
(R2 = summary(model.lm)$r.squared)
## [1] 0.3871566
Either way, we find \(R^2\approx 0.39\), which means that \(39\%\) of the total variation in the data can be explained by our two-treatment model.
Is this good enough? That depends on the specifics of the situation (in particular, on the researchers’ needs).
Test the three additional assumptions:
normality of the error terms;
constant variance (within treatment groups), and
equal variances (across treatment groups).
Solution: we can get an idea as to whether these are valid by looking at the diagnostic plots associated with the model.
Normality of the errors can be tested visually with the help of a normal-QQ plot, which compares the standardized residuals quantiles against the theoretical quantiles of the standard normal distribution \(N(0,1)\) (a straight line indicates normality… some departures in the tails are allowed).
To test the assumption of constant and equal variance, we can run visual inspection using residuals vs. fitted values plots, and/or residuals vs. order/time.
plot(model.lm)
THe visual evidence seems to support the use of the two-treatment model.
Consider the following data, consisting of \(n=20\) paired measurements \((x_i,y_i)\) of hydrocarbon levels (\(x\)) and pure oxygen levels (\(y\)) in fuels:
x=c(0.99, 1.02, 1.15, 1.29, 1.46, 1.36, 0.87, 1.23, 1.55, 1.40,
1.19, 1.15, 0.98, 1.01, 1.11, 1.20, 1.26, 1.32, 1.43, 0.95)
y=c(90.01, 89.05, 91.43, 93.74, 96.73, 94.45, 87.59, 91.77, 99.42, 93.65,
93.54, 92.52, 90.56, 89.54, 89.85, 90.39, 93.25, 93.41, 94.98, 87.33)
Solution: we can compute the strength of association using the correlation \[\rho=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}=\frac{\sum_i (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum_i (x_i-\overline{x})^2\sum_i (y_i-\overline{y})^2}}.\]
First we plot the data:
# produce the scatterplot of x,y
plot(x,y)
It seems that points lie around a hidden line!
The correlation can be computed directly with corr()
:
# compute the correlation rho
cor(x,y)
## [1] 0.9367154
or computed using the formula:
Sxy=sum((x-mean(x))*(y-mean(y)))
Sxx=sum((x-mean(x))^2)
Syy=sum((y-mean(y))^2)
(rho=Sxy/(sqrt(Sxx*Syy)))
## [1] 0.9367154
Either way, the correlation is pretty strong, which is aligned with our observation that the points are scattered around a hidden line.
Solution: we use the lm()
function.
# build a data frame with the variables
fuels=data.frame(x,y)
# run the linear regression
model <- lm(y ~ x, data=fuels)
# display the model summary
summary(model)
##
## Call:
## lm(formula = y ~ x, data = fuels)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83029 -0.73334 0.04497 0.69969 1.96809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.283 1.593 46.62 < 2e-16 ***
## x 14.947 1.317 11.35 1.23e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.087 on 18 degrees of freedom
## Multiple R-squared: 0.8774, Adjusted R-squared: 0.8706
## F-statistic: 128.9 on 1 and 18 DF, p-value: 1.227e-09
It seems that the line of best fit is \(y=b_0+b_1x= 74.283+14.947x\); the predicted values are \(\hat{y}_i=74.283+14.947x_i\), and the residuals are \(e_i=y_i-\hat{y}_i\).
We can plot the line of best fit and the residuals directly:
# for line of best fit, residual plots
library(ggplot2)
# line of best fit
ggplot(model) + geom_point(aes(x=x, y=y)) +
geom_line(aes(x=x, y=.fitted), color="blue" ) +
theme_bw()
# residuals
ggplot(model) + geom_point(aes(x=x, y=y)) + ### plotting residuals
geom_line(aes(x=x, y=.fitted), color="blue" ) +
geom_linerange(aes(x=x, ymin=.fitted, ymax=y), color="red") +
theme_bw()
Solution: if the linear regression assumptions are satisfied, we expect the residuals to follow a \(N(0,\sigma^2)\) distribution. The mean squared error (MSE) \[ \hat{\sigma}^2 =\frac{\sum_ie_i^2}{n-2}=\frac{\sum_i(y_i-\hat{y}_i)}{n-2}=\frac{S_{yy}-b_1S_{xy}}{n-2},\] where \(n\) represents the number of observations.
To compute the variance in a full population, we would use a denominator of \(n\). For a population sample, we would use \(n-1\). For the regression error, we use \(n-2\) because \(2\) parameters had to be estimated in order to obtain \(\hat{y}_i\): \(b_0\) and \(b_1\).
We can compute it directly using the formula:
# number of observations
n=length(x)
# extracting b1 as the 2nd entry in model$coefficients (verify!)
sigma2 = (Syy-as.numeric(model$coefficients[2])*Sxy)/(n-2)
sigma2
## [1] 1.180545
But we can also extract it directly from the model summary:
# to find the estimated sigma (note that attributes(model) does not list sigma)
attributes(summary(model))
## $names
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
##
## $class
## [1] "summary.lm"
# getting the MSE from the summary
summary(model)$sigma^2
## [1] 1.180545
The estimate is \(\hat{sigma}^2\approx 1.18\).
We can use R
and ggplot2
to present information about univariate datasets.
The grades for a midterm are shown below. Discuss the results.
grades<-c(80,73,83,60,49,96,87,87,60,53,66,83,32,80,66,90,72,55,76,46,48,69,45,48,77,52,59,97,76,89,73,73,48,59,55,76,87,55,80,90,83,66,80,97,80,55,94,73,49,32,76,57,42,94,80,90,90,62,85,87,97,50,73,77,66,35,66,76,90,73,80,70,73,94,59,52,81,90,55,73,76,90,46,66,76,69,76,80,42,66,83,80,46,55,80,76,94,69,57,55,66,46,87,83,49,82,93,47,59,68,65,66,69,76,38,99,61,46,73,90,66,100,83,48,97,69,62,80,66,55,28,83,59,48,61,87,72,46,94,48,59,69,97,83,80,66,76,25,55,69,76,38,21,87,52,90,62,73,73,89,25,94,27,66,66,76,90,83,52,52,83,66,48,62,80,35,59,72,97,69,62,90,48,83,55,58,66,100,82,78,62,73,55,84,83,66,49,76,73,54,55,87,50,73,54,52,62,36,87,80,80)
Solution: we can start by building a histogram of the grades.
hist(grades, breaks = seq(20,100,10))
We can find the mode of the data with this function:
fun.mode<-function(x){as.numeric(names(sort(-table(x)))[1])}
Let’s spruce up the histogram a bit by adding some details through ggplot2:
library(ggplot2)
ggplot(data=data.frame(grades), aes(grades)) +
geom_histogram(aes(y =..density..), # approximated pdf
breaks=seq(20, 100, by = 10), # 8 bins from 20 to 100
col="black", # colour of outline
fill="blue", # fill colour of bars
alpha=.2) + # transparency
geom_density(col=2) + # colour of pdf curve
geom_rug(aes(grades)) + # adding a rug on x-axis
geom_vline(aes(xintercept = mean(grades)),col='red',size=2) + # vertical line for mean
geom_vline(aes(xintercept = median(grades)),col='darkblue',size=2) + # vertical line for median
geom_vline(aes(xintercept = fun.mode(grades)),col='black',size=2) # vertical line for mode
We can also visualize the data using a boxplot:
boxplot(grades)
And get summary statistics using summary()
and psych
’s describe()
:
summary(grades)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 55.00 70.00 68.74 82.50 100.00
library(psych)
describe(grades)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 211 68.74 17.37 70 69.43 19.27 21 100 79 -0.37 -0.46
## se
## X1 1.2
How is the class doing?
What can we say about confidence intervals?
A sample of size \(n=17\) is selected from a normal population with mean \(\mu=-3\) (which is unknown) and standard deviation \(\sigma=2\), which is known.
Solution: we sample using rnorm()
.
set.seed(0) # for replicability
n=17
mu=-3
sigma=2
x=rnorm(n,mu,sigma)
The histogram of the sample looks like:
hist(x)
The sample mean is given by:
mean(x)
## [1] -2.804917
The \(95\%\) confidence interval is
alpha = 0.05
lower.bound = mean(x) + qnorm(alpha/2)*sigma/sqrt(n) # read the documentation for qnorm(), to be certain to use the right form
upper.bound = mean(x) + qnorm(1-alpha/2)*sigma/sqrt(n)
c(lower.bound,upper.bound)
## [1] -3.755640 -1.854195
We notice that \(\mu=3\) is indeed found in the confidence interval:
lower.bound<mu & mu<upper.bound
## [1] TRUE
Solution: we just put all the code together in one routine:
set.seed(0) # for replicability
# parameters
n=17
mu=-3
sigma=2
alpha=0.05
N=1000
# initializing the vector which determines if mu is in the C.I.
is.mu.in <- c()
# initializng the vector of sample means
sample.means <- c()
for(j in 1:N){
x=rnorm(n,mu,sigma)
sample.means[j] = mean(x)
lower.bound = sample.means[j] + qnorm(alpha/2)*sigma/sqrt(n)
upper.bound = sample.means[j] + qnorm(1-alpha/2)*sigma/sqrt(n)
is.mu.in[j] = lower.bound<mu & mu<upper.bound
}
# proportion of the times when mu was in the C.I.
table(is.mu.in)/N
## is.mu.in
## FALSE TRUE
## 0.055 0.945
# histogram of the sample means (compare with histogram of sample)
hist(sample.means)
What happense if the parameters n
, alpha
, and/or N
are modified?
Solution: we’ll use the same sample as in the first question, to make the comparison more meaningful.
set.seed(0) # for replicability
n=17
mu=-3
sigma=2
x=rnorm(n,mu,sigma)
The sample mean and the sample variance are given by:
mean(x)
## [1] -2.804917
var(x)
## [1] 4.286731
The \(95\%\) confidence interval in this case is
# read the documentation for qt(), to make sure you are using it properly
alpha = 0.05
lower.bound = mean(x) + qt(alpha/2, df = n-1)*sqrt(var(x))/sqrt(n)
upper.bound = mean(x) + qt(1-alpha/2, df = n-1)*sqrt(var(x))/sqrt(n)
c(lower.bound,upper.bound)
## [1] -3.869441 -1.740394
We notice that \(\mu=3\) is indeed found in the confidence interval:
lower.bound<mu & mu<upper.bound
## [1] TRUE
Solution: we just put all the code together in one routine:
set.seed(0) # for replicability
# parameters
n=17
mu=-3
sigma=2
alpha=0.05
N=1000
# initializing the vector which determines if mu is in the C.I.
is.mu.in <- c()
# initializng the vector of sample means
sample.means <- c()
for(j in 1:N){
x=rnorm(n,mu,sigma)
sample.means[j] = mean(x)
lower.bound = sample.means[j] + qt(alpha/2, df = n-1)*sqrt(var(x))/sqrt(n)
upper.bound = sample.means[j] + qt(1-alpha/2, df = n-1)*sqrt(var(x))/sqrt(n)
is.mu.in[j] = lower.bound<mu & mu<upper.bound
}
# proportion of the times when mu was in the C.I.
table(is.mu.in)/N
## is.mu.in
## FALSE TRUE
## 0.05 0.95
# histogram of the sample means (compare with histogram of sample)
hist(sample.means)
Any surprises?
5.4.9 in JWHT, p. 201
library(MASS)
set.seed(1234)
attach(Boston)
# a)
medv.mean = mean(medv)
medv.mean
## [1] 22.53281
# b)
medv.err = sd(medv)/sqrt(length(medv))
medv.err
## [1] 0.4088611
# c)
boot.fn = function(data, index){return(mean(data[index]))}
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
bstrap = boot(medv, boot.fn, 1000)
bstrap
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.0199747 0.4173862
sd.medv = sd(bstrap[[2]])
# d)
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
c(bstrap$t0 - 2*sd.medv, bstrap$t0 + 2*sd.medv)
## [1] 21.69803 23.36758
# e)
medv.med = median(medv)
medv.med
## [1] 21.2
# f)
boot.fn = function(data, index){
return(median(data[index]))}
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0243 0.3736497
# g)
medv.tenth = quantile(medv, c(0.1))
medv.tenth
## 10%
## 12.75
# h)
boot.fn = function(data, index){
return(quantile(data[index], c(0.1)))}
boot(medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0389 0.5016258
There are built in functions in R
that allow for hypothesis testing. For instance,
t.test(x,mu=5)
tests for \(H_0:\mu=5\) against \(H_1:\mu\not=5\) when \(\sigma\) is unknown \(2-\)sided \(t-\)test)
t.test(x,mu=5,alternative="greater")
tests for \(H_0:\mu=5\) against \(H_1:\mu>5\) when \(\sigma\) is unknown (\(1-\)sided \(t-\)test)
t.test(x,mu=5,alternative="less")
tests for \(H_0:\mu=5\) against \(H_1:\mu<5\) when \(\sigma\) is unknown (\(1-\)sided \(t-\)test)
t.test(x,y,var.equal=TRUE)
tests for \(H_0:\mu_1=\mu_2\) against\(H_1:\mu_1\neq \mu_2\) in case of two independent samples, when variances are unknown but equal.
t.test(x,y,var.equal=TRUE,alternative="greater")
tests for \(H_0:\mu_1=\mu_2\) against \(H_1:\mu_1>\mu_2\) in case of two independent samples, when variances are unknown but equal.
t.test(x,y,var.equal=TRUE,alternative="less")
tests for \(H_0:\mu_1=\mu_2\) against \(H_1:\mu_1<\mu_2\) in case of two independent samples, when variances are unknown but equal.
For all these tests, we reject the null hypothesis \(H_0\) at significance level \(\alpha\) if the \(p-\)value of the test is below \(\alpha\) (which means that the probability of wrongly rejecting \(H_0\) when \(H_0\) is in fact true is below \(\alpha\), usually taken to be \(0.05\) or \(0.01\)).
If the \(p-\)value of the test is greater than the significance level \(\alpha\), then we fail to reject the null hypothesis \(H_0\) at significance level \(\alpha\), WHICH IS NOT THE SAME AS ACCEPTING THE NULL HYPOTHESIS!!
Note that the \(p-\)value for the test will appear in the output, but it can also be computed directly using the appropriate formula. The corresponding \(95\%\) confidence intervals also appear in the output.
x=c(4,5,4,6,4,4,5)
Let \(\mu_x\) be the true mean of whatever distribution the sample came from. Is it conceivable that \(\mu_x=5\)?
Solution: we can test for \(H_0: \mu_x=5\) against \(H_1:\mu_x\neq 5\) simply by calling
t.test(x,mu=5)
##
## One Sample t-test
##
## data: x
## t = -1.4412, df = 6, p-value = 0.1996
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
## 3.843764 5.299093
## sample estimates:
## mean of x
## 4.571429
All the important information is in the output: the critical \(t-\)value from Student’s \(T-\)distribution with \(n-1=6\) degrees of freedom \(t^*=-1.4412\), the probability of wrongly rejecting \(H_0\) if it was in fact true (\(p-\)value \(=0.1996\)), and the \(95\%\) confidence interval \((3.843764, 5.299093)\) for \(\mu_x\), whose point estimate is \(\overline{x}=4.571429\).
Since the \(p-\)value is greater than \(\alpha=0.05\), we fail to reject the null hypothesis that \(\mu_x=5\); there is not enough evidence in the data to categorically state that \(\mu_x\neq 5\).
(is it problematic that the sample size \(n=7\) is small?)
y=c(1,2,1,4,3,2,4,3,2)
Let \(\mu_y\) be the true mean of whatever distribution the sample came from. Is it conceivable that \(\mu_x=5\)?
Solution: we can test for \(H_0: \mu_y=5\) against \(H_1:\mu_y\neq 5\) simply by calling
t.test(y,mu=5)
##
## One Sample t-test
##
## data: y
## t = -6.7823, df = 8, p-value = 0.0001403
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
## 1.575551 3.313338
## sample estimates:
## mean of x
## 2.444444
The \(p-\)value is \(0.0001403\), which is substantially smaller than \(\alpha=0.05\), and we reject the null hypothesis that the true mean is \(5\). The test provides no information about what the true mean could be, but the \(95\%\) confidence interval \((1.575551, 3.313338)\) does: we would expect \(\mu_y\approx 2.5\).
Solution: let’s run
t.test(y,mu=2.5)
##
## One Sample t-test
##
## data: y
## t = -0.14744, df = 8, p-value = 0.8864
## alternative hypothesis: true mean is not equal to 2.5
## 95 percent confidence interval:
## 1.575551 3.313338
## sample estimates:
## mean of x
## 2.444444
Large \(p-\)value, much above \(\alpha=0.05\)… we’d like to say that this clinches it: we accept the null hypothesis! But no, no. We can’t do that. All we can say, sadly, is that we do not have enough evidence to reject the null hypothesis that \(\mu_y=2.5\).
Suppose that a researcher wants to determine if, as she believes, a new teaching method enables students to understand elementary statistical concepts better than the traditional lectures given in a university setting.
She recruits \(N=80\) second-year students to test her claim. The students are randomly assigned to one of two groups:
students in group \(A\) are given the traditional lectures,
whereas students in group \(B\) are taught using the new teaching method.
After three weeks, a short quiz is administered to the students in order to assess their understanding of statistical concepts.
The results are found in the teaching.csv
dataset.
teaching <- read.csv("Data/teaching.csv", header = TRUE)
head(teaching)
## ID Group Grade
## 1 1 B 75.5
## 2 2 B 77.5
## 3 3 A 73.5
## 4 4 A 75.0
## 5 5 B 77.0
## 6 6 A 79.0
Is there enough evidence to suggest that the new teaching is more effective (as measured by test performance)?
Solution: we can summarize the results (sample size, sample mean, sample variance) as follows:
counts.by.group = aggregate(x = teaching$Grade,
by = list(teaching$Group),
FUN = length)
means.by.group = aggregate(x = teaching$Grade,
by = list(teaching$Group),
FUN = mean)
variances.by.group = aggregate(x = teaching$Grade,
by = list(teaching$Group),
FUN = var)
teaching.summary <- counts.by.group %>%
full_join(means.by.group,
by="Group.1" ) %>%
full_join(variances.by.group,
by="Group.1" )
colnames(teaching.summary) <- c("Group",
"Sample Size",
"Sample Mean",
"Sample Variance")
teaching.summary
## Group Sample Size Sample Mean Sample Variance
## 1 A 40 75.125 6.650641
## 2 B 40 79.000 5.538462
If the researcher assumes that both groups have similar background knowledge prior to being taught (which she attempt to enforce by randomising the group assignment), then the effectiveness of the teaching methods may be compared using two hypotheses: the null hypothesis \(H_0\) and the alternative \(H_{1}\).
Let \(\mu_i\) represent the true performance of method \(i\).
Since the researcher wants to claim that the new method is more effective than the traditional ones, it is most appropriate for her to use one-sided hypothesis testing with \[H_{0}: \mu_{A} \geq \mu_{B} \quad\mbox{against}\quad H_{1}: \mu_{A} < \mu_{B}.\]
The testing procedure is simple:
calculate an appropriate test statistic under \(H_0\);
reject \(H_0\) in favour of \(H_1\) if the test statistic falls in the critical region (also called the rejection region) of an associated distribution, and
fail to reject \(H_0\) otherwise (WHICH I REMIND YOU IS NOT THE SAME AS ACCEPTING \(H_0\)!!)
In this case, she wants to use a two-sample \(t-\)test. Assuming that variability in two groups are roughly the same, the test statistic is given by:
\[ t_{0}=\frac{\bar{y}_{B}-\bar{y}_{A}}{S_{p}\sqrt{\frac{1}{N_{A}}+\frac{1}{N_{B}}}}, \] where the pooled variance \(S^{2}_{p}\) is \[ S^{2}_{p}=\frac{(N_{A}-1)S^{2}_{A}+(N_{B}-1)S^{2}_{B}}{N_{A}+N_{B}-2}. \]
With her data, she obtains the following \(t-\)statistic:
# number of observations in each group
N.A = teaching.summary[1,2]
N.B = teaching.summary[2,2]
N=N.A+N.B
# sample mean score in each group
y.bar.A = teaching.summary[1,3]
y.bar.B = teaching.summary[2,3]
# sample variance of scores in each group
S2.A = teaching.summary[1,4]
S2.B = teaching.summary[2,4]
# sample pooled variance of scores
S2.P = ((N.A-1)*S2.A+(N.A-1)*S2.B)/(N.A+N.B-2)
# t-statistic
t0 = (y.bar.B - y.bar.A) / sqrt(S2.P*(1/N.A+1/N.B))
t0
## [1] 7.019656
The test statistic value is \(t_{0} = 7.02\).
In order to reject or fail to reject the null hypothesis, she needs to compare it against the critical value of the Student \(T\) distribution with \(N-2=78\) degrees of freedom at significance level \(\alpha=0.05\), say.
# significance level
alpha=0.05
# this will give you a critical value on the wrong side of the distribution's mean
t.star = qt(alpha,N-2)
t.star
## [1] -1.664625
# this gives the correct critical value
t.star = qt(alpha,N-2, lower.tail=FALSE)
t.star
## [1] 1.664625
The appropriate critical value is \[t^*= t_{1-\alpha, N-2}=t_{0.95, 78}=1.665.\]
Since \(t_{0} > t^*\) at \(\alpha=0.05\), she happily rejects the null hypothesis \(H_0:\mu_A\geq \mu_B\), which is to say that she has enough evidence to support the claim that the new teaching method is more effective than the traditional methods, at \(\alpha=0.05\).
Consider the following data, consisting of \(n=20\) paired measurements \((x_i,y_i)\) of hydrocarbon levels (\(x\)) and pure oxygen levels (\(y\)) in fuels:
x=c(0.99, 1.02, 1.15, 1.29, 1.46, 1.36, 0.87, 1.23, 1.55, 1.40,
1.19, 1.15, 0.98, 1.01, 1.11, 1.20, 1.26, 1.32, 1.43, 0.95)
y=c(90.01, 89.05, 91.43, 93.74, 96.73, 94.45, 87.59, 91.77, 99.42, 93.65,
93.54, 92.52, 90.56, 89.54, 89.85, 90.39, 93.25, 93.41, 94.98, 87.33)
The line of best fit is given by the lm()
function.
# build a data frame with the variables
fuels=data.frame(x,y)
# run the linear regression
model <- lm(y ~ x, data=fuels)
# display the model summary
summary(model)
##
## Call:
## lm(formula = y ~ x, data = fuels)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83029 -0.73334 0.04497 0.69969 1.96809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.283 1.593 46.62 < 2e-16 ***
## x 14.947 1.317 11.35 1.23e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.087 on 18 degrees of freedom
## Multiple R-squared: 0.8774, Adjusted R-squared: 0.8706
## F-statistic: 128.9 on 1 and 18 DF, p-value: 1.227e-09
# some quantities that will be useful
n= length(x)
Sxx=sum((x-mean(x))^2)
Syy=sum((y-mean(y))^2)
Sxy=sum((x-mean(x))*(y-mean(y)))
sigma2 = (Syy-as.numeric(model$coefficients[2])*Sxy)/(n-2)
b0 = as.numeric(model$coefficients[1]) ### intercept
b1 = as.numeric(model$coefficients[2]) ### slope
beta00 = 75 ### for 1.
beta10 = 10 ### for 2.
We might be interested in testing whether the true intercept \(\beta_0\) is equal to some candidate value \(\beta_{0,0}\), i.e. \[H_0: \beta_0=\beta_{0,0}~~\mbox{against}~~ H_1:\beta_0\not= \beta_{0,0}.\]
The test statistic for this situation \[ t_0=\frac{b_0-\beta_{0,0}}{\sqrt{\hat{\sigma}^2\frac{\sum x_i^2}{nS_{xx}}}} \] follows a Student \(t-\)distribution with \(n-2\) degrees of freedom.
The critical/rejection region in this case is:
alternative hypothesis: \(H_1:\beta_0>\beta_{0,0}\); critical/rejection region: \(t_0>t_\alpha(n-2)\)
alternative hypothesis: \(H_1:\beta_0<\beta_{0,0}\); critical/rejection region: \(t_0<-t_\alpha(n-2)\)
alternative hypothesis: \(H_1:\beta_0\neq \beta_{0,0}\); critical/rejection region: \(|t_0|>t_{\alpha/2}(n-2)\)
The decision rule is: Reject \(H_0\) if \(t_0\) is in the critical region.
If instead we are interested in testing whether the true slope \(\beta_1\) is equal to some candidate value \(\beta_{1,0}\), i.e. \[H_0: \beta_1=\beta_{1,0}~~\mbox{against}~~ H_1:\beta_1\not= \beta_{1,0},\]
the test statistic for this situation is \[ t_0=\frac{b_1-\beta_{1,0}}{\sqrt{\hat{\sigma}^2/S_{xx}}} \] which follows a Student \(t-\)distribution with \(n-2\) degrees of freedom.
The critical/rejection region in this case is:
alternative hypothesis: \(H_1:\beta_1>\beta_{1,0}\); critical/rejection region: \(t_0>t_\alpha(n-2)\)
alternative hypothesis: \(H_1:\beta_1<\beta_{1,0}\); critical/rejection region: \(t_0<-t_\alpha(n-2)\)
alternative hypothesis: \(H_1:\beta_1\neq \beta_{1,0}\); critical/rejection region: \(|t_0|>t_{\alpha/2}(n-2)\)
The decision rule is: Reject \(H_1\) if \(t_0\) is in the critical region.
Solution:
# test statistic
(t0a = (b0-beta00)/sqrt(sigma2*sum(x^2)/n/Sxx))
## [1] -0.4497632
# critical value at alpha=0.05
(crit_t005_18a = qt(0.05,n-2))
## [1] -1.734064
# test for critical region
t0a < crit_t005_18a
## [1] FALSE
As the test statistic is larger than the critical value, we fail to reject \(H_0\).
Solution:
# test statistic
(t0b = (b1-beta10)/sqrt(sigma2/Sxx))
## [1] 3.757318
# critical value at alpha=0.05
(crit_t005_18b = - qt(0.05,n-2))
## [1] 1.734064
# test for critical region
t0b > crit_t005_18b
## [1] TRUE
As the test statistic is smaller than the critical value, we reject \(H_0\).
Solution:
# test statistic
(t0c = b1/sqrt(sigma2/Sxx))
## [1] 11.35173
# critical value at alpha=0.05
(crit_t0025_18c = - qt(0.025,18))
## [1] 2.100922
# test for critical region
abs(t0c) > crit_t0025_18c
## [1] TRUE
As the test statistic is smaller than the critical value, we reject \(H_0\).
Here we give a pot-pourri of miscellaneous results.
Suppose you are on a game show, and you are given the choice of three doors. Behind one door is a prize; behind the others, dirty and smelly rubbish bins.
You pick a door, say No. 1, and the host, who knows what is behind the doors, opens another door, say No. 3, behind which is a bin. She then says to you, “Do you want to switch from door No. 1 to No. 2?” Is it to your advantage to do so? Solution: this classic problem has left even the greatest minds gasping for air.
In what follows, let \(S\) and \(D\) represent the events that switching to another door is a successful strategy and that the prize is behind the original door, respectively.
We start by tackling a different scenario, in which the host DOES NOT open one of the two other doors after you have selected a door. What is the probability that switching to another door would prove to be a successful strategy?
If the prize is behind the original door, switching would succeed \(0\%\) of the time: \(P(S\mid D)=0\). Note that the prior is\(P(D)=1/3\).
If the prize is not behind the original door, switching would succeed \(50\%\) of the time: \(P(S \mid D^c)=1/2\). Note that the prior is\(P(D^c)=2/3\).
Thus, \[\begin{align*}P(S) = P(S\mid D) P(D) +P(S\mid D^c)P(D^c) =0\cdot \frac{1}{3}+\frac{1}{2}\cdot \frac{2}{3}\approx 33\%. \end{align*}\]
Does this make sense to you? There is no sense in switching because, in this scenario, switching is equivalent to having changed your mind about which door to select: the prize is behind one of the three doors, and whether you select the door originally or after having been offered the choice to switch, the probability that the prize is behind any door is just 1/3.
Now let’s return to the original scenario, in which the host opens one of the other two doors to show a rubbish bin. What is the probability that switching to another door in this scenario would prove to be a successful strategy?
If the prize is behind the original door, switching would still succeed \(0\%\) of the time: \(P(S\mid D)=0\). Note that the prior is\(P(D)=1/3\).
If the prize is not behind the original door, switching would succeed \(100\%\) of the time: \(P(S\mid D^c)=1\). Note that the prior is \(P(D^c)=2/3\).
Thus, \[\begin{align*}P(S) = P(S\mid D) P(D) +P(S\mid D^c)P(D^c) =0\cdot \frac{1}{3}+1\cdot \frac{2}{3}\approx 67\%. \end{align*}\]
Evidently, it makes sense to switch in the original scenario. This is counter-intuitive to a lot of people: the prize has a probability 1/3 of being behind any of the 3 doors, so why is the probability not 1/3, as described above? Isn’t this just the same problem?
Well, something has changed: the information in your possession changes once thehost opensoneof the two non-prize doors. You nowknow the location of one of the bins with \(100\%\) certainty.
Do you find this explanation satisfying?
Perhaps you would rather see what happens in practice: if you could pit two players against one another (one who never switches and one who always does so) in a series of Monty Hall games, which one would come out on top in the longrun?
# number of games (if N is too small, we don't have long-run behaviour)
N=500
# fixing the seed for replicability purposes
set.seed(1234)
# placing the prize behind one of the 3 doors for each game
locations = sample(c("A","B","C"), N, replace = TRUE)
# verification that the prize gets placed behind each door roughly 33% of the time
table(locations)/N
## locations
## A B C
## 0.302 0.344 0.354
# getting the playerto select a door for each of the N games (in practice, should be independent of where the prize actually is)
player.guesses = sample(c("A","B","C"), N, replace = TRUE)
# create a data frame telling the analyst where the prize is, and what door the player has selected
games = data.frame(locations, player.guesses)
head(games)
## locations player.guesses
## 1 B B
## 2 B B
## 3 A B
## 4 C C
## 5 A C
## 6 A A
# how often does the player guess correctly, before a door is opened (should be about 33%)
table(games$locations==games$player.guesses)
##
## FALSE TRUE
## 333 167
# initialize the process to find out which door the host will open
games$open.door <- NA
# for each game, the host opens a door which is not the one selected by the player, nor the one behind which the prize is found
# the union() call enumerates the doors that the host cannot open
# the setdiff() call finds the complementof the doors that thehost cannotopen (i.e.: the doors that she can open)
# the sample() call picks one of those doors
for(j in 1:N){
games$open.door[j] <- sample(setdiff(c("A","B","C"), union(games$locations[j],games$player.guesses[j])), 1)
}
head(games)
## locations player.guesses open.door
## 1 B B C
## 2 B B C
## 3 A B C
## 4 C C A
## 5 A C B
## 6 A A B
# if the player neverswitches, they win whenever they had originally guessed the location of the prize correctly
games$no.switch.win <- games$player.guess==games$locations
# let's find which doorthe playerwould have selected if they alwaysswitched (the doorthatis neither the location of the prize nor theonethey hadoriginally selected)
games$switch.door <- NA
for(j in 1:N){
games$switch.door[j] <- sample(setdiff(c("A","B","C"), union(games$open.door[j],games$player.guesses[j])), 1)
}
# if the player always switches, they win whenever their switched guess is wheretheprize is located
games$switch.win <- games$switch.door==games$locations
head(games)
## locations player.guesses open.door no.switch.win switch.door switch.win
## 1 B B C TRUE A FALSE
## 2 B B C TRUE A FALSE
## 3 A B C FALSE A TRUE
## 4 C C A TRUE B FALSE
## 5 A C B FALSE A TRUE
## 6 A A B TRUE C FALSE
# chance of winning by not switching
table(games$no.switch.win)/N
##
## FALSE TRUE
## 0.666 0.334
# chance of winning by switching
table(games$switch.win)/N
##
## FALSE TRUE
## 0.334 0.666
Pretty wild, eh?