(based on a Case Study by L.Torgo)
This notebook provides a concrete and thorough illustration of the data science process using a realistic dataset: algae_blooms.csv
, which is also available at the UCI Machine Learning Repository.
The problem is to predict the occurrence of harmful algae in water samples.
We also use it to highlight various aspects of data exploration, data cleaning, and R syntax.
Before we can take a look at the data and begin the process in earnest, we need to load it in the in the R workspace.
The dataset is stored in the CSV file algae_blooms.csv
(in the Data
directory).
# The data can also be loaded as part of Torgo's DMwR package,
# which could be loaded directly as below (by uncommenting the
# line of code), but that is not how we will access it.
# library(DMwR)
As the data resides in a CSV file, we use the read.csv
R function to load it and store it to an R data frame (Torgo stores it to a tibble
object).
(Note the sep
and header
options – what do you think their function is?)
algae_blooms<-read.csv("Data/algae_blooms.csv", sep=",", header=TRUE)
We can get an idea of the data frame’s structure by calling the str
function.
str(algae_blooms)
## 'data.frame': 340 obs. of 18 variables:
## $ season: Factor w/ 4 levels "autumn","spring",..: 4 2 1 2 1 4 3 1 4 4 ...
## $ size : Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ speed : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 1 1 1 3 1 ...
## $ mxPH : num 8 8.35 8.1 8.07 8.06 8.25 8.15 8.05 8.7 7.93 ...
## $ mnO2 : num 9.8 8 11.4 4.8 9 13.1 10.3 10.6 3.4 9.9 ...
## $ Cl : num 60.8 57.8 40 77.4 55.4 ...
## $ NO3 : num 6.24 1.29 5.33 2.3 10.42 ...
## $ NH4 : num 578 370 346.7 98.2 233.7 ...
## $ oPO4 : num 105 428.8 125.7 61.2 58.2 ...
## $ PO4 : num 170 558.8 187.1 138.7 97.6 ...
## $ Chla : num 50 1.3 15.6 1.4 10.5 ...
## $ a1 : num 0 1.4 3.3 3.1 9.2 15.1 2.4 18.2 25.4 17 ...
## $ a2 : num 0 7.6 53.6 41 2.9 14.6 1.2 1.6 5.4 0 ...
## $ a3 : num 0 4.8 1.9 18.9 7.5 1.4 3.2 0 2.5 0 ...
## $ a4 : num 0 1.9 0 0 0 0 3.9 0 0 2.9 ...
## $ a5 : num 34.2 6.7 0 1.4 7.5 22.5 5.8 5.5 0 0 ...
## $ a6 : num 8.3 0 0 0 4.1 12.6 6.8 8.7 0 0 ...
## $ a7 : num 0 2.1 9.7 1.4 1 2.9 0 0 0 1.7 ...
Evidently, algae_blooms
is a data frame with 340 observations of 18 variables each.
Notes: - 3 of the fields are categorical (season
, size
, speed
, which refer to the data collection process) - of the numerical fields, 8 have names that sound vaguely “chemical” - presumably, the remaining fields refer to the algae blooms
We can get a better feel for the data frame by observing it in its natural habitat, so to speak, using the head()
or tail()
functions.
head(algae_blooms)
## season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla
## 1 winter small medium 8.00 9.8 60.800 6.238 578.000 105.000 170.000 50.0
## 2 spring small medium 8.35 8.0 57.750 1.288 370.000 428.750 558.750 1.3
## 3 autumn small medium 8.10 11.4 40.020 5.330 346.667 125.667 187.057 15.6
## 4 spring small medium 8.07 4.8 77.364 2.302 98.182 61.182 138.700 1.4
## 5 autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.580 10.5
## 6 winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.667 28.4
## a1 a2 a3 a4 a5 a6 a7
## 1 0.0 0.0 0.0 0.0 34.2 8.3 0.0
## 2 1.4 7.6 4.8 1.9 6.7 0.0 2.1
## 3 3.3 53.6 1.9 0.0 0.0 0.0 9.7
## 4 3.1 41.0 18.9 0.0 1.4 0.0 1.4
## 5 9.2 2.9 7.5 0.0 7.5 4.1 1.0
## 6 15.1 14.6 1.4 0.0 22.5 12.6 2.9
First, we filter the algae_blooms
dataset to remove the 2 observations with missing values (is this always a good idea?)
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) # %>%
# filter(NH4<3000) %>%
# mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
## season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla
## 1 winter small medium 8.00 9.8 60.800 6.238 578.000 105.000 170.000 50.0
## 2 spring small medium 8.35 8.0 57.750 1.288 370.000 428.750 558.750 1.3
## 3 autumn small medium 8.10 11.4 40.020 5.330 346.667 125.667 187.057 15.6
## 4 spring small medium 8.07 4.8 77.364 2.302 98.182 61.182 138.700 1.4
## 5 autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.580 10.5
## 6 winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.667 28.4
## a1 a2 a3 a4 a5 a6 a7
## 1 0.0 0.0 0.0 0.0 34.2 8.3 0.0
## 2 1.4 7.6 4.8 1.9 6.7 0.0 2.1
## 3 3.3 53.6 1.9 0.0 0.0 0.0 9.7
## 4 3.1 41.0 18.9 0.0 1.4 0.0 1.4
## 5 9.2 2.9 7.5 0.0 7.5 4.1 1.0
## 6 15.1 14.6 1.4 0.0 22.5 12.6 2.9
nrow(f.NH4.data)
## [1] 338
Next we remove the 11 observations for which NH4 > 3000 (again, based on the mean + sd “argument”)
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>%
filter(NH4<3000) # %>%
# mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
## season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla
## 1 winter small medium 8.00 9.8 60.800 6.238 578.000 105.000 170.000 50.0
## 2 spring small medium 8.35 8.0 57.750 1.288 370.000 428.750 558.750 1.3
## 3 autumn small medium 8.10 11.4 40.020 5.330 346.667 125.667 187.057 15.6
## 4 spring small medium 8.07 4.8 77.364 2.302 98.182 61.182 138.700 1.4
## 5 autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.580 10.5
## 6 winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.667 28.4
## a1 a2 a3 a4 a5 a6 a7
## 1 0.0 0.0 0.0 0.0 34.2 8.3 0.0
## 2 1.4 7.6 4.8 1.9 6.7 0.0 2.1
## 3 3.3 53.6 1.9 0.0 0.0 0.0 9.7
## 4 3.1 41.0 18.9 0.0 1.4 0.0 1.4
## 5 9.2 2.9 7.5 0.0 7.5 4.1 1.0
## 6 15.1 14.6 1.4 0.0 22.5 12.6 2.9
nrow(f.NH4.data)
## [1] 327
Finally, we add a new variable to tell us in which quartile the NH4 value falls.
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>%
filter(NH4<3000) %>%
mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
## season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla
## 1 winter small medium 8.00 9.8 60.800 6.238 578.000 105.000 170.000 50.0
## 2 spring small medium 8.35 8.0 57.750 1.288 370.000 428.750 558.750 1.3
## 3 autumn small medium 8.10 11.4 40.020 5.330 346.667 125.667 187.057 15.6
## 4 spring small medium 8.07 4.8 77.364 2.302 98.182 61.182 138.700 1.4
## 5 autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222 97.580 10.5
## 6 winter small high 8.25 13.1 65.750 9.248 430.000 18.250 56.667 28.4
## a1 a2 a3 a4 a5 a6 a7 q.NH4
## 1 0.0 0.0 0.0 0.0 34.2 8.3 0.0 (226,2.17e+03]
## 2 1.4 7.6 4.8 1.9 6.7 0.0 2.1 (226,2.17e+03]
## 3 3.3 53.6 1.9 0.0 0.0 0.0 9.7 (226,2.17e+03]
## 4 3.1 41.0 18.9 0.0 1.4 0.0 1.4 (36.8,103]
## 5 9.2 2.9 7.5 0.0 7.5 4.1 1.0 (226,2.17e+03]
## 6 15.1 14.6 1.4 0.0 22.5 12.6 2.9 (226,2.17e+03]
nrow(f.NH4.data)
## [1] 327
We can now use the new q.NH4
variable to make multi-variate comparisons, say between a1
, NH4
, and season
.
ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) + # plot the a1 levels by season ...
geom_point() +
facet_wrap(~q.NH4) + # for each of the quartiles (each plot contains ~25% of the observations)
guides(color=FALSE) +
ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")
That’s odd… why are we seeing missing values here? Haven’t we removed the NAs? Let’s delve in a bit deeper.
f.NH4.data[which(is.na(f.NH4.data$q.NH4)),]
## season size speed mxPH mnO2 Cl NO3 NH4 oPO4 PO4 Chla a1 a2 a3 a4
## 53 spring small medium 5.6 11.8 NA 2.22 5 1 1 NA 82.7 0 0 0
## 223 autumn small high 5.9 11.9 NA 1.88 5 1 2 NA 75.2 0 0 0
## a5 a6 a7 q.NH4
## 53 0 0 0 <NA>
## 223 0 0 0 <NA>
table(f.NH4.data$q.NH4,useNA="ifany")
##
## (5,36.8] (36.8,103] (103,226] (226,2.17e+03] <NA>
## 80 82 81 82 2
Ah, it seems as though the quartiles do not include their lower bound. We can remedy the situation by including an additional parameter in the mutate()
call.
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>%
filter(NH4<3000) %>%
mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1)), include.lowest=TRUE))
table(f.NH4.data$q.NH4,useNA="ifany")
##
## [5,36.8] (36.8,103] (103,226] (226,2.17e+03]
## 82 82 81 82
ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) +
geom_point() +
facet_wrap(~q.NH4) +
guides(color=FALSE) +
ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")
The NAs have disappeared from the graph (but the observations have not!). Either way, it seems as though the a1
levels are inversely correlated with the NH4+
levels but that the season doesn’t have much of an effect (although we would need more work to conclude with any degree of certainty).
ggplot(f.NH4.data,aes(x=a3,y=season,color=season)) +
geom_point() +
facet_wrap(~q.NH4) +
guides(color=FALSE) +
ggtitle("Algae Level a3 by Season and Ammonium Quartiles NH4")
Our new dataset still contains observations with missing cases, however.
algae_blooms.sna = algae_blooms[-which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2),]
dim(algae_blooms.sna)
## [1] 338 18
What can we do with the other observations for which values are missing?
One possibility is to use the set of complete observations to compute a correlation matrix, and to see if any numerical field is strongly correlated with annother field. That way, if there is a missing value in the first field, the second could be used to impute it.
IMPORTANT NOTE: - this approach only works for variables that are linearly correlated to a single other variable. Non-linear correlations and multi-variate associations will not be uncovered.
symnum(cor(algae_blooms.sna[,4:18], use="complete.obs"))
## mP mO Cl NO NH o P Ch a1 a2 a3 a4 a5 a6 a7
## mxPH 1
## mnO2 1
## Cl 1
## NO3 1
## NH4 . 1
## oPO4 . 1
## PO4 . . . * 1
## Chla . 1
## a1 . . . 1
## a2 1
## a3 1
## a4 . . . . 1
## a5 1
## a6 . . 1
## a7 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
This might be a bit hard to read. We can use the corrplot
library to visualize the (linear) correlations.
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor(algae_blooms.sna[,4:18], use="complete.obs"),type="upper",tl.pos="d")
The correlation between PO4
(which has missing cases) and oPO4
(which does not, anymore) is clear.
What is the nature of the relation? We’ll use the set of complete cases to find it.
algae_blooms.nona <- algae_blooms.sna[-which(apply(algae_blooms.sna,1, function(x) sum(is.na(x)))>0),]
dim(algae_blooms.nona)
## [1] 306 18
PO4.oPO4.model = lm(PO4 ~ oPO4, data=algae_blooms.nona)
PO4.oPO4.model
##
## Call:
## lm(formula = PO4 ~ oPO4, data = algae_blooms.nona)
##
## Coefficients:
## (Intercept) oPO4
## 51.811 1.203
The regression function is PO4
\(=51.811+1.203\)oPO4
(note that I’m not really interested in the fit statistics).
Intercept = PO4.oPO4.model$coefficients[[1]]
Slope = PO4.oPO4.model$coefficients[[2]]
What are the observations for which PO4
is missing?
which(is.na(algae_blooms.sna$PO4)==TRUE)
## [1] 28 221 291 326 331 335
Let’s use the regression function to impute the missing PO4
values.
algae_blooms.sna2 <- algae_blooms.sna
algae_blooms.sna2$PO4 <- ifelse(is.na(algae_blooms.sna2$PO4),max(Intercept + Slope*algae_blooms.sna2$oPO4,0),algae_blooms.sna2$PO4)
We can clearly see that no values of PO4
are missing anymore.
which(is.na(algae_blooms.sna2$PO4)==TRUE)
## integer(0)
That takes care of the missing values with strong linear correlation to another field. Where do we stand now?
summary(algae_blooms.sna2)
## season size speed mxPH mnO2
## spring:84 small :120 low : 58 Min. :5.600 Min. : 1.500
## summer:85 medium:136 medium:138 1st Qu.:7.750 1st Qu.: 8.000
## autumn:80 large : 82 high :142 Median :8.055 Median : 9.700
## winter:89 Mean :8.002 Mean : 9.161
## 3rd Qu.:8.400 3rd Qu.:10.800
## Max. :9.700 Max. :13.400
## NA's :2 NA's :1
## Cl NO3 NH4 oPO4
## Min. : 0.222 Min. : 0.000 Min. : 5.00 Min. : 1.00
## 1st Qu.: 10.994 1st Qu.: 1.147 1st Qu.: 37.86 1st Qu.: 13.00
## Median : 32.470 Median : 2.356 Median : 107.36 Median : 37.24
## Mean : 42.517 Mean : 3.121 Mean : 471.73 Mean : 73.09
## 3rd Qu.: 57.750 3rd Qu.: 4.147 3rd Qu.: 244.90 3rd Qu.: 88.11
## Max. :391.500 Max. :45.650 Max. :24064.00 Max. :1435.00
## NA's :14
## PO4 Chla a1 a2
## Min. : 1.00 Min. : 0.200 Min. : 0.00 Min. : 0.000
## 1st Qu.: 40.75 1st Qu.: 2.133 1st Qu.: 1.50 1st Qu.: 0.000
## Median : 104.86 Median : 5.111 Median : 7.10 Median : 2.800
## Mean : 166.18 Mean : 12.796 Mean :16.74 Mean : 7.207
## 3rd Qu.: 206.12 3rd Qu.: 17.200 3rd Qu.:25.32 3rd Qu.:10.025
## Max. :1777.93 Max. :110.456 Max. :89.80 Max. :72.600
## NA's :21
## a3 a4 a5 a6
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 1.400 Median : 0.000 Median : 2.200 Median : 0.000
## Mean : 3.917 Mean : 1.812 Mean : 5.548 Mean : 6.438
## 3rd Qu.: 4.675 3rd Qu.: 2.300 3rd Qu.: 8.000 3rd Qu.: 7.075
## Max. :42.800 Max. :44.600 Max. :61.100 Max. :77.600
##
## a7
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 0.0
## Mean : 2.2
## 3rd Qu.: 2.2
## Max. :31.6
##
Alright, so we don’t have as many missing values as before, but we still have some. And the correlation trick isn’t going to work this time.
What can we do? There are many ways to tackle the problem, but we will use \(k\)NN imputation. The principle is simple. Using some similarity/distance metric (typically based on the Euclidean distance between points), we start by identifying the \(k\) nearest (complete) neighbours of each observation with a missing case. We then compute the mean value of the missing case in the \(k-\)group of complete observations, and use that value as the imputed value.
IMPORTANT NOTES: - as we have seen when we were discussing, we often suggest scaling the data when dealing with distance metrics. I elected not to scale the data explicitely here. How much of an effect can that have?
- we are going to be using DMwR
s implementation of knnImputation()
(below). How would you go about determining if the routine scales the data internally?
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
algae_blooms.sna2 <- knnImputation(algae_blooms.sna2,k=10) # the choice of k=10 is arbitrary here.
Sure enough, there are no further observations with incomplete cases.
summary(algae_blooms.sna2)
## season size speed mxPH mnO2
## spring:84 small :120 low : 58 Min. :5.600 Min. : 1.500
## summer:85 medium:136 medium:138 1st Qu.:7.755 1st Qu.: 8.000
## autumn:80 large : 82 high :142 Median :8.045 Median : 9.750
## winter:89 Mean :8.001 Mean : 9.165
## 3rd Qu.:8.393 3rd Qu.:10.800
## Max. :9.700 Max. :13.400
## Cl NO3 NH4 oPO4
## Min. : 0.222 Min. : 0.000 Min. : 5.00 Min. : 1.00
## 1st Qu.: 10.581 1st Qu.: 1.147 1st Qu.: 37.86 1st Qu.: 13.00
## Median : 31.233 Median : 2.356 Median : 107.36 Median : 37.24
## Mean : 41.326 Mean : 3.121 Mean : 471.73 Mean : 73.09
## 3rd Qu.: 57.323 3rd Qu.: 4.147 3rd Qu.: 244.90 3rd Qu.: 88.11
## Max. :391.500 Max. :45.650 Max. :24064.00 Max. :1435.00
## PO4 Chla a1 a2
## Min. : 1.00 Min. : 0.200 Min. : 0.00 Min. : 0.000
## 1st Qu.: 40.75 1st Qu.: 2.100 1st Qu.: 1.50 1st Qu.: 0.000
## Median : 104.86 Median : 5.042 Median : 7.10 Median : 2.800
## Mean : 166.18 Mean : 12.328 Mean :16.74 Mean : 7.207
## 3rd Qu.: 206.12 3rd Qu.: 16.021 3rd Qu.:25.32 3rd Qu.:10.025
## Max. :1777.93 Max. :110.456 Max. :89.80 Max. :72.600
## a3 a4 a5 a6
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 1.400 Median : 0.000 Median : 2.200 Median : 0.000
## Mean : 3.917 Mean : 1.812 Mean : 5.548 Mean : 6.438
## 3rd Qu.: 4.675 3rd Qu.: 2.300 3rd Qu.: 8.000 3rd Qu.: 7.075
## Max. :42.800 Max. :44.600 Max. :61.100 Max. :77.600
## a7
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 0.0
## Mean : 2.2
## 3rd Qu.: 2.2
## Max. :31.6
table(apply(algae_blooms.sna2,1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns
##
## 0
## 338
We see that the adjusted \(R^2\) coefficient is fairly small. Furthermore, if the linear model is a good fit, the residuals should have a mean of 0 and be as “small”. That being said, the \(F-\)statistic seems to indicate that there is some (linear) dependence on the predictor variables.
We can get a better handle on the regression diagnostics by calling the plot()
method on the object.
plot(linear.model.a2)
The fit isn’t much better, but an ANOVA on the 2 suggested models shows that we’re ~88% that the models are different (see below).
anova(linear.model.a2,linear.model.a2.mod)
## Analysis of Variance Table
##
## Model 1: a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 +
## PO4 + Chla
## Model 2: a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + oPO4 +
## PO4 + Chla
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 202 23309
## 2 203 23453 -1 -143.86 1.2467 0.2655
The function step
uses AIC to perform a model search (using backward elimination), implemented in the method step()
.
final.linear.model.a2 <- step(linear.model.a2)
## Start: AIC=1050.52
## a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 +
## PO4 + Chla
##
## Df Sum of Sq RSS AIC
## - season 3 157.44 23466 1046.0
## - oPO4 1 6.20 23315 1048.6
## - PO4 1 7.96 23317 1048.6
## - Cl 1 15.85 23325 1048.7
## - mnO2 1 25.40 23334 1048.8
## - NO3 1 57.19 23366 1049.0
## - size 2 282.28 23591 1049.1
## - NH4 1 143.86 23453 1049.9
## <none> 23309 1050.5
## - speed 2 967.47 24276 1055.4
## - mxPH 1 1121.22 24430 1058.8
## - Chla 1 1478.18 24787 1061.9
##
## Step: AIC=1045.98
## a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 +
## Chla
##
## Df Sum of Sq RSS AIC
## - oPO4 1 2.54 23469 1044.0
## - PO4 1 4.10 23470 1044.0
## - mnO2 1 6.61 23473 1044.0
## - Cl 1 15.59 23482 1044.1
## - size 2 257.60 23724 1044.4
## - NO3 1 47.04 23513 1044.4
## - NH4 1 114.06 23580 1045.0
## <none> 23466 1046.0
## - speed 2 1035.56 24502 1051.4
## - mxPH 1 1052.01 24518 1053.5
## - Chla 1 1477.06 24943 1057.3
##
## Step: AIC=1044.01
## a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + PO4 + Chla
##
## Df Sum of Sq RSS AIC
## - PO4 1 1.62 23470 1042.0
## - mnO2 1 7.17 23476 1042.1
## - Cl 1 14.19 23483 1042.1
## - NO3 1 44.93 23514 1042.4
## - size 2 266.73 23736 1042.5
## - NH4 1 114.91 23584 1043.1
## <none> 23469 1044.0
## - speed 2 1050.55 24519 1049.5
## - mxPH 1 1099.78 24569 1052.0
## - Chla 1 1480.47 24949 1055.3
##
## Step: AIC=1042.02
## a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + Chla
##
## Df Sum of Sq RSS AIC
## - mnO2 1 6.59 23477 1040.1
## - Cl 1 17.42 23488 1040.2
## - size 2 265.19 23736 1040.5
## - NO3 1 51.04 23521 1040.5
## - NH4 1 140.72 23611 1041.3
## <none> 23470 1042.0
## - speed 2 1050.42 24521 1047.6
## - mxPH 1 1105.21 24576 1050.0
## - Chla 1 1482.34 24953 1053.4
##
## Step: AIC=1040.08
## a2 ~ size + speed + mxPH + Cl + NO3 + NH4 + Chla
##
## Df Sum of Sq RSS AIC
## - Cl 1 13.41 23490 1038.2
## - size 2 260.65 23738 1038.5
## - NO3 1 44.48 23522 1038.5
## - NH4 1 135.66 23613 1039.3
## <none> 23477 1040.1
## - speed 2 1121.64 24599 1046.3
## - mxPH 1 1103.17 24580 1048.1
## - Chla 1 1492.55 24970 1051.5
##
## Step: AIC=1038.21
## a2 ~ size + speed + mxPH + NO3 + NH4 + Chla
##
## Df Sum of Sq RSS AIC
## - NO3 1 36.13 23526 1036.5
## - size 2 275.91 23766 1036.8
## - NH4 1 128.31 23619 1037.4
## <none> 23490 1038.2
## - speed 2 1172.78 24663 1044.8
## - mxPH 1 1089.85 24580 1046.1
## - Chla 1 1490.94 24981 1049.6
##
## Step: AIC=1036.54
## a2 ~ size + speed + mxPH + NH4 + Chla
##
## Df Sum of Sq RSS AIC
## - size 2 244.91 23771 1034.8
## - NH4 1 93.48 23620 1035.4
## <none> 23526 1036.5
## - speed 2 1164.36 24691 1043.1
## - mxPH 1 1053.88 24580 1044.1
## - Chla 1 1611.04 25138 1049.0
##
## Step: AIC=1034.8
## a2 ~ speed + mxPH + NH4 + Chla
##
## Df Sum of Sq RSS AIC
## - NH4 1 82.62 23854 1033.6
## <none> 23771 1034.8
## - mxPH 1 850.56 24622 1040.5
## - speed 2 1085.45 24857 1040.5
## - Chla 1 1540.50 25312 1046.5
##
## Step: AIC=1033.56
## a2 ~ speed + mxPH + Chla
##
## Df Sum of Sq RSS AIC
## <none> 23854 1033.6
## - speed 2 1021.27 24875 1038.7
## - mxPH 1 928.72 24783 1039.9
## - Chla 1 1479.59 25334 1044.7
An alternative to regression is the use of regression trees, an implementation of which is found in the library rpart
. The syntax is similar to lm
.
library(rpart)
regression.tree.a2 <-rpart(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train)
The outcome can be displayed by calling the object directly.
regression.tree.a2
## n= 218
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 218 29355.1300 7.6366970
## 2) Cl< 16.6875 83 1193.6400 1.8891570
## 4) size=small,medium 67 398.6457 0.9447761 *
## 5) size=large 16 485.0194 5.8437500 *
## 3) Cl>=16.6875 135 23733.9200 11.1703700
## 6) mxPH< 8.065 59 3831.8290 5.3864410
## 12) season=autumn,winter 29 561.8414 2.5172410 *
## 13) season=spring,summer 30 2800.4720 8.1600000
## 26) mxPH< 7.9375 23 889.9730 5.3173910 *
## 27) mxPH>=7.9375 7 1114.0000 17.5000000 *
## 7) mxPH>=8.065 76 16396.0400 15.6605300
## 14) Chla>=2.65 68 9694.0890 13.8544100
## 28) Chla< 14.8875 29 2747.5810 8.7172410
## 56) NH4< 226.875 21 558.4257 5.7857140 *
## 57) NH4>=226.875 8 1534.9490 16.4125000 *
## 29) Chla>=14.8875 39 5612.0940 17.6743600
## 58) mnO2< 11.05 30 3139.0940 15.4233300
## 116) NH4>=158.409 8 577.1000 8.9000000 *
## 117) NH4< 158.409 22 2097.7700 17.7954500
## 234) season=spring,autumn 14 674.7521 14.6642900 *
## 235) season=summer,winter 8 1045.5550 23.2750000 *
## 59) mnO2>=11.05 9 1814.2760 25.1777800 *
## 15) Chla< 2.65 8 4594.6690 31.0125000 *
The tree structure can be hard to determine when there is a large number of nodes; let’s see if we can improve on the visuals using rpart.plot
.
library(rpart.plot)
prp(regression.tree.a2,extra=101,box.col="orange",split.box.col="gray")
Details on the regression tree can be obtained by calling the summary()
method on the object.
summary(regression.tree.a2)
## Call:
## rpart(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl +
## NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)
## n= 218
##
## CP nsplit rel error xerror xstd
## 1 0.15082765 0 1.0000000 1.0094399 0.1994604
## 2 0.11943572 1 0.8491724 0.9479860 0.1735351
## 3 0.07178590 2 0.7297366 0.9342445 0.1677879
## 4 0.04545758 3 0.6579507 0.8986776 0.1682068
## 5 0.02243987 4 0.6124932 0.9669438 0.1825075
## 6 0.02228595 5 0.5900533 1.0092162 0.1893206
## 7 0.02156378 6 0.5677673 1.0157825 0.1894284
## 8 0.01581407 8 0.5246398 0.9891863 0.1771592
## 9 0.01285848 9 0.5088257 0.9422149 0.1718359
## 10 0.01055949 10 0.4959672 0.9573215 0.1762788
## 11 0.01000000 11 0.4854077 0.9488856 0.1746610
##
## Variable importance
## Chla NH4 Cl mxPH oPO4 PO4 NO3 speed mnO2 season
## 19 14 14 13 11 9 6 5 4 3
## size
## 2
##
## Node number 1: 218 observations, complexity param=0.1508276
## mean=7.636697, MSE=134.6565
## left son=2 (83 obs) right son=3 (135 obs)
## Primary splits:
## Cl < 16.6875 to the left, improve=0.15082760, (0 missing)
## mxPH < 7.94 to the left, improve=0.14900670, (0 missing)
## oPO4 < 45.1 to the left, improve=0.11106510, (0 missing)
## Chla < 12.21 to the left, improve=0.09817759, (0 missing)
## NH4 < 46.35 to the left, improve=0.09032131, (0 missing)
## Surrogate splits:
## PO4 < 70.465 to the left, agree=0.844, adj=0.590, (0 split)
## oPO4 < 19.8635 to the left, agree=0.835, adj=0.566, (0 split)
## NH4 < 46.35 to the left, agree=0.807, adj=0.494, (0 split)
## Chla < 2.225 to the left, agree=0.807, adj=0.494, (0 split)
## speed splits as RRL, agree=0.775, adj=0.410, (0 split)
##
## Node number 2: 83 observations, complexity param=0.01055949
## mean=1.889157, MSE=14.38121
## left son=4 (67 obs) right son=5 (16 obs)
## Primary splits:
## size splits as LLR, improve=0.25968900, (0 missing)
## Chla < 4.4835 to the left, improve=0.16315540, (0 missing)
## mxPH < 7.835 to the left, improve=0.07327114, (0 missing)
## oPO4 < 18.889 to the left, improve=0.05858864, (0 missing)
## speed splits as LRL, improve=0.05534763, (0 missing)
## Surrogate splits:
## speed splits as LRL, agree=0.843, adj=0.188, (0 split)
## mxPH < 8.385 to the left, agree=0.831, adj=0.125, (0 split)
## mnO2 < 8.3 to the right, agree=0.831, adj=0.125, (0 split)
## Cl < 15.2705 to the left, agree=0.819, adj=0.062, (0 split)
##
## Node number 3: 135 observations, complexity param=0.1194357
## mean=11.17037, MSE=175.8068
## left son=6 (59 obs) right son=7 (76 obs)
## Primary splits:
## mxPH < 8.065 to the left, improve=0.14772320, (0 missing)
## NO3 < 0.3785 to the right, improve=0.09864039, (0 missing)
## Chla < 2.85 to the right, improve=0.06998520, (0 missing)
## mnO2 < 11.05 to the left, improve=0.06691501, (0 missing)
## NH4 < 474.1875 to the right, improve=0.05884159, (0 missing)
## Surrogate splits:
## NH4 < 272.2915 to the right, agree=0.689, adj=0.288, (0 split)
## Chla < 11.65 to the left, agree=0.667, adj=0.237, (0 split)
## NO3 < 5.37 to the right, agree=0.659, adj=0.220, (0 split)
## mnO2 < 6.55 to the left, agree=0.622, adj=0.136, (0 split)
## oPO4 < 279.5085 to the right, agree=0.607, adj=0.102, (0 split)
##
## Node number 4: 67 observations
## mean=0.9447761, MSE=5.949935
##
## Node number 5: 16 observations
## mean=5.84375, MSE=30.31371
##
## Node number 6: 59 observations, complexity param=0.02156378
## mean=5.386441, MSE=64.94626
## left son=12 (29 obs) right son=13 (30 obs)
## Primary splits:
## season splits as RRLL, improve=0.12253050, (0 missing)
## NH4 < 339.9305 to the right, improve=0.08821754, (0 missing)
## mxPH < 7.905 to the left, improve=0.07581077, (0 missing)
## Chla < 2.5 to the right, improve=0.05654547, (0 missing)
## oPO4 < 25.75 to the left, improve=0.05324680, (0 missing)
## Surrogate splits:
## NO3 < 2.8605 to the right, agree=0.712, adj=0.414, (0 split)
## mnO2 < 9.75 to the right, agree=0.661, adj=0.310, (0 split)
## oPO4 < 144.1885 to the left, agree=0.627, adj=0.241, (0 split)
## Chla < 6.85 to the right, agree=0.627, adj=0.241, (0 split)
## mxPH < 7.975 to the right, agree=0.610, adj=0.207, (0 split)
##
## Node number 7: 76 observations, complexity param=0.0717859
## mean=15.66053, MSE=215.7374
## left son=14 (68 obs) right son=15 (8 obs)
## Primary splits:
## Chla < 2.65 to the right, improve=0.12852400, (0 missing)
## mnO2 < 11.05 to the left, improve=0.10098910, (0 missing)
## NO3 < 0.6045 to the right, improve=0.08727846, (0 missing)
## Cl < 35.705 to the right, improve=0.07266453, (0 missing)
## mxPH < 8.375 to the right, improve=0.04660101, (0 missing)
## Surrogate splits:
## Cl < 19.11 to the right, agree=0.921, adj=0.25, (0 split)
## NO3 < 0.076 to the right, agree=0.921, adj=0.25, (0 split)
## NH4 < 2713 to the left, agree=0.921, adj=0.25, (0 split)
## oPO4 < 8 to the right, agree=0.921, adj=0.25, (0 split)
## PO4 < 40.3335 to the right, agree=0.921, adj=0.25, (0 split)
##
## Node number 12: 29 observations
## mean=2.517241, MSE=19.37384
##
## Node number 13: 30 observations, complexity param=0.02156378
## mean=8.16, MSE=93.34907
## left son=26 (23 obs) right son=27 (7 obs)
## Primary splits:
## mxPH < 7.9375 to the left, improve=0.28441600, (0 missing)
## NH4 < 119.2855 to the left, improve=0.10978520, (0 missing)
## oPO4 < 42.857 to the left, improve=0.06038492, (0 missing)
## Cl < 37.624 to the left, improve=0.05321153, (0 missing)
## Chla < 3.1 to the right, improve=0.04623221, (0 missing)
## Surrogate splits:
## NO3 < 1.0425 to the right, agree=0.833, adj=0.286, (0 split)
##
## Node number 14: 68 observations, complexity param=0.04545758
## mean=13.85441, MSE=142.5601
## left son=28 (29 obs) right son=29 (39 obs)
## Primary splits:
## Chla < 14.8875 to the left, improve=0.13765220, (0 missing)
## Cl < 68.875 to the right, improve=0.10411670, (0 missing)
## NH4 < 228.175 to the left, improve=0.09585894, (0 missing)
## NO3 < 5.247 to the left, improve=0.08016472, (0 missing)
## oPO4 < 45.1 to the left, improve=0.05486673, (0 missing)
## Surrogate splits:
## size splits as LRR, agree=0.706, adj=0.310, (0 split)
## mxPH < 8.215 to the left, agree=0.676, adj=0.241, (0 split)
## oPO4 < 180.8335 to the right, agree=0.662, adj=0.207, (0 split)
## NO3 < 3.779 to the right, agree=0.647, adj=0.172, (0 split)
## season splits as RLRR, agree=0.632, adj=0.138, (0 split)
##
## Node number 15: 8 observations
## mean=31.0125, MSE=574.3336
##
## Node number 26: 23 observations
## mean=5.317391, MSE=38.69448
##
## Node number 27: 7 observations
## mean=17.5, MSE=159.1429
##
## Node number 28: 29 observations, complexity param=0.02228595
## mean=8.717241, MSE=94.74419
## left son=56 (21 obs) right son=57 (8 obs)
## Primary splits:
## NH4 < 226.875 to the left, improve=0.23810280, (0 missing)
## Cl < 68.875 to the right, improve=0.12545720, (0 missing)
## NO3 < 3.8045 to the right, improve=0.10340650, (0 missing)
## mnO2 < 9.6 to the right, improve=0.08307998, (0 missing)
## oPO4 < 45.1 to the left, improve=0.07494295, (0 missing)
## Surrogate splits:
## Chla < 12.08 to the left, agree=0.862, adj=0.5, (0 split)
##
## Node number 29: 39 observations, complexity param=0.02243987
## mean=17.67436, MSE=143.8999
## left son=58 (30 obs) right son=59 (9 obs)
## Primary splits:
## mnO2 < 11.05 to the left, improve=0.11737600, (0 missing)
## mxPH < 8.35 to the right, improve=0.11494980, (0 missing)
## Cl < 51.6665 to the right, improve=0.11368430, (0 missing)
## NO3 < 5.0885 to the left, improve=0.10842770, (0 missing)
## NH4 < 231.3 to the left, improve=0.08663984, (0 missing)
## Surrogate splits:
## NH4 < 245.682 to the left, agree=0.821, adj=0.222, (0 split)
## oPO4 < 19.4585 to the right, agree=0.821, adj=0.222, (0 split)
## PO4 < 96.283 to the right, agree=0.821, adj=0.222, (0 split)
## Chla < 16.4 to the right, agree=0.821, adj=0.222, (0 split)
## size splits as RLL, agree=0.795, adj=0.111, (0 split)
##
## Node number 56: 21 observations
## mean=5.785714, MSE=26.5917
##
## Node number 57: 8 observations
## mean=16.4125, MSE=191.8686
##
## Node number 58: 30 observations, complexity param=0.01581407
## mean=15.42333, MSE=104.6365
## left son=116 (8 obs) right son=117 (22 obs)
## Primary splits:
## NH4 < 158.409 to the right, improve=0.14788480, (0 missing)
## Chla < 48.15 to the left, improve=0.13800510, (0 missing)
## mnO2 < 9.95 to the right, improve=0.12649850, (0 missing)
## NO3 < 0.698 to the right, improve=0.11199640, (0 missing)
## oPO4 < 144.0415 to the left, improve=0.08054276, (0 missing)
## Surrogate splits:
## mxPH < 8.27 to the left, agree=0.767, adj=0.125, (0 split)
## oPO4 < 36.4 to the left, agree=0.767, adj=0.125, (0 split)
##
## Node number 59: 9 observations
## mean=25.17778, MSE=201.5862
##
## Node number 116: 8 observations
## mean=8.9, MSE=72.1375
##
## Node number 117: 22 observations, complexity param=0.01285848
## mean=17.79545, MSE=95.35316
## left son=234 (14 obs) right son=235 (8 obs)
## Primary splits:
## season splits as LRLR, improve=0.1799351, (0 missing)
## NO3 < 1.2775 to the right, improve=0.1189142, (0 missing)
## Chla < 31.869 to the left, improve=0.1160835, (0 missing)
## NH4 < 99.6665 to the left, improve=0.1091819, (0 missing)
## Cl < 42.818 to the left, improve=0.1085589, (0 missing)
## Surrogate splits:
## NO3 < 3.2455 to the left, agree=0.727, adj=0.250, (0 split)
## oPO4 < 103.293 to the left, agree=0.727, adj=0.250, (0 split)
## size splits as RLL, agree=0.682, adj=0.125, (0 split)
## speed splits as LRL, agree=0.682, adj=0.125, (0 split)
## mnO2 < 7.4 to the left, agree=0.682, adj=0.125, (0 split)
##
## Node number 234: 14 observations
## mean=14.66429, MSE=48.19658
##
## Node number 235: 8 observations
## mean=23.275, MSE=130.6944
The entire process is automated in the wrapper method rpartXse()
provided with the DMwr
library. In our implementation, we (abitrarily) use se
\(=0.3\).
library(DMwR)
(regression.tree.a2.final <- rpartXse(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train,se=0.3))
## n= 218
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 218 29355.13 7.636697 *
summary(regression.tree.a2.final)
## Call:
## rpart(formula = form, data = data, cp = cp, minsplit = minsplit)
## n= 218
##
## CP nsplit rel error xerror xstd
## 1 0.1508276 0 1 1.009958 0.199121
##
## Node number 1: 218 observations
## mean=7.636697, MSE=134.6565
prp(regression.tree.a2.final,extra=101,box.col="orange",split.box.col="gray")
## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
## Call prp with roundint=FALSE,
## or rebuild the rpart model with model=TRUE.
The resulting tree is not nearly as complex as the original tree (hence discourages overfitting) but is still more complex than the pruned tree (which should improve predicting accuracy).
This time, however, we will learn models and perform evaluation for all target variables (a1
-a7
) simultaneously. This does not mean that we are looking for a single model which will optimize all learning tasks at once, but rather that we can prepare and evaluate the models for each target variable with the same bit of code.
This first require some code to create the appropriate model formulas (a1 ~ .
, … ,a7 ~ .
) and the appropriate training data.
gg<-function(x,list.of.variables){
PredTask(as.formula(paste(x,"~ .")),algae.train[,c(list.of.variables,x)], x, copy=TRUE)
}
data.sources <- sapply(names(algae.train[12:18]),gg,names(algae.train[1:11]))
data.sources
## $a1
## Prediction Task Object:
## Task Name :: a1
## Task Type :: regression
## Target Feature :: a1
## Formula :: a1 ~ .
## <environment: 0x7f8b7765c960>
## Task Data Source :: internal 218x12 data frame.
##
## $a2
## Prediction Task Object:
## Task Name :: a2
## Task Type :: regression
## Target Feature :: a2
## Formula :: a2 ~ .
## <environment: 0x7f8b775caac8>
## Task Data Source :: internal 218x12 data frame.
##
## $a3
## Prediction Task Object:
## Task Name :: a3
## Task Type :: regression
## Target Feature :: a3
## Formula :: a3 ~ .
## <environment: 0x7f8b783b5228>
## Task Data Source :: internal 218x12 data frame.
##
## $a4
## Prediction Task Object:
## Task Name :: a4
## Task Type :: regression
## Target Feature :: a4
## Formula :: a4 ~ .
## <environment: 0x7f8b78373378>
## Task Data Source :: internal 218x12 data frame.
##
## $a5
## Prediction Task Object:
## Task Name :: a5
## Task Type :: regression
## Target Feature :: a5
## Formula :: a5 ~ .
## <environment: 0x7f8b781bd0d0>
## Task Data Source :: internal 218x12 data frame.
##
## $a6
## Prediction Task Object:
## Task Name :: a6
## Task Type :: regression
## Target Feature :: a6
## Formula :: a6 ~ .
## <environment: 0x7f8b78845590>
## Task Data Source :: internal 218x12 data frame.
##
## $a7
## Prediction Task Object:
## Task Name :: a7
## Task Type :: regression
## Target Feature :: a7
## Formula :: a7 ~ .
## <environment: 0x7f8b77faa390>
## Task Data Source :: internal 218x12 data frame.
We shall use e1071
’s implementation of svm()
, with various values of the svm()
-specific parameters cost
and gamma
.
library(e1071)
kCV.results.algae.all <- performanceEstimation(
data.sources,
c(Workflow(learner="lm",post="onlyPos"),
Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1)))
),
EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10))
)
##
##
## ##### PERFORMANCE ESTIMATION USING CROSS VALIDATION #####
##
## ** PREDICTIVE TASK :: a1
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a2
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a3
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a4
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a5
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a6
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a7
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
The rest of the evaluation proceeds much as before, except that we can display results for the 7 target variables simultaneously.
plot(kCV.results.algae.all)
rankWorkflows(kCV.results.algae.all,top=3)
## $a1
## $a1$nmse
## Workflow Estimate
## 1 rpartXse.v1 0.6163406
## 2 rpartXse.v2 0.6278027
## 3 rpartXse.v3 0.6430736
##
##
## $a2
## $a2$nmse
## Workflow Estimate
## 1 lm 0.9723125
## 2 svm 0.9954432
## 3 rpartXse.v3 1.0041667
##
##
## $a3
## $a3$nmse
## Workflow Estimate
## 1 svm 0.9497730
## 2 lm 0.9801662
## 3 rpartXse.v2 1.0000000
##
##
## $a4
## $a4$nmse
## Workflow Estimate
## 1 rpartXse.v3 1.001453
## 2 rpartXse.v2 1.351494
## 3 lm 1.357243
##
##
## $a5
## $a5$nmse
## Workflow Estimate
## 1 svm 0.9968475
## 2 rpartXse.v3 0.9990465
## 3 rpartXse.v2 1.0194733
##
##
## $a6
## $a6$nmse
## Workflow Estimate
## 1 rpartXse.v2 1.010069
## 2 rpartXse.v3 1.010069
## 3 svm 1.054975
##
##
## $a7
## $a7$nmse
## Workflow Estimate
## 1 rpartXse.v2 1.00000
## 2 rpartXse.v3 1.00000
## 3 rpartXse.v1 1.00797
topPerformers(kCV.results.algae.all)
## $a1
## Workflow Estimate
## nmse rpartXse.v1 0.616
##
## $a2
## Workflow Estimate
## nmse lm 0.972
##
## $a3
## Workflow Estimate
## nmse svm 0.95
##
## $a4
## Workflow Estimate
## nmse rpartXse.v3 1.001
##
## $a5
## Workflow Estimate
## nmse svm 0.997
##
## $a6
## Workflow Estimate
## nmse rpartXse.v2 1.01
##
## $a7
## Workflow Estimate
## nmse rpartXse.v2 1
For a1
, the models seem to perform reasonably well. But it’s not so rosy for the other target variables, where the baseline model is sometimes better.
Again, this could be built-in in the data, but we might benefit from incorporating more models. As a last attempt to improve the situation, let’s include random forests.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:psych':
##
## outlier
kCV.results.algae.all.rf <- performanceEstimation(
data.sources,
c(Workflow(learner="lm",post="onlyPos"),
Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1))),
workflowVariants(learner="randomForest",learner.pars=list(ntree=c(200,500,700)))
),
EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10))
)
##
##
## ##### PERFORMANCE ESTIMATION USING CROSS VALIDATION #####
##
## ** PREDICTIVE TASK :: a1
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a2
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a3
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a4
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a5
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a6
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ** PREDICTIVE TASK :: a7
##
## ++ MODEL/WORKFLOW :: lm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: svm
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: rpartXse.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v1
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v2
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
##
##
## ++ MODEL/WORKFLOW :: randomForest.v3
## Task for estimating nmse using
## 5 x 10 - Fold Cross Validation
## Run with seed = 1234
## Iteration :**************************************************
rankWorkflows(kCV.results.algae.all.rf,top=3)
## $a1
## $a1$nmse
## Workflow Estimate
## 1 randomForest.v2 0.5217204
## 2 randomForest.v3 0.5228744
## 3 randomForest.v1 0.5264328
##
##
## $a2
## $a2$nmse
## Workflow Estimate
## 1 randomForest.v3 0.7798749
## 2 randomForest.v2 0.7806831
## 3 randomForest.v1 0.7849360
##
##
## $a3
## $a3$nmse
## Workflow Estimate
## 1 randomForest.v3 0.9377108
## 2 randomForest.v2 0.9400108
## 3 randomForest.v1 0.9431801
##
##
## $a4
## $a4$nmse
## Workflow Estimate
## 1 rpartXse.v3 1.001453
## 2 randomForest.v3 1.006496
## 3 randomForest.v1 1.006806
##
##
## $a5
## $a5$nmse
## Workflow Estimate
## 1 randomForest.v1 0.7626241
## 2 randomForest.v2 0.7675794
## 3 randomForest.v3 0.7681834
##
##
## $a6
## $a6$nmse
## Workflow Estimate
## 1 randomForest.v2 0.8590227
## 2 randomForest.v3 0.8621478
## 3 randomForest.v1 0.8663869
##
##
## $a7
## $a7$nmse
## Workflow Estimate
## 1 rpartXse.v2 1.00000
## 2 rpartXse.v3 1.00000
## 3 rpartXse.v1 1.00797
rankWorkflows()
does not report on the standard error, and so we cannot tell whether the differences between the score of the best model and the other models is statistically significant.
randomForest.v3
seems to have the best ranking across all learning tasks, so we will use it as the baseline model.
p<-pairedComparisons(kCV.results.algae.all.rf,baseline="randomForest.v3")
p$nmse$F.test
## $chi
## [1] 22.86905
##
## $FF
## [1] 5.251025
##
## $critVal
## [1] 0.7071231
##
## $rejNull
## [1] TRUE
p$nmse$BonferroniDunn.test
## $critDif
## [1] 3.52218
##
## $baseline
## [1] "randomForest.v3"
##
## $rkDifs
## lm svm rpartXse.v1 rpartXse.v2
## 4.1428571 2.8571429 4.1428571 2.6428571
## rpartXse.v3 randomForest.v1 randomForest.v2
## 1.9285714 0.8571429 0.0000000
##
## $signifDifs
## lm svm rpartXse.v1 rpartXse.v2
## TRUE FALSE TRUE FALSE
## rpartXse.v3 randomForest.v1 randomForest.v2
## FALSE FALSE FALSE
We can reject with 95% certainty that the performance of the baseline method (randomForest.v3
) is the same as that of the linear model and the first 2 regression trees, but not that it is better than svm
, rpartXse.v3
, and the other 2 random forests.
The information is also displayed in the Bonferroni-Dunn CD diagram below.
CDdiagram.BD(p)