(based on a Case Study by L.Torgo)
This notebook provides a concrete and thorough illustration of the data science process using a realistic dataset: algae_blooms.csv
, which is also available at the UCI Machine Learning Repository.
The problem is to predict the occurrence of harmful algae in water samples.
We also use it to highlight various aspects of data exploration, data cleaning, and R syntax.
The ability to monitor and perform early forecasts of various river algae blooms is crucial to control the ecological harm they can cause.
The dataset which is used to train the learning model consists of:
What is the data science motivation for such a model? After all, we can analyze water samples to determine if various harmful algae are present or absent.
The answer is simple: chemical monitoring is cheap and easy to automate, whereas biological analysis of samples is expensive and slow.
Another answer is that analyzing the samples for harmful content does not provide a better understanding of algae drivers: it just tells us which samples contain algaes.
Before we can take a look at the data and begin the process in earnest, we need to load it in the in the R workspace.
The dataset is stored in the CSV file algae_blooms.csv
(in the Data
directory).
# The data can also be loaded as part of Torgo's DMwR package,
# which could be loaded directly as below (by uncommenting the
# line of code), but that is not how we will access it.
# library(DMwR)
As the data resides in a CSV file, we use the read.csv
R function to load it and store it to an R data frame (Torgo stores it to a tibble
object).
(Note the sep
and header
options -- what do you think their function is?)
algae_blooms<-read.csv("Data/algae_blooms.csv", sep=",", header=TRUE)
We can get an idea of the data frame's structure by calling the str
function.
str(algae_blooms)
Evidently, algae_blooms
is a data frame with 340 observations of 18 variables each.
Notes:
season
, size
, speed
, which refer to the data collection process)We can get a better feel for the data frame by observing it in its natural habitat, so to speak, using the head()
or tail()
functions.
head(algae_blooms)
As it happens, we are not given an awful lot of information about the dataset's domain (I, for one, remain woefully ill-prepared to deal with matters of a chemical nature, to my eternal shame).
Data exploration, in the form of summaries and visualization, can provide a better handle on the problem at hand.
A call to the summary
function provides frequency counts for categorical variables, and 6-pt summaries for numerical variables. As a bonus, the number of missing values is also tabulated.
(the default setting only lists a limited number of categorical levels -- the summary
documentation will explain how to increase the number of levels that are displayed.)
IMPORTANT NOTE: I may have given you the impression (just now) that exploration is only really necessary when domain expertise escapes us. Domain expertise can help analysts frame the problem and the analysis results in the appropriate manner, but it often also gives them a false sense of security. Errors can creep anywhere -- data exploration at an early stage may save you a lot of embarassing back-tracking at a later stage.
summary(algae_blooms)
Notes:
Of course, these summaries each apply to a single variable (1-way tables). Can we spot anything else using $n$-way tables (on the categorical variables)?
# 2-way tables
table(algae_blooms$speed,algae_blooms$size)
table(algae_blooms$speed,algae_blooms$season)
table(algae_blooms$season,algae_blooms$size)
# 3-way table
table(algae_blooms$season,algae_blooms$size,algae_blooms$speed)
The 6-pt summary provides some information about the underlying distribution, but not much on the parametric front. A more traditional summary can be displayed using the psych
library's describe
function.
library(psych)
describe(algae_blooms)
Notes:
*
; the levels are coded with an integer, and treated as numerical variables for the purpose of the analysis, so the results for these fields are meaninglesstrimmed
variable refers to the trimmed mean, the mean obtained when a certain percentage of the observations are removed from both end of the spectrum (what percentage, exactly?)mad
variable refers to the median absolute deviation (from the median)I personally find such a table hard to read and really grasp once there are more than a few variables in the dataset. Visualization comes in handy then.
Basic histograms can be constructed with the hist()
function.
hist(algae_blooms$mnO2)
Based on this histogram, we can conclude that the underlying distribution of mnO2
has a negative skew, say, which is confirmed by the table above.
This basic histogram can be spruced up by using Hadley Wickham's (in)famous ggplot2
library.
library(ggplot2)
ggplot(algae_blooms,aes(x=mnO2)) + # plotting mnO2 from the algae_blooms dataset ...
geom_histogram(aes(y=..density..)) + # as a histogram, where the vertical axis is the density ...
geom_density(color="blue") + # on which will be layered a blue density curve ...
geom_rug() + # and a rug (or comb) showing where the observations actually fall...
ggtitle("Histogram of minimum value of O2 among 340 observations") + # with this title ...
xlab("") + # no x axis label ...
ylab("") # and no y axis label
While mnO2
clearly does not follow a normal distribution (no negative value, high skew), we see that such an approximation wouldn't be so bad compared to a1
.
ggplot(algae_blooms,aes(x=a1)) +
geom_histogram(aes(y=..density..)) +
geom_density(color="red") +
geom_rug() +
ggtitle("Histogram of minimum value of a1 among 340 observations") +
xlab("") +
ylab("")
$qq-$plots, another traditional statistical plot, can be produced with the car
libary's qqPlot()
function.
library(car)
qqPlot(algae_blooms$mnO2, main='Normal QQ plot for minimum values of O2', ylab="")
Again, we can see that the normal distribution is not a good fit for mnO2
, but that it is even worse for a1
(see below).
qqPlot(algae_blooms$a1, main='Normal QQ plot for a1', ylab="")
Now let's take a look at some of the odd values for NH4
.
ggplot(algae_blooms,aes(x=factor(0),y=NH4)) + # plotting NH4 from the algae_blooms dataset ...
geom_boxplot() + # as a boxplot ...
geom_rug() + # with a rug on which the true values are shown ...
geom_hline(aes(yintercept=mean(algae_blooms$NH4, na.rm=TRUE)), linetype=2, colour="pink") + # and a horizontal line showing where the mean NH4 value falls ...
ylab("Ammonium (NH4+)") +
xlab("") +
scale_x_discrete(breaks=NULL)
We see that there are a string of values falling way above the boxplot. If the underlying distribution was normal, say, these would definitely be considered outliers.
Let's investigate further.
plot(algae_blooms$NH4, xlab="", ylab="Ammonium (NH4+)") # scatter plot of NH4 against observation number
abline(h=mean(algae_blooms$NH4, na.rm=TRUE), lty=1) # mean value of NH4, solid line
abline(h=mean(algae_blooms$NH4, na.rm=TRUE) + sd(algae_blooms$NH4, na.rm=TRUE), lty=2) # mean + sd value of NH4, dashed line
abline(h=median(algae_blooms$NH4, na.rm=TRUE), lty=3) # median value of NH4, tight dashed line
We can also look at the data and see which observations have values of NH4 below 3000 (roughly all values below the long dashed line above)
algae_blooms[-which(algae_blooms$NH4>3000),] # keeping all values for which NH4 is NOT > 3000
nrow(algae_blooms[-which(algae_blooms$NH4>3000),]) # seeing how many are being kept
Let's see what the boxplot above looks like without the suspected outliers.
ggplot(algae_blooms[-which(algae_blooms$NH4>3000),],aes(x=factor(0),y=NH4)) +
geom_boxplot() +
geom_rug() +
geom_hline(aes(yintercept=mean(algae_blooms[-which(algae_blooms$NH4>3000),8], na.rm=TRUE)), linetype=2, colour="pink") +
ylab("Ammonium (NH4+)") +
xlab("") +
scale_x_discrete(breaks=NULL)
It's a bit better, to be sure (the box structure has expanded). There still seems to be a bit of very high values. Perhaps that's normal? How would we find out?
Now let's take a look at some of the algae levels.
ggplot(algae_blooms,aes(x=season,y=a3)) + # plot a3 by season ...
geom_boxplot() + # in a series of boxplots ...
xlab("Season") + # with x axis as Seasons and y axis as a3
ylab("Algae Level a3")
What does that tell you? Or is it hard to get a good handle on the situation because the season are out of order?
We can re-arrange the factors, but it requires a bit of fancy footwork using the forcats
'library fct_relevel()
function, and dplyr
's mutate()
.
library(forcats) # for fct_relevel
library(dplyr) # for mutate
# note to change the mutate
algae_blooms$size <- factor(algae_blooms$size,levels = c("small", "medium", "large"))
algae_blooms$speed <- factor(algae_blooms$speed,levels = c("low","medium","high"))
algae_blooms$season <- factor(algae_blooms$season,levels = c("spring","summer","autumn","winter"))
#algae_blooms = mutate(algae_blooms,
# size=fct_relevel(size,c("small","medium","large")), # factors should appear in the desired order
# speed=fct_relevel(speed,c("low","medium","high")), # ditto
# season=fct_relevel(season,c("spring","summer","autumn","winter")) # same here
# )
ggplot(algae_blooms,aes(x=season,y=a3)) +
geom_boxplot() +
xlab("Season") +
ylab("Algae Level a3")
We only have 1 year's worth of data, so it's perhaps too early to tell, but it certainly seems as though the a3
levels decrease from spring to winter.
Violin plots are cousins to the boxplots. Can we get a bit more insight on the a3
trend?
ggplot(algae_blooms,aes(x=season,y=a3)) + # plot a3 by season ...
geom_violin() + # in a series of violin plots ...
geom_jitter() + # with some jitter to avoid all the points being on top of one another ...
xlab("Season") +
ylab("Algae Level a3")
This plot certainly seems to suggest that a3
levels are cyclic, with a peak in the spring and low levels in the fall.
(What happens if you turn off the jitter option?)
Let's go back to NH4 for a second to see if we can spot a link with the season (as we did for a3
).
We only keep the observations for which the NH4 value is greater than 3000, and we bin them with respect to the quartiles.
This will also give us the chance to demonstrate the usage of the pipeline operator %>%
.
First, we filter the algae_blooms
dataset to remove the 2 observations with missing values (is this always a good idea?)
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) # %>%
# filter(NH4<3000) %>%
# mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
nrow(f.NH4.data)
Next we remove the 11 observations for which NH4 > 3000 (again, based on the mean + sd "argument")
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>%
filter(NH4<3000) # %>%
# mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
nrow(f.NH4.data)
Finally, we add a new variable to tell us in which quartile the NH4 value falls.
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>%
filter(NH4<3000) %>%
mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
nrow(f.NH4.data)
We can now use the new q.NH4
variable to make multi-variate comparisons, say between a1
, NH4
, and season
.
ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) + # plot the a1 levels by season ...
geom_point() +
facet_wrap(~q.NH4) + # for each of the quartiles (each plot contains ~25% of the observations)
guides(color=FALSE) +
ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")
That's odd... why are we seeing missing values here? Haven't we removed the NAs? Let's delve in a bit deeper.
f.NH4.data[which(is.na(f.NH4.data$q.NH4)),]
table(f.NH4.data$q.NH4,useNA="ifany")
Ah, it seems as though the quartiles do not include their lower bound. We can remedy the situation by including an additional parameter in the mutate()
call.
f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>%
filter(NH4<3000) %>%
mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1)), include.lowest=TRUE))
table(f.NH4.data$q.NH4,useNA="ifany")
ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) +
geom_point() +
facet_wrap(~q.NH4) +
guides(color=FALSE) +
ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")
The NAs have disappeared from the graph (but the observations have not!). Either way, it seems as though the a1
levels are inversely correlated with the NH4+
levels but that the season doesn't have much of an effect (although we would need more work to conclude with any degree of certainty).
ggplot(f.NH4.data,aes(x=a3,y=season,color=season)) +
geom_point() +
facet_wrap(~q.NH4) +
guides(color=FALSE) +
ggtitle("Algae Level a3 by Season and Ammonium Quartiles NH4")
We found some potential anomalies in the data when we explored the dataset (although I'm electing to keep them in the dataset as I lack the domain expertise to make a reasonable decision on that front); now let's take some time to clean the data to some extent.
Anomalies come in various flavours; we have already explored some potential outlying behaviour, in this section we'll handle missing values.
library(dplyr)
The function complete.cases
lists the observations for which every field is present (note that it says nothing about the validity of the case).
table(complete.cases(algae_blooms))
The vast majority of observations do not have missing cases, but a few still do. Let's take a look at them to see if there is anything special about them? Are the values missing completely at random?
str(filter(algae_blooms, !complete.cases(algae_blooms)))
summary(filter(algae_blooms, !complete.cases(algae_blooms)))
Right off the bat, missing cases seem to be over-represented in small
rivers and under-represented in low
-speed rivers. But upon further investigation (that is, comparing with the original dataset), the under-representation of low
-speed rivers is not problematic as it is in-line with the numbers in the larger dataset (by which I mean that low
-speed rivers don't seem to have a systematic missing value problem, rather than the under-representation of low
-speed rivers in the original dataset is not problematic).
I will assume for now (in the interest of efficiency) that the fact that small rivers have a lot of missing cases (mostly Cl
and Chla
) is also not a problem (YOU SHOULD PROBABLY VERIFY IF THAT IS INDEED THE CASE...)
And the bulk of the missing values seem to come from either Cl
, Chla
, or PO4
. There also seems to be 2 observations for which we have very little information, but we can't use that smmary to determine if it's always the same two observations.
For instance, let's see what observations have missing NH4
values.
algae_blooms[which(is.na(algae_blooms$NH4)),]
While these observations also have missing values in other fields, there do have some non-missing fields as well.
But they're both missing 6 of the predictor variables. How useful could they be in training a predictive model? (that depends on the model, of course....)
We can write a functio nthat will compute how many missing cases there are for each observations.
table(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns
which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2)
Most observations have no missing cases, which is great, and there are a few with 1 or 2, but observations 62 and 199 are wild cards, with 6 missing predictors (out of 11).
Based on the small number of such wild cards, we elect to remove them from the analysis.
IMPORTANT NOTES:
Our new dataset still contains observations with missing cases, however.
algae_blooms.sna = algae_blooms[-which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2),]
dim(algae_blooms.sna)
What can we do with the other observations for which values are missing?
One possibility is to use the set of complete observations to compute a correlation matrix, and to see if any numerical field is strongly correlated with annother field. That way, if there is a missing value in the first field, the second could be used to impute it.
IMPORTANT NOTE:
symnum(cor(algae_blooms.sna[,4:18], use="complete.obs"))
This might be a bit hard to read. We can use the corrplot
library to visualize the (linear) correlations.
library(corrplot)
corrplot(cor(algae_blooms.sna[,4:18], use="complete.obs"),type="upper",tl.pos="d")
The correlation between PO4
(which has missing cases) and oPO4
(which does not, anymore) is clear.
What is the nature of the relation? We'll use the set of complete cases to find it.
algae_blooms.nona <- algae_blooms.sna[-which(apply(algae_blooms.sna,1, function(x) sum(is.na(x)))>0),]
dim(algae_blooms.nona)
PO4.oPO4.model = lm(PO4 ~ oPO4, data=algae_blooms.nona)
PO4.oPO4.model
The regression function is PO4
$=51.811+1.203$oPO4
(note that I'm not really interested in the fit statistics).
Intercept = PO4.oPO4.model$coefficients[[1]]
Slope = PO4.oPO4.model$coefficients[[2]]
What are the observations for which PO4
is missing?
which(is.na(algae_blooms.sna$PO4)==TRUE)
Let's use the regression function to impute the missing PO4
values.
algae_blooms.sna2 <- algae_blooms.sna
algae_blooms.sna2$PO4 <- ifelse(is.na(algae_blooms.sna2$PO4),max(Intercept + Slope*algae_blooms.sna2$oPO4,0),algae_blooms.sna2$PO4)
We can clearly see that no values of PO4
are missing anymore.
which(is.na(algae_blooms.sna2$PO4)==TRUE)
That takes care of the missing values with strong linear correlation to another field. Where do we stand now?
summary(algae_blooms.sna2)
Alright, so we don't have as many missing values as before, but we still have some. And the correlation trick isn't going to work this time.
What can we do? There are many ways to tackle the problem, but we will use $k$NN imputation. The principle is simple. Using some similarity/distance metric (typically based on the Euclidean distance between points), we start by identifying the $k$ nearest (complete) neighbours of each observation with a missing case. We then compute the mean value of the missing case in the $k-$group of complete observations, and use that value as the imputed value.
IMPORTANT NOTES:
DMwR
s implementation of knnImputation()
(below). How would you go about determining if the routine scales the data internally?library(DMwR)
algae_blooms.sna2 <- knnImputation(algae_blooms.sna2,k=10) # the choice of k=10 is arbitrary here.
Sure enough, there are no further observations with incomplete cases.
summary(algae_blooms.sna2)
table(apply(algae_blooms.sna2,1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns
For supervised learning tasks, the bias-variance trade-off means that we need to set aside a testing set on which the models (which learned on the training set) are evaluated.
There are no hard-and-fast rules to determine how to split the data - if the signal is very strong, a small training set should capture its important features, but we don't typically know how strong the signal is before we start the modeling process. On the other hand, if the training set is too large, we run the risk of overfitting the data.
We've elected to use a 65%/35% split in favour of the training set.
The training data should also be representative, to avoid injecting biases in the model (in case the data was provided according to some systematic but unknown rule). There are numerous ways to do this (see any introduction to sampling design), but we can do so using a simple random sample of 218 observations (we could also have stratified according to season, size, etc.).
To avoid issues due related to repeatability, we'll all use the same training set; I've commented out the code that would allow you to sample randomly.
#ind<-sample(1:dim(algae_blooms.sna2)[1],218, replace=FALSE)
ind <- 1:218
218/338
algae.train<-algae_blooms.sna2[ind,]
algae.test<-algae_blooms.sna2[-ind,]
dim(algae.train)
dim(algae.test)
Before we get too excited about using various machine learning methods, let's see what the traditional approach yields.
We'll run a linear model to predict a2
, for example, against all the predictor variables, but using only the training set as data.
linear.model.a2 <- lm(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,data=algae.train)
A summary of the results can be given by calling the summary
method on the resulting object.
summary(linear.model.a2)
We see that the adjusted $R^2$ coefficient is fairly small. Furthermore, if the linear model is a good fit, the residuals should have a mean of 0 and be as "small". That being said, the $F-$statistic seems to indicate that there is some (linear) dependence on the predictor variables.
We can get a better handle on the regression diagnostics by calling the plot()
method on the object.
plot(linear.model.a2)
All in all, the linear model is nowhere near a great one, let alone a good one.
The significance of some of the coefficients is questionable, however, and we might wonder what effect their inclusion might have.
anova(linear.model.a2)
An ANOVA finds that NH4
least contributes to the prediction of a2
, and we might be interested in the results of a linear regression with that predictor removed.
linear.model.a2.mod <- update(linear.model.a2, . ~ . - NH4)
summary(linear.model.a2.mod)
The fit isn't much better, but an ANOVA on the 2 suggested models shows that we're ~88% that the models are different (see below).
anova(linear.model.a2,linear.model.a2.mod)
The function step
uses AIC to perform a model search (using backward elimination), implemented in the method step()
.
final.linear.model.a2 <- step(linear.model.a2)
The "best" linear model for a2
is shown below.
summary(final.linear.model.a2)
It's still not a great fit (adjusted $R^2$ is quite small), so the linear model is not ideal for a2
.
anova(final.linear.model.a2)
anova(linear.model.a2,final.linear.model.a2)
plot(final.linear.model.a2)
An alternative to regression is the use of regression trees, an implementation of which is found in the library rpart
. The syntax is similar to lm
.
library(rpart)
regression.tree.a2 <-rpart(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train)
The outcome can be displayed by calling the object directly.
regression.tree.a2
The tree structure can be hard to determine when there is a large number of nodes; let's see if we can improve on the visuals using rpart.plot
.
library(rpart.plot)
prp(regression.tree.a2,extra=101,box.col="orange",split.box.col="gray")
Details on the regression tree can be obtained by calling the summary()
method on the object.
summary(regression.tree.a2)
rpart()
grows a tree until one of the following criterion is met:
cp
)minsplit
)maxdepth
)rpart
implements a pruning method based on cost complexity: finding the value of cp
which best balances predictive accuracy and tree size.
printcp(regression.tree.a2)
The tree returned by rpart()
is the final one (cp
$=0.01$ is the default value), requires 11 tests, and has a relative error of 0.485. Internally, rpart()
uses 10-fold cross-validation to estimate that the tree has an average relative error of $0.98\pm 0.18$ (these values might change when you run the model due to the stochastic nature of the internal cross-validation routines).
In this framework, the optimal tree minimizes the value of xerror
. Alternatively, one could use the 1-SE rule to find the minimal xerror
+ xstd
tree.
plotcp(regression.tree.a2)
The scree plot above suggests that cp
$=0.08$ (that value may change when you run yours due to the stochastic nature of the internal cross-validation algorithm) is a special value for tree growth, so we could prune the tree using that specific value.
regression.tree.a2.mod <- prune(regression.tree.a2,cp=0.05)
regression.tree.a2.mod
prp(regression.tree.a2.mod,extra=101,box.col="orange",split.box.col="gray")
The entire process is automated in the wrapper method rpartXse()
provided with the DMwr
library. In our implementation, we (abitrarily) use se
$=0.3$.
library(DMwR)
(regression.tree.a2.final <- rpartXse(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train,se=0.3))
summary(regression.tree.a2.final)
prp(regression.tree.a2.final,extra=101,box.col="orange",split.box.col="gray")
The resulting tree is not nearly as complex as the original tree (hence discourages overfitting) but is still more complex than the pruned tree (which should improve predicting accuracy).
At this stage, we know that the linear model is not great for a2
, and we've seen how to grow a regression tree for a2
but we have not yet discussed whether this model is a good fit, to say nothing of the remaining 6 algae concentrations.
Can we get a better handle on these models' performance (i.e. comparing the model predictions to the real values of the target variable in the test data)?
Various metrics can be used to determine how the values compare:
As the ratio of MSE to a baseline predictor (the mean of the value of the target), NMSE is unitless. NMSE values between 0 and 1 (smaller is better) indicate that the model performs better than the baseline; greater than 1 indicate that the model's performance is sub-par.
There are other metrics, but these are the most common.
So how does model evaluation work? Typically, one computes the selected evaluation metric for a variety of models, and picks the model which provides the optimal metric value.
$k$-fold cross-validation is often used to try to mitigate the effects of overfitting for mid-size datasets (no more than 50,000 observations, say).
performanceEvaluation()
can be used to automate the process. Here, we'll use the linear model and 5 regressiont trees (parameterized by se
), and 5 $10-$fold cross-validation to select the best predictive model for a2
.
library(performanceEstimation)
kCV.results.algae.a2 <- performanceEstimation(
PredTask(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train,"a2"),
c(Workflow(learner="lm",post="onlyPos"),
workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.25,0.5,0.75,1)))
),
EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10))
)
A summary and plot of the cross-validation results for NMSE can be displayed using calls to summary()
and plot()
.
summary(kCV.results.algae.a2)
plot(kCV.results.algae.a2)
It's not necessarily clear which of the models has smaller values of NMSE, although it does seem that the latter versions of the regression tree models are not substantially better than the baseline model.
The first regression tree model sometimes produces very small NMSE values, but that's offset by some of the larger values it also produces (similarly for the linear model).
At any rate, visual evidence seems to suggest that the linear model is the best predictive model for a2
given the training data (in this version of $k$CV), which is corrobated by a call to topPerformers()
.
topPerformers(kCV.results.algae.a2)
This might seem disheartening at first given how poorly the linear model performed, but it might be helpful to remember that there is no guarantee that a decent predictive model even exists.
Furthermore, regression trees and linear models are only two of a whole collection of possible models. How do support vector machines perform the task, for instance?
This time, however, we will learn models and perform evaluation for all target variables (a1
-a7
) simultaneously. This does not mean that we are looking for a single model which will optimize all learning tasks at once, but rather that we can prepare and evaluate the models for each target variable with the same bit of code.
This first require some code to create the appropriate model formulas (a1 ~ .
, ... ,a7 ~ .
) and the appropriate training data.
gg<-function(x,list.of.variables){
PredTask(as.formula(paste(x,"~ .")),algae.train[,c(list.of.variables,x)], x, copy=TRUE)
}
data.sources <- sapply(names(algae.train[12:18]),gg,names(algae.train[1:11]))
data.sources
We shall use e1071
's implementation of svm()
, with various values of the svm()
-specific parameters cost
and gamma
.
library(e1071)
kCV.results.algae.all <- performanceEstimation(
data.sources,
c(Workflow(learner="lm",post="onlyPos"),
Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1)))
),
EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10))
)
The rest of the evaluation proceeds much as before, except that we can display results for the 7 target variables simultaneously.
plot(kCV.results.algae.all)
rankWorkflows(kCV.results.algae.all,top=3)
topPerformers(kCV.results.algae.all)
For a1
, the models seem to perform reasonably well. But it's not so rosy for the other target variables, where the baseline model is sometimes better.
Again, this could be built-in in the data, but we might benefit from incorporating more models. As a last attempt to improve the situation, let's include random forests.
library(randomForest)
kCV.results.algae.all.rf <- performanceEstimation(
data.sources,
c(Workflow(learner="lm",post="onlyPos"),
Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1))),
workflowVariants(learner="randomForest",learner.pars=list(ntree=c(200,500,700)))
),
EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10))
)
rankWorkflows(kCV.results.algae.all.rf,top=3)
rankWorkflows()
does not report on the standard error, and so we cannot tell whether the differences between the score of the best model and the other models is statistically significant.
randomForest.v3
seems to have the best ranking across all learning tasks, so we will use it as the baseline model.
p<-pairedComparisons(kCV.results.algae.all.rf,baseline="randomForest.v3")
p$nmse$F.test
p$nmse$BonferroniDunn.test
We can reject with 95% certainty that the performance of the baseline method (randomForest.v3
) is the same as that of the linear model and the first 2 regression trees, but not that it is better than svm
, rpartXse.v3
, and the other 2 random forests.
The information is also displayed in the Bonferroni-Dunn CD diagram below.
CDdiagram.BD(p)
Finally, we might actually be interested in generating predictions for each of the target variables in the testing set.
This requires simply that the best performers for each target be brought together in an R object.
best.performers <- sapply(taskNames(kCV.results.algae.all.rf), function(x) topPerformer(kCV.results.algae.all.rf,metric="nmse",task=x))
best.performers
The observations that form the testing set are shown below, as are the corresponding
rownames(algae.test)
test.observations <- array(dim=c(nrow(algae.test),7,2),dimnames=list(rownames(algae.test),paste("a",1:7),c("actual","predicted")))
The function runWorkflow()
will compute the predicted values for each of the targets' best performers. We can then plot the predicted and actual values for each of the testing set targets.
for(j in 1:7){
results <- runWorkflow(best.performers[[j]],
as.formula(paste(names(best.performers)[j],"~ .")),
algae.train[,c(1:11,11+j)],
algae.test[,c(1:11,11+j)])
test.observations[,j,"actual"] <- results$trues
test.observations[,j,"predicted"] <- results$preds
}
df.a1 <- as.data.frame(test.observations[,1,])
df.a2 <- as.data.frame(test.observations[,2,])
df.a3 <- as.data.frame(test.observations[,3,])
df.a4 <- as.data.frame(test.observations[,4,])
df.a5 <- as.data.frame(test.observations[,5,])
df.a6 <- as.data.frame(test.observations[,6,])
df.a7 <- as.data.frame(test.observations[,7,])
plot(df.a1,main="a1 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a2,main="a2 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a3,main="a3 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a4,main="a4 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a5,main="a5 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a6,main="a6 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a7,main="a7 - predicted vs. actual")
abline(0,1,col="red")
The models simply are not that great, but we already expected that. The average prediction for each target is shown below.
average.prediction <- apply(algae.train[,12:18],2, mean)
average.prediction
Finally, you might be interested in the NMSE metrics for the predicted values and how they compare to the NMSE metrics on the training set. Is any of this surprising?
apply((test.observations[,,"actual"]-test.observations[,,"predicted"])^2,2,sum) / apply((scale(test.observations[,,"actual"],average.prediction,FALSE))^2,2,sum)
svm
model?