PREDICTING ALGAE BLOOMS ¶

(based on a Case Study by L.Torgo)

This notebook provides a concrete and thorough illustration of the data science process using a realistic dataset: algae_blooms.csv, which is also available at the UCI Machine Learning Repository.

The problem is to predict the occurrence of harmful algae in water samples.

We also use it to highlight various aspects of data exploration, data cleaning, and R syntax.

OUTLINE¶

Problem Description
Loading the Data
Summary and Visualization
Data Cleaning
Prediction Models
- General Linear Model
- Regression Tree Model
Model Evaluation
Model Predictions
Exercises

Back to top

PROBLEM DESCRIPTION ¶

The ability to monitor and perform early forecasts of various river algae blooms is crucial to control the ecological harm they can cause.

The dataset which is used to train the learning model consists of:

chemical properties of various water samples of European rivers
the quantity of seven algae in each of the samples, and
the characteristics of the collection process for each sample.

What is the data science motivation for such a model? After all, we can analyze water samples to determine if various harmful algae are present or absent.

The answer is simple: chemical monitoring is cheap and easy to automate, whereas biological analysis of samples is expensive and slow.

Another answer is that analyzing the samples for harmful content does not provide a better understanding of algae drivers: it just tells us which samples contain algaes.

Back to top

LOADING THE DATA ¶

Before we can take a look at the data and begin the process in earnest, we need to load it in the in the R workspace.

The dataset is stored in the CSV file algae_blooms.csv (in the Data directory).

# The data can also be loaded as part of Torgo's DMwR package, 
# which could be loaded directly as below (by uncommenting the 
# line of code), but that is not how we will access it.

# library(DMwR)

As the data resides in a CSV file, we use the read.csv R function to load it and store it to an R data frame (Torgo stores it to a tibble object).

(Note the sep and header options -- what do you think their function is?)

algae_blooms<-read.csv("Data/algae_blooms.csv", sep=",", header=TRUE)

We can get an idea of the data frame's structure by calling the str function.

str(algae_blooms)

'data.frame':	340 obs. of  18 variables:
 $ season: Factor w/ 4 levels "autumn","spring",..: 4 2 1 2 1 4 3 1 4 4 ...
 $ size  : Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ speed : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 1 1 1 3 1 ...
 $ mxPH  : num  8 8.35 8.1 8.07 8.06 8.25 8.15 8.05 8.7 7.93 ...
 $ mnO2  : num  9.8 8 11.4 4.8 9 13.1 10.3 10.6 3.4 9.9 ...
 $ Cl    : num  60.8 57.8 40 77.4 55.4 ...
 $ NO3   : num  6.24 1.29 5.33 2.3 10.42 ...
 $ NH4   : num  578 370 346.7 98.2 233.7 ...
 $ oPO4  : num  105 428.8 125.7 61.2 58.2 ...
 $ PO4   : num  170 558.8 187.1 138.7 97.6 ...
 $ Chla  : num  50 1.3 15.6 1.4 10.5 ...
 $ a1    : num  0 1.4 3.3 3.1 9.2 15.1 2.4 18.2 25.4 17 ...
 $ a2    : num  0 7.6 53.6 41 2.9 14.6 1.2 1.6 5.4 0 ...
 $ a3    : num  0 4.8 1.9 18.9 7.5 1.4 3.2 0 2.5 0 ...
 $ a4    : num  0 1.9 0 0 0 0 3.9 0 0 2.9 ...
 $ a5    : num  34.2 6.7 0 1.4 7.5 22.5 5.8 5.5 0 0 ...
 $ a6    : num  8.3 0 0 0 4.1 12.6 6.8 8.7 0 0 ...
 $ a7    : num  0 2.1 9.7 1.4 1 2.9 0 0 0 1.7 ...

Evidently, algae_blooms is a data frame with 340 observations of 18 variables each.

Notes:

3 of the fields are categorical (season, size, speed, which refer to the data collection process)
of the numerical fields, 8 have names that sound vaguely "chemical"
presumably, the remaining fields refer to the algae blooms

We can get a better feel for the data frame by observing it in its natural habitat, so to speak, using the head() or tail() functions.

head(algae_blooms)

Back to top

SUMMARY AND VISUALIZATION ¶

As it happens, we are not given an awful lot of information about the dataset's domain (I, for one, remain woefully ill-prepared to deal with matters of a chemical nature, to my eternal shame).

Data exploration, in the form of summaries and visualization, can provide a better handle on the problem at hand.

A call to the summary function provides frequency counts for categorical variables, and 6-pt summaries for numerical variables. As a bonus, the number of missing values is also tabulated.

(the default setting only lists a limited number of categorical levels -- the summary documentation will explain how to increase the number of levels that are displayed.)

IMPORTANT NOTE: I may have given you the impression (just now) that exploration is only really necessary when domain expertise escapes us. Domain expertise can help analysts frame the problem and the analysis results in the appropriate manner, but it often also gives them a false sense of security. Errors can creep anywhere -- data exploration at an early stage may save you a lot of embarassing back-tracking at a later stage.

summary(algae_blooms)

    season       size        speed          mxPH            mnO2       
 autumn:80   large : 83   high  :142   Min.   :5.600   Min.   : 1.500  
 spring:84   medium:136   low   : 58   1st Qu.:7.750   1st Qu.: 7.925  
 summer:86   small :121   medium:140   Median :8.045   Median : 9.700  
 winter:90                             Mean   :7.997   Mean   : 9.157  
                                       3rd Qu.:8.393   3rd Qu.:10.800  
                                       Max.   :9.700   Max.   :13.400  
                                       NA's   :2       NA's   :2       
       Cl               NO3              NH4                oPO4        
 Min.   :  0.222   Min.   : 0.000   Min.   :    5.00   Min.   :   1.00  
 1st Qu.: 10.994   1st Qu.: 1.147   1st Qu.:   37.86   1st Qu.:  13.00  
 Median : 32.470   Median : 2.356   Median :  107.36   Median :  37.24  
 Mean   : 42.517   Mean   : 3.121   Mean   :  471.73   Mean   :  73.09  
 3rd Qu.: 57.750   3rd Qu.: 4.147   3rd Qu.:  244.90   3rd Qu.:  88.11  
 Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :1435.00  
 NA's   :16        NA's   :2        NA's   :2          NA's   :2        
      PO4              Chla               a1              a2        
 Min.   :   1.0   Min.   :  0.200   Min.   : 0.00   Min.   : 0.000  
 1st Qu.:  40.0   1st Qu.:  2.133   1st Qu.: 1.50   1st Qu.: 0.000  
 Median : 101.5   Median :  5.111   Median : 7.10   Median : 2.800  
 Mean   : 136.7   Mean   : 12.796   Mean   :16.70   Mean   : 7.201  
 3rd Qu.: 200.2   3rd Qu.: 17.200   3rd Qu.:25.18   3rd Qu.:10.150  
 Max.   :1690.0   Max.   :110.456   Max.   :89.80   Max.   :72.600  
 NA's   :7        NA's   :23                                        
       a3               a4              a5               a6        
 Min.   : 0.000   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 1.400   Median : 0.00   Median : 2.200   Median : 0.000  
 Mean   : 3.904   Mean   : 1.81   Mean   : 5.516   Mean   : 6.411  
 3rd Qu.: 4.625   3rd Qu.: 2.30   3rd Qu.: 8.000   3rd Qu.: 7.025  
 Max.   :42.800   Max.   :44.60   Max.   :61.100   Max.   :77.600  
                                                                   
       a7        
 Min.   : 0.000  
 1st Qu.: 0.000  
 Median : 0.000  
 Mean   : 2.206  
 3rd Qu.: 2.200  
 Max.   :31.600

Notes:

The chemical variables all have missing values, ranging from only 2 to 7, 16, and 23.
The observations seem fairly uniformly distributed in terms of the seasons, but large rivers and low speed rivers are not represented as often as their counterparts.
All numerical values are non-negative, which makes sense in the context of concentrations
I do not know what the range of the chemical values should take in a real-world context, but some of the maximum values seem ... unrealistic (NH4!!, oPO4, a7, etc.)
do you spot anything else?

Of course, these summaries each apply to a single variable (1-way tables). Can we spot anything else using $n$-way tables (on the categorical variables)?

# 2-way tables
table(algae_blooms$speed,algae_blooms$size)
table(algae_blooms$speed,algae_blooms$season)
table(algae_blooms$season,algae_blooms$size)

# 3-way table
table(algae_blooms$season,algae_blooms$size,algae_blooms$speed)

        
         large medium small
  high      13     56    73
  low       32     24     2
  medium    38     56    46

        
         autumn spring summer winter
  high       32     34     38     38
  low        16     13     12     17
  medium     32     37     36     35

        
         large medium small
  autumn    19     33    28
  spring    21     34    29
  summer    19     36    31
  winter    24     33    33

, ,  = high

        
         large medium small
  autumn     3     13    16
  spring     3     14    17
  summer     3     16    19
  winter     4     13    21

, ,  = low

        
         large medium small
  autumn     9      6     1
  spring     7      6     0
  summer     6      6     0
  winter    10      6     1

, ,  = medium

        
         large medium small
  autumn     7     14    11
  spring    11     14    12
  summer    10     14    12
  winter    10     14    11

The 6-pt summary provides some information about the underlying distribution, but not much on the parametric front. A more traditional summary can be displayed using the psych library's describe function.

library(psych)
describe(algae_blooms)

Notes:

the categorical variables are marked by an asterisk *; the levels are coded with an integer, and treated as numerical variables for the purpose of the analysis, so the results for these fields are meaningless
the trimmed variable refers to the trimmed mean, the mean obtained when a certain percentage of the observations are removed from both end of the spectrum (what percentage, exactly?)
the mad variable refers to the median absolute deviation (from the median)

I personally find such a table hard to read and really grasp once there are more than a few variables in the dataset. Visualization comes in handy then.

Basic histograms can be constructed with the hist() function.

hist(algae_blooms$mnO2)

Based on this histogram, we can conclude that the underlying distribution of mnO2 has a negative skew, say, which is confirmed by the table above.

This basic histogram can be spruced up by using Hadley Wickham's (in)famous ggplot2 library.

library(ggplot2)

Attaching package: ‘ggplot2’

The following objects are masked from ‘package:psych’:

    %+%, alpha

ggplot(algae_blooms,aes(x=mnO2)) +        # plotting mnO2 from the algae_blooms dataset ...
    geom_histogram(aes(y=..density..)) +  # as a histogram, where the vertical axis is the density ... 
    geom_density(color="blue") +          # on which will be layered a blue density curve ... 
    geom_rug() +                          # and a rug (or comb) showing where the observations actually fall...
    ggtitle("Histogram of minimum value of O2 among 340 observations") + # with this title ... 
    xlab("") +                                                           # no x axis label ...   
    ylab("")                                                             # and no y axis label

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
“Removed 2 rows containing non-finite values (stat_bin).”Warning message:
“Removed 2 rows containing non-finite values (stat_density).”

While mnO2 clearly does not follow a normal distribution (no negative value, high skew), we see that such an approximation wouldn't be so bad compared to a1.

ggplot(algae_blooms,aes(x=a1)) + 
    geom_histogram(aes(y=..density..)) +
    geom_density(color="red") + 
    geom_rug() +
    ggtitle("Histogram of minimum value of a1 among 340 observations") + 
    xlab("") +
    ylab("")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

$qq-$plots, another traditional statistical plot, can be produced with the car libary's qqPlot() function.

library(car)

Attaching package: ‘car’

The following object is masked from ‘package:psych’:

    logit

qqPlot(algae_blooms$mnO2, main='Normal QQ plot for minimum values of O2', ylab="")

Again, we can see that the normal distribution is not a good fit for mnO2, but that it is even worse for a1 (see below).

qqPlot(algae_blooms$a1, main='Normal QQ plot for a1', ylab="")

Now let's take a look at some of the odd values for NH4.

ggplot(algae_blooms,aes(x=factor(0),y=NH4)) +   # plotting NH4 from the algae_blooms dataset ...
    geom_boxplot() +                            # as a boxplot ... 
    geom_rug() +                                # with a rug on which the true values are shown ...
    geom_hline(aes(yintercept=mean(algae_blooms$NH4, na.rm=TRUE)), linetype=2, colour="pink") + # and a horizontal line showing where the mean NH4 value falls ...
    ylab("Ammonium (NH4+)") +                   
    xlab("") +
    scale_x_discrete(breaks=NULL)

Warning message:
“Removed 2 rows containing non-finite values (stat_boxplot).”

We see that there are a string of values falling way above the boxplot. If the underlying distribution was normal, say, these would definitely be considered outliers.

Let's investigate further.

plot(algae_blooms$NH4, xlab="", ylab="Ammonium (NH4+)") # scatter plot of NH4 against observation number
abline(h=mean(algae_blooms$NH4, na.rm=TRUE), lty=1)     # mean value of NH4, solid line
abline(h=mean(algae_blooms$NH4, na.rm=TRUE) + sd(algae_blooms$NH4, na.rm=TRUE), lty=2) # mean + sd value of NH4, dashed line
abline(h=median(algae_blooms$NH4, na.rm=TRUE), lty=3)   # median value of NH4, tight dashed line

We can also look at the data and see which observations have values of NH4 below 3000 (roughly all values below the long dashed line above)

algae_blooms[-which(algae_blooms$NH4>3000),]        # keeping all values for which NH4 is NOT > 3000
nrow(algae_blooms[-which(algae_blooms$NH4>3000),])  # seeing how many are being kept

Let's see what the boxplot above looks like without the suspected outliers.

ggplot(algae_blooms[-which(algae_blooms$NH4>3000),],aes(x=factor(0),y=NH4)) + 
    geom_boxplot() + 
    geom_rug() +
    geom_hline(aes(yintercept=mean(algae_blooms[-which(algae_blooms$NH4>3000),8], na.rm=TRUE)), linetype=2, colour="pink") +
    ylab("Ammonium (NH4+)") +
    xlab("") +
    scale_x_discrete(breaks=NULL)

Warning message:
“Removed 2 rows containing non-finite values (stat_boxplot).”

It's a bit better, to be sure (the box structure has expanded). There still seems to be a bit of very high values. Perhaps that's normal? How would we find out?

Now let's take a look at some of the algae levels.

ggplot(algae_blooms,aes(x=season,y=a3)) +   # plot a3 by season ...
    geom_boxplot() +                        # in a series of boxplots ...
    xlab("Season") +                        # with x axis as Seasons and y axis as a3
    ylab("Algae Level a3")

What does that tell you? Or is it hard to get a good handle on the situation because the season are out of order?

We can re-arrange the factors, but it requires a bit of fancy footwork using the forcats'library fct_relevel() function, and dplyr's mutate().

library(forcats) # for fct_relevel
library(dplyr)   # for mutate

Attaching package: ‘dplyr’

The following object is masked from ‘package:car’:

    recode

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

# note to change the mutate
algae_blooms$size <- factor(algae_blooms$size,levels = c("small", "medium", "large")) 
algae_blooms$speed <- factor(algae_blooms$speed,levels  = c("low","medium","high")) 
algae_blooms$season <- factor(algae_blooms$season,levels = c("spring","summer","autumn","winter")) 

#algae_blooms = mutate(algae_blooms, 
#                    size=fct_relevel(size,c("small","medium","large")), # factors should appear in the desired order
#                    speed=fct_relevel(speed,c("low","medium","high")),  # ditto
#                    season=fct_relevel(season,c("spring","summer","autumn","winter")) # same here
#                    )

ggplot(algae_blooms,aes(x=season,y=a3)) + 
    geom_boxplot() +
    xlab("Season") +
    ylab("Algae Level a3")

We only have 1 year's worth of data, so it's perhaps too early to tell, but it certainly seems as though the a3 levels decrease from spring to winter.

Violin plots are cousins to the boxplots. Can we get a bit more insight on the a3 trend?

ggplot(algae_blooms,aes(x=season,y=a3)) +  # plot a3 by season ...
    geom_violin() +                        # in a series of violin plots ...  
    geom_jitter() +                        # with some jitter to avoid all the points being on top of one another ...
    xlab("Season") + 
    ylab("Algae Level a3")

This plot certainly seems to suggest that a3 levels are cyclic, with a peak in the spring and low levels in the fall.

(What happens if you turn off the jitter option?)

Let's go back to NH4 for a second to see if we can spot a link with the season (as we did for a3).

We only keep the observations for which the NH4 value is greater than 3000, and we bin them with respect to the quartiles.

This will also give us the chance to demonstrate the usage of the pipeline operator %>%.

First, we filter the algae_blooms dataset to remove the 2 observations with missing values (is this always a good idea?)

f.NH4.data <- filter(algae_blooms,!is.na(NH4)) # %>% 
    # filter(NH4<3000) %>% 
    # mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
nrow(f.NH4.data)

Next we remove the 11 observations for which NH4 > 3000 (again, based on the mean + sd "argument")

f.NH4.data <- filter(algae_blooms,!is.na(NH4))  %>% 
     filter(NH4<3000) # %>% 
    # mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
nrow(f.NH4.data)

Finally, we add a new variable to tell us in which quartile the NH4 value falls.

f.NH4.data <- filter(algae_blooms,!is.na(NH4))  %>% 
     filter(NH4<3000)  %>% 
     mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)
nrow(f.NH4.data)

We can now use the new q.NH4 variable to make multi-variate comparisons, say between a1, NH4, and season.

ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) + # plot the a1 levels by season ...
    geom_point() +
    facet_wrap(~q.NH4) +                             # for each of the quartiles (each plot contains ~25% of the observations)
    guides(color=FALSE) +
    ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")

That's odd... why are we seeing missing values here? Haven't we removed the NAs? Let's delve in a bit deeper.

f.NH4.data[which(is.na(f.NH4.data$q.NH4)),]
table(f.NH4.data$q.NH4,useNA="ifany")

      (5,36.8]     (36.8,103]      (103,226] (226,2.17e+03]           <NA> 
            80             82             81             82              2

Ah, it seems as though the quartiles do not include their lower bound. We can remedy the situation by including an additional parameter in the mutate() call.

f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>% 
    filter(NH4<3000) %>% 
    mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1)), include.lowest=TRUE))
table(f.NH4.data$q.NH4,useNA="ifany")

      [5,36.8]     (36.8,103]      (103,226] (226,2.17e+03] 
            82             82             81             82

ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) +
    geom_point() +
    facet_wrap(~q.NH4) +
    guides(color=FALSE) +
    ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")

The NAs have disappeared from the graph (but the observations have not!). Either way, it seems as though the a1 levels are inversely correlated with the NH4+ levels but that the season doesn't have much of an effect (although we would need more work to conclude with any degree of certainty).

ggplot(f.NH4.data,aes(x=a3,y=season,color=season)) +
    geom_point() +
    facet_wrap(~q.NH4) +
    guides(color=FALSE) +
    ggtitle("Algae Level a3 by Season and Ammonium Quartiles NH4")

Back to top

DATA CLEANING ¶

We found some potential anomalies in the data when we explored the dataset (although I'm electing to keep them in the dataset as I lack the domain expertise to make a reasonable decision on that front); now let's take some time to clean the data to some extent.

Anomalies come in various flavours; we have already explored some potential outlying behaviour, in this section we'll handle missing values.

library(dplyr)

The function complete.cases lists the observations for which every field is present (note that it says nothing about the validity of the case).

table(complete.cases(algae_blooms))

FALSE  TRUE 
   34   306

The vast majority of observations do not have missing cases, but a few still do. Let's take a look at them to see if there is anything special about them? Are the values missing completely at random?

str(filter(algae_blooms, !complete.cases(algae_blooms)))
summary(filter(algae_blooms, !complete.cases(algae_blooms)))

'data.frame':	34 obs. of  18 variables:
 $ season: Factor w/ 4 levels "spring","summer",..: 3 1 4 4 1 3 1 2 3 1 ...
 $ size  : Factor w/ 3 levels "small","medium",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ speed : Factor w/ 3 levels "low","medium",..: 3 3 1 3 2 2 3 3 2 2 ...
 $ mxPH  : num  6.8 8 NA 6.6 5.6 5.7 6.6 6.6 6.6 6.5 ...
 $ mnO2  : num  11.1 NA 12.6 10.8 11.8 10.8 9.5 10.8 11.3 10.4 ...
 $ Cl    : num  9 1.45 9 NA NA NA NA NA NA NA ...
 $ NO3   : num  0.63 0.81 0.23 3.25 2.22 ...
 $ NH4   : num  20 10 10 10 5 10 20 10 10 10 ...
 $ oPO4  : num  4 2.5 5 1 1 1 1 2 1 2 ...
 $ PO4   : num  NA 3 6 6.5 1 4 6 11 6 14 ...
 $ Chla  : num  2.7 0.3 1.1 NA NA NA NA NA NA NA ...
 $ a1    : num  30.3 75.8 35.5 24.3 82.7 16.8 46.8 46.9 47.1 66.9 ...
 $ a2    : num  1.9 0 0 0 0 4.6 0 0 0 0 ...
 $ a3    : num  0 0 0 0 0 3.9 0 0 0 0 ...
 $ a4    : num  0 0 0 0 0 11.5 28.8 13.4 0 0 ...
 $ a5    : num  2.1 0 0 0 0 0 0 0 0 0 ...
 $ a6    : num  1.4 0 0 0 0 0 0 0 1.2 0 ...
 $ a7    : num  2.1 0 0 0 0 0 0 0 0 0 ...

    season       size       speed         mxPH            mnO2       
 spring: 7   small :26   low   : 4   Min.   :5.600   Min.   : 5.700  
 summer: 6   medium: 1   medium:13   1st Qu.:6.600   1st Qu.: 9.275  
 autumn: 8   large : 7   high  :17   Median :7.225   Median :10.800  
 winter:13                           Mean   :7.344   Mean   :10.192  
                                     3rd Qu.:8.000   3rd Qu.:11.300  
                                     Max.   :9.700   Max.   :12.600  
                                     NA's   :2       NA's   :2       
       Cl              NO3               NH4              oPO4        
 Min.   : 0.222   Min.   : 0.2300   Min.   :  5.00   Min.   :  1.000  
 1st Qu.: 4.562   1st Qu.: 0.8025   1st Qu.: 10.00   1st Qu.:  1.000  
 Median : 9.027   Median : 1.4440   Median : 11.84   Median :  3.667  
 Mean   :19.312   Mean   : 2.1684   Mean   : 62.03   Mean   : 25.676  
 3rd Qu.:25.238   3rd Qu.: 2.5725   3rd Qu.: 46.38   3rd Qu.: 20.250  
 Max.   :71.000   Max.   :11.0200   Max.   :500.00   Max.   :295.667  
 NA's   :16       NA's   :2         NA's   :2        NA's   :2        
      PO4              Chla             a1              a2        
 Min.   :  1.00   Min.   : 0.30   Min.   : 0.00   Min.   : 0.000  
 1st Qu.:  6.00   1st Qu.: 1.78   1st Qu.:16.80   1st Qu.: 0.000  
 Median : 10.83   Median : 4.00   Median :30.30   Median : 0.000  
 Mean   : 34.58   Mean   :13.97   Mean   :36.02   Mean   : 4.503  
 3rd Qu.: 19.23   3rd Qu.:12.22   3rd Qu.:54.38   3rd Qu.: 3.400  
 Max.   :380.00   Max.   :68.05   Max.   :83.00   Max.   :36.500  
 NA's   :7        NA's   :23                                      
       a3               a4               a5               a6        
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 0.000   Median : 0.000   Median : 0.000   Median : 0.000  
 Mean   : 1.418   Mean   : 2.335   Mean   : 1.544   Mean   : 1.138  
 3rd Qu.: 1.125   3rd Qu.: 1.975   3rd Qu.: 0.900   3rd Qu.: 0.000  
 Max.   :14.600   Max.   :28.800   Max.   :21.100   Max.   :14.500  
                                                                    
       a7        
 Min.   : 0.000  
 1st Qu.: 0.000  
 Median : 0.000  
 Mean   : 1.676  
 3rd Qu.: 1.575  
 Max.   :28.000

Right off the bat, missing cases seem to be over-represented in small rivers and under-represented in low-speed rivers. But upon further investigation (that is, comparing with the original dataset), the under-representation of low-speed rivers is not problematic as it is in-line with the numbers in the larger dataset (by which I mean that low-speed rivers don't seem to have a systematic missing value problem, rather than the under-representation of low-speed rivers in the original dataset is not problematic).

I will assume for now (in the interest of efficiency) that the fact that small rivers have a lot of missing cases (mostly Cl and Chla) is also not a problem (YOU SHOULD PROBABLY VERIFY IF THAT IS INDEED THE CASE...)

And the bulk of the missing values seem to come from either Cl, Chla, or PO4. There also seems to be 2 observations for which we have very little information, but we can't use that smmary to determine if it's always the same two observations.

For instance, let's see what observations have missing NH4 values.

algae_blooms[which(is.na(algae_blooms$NH4)),]

While these observations also have missing values in other fields, there do have some non-missing fields as well.

But they're both missing 6 of the predictor variables. How useful could they be in training a predictive model? (that depends on the model, of course....)

We can write a functio nthat will compute how many missing cases there are for each observations.

table(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns

  0   1   2   6 
306  20  12   2

which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2)

Most observations have no missing cases, which is great, and there are a few with 1 or 2, but observations 62 and 199 are wild cards, with 6 missing predictors (out of 11).

Based on the small number of such wild cards, we elect to remove them from the analysis.

IMPORTANT NOTES:

If you decide to remove observations for any reason whatsoever, you need to document the process that lead you to eliminate it, and make that available.
Why do we remove the ones with 6 missing cases, but not the ones with 2 missing cases? If there'd been observations with 4 missing cases, what would you have done? What factors influence your decision?

Our new dataset still contains observations with missing cases, however.

algae_blooms.sna = algae_blooms[-which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2),]
dim(algae_blooms.sna)

What can we do with the other observations for which values are missing?

One possibility is to use the set of complete observations to compute a correlation matrix, and to see if any numerical field is strongly correlated with annother field. That way, if there is a missing value in the first field, the second could be used to impute it.

IMPORTANT NOTE:

this approach only works for variables that are linearly correlated to a single other variable. Non-linear correlations and multi-variate associations will not be uncovered.

symnum(cor(algae_blooms.sna[,4:18], use="complete.obs"))

     mP mO Cl NO NH o P Ch a1 a2 a3 a4 a5 a6 a7
mxPH 1                                         
mnO2    1                                      
Cl         1                                   
NO3           1                                
NH4           .  1                             
oPO4             .  1                          
PO4     .  .     .  * 1                        
Chla .                  1                      
a1         .        . .    1                   
a2                            1                
a3                               1             
a4      .        .  . .             1          
a5                                     1       
a6            .                        .  1    
a7                                           1 
attr(,"legend")
[1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1

This might be a bit hard to read. We can use the corrplot library to visualize the (linear) correlations.

library(corrplot)

corrplot(cor(algae_blooms.sna[,4:18], use="complete.obs"),type="upper",tl.pos="d")

The correlation between PO4 (which has missing cases) and oPO4 (which does not, anymore) is clear.

What is the nature of the relation? We'll use the set of complete cases to find it.

algae_blooms.nona <- algae_blooms.sna[-which(apply(algae_blooms.sna,1, function(x) sum(is.na(x)))>0),]

dim(algae_blooms.nona)

PO4.oPO4.model = lm(PO4 ~ oPO4, data=algae_blooms.nona)
PO4.oPO4.model

Call:
lm(formula = PO4 ~ oPO4, data = algae_blooms.nona)

Coefficients:
(Intercept)         oPO4  
     51.811        1.203

The regression function is PO4$=51.811+1.203$oPO4 (note that I'm not really interested in the fit statistics).

Intercept = PO4.oPO4.model$coefficients[[1]]
Slope = PO4.oPO4.model$coefficients[[2]]

What are the observations for which PO4 is missing?

which(is.na(algae_blooms.sna$PO4)==TRUE)

Let's use the regression function to impute the missing PO4 values.

algae_blooms.sna2 <- algae_blooms.sna
algae_blooms.sna2$PO4 <- ifelse(is.na(algae_blooms.sna2$PO4),max(Intercept + Slope*algae_blooms.sna2$oPO4,0),algae_blooms.sna2$PO4)

We can clearly see that no values of PO4 are missing anymore.

which(is.na(algae_blooms.sna2$PO4)==TRUE)

That takes care of the missing values with strong linear correlation to another field. Where do we stand now?

summary(algae_blooms.sna2)

    season       size        speed          mxPH            mnO2       
 spring:84   small :120   low   : 58   Min.   :5.600   Min.   : 1.500  
 summer:85   medium:136   medium:138   1st Qu.:7.750   1st Qu.: 8.000  
 autumn:80   large : 82   high  :142   Median :8.055   Median : 9.700  
 winter:89                             Mean   :8.002   Mean   : 9.161  
                                       3rd Qu.:8.400   3rd Qu.:10.800  
                                       Max.   :9.700   Max.   :13.400  
                                       NA's   :2       NA's   :1       
       Cl               NO3              NH4                oPO4        
 Min.   :  0.222   Min.   : 0.000   Min.   :    5.00   Min.   :   1.00  
 1st Qu.: 10.994   1st Qu.: 1.147   1st Qu.:   37.86   1st Qu.:  13.00  
 Median : 32.470   Median : 2.356   Median :  107.36   Median :  37.24  
 Mean   : 42.517   Mean   : 3.121   Mean   :  471.73   Mean   :  73.09  
 3rd Qu.: 57.750   3rd Qu.: 4.147   3rd Qu.:  244.90   3rd Qu.:  88.11  
 Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :1435.00  
 NA's   :14                                                             
      PO4               Chla               a1              a2        
 Min.   :   1.00   Min.   :  0.200   Min.   : 0.00   Min.   : 0.000  
 1st Qu.:  40.75   1st Qu.:  2.133   1st Qu.: 1.50   1st Qu.: 0.000  
 Median : 104.86   Median :  5.111   Median : 7.10   Median : 2.800  
 Mean   : 166.18   Mean   : 12.796   Mean   :16.74   Mean   : 7.207  
 3rd Qu.: 206.12   3rd Qu.: 17.200   3rd Qu.:25.32   3rd Qu.:10.025  
 Max.   :1777.93   Max.   :110.456   Max.   :89.80   Max.   :72.600  
                   NA's   :21                                        
       a3               a4               a5               a6        
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 1.400   Median : 0.000   Median : 2.200   Median : 0.000  
 Mean   : 3.917   Mean   : 1.812   Mean   : 5.548   Mean   : 6.438  
 3rd Qu.: 4.675   3rd Qu.: 2.300   3rd Qu.: 8.000   3rd Qu.: 7.075  
 Max.   :42.800   Max.   :44.600   Max.   :61.100   Max.   :77.600  
                                                                    
       a7      
 Min.   : 0.0  
 1st Qu.: 0.0  
 Median : 0.0  
 Mean   : 2.2  
 3rd Qu.: 2.2  
 Max.   :31.6

Alright, so we don't have as many missing values as before, but we still have some. And the correlation trick isn't going to work this time.

What can we do? There are many ways to tackle the problem, but we will use $k$NN imputation. The principle is simple. Using some similarity/distance metric (typically based on the Euclidean distance between points), we start by identifying the $k$ nearest (complete) neighbours of each observation with a missing case. We then compute the mean value of the missing case in the $k-$group of complete observations, and use that value as the imputed value.

IMPORTANT NOTES:

as we have seen when we were discussing, we often suggest scaling the data when dealing with distance metrics. I elected not to scale the data explicitely here. How much of an effect can that have?
we are going to be using DMwRs implementation of knnImputation() (below). How would you go about determining if the routine scales the data internally?

library(DMwR)

Loading required package: lattice
Loading required package: grid

algae_blooms.sna2 <- knnImputation(algae_blooms.sna2,k=10) # the choice of k=10 is arbitrary here.

Sure enough, there are no further observations with incomplete cases.

summary(algae_blooms.sna2)
table(apply(algae_blooms.sna2,1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns

    season       size        speed          mxPH            mnO2       
 spring:84   small :120   low   : 58   Min.   :5.600   Min.   : 1.500  
 summer:85   medium:136   medium:138   1st Qu.:7.755   1st Qu.: 8.000  
 autumn:80   large : 82   high  :142   Median :8.045   Median : 9.750  
 winter:89                             Mean   :8.001   Mean   : 9.165  
                                       3rd Qu.:8.393   3rd Qu.:10.800  
                                       Max.   :9.700   Max.   :13.400  
       Cl               NO3              NH4                oPO4        
 Min.   :  0.222   Min.   : 0.000   Min.   :    5.00   Min.   :   1.00  
 1st Qu.: 10.581   1st Qu.: 1.147   1st Qu.:   37.86   1st Qu.:  13.00  
 Median : 31.233   Median : 2.356   Median :  107.36   Median :  37.24  
 Mean   : 41.326   Mean   : 3.121   Mean   :  471.73   Mean   :  73.09  
 3rd Qu.: 57.323   3rd Qu.: 4.147   3rd Qu.:  244.90   3rd Qu.:  88.11  
 Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :1435.00  
      PO4               Chla               a1              a2        
 Min.   :   1.00   Min.   :  0.200   Min.   : 0.00   Min.   : 0.000  
 1st Qu.:  40.75   1st Qu.:  2.100   1st Qu.: 1.50   1st Qu.: 0.000  
 Median : 104.86   Median :  5.042   Median : 7.10   Median : 2.800  
 Mean   : 166.18   Mean   : 12.328   Mean   :16.74   Mean   : 7.207  
 3rd Qu.: 206.12   3rd Qu.: 16.021   3rd Qu.:25.32   3rd Qu.:10.025  
 Max.   :1777.93   Max.   :110.456   Max.   :89.80   Max.   :72.600  
       a3               a4               a5               a6        
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 1.400   Median : 0.000   Median : 2.200   Median : 0.000  
 Mean   : 3.917   Mean   : 1.812   Mean   : 5.548   Mean   : 6.438  
 3rd Qu.: 4.675   3rd Qu.: 2.300   3rd Qu.: 8.000   3rd Qu.: 7.075  
 Max.   :42.800   Max.   :44.600   Max.   :61.100   Max.   :77.600  
       a7      
 Min.   : 0.0  
 1st Qu.: 0.0  
 Median : 0.0  
 Mean   : 2.2  
 3rd Qu.: 2.2  
 Max.   :31.6

  0 
338

Back to top

PREDICTION MODELS ¶

For supervised learning tasks, the bias-variance trade-off means that we need to set aside a testing set on which the models (which learned on the training set) are evaluated.

There are no hard-and-fast rules to determine how to split the data - if the signal is very strong, a small training set should capture its important features, but we don't typically know how strong the signal is before we start the modeling process. On the other hand, if the training set is too large, we run the risk of overfitting the data.

We've elected to use a 65%/35% split in favour of the training set.

The training data should also be representative, to avoid injecting biases in the model (in case the data was provided according to some systematic but unknown rule). There are numerous ways to do this (see any introduction to sampling design), but we can do so using a simple random sample of 218 observations (we could also have stratified according to season, size, etc.).

To avoid issues due related to repeatability, we'll all use the same training set; I've commented out the code that would allow you to sample randomly.

#ind<-sample(1:dim(algae_blooms.sna2)[1],218, replace=FALSE)
ind <- 1:218
218/338

algae.train<-algae_blooms.sna2[ind,]
algae.test<-algae_blooms.sna2[-ind,]

dim(algae.train)
dim(algae.test)

GENERALIZED LINEAR MODEL ¶

Before we get too excited about using various machine learning methods, let's see what the traditional approach yields.

We'll run a linear model to predict a2, for example, against all the predictor variables, but using only the training set as data.

linear.model.a2 <- lm(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,data=algae.train)

A summary of the results can be given by calling the summary method on the resulting object.

summary(linear.model.a2)

Call:
lm(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
    NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.436  -5.281  -2.613   2.026  62.712 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.083e+01  1.257e+01  -2.452 0.015056 *  
seasonsummer -1.166e-01  2.112e+00  -0.055 0.956035    
seasonautumn  1.071e+00  2.370e+00   0.452 0.651934    
seasonwinter -1.451e+00  2.000e+00  -0.726 0.468935    
sizemedium   -2.628e+00  1.895e+00  -1.387 0.166896    
sizelarge    -3.210e+00  2.412e+00  -1.331 0.184767    
speedmedium   3.887e+00  2.485e+00   1.564 0.119325    
speedhigh    -1.104e+00  2.772e+00  -0.398 0.690751    
mxPH          4.859e+00  1.559e+00   3.117 0.002092 ** 
mnO2         -1.841e-01  3.924e-01  -0.469 0.639474    
Cl           -7.432e-03  2.006e-02  -0.371 0.711351    
NO3           2.132e-01  3.028e-01   0.704 0.482249    
NH4          -5.979e-04  5.355e-04  -1.117 0.265510    
oPO4          2.290e-03  9.876e-03   0.232 0.816875    
PO4          -1.559e-03  5.936e-03  -0.263 0.793090    
Chla          1.652e-01  4.614e-02   3.579 0.000432 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.74 on 202 degrees of freedom
Multiple R-squared:  0.206,	Adjusted R-squared:  0.147 
F-statistic: 3.493 on 15 and 202 DF,  p-value: 2.498e-05

We see that the adjusted $R^2$ coefficient is fairly small. Furthermore, if the linear model is a good fit, the residuals should have a mean of 0 and be as "small". That being said, the $F-$statistic seems to indicate that there is some (linear) dependence on the predictor variables.

We can get a better handle on the regression diagnostics by calling the plot() method on the object.

plot(linear.model.a2)

All in all, the linear model is nowhere near a great one, let alone a good one.

The significance of some of the coefficients is questionable, however, and we might wonder what effect their inclusion might have.

anova(linear.model.a2)

An ANOVA finds that NH4 least contributes to the prediction of a2, and we might be interested in the results of a linear regression with that predictor removed.

linear.model.a2.mod <- update(linear.model.a2, . ~ . - NH4)

summary(linear.model.a2.mod)

Call:
lm(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
    NO3 + oPO4 + PO4 + Chla, data = algae.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.801  -5.500  -2.647   2.504  63.259 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -30.996221  12.580354  -2.464 0.014577 *  
seasonsummer  -0.107996   2.113697  -0.051 0.959301    
seasonautumn   0.806683   2.359593   0.342 0.732799    
seasonwinter  -1.397244   2.000275  -0.699 0.485648    
sizemedium    -2.378831   1.882645  -1.264 0.207838    
sizelarge     -3.086404   2.411377  -1.280 0.202029    
speedmedium    3.637403   2.476278   1.469 0.143408    
speedhigh     -1.382060   2.762133  -0.500 0.617364    
mxPH           4.821140   1.559264   3.092 0.002268 ** 
mnO2          -0.074216   0.380118  -0.195 0.845398    
Cl            -0.001602   0.019376  -0.083 0.934181    
NO3           -0.013968   0.224437  -0.062 0.950437    
oPO4          -0.001285   0.009348  -0.137 0.890775    
PO4           -0.001518   0.005940  -0.256 0.798576    
Chla           0.165865   0.046166   3.593 0.000411 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.75 on 203 degrees of freedom
Multiple R-squared:  0.2011,	Adjusted R-squared:  0.146 
F-statistic: 3.649 on 14 and 203 DF,  p-value: 2.009e-05

The fit isn't much better, but an ANOVA on the 2 suggested models shows that we're ~88% that the models are different (see below).

anova(linear.model.a2,linear.model.a2.mod)

The function step uses AIC to perform a model search (using backward elimination), implemented in the method step().

final.linear.model.a2 <- step(linear.model.a2)

Start:  AIC=1050.52
a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + 
    PO4 + Chla

         Df Sum of Sq   RSS    AIC
- season  3    157.44 23466 1046.0
- oPO4    1      6.20 23315 1048.6
- PO4     1      7.96 23317 1048.6
- Cl      1     15.85 23325 1048.7
- mnO2    1     25.40 23334 1048.8
- NO3     1     57.19 23366 1049.0
- size    2    282.28 23591 1049.1
- NH4     1    143.86 23453 1049.9
<none>                23309 1050.5
- speed   2    967.47 24276 1055.4
- mxPH    1   1121.22 24430 1058.8
- Chla    1   1478.18 24787 1061.9

Step:  AIC=1045.98
a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + 
    Chla

        Df Sum of Sq   RSS    AIC
- oPO4   1      2.54 23469 1044.0
- PO4    1      4.10 23470 1044.0
- mnO2   1      6.61 23473 1044.0
- Cl     1     15.59 23482 1044.1
- size   2    257.60 23724 1044.4
- NO3    1     47.04 23513 1044.4
- NH4    1    114.06 23580 1045.0
<none>               23466 1046.0
- speed  2   1035.56 24502 1051.4
- mxPH   1   1052.01 24518 1053.5
- Chla   1   1477.06 24943 1057.3

Step:  AIC=1044.01
a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + PO4 + Chla

        Df Sum of Sq   RSS    AIC
- PO4    1      1.62 23470 1042.0
- mnO2   1      7.17 23476 1042.1
- Cl     1     14.19 23483 1042.1
- NO3    1     44.93 23514 1042.4
- size   2    266.73 23736 1042.5
- NH4    1    114.91 23584 1043.1
<none>               23469 1044.0
- speed  2   1050.55 24519 1049.5
- mxPH   1   1099.78 24569 1052.0
- Chla   1   1480.47 24949 1055.3

Step:  AIC=1042.02
a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- mnO2   1      6.59 23477 1040.1
- Cl     1     17.42 23488 1040.2
- size   2    265.19 23736 1040.5
- NO3    1     51.04 23521 1040.5
- NH4    1    140.72 23611 1041.3
<none>               23470 1042.0
- speed  2   1050.42 24521 1047.6
- mxPH   1   1105.21 24576 1050.0
- Chla   1   1482.34 24953 1053.4

Step:  AIC=1040.08
a2 ~ size + speed + mxPH + Cl + NO3 + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- Cl     1     13.41 23490 1038.2
- size   2    260.65 23738 1038.5
- NO3    1     44.48 23522 1038.5
- NH4    1    135.66 23613 1039.3
<none>               23477 1040.1
- speed  2   1121.64 24599 1046.3
- mxPH   1   1103.17 24580 1048.1
- Chla   1   1492.55 24970 1051.5

Step:  AIC=1038.21
a2 ~ size + speed + mxPH + NO3 + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- NO3    1     36.13 23526 1036.5
- size   2    275.91 23766 1036.8
- NH4    1    128.31 23619 1037.4
<none>               23490 1038.2
- speed  2   1172.78 24663 1044.8
- mxPH   1   1089.85 24580 1046.1
- Chla   1   1490.94 24981 1049.6

Step:  AIC=1036.54
a2 ~ size + speed + mxPH + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- size   2    244.91 23771 1034.8
- NH4    1     93.48 23620 1035.4
<none>               23526 1036.5
- speed  2   1164.36 24691 1043.1
- mxPH   1   1053.88 24580 1044.1
- Chla   1   1611.04 25138 1049.0

Step:  AIC=1034.8
a2 ~ speed + mxPH + NH4 + Chla

        Df Sum of Sq   RSS    AIC
- NH4    1     82.62 23854 1033.6
<none>               23771 1034.8
- mxPH   1    850.56 24622 1040.5
- speed  2   1085.45 24857 1040.5
- Chla   1   1540.50 25312 1046.5

Step:  AIC=1033.56
a2 ~ speed + mxPH + Chla

        Df Sum of Sq   RSS    AIC
<none>               23854 1033.6
- speed  2   1021.27 24875 1038.7
- mxPH   1    928.72 24783 1039.9
- Chla   1   1479.59 25334 1044.7

The "best" linear model for a2 is shown below.

summary(final.linear.model.a2)

Call:
lm(formula = a2 ~ speed + mxPH + Chla, data = algae.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.195  -6.008  -2.530   2.024  63.589 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -27.13270   11.07921  -2.449 0.015134 *  
speedmedium   4.17176    2.34330   1.780 0.076453 .  
speedhigh    -0.32929    2.41899  -0.136 0.891850    
mxPH          3.89794    1.35358   2.880 0.004387 ** 
Chla          0.15945    0.04387   3.635 0.000349 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.58 on 213 degrees of freedom
Multiple R-squared:  0.1874,	Adjusted R-squared:  0.1721 
F-statistic: 12.28 on 4 and 213 DF,  p-value: 5.289e-09

It's still not a great fit (adjusted $R^2$ is quite small), so the linear model is not ideal for a2.

anova(final.linear.model.a2)

anova(linear.model.a2,final.linear.model.a2)

plot(final.linear.model.a2)

REGRESSION TREE MODEL ¶

An alternative to regression is the use of regression trees, an implementation of which is found in the library rpart. The syntax is similar to lm.

library(rpart)

regression.tree.a2 <-rpart(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train)

The outcome can be displayed by calling the object directly.

regression.tree.a2

n= 218 

node), split, n, deviance, yval
      * denotes terminal node

  1) root 218 29355.1300  7.6366970  
    2) Cl< 16.6875 83  1193.6400  1.8891570  
      4) size=small,medium 67   398.6457  0.9447761 *
      5) size=large 16   485.0194  5.8437500 *
    3) Cl>=16.6875 135 23733.9200 11.1703700  
      6) mxPH< 8.065 59  3831.8290  5.3864410  
       12) season=autumn,winter 29   561.8414  2.5172410 *
       13) season=spring,summer 30  2800.4720  8.1600000  
         26) mxPH< 7.9375 23   889.9730  5.3173910 *
         27) mxPH>=7.9375 7  1114.0000 17.5000000 *
      7) mxPH>=8.065 76 16396.0400 15.6605300  
       14) Chla>=2.65 68  9694.0890 13.8544100  
         28) Chla< 14.8875 29  2747.5810  8.7172410  
           56) NH4< 226.875 21   558.4257  5.7857140 *
           57) NH4>=226.875 8  1534.9490 16.4125000 *
         29) Chla>=14.8875 39  5612.0940 17.6743600  
           58) mnO2< 11.05 30  3139.0940 15.4233300  
            116) NH4>=158.409 8   577.1000  8.9000000 *
            117) NH4< 158.409 22  2097.7700 17.7954500  
              234) season=spring,autumn 14   674.7521 14.6642900 *
              235) season=summer,winter 8  1045.5550 23.2750000 *
           59) mnO2>=11.05 9  1814.2760 25.1777800 *
       15) Chla< 2.65 8  4594.6690 31.0125000 *

The tree structure can be hard to determine when there is a large number of nodes; let's see if we can improve on the visuals using rpart.plot.

library(rpart.plot)

prp(regression.tree.a2,extra=101,box.col="orange",split.box.col="gray")

Details on the regression tree can be obtained by calling the summary() method on the object.

summary(regression.tree.a2)

Call:
rpart(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
    NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)
  n= 218 

           CP nsplit rel error    xerror      xstd
1  0.15082765      0 1.0000000 1.0114523 0.1997920
2  0.11943572      1 0.8491724 0.9748608 0.1731060
3  0.07178590      2 0.7297366 0.9198268 0.1672873
4  0.04545758      3 0.6579507 0.9334977 0.1700251
5  0.02243987      4 0.6124932 0.9742826 0.1806729
6  0.02228595      5 0.5900533 0.9925243 0.1909517
7  0.02156378      6 0.5677673 0.9722725 0.1712347
8  0.01581407      8 0.5246398 0.9988537 0.1815813
9  0.01285848      9 0.5088257 0.9850797 0.1799236
10 0.01055949     10 0.4959672 1.0042841 0.1813917
11 0.01000000     11 0.4854077 0.9953589 0.1813778

Variable importance
  Chla    NH4     Cl   mxPH   oPO4    PO4    NO3  speed   mnO2 season   size 
    19     14     14     13     11      9      6      5      4      3      2 

Node number 1: 218 observations,    complexity param=0.1508276
  mean=7.636697, MSE=134.6565 
  left son=2 (83 obs) right son=3 (135 obs)
  Primary splits:
      Cl   < 16.6875  to the left,  improve=0.15082760, (0 missing)
      mxPH < 7.94     to the left,  improve=0.14900670, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.11106510, (0 missing)
      Chla < 12.21    to the left,  improve=0.09817759, (0 missing)
      NH4  < 46.35    to the left,  improve=0.09032131, (0 missing)
  Surrogate splits:
      PO4   < 70.465   to the left,  agree=0.844, adj=0.590, (0 split)
      oPO4  < 19.8635  to the left,  agree=0.835, adj=0.566, (0 split)
      NH4   < 46.35    to the left,  agree=0.807, adj=0.494, (0 split)
      Chla  < 2.225    to the left,  agree=0.807, adj=0.494, (0 split)
      speed splits as  RRL,          agree=0.775, adj=0.410, (0 split)

Node number 2: 83 observations,    complexity param=0.01055949
  mean=1.889157, MSE=14.38121 
  left son=4 (67 obs) right son=5 (16 obs)
  Primary splits:
      size  splits as  LLR,          improve=0.25968900, (0 missing)
      Chla  < 4.4835   to the left,  improve=0.16315540, (0 missing)
      mxPH  < 7.835    to the left,  improve=0.07327114, (0 missing)
      oPO4  < 18.889   to the left,  improve=0.05858864, (0 missing)
      speed splits as  LRL,          improve=0.05534763, (0 missing)
  Surrogate splits:
      speed splits as  LRL,          agree=0.843, adj=0.188, (0 split)
      mxPH  < 8.385    to the left,  agree=0.831, adj=0.125, (0 split)
      mnO2  < 8.3      to the right, agree=0.831, adj=0.125, (0 split)
      Cl    < 15.2705  to the left,  agree=0.819, adj=0.062, (0 split)

Node number 3: 135 observations,    complexity param=0.1194357
  mean=11.17037, MSE=175.8068 
  left son=6 (59 obs) right son=7 (76 obs)
  Primary splits:
      mxPH < 8.065    to the left,  improve=0.14772320, (0 missing)
      NO3  < 0.3785   to the right, improve=0.09864039, (0 missing)
      Chla < 2.85     to the right, improve=0.06998520, (0 missing)
      mnO2 < 11.05    to the left,  improve=0.06691501, (0 missing)
      NH4  < 474.1875 to the right, improve=0.05884159, (0 missing)
  Surrogate splits:
      NH4  < 272.2915 to the right, agree=0.689, adj=0.288, (0 split)
      Chla < 11.65    to the left,  agree=0.667, adj=0.237, (0 split)
      NO3  < 5.37     to the right, agree=0.659, adj=0.220, (0 split)
      mnO2 < 6.55     to the left,  agree=0.622, adj=0.136, (0 split)
      oPO4 < 279.5085 to the right, agree=0.607, adj=0.102, (0 split)

Node number 4: 67 observations
  mean=0.9447761, MSE=5.949935 

Node number 5: 16 observations
  mean=5.84375, MSE=30.31371 

Node number 6: 59 observations,    complexity param=0.02156378
  mean=5.386441, MSE=64.94626 
  left son=12 (29 obs) right son=13 (30 obs)
  Primary splits:
      season splits as  RRLL,         improve=0.12253050, (0 missing)
      NH4    < 339.9305 to the right, improve=0.08821754, (0 missing)
      mxPH   < 7.905    to the left,  improve=0.07581077, (0 missing)
      Chla   < 2.5      to the right, improve=0.05654547, (0 missing)
      oPO4   < 25.75    to the left,  improve=0.05324680, (0 missing)
  Surrogate splits:
      NO3  < 2.8605   to the right, agree=0.712, adj=0.414, (0 split)
      mnO2 < 9.75     to the right, agree=0.661, adj=0.310, (0 split)
      oPO4 < 144.1885 to the left,  agree=0.627, adj=0.241, (0 split)
      Chla < 6.85     to the right, agree=0.627, adj=0.241, (0 split)
      mxPH < 7.975    to the right, agree=0.610, adj=0.207, (0 split)

Node number 7: 76 observations,    complexity param=0.0717859
  mean=15.66053, MSE=215.7374 
  left son=14 (68 obs) right son=15 (8 obs)
  Primary splits:
      Chla < 2.65     to the right, improve=0.12852400, (0 missing)
      mnO2 < 11.05    to the left,  improve=0.10098910, (0 missing)
      NO3  < 0.6045   to the right, improve=0.08727846, (0 missing)
      Cl   < 35.705   to the right, improve=0.07266453, (0 missing)
      mxPH < 8.375    to the right, improve=0.04660101, (0 missing)
  Surrogate splits:
      Cl   < 19.11    to the right, agree=0.921, adj=0.25, (0 split)
      NO3  < 0.076    to the right, agree=0.921, adj=0.25, (0 split)
      NH4  < 2713     to the left,  agree=0.921, adj=0.25, (0 split)
      oPO4 < 8        to the right, agree=0.921, adj=0.25, (0 split)
      PO4  < 40.3335  to the right, agree=0.921, adj=0.25, (0 split)

Node number 12: 29 observations
  mean=2.517241, MSE=19.37384 

Node number 13: 30 observations,    complexity param=0.02156378
  mean=8.16, MSE=93.34907 
  left son=26 (23 obs) right son=27 (7 obs)
  Primary splits:
      mxPH < 7.9375   to the left,  improve=0.28441600, (0 missing)
      NH4  < 119.2855 to the left,  improve=0.10978520, (0 missing)
      oPO4 < 42.857   to the left,  improve=0.06038492, (0 missing)
      Cl   < 37.624   to the left,  improve=0.05321153, (0 missing)
      Chla < 3.1      to the right, improve=0.04623221, (0 missing)
  Surrogate splits:
      NO3 < 1.0425   to the right, agree=0.833, adj=0.286, (0 split)

Node number 14: 68 observations,    complexity param=0.04545758
  mean=13.85441, MSE=142.5601 
  left son=28 (29 obs) right son=29 (39 obs)
  Primary splits:
      Chla < 14.8875  to the left,  improve=0.13765220, (0 missing)
      Cl   < 68.875   to the right, improve=0.10411670, (0 missing)
      NH4  < 228.175  to the left,  improve=0.09585894, (0 missing)
      NO3  < 5.247    to the left,  improve=0.08016472, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.05486673, (0 missing)
  Surrogate splits:
      size   splits as  LRR,          agree=0.706, adj=0.310, (0 split)
      mxPH   < 8.215    to the left,  agree=0.676, adj=0.241, (0 split)
      oPO4   < 180.8335 to the right, agree=0.662, adj=0.207, (0 split)
      NO3    < 3.779    to the right, agree=0.647, adj=0.172, (0 split)
      season splits as  RLRR,         agree=0.632, adj=0.138, (0 split)

Node number 15: 8 observations
  mean=31.0125, MSE=574.3336 

Node number 26: 23 observations
  mean=5.317391, MSE=38.69448 

Node number 27: 7 observations
  mean=17.5, MSE=159.1429 

Node number 28: 29 observations,    complexity param=0.02228595
  mean=8.717241, MSE=94.74419 
  left son=56 (21 obs) right son=57 (8 obs)
  Primary splits:
      NH4  < 226.875  to the left,  improve=0.23810280, (0 missing)
      Cl   < 68.875   to the right, improve=0.12545720, (0 missing)
      NO3  < 3.8045   to the right, improve=0.10340650, (0 missing)
      mnO2 < 9.6      to the right, improve=0.08307998, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.07494295, (0 missing)
  Surrogate splits:
      Chla < 12.08    to the left,  agree=0.862, adj=0.5, (0 split)

Node number 29: 39 observations,    complexity param=0.02243987
  mean=17.67436, MSE=143.8999 
  left son=58 (30 obs) right son=59 (9 obs)
  Primary splits:
      mnO2 < 11.05    to the left,  improve=0.11737600, (0 missing)
      mxPH < 8.35     to the right, improve=0.11494980, (0 missing)
      Cl   < 51.6665  to the right, improve=0.11368430, (0 missing)
      NO3  < 5.0885   to the left,  improve=0.10842770, (0 missing)
      NH4  < 231.3    to the left,  improve=0.08663984, (0 missing)
  Surrogate splits:
      NH4  < 245.682  to the left,  agree=0.821, adj=0.222, (0 split)
      oPO4 < 19.4585  to the right, agree=0.821, adj=0.222, (0 split)
      PO4  < 96.283   to the right, agree=0.821, adj=0.222, (0 split)
      Chla < 16.4     to the right, agree=0.821, adj=0.222, (0 split)
      size splits as  RLL,          agree=0.795, adj=0.111, (0 split)

Node number 56: 21 observations
  mean=5.785714, MSE=26.5917 

Node number 57: 8 observations
  mean=16.4125, MSE=191.8686 

Node number 58: 30 observations,    complexity param=0.01581407
  mean=15.42333, MSE=104.6365 
  left son=116 (8 obs) right son=117 (22 obs)
  Primary splits:
      NH4  < 158.409  to the right, improve=0.14788480, (0 missing)
      Chla < 48.15    to the left,  improve=0.13800510, (0 missing)
      mnO2 < 9.95     to the right, improve=0.12649850, (0 missing)
      NO3  < 0.698    to the right, improve=0.11199640, (0 missing)
      oPO4 < 144.0415 to the left,  improve=0.08054276, (0 missing)
  Surrogate splits:
      mxPH < 8.27     to the left,  agree=0.767, adj=0.125, (0 split)
      oPO4 < 36.4     to the left,  agree=0.767, adj=0.125, (0 split)

Node number 59: 9 observations
  mean=25.17778, MSE=201.5862 

Node number 116: 8 observations
  mean=8.9, MSE=72.1375 

Node number 117: 22 observations,    complexity param=0.01285848
  mean=17.79545, MSE=95.35316 
  left son=234 (14 obs) right son=235 (8 obs)
  Primary splits:
      season splits as  LRLR,         improve=0.1799351, (0 missing)
      NO3    < 1.2775   to the right, improve=0.1189142, (0 missing)
      Chla   < 31.869   to the left,  improve=0.1160835, (0 missing)
      NH4    < 99.6665  to the left,  improve=0.1091819, (0 missing)
      Cl     < 42.818   to the left,  improve=0.1085589, (0 missing)
  Surrogate splits:
      NO3   < 3.2455   to the left,  agree=0.727, adj=0.250, (0 split)
      oPO4  < 103.293  to the left,  agree=0.727, adj=0.250, (0 split)
      size  splits as  RLL,          agree=0.682, adj=0.125, (0 split)
      speed splits as  LRL,          agree=0.682, adj=0.125, (0 split)
      mnO2  < 7.4      to the left,  agree=0.682, adj=0.125, (0 split)

Node number 234: 14 observations
  mean=14.66429, MSE=48.19658 

Node number 235: 8 observations
  mean=23.275, MSE=130.6944

rpart() grows a tree until one of the following criterion is met:

decrease in deviance goes below a certain threshold (cp)
number of samples in a node is below some other threshold (minsplit)
depth of the tree crosses yet another threshold (maxdepth)

rpart implements a pruning method based on cost complexity: finding the value of cp which best balances predictive accuracy and tree size.

printcp(regression.tree.a2)

Regression tree:
rpart(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
    NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)

Variables actually used in tree construction:
[1] Chla   Cl     mnO2   mxPH   NH4    season size  

Root node error: 29355/218 = 134.66

n= 218 

         CP nsplit rel error  xerror    xstd
1  0.150828      0   1.00000 1.01145 0.19979
2  0.119436      1   0.84917 0.97486 0.17311
3  0.071786      2   0.72974 0.91983 0.16729
4  0.045458      3   0.65795 0.93350 0.17003
5  0.022440      4   0.61249 0.97428 0.18067
6  0.022286      5   0.59005 0.99252 0.19095
7  0.021564      6   0.56777 0.97227 0.17123
8  0.015814      8   0.52464 0.99885 0.18158
9  0.012858      9   0.50883 0.98508 0.17992
10 0.010559     10   0.49597 1.00428 0.18139
11 0.010000     11   0.48541 0.99536 0.18138

The tree returned by rpart() is the final one (cp$=0.01$ is the default value), requires 11 tests, and has a relative error of 0.485. Internally, rpart() uses 10-fold cross-validation to estimate that the tree has an average relative error of $0.98\pm 0.18$ (these values might change when you run the model due to the stochastic nature of the internal cross-validation routines).

In this framework, the optimal tree minimizes the value of xerror. Alternatively, one could use the 1-SE rule to find the minimal xerror + xstd tree.

plotcp(regression.tree.a2)

The scree plot above suggests that cp$=0.08$ (that value may change when you run yours due to the stochastic nature of the internal cross-validation algorithm) is a special value for tree growth, so we could prune the tree using that specific value.

regression.tree.a2.mod <- prune(regression.tree.a2,cp=0.05)

regression.tree.a2.mod

n= 218 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 218 29355.130  7.636697  
   2) Cl< 16.6875 83  1193.640  1.889157 *
   3) Cl>=16.6875 135 23733.920 11.170370  
     6) mxPH< 8.065 59  3831.829  5.386441 *
     7) mxPH>=8.065 76 16396.040 15.660530  
      14) Chla>=2.65 68  9694.089 13.854410 *
      15) Chla< 2.65 8  4594.669 31.012500 *

prp(regression.tree.a2.mod,extra=101,box.col="orange",split.box.col="gray")

The entire process is automated in the wrapper method rpartXse() provided with the DMwr library. In our implementation, we (abitrarily) use se$=0.3$.

library(DMwR)

(regression.tree.a2.final <- rpartXse(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train,se=0.3))

n= 218 

node), split, n, deviance, yval
      * denotes terminal node

1) root 218 29355.130  7.636697  
  2) Cl< 16.6875 83  1193.640  1.889157 *
  3) Cl>=16.6875 135 23733.920 11.170370  
    6) mxPH< 8.065 59  3831.829  5.386441 *
    7) mxPH>=8.065 76 16396.040 15.660530 *

summary(regression.tree.a2.final)

Call:
rpart(formula = form, data = data, cp = cp, minsplit = minsplit)
  n= 218 

         CP nsplit rel error    xerror      xstd
1 0.1508276      0 1.0000000 1.0086512 0.2002729
2 0.1194357      1 0.8491724 0.9951432 0.1955481
3 0.0821150      2 0.7297366 0.9226403 0.1761566

Variable importance
   Cl  mxPH   NH4  Chla  oPO4   PO4 speed   NO3  mnO2 
   20    15    14    13    13    12     8     3     2 

Node number 1: 218 observations,    complexity param=0.1508276
  mean=7.636697, MSE=134.6565 
  left son=2 (83 obs) right son=3 (135 obs)
  Primary splits:
      Cl   < 16.6875  to the left,  improve=0.15082760, (0 missing)
      mxPH < 7.94     to the left,  improve=0.14900670, (0 missing)
      NO3  < 0.18     to the right, improve=0.11564070, (0 missing)
      oPO4 < 45.1     to the left,  improve=0.11106510, (0 missing)
      Chla < 12.21    to the left,  improve=0.09817759, (0 missing)
  Surrogate splits:
      PO4   < 70.465   to the left,  agree=0.844, adj=0.590, (0 split)
      oPO4  < 19.8635  to the left,  agree=0.835, adj=0.566, (0 split)
      NH4   < 46.35    to the left,  agree=0.807, adj=0.494, (0 split)
      Chla  < 2.225    to the left,  agree=0.807, adj=0.494, (0 split)
      speed splits as  RRL,          agree=0.775, adj=0.410, (0 split)

Node number 2: 83 observations
  mean=1.889157, MSE=14.38121 

Node number 3: 135 observations,    complexity param=0.1194357
  mean=11.17037, MSE=175.8068 
  left son=6 (59 obs) right son=7 (76 obs)
  Primary splits:
      mxPH < 8.065    to the left,  improve=0.14772320, (0 missing)
      NO3  < 0.319    to the right, improve=0.11867590, (0 missing)
      Chla < 2.85     to the right, improve=0.06998520, (0 missing)
      mnO2 < 11.05    to the left,  improve=0.06691501, (0 missing)
      NH4  < 474.1875 to the right, improve=0.05884159, (0 missing)
  Surrogate splits:
      NH4  < 272.2915 to the right, agree=0.689, adj=0.288, (0 split)
      Chla < 11.65    to the left,  agree=0.667, adj=0.237, (0 split)
      NO3  < 5.37     to the right, agree=0.659, adj=0.220, (0 split)
      mnO2 < 6.55     to the left,  agree=0.622, adj=0.136, (0 split)
      oPO4 < 279.5085 to the right, agree=0.607, adj=0.102, (0 split)

Node number 6: 59 observations
  mean=5.386441, MSE=64.94626 

Node number 7: 76 observations
  mean=15.66053, MSE=215.7374

prp(regression.tree.a2.final,extra=101,box.col="orange",split.box.col="gray")

The resulting tree is not nearly as complex as the original tree (hence discourages overfitting) but is still more complex than the pruned tree (which should improve predicting accuracy).

Back to top

MODEL EVALUATION ¶

At this stage, we know that the linear model is not great for a2, and we've seen how to grow a regression tree for a2 but we have not yet discussed whether this model is a good fit, to say nothing of the remaining 6 algae concentrations.

Can we get a better handle on these models' performance (i.e. comparing the model predictions to the real values of the target variable in the test data)?

Various metrics can be used to determine how the values compare:

mean absolute error: (MAE) $$\text{mean}\left\{\left|\text{pred}_i-\text{real}_i\right|; i=1,..., N\right\}$$
mean squared error: (MSE) $$\text{mean}\left\{\left(\text{pred}_i-\text{real}_i\right)^2; i=1,..., N\right\}$$
normalized mean squared error (NMSE): $$\frac{\text{MSE}}{\text{mean}\left\{\left(\overline{\text{real}}-\text{real}_i\right)^2; i=1,..., N\right\}}$$

As the ratio of MSE to a baseline predictor (the mean of the value of the target), NMSE is unitless. NMSE values between 0 and 1 (smaller is better) indicate that the model performs better than the baseline; greater than 1 indicate that the model's performance is sub-par.

There are other metrics, but these are the most common.

So how does model evaluation work? Typically, one computes the selected evaluation metric for a variety of models, and picks the model which provides the optimal metric value.

$k$-fold cross-validation is often used to try to mitigate the effects of overfitting for mid-size datasets (no more than 50,000 observations, say).

The training data is randomly partitioned into $k$ distinct subsets of equal size (give or take an observation here and there)
Each of the $k$ subsets is used as a testing set, while the remaining $k-1$ subsets are used to train a different version of the model (giving rise to $k$ models, and so to $k$ evaluation values on the $k$ testing sets)
The first two steps are repeated $n$ times
The final $k-$fold score is the average of the $n\times k$ evaluation values (often the standard error is also recorded)

performanceEvaluation() can be used to automate the process. Here, we'll use the linear model and 5 regressiont trees (parameterized by se), and 5 $10-$fold cross-validation to select the best predictive model for a2.

library(performanceEstimation)

kCV.results.algae.a2 <- performanceEstimation(
    PredTask(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train,"a2"),
    c(Workflow(learner="lm",post="onlyPos"),
    workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.25,0.5,0.75,1)))
     ),
    EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)


##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####

** PREDICTIVE TASK :: a2

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v4 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v5 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************

A summary and plot of the cross-validation results for NMSE can be displayed using calls to summary() and plot().

summary(kCV.results.algae.a2)

== Summary of a  Cross Validation Performance Estimation Experiment ==

Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 

* Predictive Tasks ::  a2
* Workflows  ::  lm, rpartXse.v1, rpartXse.v2, rpartXse.v3, rpartXse.v4, rpartXse.v5 

-> Task:  a2
  *Workflow: lm 
             nmse
avg     0.9880781
std     0.3682616
med     0.9470239
iqr     0.2817843
min     0.4869917
max     2.5236216
invalid 0.0000000

  *Workflow: rpartXse.v1 
             nmse
avg     1.0333720
std     0.3406970
med     1.0000000
iqr     0.1842643
min     0.6171205
max     2.4535376
invalid 0.0000000

  *Workflow: rpartXse.v2 
             nmse
avg     1.0596868
std     0.3147441
med     1.0000000
iqr     0.0435684
min     0.5049684
max     2.4535376
invalid 0.0000000

  *Workflow: rpartXse.v3 
                nmse
avg     1.028517e+00
std     2.301808e-01
med     1.000000e+00
iqr     3.330669e-16
min     5.283415e-01
max     2.365684e+00
invalid 0.000000e+00

  *Workflow: rpartXse.v4 
                nmse
avg     1.012748e+00
std     7.803506e-02
med     1.000000e+00
iqr     3.330669e-16
min     8.198276e-01
max     1.413850e+00
invalid 0.000000e+00

  *Workflow: rpartXse.v5 
                nmse
avg     1.001631e+00
std     1.153258e-02
med     1.000000e+00
iqr     3.330669e-16
min     1.000000e+00
max     1.081548e+00
invalid 0.000000e+00

plot(kCV.results.algae.a2)

It's not necessarily clear which of the models has smaller values of NMSE, although it does seem that the latter versions of the regression tree models are not substantially better than the baseline model.

The first regression tree model sometimes produces very small NMSE values, but that's offset by some of the larger values it also produces (similarly for the linear model).

At any rate, visual evidence seems to suggest that the linear model is the best predictive model for a2 given the training data (in this version of $k$CV), which is corrobated by a call to topPerformers().

topPerformers(kCV.results.algae.a2)

This might seem disheartening at first given how poorly the linear model performed, but it might be helpful to remember that there is no guarantee that a decent predictive model even exists.

Furthermore, regression trees and linear models are only two of a whole collection of possible models. How do support vector machines perform the task, for instance?

This time, however, we will learn models and perform evaluation for all target variables (a1-a7) simultaneously. This does not mean that we are looking for a single model which will optimize all learning tasks at once, but rather that we can prepare and evaluate the models for each target variable with the same bit of code.

This first require some code to create the appropriate model formulas (a1 ~ ., ... ,a7 ~ .) and the appropriate training data.

gg<-function(x,list.of.variables){    
    PredTask(as.formula(paste(x,"~ .")),algae.train[,c(list.of.variables,x)], x, copy=TRUE)
}
data.sources <- sapply(names(algae.train[12:18]),gg,names(algae.train[1:11]))
data.sources

$a1
Prediction Task Object:
	Task Name         :: a1 
	Task Type         :: regression 
	Target Feature    :: a1 
	Formula           :: a1 ~ .
<environment: 0xae16bd0>
	Task Data Source  :: internal  218x12 data frame.

$a2
Prediction Task Object:
	Task Name         :: a2 
	Task Type         :: regression 
	Target Feature    :: a2 
	Formula           :: a2 ~ .
<environment: 0xad7db40>
	Task Data Source  :: internal  218x12 data frame.

$a3
Prediction Task Object:
	Task Name         :: a3 
	Task Type         :: regression 
	Target Feature    :: a3 
	Formula           :: a3 ~ .
<environment: 0xa333ec8>
	Task Data Source  :: internal  218x12 data frame.

$a4
Prediction Task Object:
	Task Name         :: a4 
	Task Type         :: regression 
	Target Feature    :: a4 
	Formula           :: a4 ~ .
<environment: 0xa2d98e8>
	Task Data Source  :: internal  218x12 data frame.

$a5
Prediction Task Object:
	Task Name         :: a5 
	Task Type         :: regression 
	Target Feature    :: a5 
	Formula           :: a5 ~ .
<environment: 0xad09890>
	Task Data Source  :: internal  218x12 data frame.

$a6
Prediction Task Object:
	Task Name         :: a6 
	Task Type         :: regression 
	Target Feature    :: a6 
	Formula           :: a6 ~ .
<environment: 0xa77c288>
	Task Data Source  :: internal  218x12 data frame.

$a7
Prediction Task Object:
	Task Name         :: a7 
	Task Type         :: regression 
	Target Feature    :: a7 
	Formula           :: a7 ~ .
<environment: 0xa78f860>
	Task Data Source  :: internal  218x12 data frame.

We shall use e1071's implementation of svm(), with various values of the svm()-specific parameters cost and gamma.

library(e1071)

kCV.results.algae.all <- performanceEstimation(
    data.sources,
    c(Workflow(learner="lm",post="onlyPos"), 
      Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
      workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1)))
     ),
    EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)


##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####

** PREDICTIVE TASK :: a1

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a2

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a3

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a4

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a5

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a6

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a7

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************

The rest of the evaluation proceeds much as before, except that we can display results for the 7 target variables simultaneously.

plot(kCV.results.algae.all)

rankWorkflows(kCV.results.algae.all,top=3)

topPerformers(kCV.results.algae.all)

For a1, the models seem to perform reasonably well. But it's not so rosy for the other target variables, where the baseline model is sometimes better.

Again, this could be built-in in the data, but we might benefit from incorporating more models. As a last attempt to improve the situation, let's include random forests.

library(randomForest)

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: ‘randomForest’

The following object is masked from ‘package:dplyr’:

    combine

The following object is masked from ‘package:ggplot2’:

    margin

The following object is masked from ‘package:psych’:

    outlier

kCV.results.algae.all.rf <- performanceEstimation(
    data.sources,
    c(Workflow(learner="lm",post="onlyPos"), 
      Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
      workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1))),
      workflowVariants(learner="randomForest",learner.pars=list(ntree=c(200,500,700)))
     ),
    EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)


##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####

** PREDICTIVE TASK :: a1

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a2

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a3

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a4

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a5

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a6

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


** PREDICTIVE TASK :: a7

++ MODEL/WORKFLOW :: lm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: svm 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: rpartXse.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v1 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v2 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************


++ MODEL/WORKFLOW :: randomForest.v3 
Task for estimating  nmse  using
 5 x 10 - Fold Cross Validation
	 Run with seed =  1234 
Iteration :**************************************************

rankWorkflows(kCV.results.algae.all.rf,top=3)

rankWorkflows() does not report on the standard error, and so we cannot tell whether the differences between the score of the best model and the other models is statistically significant.

randomForest.v3 seems to have the best ranking across all learning tasks, so we will use it as the baseline model.

p<-pairedComparisons(kCV.results.algae.all.rf,baseline="randomForest.v3")
p$nmse$F.test
p$nmse$BonferroniDunn.test

We can reject with 95% certainty that the performance of the baseline method (randomForest.v3) is the same as that of the linear model and the first 2 regression trees, but not that it is better than svm, rpartXse.v3, and the other 2 random forests.

The information is also displayed in the Bonferroni-Dunn CD diagram below.

CDdiagram.BD(p)

Back to top

MODEL PREDICTIONS ¶

Finally, we might actually be interested in generating predictions for each of the target variables in the testing set.

This requires simply that the best performers for each target be brought together in an R object.

best.performers <- sapply(taskNames(kCV.results.algae.all.rf), function(x) topPerformer(kCV.results.algae.all.rf,metric="nmse",task=x))

best.performers

$a1
Workflow Object:
	Workflow ID       ::  randomForest.v3 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=700 
		 learner  -> randomForest 


$a2
Workflow Object:
	Workflow ID       ::  randomForest.v3 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=700 
		 learner  -> randomForest 


$a3
Workflow Object:
	Workflow ID       ::  randomForest.v2 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=500 
		 learner  -> randomForest 


$a4
Workflow Object:
	Workflow ID       ::  randomForest.v2 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=500 
		 learner  -> randomForest 


$a5
Workflow Object:
	Workflow ID       ::  randomForest.v3 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=700 
		 learner  -> randomForest 


$a6
Workflow Object:
	Workflow ID       ::  randomForest.v3 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> ntree=700 
		 learner  -> randomForest 


$a7
Workflow Object:
	Workflow ID       ::  rpartXse.v2 
	Workflow Function ::  standardWF
	     Parameter values:
		 learner.pars  -> se=0.7 
		 learner  -> rpartXse

The observations that form the testing set are shown below, as are the corresponding

rownames(algae.test)

test.observations <- array(dim=c(nrow(algae.test),7,2),dimnames=list(rownames(algae.test),paste("a",1:7),c("actual","predicted")))

The function runWorkflow() will compute the predicted values for each of the targets' best performers. We can then plot the predicted and actual values for each of the testing set targets.

for(j in 1:7){ 
    results <- runWorkflow(best.performers[[j]],
                          as.formula(paste(names(best.performers)[j],"~ .")),
                          algae.train[,c(1:11,11+j)],
                          algae.test[,c(1:11,11+j)])
    test.observations[,j,"actual"] <- results$trues
    test.observations[,j,"predicted"] <- results$preds
}

df.a1 <- as.data.frame(test.observations[,1,])
df.a2 <- as.data.frame(test.observations[,2,])
df.a3 <- as.data.frame(test.observations[,3,])
df.a4 <- as.data.frame(test.observations[,4,])
df.a5 <- as.data.frame(test.observations[,5,])
df.a6 <- as.data.frame(test.observations[,6,])
df.a7 <- as.data.frame(test.observations[,7,])

plot(df.a1,main="a1 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a2,main="a2 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a3,main="a3 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a4,main="a4 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a5,main="a5 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a6,main="a6 - predicted vs. actual")
abline(0,1,col="red")
plot(df.a7,main="a7 - predicted vs. actual")
abline(0,1,col="red")

The models simply are not that great, but we already expected that. The average prediction for each target is shown below.

average.prediction <- apply(algae.train[,12:18],2, mean)
average.prediction

Finally, you might be interested in the NMSE metrics for the predicted values and how they compare to the NMSE metrics on the training set. Is any of this surprising?

apply((test.observations[,,"actual"]-test.observations[,,"predicted"])^2,2,sum) / apply((scale(test.observations[,,"actual"],average.prediction,FALSE))^2,2,sum)

Back to top

EXERCISES:¶

once you complete the data cleaning step, run the analysis with a scaled dataset. Do the results change significantly?
once you complete the data cleaning step, run the analysis with a PCA-reduced dataset. Do the results change significantly?
why do you think we only see one svm model?
is there any value in using a multi-stage model (first predict absence or presence of algae, then predict the level)?

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
season*	1	340	2.547059	1.1186895	3.0000	2.558824	1.482600	1.000	4.000	3.000	-0.05447289	-1.3652641	0.06066946
size*	2	340	2.111765	0.7676208	2.0000	2.139706	1.482600	1.000	3.000	2.000	-0.19150312	-1.2876572	0.04163008
speed*	3	340	1.994118	0.9120437	2.0000	1.992647	1.482600	1.000	3.000	2.000	0.01153866	-1.8012592	0.04946251
mxPH	4	338	7.997293	0.5783188	8.0450	8.035864	0.467019	5.600	9.700	4.100	-0.84022022	1.9865897	0.03145640
mnO2	5	338	9.156716	2.3130799	9.7000	9.388199	1.927380	1.500	13.400	11.900	-0.93926049	0.5497464	0.12581496
Cl	6	324	42.517246	44.4906037	32.4700	34.997591	33.478591	0.222	391.500	391.278	2.84856749	14.0485533	2.47170021
NO3	7	338	3.120784	3.2851622	2.3555	2.699504	2.181646	0.000	45.650	45.650	6.78743778	81.1290663	0.17868927
NH4	8	338	471.734411	1739.0774580	107.3570	156.746838	119.949753	5.000	24064.000	24059.000	9.11841709	105.4980942	94.59334335
oPO4	9	338	73.091882	114.1420517	37.2430	51.792802	43.460936	1.000	1435.000	1434.000	5.99067080	60.0469965	6.20850914
PO4	10	333	136.685700	149.4773125	101.4550	115.867652	114.999352	1.000	1690.000	1689.000	4.22624415	35.1788385	8.19130627
Chla	11	317	12.796196	18.0813363	5.1110	8.782608	6.094969	0.200	110.456	110.256	2.61867532	7.8575379	1.01554902
a1	12	340	16.701765	20.9987076	7.1000	12.593750	10.526460	0.000	89.800	89.800	1.50408832	1.4996118	1.13881481
a2	13	340	7.200882	10.7549412	2.8000	4.854412	4.151280	0.000	72.600	72.600	2.32801696	6.7216331	0.58326858
a3	14	340	3.904412	6.4205247	1.4000	2.358456	2.075640	0.000	42.800	42.800	2.41507584	6.7202844	0.34820184
a4	15	340	1.810000	3.8292948	0.0000	1.036765	0.000000	0.000	44.600	44.600	5.95164312	52.8944003	0.20767267
a5	16	340	5.515588	8.4186630	2.2000	3.692279	3.261720	0.000	61.100	61.100	2.75624792	10.4260075	0.45656610
a6	17	340	6.411471	12.3978237	0.0000	3.279779	0.000000	0.000	77.600	77.600	2.88057370	9.2050442	0.67236639
a7	18	340	2.206471	4.9472217	0.0000	1.040074	0.000000	0.000	31.600	31.600	4.09036063	18.5615144	0.26830077

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
season	3	1.122589e+02	37.4196342	3.242880e-01	8.078029e-01
size	2	4.359809e+02	217.9904537	1.889160e+00	1.538604e-01
speed	2	1.552841e+03	776.4203322	6.728654e+00	1.482532e-03
mxPH	1	2.223541e+03	2223.5413342	1.926977e+01	1.828507e-05
mnO2	1	5.405028e-01	0.5405028	4.684133e-03	9.455025e-01
Cl	1	3.319361e-01	0.3319361	2.876642e-03	9.572795e-01
NO3	1	4.391250e+01	43.9124991	3.805568e-01	5.380001e-01
NH4	1	1.938211e+02	193.8211063	1.679703e+00	1.964428e-01
oPO4	1	9.138940e-02	0.0913894	7.920036e-04	9.775762e-01
PO4	1	4.816856e+00	4.8168561	4.174409e-02	8.383141e-01
Chla	1	1.478183e+03	1478.1833857	1.281031e+01	4.316486e-04
Residuals	202	2.330881e+04	115.3901334	NA	NA

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
speed	2	1994.841	997.4206	8.906264	0.0001928875
mxPH	1	2026.632	2026.6317	18.096394	0.0000314492
Chla	1	1479.589	1479.5892	13.211690	0.0003488051
Residuals	213	23854.064	111.9909	NA	NA

season	size	speed	mxPH	mnO2	Cl	NO3	NH4	oPO4	PO4	Chla	a1	a2	a3	a4	a5	a6	a7
winter	small	medium	8.00	9.8	60.800	6.238	578.000	105.000	170.000	50.0	0.0	0.0	0.0	0.0	34.2	8.3	0.0
spring	small	medium	8.35	8.0	57.750	1.288	370.000	428.750	558.750	1.3	1.4	7.6	4.8	1.9	6.7	0.0	2.1
autumn	small	medium	8.10	11.4	40.020	5.330	346.667	125.667	187.057	15.6	3.3	53.6	1.9	0.0	0.0	0.0	9.7
spring	small	medium	8.07	4.8	77.364	2.302	98.182	61.182	138.700	1.4	3.1	41.0	18.9	0.0	1.4	0.0	1.4
autumn	small	medium	8.06	9.0	55.350	10.416	233.700	58.222	97.580	10.5	9.2	2.9	7.5	0.0	7.5	4.1	1.0
winter	small	high	8.25	13.1	65.750	9.248	430.000	18.250	56.667	28.4	15.1	14.6	1.4	0.0	22.5	12.6	2.9

	season	size	speed	mxPH	mnO2	Cl	NO3	NH4	oPO4	PO4	Chla	a1	a2	a3	a4	a5	a6	a7
62	summer	small	medium	6.4	NA	NA	NA	NA	NA	14	NA	19.4	0.0	0.0	2	0	3.9	1.7
199	winter	large	medium	8.0	7.6	NA	NA	NA	NA	NA	NA	0.0	12.5	3.7	1	0	0.0	4.9

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
202	23308.81	NA	NA	NA	NA
213	23854.06	-11	-545.2573	0.4295761	0.9416056

Workflow	Estimate
randomForest.v3	0.5457269
randomForest.v2	0.5461815
randomForest.v1	0.5489013

Workflow	Estimate
randomForest.v3	0.8130037
randomForest.v2	0.8159825
randomForest.v1	0.8192456

Workflow	Estimate
randomForest.v2	0.9247299
randomForest.v3	0.9252515
randomForest.v1	0.9260748

Workflow	Estimate
randomForest.v2	1.005694
randomForest.v3	1.005728
randomForest.v1	1.022519

Workflow	Estimate
randomForest.v3	0.7511166
randomForest.v2	0.7535201
randomForest.v1	0.7535520

Workflow	Estimate
randomForest.v3	0.8902371
randomForest.v2	0.8964133
randomForest.v1	0.9074734

PREDICTING ALGAE BLOOMS¶

OUTLINE¶

PROBLEM DESCRIPTION ¶

LOADING THE DATA ¶

SUMMARY AND VISUALIZATION¶

DATA CLEANING¶

PREDICTION MODELS ¶

GENERALIZED LINEAR MODEL ¶

REGRESSION TREE MODEL ¶

MODEL EVALUATION¶

MODEL PREDICTIONS ¶

EXERCISES:¶

PREDICTING ALGAE BLOOMS ¶

SUMMARY AND VISUALIZATION ¶

DATA CLEANING ¶

MODEL EVALUATION ¶