PREDICTING ALGAE BLOOMS

(based on a Case Study by L.Torgo)

This notebook provides a concrete and thorough illustration of the data science process using a realistic dataset: algae_blooms.csv, which is also available at the UCI Machine Learning Repository.

The problem is to predict the occurrence of harmful algae in water samples.

We also use it to highlight various aspects of data exploration, data cleaning, and R syntax.

OUTLINE

Problem Description
Loading the Data
Summary and Visualization
Data Cleaning
Prediction Models
- General Linear Model
- Regression Tree Model
Model Evaluation
Model Predictions
Exercises

LOADING THE DATA

Before we can take a look at the data and begin the process in earnest, we need to load it in the in the R workspace.

The dataset is stored in the CSV file algae_blooms.csv (in the Data directory).

# The data can also be loaded as part of Torgo's DMwR package, 
# which could be loaded directly as below (by uncommenting the 
# line of code), but that is not how we will access it.

# library(DMwR)

As the data resides in a CSV file, we use the read.csv R function to load it and store it to an R data frame (Torgo stores it to a tibble object).

(Note the sep and header options – what do you think their function is?)

algae_blooms<-read.csv("Data/algae_blooms.csv", sep=",", header=TRUE)

We can get an idea of the data frame’s structure by calling the str function.

str(algae_blooms)

## 'data.frame':    340 obs. of  18 variables:
##  $ season: Factor w/ 4 levels "autumn","spring",..: 4 2 1 2 1 4 3 1 4 4 ...
##  $ size  : Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ speed : Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 1 1 1 3 1 ...
##  $ mxPH  : num  8 8.35 8.1 8.07 8.06 8.25 8.15 8.05 8.7 7.93 ...
##  $ mnO2  : num  9.8 8 11.4 4.8 9 13.1 10.3 10.6 3.4 9.9 ...
##  $ Cl    : num  60.8 57.8 40 77.4 55.4 ...
##  $ NO3   : num  6.24 1.29 5.33 2.3 10.42 ...
##  $ NH4   : num  578 370 346.7 98.2 233.7 ...
##  $ oPO4  : num  105 428.8 125.7 61.2 58.2 ...
##  $ PO4   : num  170 558.8 187.1 138.7 97.6 ...
##  $ Chla  : num  50 1.3 15.6 1.4 10.5 ...
##  $ a1    : num  0 1.4 3.3 3.1 9.2 15.1 2.4 18.2 25.4 17 ...
##  $ a2    : num  0 7.6 53.6 41 2.9 14.6 1.2 1.6 5.4 0 ...
##  $ a3    : num  0 4.8 1.9 18.9 7.5 1.4 3.2 0 2.5 0 ...
##  $ a4    : num  0 1.9 0 0 0 0 3.9 0 0 2.9 ...
##  $ a5    : num  34.2 6.7 0 1.4 7.5 22.5 5.8 5.5 0 0 ...
##  $ a6    : num  8.3 0 0 0 4.1 12.6 6.8 8.7 0 0 ...
##  $ a7    : num  0 2.1 9.7 1.4 1 2.9 0 0 0 1.7 ...

Evidently, algae_blooms is a data frame with 340 observations of 18 variables each.

Notes: - 3 of the fields are categorical (season, size, speed, which refer to the data collection process) - of the numerical fields, 8 have names that sound vaguely “chemical” - presumably, the remaining fields refer to the algae blooms

We can get a better feel for the data frame by observing it in its natural habitat, so to speak, using the head() or tail() functions.

head(algae_blooms)

##   season  size  speed mxPH mnO2     Cl    NO3     NH4    oPO4     PO4 Chla
## 1 winter small medium 8.00  9.8 60.800  6.238 578.000 105.000 170.000 50.0
## 2 spring small medium 8.35  8.0 57.750  1.288 370.000 428.750 558.750  1.3
## 3 autumn small medium 8.10 11.4 40.020  5.330 346.667 125.667 187.057 15.6
## 4 spring small medium 8.07  4.8 77.364  2.302  98.182  61.182 138.700  1.4
## 5 autumn small medium 8.06  9.0 55.350 10.416 233.700  58.222  97.580 10.5
## 6 winter small   high 8.25 13.1 65.750  9.248 430.000  18.250  56.667 28.4
##     a1   a2   a3  a4   a5   a6  a7
## 1  0.0  0.0  0.0 0.0 34.2  8.3 0.0
## 2  1.4  7.6  4.8 1.9  6.7  0.0 2.1
## 3  3.3 53.6  1.9 0.0  0.0  0.0 9.7
## 4  3.1 41.0 18.9 0.0  1.4  0.0 1.4
## 5  9.2  2.9  7.5 0.0  7.5  4.1 1.0
## 6 15.1 14.6  1.4 0.0 22.5 12.6 2.9

First, we filter the algae_blooms dataset to remove the 2 observations with missing values (is this always a good idea?)

f.NH4.data <- filter(algae_blooms,!is.na(NH4)) # %>% 
    # filter(NH4<3000) %>% 
    # mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)

##   season  size  speed mxPH mnO2     Cl    NO3     NH4    oPO4     PO4 Chla
## 1 winter small medium 8.00  9.8 60.800  6.238 578.000 105.000 170.000 50.0
## 2 spring small medium 8.35  8.0 57.750  1.288 370.000 428.750 558.750  1.3
## 3 autumn small medium 8.10 11.4 40.020  5.330 346.667 125.667 187.057 15.6
## 4 spring small medium 8.07  4.8 77.364  2.302  98.182  61.182 138.700  1.4
## 5 autumn small medium 8.06  9.0 55.350 10.416 233.700  58.222  97.580 10.5
## 6 winter small   high 8.25 13.1 65.750  9.248 430.000  18.250  56.667 28.4
##     a1   a2   a3  a4   a5   a6  a7
## 1  0.0  0.0  0.0 0.0 34.2  8.3 0.0
## 2  1.4  7.6  4.8 1.9  6.7  0.0 2.1
## 3  3.3 53.6  1.9 0.0  0.0  0.0 9.7
## 4  3.1 41.0 18.9 0.0  1.4  0.0 1.4
## 5  9.2  2.9  7.5 0.0  7.5  4.1 1.0
## 6 15.1 14.6  1.4 0.0 22.5 12.6 2.9

nrow(f.NH4.data)

## [1] 338

Next we remove the 11 observations for which NH4 > 3000 (again, based on the mean + sd “argument”)

f.NH4.data <- filter(algae_blooms,!is.na(NH4))  %>% 
     filter(NH4<3000) # %>% 
    # mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)

##   season  size  speed mxPH mnO2     Cl    NO3     NH4    oPO4     PO4 Chla
## 1 winter small medium 8.00  9.8 60.800  6.238 578.000 105.000 170.000 50.0
## 2 spring small medium 8.35  8.0 57.750  1.288 370.000 428.750 558.750  1.3
## 3 autumn small medium 8.10 11.4 40.020  5.330 346.667 125.667 187.057 15.6
## 4 spring small medium 8.07  4.8 77.364  2.302  98.182  61.182 138.700  1.4
## 5 autumn small medium 8.06  9.0 55.350 10.416 233.700  58.222  97.580 10.5
## 6 winter small   high 8.25 13.1 65.750  9.248 430.000  18.250  56.667 28.4
##     a1   a2   a3  a4   a5   a6  a7
## 1  0.0  0.0  0.0 0.0 34.2  8.3 0.0
## 2  1.4  7.6  4.8 1.9  6.7  0.0 2.1
## 3  3.3 53.6  1.9 0.0  0.0  0.0 9.7
## 4  3.1 41.0 18.9 0.0  1.4  0.0 1.4
## 5  9.2  2.9  7.5 0.0  7.5  4.1 1.0
## 6 15.1 14.6  1.4 0.0 22.5 12.6 2.9

nrow(f.NH4.data)

## [1] 327

Finally, we add a new variable to tell us in which quartile the NH4 value falls.

f.NH4.data <- filter(algae_blooms,!is.na(NH4))  %>% 
     filter(NH4<3000)  %>% 
     mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1))))
head(f.NH4.data)

##   season  size  speed mxPH mnO2     Cl    NO3     NH4    oPO4     PO4 Chla
## 1 winter small medium 8.00  9.8 60.800  6.238 578.000 105.000 170.000 50.0
## 2 spring small medium 8.35  8.0 57.750  1.288 370.000 428.750 558.750  1.3
## 3 autumn small medium 8.10 11.4 40.020  5.330 346.667 125.667 187.057 15.6
## 4 spring small medium 8.07  4.8 77.364  2.302  98.182  61.182 138.700  1.4
## 5 autumn small medium 8.06  9.0 55.350 10.416 233.700  58.222  97.580 10.5
## 6 winter small   high 8.25 13.1 65.750  9.248 430.000  18.250  56.667 28.4
##     a1   a2   a3  a4   a5   a6  a7          q.NH4
## 1  0.0  0.0  0.0 0.0 34.2  8.3 0.0 (226,2.17e+03]
## 2  1.4  7.6  4.8 1.9  6.7  0.0 2.1 (226,2.17e+03]
## 3  3.3 53.6  1.9 0.0  0.0  0.0 9.7 (226,2.17e+03]
## 4  3.1 41.0 18.9 0.0  1.4  0.0 1.4     (36.8,103]
## 5  9.2  2.9  7.5 0.0  7.5  4.1 1.0 (226,2.17e+03]
## 6 15.1 14.6  1.4 0.0 22.5 12.6 2.9 (226,2.17e+03]

nrow(f.NH4.data)

## [1] 327

We can now use the new q.NH4 variable to make multi-variate comparisons, say between a1, NH4, and season.

ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) + # plot the a1 levels by season ...
    geom_point() +
    facet_wrap(~q.NH4) +                             # for each of the quartiles (each plot contains ~25% of the observations)
    guides(color=FALSE) +
    ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")

That’s odd… why are we seeing missing values here? Haven’t we removed the NAs? Let’s delve in a bit deeper.

f.NH4.data[which(is.na(f.NH4.data$q.NH4)),]

##     season  size  speed mxPH mnO2 Cl  NO3 NH4 oPO4 PO4 Chla   a1 a2 a3 a4
## 53  spring small medium  5.6 11.8 NA 2.22   5    1   1   NA 82.7  0  0  0
## 223 autumn small   high  5.9 11.9 NA 1.88   5    1   2   NA 75.2  0  0  0
##     a5 a6 a7 q.NH4
## 53   0  0  0  <NA>
## 223  0  0  0  <NA>

table(f.NH4.data$q.NH4,useNA="ifany")

## 
##       (5,36.8]     (36.8,103]      (103,226] (226,2.17e+03]           <NA> 
##             80             82             81             82              2

Ah, it seems as though the quartiles do not include their lower bound. We can remedy the situation by including an additional parameter in the mutate() call.

f.NH4.data <- filter(algae_blooms,!is.na(NH4)) %>% 
    filter(NH4<3000) %>% 
    mutate(q.NH4=cut(NH4,quantile(NH4,c(0,0.25,0.5,0.75,1)), include.lowest=TRUE))
table(f.NH4.data$q.NH4,useNA="ifany")

## 
##       [5,36.8]     (36.8,103]      (103,226] (226,2.17e+03] 
##             82             82             81             82

ggplot(f.NH4.data,aes(x=a1,y=season,color=season)) +
    geom_point() +
    facet_wrap(~q.NH4) +
    guides(color=FALSE) +
    ggtitle("Algae Level a1 by Season and Ammonium Quartiles NH4")

The NAs have disappeared from the graph (but the observations have not!). Either way, it seems as though the a1 levels are inversely correlated with the NH4+ levels but that the season doesn’t have much of an effect (although we would need more work to conclude with any degree of certainty).

ggplot(f.NH4.data,aes(x=a3,y=season,color=season)) +
    geom_point() +
    facet_wrap(~q.NH4) +
    guides(color=FALSE) +
    ggtitle("Algae Level a3 by Season and Ammonium Quartiles NH4")

Our new dataset still contains observations with missing cases, however.

algae_blooms.sna = algae_blooms[-which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2),]
dim(algae_blooms.sna)

## [1] 338  18

What can we do with the other observations for which values are missing?

One possibility is to use the set of complete observations to compute a correlation matrix, and to see if any numerical field is strongly correlated with annother field. That way, if there is a missing value in the first field, the second could be used to impute it.

IMPORTANT NOTE: - this approach only works for variables that are linearly correlated to a single other variable. Non-linear correlations and multi-variate associations will not be uncovered.

symnum(cor(algae_blooms.sna[,4:18], use="complete.obs"))

##      mP mO Cl NO NH o P Ch a1 a2 a3 a4 a5 a6 a7
## mxPH 1                                         
## mnO2    1                                      
## Cl         1                                   
## NO3           1                                
## NH4           .  1                             
## oPO4             .  1                          
## PO4     .  .     .  * 1                        
## Chla .                  1                      
## a1         .        . .    1                   
## a2                            1                
## a3                               1             
## a4      .        .  . .             1          
## a5                                     1       
## a6            .                        .  1    
## a7                                           1 
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

This might be a bit hard to read. We can use the corrplot library to visualize the (linear) correlations.

library(corrplot)

## corrplot 0.84 loaded

corrplot(cor(algae_blooms.sna[,4:18], use="complete.obs"),type="upper",tl.pos="d")

The correlation between PO4 (which has missing cases) and oPO4 (which does not, anymore) is clear.

What is the nature of the relation? We’ll use the set of complete cases to find it.

algae_blooms.nona <- algae_blooms.sna[-which(apply(algae_blooms.sna,1, function(x) sum(is.na(x)))>0),]

dim(algae_blooms.nona)

## [1] 306  18

PO4.oPO4.model = lm(PO4 ~ oPO4, data=algae_blooms.nona)
PO4.oPO4.model

## 
## Call:
## lm(formula = PO4 ~ oPO4, data = algae_blooms.nona)
## 
## Coefficients:
## (Intercept)         oPO4  
##      51.811        1.203

The regression function is PO4\(=51.811+1.203\)oPO4 (note that I’m not really interested in the fit statistics).

Intercept = PO4.oPO4.model$coefficients[[1]]
Slope = PO4.oPO4.model$coefficients[[2]]

What are the observations for which PO4 is missing?

which(is.na(algae_blooms.sna$PO4)==TRUE)

## [1]  28 221 291 326 331 335

Let’s use the regression function to impute the missing PO4 values.

algae_blooms.sna2 <- algae_blooms.sna
algae_blooms.sna2$PO4 <- ifelse(is.na(algae_blooms.sna2$PO4),max(Intercept + Slope*algae_blooms.sna2$oPO4,0),algae_blooms.sna2$PO4)

We can clearly see that no values of PO4 are missing anymore.

which(is.na(algae_blooms.sna2$PO4)==TRUE)

## integer(0)

That takes care of the missing values with strong linear correlation to another field. Where do we stand now?

summary(algae_blooms.sna2)

##     season       size        speed          mxPH            mnO2       
##  spring:84   small :120   low   : 58   Min.   :5.600   Min.   : 1.500  
##  summer:85   medium:136   medium:138   1st Qu.:7.750   1st Qu.: 8.000  
##  autumn:80   large : 82   high  :142   Median :8.055   Median : 9.700  
##  winter:89                             Mean   :8.002   Mean   : 9.161  
##                                        3rd Qu.:8.400   3rd Qu.:10.800  
##                                        Max.   :9.700   Max.   :13.400  
##                                        NA's   :2       NA's   :1       
##        Cl               NO3              NH4                oPO4        
##  Min.   :  0.222   Min.   : 0.000   Min.   :    5.00   Min.   :   1.00  
##  1st Qu.: 10.994   1st Qu.: 1.147   1st Qu.:   37.86   1st Qu.:  13.00  
##  Median : 32.470   Median : 2.356   Median :  107.36   Median :  37.24  
##  Mean   : 42.517   Mean   : 3.121   Mean   :  471.73   Mean   :  73.09  
##  3rd Qu.: 57.750   3rd Qu.: 4.147   3rd Qu.:  244.90   3rd Qu.:  88.11  
##  Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :1435.00  
##  NA's   :14                                                             
##       PO4               Chla               a1              a2        
##  Min.   :   1.00   Min.   :  0.200   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:  40.75   1st Qu.:  2.133   1st Qu.: 1.50   1st Qu.: 0.000  
##  Median : 104.86   Median :  5.111   Median : 7.10   Median : 2.800  
##  Mean   : 166.18   Mean   : 12.796   Mean   :16.74   Mean   : 7.207  
##  3rd Qu.: 206.12   3rd Qu.: 17.200   3rd Qu.:25.32   3rd Qu.:10.025  
##  Max.   :1777.93   Max.   :110.456   Max.   :89.80   Max.   :72.600  
##                    NA's   :21                                        
##        a3               a4               a5               a6        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 1.400   Median : 0.000   Median : 2.200   Median : 0.000  
##  Mean   : 3.917   Mean   : 1.812   Mean   : 5.548   Mean   : 6.438  
##  3rd Qu.: 4.675   3rd Qu.: 2.300   3rd Qu.: 8.000   3rd Qu.: 7.075  
##  Max.   :42.800   Max.   :44.600   Max.   :61.100   Max.   :77.600  
##                                                                     
##        a7      
##  Min.   : 0.0  
##  1st Qu.: 0.0  
##  Median : 0.0  
##  Mean   : 2.2  
##  3rd Qu.: 2.2  
##  Max.   :31.6  
##

Alright, so we don’t have as many missing values as before, but we still have some. And the correlation trick isn’t going to work this time.

What can we do? There are many ways to tackle the problem, but we will use \(k\)NN imputation. The principle is simple. Using some similarity/distance metric (typically based on the Euclidean distance between points), we start by identifying the \(k\) nearest (complete) neighbours of each observation with a missing case. We then compute the mean value of the missing case in the \(k-\)group of complete observations, and use that value as the imputed value.

IMPORTANT NOTES: - as we have seen when we were discussing, we often suggest scaling the data when dealing with distance metrics. I elected not to scale the data explicitely here. How much of an effect can that have?
- we are going to be using DMwRs implementation of knnImputation() (below). How would you go about determining if the routine scales the data internally?

library(DMwR)

## Loading required package: lattice

## Loading required package: grid

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

algae_blooms.sna2 <- knnImputation(algae_blooms.sna2,k=10) # the choice of k=10 is arbitrary here.

Sure enough, there are no further observations with incomplete cases.

summary(algae_blooms.sna2)

##     season       size        speed          mxPH            mnO2       
##  spring:84   small :120   low   : 58   Min.   :5.600   Min.   : 1.500  
##  summer:85   medium:136   medium:138   1st Qu.:7.755   1st Qu.: 8.000  
##  autumn:80   large : 82   high  :142   Median :8.045   Median : 9.750  
##  winter:89                             Mean   :8.001   Mean   : 9.165  
##                                        3rd Qu.:8.393   3rd Qu.:10.800  
##                                        Max.   :9.700   Max.   :13.400  
##        Cl               NO3              NH4                oPO4        
##  Min.   :  0.222   Min.   : 0.000   Min.   :    5.00   Min.   :   1.00  
##  1st Qu.: 10.581   1st Qu.: 1.147   1st Qu.:   37.86   1st Qu.:  13.00  
##  Median : 31.233   Median : 2.356   Median :  107.36   Median :  37.24  
##  Mean   : 41.326   Mean   : 3.121   Mean   :  471.73   Mean   :  73.09  
##  3rd Qu.: 57.323   3rd Qu.: 4.147   3rd Qu.:  244.90   3rd Qu.:  88.11  
##  Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :1435.00  
##       PO4               Chla               a1              a2        
##  Min.   :   1.00   Min.   :  0.200   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.:  40.75   1st Qu.:  2.100   1st Qu.: 1.50   1st Qu.: 0.000  
##  Median : 104.86   Median :  5.042   Median : 7.10   Median : 2.800  
##  Mean   : 166.18   Mean   : 12.328   Mean   :16.74   Mean   : 7.207  
##  3rd Qu.: 206.12   3rd Qu.: 16.021   3rd Qu.:25.32   3rd Qu.:10.025  
##  Max.   :1777.93   Max.   :110.456   Max.   :89.80   Max.   :72.600  
##        a3               a4               a5               a6        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 1.400   Median : 0.000   Median : 2.200   Median : 0.000  
##  Mean   : 3.917   Mean   : 1.812   Mean   : 5.548   Mean   : 6.438  
##  3rd Qu.: 4.675   3rd Qu.: 2.300   3rd Qu.: 8.000   3rd Qu.: 7.075  
##  Max.   :42.800   Max.   :44.600   Max.   :61.100   Max.   :77.600  
##        a7      
##  Min.   : 0.0  
##  1st Qu.: 0.0  
##  Median : 0.0  
##  Mean   : 2.2  
##  3rd Qu.: 2.2  
##  Max.   :31.6

table(apply(algae_blooms.sna2,1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns

## 
##   0 
## 338

We see that the adjusted \(R^2\) coefficient is fairly small. Furthermore, if the linear model is a good fit, the residuals should have a mean of 0 and be as “small”. That being said, the \(F-\)statistic seems to indicate that there is some (linear) dependence on the predictor variables.

We can get a better handle on the regression diagnostics by calling the plot() method on the object.

plot(linear.model.a2)

The fit isn’t much better, but an ANOVA on the 2 suggested models shows that we’re ~88% that the models are different (see below).

anova(linear.model.a2,linear.model.a2.mod)

## Analysis of Variance Table
## 
## Model 1: a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + 
##     PO4 + Chla
## Model 2: a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + oPO4 + 
##     PO4 + Chla
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1    202 23309                           
## 2    203 23453 -1   -143.86 1.2467 0.2655

The function step uses AIC to perform a model search (using backward elimination), implemented in the method step().

final.linear.model.a2 <- step(linear.model.a2)

## Start:  AIC=1050.52
## a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + 
##     PO4 + Chla
## 
##          Df Sum of Sq   RSS    AIC
## - season  3    157.44 23466 1046.0
## - oPO4    1      6.20 23315 1048.6
## - PO4     1      7.96 23317 1048.6
## - Cl      1     15.85 23325 1048.7
## - mnO2    1     25.40 23334 1048.8
## - NO3     1     57.19 23366 1049.0
## - size    2    282.28 23591 1049.1
## - NH4     1    143.86 23453 1049.9
## <none>                23309 1050.5
## - speed   2    967.47 24276 1055.4
## - mxPH    1   1121.22 24430 1058.8
## - Chla    1   1478.18 24787 1061.9
## 
## Step:  AIC=1045.98
## a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + 
##     Chla
## 
##         Df Sum of Sq   RSS    AIC
## - oPO4   1      2.54 23469 1044.0
## - PO4    1      4.10 23470 1044.0
## - mnO2   1      6.61 23473 1044.0
## - Cl     1     15.59 23482 1044.1
## - size   2    257.60 23724 1044.4
## - NO3    1     47.04 23513 1044.4
## - NH4    1    114.06 23580 1045.0
## <none>               23466 1046.0
## - speed  2   1035.56 24502 1051.4
## - mxPH   1   1052.01 24518 1053.5
## - Chla   1   1477.06 24943 1057.3
## 
## Step:  AIC=1044.01
## a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + PO4 + Chla
## 
##         Df Sum of Sq   RSS    AIC
## - PO4    1      1.62 23470 1042.0
## - mnO2   1      7.17 23476 1042.1
## - Cl     1     14.19 23483 1042.1
## - NO3    1     44.93 23514 1042.4
## - size   2    266.73 23736 1042.5
## - NH4    1    114.91 23584 1043.1
## <none>               23469 1044.0
## - speed  2   1050.55 24519 1049.5
## - mxPH   1   1099.78 24569 1052.0
## - Chla   1   1480.47 24949 1055.3
## 
## Step:  AIC=1042.02
## a2 ~ size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + Chla
## 
##         Df Sum of Sq   RSS    AIC
## - mnO2   1      6.59 23477 1040.1
## - Cl     1     17.42 23488 1040.2
## - size   2    265.19 23736 1040.5
## - NO3    1     51.04 23521 1040.5
## - NH4    1    140.72 23611 1041.3
## <none>               23470 1042.0
## - speed  2   1050.42 24521 1047.6
## - mxPH   1   1105.21 24576 1050.0
## - Chla   1   1482.34 24953 1053.4
## 
## Step:  AIC=1040.08
## a2 ~ size + speed + mxPH + Cl + NO3 + NH4 + Chla
## 
##         Df Sum of Sq   RSS    AIC
## - Cl     1     13.41 23490 1038.2
## - size   2    260.65 23738 1038.5
## - NO3    1     44.48 23522 1038.5
## - NH4    1    135.66 23613 1039.3
## <none>               23477 1040.1
## - speed  2   1121.64 24599 1046.3
## - mxPH   1   1103.17 24580 1048.1
## - Chla   1   1492.55 24970 1051.5
## 
## Step:  AIC=1038.21
## a2 ~ size + speed + mxPH + NO3 + NH4 + Chla
## 
##         Df Sum of Sq   RSS    AIC
## - NO3    1     36.13 23526 1036.5
## - size   2    275.91 23766 1036.8
## - NH4    1    128.31 23619 1037.4
## <none>               23490 1038.2
## - speed  2   1172.78 24663 1044.8
## - mxPH   1   1089.85 24580 1046.1
## - Chla   1   1490.94 24981 1049.6
## 
## Step:  AIC=1036.54
## a2 ~ size + speed + mxPH + NH4 + Chla
## 
##         Df Sum of Sq   RSS    AIC
## - size   2    244.91 23771 1034.8
## - NH4    1     93.48 23620 1035.4
## <none>               23526 1036.5
## - speed  2   1164.36 24691 1043.1
## - mxPH   1   1053.88 24580 1044.1
## - Chla   1   1611.04 25138 1049.0
## 
## Step:  AIC=1034.8
## a2 ~ speed + mxPH + NH4 + Chla
## 
##         Df Sum of Sq   RSS    AIC
## - NH4    1     82.62 23854 1033.6
## <none>               23771 1034.8
## - mxPH   1    850.56 24622 1040.5
## - speed  2   1085.45 24857 1040.5
## - Chla   1   1540.50 25312 1046.5
## 
## Step:  AIC=1033.56
## a2 ~ speed + mxPH + Chla
## 
##         Df Sum of Sq   RSS    AIC
## <none>               23854 1033.6
## - speed  2   1021.27 24875 1038.7
## - mxPH   1    928.72 24783 1039.9
## - Chla   1   1479.59 25334 1044.7

REGRESSION TREE MODEL

An alternative to regression is the use of regression trees, an implementation of which is found in the library rpart. The syntax is similar to lm.

library(rpart)

regression.tree.a2 <-rpart(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train)

The outcome can be displayed by calling the object directly.

regression.tree.a2

## n= 218 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 218 29355.1300  7.6366970  
##     2) Cl< 16.6875 83  1193.6400  1.8891570  
##       4) size=small,medium 67   398.6457  0.9447761 *
##       5) size=large 16   485.0194  5.8437500 *
##     3) Cl>=16.6875 135 23733.9200 11.1703700  
##       6) mxPH< 8.065 59  3831.8290  5.3864410  
##        12) season=autumn,winter 29   561.8414  2.5172410 *
##        13) season=spring,summer 30  2800.4720  8.1600000  
##          26) mxPH< 7.9375 23   889.9730  5.3173910 *
##          27) mxPH>=7.9375 7  1114.0000 17.5000000 *
##       7) mxPH>=8.065 76 16396.0400 15.6605300  
##        14) Chla>=2.65 68  9694.0890 13.8544100  
##          28) Chla< 14.8875 29  2747.5810  8.7172410  
##            56) NH4< 226.875 21   558.4257  5.7857140 *
##            57) NH4>=226.875 8  1534.9490 16.4125000 *
##          29) Chla>=14.8875 39  5612.0940 17.6743600  
##            58) mnO2< 11.05 30  3139.0940 15.4233300  
##             116) NH4>=158.409 8   577.1000  8.9000000 *
##             117) NH4< 158.409 22  2097.7700 17.7954500  
##               234) season=spring,autumn 14   674.7521 14.6642900 *
##               235) season=summer,winter 8  1045.5550 23.2750000 *
##            59) mnO2>=11.05 9  1814.2760 25.1777800 *
##        15) Chla< 2.65 8  4594.6690 31.0125000 *

The tree structure can be hard to determine when there is a large number of nodes; let’s see if we can improve on the visuals using rpart.plot.

library(rpart.plot)

prp(regression.tree.a2,extra=101,box.col="orange",split.box.col="gray")

Details on the regression tree can be obtained by calling the summary() method on the object.

summary(regression.tree.a2)

## Call:
## rpart(formula = a2 ~ season + size + speed + mxPH + mnO2 + Cl + 
##     NO3 + NH4 + oPO4 + PO4 + Chla, data = algae.train)
##   n= 218 
## 
##            CP nsplit rel error    xerror      xstd
## 1  0.15082765      0 1.0000000 1.0094399 0.1994604
## 2  0.11943572      1 0.8491724 0.9479860 0.1735351
## 3  0.07178590      2 0.7297366 0.9342445 0.1677879
## 4  0.04545758      3 0.6579507 0.8986776 0.1682068
## 5  0.02243987      4 0.6124932 0.9669438 0.1825075
## 6  0.02228595      5 0.5900533 1.0092162 0.1893206
## 7  0.02156378      6 0.5677673 1.0157825 0.1894284
## 8  0.01581407      8 0.5246398 0.9891863 0.1771592
## 9  0.01285848      9 0.5088257 0.9422149 0.1718359
## 10 0.01055949     10 0.4959672 0.9573215 0.1762788
## 11 0.01000000     11 0.4854077 0.9488856 0.1746610
## 
## Variable importance
##   Chla    NH4     Cl   mxPH   oPO4    PO4    NO3  speed   mnO2 season 
##     19     14     14     13     11      9      6      5      4      3 
##   size 
##      2 
## 
## Node number 1: 218 observations,    complexity param=0.1508276
##   mean=7.636697, MSE=134.6565 
##   left son=2 (83 obs) right son=3 (135 obs)
##   Primary splits:
##       Cl   < 16.6875  to the left,  improve=0.15082760, (0 missing)
##       mxPH < 7.94     to the left,  improve=0.14900670, (0 missing)
##       oPO4 < 45.1     to the left,  improve=0.11106510, (0 missing)
##       Chla < 12.21    to the left,  improve=0.09817759, (0 missing)
##       NH4  < 46.35    to the left,  improve=0.09032131, (0 missing)
##   Surrogate splits:
##       PO4   < 70.465   to the left,  agree=0.844, adj=0.590, (0 split)
##       oPO4  < 19.8635  to the left,  agree=0.835, adj=0.566, (0 split)
##       NH4   < 46.35    to the left,  agree=0.807, adj=0.494, (0 split)
##       Chla  < 2.225    to the left,  agree=0.807, adj=0.494, (0 split)
##       speed splits as  RRL,          agree=0.775, adj=0.410, (0 split)
## 
## Node number 2: 83 observations,    complexity param=0.01055949
##   mean=1.889157, MSE=14.38121 
##   left son=4 (67 obs) right son=5 (16 obs)
##   Primary splits:
##       size  splits as  LLR,          improve=0.25968900, (0 missing)
##       Chla  < 4.4835   to the left,  improve=0.16315540, (0 missing)
##       mxPH  < 7.835    to the left,  improve=0.07327114, (0 missing)
##       oPO4  < 18.889   to the left,  improve=0.05858864, (0 missing)
##       speed splits as  LRL,          improve=0.05534763, (0 missing)
##   Surrogate splits:
##       speed splits as  LRL,          agree=0.843, adj=0.188, (0 split)
##       mxPH  < 8.385    to the left,  agree=0.831, adj=0.125, (0 split)
##       mnO2  < 8.3      to the right, agree=0.831, adj=0.125, (0 split)
##       Cl    < 15.2705  to the left,  agree=0.819, adj=0.062, (0 split)
## 
## Node number 3: 135 observations,    complexity param=0.1194357
##   mean=11.17037, MSE=175.8068 
##   left son=6 (59 obs) right son=7 (76 obs)
##   Primary splits:
##       mxPH < 8.065    to the left,  improve=0.14772320, (0 missing)
##       NO3  < 0.3785   to the right, improve=0.09864039, (0 missing)
##       Chla < 2.85     to the right, improve=0.06998520, (0 missing)
##       mnO2 < 11.05    to the left,  improve=0.06691501, (0 missing)
##       NH4  < 474.1875 to the right, improve=0.05884159, (0 missing)
##   Surrogate splits:
##       NH4  < 272.2915 to the right, agree=0.689, adj=0.288, (0 split)
##       Chla < 11.65    to the left,  agree=0.667, adj=0.237, (0 split)
##       NO3  < 5.37     to the right, agree=0.659, adj=0.220, (0 split)
##       mnO2 < 6.55     to the left,  agree=0.622, adj=0.136, (0 split)
##       oPO4 < 279.5085 to the right, agree=0.607, adj=0.102, (0 split)
## 
## Node number 4: 67 observations
##   mean=0.9447761, MSE=5.949935 
## 
## Node number 5: 16 observations
##   mean=5.84375, MSE=30.31371 
## 
## Node number 6: 59 observations,    complexity param=0.02156378
##   mean=5.386441, MSE=64.94626 
##   left son=12 (29 obs) right son=13 (30 obs)
##   Primary splits:
##       season splits as  RRLL,         improve=0.12253050, (0 missing)
##       NH4    < 339.9305 to the right, improve=0.08821754, (0 missing)
##       mxPH   < 7.905    to the left,  improve=0.07581077, (0 missing)
##       Chla   < 2.5      to the right, improve=0.05654547, (0 missing)
##       oPO4   < 25.75    to the left,  improve=0.05324680, (0 missing)
##   Surrogate splits:
##       NO3  < 2.8605   to the right, agree=0.712, adj=0.414, (0 split)
##       mnO2 < 9.75     to the right, agree=0.661, adj=0.310, (0 split)
##       oPO4 < 144.1885 to the left,  agree=0.627, adj=0.241, (0 split)
##       Chla < 6.85     to the right, agree=0.627, adj=0.241, (0 split)
##       mxPH < 7.975    to the right, agree=0.610, adj=0.207, (0 split)
## 
## Node number 7: 76 observations,    complexity param=0.0717859
##   mean=15.66053, MSE=215.7374 
##   left son=14 (68 obs) right son=15 (8 obs)
##   Primary splits:
##       Chla < 2.65     to the right, improve=0.12852400, (0 missing)
##       mnO2 < 11.05    to the left,  improve=0.10098910, (0 missing)
##       NO3  < 0.6045   to the right, improve=0.08727846, (0 missing)
##       Cl   < 35.705   to the right, improve=0.07266453, (0 missing)
##       mxPH < 8.375    to the right, improve=0.04660101, (0 missing)
##   Surrogate splits:
##       Cl   < 19.11    to the right, agree=0.921, adj=0.25, (0 split)
##       NO3  < 0.076    to the right, agree=0.921, adj=0.25, (0 split)
##       NH4  < 2713     to the left,  agree=0.921, adj=0.25, (0 split)
##       oPO4 < 8        to the right, agree=0.921, adj=0.25, (0 split)
##       PO4  < 40.3335  to the right, agree=0.921, adj=0.25, (0 split)
## 
## Node number 12: 29 observations
##   mean=2.517241, MSE=19.37384 
## 
## Node number 13: 30 observations,    complexity param=0.02156378
##   mean=8.16, MSE=93.34907 
##   left son=26 (23 obs) right son=27 (7 obs)
##   Primary splits:
##       mxPH < 7.9375   to the left,  improve=0.28441600, (0 missing)
##       NH4  < 119.2855 to the left,  improve=0.10978520, (0 missing)
##       oPO4 < 42.857   to the left,  improve=0.06038492, (0 missing)
##       Cl   < 37.624   to the left,  improve=0.05321153, (0 missing)
##       Chla < 3.1      to the right, improve=0.04623221, (0 missing)
##   Surrogate splits:
##       NO3 < 1.0425   to the right, agree=0.833, adj=0.286, (0 split)
## 
## Node number 14: 68 observations,    complexity param=0.04545758
##   mean=13.85441, MSE=142.5601 
##   left son=28 (29 obs) right son=29 (39 obs)
##   Primary splits:
##       Chla < 14.8875  to the left,  improve=0.13765220, (0 missing)
##       Cl   < 68.875   to the right, improve=0.10411670, (0 missing)
##       NH4  < 228.175  to the left,  improve=0.09585894, (0 missing)
##       NO3  < 5.247    to the left,  improve=0.08016472, (0 missing)
##       oPO4 < 45.1     to the left,  improve=0.05486673, (0 missing)
##   Surrogate splits:
##       size   splits as  LRR,          agree=0.706, adj=0.310, (0 split)
##       mxPH   < 8.215    to the left,  agree=0.676, adj=0.241, (0 split)
##       oPO4   < 180.8335 to the right, agree=0.662, adj=0.207, (0 split)
##       NO3    < 3.779    to the right, agree=0.647, adj=0.172, (0 split)
##       season splits as  RLRR,         agree=0.632, adj=0.138, (0 split)
## 
## Node number 15: 8 observations
##   mean=31.0125, MSE=574.3336 
## 
## Node number 26: 23 observations
##   mean=5.317391, MSE=38.69448 
## 
## Node number 27: 7 observations
##   mean=17.5, MSE=159.1429 
## 
## Node number 28: 29 observations,    complexity param=0.02228595
##   mean=8.717241, MSE=94.74419 
##   left son=56 (21 obs) right son=57 (8 obs)
##   Primary splits:
##       NH4  < 226.875  to the left,  improve=0.23810280, (0 missing)
##       Cl   < 68.875   to the right, improve=0.12545720, (0 missing)
##       NO3  < 3.8045   to the right, improve=0.10340650, (0 missing)
##       mnO2 < 9.6      to the right, improve=0.08307998, (0 missing)
##       oPO4 < 45.1     to the left,  improve=0.07494295, (0 missing)
##   Surrogate splits:
##       Chla < 12.08    to the left,  agree=0.862, adj=0.5, (0 split)
## 
## Node number 29: 39 observations,    complexity param=0.02243987
##   mean=17.67436, MSE=143.8999 
##   left son=58 (30 obs) right son=59 (9 obs)
##   Primary splits:
##       mnO2 < 11.05    to the left,  improve=0.11737600, (0 missing)
##       mxPH < 8.35     to the right, improve=0.11494980, (0 missing)
##       Cl   < 51.6665  to the right, improve=0.11368430, (0 missing)
##       NO3  < 5.0885   to the left,  improve=0.10842770, (0 missing)
##       NH4  < 231.3    to the left,  improve=0.08663984, (0 missing)
##   Surrogate splits:
##       NH4  < 245.682  to the left,  agree=0.821, adj=0.222, (0 split)
##       oPO4 < 19.4585  to the right, agree=0.821, adj=0.222, (0 split)
##       PO4  < 96.283   to the right, agree=0.821, adj=0.222, (0 split)
##       Chla < 16.4     to the right, agree=0.821, adj=0.222, (0 split)
##       size splits as  RLL,          agree=0.795, adj=0.111, (0 split)
## 
## Node number 56: 21 observations
##   mean=5.785714, MSE=26.5917 
## 
## Node number 57: 8 observations
##   mean=16.4125, MSE=191.8686 
## 
## Node number 58: 30 observations,    complexity param=0.01581407
##   mean=15.42333, MSE=104.6365 
##   left son=116 (8 obs) right son=117 (22 obs)
##   Primary splits:
##       NH4  < 158.409  to the right, improve=0.14788480, (0 missing)
##       Chla < 48.15    to the left,  improve=0.13800510, (0 missing)
##       mnO2 < 9.95     to the right, improve=0.12649850, (0 missing)
##       NO3  < 0.698    to the right, improve=0.11199640, (0 missing)
##       oPO4 < 144.0415 to the left,  improve=0.08054276, (0 missing)
##   Surrogate splits:
##       mxPH < 8.27     to the left,  agree=0.767, adj=0.125, (0 split)
##       oPO4 < 36.4     to the left,  agree=0.767, adj=0.125, (0 split)
## 
## Node number 59: 9 observations
##   mean=25.17778, MSE=201.5862 
## 
## Node number 116: 8 observations
##   mean=8.9, MSE=72.1375 
## 
## Node number 117: 22 observations,    complexity param=0.01285848
##   mean=17.79545, MSE=95.35316 
##   left son=234 (14 obs) right son=235 (8 obs)
##   Primary splits:
##       season splits as  LRLR,         improve=0.1799351, (0 missing)
##       NO3    < 1.2775   to the right, improve=0.1189142, (0 missing)
##       Chla   < 31.869   to the left,  improve=0.1160835, (0 missing)
##       NH4    < 99.6665  to the left,  improve=0.1091819, (0 missing)
##       Cl     < 42.818   to the left,  improve=0.1085589, (0 missing)
##   Surrogate splits:
##       NO3   < 3.2455   to the left,  agree=0.727, adj=0.250, (0 split)
##       oPO4  < 103.293  to the left,  agree=0.727, adj=0.250, (0 split)
##       size  splits as  RLL,          agree=0.682, adj=0.125, (0 split)
##       speed splits as  LRL,          agree=0.682, adj=0.125, (0 split)
##       mnO2  < 7.4      to the left,  agree=0.682, adj=0.125, (0 split)
## 
## Node number 234: 14 observations
##   mean=14.66429, MSE=48.19658 
## 
## Node number 235: 8 observations
##   mean=23.275, MSE=130.6944

The entire process is automated in the wrapper method rpartXse() provided with the DMwr library. In our implementation, we (abitrarily) use se\(=0.3\).

library(DMwR)

(regression.tree.a2.final <- rpartXse(a2 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + PO4 + Chla,algae.train,se=0.3))

## n= 218 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 218 29355.13 7.636697 *

summary(regression.tree.a2.final)

## Call:
## rpart(formula = form, data = data, cp = cp, minsplit = minsplit)
##   n= 218 
## 
##          CP nsplit rel error   xerror     xstd
## 1 0.1508276      0         1 1.009958 0.199121
## 
## Node number 1: 218 observations
##   mean=7.636697, MSE=134.6565

prp(regression.tree.a2.final,extra=101,box.col="orange",split.box.col="gray")

## Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
## To silence this warning:
##     Call prp with roundint=FALSE,
##     or rebuild the rpart model with model=TRUE.

The resulting tree is not nearly as complex as the original tree (hence discourages overfitting) but is still more complex than the pruned tree (which should improve predicting accuracy).

This time, however, we will learn models and perform evaluation for all target variables (a1-a7) simultaneously. This does not mean that we are looking for a single model which will optimize all learning tasks at once, but rather that we can prepare and evaluate the models for each target variable with the same bit of code.

This first require some code to create the appropriate model formulas (a1 ~ ., … ,a7 ~ .) and the appropriate training data.

gg<-function(x,list.of.variables){    
    PredTask(as.formula(paste(x,"~ .")),algae.train[,c(list.of.variables,x)], x, copy=TRUE)
}
data.sources <- sapply(names(algae.train[12:18]),gg,names(algae.train[1:11]))
data.sources

## $a1
## Prediction Task Object:
##  Task Name         :: a1 
##  Task Type         :: regression 
##  Target Feature    :: a1 
##  Formula           :: a1 ~ .
## <environment: 0x7f8b7765c960>
##  Task Data Source  :: internal  218x12 data frame.
## 
## $a2
## Prediction Task Object:
##  Task Name         :: a2 
##  Task Type         :: regression 
##  Target Feature    :: a2 
##  Formula           :: a2 ~ .
## <environment: 0x7f8b775caac8>
##  Task Data Source  :: internal  218x12 data frame.
## 
## $a3
## Prediction Task Object:
##  Task Name         :: a3 
##  Task Type         :: regression 
##  Target Feature    :: a3 
##  Formula           :: a3 ~ .
## <environment: 0x7f8b783b5228>
##  Task Data Source  :: internal  218x12 data frame.
## 
## $a4
## Prediction Task Object:
##  Task Name         :: a4 
##  Task Type         :: regression 
##  Target Feature    :: a4 
##  Formula           :: a4 ~ .
## <environment: 0x7f8b78373378>
##  Task Data Source  :: internal  218x12 data frame.
## 
## $a5
## Prediction Task Object:
##  Task Name         :: a5 
##  Task Type         :: regression 
##  Target Feature    :: a5 
##  Formula           :: a5 ~ .
## <environment: 0x7f8b781bd0d0>
##  Task Data Source  :: internal  218x12 data frame.
## 
## $a6
## Prediction Task Object:
##  Task Name         :: a6 
##  Task Type         :: regression 
##  Target Feature    :: a6 
##  Formula           :: a6 ~ .
## <environment: 0x7f8b78845590>
##  Task Data Source  :: internal  218x12 data frame.
## 
## $a7
## Prediction Task Object:
##  Task Name         :: a7 
##  Task Type         :: regression 
##  Target Feature    :: a7 
##  Formula           :: a7 ~ .
## <environment: 0x7f8b77faa390>
##  Task Data Source  :: internal  218x12 data frame.

We shall use e1071’s implementation of svm(), with various values of the svm()-specific parameters cost and gamma.

library(e1071)

kCV.results.algae.all <- performanceEstimation(
    data.sources,
    c(Workflow(learner="lm",post="onlyPos"), 
      Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
      workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1)))
     ),
    EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)

## 
## 
## ##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####
## 
## ** PREDICTIVE TASK :: a1
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a2
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a3
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a4
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a5
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a6
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a7
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************

The rest of the evaluation proceeds much as before, except that we can display results for the 7 target variables simultaneously.

plot(kCV.results.algae.all)

rankWorkflows(kCV.results.algae.all,top=3)

## $a1
## $a1$nmse
##      Workflow  Estimate
## 1 rpartXse.v1 0.6163406
## 2 rpartXse.v2 0.6278027
## 3 rpartXse.v3 0.6430736
## 
## 
## $a2
## $a2$nmse
##      Workflow  Estimate
## 1          lm 0.9723125
## 2         svm 0.9954432
## 3 rpartXse.v3 1.0041667
## 
## 
## $a3
## $a3$nmse
##      Workflow  Estimate
## 1         svm 0.9497730
## 2          lm 0.9801662
## 3 rpartXse.v2 1.0000000
## 
## 
## $a4
## $a4$nmse
##      Workflow Estimate
## 1 rpartXse.v3 1.001453
## 2 rpartXse.v2 1.351494
## 3          lm 1.357243
## 
## 
## $a5
## $a5$nmse
##      Workflow  Estimate
## 1         svm 0.9968475
## 2 rpartXse.v3 0.9990465
## 3 rpartXse.v2 1.0194733
## 
## 
## $a6
## $a6$nmse
##      Workflow Estimate
## 1 rpartXse.v2 1.010069
## 2 rpartXse.v3 1.010069
## 3         svm 1.054975
## 
## 
## $a7
## $a7$nmse
##      Workflow Estimate
## 1 rpartXse.v2  1.00000
## 2 rpartXse.v3  1.00000
## 3 rpartXse.v1  1.00797

topPerformers(kCV.results.algae.all)

## $a1
##         Workflow Estimate
## nmse rpartXse.v1    0.616
## 
## $a2
##      Workflow Estimate
## nmse       lm    0.972
## 
## $a3
##      Workflow Estimate
## nmse      svm     0.95
## 
## $a4
##         Workflow Estimate
## nmse rpartXse.v3    1.001
## 
## $a5
##      Workflow Estimate
## nmse      svm    0.997
## 
## $a6
##         Workflow Estimate
## nmse rpartXse.v2     1.01
## 
## $a7
##         Workflow Estimate
## nmse rpartXse.v2        1

For a1, the models seem to perform reasonably well. But it’s not so rosy for the other target variables, where the baseline model is sometimes better.

Again, this could be built-in in the data, but we might benefit from incorporating more models. As a last attempt to improve the situation, let’s include random forests.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:psych':
## 
##     outlier

kCV.results.algae.all.rf <- performanceEstimation(
    data.sources,
    c(Workflow(learner="lm",post="onlyPos"), 
      Workflow(learner="svm",learner.pars=list(cost=c(10,1,0.1),gamma=0.1)),
      workflowVariants(learner="rpartXse",learner.pars=list(se=c(0,0.7,1))),
      workflowVariants(learner="randomForest",learner.pars=list(ntree=c(200,500,700)))
     ),
    EstimationTask(metrics="nmse",method=CV(nReps=5,nFolds=10)) 
)

## 
## 
## ##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####
## 
## ** PREDICTIVE TASK :: a1
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a2
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a3
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a4
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a5
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a6
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ** PREDICTIVE TASK :: a7
## 
## ++ MODEL/WORKFLOW :: lm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: rpartXse.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v1 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v2 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v3 
## Task for estimating  nmse  using
##  5 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**************************************************

rankWorkflows(kCV.results.algae.all.rf,top=3)

## $a1
## $a1$nmse
##          Workflow  Estimate
## 1 randomForest.v2 0.5217204
## 2 randomForest.v3 0.5228744
## 3 randomForest.v1 0.5264328
## 
## 
## $a2
## $a2$nmse
##          Workflow  Estimate
## 1 randomForest.v3 0.7798749
## 2 randomForest.v2 0.7806831
## 3 randomForest.v1 0.7849360
## 
## 
## $a3
## $a3$nmse
##          Workflow  Estimate
## 1 randomForest.v3 0.9377108
## 2 randomForest.v2 0.9400108
## 3 randomForest.v1 0.9431801
## 
## 
## $a4
## $a4$nmse
##          Workflow Estimate
## 1     rpartXse.v3 1.001453
## 2 randomForest.v3 1.006496
## 3 randomForest.v1 1.006806
## 
## 
## $a5
## $a5$nmse
##          Workflow  Estimate
## 1 randomForest.v1 0.7626241
## 2 randomForest.v2 0.7675794
## 3 randomForest.v3 0.7681834
## 
## 
## $a6
## $a6$nmse
##          Workflow  Estimate
## 1 randomForest.v2 0.8590227
## 2 randomForest.v3 0.8621478
## 3 randomForest.v1 0.8663869
## 
## 
## $a7
## $a7$nmse
##      Workflow Estimate
## 1 rpartXse.v2  1.00000
## 2 rpartXse.v3  1.00000
## 3 rpartXse.v1  1.00797

rankWorkflows() does not report on the standard error, and so we cannot tell whether the differences between the score of the best model and the other models is statistically significant.

randomForest.v3 seems to have the best ranking across all learning tasks, so we will use it as the baseline model.

p<-pairedComparisons(kCV.results.algae.all.rf,baseline="randomForest.v3")
p$nmse$F.test

## $chi
## [1] 22.86905
## 
## $FF
## [1] 5.251025
## 
## $critVal
## [1] 0.7071231
## 
## $rejNull
## [1] TRUE

p$nmse$BonferroniDunn.test

## $critDif
## [1] 3.52218
## 
## $baseline
## [1] "randomForest.v3"
## 
## $rkDifs
##              lm             svm     rpartXse.v1     rpartXse.v2 
##       4.1428571       2.8571429       4.1428571       2.6428571 
##     rpartXse.v3 randomForest.v1 randomForest.v2 
##       1.9285714       0.8571429       0.0000000 
## 
## $signifDifs
##              lm             svm     rpartXse.v1     rpartXse.v2 
##            TRUE           FALSE            TRUE           FALSE 
##     rpartXse.v3 randomForest.v1 randomForest.v2 
##           FALSE           FALSE           FALSE

We can reject with 95% certainty that the performance of the baseline method (randomForest.v3) is the same as that of the linear model and the first 2 regression trees, but not that it is better than svm, rpartXse.v3, and the other 2 random forests.

The information is also displayed in the Bonferroni-Dunn CD diagram below.

CDdiagram.BD(p)

TDA 07 - Predicting Algae Blooms (L.Torgo).ipynb

PREDICTING ALGAE BLOOMS

OUTLINE

LOADING THE DATA

REGRESSION TREE MODEL