(based on a Case Study by L.Torgo)
Data cleaning is an essential (but not overly exciting) part of the analytical process. In this notebook, we show how such a task could be accomplished using the Algae Blooms Dataset, which we will revisit in notebooks DS 03 and DS 07.
We will also show how to run a Principal Component Analysis and a Feature Selection on this dataset.
The ability to monitor and perform early forecasts of various river algae blooms is crucial to control the ecological harm they can cause.
The dataset which is used to train the learning model consists of:
For now, we assume that we have already explored the data (see notebooks DS 03 and DS 07).
Before we can take a look at the data and begin the process in earnest, we need to load it in the in the R workspace.
The dataset is stored in the CSV file algae_blooms.csv
(in the Data
directory).
algae_blooms<-read.csv("Data/algae_blooms.csv", sep=",", header=TRUE)
We can get an idea of the data frame's structure by calling the str
function.
str(algae_blooms)
Evidently, algae_blooms
is a data frame with 340 observations of 18 variables each.
Notes:
season
, size
, speed
, which refer to the data collection process)We can get a better feel for the data frame by observing it in its natural habitat, so to speak, using the head
or tail
functions.
head(algae_blooms)
We found some anomalies in the data when we explored the dataset (see notebooks DS 03 and DS 07); let's take some time to clean it up (somewhat).
Anomalies come in various flavours; in this section we'll handle missing values (outlying behaviour is tackled in notebook DS 07).
library(dplyr)
The function complete.cases
lists the observations for which every field is present (note that it says nothing about the validity of the case).
table(complete.cases(algae_blooms))
The vast majority of observations do not have missing cases, but a few still do. Let's take a look at them to see if there is anything special about them? Are the values missing completely at random?
str(filter(algae_blooms, !complete.cases(algae_blooms)))
summary(filter(algae_blooms, !complete.cases(algae_blooms)))
Right off the bat, missing cases seem to be over-represented in small
rivers and under-represented in low
-speed rivers. But upon further investigation (that is, comparing with the original dataset), the under-representation of low
-speed rivers is not problematic as it is in-line with the numbers in the larger dataset (by which I mean that low
-speed rivers don't seem to have a systematic missing value problem, rather than the under-representation of low
-speed rivers in the original dataset is not problematic).
I will assume for now (in the interest of efficiency) that the fact that small rivers have a lot of missing cases (mostly Cl
and Chla
) is also not a problem (YOU SHOULD PROBABLY VERIFY IF THAT IS INDEED THE CASE...)
And the bulk of the missing values seem to come from either Cl
, Chla
, or PO4
. There also seems to be 2 observations for which we have very little information, but we can't use that smmary to determine if it's always the same two observations.
For instance, let's see what observations have missing NH4
values.
algae_blooms[which(is.na(algae_blooms$NH4)),]
While these observations also have missing values in other fields, they do have some non-missing fields as well.
But they're both missing 6 of the predictor variables. How useful could they be in training a predictive model? (that depends on the model, of course....)
We can write a function that will compute how many missing cases there are for each observation.
table(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns
which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2)
Most observations have no missing cases, which is great, and there are a few with 1 or 2, but observations 62 and 199 are wild cards, with 6 missing predictors (out of 11).
Based on the small number of such wild cards, we elect to remove them from the analysis.
IMPORTANT NOTES:
Our new dataset still contains observations with missing cases, however.
algae_blooms.sna = algae_blooms[-which(apply(algae_blooms[,1:11],1, function(x) sum(is.na(x)))>2),]
dim(algae_blooms.sna)
What can we do with the other observations for which values are missing?
One possibility is to use the set of complete observations to compute a correlation matrix, and to see if any numerical field is strongly correlated with annother field. That way, if there is a missing value in the first field, the second could be used to impute it.
IMPORTANT NOTE:
symnum(cor(algae_blooms.sna[,4:18], use="complete.obs"))
This might be a bit hard to read. We can use the corrplot
library to visualize the (linear) correlations.
library(corrplot)
corrplot(cor(algae_blooms.sna[,4:18], use="complete.obs"),type="upper",tl.pos="d")
The correlation between PO4
(which has missing cases) and oPO4
(which does not, anymore) is clear.
What is the nature of the relation? We'll use the set of complete cases to find it.
algae_blooms.nona <- algae_blooms.sna[-which(apply(algae_blooms.sna,1, function(x) sum(is.na(x)))>0),]
dim(algae_blooms.nona)
PO4.oPO4.model = lm(PO4 ~ oPO4, data=algae_blooms.nona)
PO4.oPO4.model
The regression function is PO4
$=51.811+1.203$oPO4
. (note that I'm not really interested in the fit statistics)
Intercept = PO4.oPO4.model$coefficients[[1]]
Slope = PO4.oPO4.model$coefficients[[2]]
What are the observations for which PO4
is missing?
which(is.na(algae_blooms.sna$PO4)==TRUE)
Let's use the regression function to impute the missing PO4
values.
algae_blooms.sna2 <- algae_blooms.sna
algae_blooms.sna2$PO4 <- ifelse(is.na(algae_blooms.sna2$PO4),max(Intercept + Slope*algae_blooms.sna2$oPO4,0),algae_blooms.sna2$PO4)
We can clearly see that no values of PO4
are missing anymore.
which(is.na(algae_blooms.sna2$PO4)==TRUE)
That takes care of the missing values with strong linear correlation to another field. Where do we stand now?
summary(algae_blooms.sna2)
Alright, so we don't have as many missing values as before, but we still have some. And the correlation trick isn't going to work this time.
What can we do? There are many ways to tackle the problem, but we will use $k$NN imputation. The principle is simple. Using some similarity/distance metric (typically based on the Euclidean distance between points), we start by identifying the $k$ nearest (complete) neighbours of each observation with a missing case. We then compute the mean value of the missing case in the $k-$group of complete observations, and use that value as the imputed value.
IMPORTANT NOTES:
DMwR
s implementation of knnImputation
(below). How would you go about determining if the routine scales the data internally?library(DMwR)
algae_blooms.sna2 <- knnImputation(algae_blooms.sna2,k=10) # the choice of k=10 is arbitrary here.
Sure enough, there are no further observations with incomplete cases.
summary(algae_blooms.sna2)
table(apply(algae_blooms.sna2,1, function(x) sum(is.na(x)))) # 1 for rows, 2 for columns
PCA is typically used on the (numeric) predictor variables. Methods exist to combine numeric and categorical variables, but for the purposes of this example, we will neglect the effect of the categorical fields.
pca.algae = algae_blooms.sna2[,4:11]
head(pca.algae)
pca.1 = princomp(scale(pca.algae))
summary(pca.1)
If we can live with 75% of the variance being explained, than we can reduce the dimension by 4.
reduced.algae = data.frame(algae_blooms.sna2[,1:3],pca.1$scores[,1:4],algae_blooms.sna2[12:17])
head(reduced.algae)
Whether this will prove useful or not remains to be seen.
We will use CORElearn
's metrics to illustrate the feature selection process.
For supervised problems with numeric target (such as is the case for with trying to predict a1
-a7
), several metrics are available.
library(CORElearn)
infoCore(what="attrEvalReg") # what metrics are available for regression problems
The feature selection process is dependent on the target variable. For the sake of simplification, we will use a1
as our target variable.
attrEval(a1 ~ .,algae_blooms.sna2[,1:12],estimator="MSEofMean")
Alright, they're all negative... is this supposed to happen? Are we looking for features that are very negative or only a little bit negative? (For classification problems, important features would be those for which the outputs would be near 1).
We could look at the documentation (and we should, at some point!), but let's see if we can't get the answer some other way.
We'll create a dataset for which the target variable is one of the predictor variables. No matter what kind of method is used, we would suspect that the associated value would be optimal.
For instance, let's re-create the analysis by replacing a1
with mxPH
.
test = algae_blooms.sna2[,1:11]
what = test$mxPH
test = cbind(test,what)
attrEval(what ~ .,test,estimator="MSEofMean")
The value of mxPH
is closest to 0, which seems to imply that we're looking for values close to 0 (at least for the MSEofMean
estimator). In particular, Cl
, oPO4
and PO4
seem to be the best features at explaining a1
in the MSE of Mean framework.
Is that also the case for the other estimators?
attrEval(a1 ~ .,algae_blooms.sna2[,1:12],estimator="MAEofModel")
attrEval(a1 ~ .,algae_blooms.sna2[,1:12],estimator="MSEofModel")
The results are somewhat consistent (for the 3 estimators that were chosen).
The only thing I will add here is that there is an awful lot more we could have said about feature selection. It is an important sub-field of data science, and you've barely been introduced to the topic (we could probably fill a full graduate course on the topic).
In many applications, the subset of retained features has a marked effect on analysis results, and thus plays an important role in their interpretation and eventual use in decision-making.
Please keep this in mind.
algae_blooms.sna2
for the target variables a2
-a7
. Are there features that are consistently retained (compare with a1
) iris
dataset.