1 Story: with Iris toy dataset

1.1 Objective

The objective of this story is to cover large breath of topics in data science/data handling in R and Python using the ubiquitous Iris flower dataset. Topics covered are data structure, column manipulation, clustering, PCA, various ML algorithms and visualization. Doing parallel exercise in Python helps in quick learning of other language.

1.2 A bit about Iris flower data set

Let us get to know about the Iris flower dataset. It is one of the most popular toy dataset for ML (Machine Learning) which is about identification of the species based on four oberservatios made on petal’s and sepal’s length and width. The dataset consist of four predictors (also called features) columns that are numeric data type. Based on these features, one can predict the species of the flower - setosa, versicolor, virginica. There are 50 measurements (one row has one measurement) for each species.

1.3 What Kind of dataset is Iris?

In machine language, it is multinomial classification problem. The Response variable (the variable that needs to be predicted, also called as label or target) have more than 2 outcompes – setosa, versicolor, and virginica. If there were two outcomes, it would be binary classification problem. If the Response variable is number like sales amount, then it would be regression problem.

The Iris data in R already exist in base package and so no library needs to be invoked to access the iris data set.

1.4 Did I find anything interesting in the Data?

Surprisingly yes! One of the prevalent wisdom for reducing the predictor columns is to remove columns that are highly coorelated with other column. We find that it is not a good idea to remove highly correlated column without any due consideration. To directly jump at this finding click the \(\boxed{RED\;}\) box

In next section, we take the first step in handling the data which is the follwong:

  1. Put the data as DataFrame.
  2. Standardize the predictor columns, so that each column has mean = 0 and standard deviation = 1.
  3. Split the data into training and test set.

2 Data Preparation

2.1 Putting data in the right format

Iris dataset comes as dataframe. We divide the dataset vertically into only predictors – xiris and one column response – yiris.

class(iris) # check the way the data is stored
## [1] "data.frame"
dim(iris)   # dimension -- rows, column for dataframe
## [1] 150   5
str(iris)   # data type for each columns
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)  
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
colnames(iris)  # show column names
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"
# rename the first 4 column names
colnames(iris)[1:4]= c("Sepal_Length",'Sepal_Width','Petal_Length','Petal_Width') 
colnames(iris)
## [1] "Sepal_Length" "Sepal_Width"  "Petal_Length" "Petal_Width" 
## [5] "Species"
xiris <- iris[,c(1:4)]  # or iris[,c("Sepal_Length",'Sepal_Width','Petal_Length','Petal_Width')]
yiris <- iris[,5, drop=FALSE] # When one column is assigned, by default Drop=TRUE, will convert yiris as vector erasing the column name. 
# in python yiris existed in values of 0,1,2 for the species
# we will map the species (setosa, versicolor, virginica) to (0,1,2) just as an exercise.

y_temp <-yiris
y_temp <- sapply(yiris, function(x) {ifelse(x=='setosa',0,ifelse(x=='versicolor',1,2))})  # not as elegant as python's map
table(y_temp)  # check the frequency of values to confirm there are 50 of each 0,1,2 as response variable
## y_temp
##  0  1  2 
## 50 50 50
head(xiris,6)
tail(yiris,6)
cat(dim(xiris))        # dim(df) in R
## 150 4
cat(str(xiris))        # str(df) in R
## 'data.frame':    150 obs. of  4 variables:
##  $ Sepal_Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal_Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal_Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal_Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#rm(list=ls())

2.2 Normalizing or Standardizing of data

We transform predictors column so that each column will have mean = 0 and std = 1. This is important because some of the ML algorithms like KNN will give more weightage to predictors that have larger range.

In R, we do in with a versatile caret package. Later we check the mean, standard deviation and range

library(caret)
preObj <- preProcess(xiris, method=c("center", "scale"))
xiris <- predict(preObj, xiris)
iris <-data.frame(xiris,yiris)
apply(xiris, 2, mean)  # here 2 means operate on column.
##  Sepal_Length   Sepal_Width  Petal_Length   Petal_Width 
## -4.484318e-16  2.034094e-16 -2.895326e-17 -2.989362e-17
apply(xiris, 2, sd)
## Sepal_Length  Sepal_Width Petal_Length  Petal_Width 
##            1            1            1            1
apply(xiris,2,range)  
##      Sepal_Length Sepal_Width Petal_Length Petal_Width
## [1,]    -1.863780   -2.425820    -1.562342   -1.442245
## [2,]     2.483699    3.080455     1.779869    1.706379

2.3 Split data to Train and Test set

Splitting of the data is done for cross-validation, which is best method to optimize the model fit – i.e., neither underfit nor overfit. This is done by seperating small portion (20%-30%) of the data on which we do not train the model. We then test the prediction on this data. If the model fit is optimized, the prediction will be comparable to that of prediction on traing set.

Optimization is having a balance between bias and variance. A no-model (y = constant, say y = avg(y) or mode(y)) has low bias but will have high variance. An over fit model will have low variance but have high bias.

set.seed(12345)     # set seed so that results are reproducible when run again
# creates a sequence from 1 to 2 of the size of the data with probability distribution of 70 and 30
# We can create three data sets by replacing with seq(1,3) and prob=c(70,15,15)
idx <- sample(seq(1, 2), size = nrow(xiris), replace = TRUE, prob = c(70,30))
table(idx)/length(idx)*100  # check back the percentages of 1 and 2
## idx
##        1        2 
## 64.66667 35.33333
# Defining new sets seperatly for tain set and test set
xiris_tr <- xiris[idx==1,]
yiris_tr <- yiris[idx==1,]
xiris_te <- xiris[idx==2,]
yiris_te <- yiris[idx==2,]
iris_te <-iris[idx==1,]
iris_tr <-iris[idx==2,]

3 Finding Important Predictors

We will use following approach to find important predictors. This is esepcially useful when the number of predictors are in hundreds or thousands where reduction of predictors before processing becomes very important.

  1. Correlation function
  2. regsubsets from leaps library
  3. Principal Component Analysis (PCA)
  4. Linear Determinant Analysis (LDA)

3.1 1. using correlation function

Correlation function finds correlation between pairs of variable. If two predictors are highly correlated, one can be dropped as it may not have additional prediction value. (I am reminded by General Colin Powell – If you have a yes man working for you, one of you is redundant)

# the cor function calculates correlation between all possible pairs of dataframe.
cor(xiris)
##              Sepal_Length Sepal_Width Petal_Length Petal_Width
## Sepal_Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal_Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal_Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal_Width     0.8179411  -0.3661259    0.9628654   1.0000000

Based on above data, (Petal_Width, Petal_Length) are hightly correlated ( >96% ) and therefore one could be dropped. It will be a surprise find by the next method that will suggest to the contrary!)

3.2 2. using regsubsets

regsubsets makes many iterations of checking the predictability of different set of predictor variable. In first iteration, regsubsets uses one predictor at a time to see which is most important predictor variable for prediction of Species. Under next itegration it uses two variables and checks which two are the most important, and so on. There are various ways (which is not discussed here) to find the best iteration, This algorithm is especially important if there are hundreds or thousands of predictor variables. If the dataset is large, we must not use exhaustive method due to large combinations of predictors. We use “forward” or “backward” method in such cases.

library leaps is managed by Professor Trevor Hastie, who is also co-author of one of the best and free data science book: An Introduction to Statistical Learning with Applications in R.

library(leaps)
# In R, a formula clause such as, "Sepecies~." can be inserted in function like regsubsets, which is interpreted as Species as the response variable and all other variable as predictor columns that must be considered in evaluation.
fit <-regsubsets(Species ~.,  data = iris_tr, method="exhaustive", nvmax = 4) # nvmax no. of iterations.
summary(fit)
## Subset selection object
## Call: regsubsets.formula(Species ~ ., data = iris_tr, method = "exhaustive", 
##     nvmax = 4)
## 4 Variables  (and intercept)
##              Forced in Forced out
## Sepal_Length     FALSE      FALSE
## Sepal_Width      FALSE      FALSE
## Petal_Length     FALSE      FALSE
## Petal_Width      FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          Sepal_Length Sepal_Width Petal_Length Petal_Width
## 1  ( 1 ) " "          " "         " "          "*"        
## 2  ( 1 ) " "          " "         "*"          "*"        
## 3  ( 1 ) "*"          " "         "*"          "*"        
## 4  ( 1 ) "*"          "*"         "*"          "*"

The above matrix output with “*" shows most significant column for prediction. In first iteration, it is only one column, which is Petal_Width. Next iteration has two most significant columns (Petal_Width, Petal_Length) and so on. We see that Petal_Width is the most important variable as it consistently appears in all iterations, hence must not be dropped. Second important variable is Petal_Length! This appears to be in contradiction with wisdom that highly correlated variable (>96% )observation that Petal_Width and Petal_length are >96% correlated, hence one can drop one of the variable.. The only logical explaination is the contribution of other variables – sepal length and sepal width are less than the few percentage contributed from highly correlated Petal_length. We will prove this in ML algorithm section. Third variable of importance is Sepal_Length.

We shall use (Petal_Width, Petal_Width, Sepal_Length) for ML algorithms

3.3 Principal Component Analysis (PCA)

PCA technique transforms the predictor columns in a frame where the first variable (PC1) is along the axis of maximum variation, the second variable (PC2) is perpendicular to axis of PC1 and where the variation is maximum and so on. So we have variables, PC1, PC2, … with decreasing order of importance (variation). We can plot the proportion of variance and place a cut-off variable beyond which we can drop the variables as the variation is very small.

The ease of this method to select importnat (albeit transformed) variables makes it a very poular method for reducing predictor dimensions. Reducing the dimension of the predictor variables can help ML algorithms process fast without loosing the power of prediction.

The PCA method is implemented as follows:

pca.out=prcomp(xiris, scale=TRUE)
#biplot(pca.out, scale=1) # We will not use this biplot, instead we use a fancy one.
library(devtools)
#install_github("vqv/ggbiplot")
library(ggbiplot)
print(ggbiplot(pca.out, obs.scale = 1, var.scale = 1, groups = iris$Species, ellipse = TRUE, circle = TRUE))

summary(pca.out) # get proportion of variance
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
# plot proportion of variance
plot(pca.out$sdev^2/sum(pca.out$sdev^2)*100, type="b", col="blue", ylim=c(0,100), axes=FALSE, ann=FALSE)
axis(1, at=1:4, lab=paste0("PC", 1:4))
axis(2, las=1, at=seq(0,100,10))
#grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted", lwd = par("lwd"), equilogs = TRUE) # not flexible
abline(h=seq(0,100,10), v=1:4, col="grey", lty="dotted")
title(xlab="PCA Variables", col.lab="black", line=2.2)
title(ylab="variance (in %)", col.lab="black", cex.lab=1.1, 
      line=2.4) #line gives seperation of label from axes
title(main="PCA's Proportion of variance plot", col.main="black")
box()

#colors()            # show all the colors

Reference: good examples of plot can be found in this link

3.4 Linear Discriminant Analysis

LDA technique transforms the predictor columns in a frame where the first variable (LD1) is along the axis of maximum class-seperation (in response variable), the second variable (LD2) is again along the axis of maximum class-seperation, but perpendicular to LD1 axis. So we have variables, LD1, LD2, … with decreasing order of class seperation. This is very similar to PCA in approach, but PCA deals with variation in predictor data where as LDA deals with class-seperation. PCA us unsupervised (i.e., does not need reponse variable – Species), LDA is supervised as the seperation is based on knowing the response class. LDA works only on multinomial or binomial classification data set (it does not work on regression data set)

Since LDA wil be using the repsonse variable, we will not analyse on entire data, but on training set – iris_tr. We will keep the test set – iris_te only for prediction. This way we ensure the ML algorithm in unbiased in the prediction.

Following code will explore the data with LDA. Interestingly we find the LDA1 and LDA2 depend mostly on Petal_width and Petal_Length! Inspite of having >96% orrelation between the Petal_Wideth and Petal_Length, it is more important than the other two - Sepal predictors. This is exactly the same as concluded by regsubsets. Hence we must not remove a highly correlated variable without doing further analysis.

We use ggplot2 plotting library, to plot overlapping histograms. ggplot library is the standard in R and Python.

3.4.1 \(\boxed{*\;}\) Interesting way to map

c(“green”,“blue”,“red”)[yiris_tr] is interesting way to map Response classification into color variable. formula variable could be Sepal~. or Sepal~Petal_Width+Petal_Length, Sepal~.-Petal_Length to select appropriate predictor variables.

library(MASS) # for lda, qda
lda.fit <- lda(Species~., data=iris_tr)                         # Use all predictors
#lda.fit <- lda(Species~Petal_Width+Petal_Length, data=iris_tr) # Use only Petal_Width and Petal_Length predictors 
#lda.fit <- lda(Species~.-Petal_Length, data=iris_tr)           # Use all but Petal_Length as predictors
prop.table(lda.fit$svd^2)                                       # gives the proportion of trace
## [1] 0.995062308 0.004937692
print(lda.fit)
## Call:
## lda(Species ~ ., data = iris_tr)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3018868  0.3584906  0.3396226 
## 
## Group means:
##            Sepal_Length Sepal_Width Petal_Length Petal_Width
## setosa       -1.1844867   0.5710849   -1.3286707  -1.2618549
## versicolor    0.1764839  -0.5662448    0.3100117   0.1596868
## virginica     1.0680843  -0.1825228    1.0434497   1.0795700
## 
## Coefficients of linear discriminants:
##                     LD1        LD2
## Sepal_Length  0.8080343  0.4573182
## Sepal_Width   0.9880170  0.9684004
## Petal_Length -5.1898199 -1.8644682
## Petal_Width  -2.5576125  1.8705542
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9951 0.0049
library(devtools)
#install_github('fawda123/ggord')
library(ggord)
ggord(lda.fit, iris_tr$Species)

lda.pred <- predict(lda.fit, newdata=xiris_tr)
plot(lda.pred$x, pch=19, col=c("green","blue","red")[yiris_tr])

df <- data.frame(lda.pred$x, "Species"=yiris_tr) 
library(ggplot2)
ggplot(data=df, aes(x=LD1, group=Species, fill=Species)) + geom_histogram(position="identity", binwidth=.5, alpha=.8)

From the proportion of trace LD1 essientially contributes to seperating the class variable, which is evident in the ggord plot and the overlapping histogram.

The print(lda.fit) shows the LD1 having coefficents (-5.2,-2.6,0.9, 0.8) from (Petal_Length, Petal Width, Sepal_Width, Sepal_Length). Since the predictors are standardized, the coefficients shows the strength of contribution from each variables. It shows (Petal_Length, Petal Width) are far stronger variable than (Sepal_Width, Sepal_Length). It reaffirms our assertion, that even though (Petal_Length, Petal) are more than 96% correlated, both are still important as other variables (Sepal_Width, Sepal_Length) has far less contribution.

4 Processing with Machine Learning (ML) Algorithms

LDA is also a ML algorithm grounded on statistics. We continue with previous section and use LDA to predict on the test data. We then compare the predicted data with the real data to find goodness of fit.

4.0.1 Interesting Ideas: Goodness of Fit

For regression problem the goodness of fit can be simple standard deviation. In case of classification problem, it is confusion matrix. There are two more interesting measures – balanced accuracy and kappa – that should be explained. In a binomial classification, when one outcome is very less compared to another, balanced accuracy must be refered to gauge goodness of fit. Kappa is important when we have multinomial classication with ordinal outcomes, say low, medium, and high. Here mis-classification from low to medium is less severe than low to high. Kappa will be higher in former case than later. Higher the kappa better is the goodness of fit.

We will only confine to Confusion Matrix for this simple example.

lda.pred <- predict(lda.fit, newdata=xiris_te)

library(caret)
confusionMatrix(lda.pred$class, yiris_te)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         19         0
##   virginica       0          0        18
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9328, 1)
##     No Information Rate : 0.3585     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3019            0.3585           0.3396
## Detection Rate              0.3019            0.3585           0.3396
## Detection Prevalence        0.3019            0.3585           0.3396
## Balanced Accuracy           1.0000            1.0000           1.0000
summary(lda.fit)
##         Length Class  Mode     
## prior    3     -none- numeric  
## counts   3     -none- numeric  
## means   12     -none- numeric  
## scaling  8     -none- numeric  
## lev      3     -none- character
## svd      2     -none- numeric  
## N        1     -none- numeric  
## call     3     -none- call     
## terms    3     terms  call     
## xlevels  0     -none- list
 library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:scales':
## 
##     alpha
## The following object is masked from 'package:ggplot2':
## 
##     alpha
 model8<-ksvm(Species~.,data=iris_tr)
 
 #Summarize the model
 summary(model8)
## Length  Class   Mode 
##      1   ksvm     S4
 #Predict using the model
 pred_svm<-predict(model8,xiris_te,type="response")

confusionMatrix(pred_svm, yiris_te)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         19         0
##   virginica       0          0        18
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9328, 1)
##     No Information Rate : 0.3585     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3019            0.3585           0.3396
## Detection Rate              0.3019            0.3585           0.3396
## Detection Prevalence        0.3019            0.3585           0.3396
## Balanced Accuracy           1.0000            1.0000           1.0000

4.1 Support Vector Machine (SVM)

[Following paragraph on SVM is based on my rudimendary understanding thus far. Expect this section to change in due time. Please feel free to correct me]

A Linear SVM tries to seperate the class by transforming the points to higher dimensions where they can be seperated. Then it finds the decision boundary that has widest margin. To illustrate, imagine building the widest but straight road seperating the cluster of two kind of trees. The trees that touches the road boundary on both sizes are called Support vectors. The making of widest road, places the constraint on the direction of the road. The direction of the road gives the predictive power to seperate the two clusters of the test data.

For some dataset the seperation is just not posible. Here we used non-linear transformation via choosing the kernal. It is like building the curved road with some predefined curvature.

SVN is naturally built for binary classification. Yet it can solve a multinomial classification by following technique..

4.1.1 \(\boxed{*\;}\) One-vs-One (OVO) and One-vs-All (OVA)

Any multinomial classification can be solved by solving a set of binomial classification problems. There are two ways OVA and OVO. Let us assume a multinomial data with k outcomes.

In OVA, we take one class vs rest of the class. We do this for all the class so that we have k such binary classifcation sets. We solve and pick the one that has maximum probability. In Iris dataset we will have three such data sets – setosa vs (versicolor or virginica), versicolor vs (setosa or virginica), and virginica vs (setosa or versicolor), Here we must be careful as the outcome is imbalanced – setosa will be fewer that (versicolr or virginica).

In OVO, we take one class vs other and do this for all different possible – k(k-1)/2 – pairs. We train the binary classifcation for these sets and pick which ever gets most selected (in case of tie, we can add the probabilities). In Iris dataset we have the following sets setosa vs versicolor, setosa vs virginica, versicolor vs virginica. Here we have large number of datasets – k(k-1)/2 compared to OVA’s k datasets. However, the datasets will be smaller and the outcomes are more balanced where the algorithms are efficient. I would personally prefer OVO.

library("e1071")
df<-iris_tr[!(yiris_tr=="setosa"),]
fit.svm <- svm(Species ~ ., data = iris_tr)
summary(fit.svm)
#svmfit <- svm(cross=3,Species~., data = df, kernel = "linear", cost = 10, scale = FALSE) 

plot(fit.svm, df, Petal_Width ~ Petal_Length, slice = list(Sepal_Width = 0, Sepal_Length = 0))
plot(fit.svm, df, Petal_Width ~ Petal_Length, slice = list(Sepal_Width = 0, Sepal_Length = 1))


range(iris_tr$Sepal_Length)

print(svmfit)
#plot(svmfit, iris_tr[,c(3,4,5)])

iris.svm <- svm(Species ~ ., data = iris)

plot(svmfit, iris_tr, Petal_Width ~ Petal_Length, slice = list(Sepal_Width = -1, Sepal_Length = -2))

colnames(iris_tr[,c(1,2,5)])

fit.svm$x.scale
library("e1071")

fit.nb <- naiveBayes(Species ~ ., data = iris_tr)
class(fit.nb)
## [1] "naiveBayes"
summary(fit.nb)
##         Length Class  Mode     
## apriori 3      table  numeric  
## tables  4      -none- list     
## levels  3      -none- character
## call    4      -none- call
print(fit.nb)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     setosa versicolor  virginica 
##  0.3018868  0.3584906  0.3396226 
## 
## Conditional probabilities:
##             Sepal_Length
## Y                  [,1]      [,2]
##   setosa     -1.1844867 0.3440518
##   versicolor  0.1764839 0.6414172
##   virginica   1.0680843 0.6494136
## 
##             Sepal_Width
## Y                  [,1]      [,2]
##   setosa      0.5710849 0.6125846
##   versicolor -0.5662448 0.7869781
##   virginica  -0.1825228 0.5193487
## 
##             Petal_Length
## Y                  [,1]      [,2]
##   setosa     -1.3286707 0.0922157
##   versicolor  0.3100117 0.2253865
##   virginica   1.0434497 0.2621248
## 
##             Petal_Width
## Y                  [,1]      [,2]
##   setosa     -1.2618549 0.1784413
##   versicolor  0.1596868 0.2296557
##   virginica   1.0795700 0.3615488
pred_nb <- predict(fit.nb, newdata = xiris_te)
confusionMatrix(pred_nb, yiris_te)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         19         0
##   virginica       0          0        18
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9328, 1)
##     No Information Rate : 0.3585     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3019            0.3585           0.3396
## Detection Rate              0.3019            0.3585           0.3396
## Detection Prevalence        0.3019            0.3585           0.3396
## Balanced Accuracy           1.0000            1.0000           1.0000
table(pred_nb,yiris_te)
##             yiris_te
## pred_nb      setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         19         0
##   virginica       0          0        18

4.2 Decision Tree

Decision Tree is continued segmentation of data based of simple statistics. Plotting as decision tree help us in gaining insight in the patterns.For such a simple algorithm it provides good predictions.

The main disadvantage if the sample for training data is slightly different, it may lead to different decision trees.

So why not take averages of trees? We do exactly this, which is taking the contribution of many decision trees in Random Forest and Gradient Boost Model. These models are the most powerful algorithms today. (Although recently Deep Learning algorithm is considered most powerful in image, audio recognition, Natural Language Processing (NLP),

Following is the decision tree algorithm and we can see the prediction is fair but not the best.

library(rpart)
fit.dt<-rpart(Species~.,data=iris_tr)
library(rpart.plot) # elegant plot for decision tree
prp(fit.dt)

#summary(fit.dt)
pred <-predict(fit.dt, type="class")
confusionMatrix(pred, yiris_te)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         18         0
##   virginica       0          1        18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9811          
##                  95% CI : (0.8993, 0.9995)
##     No Information Rate : 0.3585          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9716          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9474           1.0000
## Specificity                 1.0000            1.0000           0.9714
## Pos Pred Value              1.0000            1.0000           0.9474
## Neg Pred Value              1.0000            0.9714           1.0000
## Prevalence                  0.3019            0.3585           0.3396
## Detection Rate              0.3019            0.3396           0.3396
## Detection Prevalence        0.3019            0.3396           0.3585
## Balanced Accuracy           1.0000            0.9737           0.9857

4.3 Random Forest

seq(1:3)

bs <- sample(seq(0, 1), size = 100, replace = TRUE, prob = c(15,85))
table(bs)
## bs
##  0  1 
## 15 85
cat(mean(bs), "+/-", sd(bs))
## 0.85 +/- 0.3588703
a <- replicate(200, sample(bs, size = 100, replace = TRUE), simplify = TRUE)
mean(apply(a,2,mean))
## [1] 0.85005

4.4 RandomForest Algorithm

This can be considered as the most powerful out of self algorithm. It requires no tuning parameters, although some tuning can be done. The library randomForest has some limitations, like max number of columns (about thousand columns) and max cardinality of categorical predictors (in thirties). (See the next section why cardinality of categorical predictors matter.)

RandomForest from H2o package is the most worry free algorithm where benchmark can be established. It is ten times faster than randomForest package and has no such limitations of number of columns and cardinality of categorical predictors. It also resolves the missing data. Lastly, most of the R packages works in memory, therefore size of RAM is crucial worry when working with large datasets. H2o package is written in Java, where upper working memory limit can be fixed by user! Almost magical when I first ran the code! We will use randomForst algorithm for our iris dataset, which is more informative.

Random Forst working …

bootstrapping technique

me tell you a very interesting idea. In a sample of data, you might think knowing mean, standard deviation, skewness, kurtosis and other moments is all there is to the data. But there are two more things that can be sq

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
 # no tuning
fit.rf <- randomForest(Species~.,data=iris_tr, importance=TRUE)
fit.rf$importance
##                   setosa  versicolor   virginica MeanDecreaseAccuracy
## Sepal_Length 0.130360317 0.037801010 0.067939755          0.075627805
## Sepal_Width  0.006494444 0.005198413 0.001207937          0.004097907
## Petal_Length 0.321311111 0.303278355 0.362278355          0.320265008
## Petal_Width  0.334059957 0.228189971 0.208605556          0.245722356
##              MeanDecreaseGini
## Sepal_Length        4.7786733
## Sepal_Width         0.4944315
## Petal_Length       16.3820470
## Petal_Width        12.8671853
# with tuning
#fit_rf<-randomForest(Species~.,data=iris_tr, ntree=ihp_NTREES, mtry=ihp_MTRY, nodesize= ihp_NODESIZE, importance =TRUE)
#summary(fit.rf)
pred_rf<-predict(fit.rf,xiris_te)
#confusionMatrix(pred_rf, yiris_te)
rm(list=setdiff(ls(),  c("iris","iris_te", "iris_tr", "xiris", "xiris_te", "xiris_tr", "yiris", "yiris_te", "yiris_tr")))
library(gbm)
## Warning: package 'gbm' was built under R version 3.3.3
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
my_DISTRIBUTION <- "multinomial" # possible values - "gaussian","adaboost","bernoulli"
hp_NTREES <- 5001; hp_SHRINKAGE <- 0.01; hp_DEPTH <- 2; hp_MINOBS <- 3 # default
#gbm_fit <- gbm(Species~.,data=iris_tr, distribution=my_DISTRIBUTION,
#              n.trees=hp_NTREES, shrinkage=hp_SHRINKAGE,
#              interaction.depth = hp_DEPTH, n.minobsinnode = hp_MINOBS, verbose=TRUE,
#              cv.folds = 0, n.cores=1)
gbm.fit <- gbm(Species~.,data=iris_tr, distribution=my_DISTRIBUTION,
              n.trees=hp_NTREES, shrinkage=hp_SHRINKAGE,
              cv.folds = 0, n.cores=1)

gbm.perf(gbm.fit)
## Using OOB method...
## Warning in gbm.perf(gbm.fit): OOB generally underestimates the optimal
## number of iterations although predictive performance is reasonably
## competitive. Using cv.folds>0 when calling gbm usually results in improved
## predictive performance.

## [1] 391
#summary(gbm.fit, n.trees=1)
summary(gbm.fit, n.trees=gbm.perf(gbm.fit, plot=FALSE), plotit=TRUE, method = relative.influence)
## Using OOB method...
## Warning in gbm.perf(gbm.fit, plot = FALSE): OOB generally underestimates
## the optimal number of iterations although predictive performance is
## reasonably competitive. Using cv.folds>0 when calling gbm usually results
## in improved predictive performance.
plot.gbm(gbm.fit, n.trees=gbm.perf(gbm.fit, plot=FALSE))
## Using OOB method...
## Warning in gbm.perf(gbm.fit, plot = FALSE): OOB generally underestimates
## the optimal number of iterations although predictive performance is
## reasonably competitive. Using cv.folds>0 when calling gbm usually results
## in improved predictive performance.

#predict.gbm(gbm_fit, newdata=xiris_tr)
#pred_gbm <-predict.gbm(gbm,fit, newdata=xiris_tr, n.trees=gbm.perf(gbm.fit, plot=FALSE), single.tree = TRUE, type="link")
pred_gbm <-predict(gbm.fit, newdata=xiris_te, n.trees=gbm.perf(gbm.fit, plot=FALSE), type="response")
## Using OOB method...
## Warning in gbm.perf(gbm.fit, plot = FALSE): OOB generally underestimates
## the optimal number of iterations although predictive performance is
## reasonably competitive. Using cv.folds>0 when calling gbm usually results
## in improved predictive performance.
pred_gbm[1:4,,]
##         setosa versicolor    virginica
## [1,] 0.9803740 0.01892186 0.0007041781
## [2,] 0.9296742 0.06751518 0.0028105755
## [3,] 0.9803740 0.01892186 0.0007041781
## [4,] 0.9775433 0.02163097 0.0008257231
pred_gbm <- colnames(pred_gbm)[apply(pred_gbm, 1, which.max)]
confusionMatrix(pred_gbm, yiris_te)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         16          0         0
##   versicolor      0         19         0
##   virginica       0          0        18
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9328, 1)
##     No Information Rate : 0.3585     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3019            0.3585           0.3396
## Detection Rate              0.3019            0.3585           0.3396
## Detection Prevalence        0.3019            0.3585           0.3396
## Balanced Accuracy           1.0000            1.0000           1.0000
rm(list=setdiff(ls(),  c("iris","iris_te", "iris_tr", "xiris", "xiris_te", "xiris_tr", "yiris", "yiris_te", "yiris_tr")))