The objective of this story is to cover large breath of topics in data science/data handling in R and Python using the ubiquitous Iris flower dataset. Topics covered are data structure, column manipulation, clustering, PCA, various ML algorithms and visualization. Doing parallel exercise in Python helps in quick learning of other language.
Let us get to know about the Iris flower dataset. It is one of the most popular toy dataset for ML (Machine Learning) which is about identification of the species based on four oberservatios made on petal’s and sepal’s length and width. The dataset consist of four predictors (also called features) columns that are numeric data type. Based on these features, one can predict the species of the flower - setosa, versicolor, virginica. There are 50 measurements (one row has one measurement) for each species.
In machine language, it is multinomial classification problem. The Response variable (the variable that needs to be predicted, also called as label or target) have more than 2 outcompes – setosa, versicolor, and virginica. If there were two outcomes, it would be binary classification problem. If the Response variable is number like sales amount, then it would be regression problem.
The Iris data in R already exist in base package and so no library needs to be invoked to access the iris data set.
Surprisingly yes! One of the prevalent wisdom for reducing the predictor columns is to remove columns that are highly coorelated with other column. We find that it is not a good idea to remove highly correlated column without any due consideration. To directly jump at this finding click the \(\boxed{RED\;}\) box
In next section, we take the first step in handling the data which is the follwong:
Iris dataset comes as dataframe. We divide the dataset vertically into only predictors – xiris and one column response – yiris.
class(iris) # check the way the data is stored
## [1] "data.frame"
dim(iris) # dimension -- rows, column for dataframe
## [1] 150 5
str(iris) # data type for each columns
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
colnames(iris) # show column names
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
# rename the first 4 column names
colnames(iris)[1:4]= c("Sepal_Length",'Sepal_Width','Petal_Length','Petal_Width')
colnames(iris)
## [1] "Sepal_Length" "Sepal_Width" "Petal_Length" "Petal_Width"
## [5] "Species"
xiris <- iris[,c(1:4)] # or iris[,c("Sepal_Length",'Sepal_Width','Petal_Length','Petal_Width')]
yiris <- iris[,5, drop=FALSE] # When one column is assigned, by default Drop=TRUE, will convert yiris as vector erasing the column name.
# in python yiris existed in values of 0,1,2 for the species
# we will map the species (setosa, versicolor, virginica) to (0,1,2) just as an exercise.
y_temp <-yiris
y_temp <- sapply(yiris, function(x) {ifelse(x=='setosa',0,ifelse(x=='versicolor',1,2))}) # not as elegant as python's map
table(y_temp) # check the frequency of values to confirm there are 50 of each 0,1,2 as response variable
## y_temp
## 0 1 2
## 50 50 50
head(xiris,6)
tail(yiris,6)
cat(dim(xiris)) # dim(df) in R
## 150 4
cat(str(xiris)) # str(df) in R
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal_Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal_Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal_Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal_Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#rm(list=ls())
We transform predictors column so that each column will have mean = 0 and std = 1. This is important because some of the ML algorithms like KNN will give more weightage to predictors that have larger range.
In R, we do in with a versatile caret package. Later we check the mean, standard deviation and range
library(caret)
preObj <- preProcess(xiris, method=c("center", "scale"))
xiris <- predict(preObj, xiris)
iris <-data.frame(xiris,yiris)
apply(xiris, 2, mean) # here 2 means operate on column.
## Sepal_Length Sepal_Width Petal_Length Petal_Width
## -4.484318e-16 2.034094e-16 -2.895326e-17 -2.989362e-17
apply(xiris, 2, sd)
## Sepal_Length Sepal_Width Petal_Length Petal_Width
## 1 1 1 1
apply(xiris,2,range)
## Sepal_Length Sepal_Width Petal_Length Petal_Width
## [1,] -1.863780 -2.425820 -1.562342 -1.442245
## [2,] 2.483699 3.080455 1.779869 1.706379
Splitting of the data is done for cross-validation, which is best method to optimize the model fit – i.e., neither underfit nor overfit. This is done by seperating small portion (20%-30%) of the data on which we do not train the model. We then test the prediction on this data. If the model fit is optimized, the prediction will be comparable to that of prediction on traing set.
Optimization is having a balance between bias and variance. A no-model (y = constant, say y = avg(y) or mode(y)) has low bias but will have high variance. An over fit model will have low variance but have high bias.
set.seed(12345) # set seed so that results are reproducible when run again
# creates a sequence from 1 to 2 of the size of the data with probability distribution of 70 and 30
# We can create three data sets by replacing with seq(1,3) and prob=c(70,15,15)
idx <- sample(seq(1, 2), size = nrow(xiris), replace = TRUE, prob = c(70,30))
table(idx)/length(idx)*100 # check back the percentages of 1 and 2
## idx
## 1 2
## 64.66667 35.33333
# Defining new sets seperatly for tain set and test set
xiris_tr <- xiris[idx==1,]
yiris_tr <- yiris[idx==1,]
xiris_te <- xiris[idx==2,]
yiris_te <- yiris[idx==2,]
iris_te <-iris[idx==1,]
iris_tr <-iris[idx==2,]
We will use following approach to find important predictors. This is esepcially useful when the number of predictors are in hundreds or thousands where reduction of predictors before processing becomes very important.
Correlation function finds correlation between pairs of variable. If two predictors are highly correlated, one can be dropped as it may not have additional prediction value. (I am reminded by General Colin Powell – If you have a yes man working for you, one of you is redundant)
# the cor function calculates correlation between all possible pairs of dataframe.
cor(xiris)
## Sepal_Length Sepal_Width Petal_Length Petal_Width
## Sepal_Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal_Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal_Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal_Width 0.8179411 -0.3661259 0.9628654 1.0000000
Based on above data, (Petal_Width, Petal_Length) are hightly correlated ( >96% ) and therefore one could be dropped. It will be a surprise find by the next method that will suggest to the contrary!)
regsubsets makes many iterations of checking the predictability of different set of predictor variable. In first iteration, regsubsets uses one predictor at a time to see which is most important predictor variable for prediction of Species. Under next itegration it uses two variables and checks which two are the most important, and so on. There are various ways (which is not discussed here) to find the best iteration, This algorithm is especially important if there are hundreds or thousands of predictor variables. If the dataset is large, we must not use exhaustive method due to large combinations of predictors. We use “forward” or “backward” method in such cases.
library leaps is managed by Professor Trevor Hastie, who is also co-author of one of the best and free data science book: An Introduction to Statistical Learning with Applications in R.
library(leaps)
# In R, a formula clause such as, "Sepecies~." can be inserted in function like regsubsets, which is interpreted as Species as the response variable and all other variable as predictor columns that must be considered in evaluation.
fit <-regsubsets(Species ~., data = iris_tr, method="exhaustive", nvmax = 4) # nvmax no. of iterations.
summary(fit)
## Subset selection object
## Call: regsubsets.formula(Species ~ ., data = iris_tr, method = "exhaustive",
## nvmax = 4)
## 4 Variables (and intercept)
## Forced in Forced out
## Sepal_Length FALSE FALSE
## Sepal_Width FALSE FALSE
## Petal_Length FALSE FALSE
## Petal_Width FALSE FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
## Sepal_Length Sepal_Width Petal_Length Petal_Width
## 1 ( 1 ) " " " " " " "*"
## 2 ( 1 ) " " " " "*" "*"
## 3 ( 1 ) "*" " " "*" "*"
## 4 ( 1 ) "*" "*" "*" "*"
The above matrix output with “*" shows most significant column for prediction. In first iteration, it is only one column, which is Petal_Width. Next iteration has two most significant columns (Petal_Width, Petal_Length) and so on. We see that Petal_Width is the most important variable as it consistently appears in all iterations, hence must not be dropped. Second important variable is Petal_Length! This appears to be in contradiction with wisdom that highly correlated variable (>96% )observation that Petal_Width and Petal_length are >96% correlated, hence one can drop one of the variable.. The only logical explaination is the contribution of other variables – sepal length and sepal width are less than the few percentage contributed from highly correlated Petal_length. We will prove this in ML algorithm section. Third variable of importance is Sepal_Length.
We shall use (Petal_Width, Petal_Width, Sepal_Length) for ML algorithms
PCA technique transforms the predictor columns in a frame where the first variable (PC1) is along the axis of maximum variation, the second variable (PC2) is perpendicular to axis of PC1 and where the variation is maximum and so on. So we have variables, PC1, PC2, … with decreasing order of importance (variation). We can plot the proportion of variance and place a cut-off variable beyond which we can drop the variables as the variation is very small.
The ease of this method to select importnat (albeit transformed) variables makes it a very poular method for reducing predictor dimensions. Reducing the dimension of the predictor variables can help ML algorithms process fast without loosing the power of prediction.
The PCA method is implemented as follows:
pca.out=prcomp(xiris, scale=TRUE)
#biplot(pca.out, scale=1) # We will not use this biplot, instead we use a fancy one.
library(devtools)
#install_github("vqv/ggbiplot")
library(ggbiplot)
print(ggbiplot(pca.out, obs.scale = 1, var.scale = 1, groups = iris$Species, ellipse = TRUE, circle = TRUE))
summary(pca.out) # get proportion of variance
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
# plot proportion of variance
plot(pca.out$sdev^2/sum(pca.out$sdev^2)*100, type="b", col="blue", ylim=c(0,100), axes=FALSE, ann=FALSE)
axis(1, at=1:4, lab=paste0("PC", 1:4))
axis(2, las=1, at=seq(0,100,10))
#grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted", lwd = par("lwd"), equilogs = TRUE) # not flexible
abline(h=seq(0,100,10), v=1:4, col="grey", lty="dotted")
title(xlab="PCA Variables", col.lab="black", line=2.2)
title(ylab="variance (in %)", col.lab="black", cex.lab=1.1,
line=2.4) #line gives seperation of label from axes
title(main="PCA's Proportion of variance plot", col.main="black")
box()
#colors() # show all the colors
LDA technique transforms the predictor columns in a frame where the first variable (LD1) is along the axis of maximum class-seperation (in response variable), the second variable (LD2) is again along the axis of maximum class-seperation, but perpendicular to LD1 axis. So we have variables, LD1, LD2, … with decreasing order of class seperation. This is very similar to PCA in approach, but PCA deals with variation in predictor data where as LDA deals with class-seperation. PCA us unsupervised (i.e., does not need reponse variable – Species), LDA is supervised as the seperation is based on knowing the response class. LDA works only on multinomial or binomial classification data set (it does not work on regression data set)
Since LDA wil be using the repsonse variable, we will not analyse on entire data, but on training set – iris_tr. We will keep the test set – iris_te only for prediction. This way we ensure the ML algorithm in unbiased in the prediction.
Following code will explore the data with LDA. Interestingly we find the LDA1 and LDA2 depend mostly on Petal_width and Petal_Length! Inspite of having >96% orrelation between the Petal_Wideth and Petal_Length, it is more important than the other two - Sepal predictors. This is exactly the same as concluded by regsubsets. Hence we must not remove a highly correlated variable without doing further analysis.
We use ggplot2 plotting library, to plot overlapping histograms. ggplot library is the standard in R and Python.
c(“green”,“blue”,“red”)[yiris_tr] is interesting way to map Response classification into color variable. formula variable could be Sepal~. or Sepal~Petal_Width+Petal_Length, Sepal~.-Petal_Length to select appropriate predictor variables.
library(MASS) # for lda, qda
lda.fit <- lda(Species~., data=iris_tr) # Use all predictors
#lda.fit <- lda(Species~Petal_Width+Petal_Length, data=iris_tr) # Use only Petal_Width and Petal_Length predictors
#lda.fit <- lda(Species~.-Petal_Length, data=iris_tr) # Use all but Petal_Length as predictors
prop.table(lda.fit$svd^2) # gives the proportion of trace
## [1] 0.995062308 0.004937692
print(lda.fit)
## Call:
## lda(Species ~ ., data = iris_tr)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3018868 0.3584906 0.3396226
##
## Group means:
## Sepal_Length Sepal_Width Petal_Length Petal_Width
## setosa -1.1844867 0.5710849 -1.3286707 -1.2618549
## versicolor 0.1764839 -0.5662448 0.3100117 0.1596868
## virginica 1.0680843 -0.1825228 1.0434497 1.0795700
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal_Length 0.8080343 0.4573182
## Sepal_Width 0.9880170 0.9684004
## Petal_Length -5.1898199 -1.8644682
## Petal_Width -2.5576125 1.8705542
##
## Proportion of trace:
## LD1 LD2
## 0.9951 0.0049
library(devtools)
#install_github('fawda123/ggord')
library(ggord)
ggord(lda.fit, iris_tr$Species)
lda.pred <- predict(lda.fit, newdata=xiris_tr)
plot(lda.pred$x, pch=19, col=c("green","blue","red")[yiris_tr])
df <- data.frame(lda.pred$x, "Species"=yiris_tr)
library(ggplot2)
ggplot(data=df, aes(x=LD1, group=Species, fill=Species)) + geom_histogram(position="identity", binwidth=.5, alpha=.8)
From the proportion of trace LD1 essientially contributes to seperating the class variable, which is evident in the ggord plot and the overlapping histogram.
The print(lda.fit) shows the LD1 having coefficents (-5.2,-2.6,0.9, 0.8) from (Petal_Length, Petal Width, Sepal_Width, Sepal_Length). Since the predictors are standardized, the coefficients shows the strength of contribution from each variables. It shows (Petal_Length, Petal Width) are far stronger variable than (Sepal_Width, Sepal_Length). It reaffirms our assertion, that even though (Petal_Length, Petal) are more than 96% correlated, both are still important as other variables (Sepal_Width, Sepal_Length) has far less contribution.
LDA is also a ML algorithm grounded on statistics. We continue with previous section and use LDA to predict on the test data. We then compare the predicted data with the real data to find goodness of fit.
For regression problem the goodness of fit can be simple standard deviation. In case of classification problem, it is confusion matrix. There are two more interesting measures – balanced accuracy and kappa – that should be explained. In a binomial classification, when one outcome is very less compared to another, balanced accuracy must be refered to gauge goodness of fit. Kappa is important when we have multinomial classication with ordinal outcomes, say low, medium, and high. Here mis-classification from low to medium is less severe than low to high. Kappa will be higher in former case than later. Higher the kappa better is the goodness of fit.
We will only confine to Confusion Matrix for this simple example.
lda.pred <- predict(lda.fit, newdata=xiris_te)
library(caret)
confusionMatrix(lda.pred$class, yiris_te)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 19 0
## virginica 0 0 18
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9328, 1)
## No Information Rate : 0.3585
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3019 0.3585 0.3396
## Detection Rate 0.3019 0.3585 0.3396
## Detection Prevalence 0.3019 0.3585 0.3396
## Balanced Accuracy 1.0000 1.0000 1.0000
summary(lda.fit)
## Length Class Mode
## prior 3 -none- numeric
## counts 3 -none- numeric
## means 12 -none- numeric
## scaling 8 -none- numeric
## lev 3 -none- character
## svd 2 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## terms 3 terms call
## xlevels 0 -none- list
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:scales':
##
## alpha
## The following object is masked from 'package:ggplot2':
##
## alpha
model8<-ksvm(Species~.,data=iris_tr)
#Summarize the model
summary(model8)
## Length Class Mode
## 1 ksvm S4
#Predict using the model
pred_svm<-predict(model8,xiris_te,type="response")
confusionMatrix(pred_svm, yiris_te)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 19 0
## virginica 0 0 18
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9328, 1)
## No Information Rate : 0.3585
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3019 0.3585 0.3396
## Detection Rate 0.3019 0.3585 0.3396
## Detection Prevalence 0.3019 0.3585 0.3396
## Balanced Accuracy 1.0000 1.0000 1.0000
[Following paragraph on SVM is based on my rudimendary understanding thus far. Expect this section to change in due time. Please feel free to correct me]
A Linear SVM tries to seperate the class by transforming the points to higher dimensions where they can be seperated. Then it finds the decision boundary that has widest margin. To illustrate, imagine building the widest but straight road seperating the cluster of two kind of trees. The trees that touches the road boundary on both sizes are called Support vectors. The making of widest road, places the constraint on the direction of the road. The direction of the road gives the predictive power to seperate the two clusters of the test data.
For some dataset the seperation is just not posible. Here we used non-linear transformation via choosing the kernal. It is like building the curved road with some predefined curvature.
SVN is naturally built for binary classification. Yet it can solve a multinomial classification by following technique..
Any multinomial classification can be solved by solving a set of binomial classification problems. There are two ways OVA and OVO. Let us assume a multinomial data with k outcomes.
In OVA, we take one class vs rest of the class. We do this for all the class so that we have k such binary classifcation sets. We solve and pick the one that has maximum probability. In Iris dataset we will have three such data sets – setosa vs (versicolor or virginica), versicolor vs (setosa or virginica), and virginica vs (setosa or versicolor), Here we must be careful as the outcome is imbalanced – setosa will be fewer that (versicolr or virginica).
In OVO, we take one class vs other and do this for all different possible – k(k-1)/2 – pairs. We train the binary classifcation for these sets and pick which ever gets most selected (in case of tie, we can add the probabilities). In Iris dataset we have the following sets setosa vs versicolor, setosa vs virginica, versicolor vs virginica. Here we have large number of datasets – k(k-1)/2 compared to OVA’s k datasets. However, the datasets will be smaller and the outcomes are more balanced where the algorithms are efficient. I would personally prefer OVO.
library("e1071")
df<-iris_tr[!(yiris_tr=="setosa"),]
fit.svm <- svm(Species ~ ., data = iris_tr)
summary(fit.svm)
#svmfit <- svm(cross=3,Species~., data = df, kernel = "linear", cost = 10, scale = FALSE)
plot(fit.svm, df, Petal_Width ~ Petal_Length, slice = list(Sepal_Width = 0, Sepal_Length = 0))
plot(fit.svm, df, Petal_Width ~ Petal_Length, slice = list(Sepal_Width = 0, Sepal_Length = 1))
range(iris_tr$Sepal_Length)
print(svmfit)
#plot(svmfit, iris_tr[,c(3,4,5)])
iris.svm <- svm(Species ~ ., data = iris)
plot(svmfit, iris_tr, Petal_Width ~ Petal_Length, slice = list(Sepal_Width = -1, Sepal_Length = -2))
colnames(iris_tr[,c(1,2,5)])
fit.svm$x.scale
library("e1071")
fit.nb <- naiveBayes(Species ~ ., data = iris_tr)
class(fit.nb)
## [1] "naiveBayes"
summary(fit.nb)
## Length Class Mode
## apriori 3 table numeric
## tables 4 -none- list
## levels 3 -none- character
## call 4 -none- call
print(fit.nb)
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## setosa versicolor virginica
## 0.3018868 0.3584906 0.3396226
##
## Conditional probabilities:
## Sepal_Length
## Y [,1] [,2]
## setosa -1.1844867 0.3440518
## versicolor 0.1764839 0.6414172
## virginica 1.0680843 0.6494136
##
## Sepal_Width
## Y [,1] [,2]
## setosa 0.5710849 0.6125846
## versicolor -0.5662448 0.7869781
## virginica -0.1825228 0.5193487
##
## Petal_Length
## Y [,1] [,2]
## setosa -1.3286707 0.0922157
## versicolor 0.3100117 0.2253865
## virginica 1.0434497 0.2621248
##
## Petal_Width
## Y [,1] [,2]
## setosa -1.2618549 0.1784413
## versicolor 0.1596868 0.2296557
## virginica 1.0795700 0.3615488
pred_nb <- predict(fit.nb, newdata = xiris_te)
confusionMatrix(pred_nb, yiris_te)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 19 0
## virginica 0 0 18
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9328, 1)
## No Information Rate : 0.3585
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3019 0.3585 0.3396
## Detection Rate 0.3019 0.3585 0.3396
## Detection Prevalence 0.3019 0.3585 0.3396
## Balanced Accuracy 1.0000 1.0000 1.0000
table(pred_nb,yiris_te)
## yiris_te
## pred_nb setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 19 0
## virginica 0 0 18
Decision Tree is continued segmentation of data based of simple statistics. Plotting as decision tree help us in gaining insight in the patterns.For such a simple algorithm it provides good predictions.
The main disadvantage if the sample for training data is slightly different, it may lead to different decision trees.
So why not take averages of trees? We do exactly this, which is taking the contribution of many decision trees in Random Forest and Gradient Boost Model. These models are the most powerful algorithms today. (Although recently Deep Learning algorithm is considered most powerful in image, audio recognition, Natural Language Processing (NLP),
Following is the decision tree algorithm and we can see the prediction is fair but not the best.
library(rpart)
fit.dt<-rpart(Species~.,data=iris_tr)
library(rpart.plot) # elegant plot for decision tree
prp(fit.dt)
#summary(fit.dt)
pred <-predict(fit.dt, type="class")
confusionMatrix(pred, yiris_te)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 18 0
## virginica 0 1 18
##
## Overall Statistics
##
## Accuracy : 0.9811
## 95% CI : (0.8993, 0.9995)
## No Information Rate : 0.3585
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9716
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9474 1.0000
## Specificity 1.0000 1.0000 0.9714
## Pos Pred Value 1.0000 1.0000 0.9474
## Neg Pred Value 1.0000 0.9714 1.0000
## Prevalence 0.3019 0.3585 0.3396
## Detection Rate 0.3019 0.3396 0.3396
## Detection Prevalence 0.3019 0.3396 0.3585
## Balanced Accuracy 1.0000 0.9737 0.9857
seq(1:3)
bs <- sample(seq(0, 1), size = 100, replace = TRUE, prob = c(15,85))
table(bs)
## bs
## 0 1
## 15 85
cat(mean(bs), "+/-", sd(bs))
## 0.85 +/- 0.3588703
a <- replicate(200, sample(bs, size = 100, replace = TRUE), simplify = TRUE)
mean(apply(a,2,mean))
## [1] 0.85005
This can be considered as the most powerful out of self algorithm. It requires no tuning parameters, although some tuning can be done. The library randomForest has some limitations, like max number of columns (about thousand columns) and max cardinality of categorical predictors (in thirties). (See the next section why cardinality of categorical predictors matter.)
RandomForest from H2o package is the most worry free algorithm where benchmark can be established. It is ten times faster than randomForest package and has no such limitations of number of columns and cardinality of categorical predictors. It also resolves the missing data. Lastly, most of the R packages works in memory, therefore size of RAM is crucial worry when working with large datasets. H2o package is written in Java, where upper working memory limit can be fixed by user! Almost magical when I first ran the code! We will use randomForst algorithm for our iris dataset, which is more informative.
Random Forst working …
me tell you a very interesting idea. In a sample of data, you might think knowing mean, standard deviation, skewness, kurtosis and other moments is all there is to the data. But there are two more things that can be sq
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# no tuning
fit.rf <- randomForest(Species~.,data=iris_tr, importance=TRUE)
fit.rf$importance
## setosa versicolor virginica MeanDecreaseAccuracy
## Sepal_Length 0.130360317 0.037801010 0.067939755 0.075627805
## Sepal_Width 0.006494444 0.005198413 0.001207937 0.004097907
## Petal_Length 0.321311111 0.303278355 0.362278355 0.320265008
## Petal_Width 0.334059957 0.228189971 0.208605556 0.245722356
## MeanDecreaseGini
## Sepal_Length 4.7786733
## Sepal_Width 0.4944315
## Petal_Length 16.3820470
## Petal_Width 12.8671853
# with tuning
#fit_rf<-randomForest(Species~.,data=iris_tr, ntree=ihp_NTREES, mtry=ihp_MTRY, nodesize= ihp_NODESIZE, importance =TRUE)
#summary(fit.rf)
pred_rf<-predict(fit.rf,xiris_te)
#confusionMatrix(pred_rf, yiris_te)
rm(list=setdiff(ls(), c("iris","iris_te", "iris_tr", "xiris", "xiris_te", "xiris_tr", "yiris", "yiris_te", "yiris_tr")))
library(gbm)
## Warning: package 'gbm' was built under R version 3.3.3
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
my_DISTRIBUTION <- "multinomial" # possible values - "gaussian","adaboost","bernoulli"
hp_NTREES <- 5001; hp_SHRINKAGE <- 0.01; hp_DEPTH <- 2; hp_MINOBS <- 3 # default
#gbm_fit <- gbm(Species~.,data=iris_tr, distribution=my_DISTRIBUTION,
# n.trees=hp_NTREES, shrinkage=hp_SHRINKAGE,
# interaction.depth = hp_DEPTH, n.minobsinnode = hp_MINOBS, verbose=TRUE,
# cv.folds = 0, n.cores=1)
gbm.fit <- gbm(Species~.,data=iris_tr, distribution=my_DISTRIBUTION,
n.trees=hp_NTREES, shrinkage=hp_SHRINKAGE,
cv.folds = 0, n.cores=1)
gbm.perf(gbm.fit)
## Using OOB method...
## Warning in gbm.perf(gbm.fit): OOB generally underestimates the optimal
## number of iterations although predictive performance is reasonably
## competitive. Using cv.folds>0 when calling gbm usually results in improved
## predictive performance.
## [1] 391
#summary(gbm.fit, n.trees=1)
summary(gbm.fit, n.trees=gbm.perf(gbm.fit, plot=FALSE), plotit=TRUE, method = relative.influence)
## Using OOB method...
## Warning in gbm.perf(gbm.fit, plot = FALSE): OOB generally underestimates
## the optimal number of iterations although predictive performance is
## reasonably competitive. Using cv.folds>0 when calling gbm usually results
## in improved predictive performance.
plot.gbm(gbm.fit, n.trees=gbm.perf(gbm.fit, plot=FALSE))
## Using OOB method...
## Warning in gbm.perf(gbm.fit, plot = FALSE): OOB generally underestimates
## the optimal number of iterations although predictive performance is
## reasonably competitive. Using cv.folds>0 when calling gbm usually results
## in improved predictive performance.
#predict.gbm(gbm_fit, newdata=xiris_tr)
#pred_gbm <-predict.gbm(gbm,fit, newdata=xiris_tr, n.trees=gbm.perf(gbm.fit, plot=FALSE), single.tree = TRUE, type="link")
pred_gbm <-predict(gbm.fit, newdata=xiris_te, n.trees=gbm.perf(gbm.fit, plot=FALSE), type="response")
## Using OOB method...
## Warning in gbm.perf(gbm.fit, plot = FALSE): OOB generally underestimates
## the optimal number of iterations although predictive performance is
## reasonably competitive. Using cv.folds>0 when calling gbm usually results
## in improved predictive performance.
pred_gbm[1:4,,]
## setosa versicolor virginica
## [1,] 0.9803740 0.01892186 0.0007041781
## [2,] 0.9296742 0.06751518 0.0028105755
## [3,] 0.9803740 0.01892186 0.0007041781
## [4,] 0.9775433 0.02163097 0.0008257231
pred_gbm <- colnames(pred_gbm)[apply(pred_gbm, 1, which.max)]
confusionMatrix(pred_gbm, yiris_te)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 16 0 0
## versicolor 0 19 0
## virginica 0 0 18
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9328, 1)
## No Information Rate : 0.3585
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3019 0.3585 0.3396
## Detection Rate 0.3019 0.3585 0.3396
## Detection Prevalence 0.3019 0.3585 0.3396
## Balanced Accuracy 1.0000 1.0000 1.0000
rm(list=setdiff(ls(), c("iris","iris_te", "iris_tr", "xiris", "xiris_te", "xiris_tr", "yiris", "yiris_te", "yiris_tr")))