In the path to become a data scientist, I would like to showcase this self-executed project. This is a binary unbalanced classification banking marketing data from UC-Irvine. I choose this data because it is realistic, simple and yet tests many concepts and ML (Machine Language) algorithms.
The business case result is presented in a single plot from which various business advantage scenarios can exploited (jump to this section).
To get best ML prediction, we go through following steps:
The R code is also included which can produce the results presented here.
The Bank Marketing Dataset has about 41k observations with 19 predictors variables that are related to following (more details are given in Appendix II - Data Dictionary):
The response (also known as target or label) variable is column named “y”. The response variable has two levels “yes” and “no”.
The response “yes” is only 11.3% of data. This means for “no-model”, the random chance of getting “yes” is only 11.3%. This is unbalanced binary classification problem. Hence we will track kappa and balanced accuracy as best indicator for isolating the “yes” customers.
The data at hand does not have any NAs (or missing data). Although original data did have NA which was replaced by “unknown”
Based on the description of data (see Data Dictionary), we remove the duration variable as it will bias the prediction.
We will split the data into three sets as (1) Training Set (2) Test Set (3) Validation Set in the ratio of 60:20:20. We shall train the ML algorithm with Training Set, and test it with Test Set. Once we get the final model we will additionally validate with Validation Set.
The following code reads the Bank Marketing data from the file and remove “duration” variable
rm(list=ls()) # remove all objects
irvn_org <- read.csv(file="C:\\ds_local\\dataset\\irvine-bank-marketing.txt")
irvn_org$duration <- NULL
#
irvn <-irvn_org
yirvn <- irvn_org[,"y",drop=FALSE]
dim(irvn)
## [1] 41188 20
The following code splits the data into three sets: Training Set, Test Set and Validation Set
set.seed(12345)
idx <- sample(seq(1,3), size = nrow(irvn_org), replace = TRUE, prob = c(60,20,20))
table(idx)
## idx
## 1 2 3
## 24903 8121 8164
# Defining new sets seperatly for tain set and test set
yirvn_tr <- yirvn[idx==1,]
yirvn_te <- yirvn[idx==2,]
yirvn_vl <- yirvn[idx==3,]
The following code converts “month” varibale from the three-character-month to month-integer, which hints the ML algorithm to use as ordinal variabe.
library(plyr)
idx_month <- as.numeric(mapvalues(irvn_org$month,
from=c("jan","feb","mar","apr","may",
"jun","jul","aug","sep","oct","nov","dec"),
to=c(1:12)))
## The following `from` values were not present in `x`: jan, feb
length(idx_month) # check if it is same as data record count
## [1] 41188
The follwing code check (i) column names, (ii) structure (iii) any NA present (iv) ratio or binary response.
colnames(irvn) # column names
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "campaign" "pdays"
## [13] "previous" "poutcome" "emp.var.rate" "cons.price.idx"
## [17] "cons.conf.idx" "euribor3m" "nr.employed" "y"
str(irvn) # structure
## 'data.frame': 41188 obs. of 20 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
anyNA(irvn) # Are there any NAs
## [1] FALSE
table(irvn$y)/length(irvn$y)*100 # Frequency (in %) of response
##
## no yes
## 88.73458 11.26542
After the data is prepared, we choose to run Random Forest (RF) algorithm because it is one of the best ML algorithm and off-the-shelf ML (i.e., no tuning is needed!). We run Random forest with as-is data to get the performance metrics and getting importance of variables (jump to the code section).
After running the RF algorithm, it provides the list of important predictors that helps in predicting the response. In our case, the important predictors that were found are:
It is surprising that there are no variables depending strongly on customer’s information. Apart from campaign info – month, all variables relates to socio/economic indicators.
Confusion Matrix is powerful representation from which variety of performance metrics can be derived. In this section, We will select some metrics that will help in the business case.
Looking at the output of confusion matrix (from Caret package), we note the following: (jump to the code section):
\[ \begin{array} {c|c|c} n=8121 & \text{Actual:}\; ``No" & \text{Actual:}\; ``Yes" \\[10pt] \hline\text{Predicted:}\; ``No" & 6978 & 672 \\[10pt] \hline\text{Predicted:}\;``Yes" & 186 & 284 \end{array} \]
Kappa or balanced accuracy will be our key performance metrics for ML because we are dealing with unbalanced binary classification problem. Running the RF gives kappa = 0.35 and balance accuracy = 64%
We find the “Neg Pred Value” to be related to Subscription Coversion Rate probability that is given by: \[\small\text{Subscription Coversion Rate probability}= {\text{Predicted to be "Yes" which are Actually "Yes"}\over{\text{Predicted to be "Yes"}}} \]
\[ = {284\over{284+186}} = 60.5\% \]
We also find the Specificity to be related to Potential Customer Capture proability that is given by: \[ \small\text{Potential Customer Capture probability}= {\text{Predicted to be "Yes" which are Actually "Yes"}\over{\text{Total of Actual "Yes"}}} \]
\[= {284\over{284+672}} = 29.8\% \]
Re-emphasizing the above finding, we will use the results of Random Forest algorithm as benchmark to compare the results of other ML algorithms. The metrics that we shall track are:
This section provides the code that was run to get the results mentioned above. The output generated by the code has “##” at the beginning of the line.
For each execution of Random Forest algorithm, it generates performance metrics with different values. This is because the Random Forest algorithm chooses random set of different column at each trial. We set a seed so that we can reproduce the same result given in this report.
In the code section #** can be removed to generate 10 such confusion Matrix metrics. We will need some samples for the t-test to compare the performance of RF with GLM blended data.
ii=1
#**for (ii in 1:10){ # For reproducibility of results
set.seed(15678+100*ii)
irvn_rf <- irvn_org
suppressMessages(library(randomForest))
system.time(fit.rf <- randomForest(y~.,data=irvn_rf[idx==1,], importance=TRUE)) # no tuning
## user system elapsed
## 437.38 3.78 526.42
# order the important variable with MeanDecreaseAccuracy
fit.rf$importance[order(fit.rf$importance[,c("MeanDecreaseAccuracy"),drop=FALSE], decreasing=TRUE),]
## no yes MeanDecreaseAccuracy
## euribor3m 0.0536284541 0.0099282621 0.0487561295
## month 0.0556855391 -0.0163023306 0.0476684822
## emp.var.rate 0.0375027283 0.0014252131 0.0334871425
## nr.employed 0.0328411421 0.0343756696 0.0330152431
## cons.price.idx 0.0320845576 -0.0138443382 0.0269851416
## cons.conf.idx 0.0256174112 -0.0084752547 0.0218233727
## job 0.0078963833 -0.0028437353 0.0066989724
## day_of_week 0.0060532939 0.0014447611 0.0055403697
## contact 0.0044716828 0.0109108869 0.0051909338
## age 0.0057110196 0.0001541611 0.0050909315
## pdays 0.0015780862 0.0309723640 0.0048581232
## poutcome 0.0036173047 0.0104042941 0.0043749108
## education 0.0044062615 -0.0007721131 0.0038287474
## previous 0.0024845988 -0.0005689782 0.0021421292
## campaign 0.0010916766 0.0040852531 0.0014262540
## marital 0.0012962282 -0.0013996937 0.0009952635
## default 0.0004272557 0.0023974583 0.0006467961
## housing 0.0003367612 -0.0004123314 0.0002523711
## loan 0.0002354970 -0.0003352199 0.0001713429
## MeanDecreaseGini
## euribor3m 563.18214
## month 150.72579
## emp.var.rate 107.39701
## nr.employed 294.58695
## cons.price.idx 97.23562
## cons.conf.idx 124.14249
## job 367.94255
## day_of_week 250.64099
## contact 56.80678
## age 533.66915
## pdays 170.55446
## poutcome 144.07894
## education 285.67158
## previous 75.66117
## campaign 261.69188
## marital 130.92431
## default 47.21366
## housing 120.82400
## loan 87.60118
pred_rf_te<-predict(fit.rf, irvn_rf[idx==2,])
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
length(yirvn_te)
## [1] 8121
print(confusionMatrix(pred_rf_te, yirvn_te)) # generates rich array of metric
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6978 673
## yes 186 284
##
## Accuracy : 0.8942
## 95% CI : (0.8873, 0.9008)
## No Information Rate : 0.8822
## P-Value [Acc > NIR] : 0.0003334
##
## Kappa : 0.3474
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9740
## Specificity : 0.2968
## Pos Pred Value : 0.9120
## Neg Pred Value : 0.6043
## Prevalence : 0.8822
## Detection Rate : 0.8593
## Detection Prevalence : 0.9421
## Balanced Accuracy : 0.6354
##
## 'Positive' Class : no
##
#pred_rf<-predict(fit.rf, irvn_te, type = "prob")
#**} #-- for (ii in 1:10){
It is always instructive to fit the data with the simplest model that give quick insight of the data. GLM (General Linear Model) is the simplest model for binary classification.
Since our purpose is only to gain insight of the data. We place additional conditions on the data before GLM is run.
We do the following:
Key findings are:
Following code identifies numerical predictors, then prepares the dataframe, irvn_glm, that has numerical predictors, convert month to numeric, runs the GLM algorithm, make prediction on test data and calculate the confusion matrix.
#-- identify which variables are numeric
a <- (sapply(irvn_org, is.factor))
colnames(irvn)[!a]
## [1] "age" "campaign" "pdays" "previous"
## [5] "emp.var.rate" "cons.price.idx" "cons.conf.idx" "euribor3m"
## [9] "nr.employed"
#----
suppressMessages(library(leaps))
irvn_glm <-irvn_org
#irvn_glm <- irvn_glm[, c("age","month", "emp.var.rate", "cons.price.idx","cons.conf.idx", "euribor3m", "nr.employed", "y")]
irvn_glm <- irvn_org[, c("age", "month", "campaign", "pdays", "previous", "emp.var.rate",
"cons.price.idx","cons.conf.idx", "euribor3m", "nr.employed", "y")]
suppressMessages(library(plyr))
irvn_glm$month <- as.numeric(mapvalues(irvn_org$month,
from=c("jan","feb","mar","apr","may",
"jun","jul","aug","sep","oct","nov","dec"),
to=c(1:12)))
## The following `from` values were not present in `x`: jan, feb
fit_glm <- glm(y~., data=irvn_glm[idx==1,], family="binomial")
summary(fit_glm)
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = irvn_glm[idx ==
## 1, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8589 -0.3594 -0.3182 -0.2910 2.7425
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 59.9155540 19.2727986 3.109 0.001878 **
## age 0.0014290 0.0018664 0.766 0.443889
## month -0.1248313 0.0100266 -12.450 < 2e-16 ***
## campaign -0.0419749 0.0118303 -3.548 0.000388 ***
## pdays -0.0018505 0.0001012 -18.290 < 2e-16 ***
## previous -0.3054655 0.0455858 -6.701 2.07e-11 ***
## emp.var.rate -0.6312929 0.0760333 -8.303 < 2e-16 ***
## cons.price.idx 0.2324117 0.1161670 2.001 0.045428 *
## cons.conf.idx -0.0053956 0.0071629 -0.753 0.451286
## euribor3m 0.6889780 0.1134361 6.074 1.25e-09 ***
## nr.employed -0.0163045 0.0019208 -8.488 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17424 on 24902 degrees of freedom
## Residual deviance: 14097 on 24892 degrees of freedom
## AIC: 14119
##
## Number of Fisher Scoring iterations: 5
#plot(fit_glm)
pred_glm_tr <- ifelse(predict(fit_glm, newdata=irvn_glm[idx==1,],
type="response") <0.5,"no","yes")
pred_glm_te <- ifelse(predict(fit_glm, newdata=irvn_glm[idx==2,],
type="response") <0.5,"no","yes")
pred_glm <- predict(fit_glm, newdata=irvn_glm, type="response")
suppressMessages(library(caret))
print(confusionMatrix(pred_glm_te, yirvn_te))
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7084 765
## yes 80 192
##
## Accuracy : 0.8959
## 95% CI : (0.8891, 0.9025)
## No Information Rate : 0.8822
## P-Value [Acc > NIR] : 4.789e-05
##
## Kappa : 0.2746
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9888
## Specificity : 0.2006
## Pos Pred Value : 0.9025
## Neg Pred Value : 0.7059
## Prevalence : 0.8822
## Detection Rate : 0.8723
## Detection Prevalence : 0.9665
## Balanced Accuracy : 0.5947
##
## 'Positive' Class : no
##
Here we run the RF algorithm again but with few changes in the predictor colimns. The changes are:
We do not worry about over fitting because:
To be ultimatly sure, we could use cross-validation with validation dataset, yirvn_vl created in data preparation section.
Unlike GLM or GBM (used later), RF gives different value of kappa for each run. This is because the RF works by making set of decision trees on random subsets of columns. Therefore to compare the two models (1) RF and (2) RF with GLM blended data, we run each model 10 times (in t-test sub-section. The ten kappa values from each models gives average of 0.349 and 0.359, an increase of about 3%.
To check if this increase is statistically significant, we will do t-test to prove that the increase in kappa is real with high confidence.
The code below shows the results produced by RF with benchmark blended with added features from GLM.
In the code section #** can be removed to generate 10 such confusion Matrix metrics. We will need some samples for the t-test to compare the performance of RF on as-is data.
ii=1
#**for (ii in 1:10){
set.seed(15678+100*ii)
irvn_rf <- irvn_org
suppressMessages(library(plyr))
irvn_rf$month <- as.numeric(mapvalues(irvn_org$month,
from=c("jan","feb","mar","apr","may",
"jun","jul","aug","sep","oct","nov","dec"),
to=c(1:12)))
## The following `from` values were not present in `x`: jan, feb
head(pred_glm)
## 1 2 3 4 5 6
## 0.06226579 0.06234928 0.06069923 0.06094410 0.06226579 0.06135429
irvn_rf <-data.frame(irvn_rf, "pred_glm" = pred_glm)
#colnames(irvn_rf)
# (iii)
suppressMessages(library(randomForest))
#trn_tr <- trn_tr[sample(1:nrow(trn_tr), size=3256, replace=FALSE),]
system.time(fit.rf <- randomForest(y~.,data=irvn_rf[idx==1,], importance=TRUE)) # no tuning
## user system elapsed
## 492.14 3.51 722.46
# order the important variable with MeanDecreaseAccuracy
fit.rf$importance[order(fit.rf$importance[,c("MeanDecreaseAccuracy"),drop=FALSE], decreasing=TRUE),]
## no yes MeanDecreaseAccuracy
## pred_glm 0.0859842064 2.138129e-02 0.0787557032
## euribor3m 0.0546902865 1.963450e-03 0.0487997088
## emp.var.rate 0.0377255834 -5.773245e-03 0.0328569774
## nr.employed 0.0328037294 1.346084e-02 0.0306433577
## cons.price.idx 0.0250531720 -9.723472e-03 0.0211754999
## cons.conf.idx 0.0215515628 -6.579895e-03 0.0184081472
## month 0.0189797247 -5.437232e-03 0.0162485038
## campaign 0.0102824125 -9.723903e-04 0.0090235760
## age 0.0086712535 -1.461326e-03 0.0075373979
## contact 0.0064149117 1.107347e-02 0.0069312688
## job 0.0072782959 -9.700742e-04 0.0063567981
## day_of_week 0.0059574042 1.866308e-03 0.0054980322
## education 0.0042642416 -2.163409e-04 0.0037607681
## poutcome 0.0034377774 5.427623e-03 0.0036608225
## pdays 0.0009185496 2.337004e-02 0.0034251283
## previous 0.0031791954 1.631461e-03 0.0030013249
## marital 0.0011638645 -9.826225e-04 0.0009225576
## default 0.0003517304 2.090441e-03 0.0005459634
## housing 0.0002730435 1.776021e-05 0.0002439789
## loan 0.0002288314 -1.687919e-04 0.0001839560
## MeanDecreaseGini
## pred_glm 847.60069
## euribor3m 483.48081
## emp.var.rate 75.84654
## nr.employed 226.82973
## cons.price.idx 97.56846
## cons.conf.idx 104.89536
## month 79.83420
## campaign 196.42228
## age 467.35438
## contact 51.89025
## job 348.01830
## day_of_week 230.67927
## education 266.23953
## poutcome 91.58307
## pdays 136.84899
## previous 58.36255
## marital 118.53717
## default 44.59103
## housing 110.46733
## loan 79.34210
pred_rf_te<-predict(fit.rf, irvn_rf[idx==2,])
suppressMessages(library(caret))
print(confusionMatrix(pred_rf_te, yirvn_te))
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6969 668
## yes 195 289
##
## Accuracy : 0.8937
## 95% CI : (0.8868, 0.9004)
## No Information Rate : 0.8822
## P-Value [Acc > NIR] : 0.0005552
##
## Kappa : 0.3496
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9728
## Specificity : 0.3020
## Pos Pred Value : 0.9125
## Neg Pred Value : 0.5971
## Prevalence : 0.8822
## Detection Rate : 0.8581
## Detection Prevalence : 0.9404
## Balanced Accuracy : 0.6374
##
## 'Positive' Class : no
##
#pred_rf<-predict(fit.rf, irvn_te, type = "prob")
#**} #-- for (ii in 1:10){
With GLM blended data and month converted to numerical, we see slight increase in kappa. We are interested in determining if it is true increase or just a statistical noise.
For this we run the following code 10 times:
Note: In the above code, just uncomment (by removing #**) open and close line of for loop
We do t-test to see if the kappa of the two sample – asis and glm blend is statistically significant.
Here, Null hypothesis, H0 is that the means of two sample are equal. Alternate hypothesis, H1 is that means are truely diferrent.
From the code below, we see that p-value is <0.01 which suggests that Null hypothesis can be rejected with 99% confidence. or in plain english, we have >99% confidence that the means of two sample is different.
kappa_asis <- c(0.3517,0.3455, 0.3552, 0.349, 0.3486, 0.3548, 0.3404, 0.344, 0.3506, 0.3513)
kappa_glm_blend <- c(0.3496, 0.3598, 0.3572, 0.3553, 0.3597, 0.3643, 0.3504, 0.359, 0.3656, 0.3681)
t.test(kappa_asis,kappa_glm_blend)
##
## Welch Two Sample t-test
##
## data: kappa_asis and kappa_glm_blend
## t = -4.0221, df = 16.896, p-value = 0.0008935
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.014927772 -0.004652228
## sample estimates:
## mean of x mean of y
## 0.34911 0.35890
Gradient Boosting model is one of the best ML algorithm. This model needs three parameters for tuning (also known as hyper parameters). We will tune and select the parameters that gives maximum kappa.
Normally the GBM algorithm makes the binary decision by calculating the response probability and using the the cut-off at 0.5, i.e., if the response probability is < 0.5, the response is considered “no” otherwise “yes”.
In our case, we will choose our own cut-off to make the binary decision. We will choose the cut-off along with the tuning parameters and settle with the tuning parameter and response cut-off which gives maximum kappa. We will step wise run GBM with all possible combination of the tuning parameters.
So far we relied the caret package to get the kappa from caret package. But for tuning we define our own function for kappa as it will be much faster
kp <- function(pred,obs){
cf<-table(pred, obs)
if(sum(dim(cf))<4){return(0)}
a<- cf[1,1]; c<- cf[2,1]; b<- cf[1,2];d <- cf[2,2]
po <- (a+d)/sum(cf); pe <- ((a+b)*(a+c)+(c+d)*(b+d))/sum(cf)^2
return(ifelse(pe==1,0,(po-pe)/(1-pe)))}
In the following code we run the GBM for various values of parameters as follows:
We need to limit with the values of the parameters as there are many combinations possible. With above values, GBM algorithm will have to be run \(2\times3\times 5\times 3 = 90\) which could run for the whole day!
We ran 90 times and found the maximum kappa is when: hp_NTREES = 501, hp_SHRINKAGE = 0.05, hp_DEPTH = 49, hp_MINOBS = 3.
We were able to get Cohen’s Kappa of 0.445. Compared to the RF benchmark of 0.35, it is pretty good performance.
The code below is used for both tuning parameters and prepare this report for the final tuned model.
When the tuning needs to be done, comment the following on one line of code and run all the choosen combinations set by previous line of code:hp_NTREES <- c(501); hp_SHRINKAGE <- c( 0.05); hp_DEPTH <- c(49);hp_MINOBS <- c(3);
irvn_gbm <- irvn_org
irvn_gbm <- data.frame(model.matrix(y~.,data=irvn_org), "pred_glm"=pred_glm, "y"=irvn_org$y)
#irvn_glm <- data.frame(model.matrix(y~.,data=irvn_org), "y"=irvn_org$y)
irvn_gbm$y <- ifelse(irvn_gbm$y=="yes",1,0)
suppressMessages(library(gbm))
suppressMessages(library(caret))
my_DISTRIBUTION <- "adaboost" #adaboost
hp_NTREES <- c(501); hp_SHRINKAGE <- c( 0.001, 0.01, 0.05); hp_DEPTH <- c( 5, 10, 30, 40, 49);hp_MINOBS <- c(3);
# line below is commented out when we like to run with the tuning parameters combination provided above
hp_NTREES <- c(501); hp_SHRINKAGE <- c( 0.05); hp_DEPTH <- c(49);hp_MINOBS <- c(3);
for(ihp_NTREES in hp_NTREES){for (ihp_SHRINKAGE in hp_SHRINKAGE) { for (ihp_DEPTH in hp_DEPTH){for (ihp_MINOBS in hp_MINOBS){
cat("--ntrees: ", ihp_NTREES, ", shrinkage:", ihp_SHRINKAGE,
", interaction_depth:", ihp_DEPTH, ", minobs:", ihp_MINOBS, "\n")
gbm_fit <- gbm(y~., data=irvn_gbm[idx==1,], distribution=my_DISTRIBUTION,
n.trees=ihp_NTREES, shrinkage=ihp_SHRINKAGE,
interaction.depth = ihp_DEPTH, n.minobsinnode = ihp_MINOBS, verbose=FALSE,
cv.folds =0, n.cores=2)
#gbm_fit
#gbm.perf(gbm_fit)
#summary(gbm_fit, n.trees=1)
#summary(gbm_fit, n.trees=gbm.perf(gbm_fit, plot=FALSE))
#----- relative influence --------
#cat("@FYI - Relative influence -------")
#rel_inf <- relative.influence(gbm_fit, n.trees=gbm.perf(gbm_fit, plot=FALSE))
#print(data.frame("relative_influence"=sort(rel_inf[rel_inf>0],decreasing = TRUE)))
#cat("@FYI End - Relative influence -------")
#-------------
#pred_gbm <-predict(gbm_fit, newdata=x_feed_trn_te,
# n.trees=gbm.perf(gbm_fit, plot=FALSE))
pred_gbm <-predict.gbm(gbm_fit, newdata=irvn_gbm[idx==2,],
n.trees=gbm.perf(gbm_fit, plot=FALSE), type="response")
#pred <- ifelse(pred_gbm < 0.5, "no","yes")
#print(confusionMatrix(pred, yirvn_te))
#-------- search 4 max kappa
chk_max <- -1
for (ii in 1:100){
pred <- ifelse(pred_gbm < ii/100, "no","yes")
tmp <- kp(pred, yirvn_te)
if (chk_max < tmp) {ii_4max <- ii; chk_max <- tmp}
}
cat("--@ntrees: ", ihp_NTREES, ", shrinkage:", ihp_SHRINKAGE,
", interaction_depth:", ihp_DEPTH, ", minobs:", ihp_MINOBS,
"ii for max kappa = ", ii_4max, ", max kappa =", chk_max, "\n")
#--------
}}}} # for hyper parameters
## --ntrees: 501 , shrinkage: 0.05 , interaction_depth: 49 , minobs: 3
## Using OOB method...
## --@ntrees: 501 , shrinkage: 0.05 , interaction_depth: 49 , minobs: 3 ii for max kappa = 21 , max kappa = 0.4454907
pred <- ifelse(pred_gbm < ii_4max/100, "no","yes")
print(confusionMatrix(pred, yirvn_te))
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6614 435
## yes 550 522
##
## Accuracy : 0.8787
## 95% CI : (0.8714, 0.8857)
## No Information Rate : 0.8822
## P-Value [Acc > NIR] : 0.8367214
##
## Kappa : 0.4455
## Mcnemar's Test P-Value : 0.0002809
##
## Sensitivity : 0.9232
## Specificity : 0.5455
## Pos Pred Value : 0.9383
## Neg Pred Value : 0.4869
## Prevalence : 0.8822
## Detection Rate : 0.8144
## Detection Prevalence : 0.8680
## Balanced Accuracy : 0.7343
##
## 'Positive' Class : no
##
#save(file="c:\\scratch\\irvene\\irvine_bus_case.rdo", pred_gbm, pred, ii_4max, irvn_gbm, idx, gbm_fit)
In the code below, the metrics Kappa, Subscription Conversion Rate (Neg Pred Value), Potential Customer Capture (Specificity) is calcualted for various response cut-off value. We focus only on these metrics because it was identified earlier to be useful for business needs.
The following code calculates kappas, Neg Pred Value and Specificity for various response probability cut-offs:
# defining the limits
ii_min <- as.integer(min(pred_gbm*100))+1
ii_max <- as.integer(max(pred_gbm*100))-1
m_data <-matrix(, nrow = 3, ncol = ii_max-ii_min) # empty metrics
for (ii in ii_min:(ii_max-1)){
cf<-table(ifelse(pred_gbm > ii/100,"yes","no"), yirvn_te)
kappa <- kp(ifelse(pred_gbm > ii/100,"yes","no"), yirvn_te)
spec <- cf[2,2]/(cf[1,2]+cf[2,2])
npred <- cf[2,2]/(cf[2,1]+cf[2,2])
m_data[1,ii-ii_min+1] <- kappa
m_data[2,ii-ii_min+1] <- spec
m_data[3,ii-ii_min+1] <- npred
}
The following code prepares the ggplot graph:
# This will become column names after reshape-melting. These are
# ("kappa", "specificity", "negPredValue") but given appropriate business names
row.names(m_data) <- c("kappa", "\"Potential-Customer\" capture",
"Subscription conversion rate")
#c("kappa", "specificity", "negPredValue")
# column names will become y-axis values
colnames(m_data) <- (ii_min:(ii_max-1)-ii_min+1)/100
suppressMessages(library(reshape))
df_melted <- melt(m_data)
colnames(df_melted) <- c("metric","predicted_probability", "value")
# making good looking plot
suppressMessages(library(ggplot2))
show_plot <- ggplot(df_melted, aes(x = predicted_probability, y = value)) +
geom_line(aes(color = metric, group = metric), size=0.7) +
xlab("ML prediction response cut-of for \"Yes\"") + ylab("metric values") +
theme(panel.border = element_rect(colour = "black", fill=NA, size=0.5)) +
geom_vline(xintercept=0.08, linetype="dashed", size=0.5, colour="green") +
geom_vline(xintercept=0.238, linetype="dashed", size=0.5, colour="green") +
geom_vline(xintercept=0.455, linetype="dashed", size=0.55, colour="blue") +
theme(legend.position = c(0.81, 0.5)) +
#legendBackground: c(fill, lineSize, lineType, lineColor)
annotate("rect", xmin=0.08, xmax=0.238, ymin=0., ymax=1.0, alpha=.1, fill="green") +
ggtitle("Plot for optimization of business metrics") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(xintercept=0.01, linetype="dashed", size=0.7, colour="red") +
annotate("text", label="no model", x=0.025, y=0.5, colour="blue", alpha=.5) +
annotate("text", label="cut-1", x=0.104, y=0.95, colour="blue", alpha=.5) +
annotate("text", label="cut-2", x=0.262, y=0.95, colour="blue", alpha=.5) +
annotate("text", label="cut-3", x=0.48, y=0.95, colour="blue", alpha=.5)
Following graph shows the values predicted by tuned GBM algorithm. The higher kappa is usually the most optimal region, but depending on the business case various scenarios can be selected. Let us understand the graph below.
show_plot
The x-axis shows the cut-off on prediction response of GBM algorithm, beyond which it is identified as “yes” i.e., customer who subscribed.
Red line shows the chance of capturing the customer that could potentially subscribe. From business perspective higher number is desired with reasonable effort. (In confusion matrix we identified this metric as Specificity)
The blue line is the Subscription conversion rate. From business perspective higher conversion rate is desired without loosing potential customers. (In confusion matrix we identified this metric as “Neg Pred Value”)
The green line is Cohen Kappa co-efficent curve. Higher Kappa signifies better identification of the subscribed customer.
With this understanding, we can make few business case scenarios. For example:
If the marketing is starting with less budget, then can go with cut-3. This will give a conversion rate of about 65% but capture only 25% of the potential customers.
Below the cut-1, the conversion rate drops sharply and it may not be cost-effective.
This concludes the analysis and developing the business case for this data.
It is quite possible that the model becomes slightly biased by using the test data over many iterations when tuning. So to be extra sure, we run the model on validation data which was never used for training or testing.
The following code creates the prediction response on the Valiation data. Rest of the code is nearly the same as use earlier.
# create prediction vector response for the validation set based
# on the same gbm model fit
pred_gbm <-predict.gbm(gbm_fit, newdata=irvn_gbm[idx==3,],
n.trees=gbm.perf(gbm_fit, plot=FALSE), type="response")
## Using OOB method...
ii_min <- as.integer(min(pred_gbm*100))+1
ii_max <- as.integer(max(pred_gbm*100))-1
m_data <-matrix(, nrow = 3, ncol = ii_max-ii_min)
for (ii in ii_min:(ii_max-1)){
cf<-table(ifelse(pred_gbm > ii/100,"yes","no"), yirvn_vl)
kappa <- kp(ifelse(pred_gbm > ii/100,"yes","no"), yirvn_vl)
spec <- cf[2,2]/(cf[1,2]+cf[2,2])
npred <- cf[2,2]/(cf[2,1]+cf[2,2])
m_data[1,ii-ii_min+1] <- kappa
m_data[2,ii-ii_min+1] <- spec
m_data[3,ii-ii_min+1] <- npred
}
#m_data
row.names(m_data) <- c("kappa", "\"Potential-Customer\" capture", "Subscription conversion rate")
#c("kappa", "specificity", "negPredValue")
colnames(m_data) <- (ii_min:(ii_max-1)-ii_min+1)/100
library(reshape)
df_melted <- melt(m_data)
colnames(df_melted) <- c("metric","predicted_probability", "value")
library(ggplot2)
obj_plot <- ggplot(df_melted, aes(x = predicted_probability, y = value)) +
geom_line(aes(color = metric, group = metric)) +
xlab("ML prediction response cut-of for \"Yes\"") + ylab("metric values") +
theme(panel.border = element_rect(colour = "black", fill=NA, size=0.5)) +
geom_vline(xintercept=0.08, linetype="dashed", size=0.7, colour="green") +
geom_vline(xintercept=0.238, linetype="dashed", size=0.7, colour="green") +
geom_vline(xintercept=0.455, linetype="dashed", size=0.7, colour="blue") +
theme(legend.position = c(0.81, 0.5)) +
#legendBackground: c(fill, lineSize, lineType, lineColor)
annotate("rect", xmin=0.08, xmax=0.238, ymin=0., ymax=1.0, alpha=.1, fill="green") +
ggtitle("Plot for optimization of business metrics") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(xintercept=0.01, linetype="dashed", size=0.7, colour="red") +
annotate("text", label="no model", x=0.025, y=0.5, colour="blue", alpha=.5) +
annotate("text", label="cut-1", x=0.104, y=0.95, colour="blue", alpha=.5) +
annotate("text", label="cut-2", x=0.262, y=0.95, colour="blue", alpha=.5) +
annotate("text", label="cut-3", x=0.48, y=0.95, colour="blue", alpha=.5)
Following plot was generated from validation data set. When compared to plot created from test set (jump to this section), it shows slight degradation (clearly showing bias created by using many iteration for tuning)
obj_plot
Simple eye-ball comparison of two plot (generated by test set vs validation set) shows:
Plot from Validation Set is trure representation of predictive power. Business can use this information to optimize their effort vs subscribing customers.
The UC Irvine bank marketing dataset is available from here.
I follow Data Science Central on twitter which has been an amazing source of knowledge both in breath and depth. So much so that I owe much of my breath of data science knowledge to this site. I am ever grateful to people behind this site.
One of the article in Data Science Central lead me to Jitesh Shah’s blog from where I found UC Irvine bank marketing dataset. I realized this is the right dataset to show case my data science skills as shown in this report. I like to also thank to Jitesh Shah for the article.
The data is originally available from UC Irvine Machine Learning Repository. The Bank Marketing Data has ID 45175.
The predictor (same as feature) variables are divided into Customer data, Campaign related attributes and social/economic indicators.
Attributes related to the last touchpoint for the current campaign:
There are several missing values in some categorical attributes, all coded with the “unknown” label. These missing values can be treated as a possible class label or deletion or imputation techniques can be used on them.
12.4 Social and economic context attributes: