1 Introduction

In the path to become a data scientist, I would like to showcase this self-executed project. This is a binary unbalanced classification banking marketing data from UC-Irvine. I choose this data because it is realistic, simple and yet tests many concepts and ML (Machine Language) algorithms.

The business case result is presented in a single plot from which various business advantage scenarios can exploited (jump to this section).

To get best ML prediction, we go through following steps:

Firstly benchmarking with RF (Random Forest) which is off-the-self algorithm, yet one of the best ML algorithm (jump to this section).
We select Cohen’s kappa to measure goodness of prediction. This is because the data set is of binary classification which is unbalanced with 1:10 ratio in “no” : “yes”. We also select other metrics which is of business interest (jump to this section).
GLM algorithm is used because it is simplest logistic ML algorithm.
We blend RF and GLM (Generalized Linear Model) to see if it provide better ML performance.
We use t-test to test if the blend produced better ML performance.
Finally we use GBM (Gradient Boosting Model) as it is one of the best algorithm and tune it to get best results.

The R code is also included which can produce the results presented here.

2 About the Dataset

2.1 Predictor Variables

The Bank Marketing Dataset has about 41k observations with 19 predictors variables that are related to following (more details are given in Appendix II - Data Dictionary):

Customer information such as age, marital status, job type, has loan? or a house? and more.
Campaign related attributes such as contact, month, day of the week
Social/economic indicator such as employment variation rate, consumer price index, consumer confidence index and more

2.2 Response Variable

The response (also known as target or label) variable is column named “y”. The response variable has two levels “yes” and “no”.

The response “yes” is only 11.3% of data. This means for “no-model”, the random chance of getting “yes” is only 11.3%. This is unbalanced binary classification problem. Hence we will track kappa and balanced accuracy as best indicator for isolating the “yes” customers.

2.3 Missing Data

The data at hand does not have any NAs (or missing data). Although original data did have NA which was replaced by “unknown”

3 Data Preparation

Based on the description of data (see Data Dictionary), we remove the duration variable as it will bias the prediction.

We will split the data into three sets as (1) Training Set (2) Test Set (3) Validation Set in the ratio of 60:20:20. We shall train the ML algorithm with Training Set, and test it with Test Set. Once we get the final model we will additionally validate with Validation Set.

The following code reads the Bank Marketing data from the file and remove “duration” variable

rm(list=ls()) # remove all objects
irvn_org <- read.csv(file="C:\\ds_local\\dataset\\irvine-bank-marketing.txt")
irvn_org$duration <- NULL
#
irvn <-irvn_org
yirvn <- irvn_org[,"y",drop=FALSE]
dim(irvn)

## [1] 41188    20

The following code splits the data into three sets: Training Set, Test Set and Validation Set

set.seed(12345)
idx <- sample(seq(1,3), size = nrow(irvn_org), replace = TRUE, prob = c(60,20,20))
table(idx)

## idx
##     1     2     3 
## 24903  8121  8164

# Defining new sets seperatly for tain set and test set
yirvn_tr <- yirvn[idx==1,]
yirvn_te <- yirvn[idx==2,]
yirvn_vl <- yirvn[idx==3,]

The following code converts “month” varibale from the three-character-month to month-integer, which hints the ML algorithm to use as ordinal variabe.

library(plyr)
idx_month <- as.numeric(mapvalues(irvn_org$month,
                                   from=c("jan","feb","mar","apr","may",
                                          "jun","jul","aug","sep","oct","nov","dec"), 
                                     to=c(1:12)))

## The following `from` values were not present in `x`: jan, feb

length(idx_month) # check if it is same as data record count

## [1] 41188

The follwing code check (i) column names, (ii) structure (iii) any NA present (iv) ratio or binary response.

colnames(irvn) # column names

##  [1] "age"            "job"            "marital"        "education"     
##  [5] "default"        "housing"        "loan"           "contact"       
##  [9] "month"          "day_of_week"    "campaign"       "pdays"         
## [13] "previous"       "poutcome"       "emp.var.rate"   "cons.price.idx"
## [17] "cons.conf.idx"  "euribor3m"      "nr.employed"    "y"

str(irvn) # structure

## 'data.frame':    41188 obs. of  20 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

anyNA(irvn) # Are there any NAs

## [1] FALSE

table(irvn$y)/length(irvn$y)*100 # Frequency (in %) of response

## 
##       no      yes 
## 88.73458 11.26542

4 Benchmarking with RF

After the data is prepared, we choose to run Random Forest (RF) algorithm because it is one of the best ML algorithm and off-the-shelf ML (i.e., no tuning is needed!). We run Random forest with as-is data to get the performance metrics and getting importance of variables (jump to the code section).

4.1 Importance of variables

After running the RF algorithm, it provides the list of important predictors that helps in predicting the response. In our case, the important predictors that were found are:

month (last contact month of year)
euribor3m (euribor 3 month rate)
emp.var.rate (employment variation rate)
nr.employed (number of employees)
cons.conf.idx (consumer confidence index)
cons.price.idx (consumer price index)

It is surprising that there are no variables depending strongly on customer’s information. Apart from campaign info – month, all variables relates to socio/economic indicators.

4.2 Performance Metrics

Confusion Matrix is powerful representation from which variety of performance metrics can be derived. In this section, We will select some metrics that will help in the business case.

Looking at the output of confusion matrix (from Caret package), we note the following: (jump to the code section):

\[ \begin{array} {c|c|c} n=8121 & \text{Actual:}\; ``No" & \text{Actual:}\; ``Yes" \\[10pt] \hline\text{Predicted:}\; ``No" & 6978 & 672 \\[10pt] \hline\text{Predicted:}\;``Yes" & 186 & 284 \end{array} \]

4.2.1 Kappa

Kappa or balanced accuracy will be our key performance metrics for ML because we are dealing with unbalanced binary classification problem. Running the RF gives kappa = 0.35 and balance accuracy = 64%

4.2.2 Subscription Conversion Rate

We find the “Neg Pred Value” to be related to Subscription Coversion Rate probability that is given by: \[\small\text{Subscription Coversion Rate probability}= {\text{Predicted to be "Yes" which are Actually "Yes"}\over{\text{Predicted to be "Yes"}}} \]

\[ = {284\over{284+186}} = 60.5\% \]

4.2.3 Potential Customer Capture

We also find the Specificity to be related to Potential Customer Capture proability that is given by: \[ \small\text{Potential Customer Capture probability}= {\text{Predicted to be "Yes" which are Actually "Yes"}\over{\text{Total of Actual "Yes"}}} \]

\[= {284\over{284+672}} = 29.8\% \]

4.3 Performance Metrics to Monitor

Re-emphasizing the above finding, we will use the results of Random Forest algorithm as benchmark to compare the results of other ML algorithms. The metrics that we shall track are:

kappa = 0.35
balanced accuracy = 64%
Subscription Conversion Rate (“Neg Pred Value”) = 60.5%
Potential Customer Capture (specificity) = 29.8%

4.4 Code Section

This section provides the code that was run to get the results mentioned above. The output generated by the code has “##” at the beginning of the line.

For each execution of Random Forest algorithm, it generates performance metrics with different values. This is because the Random Forest algorithm chooses random set of different column at each trial. We set a seed so that we can reproduce the same result given in this report.

In the code section #** can be removed to generate 10 such confusion Matrix metrics. We will need some samples for the t-test to compare the performance of RF with GLM blended data.

ii=1
#**for (ii in 1:10){   # For reproducibility of results
set.seed(15678+100*ii)
irvn_rf <- irvn_org
suppressMessages(library(randomForest))
system.time(fit.rf <- randomForest(y~.,data=irvn_rf[idx==1,], importance=TRUE))  # no tuning

##    user  system elapsed 
##  437.38    3.78  526.42

# order the important variable with MeanDecreaseAccuracy
fit.rf$importance[order(fit.rf$importance[,c("MeanDecreaseAccuracy"),drop=FALSE], decreasing=TRUE),]

##                          no           yes MeanDecreaseAccuracy
## euribor3m      0.0536284541  0.0099282621         0.0487561295
## month          0.0556855391 -0.0163023306         0.0476684822
## emp.var.rate   0.0375027283  0.0014252131         0.0334871425
## nr.employed    0.0328411421  0.0343756696         0.0330152431
## cons.price.idx 0.0320845576 -0.0138443382         0.0269851416
## cons.conf.idx  0.0256174112 -0.0084752547         0.0218233727
## job            0.0078963833 -0.0028437353         0.0066989724
## day_of_week    0.0060532939  0.0014447611         0.0055403697
## contact        0.0044716828  0.0109108869         0.0051909338
## age            0.0057110196  0.0001541611         0.0050909315
## pdays          0.0015780862  0.0309723640         0.0048581232
## poutcome       0.0036173047  0.0104042941         0.0043749108
## education      0.0044062615 -0.0007721131         0.0038287474
## previous       0.0024845988 -0.0005689782         0.0021421292
## campaign       0.0010916766  0.0040852531         0.0014262540
## marital        0.0012962282 -0.0013996937         0.0009952635
## default        0.0004272557  0.0023974583         0.0006467961
## housing        0.0003367612 -0.0004123314         0.0002523711
## loan           0.0002354970 -0.0003352199         0.0001713429
##                MeanDecreaseGini
## euribor3m             563.18214
## month                 150.72579
## emp.var.rate          107.39701
## nr.employed           294.58695
## cons.price.idx         97.23562
## cons.conf.idx         124.14249
## job                   367.94255
## day_of_week           250.64099
## contact                56.80678
## age                   533.66915
## pdays                 170.55446
## poutcome              144.07894
## education             285.67158
## previous               75.66117
## campaign              261.69188
## marital               130.92431
## default                47.21366
## housing               120.82400
## loan                   87.60118

pred_rf_te<-predict(fit.rf, irvn_rf[idx==2,])
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

length(yirvn_te)

## [1] 8121

print(confusionMatrix(pred_rf_te, yirvn_te)) # generates rich array of metric

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6978  673
##        yes  186  284
##                                           
##                Accuracy : 0.8942          
##                  95% CI : (0.8873, 0.9008)
##     No Information Rate : 0.8822          
##     P-Value [Acc > NIR] : 0.0003334       
##                                           
##                   Kappa : 0.3474          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9740          
##             Specificity : 0.2968          
##          Pos Pred Value : 0.9120          
##          Neg Pred Value : 0.6043          
##              Prevalence : 0.8822          
##          Detection Rate : 0.8593          
##    Detection Prevalence : 0.9421          
##       Balanced Accuracy : 0.6354          
##                                           
##        'Positive' Class : no              
##

#pred_rf<-predict(fit.rf, irvn_te, type = "prob")
#**} #-- for (ii in 1:10){

5 Fitting data with glm

It is always instructive to fit the data with the simplest model that give quick insight of the data. GLM (General Linear Model) is the simplest model for binary classification.

Since our purpose is only to gain insight of the data. We place additional conditions on the data before GLM is run.

We do the following:

Convert month to numbers, because month is ordinal (often a month’s data will be closly related to previous month data).
Choose only the predictor variables that are numericals and use GLM only for those predictors.

5.1 key finding with GLM

Key findings are:

The important variables are (month, campaign, pdays, previous, emp.var.rate, euribor3m, nr.employed)
kappa is 0.27 which is worse than RF benchmark of 0.35
balanced accuracy is 0.59 which is worse than than RF benchmark of 0.64

5.2 GLM Code Section

Following code identifies numerical predictors, then prepares the dataframe, irvn_glm, that has numerical predictors, convert month to numeric, runs the GLM algorithm, make prediction on test data and calculate the confusion matrix.

#-- identify which variables are numeric
a <- (sapply(irvn_org, is.factor))
colnames(irvn)[!a]

## [1] "age"            "campaign"       "pdays"          "previous"      
## [5] "emp.var.rate"   "cons.price.idx" "cons.conf.idx"  "euribor3m"     
## [9] "nr.employed"

#----
suppressMessages(library(leaps))
irvn_glm <-irvn_org
#irvn_glm <- irvn_glm[, c("age","month", "emp.var.rate", "cons.price.idx","cons.conf.idx", "euribor3m", "nr.employed", "y")]

irvn_glm <- irvn_org[, c("age", "month", "campaign", "pdays", "previous", "emp.var.rate",
                         "cons.price.idx","cons.conf.idx", "euribor3m", "nr.employed", "y")]

suppressMessages(library(plyr))
irvn_glm$month <- as.numeric(mapvalues(irvn_org$month,
                                   from=c("jan","feb","mar","apr","may",
                                          "jun","jul","aug","sep","oct","nov","dec"), 
                                     to=c(1:12)))

## The following `from` values were not present in `x`: jan, feb

fit_glm <- glm(y~., data=irvn_glm[idx==1,], family="binomial")
summary(fit_glm)

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = irvn_glm[idx == 
##     1, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8589  -0.3594  -0.3182  -0.2910   2.7425  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    59.9155540 19.2727986   3.109 0.001878 ** 
## age             0.0014290  0.0018664   0.766 0.443889    
## month          -0.1248313  0.0100266 -12.450  < 2e-16 ***
## campaign       -0.0419749  0.0118303  -3.548 0.000388 ***
## pdays          -0.0018505  0.0001012 -18.290  < 2e-16 ***
## previous       -0.3054655  0.0455858  -6.701 2.07e-11 ***
## emp.var.rate   -0.6312929  0.0760333  -8.303  < 2e-16 ***
## cons.price.idx  0.2324117  0.1161670   2.001 0.045428 *  
## cons.conf.idx  -0.0053956  0.0071629  -0.753 0.451286    
## euribor3m       0.6889780  0.1134361   6.074 1.25e-09 ***
## nr.employed    -0.0163045  0.0019208  -8.488  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17424  on 24902  degrees of freedom
## Residual deviance: 14097  on 24892  degrees of freedom
## AIC: 14119
## 
## Number of Fisher Scoring iterations: 5

#plot(fit_glm)

pred_glm_tr <- ifelse(predict(fit_glm, newdata=irvn_glm[idx==1,], 
                              type="response") <0.5,"no","yes")
pred_glm_te <- ifelse(predict(fit_glm, newdata=irvn_glm[idx==2,], 
                              type="response") <0.5,"no","yes")

pred_glm <- predict(fit_glm, newdata=irvn_glm, type="response")
suppressMessages(library(caret))
print(confusionMatrix(pred_glm_te, yirvn_te))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7084  765
##        yes   80  192
##                                           
##                Accuracy : 0.8959          
##                  95% CI : (0.8891, 0.9025)
##     No Information Rate : 0.8822          
##     P-Value [Acc > NIR] : 4.789e-05       
##                                           
##                   Kappa : 0.2746          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9888          
##             Specificity : 0.2006          
##          Pos Pred Value : 0.9025          
##          Neg Pred Value : 0.7059          
##              Prevalence : 0.8822          
##          Detection Rate : 0.8723          
##    Detection Prevalence : 0.9665          
##       Balanced Accuracy : 0.5947          
##                                           
##        'Positive' Class : no              
##

6 RF with GLM blended data

Here we run the RF algorithm again but with few changes in the predictor colimns. The changes are:

Making the month column as numeric.
Take data based on GLM prediction as additional predictor.

We do not worry about over fitting because:

we used only the train data for prediction. So the glm fit does not see the test data.
GLM fits with only one co-efficient per predictor, hence has low bias.

To be ultimatly sure, we could use cross-validation with validation dataset, yirvn_vl created in data preparation section.

6.1 Key Finding

Unlike GLM or GBM (used later), RF gives different value of kappa for each run. This is because the RF works by making set of decision trees on random subsets of columns. Therefore to compare the two models (1) RF and (2) RF with GLM blended data, we run each model 10 times (in t-test sub-section. The ten kappa values from each models gives average of 0.349 and 0.359, an increase of about 3%.

To check if this increase is statistically significant, we will do t-test to prove that the increase in kappa is real with high confidence.

6.2 Code Section

The code below shows the results produced by RF with benchmark blended with added features from GLM.

In the code section #** can be removed to generate 10 such confusion Matrix metrics. We will need some samples for the t-test to compare the performance of RF on as-is data.

ii=1
#**for (ii in 1:10){ 
set.seed(15678+100*ii)
irvn_rf <- irvn_org
suppressMessages(library(plyr))
irvn_rf$month <- as.numeric(mapvalues(irvn_org$month,
                                   from=c("jan","feb","mar","apr","may",
                                          "jun","jul","aug","sep","oct","nov","dec"), 
                                     to=c(1:12)))

## The following `from` values were not present in `x`: jan, feb

head(pred_glm)

##          1          2          3          4          5          6 
## 0.06226579 0.06234928 0.06069923 0.06094410 0.06226579 0.06135429

irvn_rf <-data.frame(irvn_rf, "pred_glm" = pred_glm)
#colnames(irvn_rf)

# (iii)
suppressMessages(library(randomForest))
#trn_tr <- trn_tr[sample(1:nrow(trn_tr), size=3256, replace=FALSE),]
system.time(fit.rf <- randomForest(y~.,data=irvn_rf[idx==1,], importance=TRUE))  # no tuning

##    user  system elapsed 
##  492.14    3.51  722.46

# order the important variable with MeanDecreaseAccuracy
fit.rf$importance[order(fit.rf$importance[,c("MeanDecreaseAccuracy"),drop=FALSE], decreasing=TRUE),]

##                          no           yes MeanDecreaseAccuracy
## pred_glm       0.0859842064  2.138129e-02         0.0787557032
## euribor3m      0.0546902865  1.963450e-03         0.0487997088
## emp.var.rate   0.0377255834 -5.773245e-03         0.0328569774
## nr.employed    0.0328037294  1.346084e-02         0.0306433577
## cons.price.idx 0.0250531720 -9.723472e-03         0.0211754999
## cons.conf.idx  0.0215515628 -6.579895e-03         0.0184081472
## month          0.0189797247 -5.437232e-03         0.0162485038
## campaign       0.0102824125 -9.723903e-04         0.0090235760
## age            0.0086712535 -1.461326e-03         0.0075373979
## contact        0.0064149117  1.107347e-02         0.0069312688
## job            0.0072782959 -9.700742e-04         0.0063567981
## day_of_week    0.0059574042  1.866308e-03         0.0054980322
## education      0.0042642416 -2.163409e-04         0.0037607681
## poutcome       0.0034377774  5.427623e-03         0.0036608225
## pdays          0.0009185496  2.337004e-02         0.0034251283
## previous       0.0031791954  1.631461e-03         0.0030013249
## marital        0.0011638645 -9.826225e-04         0.0009225576
## default        0.0003517304  2.090441e-03         0.0005459634
## housing        0.0002730435  1.776021e-05         0.0002439789
## loan           0.0002288314 -1.687919e-04         0.0001839560
##                MeanDecreaseGini
## pred_glm              847.60069
## euribor3m             483.48081
## emp.var.rate           75.84654
## nr.employed           226.82973
## cons.price.idx         97.56846
## cons.conf.idx         104.89536
## month                  79.83420
## campaign              196.42228
## age                   467.35438
## contact                51.89025
## job                   348.01830
## day_of_week           230.67927
## education             266.23953
## poutcome               91.58307
## pdays                 136.84899
## previous               58.36255
## marital               118.53717
## default                44.59103
## housing               110.46733
## loan                   79.34210

pred_rf_te<-predict(fit.rf, irvn_rf[idx==2,])
suppressMessages(library(caret))
print(confusionMatrix(pred_rf_te, yirvn_te))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6969  668
##        yes  195  289
##                                           
##                Accuracy : 0.8937          
##                  95% CI : (0.8868, 0.9004)
##     No Information Rate : 0.8822          
##     P-Value [Acc > NIR] : 0.0005552       
##                                           
##                   Kappa : 0.3496          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9728          
##             Specificity : 0.3020          
##          Pos Pred Value : 0.9125          
##          Neg Pred Value : 0.5971          
##              Prevalence : 0.8822          
##          Detection Rate : 0.8581          
##    Detection Prevalence : 0.9404          
##       Balanced Accuracy : 0.6374          
##                                           
##        'Positive' Class : no              
##

#pred_rf<-predict(fit.rf, irvn_te, type = "prob")
#**} #-- for (ii in 1:10){

7 How GLM blend fare in t-test

With GLM blended data and month converted to numerical, we see slight increase in kappa. We are interested in determining if it is true increase or just a statistical noise.

For this we run the following code 10 times:

random forest for as-is
random forest with glm blended data and month to numerical converted

Note: In the above code, just uncomment (by removing #**) open and close line of for loop

7.1 t-test

We do t-test to see if the kappa of the two sample – asis and glm blend is statistically significant.

Here, Null hypothesis, H0 is that the means of two sample are equal. Alternate hypothesis, H1 is that means are truely diferrent.

From the code below, we see that p-value is <0.01 which suggests that Null hypothesis can be rejected with 99% confidence. or in plain english, we have >99% confidence that the means of two sample is different.

kappa_asis <- c(0.3517,0.3455, 0.3552, 0.349, 0.3486, 0.3548, 0.3404, 0.344, 0.3506, 0.3513)
kappa_glm_blend <- c(0.3496, 0.3598, 0.3572, 0.3553, 0.3597, 0.3643, 0.3504, 0.359, 0.3656, 0.3681)
t.test(kappa_asis,kappa_glm_blend)

## 
##  Welch Two Sample t-test
## 
## data:  kappa_asis and kappa_glm_blend
## t = -4.0221, df = 16.896, p-value = 0.0008935
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.014927772 -0.004652228
## sample estimates:
## mean of x mean of y 
##   0.34911   0.35890

8 Run with GBM

Gradient Boosting model is one of the best ML algorithm. This model needs three parameters for tuning (also known as hyper parameters). We will tune and select the parameters that gives maximum kappa.

Normally the GBM algorithm makes the binary decision by calculating the response probability and using the the cut-off at 0.5, i.e., if the response probability is < 0.5, the response is considered “no” otherwise “yes”.

In our case, we will choose our own cut-off to make the binary decision. We will choose the cut-off along with the tuning parameters and settle with the tuning parameter and response cut-off which gives maximum kappa. We will step wise run GBM with all possible combination of the tuning parameters.

8.1 Kappa function

So far we relied the caret package to get the kappa from caret package. But for tuning we define our own function for kappa as it will be much faster

kp <- function(pred,obs){
  cf<-table(pred, obs)
  if(sum(dim(cf))<4){return(0)}
  a<- cf[1,1]; c<- cf[2,1]; b<- cf[1,2];d <- cf[2,2]
  po <- (a+d)/sum(cf); pe <- ((a+b)*(a+c)+(c+d)*(b+d))/sum(cf)^2
  return(ifelse(pe==1,0,(po-pe)/(1-pe)))}

8.2 tuning parameters with GBM

In the following code we run the GBM for various values of parameters as follows:

hp_NTREES with values (501,5001)
hp_SHRINKAGE with values (0.05, 0.5, 1)
hp_DEPTH with values (1,2,3,4,5)
hp_MINOBS with values (3, 5, 10);

We need to limit with the values of the parameters as there are many combinations possible. With above values, GBM algorithm will have to be run \(2\times3\times 5\times 3 = 90\) which could run for the whole day!

We ran 90 times and found the maximum kappa is when: hp_NTREES = 501, hp_SHRINKAGE = 0.05, hp_DEPTH = 49, hp_MINOBS = 3.

8.3 Key findings

We were able to get Cohen’s Kappa of 0.445. Compared to the RF benchmark of 0.35, it is pretty good performance.

8.4 Code section for GBM

The code below is used for both tuning parameters and prepare this report for the final tuned model.

When the tuning needs to be done, comment the following on one line of code and run all the choosen combinations set by previous line of code:

 hp_NTREES <- c(501); hp_SHRINKAGE <- c( 0.05); hp_DEPTH <- c(49);hp_MINOBS <- c(3);

irvn_gbm <- irvn_org
irvn_gbm <- data.frame(model.matrix(y~.,data=irvn_org), "pred_glm"=pred_glm, "y"=irvn_org$y)
#irvn_glm <- data.frame(model.matrix(y~.,data=irvn_org), "y"=irvn_org$y)

irvn_gbm$y <- ifelse(irvn_gbm$y=="yes",1,0)
suppressMessages(library(gbm))
suppressMessages(library(caret))
my_DISTRIBUTION <- "adaboost" #adaboost

hp_NTREES <- c(501); hp_SHRINKAGE <- c( 0.001, 0.01, 0.05); hp_DEPTH <- c( 5, 10, 30, 40, 49);hp_MINOBS <- c(3);
# line below is commented out when we like to run with the tuning parameters combination provided above
hp_NTREES <- c(501); hp_SHRINKAGE <- c( 0.05); hp_DEPTH <- c(49);hp_MINOBS <- c(3); 
for(ihp_NTREES in hp_NTREES){for (ihp_SHRINKAGE in hp_SHRINKAGE) { for (ihp_DEPTH in hp_DEPTH){for (ihp_MINOBS in hp_MINOBS){
cat("--ntrees: ", ihp_NTREES, ", shrinkage:", ihp_SHRINKAGE, 
    ", interaction_depth:", ihp_DEPTH,  ", minobs:", ihp_MINOBS, "\n")
gbm_fit <- gbm(y~., data=irvn_gbm[idx==1,], distribution=my_DISTRIBUTION,
              n.trees=ihp_NTREES, shrinkage=ihp_SHRINKAGE,
              interaction.depth = ihp_DEPTH, n.minobsinnode = ihp_MINOBS, verbose=FALSE,
              cv.folds =0, n.cores=2)

#gbm_fit
#gbm.perf(gbm_fit)
#summary(gbm_fit, n.trees=1)
#summary(gbm_fit, n.trees=gbm.perf(gbm_fit, plot=FALSE))

#----- relative influence --------
#cat("@FYI - Relative influence -------")
#rel_inf <- relative.influence(gbm_fit,  n.trees=gbm.perf(gbm_fit, plot=FALSE))
#print(data.frame("relative_influence"=sort(rel_inf[rel_inf>0],decreasing = TRUE)))
#cat("@FYI End - Relative influence -------")
#-------------

#pred_gbm <-predict(gbm_fit, newdata=x_feed_trn_te, 
#                        n.trees=gbm.perf(gbm_fit, plot=FALSE))

pred_gbm <-predict.gbm(gbm_fit, newdata=irvn_gbm[idx==2,], 
                        n.trees=gbm.perf(gbm_fit, plot=FALSE), type="response")
#pred <- ifelse(pred_gbm < 0.5, "no","yes")
#print(confusionMatrix(pred, yirvn_te))
#-------- search 4 max kappa
chk_max <- -1
for (ii in 1:100){
  pred <- ifelse(pred_gbm < ii/100, "no","yes")
  tmp <- kp(pred, yirvn_te)
  if (chk_max < tmp) {ii_4max <- ii; chk_max <- tmp} 
}
cat("--@ntrees: ", ihp_NTREES, ", shrinkage:", ihp_SHRINKAGE,
    ", interaction_depth:", ihp_DEPTH,  ", minobs:", ihp_MINOBS,
    "ii for max kappa = ", ii_4max, ", max kappa =", chk_max, "\n")
#--------
}}}} # for hyper parameters

## --ntrees:  501 , shrinkage: 0.05 , interaction_depth: 49 , minobs: 3 
## Using OOB method...
## --@ntrees:  501 , shrinkage: 0.05 , interaction_depth: 49 , minobs: 3 ii for max kappa =  21 , max kappa = 0.4454907

pred <- ifelse(pred_gbm < ii_4max/100, "no","yes")
print(confusionMatrix(pred, yirvn_te))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6614  435
##        yes  550  522
##                                           
##                Accuracy : 0.8787          
##                  95% CI : (0.8714, 0.8857)
##     No Information Rate : 0.8822          
##     P-Value [Acc > NIR] : 0.8367214       
##                                           
##                   Kappa : 0.4455          
##  Mcnemar's Test P-Value : 0.0002809       
##                                           
##             Sensitivity : 0.9232          
##             Specificity : 0.5455          
##          Pos Pred Value : 0.9383          
##          Neg Pred Value : 0.4869          
##              Prevalence : 0.8822          
##          Detection Rate : 0.8144          
##    Detection Prevalence : 0.8680          
##       Balanced Accuracy : 0.7343          
##                                           
##        'Positive' Class : no              
##

#save(file="c:\\scratch\\irvene\\irvine_bus_case.rdo", pred_gbm, pred, ii_4max, irvn_gbm, idx, gbm_fit)

8.5 Code Section for metrics calculation

In the code below, the metrics Kappa, Subscription Conversion Rate (Neg Pred Value), Potential Customer Capture (Specificity) is calcualted for various response cut-off value. We focus only on these metrics because it was identified earlier to be useful for business needs.

The following code calculates kappas, Neg Pred Value and Specificity for various response probability cut-offs:

# defining the limits
ii_min <- as.integer(min(pred_gbm*100))+1
ii_max <- as.integer(max(pred_gbm*100))-1
m_data <-matrix(, nrow = 3, ncol = ii_max-ii_min) # empty metrics
for (ii in ii_min:(ii_max-1)){
  cf<-table(ifelse(pred_gbm > ii/100,"yes","no"), yirvn_te)
  kappa <- kp(ifelse(pred_gbm > ii/100,"yes","no"), yirvn_te)
  spec <- cf[2,2]/(cf[1,2]+cf[2,2])
  npred <- cf[2,2]/(cf[2,1]+cf[2,2])
  m_data[1,ii-ii_min+1] <- kappa
  m_data[2,ii-ii_min+1] <- spec
  m_data[3,ii-ii_min+1] <- npred
}

The following code prepares the ggplot graph:

# This will become column names after reshape-melting. These are 
# ("kappa", "specificity", "negPredValue") but given appropriate business names
row.names(m_data) <- c("kappa", "\"Potential-Customer\" capture", 
                       "Subscription conversion rate") 
#c("kappa", "specificity", "negPredValue")
# column names will become y-axis values
colnames(m_data) <- (ii_min:(ii_max-1)-ii_min+1)/100
suppressMessages(library(reshape))
df_melted <- melt(m_data)
colnames(df_melted) <- c("metric","predicted_probability", "value")
# making good looking plot
suppressMessages(library(ggplot2))
show_plot <- ggplot(df_melted, aes(x = predicted_probability, y = value)) + 
  geom_line(aes(color = metric, group = metric), size=0.7) +
  xlab("ML prediction response cut-of for \"Yes\"") + ylab("metric values") +
  theme(panel.border = element_rect(colour = "black", fill=NA, size=0.5)) +
  geom_vline(xintercept=0.08, linetype="dashed", size=0.5, colour="green") +
  geom_vline(xintercept=0.238, linetype="dashed", size=0.5, colour="green") +
  geom_vline(xintercept=0.455, linetype="dashed", size=0.55, colour="blue") +
  theme(legend.position = c(0.81, 0.5)) +
  #legendBackground: c(fill, lineSize, lineType, lineColor)
  annotate("rect", xmin=0.08, xmax=0.238, ymin=0., ymax=1.0, alpha=.1, fill="green") +
  ggtitle("Plot for optimization of business metrics") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_vline(xintercept=0.01, linetype="dashed", size=0.7, colour="red") +
  annotate("text", label="no model", x=0.025, y=0.5, colour="blue", alpha=.5) +
  annotate("text", label="cut-1", x=0.104, y=0.95, colour="blue", alpha=.5) +
  annotate("text", label="cut-2", x=0.262, y=0.95, colour="blue", alpha=.5) +
  annotate("text", label="cut-3", x=0.48, y=0.95, colour="blue", alpha=.5)

9 Optimal Business Metrics

Following graph shows the values predicted by tuned GBM algorithm. The higher kappa is usually the most optimal region, but depending on the business case various scenarios can be selected. Let us understand the graph below.

show_plot

The x-axis shows the cut-off on prediction response of GBM algorithm, beyond which it is identified as “yes” i.e., customer who subscribed.

Red line shows the chance of capturing the customer that could potentially subscribe. From business perspective higher number is desired with reasonable effort. (In confusion matrix we identified this metric as Specificity)

The blue line is the Subscription conversion rate. From business perspective higher conversion rate is desired without loosing potential customers. (In confusion matrix we identified this metric as “Neg Pred Value”)

The green line is Cohen Kappa co-efficent curve. Higher Kappa signifies better identification of the subscribed customer.

With this understanding, we can make few business case scenarios. For example:

If the marketing is starting with less budget, then can go with cut-3. This will give a conversion rate of about 65% but capture only 25% of the potential customers.
Cut-2 gives the conversion ratio of 50% and capture 50% of potential customers.
If the marketing has more budget, it can also go for cut-1 which will give slightly less conversion rate of cut-2 but capture 65% of customers.
Below the cut-1, the conversion rate drops sharply and it may not be cost-effective.

This concludes the analysis and developing the business case for this data.

10 Final Check with Validation Data

It is quite possible that the model becomes slightly biased by using the test data over many iterations when tuning. So to be extra sure, we run the model on validation data which was never used for training or testing.

10.1 Code Section

The following code creates the prediction response on the Valiation data. Rest of the code is nearly the same as use earlier.

# create prediction vector response for the validation set based
# on the same gbm model fit
pred_gbm <-predict.gbm(gbm_fit, newdata=irvn_gbm[idx==3,], 
                        n.trees=gbm.perf(gbm_fit, plot=FALSE), type="response")

## Using OOB method...

ii_min <- as.integer(min(pred_gbm*100))+1
ii_max <- as.integer(max(pred_gbm*100))-1
m_data <-matrix(, nrow = 3, ncol = ii_max-ii_min)
for (ii in ii_min:(ii_max-1)){
  cf<-table(ifelse(pred_gbm > ii/100,"yes","no"), yirvn_vl)
  kappa <- kp(ifelse(pred_gbm > ii/100,"yes","no"), yirvn_vl)
  spec <- cf[2,2]/(cf[1,2]+cf[2,2])
  npred <- cf[2,2]/(cf[2,1]+cf[2,2])
  m_data[1,ii-ii_min+1] <- kappa
  m_data[2,ii-ii_min+1] <- spec
  m_data[3,ii-ii_min+1] <- npred
}
#m_data
row.names(m_data) <- c("kappa", "\"Potential-Customer\" capture", "Subscription conversion rate") 
#c("kappa", "specificity", "negPredValue")
colnames(m_data) <- (ii_min:(ii_max-1)-ii_min+1)/100
library(reshape)
df_melted <- melt(m_data)
colnames(df_melted) <- c("metric","predicted_probability", "value")
library(ggplot2)
obj_plot <- ggplot(df_melted, aes(x = predicted_probability, y = value)) + 
  geom_line(aes(color = metric, group = metric)) +
  xlab("ML prediction response cut-of for \"Yes\"") + ylab("metric values") +
  theme(panel.border = element_rect(colour = "black", fill=NA, size=0.5)) +
  geom_vline(xintercept=0.08, linetype="dashed", size=0.7, colour="green") +
  geom_vline(xintercept=0.238, linetype="dashed", size=0.7, colour="green") +
  geom_vline(xintercept=0.455, linetype="dashed", size=0.7, colour="blue") +
  theme(legend.position = c(0.81, 0.5)) +
  #legendBackground: c(fill, lineSize, lineType, lineColor)
  annotate("rect", xmin=0.08, xmax=0.238, ymin=0., ymax=1.0, alpha=.1, fill="green") +
  ggtitle("Plot for optimization of business metrics") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_vline(xintercept=0.01, linetype="dashed", size=0.7, colour="red") +
  annotate("text", label="no model", x=0.025, y=0.5, colour="blue", alpha=.5) +
  annotate("text", label="cut-1", x=0.104, y=0.95, colour="blue", alpha=.5) +
  annotate("text", label="cut-2", x=0.262, y=0.95, colour="blue", alpha=.5) +
  annotate("text", label="cut-3", x=0.48, y=0.95, colour="blue", alpha=.5)

10.2 Business Optimization Plot

Following plot was generated from validation data set. When compared to plot created from test set (jump to this section), it shows slight degradation (clearly showing bias created by using many iteration for tuning)

obj_plot

Simple eye-ball comparison of two plot (generated by test set vs validation set) shows:

At cut-3, the subscription conversion rate decrease 67% from 63%. Conversion rate does not show much change.
At cut-2 the subscription convesion rate shows slight decrease from 50% to about 48%.
At cut-1 the subscription conversion rate decreases from 40% to 37%.

Plot from Validation Set is trure representation of predictive power. Business can use this information to optimize their effort vs subscribing customers.

11 Appendix I - Acknowledgements on Datasource

The UC Irvine bank marketing dataset is available from here.

I follow Data Science Central on twitter which has been an amazing source of knowledge both in breath and depth. So much so that I owe much of my breath of data science knowledge to this site. I am ever grateful to people behind this site.

One of the article in Data Science Central lead me to Jitesh Shah’s blog from where I found UC Irvine bank marketing dataset. I realized this is the right dataset to show case my data science skills as shown in this report. I like to also thank to Jitesh Shah for the article.

The data is originally available from UC Irvine Machine Learning Repository. The Bank Marketing Data has ID 45175.

12 Appendix II - Data Dictionary

The predictor (same as feature) variables are divided into Customer data, Campaign related attributes and social/economic indicators.

12.1 Customer data:

age (numeric)
job : type of job (categorical: “admin.”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)
marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”; note: “divorced” means divorced or widowed)
education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”)
default: has credit in default? (categorical: “no”, “yes”, “unknown”)
housing: has housing loan? (categorical: “no”, “yes”, “unknown”)
loan: has personal loan? (categorical: “no”, “yes”, “unknown”)

12.2 Campaign attributes

Attributes related to the last touchpoint for the current campaign:

contact: contact communication type (categorical: “cellular”, “telephone”)
month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is undertaken. Also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

12.3 Other attributes:

campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)

12.5 Target Variable

y - did the customer subscribe to the product pitched? (binary: “yes”, “no”)

12.6 Missing Attribute Values

There are several missing values in some categorical attributes, all coded with the “unknown” label. These missing values can be treated as a possible class label or deletion or imputation techniques can be used on them.