Story: with Iris toy dataset

The objective of this story to cover wide topics in data science and data handling in R and Python. Topics covered are data structure, column manipulation, clustering, PCA, various ML algorithms and visualization. Doing parallel exercise in R helps in quick learning to other language.

But first let us describe a bit about Iris flower dataset. It is one of the most popular toy dataset for ML (Machine Learning) which can be used in identifying the species based on four oberservatios made on petal's and sepal's length and width. The dataset consist of four predictors (also called features) columns that are numeric data type. Based on these features, one can predict the species of the flower - setosa, versicolor, virginica. There are 50 measurements (one row has one measurement) for each species.

In machine language it is multinomial classification problem. The Response variable (also called as label or target, it is the variable that needs to be predicted)

The data exist in sklearn package with instructions to unpack. The response variable is 0, 1 and 2, which we will convert into setosa, versicolor, virginica.

We convert the data first into numpy array and then to pandas DataFrame. DataFrame is higly optimized for data munging, slicing and fetching. Numpy's array and pandas' DataFrame must be learnt by the Data Scientist using Python.

The data preparation will have the follwong steps:

  1. Put the data as pandas' DataFrame
  2. Standardize it so that each column has mean = 0 and standard deviation sdv = 1
  3. Split the data into training and test set.

Data Preparation

Putting data in the right format

Here we put the data in DataFrames. Dataframe xiris contains four columns of only predictors, yiris contains single column of response (or target or label) variables

In [4]:
import pandas as pd             # for the dataframe conversion
import numpy as np              # numpy array
from sklearn import datasets    # contain iris dataset as numpy array 
iris = datasets.load_iris()     # use the following to fetch the iris data
print(type(iris.data))          # shows the data is numpy array
iris.data                       # shows the data of predictors or features
iris.feature_names              # This contains the feature names (column names)
iris.target                     # shows the data of response (1 column)
iris.target_names               # the name of the response
xiris = pd.DataFrame(iris.data) # put the data into panda data frame
xiris.columns = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width'] # rename the columns
yiris = pd.DataFrame(iris.target)
yiris.columns = ['Species']     # rename the response column name
print(xiris.head(n=6))
print(yiris.tail(n=6))
print(xiris.shape)             # dim(df) in R
print(xiris.describe())        # str(df) in R
# replace response of (0,1,2) to the species name
yiris['Species'] = yiris['Species'].map({0:'setosa', 1:'versicolor', 2:'virginica'})
print(yiris.iloc[101:105])            # show rows 101 to 105
<class 'numpy.ndarray'>
   Sepal_Length  Sepal_Width  Petal_Length  Petal_Width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
3           4.6          3.1           1.5          0.2
4           5.0          3.6           1.4          0.2
5           5.4          3.9           1.7          0.4
     Species
144        2
145        2
146        2
147        2
148        2
149        2
(150, 4)
       Sepal_Length  Sepal_Width  Petal_Length  Petal_Width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000
       Species
101  virginica
102  virginica
103  virginica
104  virginica

In R, we have vector, matrix, DataFrame and list (of objects) fundamentally defined. In Python the two most important library for data structure is numpy which has arrays and pandas which has DataFrame.

Task Python R Comment
data structure vector, matrix, dataframe, list Python's list or Pandas' Series, numpy's array, pandas' dataframe
check type of data structure type(xiris) class(xiris)
No. of rows and columns xiris.shape dim(xiris)
column names: get, set xiris.columns.values, xiris.columns= ['a','b','c','d'] colnames(xiris), colnames(xiris)=c('a','b','c','d')
get 5 records of df head(xiris,5), tail(xiris,5) iris[3:7,] xiris.head(n=5), iris.tail(n=5), xiris[3:7] for pandas' df
get 3rd, 1st and 3rd, 1 to 3 columns xiris.iloc[:,2:3], xiris.iloc[:,[0,2]], xiris.iloc[:,0:3] xiris[,3], xiris[,c(1,3)], xiris[,1:3]

Normalizing or Standardizing of data

We transform predictors column so that each column to have mean = 0 and std = 1. This is important because some of the ML algorithms like KNN will give more weightage to predictors that have larger range.

In [5]:
from sklearn.preprocessing import StandardScaler
scaled_features = StandardScaler().fit_transform(xiris.values)
xiris = pd.DataFrame(scaled_features, index=xiris.index, columns=xiris.columns)
print(xiris.mean(), xiris.std())
Sepal_Length   -1.690315e-15
Sepal_Width    -1.637024e-15
Petal_Length   -1.482518e-15
Petal_Width    -1.623146e-15
dtype: float64 Sepal_Length    1.00335
Sepal_Width     1.00335
Petal_Length    1.00335
Petal_Width     1.00335
dtype: float64

Split data to Train and Test set

We split the data for cross-validation. Cross-validation is best method to optimize the model fit. We do this by seperating small portion (20%-30%) of the data on which we do not train the model. We then test the prediction on this data. If the model fit is optimized, the prediction will be comparable to that of traing set.

Optimization is having a balance between bias and variance. A no-model (y = constant, say y = avg(y) or mode(y)) has low bias but will have high variance. An over fit model will have low variance but have high bias.

In [6]:
# split data into train and test
from sklearn.model_selection import train_test_split
xiris_tr, xiris_te, yiris_tr, yiris_te = train_test_split(xiris, yiris, test_size=0.3, random_state=0) # test size is 30%
# check to see rows and columns of the train and test set.
print(xiris.shape, xiris_tr.shape, xiris_te.shape)
print(yiris.shape, yiris_tr.shape, yiris_te.shape)
(150, 4) (105, 4) (45, 4)
(150, 1) (105, 1) (45, 1)

Principal Component Analysis

If there are large predictors (or features), it is good to know which few predictors majorly effect the response (or target). These predictors can then be plotted to get better visual insight. The variables that are stongly correlated need to be dropped and can cause unnecessary confusion in Exploratory Data Analysis (EDA)

In [8]:
a_map = {'setosa':'red', 'versicolor':'blue', 'virginica':'green'}
# intersting, one can use either way to get colors
colors = yiris['Species'].map(a_map)
#colors=yiris['Species'].apply(lambda x: amap[x])
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA() # PCA(n_components=4)
pca.fit(xiris)
#Cumulative Variance explains
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
plt.plot(list(range(1,var1.shape[0]+1)),var1,  '-o')
plt.show()
# get the PCA transformed variables
xiris_pca = pd.DataFrame(pca.transform(xiris))
xiris_pca.shape
var1
Out[8]:
array([ 72.77,  95.8 ,  99.48, 100.  ])

Visualization

Visualization provides insight on data. Here we have plots of each pair of predictor variables. The visualization can establish correlation in variables. However, pair plots of PCA transformed predictor varibles can provide insight into seperability of species.

In [9]:
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
colormap = {'setosa':'red', 'versicolor':'blue', 'virginica':'green'}
colors = yiris['Species'].map(colormap) # intersting, one can use either way to get colors
#colors=yiris['Species'].apply(lambda x: amap[x])
df = xiris.iloc[:,0:4] # dataframe
axes = pd.plotting.scatter_matrix(df, alpha=0.7,c=colors, diagonal='hist') # diagonal='kde'
plt.tight_layout()
#plt.savefig('scatter_matrix.png')
plt.suptitle('Visualization of pairs of predictor variables')
plt.subplots_adjust(top=0.92)
plt.show()
In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
colormap = {'setosa':'red', 'versicolor':'blue', 'virginica':'green'}
colors = yiris['Species'].map(colormap) # intersting, one can use either way to get colors
#colors=yiris['Species'].apply(lambda x: amap[x])
df = xiris_pca.iloc[:,0:4] # dataframe
axes = pd.plotting.scatter_matrix(df, alpha=0.9, c=colors)
plt.tight_layout()
#plt.savefig('scatter_matrix.png')
plt.suptitle('Visualization of pairs of PCA transformed variables')
plt.subplots_adjust(top=0.92)
plt.show()

3-D Visualization

A 3-D plot is always very handy, good for presentation but I never expect much. %matplotlib creates an interactive 3-D which can be rotated. When right perspective is reached, the interactive 3-D can be switched off.

In [11]:
%matplotlib notebook
import pylab as p
import mpl_toolkits.mplot3d.axes3d as p3

#t = np.arange(100)
fig=p.figure()
ax = p3.Axes3D(fig)
xiris.iloc[:,2:3]
ax.scatter(xiris_pca.iloc[:,0],xiris_pca.iloc[:,1],xiris_pca.iloc[:,2], c=colors)
#ax.scatter(a,b,c, c=colors)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
fig.add_axes(ax)
p.show()
In [ ]:
import pandas
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates

#data = pandas.read_csv('iris.data', sep=',')
data = pd.concat([xiris,yiris], axis=1)  # axis=1 means columnwise
parallel_coordinates(data, 'Species')
plt.show()

TSNE method of seperation - converting multi-dimension to 2-D

It provides the potential for the data to be identified as cluster http://www.datasciencecentral.com/profiles/blogs/t-sne-algo-in-r-and-python-made-with-same-dataset

It is always benificial to have this visualization to get insight on seperability.

The tuning parameter is preplexity which is recommended to be between 5 and 50

In [49]:
from sklearn import manifold
# perplexity should be between 5 and 50, read http://distill.pub/2016/misread-tsne/ for "How to Use t-SNE Effectively"
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0,perplexity=13,verbose=1, n_iter=1200)
xiris_tsne = tsne.fit_transform(xiris)
[t-SNE] Computing pairwise distances...
[t-SNE] Computing 40 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 150 / 150
[t-SNE] Mean sigma: 0.419267
[t-SNE] KL divergence after 100 iterations with early exaggeration: 1.512934
[t-SNE] Error after 350 iterations: 1.512934
In [59]:
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=(10,5))
plt.subplot2grid((1,2), (0,0))
#plt.scatter(X_pca[:, 0], X_pca[:, 1], c=digits.target)
#plt.subplot2grid((1,2), (0,1), rowspan=1, colspan=2)
plt.title('t-SNE')

# this plot is of one color to challenge if cluster can be identified
#plt.scatter(xiris_tsne[:, 0], xiris_tsne[:, 1], c = ['blue']*xiris.shape[0]) 
plt.scatter(xiris_tsne[:, 0], xiris_tsne[:, 1], c = colors)
#plt.scatter(xiris.iloc[:,1], xiris.iloc[:, 2], c = colors)
plt.show()
#colors
## ORIGINAL DATA DIMENSIONS
#print('ORIGINAL DATA DIMENSION:',np.array(xiris).shape)

## DIMENSIONS AFTER t-SNE
print('DIMENSIONS AFTER t-SNE',np.array(xiris_tsne).shape)
DIMENSIONS AFTER t-SNE (150, 2)

K Means cluster

This algorithm needs two parameters: (1) number of cluster to find (2) number of starting points it should use to find the cluster. Ref: https://github.com/Apress/mastering-ml-w-python-in-six-steps/blob/master/Chapter_3_Code/Code/Clustering.ipynb

In [13]:
xiris.describe() # xiris.mean(), .std(), min(), max() can be used
# K Means Cluster
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=11)    # define the parameter for cluster
model.fit(xiris)
# The cluster label calculated by KMeans is in model.labels_
pd.value_counts(pd.Series(model.labels_))   # In R the one of the most painful command, it is table(<vector>) in R
#yiris[(yiris=="setosa")]
#type(yiris)
# calculate the frequency of each prdicted cluster value
idx =yiris.loc[yiris["Species"]=='setosa']
lbl = model.labels_[idx.index]
print('frequency of setosa\n',pd.value_counts(pd.Series(lbl)))

idx =yiris.loc[yiris["Species"]=='versicolor']
lbl = model.labels_[idx.index]
print('frequency of versicolor\n',pd.value_counts(pd.Series(lbl)))

idx =yiris.loc[yiris["Species"]=='virginica']
lbl = model.labels_[idx.index]
print('frequency of virginica\n', pd.value_counts(pd.Series(lbl)))
# ?? will be interesting to remove the offset created by the comment 'frequency of ..'
# This is unsupervised learning and we need to associate the predicted label with Species
# Hence we should associate the (setosa -> 1, versicolor -> 0, virginica -> 2)
#--

dct = {1:'setosa', 0:'versicolor', 2:'virginica'}
yiris_pred = pd.DataFrame([dct[i] for i in model.labels_])
#type(yiris)
yiris_pred
# calculation of confusion matrix
from sklearn import metrics
# generate evaluation metrics
print ("Train - Accuracy :", metrics.accuracy_score(yiris, yiris_pred))
print ("Train - Confusion matrix :\n",metrics.confusion_matrix(yiris, yiris_pred))
print ("Train - classification report :", metrics.classification_report(yiris, yiris_pred))
frequency of setosa
 1    50
dtype: int64
frequency of versicolor
 0    39
2    11
dtype: int64
frequency of virginica
 2    36
0    14
dtype: int64
Train - Accuracy : 0.833333333333
Train - Confusion matrix :
 [[50  0  0]
 [ 0 39 11]
 [ 0 14 36]]
Train - classification report :              precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        50
 versicolor       0.74      0.78      0.76        50
  virginica       0.77      0.72      0.74        50

avg / total       0.83      0.83      0.83       150

Misc

Installation of the python IDE

If one is new to Python, it can take some time to setup before one can write code. This section will help you to setup fast in Windows environment.

Refer to: Learn Python in 3 days : Step by Step Guide

Install Anaconda and we use two applications:

  • Jupyter Notebook (This is upgraded version of ipython notebook). The name is derived from Julia+Python+R. Opening this application will show you Anaconda Prompt and explorer will open with site http://localhost:8888/tree. Select dropdown "New" on upper right and select Python 3. This will open a new page for new project. In the cell In: [ ], type 2+3 and Ctrl+Enter. You get the output. You just made the first working code!

    To rename the project, select Untitled* on the top, and you will get dialog box to rename the file.

    This project was saved as basic_iris_python.ipynb

  • Anaconda Prompt: After opening the application, one gets command prompt like window.

    This window can also be used to open new juptyer notebook by the command "jupyter notebook"

    This window is also used to install a library in Anaconda. Say, to install library pydot, we type command: "python -m pip install pydot". The command will install pydot library.

Creation of the html

In Anaconda prompt, type: jupyter nbconvert --to html basic_iris_python.ipynb A file basic_iris_python.html will be created.

hello