The objective of this story to cover wide topics in data science and data handling in R and Python. Topics covered are data structure, column manipulation, clustering, PCA, various ML algorithms and visualization. Doing parallel exercise in R helps in quick learning to other language.
But first let us describe a bit about Iris flower dataset. It is one of the most popular toy dataset for ML (Machine Learning) which can be used in identifying the species based on four oberservatios made on petal's and sepal's length and width. The dataset consist of four predictors (also called features) columns that are numeric data type. Based on these features, one can predict the species of the flower - setosa, versicolor, virginica. There are 50 measurements (one row has one measurement) for each species.
In machine language it is multinomial classification problem. The Response variable (also called as label or target, it is the variable that needs to be predicted)
The data exist in sklearn package with instructions to unpack. The response variable is 0, 1 and 2, which we will convert into setosa, versicolor, virginica.
We convert the data first into numpy array and then to pandas DataFrame. DataFrame is higly optimized for data munging, slicing and fetching. Numpy's array and pandas' DataFrame must be learnt by the Data Scientist using Python.
The data preparation will have the follwong steps:
import pandas as pd # for the dataframe conversion
import numpy as np # numpy array
from sklearn import datasets # contain iris dataset as numpy array
iris = datasets.load_iris() # use the following to fetch the iris data
print(type(iris.data)) # shows the data is numpy array
iris.data # shows the data of predictors or features
iris.feature_names # This contains the feature names (column names)
iris.target # shows the data of response (1 column)
iris.target_names # the name of the response
xiris = pd.DataFrame(iris.data) # put the data into panda data frame
xiris.columns = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width'] # rename the columns
yiris = pd.DataFrame(iris.target)
yiris.columns = ['Species'] # rename the response column name
print(xiris.head(n=6))
print(yiris.tail(n=6))
print(xiris.shape) # dim(df) in R
print(xiris.describe()) # str(df) in R
# replace response of (0,1,2) to the species name
yiris['Species'] = yiris['Species'].map({0:'setosa', 1:'versicolor', 2:'virginica'})
print(yiris.iloc[101:105]) # show rows 101 to 105
In R, we have vector, matrix, DataFrame and list (of objects) fundamentally defined. In Python the two most important library for data structure is numpy which has arrays and pandas which has DataFrame.
Task | Python | R | Comment |
---|---|---|---|
data structure | vector, matrix, dataframe, list | Python's list or Pandas' Series, numpy's array, pandas' dataframe | |
check type of data structure | type(xiris) | class(xiris) | |
No. of rows and columns | xiris.shape | dim(xiris) | |
column names: get, set | xiris.columns.values, xiris.columns= ['a','b','c','d'] | colnames(xiris), colnames(xiris)=c('a','b','c','d') | |
get 5 records of df | head(xiris,5), tail(xiris,5) iris[3:7,] | xiris.head(n=5), iris.tail(n=5), xiris[3:7] | for pandas' df |
get 3rd, 1st and 3rd, 1 to 3 columns | xiris.iloc[:,2:3], xiris.iloc[:,[0,2]], xiris.iloc[:,0:3] | xiris[,3], xiris[,c(1,3)], xiris[,1:3] |
We transform predictors column so that each column to have mean = 0 and std = 1. This is important because some of the ML algorithms like KNN will give more weightage to predictors that have larger range.
from sklearn.preprocessing import StandardScaler
scaled_features = StandardScaler().fit_transform(xiris.values)
xiris = pd.DataFrame(scaled_features, index=xiris.index, columns=xiris.columns)
print(xiris.mean(), xiris.std())
We split the data for cross-validation. Cross-validation is best method to optimize the model fit. We do this by seperating small portion (20%-30%) of the data on which we do not train the model. We then test the prediction on this data. If the model fit is optimized, the prediction will be comparable to that of traing set.
Optimization is having a balance between bias and variance. A no-model (y = constant, say y = avg(y) or mode(y)) has low bias but will have high variance. An over fit model will have low variance but have high bias.
# split data into train and test
from sklearn.model_selection import train_test_split
xiris_tr, xiris_te, yiris_tr, yiris_te = train_test_split(xiris, yiris, test_size=0.3, random_state=0) # test size is 30%
# check to see rows and columns of the train and test set.
print(xiris.shape, xiris_tr.shape, xiris_te.shape)
print(yiris.shape, yiris_tr.shape, yiris_te.shape)
If there are large predictors (or features), it is good to know which few predictors majorly effect the response (or target). These predictors can then be plotted to get better visual insight. The variables that are stongly correlated need to be dropped and can cause unnecessary confusion in Exploratory Data Analysis (EDA)
a_map = {'setosa':'red', 'versicolor':'blue', 'virginica':'green'}
# intersting, one can use either way to get colors
colors = yiris['Species'].map(a_map)
#colors=yiris['Species'].apply(lambda x: amap[x])
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA() # PCA(n_components=4)
pca.fit(xiris)
#Cumulative Variance explains
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
plt.plot(list(range(1,var1.shape[0]+1)),var1, '-o')
plt.show()
# get the PCA transformed variables
xiris_pca = pd.DataFrame(pca.transform(xiris))
xiris_pca.shape
var1
Visualization provides insight on data. Here we have plots of each pair of predictor variables. The visualization can establish correlation in variables. However, pair plots of PCA transformed predictor varibles can provide insight into seperability of species.
%matplotlib notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
colormap = {'setosa':'red', 'versicolor':'blue', 'virginica':'green'}
colors = yiris['Species'].map(colormap) # intersting, one can use either way to get colors
#colors=yiris['Species'].apply(lambda x: amap[x])
df = xiris.iloc[:,0:4] # dataframe
axes = pd.plotting.scatter_matrix(df, alpha=0.7,c=colors, diagonal='hist') # diagonal='kde'
plt.tight_layout()
#plt.savefig('scatter_matrix.png')
plt.suptitle('Visualization of pairs of predictor variables')
plt.subplots_adjust(top=0.92)
plt.show()
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
colormap = {'setosa':'red', 'versicolor':'blue', 'virginica':'green'}
colors = yiris['Species'].map(colormap) # intersting, one can use either way to get colors
#colors=yiris['Species'].apply(lambda x: amap[x])
df = xiris_pca.iloc[:,0:4] # dataframe
axes = pd.plotting.scatter_matrix(df, alpha=0.9, c=colors)
plt.tight_layout()
#plt.savefig('scatter_matrix.png')
plt.suptitle('Visualization of pairs of PCA transformed variables')
plt.subplots_adjust(top=0.92)
plt.show()
A 3-D plot is always very handy, good for presentation but I never expect much. %matplotlib creates an interactive 3-D which can be rotated. When right perspective is reached, the interactive 3-D can be switched off.
%matplotlib notebook
import pylab as p
import mpl_toolkits.mplot3d.axes3d as p3
#t = np.arange(100)
fig=p.figure()
ax = p3.Axes3D(fig)
xiris.iloc[:,2:3]
ax.scatter(xiris_pca.iloc[:,0],xiris_pca.iloc[:,1],xiris_pca.iloc[:,2], c=colors)
#ax.scatter(a,b,c, c=colors)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
fig.add_axes(ax)
p.show()
import pandas
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
#data = pandas.read_csv('iris.data', sep=',')
data = pd.concat([xiris,yiris], axis=1) # axis=1 means columnwise
parallel_coordinates(data, 'Species')
plt.show()
It provides the potential for the data to be identified as cluster http://www.datasciencecentral.com/profiles/blogs/t-sne-algo-in-r-and-python-made-with-same-dataset
It is always benificial to have this visualization to get insight on seperability.
The tuning parameter is preplexity which is recommended to be between 5 and 50
from sklearn import manifold
# perplexity should be between 5 and 50, read http://distill.pub/2016/misread-tsne/ for "How to Use t-SNE Effectively"
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0,perplexity=13,verbose=1, n_iter=1200)
xiris_tsne = tsne.fit_transform(xiris)
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=(10,5))
plt.subplot2grid((1,2), (0,0))
#plt.scatter(X_pca[:, 0], X_pca[:, 1], c=digits.target)
#plt.subplot2grid((1,2), (0,1), rowspan=1, colspan=2)
plt.title('t-SNE')
# this plot is of one color to challenge if cluster can be identified
#plt.scatter(xiris_tsne[:, 0], xiris_tsne[:, 1], c = ['blue']*xiris.shape[0])
plt.scatter(xiris_tsne[:, 0], xiris_tsne[:, 1], c = colors)
#plt.scatter(xiris.iloc[:,1], xiris.iloc[:, 2], c = colors)
plt.show()
#colors
## ORIGINAL DATA DIMENSIONS
#print('ORIGINAL DATA DIMENSION:',np.array(xiris).shape)
## DIMENSIONS AFTER t-SNE
print('DIMENSIONS AFTER t-SNE',np.array(xiris_tsne).shape)
This algorithm needs two parameters: (1) number of cluster to find (2) number of starting points it should use to find the cluster. Ref: https://github.com/Apress/mastering-ml-w-python-in-six-steps/blob/master/Chapter_3_Code/Code/Clustering.ipynb
xiris.describe() # xiris.mean(), .std(), min(), max() can be used
# K Means Cluster
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=11) # define the parameter for cluster
model.fit(xiris)
# The cluster label calculated by KMeans is in model.labels_
pd.value_counts(pd.Series(model.labels_)) # In R the one of the most painful command, it is table(<vector>) in R
#yiris[(yiris=="setosa")]
#type(yiris)
# calculate the frequency of each prdicted cluster value
idx =yiris.loc[yiris["Species"]=='setosa']
lbl = model.labels_[idx.index]
print('frequency of setosa\n',pd.value_counts(pd.Series(lbl)))
idx =yiris.loc[yiris["Species"]=='versicolor']
lbl = model.labels_[idx.index]
print('frequency of versicolor\n',pd.value_counts(pd.Series(lbl)))
idx =yiris.loc[yiris["Species"]=='virginica']
lbl = model.labels_[idx.index]
print('frequency of virginica\n', pd.value_counts(pd.Series(lbl)))
# ?? will be interesting to remove the offset created by the comment 'frequency of ..'
# This is unsupervised learning and we need to associate the predicted label with Species
# Hence we should associate the (setosa -> 1, versicolor -> 0, virginica -> 2)
#--
dct = {1:'setosa', 0:'versicolor', 2:'virginica'}
yiris_pred = pd.DataFrame([dct[i] for i in model.labels_])
#type(yiris)
yiris_pred
# calculation of confusion matrix
from sklearn import metrics
# generate evaluation metrics
print ("Train - Accuracy :", metrics.accuracy_score(yiris, yiris_pred))
print ("Train - Confusion matrix :\n",metrics.confusion_matrix(yiris, yiris_pred))
print ("Train - classification report :", metrics.classification_report(yiris, yiris_pred))
If one is new to Python, it can take some time to setup before one can write code. This section will help you to setup fast in Windows environment.
Refer to: Learn Python in 3 days : Step by Step Guide
Install Anaconda and we use two applications:
Jupyter Notebook (This is upgraded version of ipython notebook). The name is derived from Julia+Python+R. Opening this application will show you Anaconda Prompt and explorer will open with site http://localhost:8888/tree. Select dropdown "New" on upper right and select Python 3. This will open a new page for new project. In the cell In: [ ], type 2+3 and Ctrl+Enter. You get the output. You just made the first working code!
To rename the project, select Untitled* on the top, and you will get dialog box to rename the file.
This project was saved as basic_iris_python.ipynb
Anaconda Prompt: After opening the application, one gets command prompt like window.
This window can also be used to open new juptyer notebook by the command "jupyter notebook"
This window is also used to install a library in Anaconda. Say, to install library pydot, we type command: "python -m pip install pydot". The command will install pydot library.
In Anaconda prompt, type: jupyter nbconvert --to html basic_iris_python.ipynb A file basic_iris_python.html will be created.