Recognizing Hand Written Digits (UCI ML Repo) with Support Vector Machines (SVM)
Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for classification or for regression. SVMs are a discriminative classifier: that is, they draw a boundary between clusters of data. In this post I will demonstrate hand-written digit recognition using the SVC classifier in scikit-learn. We'll make use of the online dataset available in the UCI machine learning repository.
Let's begin by importing required packages
In [1]:
import matplotlib.pylab as plt
import numpy as np
from scipy import sparse
from sklearn.datasets import make_classification, make_blobs
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.cross_validation import ShuffleSplit
from sklearn import metrics
from sklearn.learning_curve import learning_curve
from pprint import pprint
import urllib
import seaborn
%matplotlib inline
In [2]:
#Let's generate some simple blobs in sklearn, with 150 samples and 2 features
X, y = make_blobs(n_samples=150, centers=2, random_state=0, cluster_std=0.4)
#Let's plot the datasets to visualize
plt.scatter(X[:,1], X[:,0], c=y, s=50, cmap=plt.cm.Paired)
plt.show()
#OK great the dataset looks fairly easily seperable (atleast to the naked eye). Let's get our hands dirty in SVC, then
#move on to bigger and better things.
In [3]:
#Let's fit the SVC to our dataset. The linear kernel essentially means the decision boundary is a line.
svc = SVC(kernel='linear').fit(X,y)
#Great now, lets run predictions and see how well SVM has done on our dataset.
y_pred=svc.predict(X)
print metrics.classification_report(y, y_pred)
#As expected it worked perfectly, basically seperated the data exactly as we generated it.
In [4]:
#OK, let's now define a plot decision function that will help us visualize the decision boundary and hyperplane
#We will be calling this routine multiple times from here on.
#This will only work for 2D datasets
def plot_decision_function(X,y,svc):
#Plot the data points as a scatter plot first
plt.scatter(X[:,0], X[:,1], c=y, s=30, cmap=plt.cm.Paired)
#Form the Grid for the Plot
x1=np.linspace(plt.xlim()[0], plt.xlim()[1], 30)
y1=np.linspace(plt.ylim()[0], plt.ylim()[1], 30)
y2,x2 = np.meshgrid(y1,x1)
#OK now we will use the decision function parameter provided by SVC
#to plot the hyperplane and decision boundary. Decision function has
#the distance of the samples
P = np.zeros_like(x2)
for i,xi in enumerate(x1):
for j,yj in enumerate(y1):
P[i,j] = svc.decision_function([xi,yj])
return plt.contour(x2,y2,P,colors ='black', levels=[-1,0,1], linestyles=['--','-','--'])
In [5]:
#Let's plot the decision boundary for the SVC we just fit
plt.title('SVC hyperplane with a well-seperated sample dataset', fontsize=14)
plot_decision_function(X,y,svc)
#Notice the couple of points that touch the dashed lines. These are the support vectors. They are actually the only
#thing that influence the SVM. If another point on either side of the hyperplane move (without crossing the decision
#boundary of course), the classification doesn't change. However if the points influencing the support vector moves
#it directly influences the classification (In our case though, the datasets are extremely well seperated, doesn't
#happen very often in real-time.
#OK, let's highlight the support vectors in the plot. They are available in the support_vectors_ attribute of the
#classifier.
plt.scatter(svc.support_vectors_[:,0], svc.support_vectors_[:,1], s=200, facecolor='none', edgecolor='black', linewidths=1.5)
plt.show()
In [6]:
#OK now let's make a slightly more complex sample dataset with 10K samples.
X,y=make_classification(n_samples = 1000,n_features=2, n_redundant=0, n_informative=2,
n_clusters_per_class=1, weights=[0.2], random_state=48)
#Let's plot the data
plt.scatter(X[:,0], X[:,1], c=y, s=40, cmap=plt.cm.Paired)
#Alright this sure looks challenging. Let's use the rbf (radial basis function) Kernel since this data is definitely
#not linearly seperable
svc=SVC(kernel='rbf', C=1.0).fit(X,y)
y_pred=svc.predict(X)
#C is a parameter that is used for regularization (To address overfitting). For instance, if C is too small, the model
#will overfit the training set, meaning it will try to map every single example correctly. But such a model will not
#generalize (work on unknowns) very well since it is too finetuned for the training set. On the other hand if C is too large,
#the model will underfit meaning it won't even fit the training set well.
In [7]:
#Let's see how well we performed
print metrics.classification_report(y,y_pred)
#Plotting the results to visualize the hyperplane
plt.title('SVC hyperplane with a well-seperated sample dataset', fontsize=14)
plot_decision_function(X,y,svc)
plt.scatter(svc.support_vectors_[:,0], svc.support_vectors_[:,1], s=200, facecolor='none', edgecolor='black', linewidths=1.5)
plt.show()
#There, the rbf kernel has done a very nice job of seperating out the data. Except for a few points all have been
#classified correctly.
Recognizing Hand Written Digits (UCI ML Repo)
In [8]:
#OK let's actually get started with real data. We'll begin with the famous hand-written digit recognition.
#For this we'll use the dataset provided by the UCI Machine Learning Repository.
#The dataset basically has 8x8 pixel images of handwritten digits (stored as 64 feature dimensions) and corresponding
#labels of the actual digits. Preprocessing has already been performed on the dataset and train/test datasets are
#available for download seperately. Let's go get 'em.
#Training dataset, last column is the labels y
url="https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra"
raw_data = urllib.urlopen(url)
digi=np.genfromtxt(raw_data, delimiter=',')
X_train, y_train = digi[:,:-1], digi[:,-1:].squeeze()
print X_train.shape, y_train.shape
#Test dataset, last column is the labels y
url="https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes"
raw_data = urllib.urlopen(url)
digi=np.genfromtxt(raw_data, delimiter=',')
X_test, y_test = digi[:,:-1], digi[:,-1:].squeeze()
print X_test.shape, y_test.shape
In [9]:
#Below is a plot_learning_curve module that's provided by scikit-learn. It allows us to quickly and easily visualize how
#well the model is performing based on number of samples we're training on. It helps to understand situations such as
#high variance or bias.
#We'll call this module in the next segment.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.learning_curve import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
In [10]:
#All right let's do this the right way. We'll use a cross-validation generator to select train and CV datasets to finetune
#parameters such as C (Regularization parameter we saw earlier). These hyperparameters are extremely critical to the model.
#Now, if we tune parameters against the Test dataset, we will end up biasing towards the test set and will once again
#not generalize very well. We will also have no good way to find out since we have essentially trained on all our data.
#Luckily scikit-learn has builit-in packages that can help with this. We'll use a crossvalidation generator that
#can train the model by tuning the parameters based on a cross-validation subset (cv) that is picked from within the
#training set. A different cv subset will be picked for each iteration, we control the number of iterations. Then we will
#use these cv/train splits and run a gridsearch function that will evaluate the model with each split and tune parameters
#to give us the best parameter that gives the optimal result.
#Defining this as a function so we can call it anytime we want
def fit_SVM(kernel, n_jobs, C, gamma):
#Choose Estimator as SVM
estimator = SVC(kernel=kernel)
#Choose cross-validation generator - let's choose ShuffleSplit which randomly shuffles and selects Train and CV sets
#for each iteration. There are other methods like the KFold split.
cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2, random_state=0)
#Apply the cross-validation iterator on the Training set using GridSearchCV. This will run the classifier on the
#different train/cv splits using parameters specified and return the model that has the best results
#Note that we are tuning based on the F1 score 2PR/P+R where P is Precision and R is Recall. This may not always be
#the best score to tune our model on. I will explore this area further in a seperate exercise. For now, we'll use F1.
from sklearn.grid_search import GridSearchCV
classifier = GridSearchCV(estimator=estimator, cv=cv, param_grid=dict(C=C,
gamma=gamma), n_jobs=n_jobs, scoring='f1')
#Also note that we're feeding multiple neighbors to the GridSearch to try out.
#We'll now fit the training dataset to this classifier
classifier.fit(X_train, y_train)
#Let's look at the best estimator that was found by GridSearchCV
print "Best Estimator learned through GridSearch"
print classifier.best_estimator_
return classifier.best_estimator_.C, classifier.best_estimator_.gamma, cv
In [11]:
#WARNING - THIS WILL TAKE A WHILE TO RUN. TRY ADJUSTING parameters such as n_jobs (jobs to run in parallel, before
#increasing this make sure your system can handle it), n_iter for ShuffleSplit (in the function definition) and reducing
#number of values being tried for C and gamma.
#SELECT INTERRUPT IN THE MENU AND PRESS INTERRUPT KERNEL IF YOU NEEDD TO STOP EXECUTION
#C_vals=[.01, .03, .1, .3, 1, 3, 10, 30]
#gamma_vals=[0.001, .003, .01, .03, .1, 1, 3]
#C_vals=[.3, 1, 3]
#gamma_vals=[0.001, .003, .01]
C_vals=[3]
gamma_vals=[.001]
#Let's fit SVM to the digits training dataset by calling the function we just created.
C, gamma,cv=fit_SVM('rbf', 10, C_vals, gamma_vals)
In [12]:
#OK, let's look at the learning curve for our model based on number of training examples. We'll call the plot_learning_curve
#module by feeding it the estimator (best estimator returned from GS) and train/cv sets.
#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
#Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now.
#We may need to adjust the hyperparameters further if there is overfitting (or underfitting, though unlikely)
title = "Learning Curves (SVM, C=%.6f, gamma=%.6f)" %(C,gamma)
estimator = SVC(kernel='rbf', C=C, gamma=gamma)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv)
plt.show()
In [13]:
#Looks like there is overfitting, look at the gulf between the training score and cv. Considering we don't have
#any more examples to train on let's make some adjustments to the regularization parameter C. Bringing this value down
#(from the best learned through GridSearchCV) will address the problem.
C=.5
#Run the learning curve again
title = "Learning Curves (SVM, C=%.6f, gamma=%.6f)" %(C,gamma)
estimator = SVC(kernel='rbf', C=C, gamma=gamma)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv)
plt.show()
In [14]:
#Great, we see that as the number of training samples increases, both training and cv scores are on the rise and they both
#are pretty high as well as close to each other at 3000 examples. This is a good sign. If training score rose but cv score
#fell that would indicate high variance (overfitting). If both train and cv scores are low that would indicate high bias
#(underfitting).
#Let's see how our trained model performs on the test set. We are not going to train on this set merely looking at how well
#our model can generalize.
#Calling Fit on the estimator object so we can predict. We're NOT retraining the classifier here.
estimator.fit(X_train, y_train)
y_pred=estimator.predict(X_test)
print metrics.classification_report(y_test,y_pred)
print "Final Generalization Accuracy: %.6f" %metrics.accuracy_score(y_test,y_pred)
#Excellent, we have a 97.7% accuracy on the Test Set, which is inline with what is to be expected for this dataset.
In [34]:
#Just for kickers, let's visualize images of some digits and label them with what our model is predicting.
fig=plt.figure(figsize=(6,6))
fig.suptitle('Sample Digit predictions by SVM', fontsize=15, alpha=2)
for i in range(20):
ax=fig.add_subplot(5,4,i+1,xticks=[], yticks=[])
ax.imshow(X_test[i,:].reshape(8,8), cmap=plt.cm.binary)
ax.text(-1, 7, y_pred[i].astype(int), fontsize=15, color='black', alpha=2)
plt.show()
In [32]:
#scikit-learn provides something called a confusion matrix that helps us easily visualize the differences b/w
#actuals and predictions. It's especially helpful with digits data as the results are easier to comprehend.
print metrics.confusion_matrix(y_test,y_pred)
#Now, interestingly this is much better to visualize if we plotted it.
seaborn.heatmap(metrics.confusion_matrix(y_test,y_pred))
plt.xlabel('True Labels')
plt.ylabel('Predicted Labels')
plt.show()
#The confusion matrix is basically a utility to compare predicted with Actuals. The ideal matrix will be solid diagonally
#across the board and plain everywhere else. We can see below that we're almost there but for a few patches here and there.
Share more, Learn more!
Comments
Comments powered by Disqus