# Understanding k-Nearest Neighbours with the PIMA Indians Diabetes dataset

K nearest neighbors (kNN) is one of the simplest supervised learning strategies: given a new, unknown observation, it simply looks up in the reference database which ones have the closest features and assigns the predominant class. Let's try and understand kNN with examples.

In [20]:
```#Importing required packages

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import seaborn
from pprint import pprint
%matplotlib inline
```
In [21]:
```#Let's begin by exploring one of scikit-learn's easiest sample datasets, the Iris.
print iris.keys()
```
```['target_names', 'data', 'target', 'DESCR', 'feature_names']

```
In [22]:
```#The Iris contains data about 3 types of Iris flowers namely:
print iris.target_names

#Let's look at the shape of the Iris dataset
print iris.data.shape
print iris.target.shape

#So there is data for 150 Iris flowers and a target set with 0,1,2 depending on the type of Iris.
#Let's look at the features
print iris.feature_names

#Great, now the objective is to learn from this dataset so given a new Iris flower we can best guess its type
#Let's keep this simple to start with and train on the whole dataset.
```
```['setosa' 'versicolor' 'virginica']
(150, 4)
(150,)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

```
In [4]:
```#Fitting the Iris dataset using KNN
X,y = iris.data, iris.target

#Fitting KNN with 1 Neighbor. This is generally a very bad idea since the 1st closest neighbor to each point is itself
#so we will definitely overfit. It's equivalent to hardcoding labels for each row in the dataset.
iris_knn = KNeighborsClassifier(n_neighbors = 1).fit(X,y)

#Now running predictions on the same dataset X. We should obviously get 100% accuracy
y_pred = iris_knn.predict(X)

#Confirming our assumption that if we run predictions on X, we will accurately predict all 150 labels
print np.all(y_pred==y)
iris_knn.score(X,y) #Score is 100%
```
```True

```
Out[4]:
```1.0
```
In [23]:
```#Let's run KNN with something more reasonable, neighbors = 3
iris_knn = KNeighborsClassifier(n_neighbors = 5).fit(X,y)
print iris_knn.score(X,y)
y_pred=iris_knn.predict(X)
```
```0.803385416667

```
In [27]:
```#Let's run a quick PCA to reduce X down to 2D so we can visualize
from sklearn.decomposition import PCA
X_proj = PCA(2).fit_transform(X)

#Let's plot the actual labels and those predicted by KNN side-by-side for comparison
fig, ax = plt.subplots(1,2,figsize=(8,3))
ax[0].scatter(X_proj[:,0], X_proj[:,1], c=y, cmap=plt.cm.Paired)
ax[0].set_title('Actual labels')
ax[1].scatter(X_proj[:,0], X_proj[:,1], c=y_pred, cmap=plt.cm.Paired)
ax[1].set_title('Predicted labels')
plt.show()
```
In [28]:
```#OK let's now split the iris dataset into Training and Test sets so we can see how well
#the classifier generalizes
X_train, X_test, y_train, y_test = train_test_split(X,y)
print X_train.shape, X_test.shape
```
```(576, 8) (192, 8)

```
In [29]:
```#Now let's train on X_train and predict on X_test
iris_knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
y_pred = iris_knn.predict(X_test)

#Let's check the scores for X_train and X_test
print "n_neighbors=5, Training cross-validation score", iris_knn.score(X_train, y_train)
print "n_neighbors=5, Test cross-validation score", iris_knn.score(X_test, y_test)

#They're almost the same and looks like the algorithm is generalizing pretty well.

#Just to confirm understanding let's run with neighbors=1. This must yield a very high training score
#but not a great test score.
#Now let's train on X_train and predict on X_test
iris_knn = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)
y_pred = iris_knn.predict(X_test)

#Let's check the scores for X_train and X_test
print "n_neighbors=1, Training cross-validation score", iris_knn.score(X_train, y_train)
print "n_neighbors=1, Test cross-validation score", iris_knn.score(X_test, y_test)

#That's perfect! Reverting back to 5 neighbors so we can plot a comparison figure
iris_knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
y_pred = iris_knn.predict(X_test)
```
```n_neighbors=5, Training cross-validation score 0.791666666667
n_neighbors=5, Test cross-validation score 0.734375
n_neighbors=1, Training cross-validation score 1.0
n_neighbors=1, Test cross-validation score 0.708333333333

```
In [32]:
```#Plotting test actuals and predicted for comparison
X_proj = PCA(2).fit_transform(X_test)

fig, ax = plt.subplots(1,2,figsize=(8,3))
ax[0].scatter(X_proj[:,0], X_proj[:,1], c=y_test, s=50, cmap=plt.cm.Paired)
ax[0].set_title('Test Actuals')
ax[1].scatter(X_proj[:,0], X_proj[:,1], c=y_pred, s=50, cmap=plt.cm.Paired)
ax[1].set_title('Test Predicted')
plt.show()
```

PIMA Indians Diabetes Data (UCI ML Repo)
Let's move on to bigger and better things. We'll analyze the Diabetes dataset available for download from the UCI Machine Learning Repository.

In [33]:
```#Download the data from the UCI website using urllib
import urllib
url = ("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data")
raw_data = urllib.urlopen(url)

#The file is a CSV, let's read it into a numpy array
#Note: not using Pandas to examine/clean the dataset at this point since this dataset is pretty well-cleansed.
diab = np.genfromtxt(raw_data, delimiter=",")
print diab.shape

#This dataset has 9 columns, 9th one seems to be the labels, 1 or 0 for Diabetes or no Diabetes.
#Let's split into X,y
X,y = diab[:,:-1], diab[:,-1:].squeeze() #squeeze to flatten the labels into the vector y

print X.shape, y.shape

#Let's run the train/test split
X_train, X_test, y_train, y_test = train_test_split(X,y)
print X_train.shape, X_test.shape
```
```(768, 9)
(768, 8) (768,)
(576, 8) (192, 8)

```
In [34]:
```#Great! now, let's run the kNN classifier with 3 neighbors and see how it does.
diab_knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
y_pred = diab_knn.predict(X_test)
y_train_pred = diab_knn.predict(X_train)

#Let's get the score summary
print "Results with 3 Neighbors"
print metrics.classification_report(y_test, y_pred, target_names=['No Diabetes', 'Diabetes'])
print metrics.classification_report(y_train, y_train_pred, target_names=['No Diabetes', 'Diabetes'])

#So we had a training accuracy of 84% but only a test accuracy of 72%
#The precision/recall metrics tell us out of each category (No Diabetes or Diabetes) how many we predicted right.
```
```Results with 3 Neighbors
precision    recall  f1-score   support

No Diabetes       0.79      0.73      0.76       128
Diabetes       0.53      0.61      0.57        64

avg / total       0.70      0.69      0.70       192

precision    recall  f1-score   support

No Diabetes       0.86      0.90      0.88       372
Diabetes       0.80      0.73      0.76       204

avg / total       0.83      0.84      0.83       576

```
In [35]:
```#OK, not so great so far. Let's see if we can improve the score by increasing neighbors.
diab_knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
y_pred = diab_knn.predict(X_test)
y_train_pred = diab_knn.predict(X_train)

#Let's get the score summary
print "Results with 5 Neighbors"
print metrics.classification_report(y_test, y_pred, target_names=['No Diabetes', 'Diabetes'])
print metrics.classification_report(y_train, y_train_pred, target_names=['No Diabetes', 'Diabetes'])

#OK we did ever so slightly better on the test set but did much poorly on the Training set.
#If we continue to tune in this fashion, we will end up biasing our model based on the Test set.
#Trying out even more number of neighbors, turns out our first choice of neighbors=3 is actually the best,
#Is there any other setting we could play with?
```
```Results with 5 Neighbors
precision    recall  f1-score   support

No Diabetes       0.76      0.74      0.75       128
Diabetes       0.51      0.53      0.52        64

avg / total       0.68      0.67      0.67       192

precision    recall  f1-score   support

No Diabetes       0.84      0.88      0.86       372
Diabetes       0.76      0.70      0.73       204

avg / total       0.81      0.81      0.81       576

```
In [36]:
```#Let's try the Ball Tree algorithm, just for kickers!
diab_knn = KNeighborsClassifier(n_neighbors=3, algorithm="ball_tree").fit(X_train, y_train)
y_pred = diab_knn.predict(X_test)
y_train_pred = diab_knn.predict(X_train)

#Let's get the score summary
print "Results with 3 Neighbors and the Ball Tree algo"
print metrics.classification_report(y_test, y_pred, target_names=['No Diabetes', 'Diabetes'])
print metrics.classification_report(y_train, y_train_pred, target_names=['No Diabetes', 'Diabetes'])

# Looks like we've exhausted the capabilities of KNN. Let's try this with other more sophisticated supervised learning
# algorithms.

#No Changes, looks like we've hit the ceiling with kNN.
```
```Results with 3 Neighbors and the Ball Tree algo
precision    recall  f1-score   support

No Diabetes       0.79      0.73      0.76       128
Diabetes       0.53      0.61      0.57        64

avg / total       0.70      0.69      0.70       192

precision    recall  f1-score   support

No Diabetes       0.86      0.90      0.88       372
Diabetes       0.80      0.73      0.76       204

avg / total       0.83      0.84      0.83       576

```
In [37]:
```#Below is a plot_learning_curve module that's provided by scikit-learn. It allows us to quickly and easily visualize how
#well the model is performing based on number of samples we're training on. It helps to understand situations such as
#high variance or bias.

#We'll call this module in the next segment.

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.learning_curve import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.

Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.

title : string
Title for the chart.

X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.

y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.

ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.

cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects

n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")

plt.legend(loc="best")
return plt
```
```Automatically created module for IPython interactive environment

```
In [38]:
```#Let's try one last technique of creating a cross-validation set. Like I mentioned earlier, when you tune parameters
#based on Test results, you could possibly end up biasing your model based on Test. Although this may not be an issue
#in many instances, you could create a cross validation set to avoid this.

#Luckily scikit-learn has builit-in packages that can help with this. We'll use a crossvalidation generator that
#can train the model by tuning the parameters based on a cross-validation subset (cv) that is picked from within the training
#set. A different cv subset will be picked for each iteration, we control the number of iterations. Then we will use these
#cv/train splits and run a gridsearch function that will evaluate the model with each split and tune parameters to give us
#the best parameter that gives the optimal result.

#OK let's now follow these steps to implement the above and see if it makes any difference to the result

#80/20 test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#Choose Estimator as KNN
estimator = KNeighborsClassifier(n_neighbors=3)

#Choose cross-validation generator - let's choose ShuffleSplit which randomly shuffles and selects Train and CV sets
#for each iteration. There are other methods like the KFold split.
from sklearn.cross_validation import ShuffleSplit
#cv = KFold(X_train.shape[0])
cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2, random_state=0)

#Apply the cross-validation iterator on the Training set using GridSearchCV. This will run the classifier on the
#different train/cv splits using parameters specified and return the model that has the best results

#Note that we are tuning based on the F1 score 2PR/P+R where P is Precision and R is Recall. This may not always be
#the best score to tune our model on. I will explore this area further in a seperate exercise. For now, we'll use F1.

from sklearn.grid_search import GridSearchCV
classifier = GridSearchCV(estimator=estimator, cv=cv, param_grid=dict(n_neighbors=[2,3,4,5]), scoring='f1')

#Also note that we're feeding multiple neighbors to the GridSearch to try out.

#We'll now fit the training dataset to this classifier
classifier.fit(X_train, y_train)

#Let's look at the best estimator that was found by GridSearchCV
print classifier.best_estimator_

#Let's look at the learning curve for our model based on number of training examples. We'll call the plot_learning_curve
#module by feeding it the estimator (best estimator returned from GS) and train/cv sets.

#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
from sklearn.learning_curve import learning_curve

title = 'Learning Curves (kNN, \$\n_neighbors=%.6f\$)' %classifier.best_estimator_.n_neighbors
estimator = KNeighborsClassifier(n_neighbors=classifier.best_estimator_.n_neighbors)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv)
plt.show()

#Great, we see that as the number of training samples increases, both training and cv scores are on the rise. This is
#a good sign. If training score rose but cv score fell that would indicate high variance (overfitting). If both train and cv
#scores are low that would indicate high bias (underfitting).
```
```KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=5, p=2, weights='uniform')

```
In [42]:
```#Let's finally predict the Test labels and look at the final scores. Remember we've not touched our test set so far. This is
#to get a good grip on how well the model will generalize.
y_pred = classifier.predict(X_test)
print "Final Classification Report"
print metrics.classification_report(y_test, y_pred)
print "Generalization Accuracy: %.6f" %metrics.accuracy_score(y_test,y_pred)
```
```Final Classification Report
precision    recall  f1-score   support

0.0       0.83      0.81      0.82       107
1.0       0.59      0.62      0.60        47

avg / total       0.76      0.75      0.75       154

Generalization Accuracy: 0.753247

```

So we actually have a pretty good model based on kNN that can predict with an ~76% probability if a person has diabetes (or not), provided information as we have it in the PIMA Indians Diabetes dataset provided by UCI.