Predicting Forest Cover Types with Ensemble Learning
This is a documentation of one of my approaches to solving the forest cover type prediction challenge hosted by Kaggle. Feel free to use for your own reference and let me know if you have any suggestions on how I can improve the model :-)
I found this topic very engaging being a nature lover. Also the features are very friendly and don't require much domain knowledge to explore (and hopefully engineer new features).
OK let's get started.
Below is an excerpt from the competition page:
In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.
This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
So we're going to be provided training data based on existing cover types and we're expected to predict integer classifications for each forest cover type. Let's look at the types of forest covers we're going to predict.
1 - Spruce/Fir
2 - Lodgepole Pine
3 - Ponderosa Pine
4 - Cottonwood/Willow
5 - Aspen
6 - Douglas-fir
7 - Krummholz
Get the data
Now is the time to go grab the train and test datasets from here. Place them in your working directory (or host them somewhere like Github). Let's very quickly explore the features available to us for this classification task:
Elevation - Elevation in meters
Aspect - Aspect in degrees azimuth
Slope - Slope in degrees
Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway
Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice
Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice
Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation
Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation
Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation
There are 4 different wilderness areas and 40 different soil types. These're already converted to machine-learning readable binary formats for us (almost never expect this to happen in real-time, ever!).
Let's now read the dataset using pandas.
#Let's first import required Python packages.
#Importing required Python packages
import matplotlib.pylab as plt
import numpy as np
from scipy import sparse
from sklearn.datasets import make_classification, make_blobs, load_boston
from sklearn.decomposition import PCA
from sklearn.cross_validation import ShuffleSplit, train_test_split, Bootstrap
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.learning_curve import learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.grid_search import GridSearchCV
from pprint import pprint
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import urllib
import seaborn
np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 25)
pd.options.display.max_colwidth = 50
%matplotlib inline
traincsv=pd.read_csv('train.csv', index_col='Id')
testcsv=pd.read_csv('test.csv', index_col='Id')
orig_traincsv = traincsv.copy()
orig_testcsv = testcsv.copy()
Let's take a look at a summary of the training set. The test set should be the same except the cover type column.
traincsv.head(2)
The data looks as smooth as silk, no dates, no strings, nothing alarming. Are there any nulls?
traincsv[traincsv.isnull().any(axis=1)]
There are no rows with null values. Considering all the columns are numeric we should really look at 0 as a possible missing value. Now, before we proceed to explore the features, I'd like to do a quick and dirty submission with a Random Forest model, with no changes to the dataset. This helps to understand how any changes we make down the line fare in comparison.
Initial Submission (Random Forests Classifier)
Let's fit a Random Forest Classifier to the traincsv dataset. We'll then quickly run predictions on the testcsv dataset and submit to Kaggle.
Note, for this initial submission we are not performing any cross-validation or even a train/test split. We're simply running a quick and dirty submission to check out the results. For our subsqeuent submissions, we will follow due process.
#Initial Random Forest Model fitting with 1000 estimators. Note below, that the labels lie in the last column, Cover Type.
rf_initial=RandomForestClassifier(n_estimators=1000, bootstrap=True, oob_score=True)
rf_initial.fit(traincsv.ix[:,:-1].values, traincsv.ix[:,-1:].values.ravel())
print "Initial Traincsv score: %.2f" %rf_initial.score(traincsv.ix[:,:-1].values, traincsv.ix[:,-1:].values.ravel())
So with 1000 estimators and no crossvalidation/hypertuning etc. we're able to predict 100% of Cover Types on the train.csv file. Note that we wouldn't know at this point if there is any overfitting. We also do not have a test set (not same as test.csv) for comparison. OK, let's now run predictions on testcsv and submit to Kaggle.
#Make a copy of the test.csv file
temp=testcsv.copy()
#Run Predictions on test.csv
temp['Cover_Type']=rf_initial.predict(temp.values)
#Create Submissions csv file
temp=temp['Cover_Type']
temp.to_csv('RF-Initial.csv', header=True)
The submission to Kaggle scored 0.75366, taking us to better than 50% of the leaderboard. And we haven't even touched the dataset yet. Simply fit an out-of-the-box random forest to the dataset.
OK let's see if and how we can improve this score. We'll run through a series of visualizations to understand the data better. But before we do that, let's look at a very valuable statistic provided by Random Forests, Feature Importances. This will tell us which features in the dataset matter the most to our predictions. It helps to understand where we should focus our energy.
Feature Importances
We'll define this as a method so we can call on this later if necessary.
#Always call fit on the estimator before invoking this method.
def importances(estimator, col_array, title):
# Calculate the feature ranking - Top 10
importances = estimator.feature_importances_
indices = np.argsort(importances)[::-1]
print "%s Top 20 Important Features\n" %title
for f in range(20):
print("%d. %s (%f)" % (f + 1, col_array.columns[indices[f]], importances[indices[f]]))
#Mean Feature Importance
print "\nMean Feature Importance %.6f" %np.mean(importances)
#Plot the feature importances of the forest
indices=indices[:10]
plt.figure()
plt.title(title+" Top 10 Feature importances")
plt.bar(range(10), importances[indices],
color="gr", align="center")
plt.xticks(range(10), col_array.columns[indices], fontsize=14, rotation=90)
plt.xlim([-1, 10])
plt.show()
#Call the method we just created to display the feature importances
importances(rf_initial, traincsv, "Cover Type (Initial RF)")
So there it is, Elevation is the most important feature in the dataset, that's almost 2.5 times more important than the second feature in the Top 10. If we can manufacture more features with Elevation, we'll almost certainly up our success rate of prediction.
We'll continue to refer back to these importances as we move through this exercise.
Data Exploration
We'll now begin to examine the data through a series of visualizations. Let's start by looking at histograms of the features to find any missing values.
traincsv.ix[:,:10].hist(figsize=(16,10), bins=50)
plt.show()
Looks like there're some missing values for Hillshade at 3 PM. If there was shade at 9 AM, I'd expect little shade at 3 PM but 0 doesn't sit right. We'll explore this further down the line, since Hillshade 3 PM is in the Top 10.
Let's now plot an SPLOM to analyze the Top 5 features. I am interested to see how Elevation relates to the others.
with seaborn.axes_style('white'):
smaller_frame=traincsv[['Elevation', 'Horizontal_Distance_To_Roadways',
'Horizontal_Distance_To_Fire_Points',
'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology']]
smaller_frame.columns=smaller_frame.columns.map(lambda x: x.replace('Horizontal_Distance_To','HD'))
smaller_frame.columns=smaller_frame.columns.map(lambda x: x.replace('Vertical_Distance_To','VD'))
scatter_matrix(smaller_frame, figsize=(14, 14), diagonal="kde")
plt.show()
Linearity of Cover Types
I am interested to explore some of these relationships further. Especially how Elevation plays with the following:
- Vertical Distance to Hydrology - Horizontal Distance to Hydrology - Horizontal Distance to Roadways
The reason for my interest is linearity. Tree based models make decisions to group values based on certain criteria for each label. For instance, if HD_Hyrdrology > 25, Cover Type is x. If we can manage to create a feature where in more similar labels fall under the same bracket then we might be able to achieve more success with Random Forests.
#Let's add Cover Type to smaller_frame so we can color code.
smaller_frame['Cover_Type']=traincsv.Cover_Type
#Elevation vs HD Hydrology
plt.figure(figsize=(8,8))
plt.scatter(smaller_frame.Elevation, smaller_frame.HD_Hydrology, c=smaller_frame.Cover_Type, s=75, cmap=plt.cm.Greens_r)
plt.xlabel("Elevation")
plt.ylabel("HD Hydrology")
plt.show()
So if we created a new feature that reduced the Elevation values slightly in relation to HD Hydrology, we probably could get more similar labels to align with each other. Let's find out.
#Elevation adjusted vs HD Hydrology
plt.figure(figsize=(8,8))
plt.scatter(smaller_frame.Elevation-0.2*smaller_frame.HD_Hydrology,
smaller_frame.HD_Hydrology, c=smaller_frame.Cover_Type, s=75, cmap=plt.cm.Greens_r)
plt.xlabel("Elevation to HD Hydrology")
plt.ylabel("HD Hydrology")
plt.show()
You get the idea. Similarly we can do the same for the other 2 features as well.
#Elevation vs HD Roadways
plt.figure(figsize=(8,8))
plt.scatter(smaller_frame.Elevation-0.05*smaller_frame.HD_Roadways,
smaller_frame.HD_Roadways, c=smaller_frame.Cover_Type, s=75, cmap=plt.cm.Greens_r)
plt.xlabel("Elevation to HD Roadways")
plt.ylabel("HD Roadways")
plt.show()
#Elevation vs VD Hyrdrology
plt.figure(figsize=(8,8))
plt.scatter(smaller_frame.Elevation-smaller_frame.VD_Hydrology,
smaller_frame.VD_Hydrology, c=smaller_frame.Cover_Type, s=75, cmap=plt.cm.Greens_r)
plt.xlabel("Elevation to VD Hydrology")
plt.ylabel("VD Hydrology")
plt.show()
Filling up missing values
Let's now turn our attention to Hillshade 3 PM where we found some missing values (zeros). We will plot an SPLOM to understand how the hillshade features are related.
with seaborn.axes_style('white'):
smaller_frame = traincsv[['Hillshade_9am', 'Hillshade_Noon','Hillshade_3pm']]
scatter_matrix(smaller_frame, alpha=0.8, figsize=(12, 12), diagonal="kde")
plt.show()
OK that makes sense. Looking at the relationship between hillshade 3 PM and 9 AM, they've clear negative correlation while 3 PM and noon have positive correlation. But are there values that are 0's?
plt.figure(figsize=(12,8))
plt.scatter(traincsv.Hillshade_3pm,traincsv.Hillshade_Noon, c=traincsv.Cover_Type, s=50, cmap=plt.cm.BrBG)
plt.xlabel('Hillshade @ 3 PM')
plt.ylabel('Hillshade @ Noon')
plt.show()
plt.figure(figsize=(12,8))
plt.scatter(traincsv.Hillshade_3pm,traincsv.Hillshade_9am, c=traincsv.Cover_Type, s=50, cmap=plt.cm.BuPu)
plt.xlabel('Hillshade @ 3 PM')
plt.ylabel('Hillshade @ 9 AM')
plt.show()
There're clearly hillshade values at 3 PM that are zeroes. Before we fill these up, just for kickers let's plot the relationship between hillshade at 9 AM and Noon where 3 PM hillshade is 0. There must be some trend there that we can spot.
smaller_frame=traincsv[traincsv.Hillshade_3pm==0]
plt.figure(figsize=(12,8))
plt.scatter(smaller_frame.Hillshade_9am,smaller_frame.Hillshade_Noon, c=smaller_frame.Cover_Type, s=50, cmap=plt.cm.Oranges)
plt.suptitle('Correlation b/w 9 AM to Noon where 3 PM hillshade is 0', fontsize=15)
plt.xlabel('Hillshade @ 9 AM')
plt.ylabel('Hillshade @ Noon')
plt.show()
Great, we see a clear positive correlation there. Perhaps, we could fit a regressor to the traincsv dataset to predict the missing 3 PM hillshade values.
Predicting missing Hillshade 3 PM values using Gradient Boosted Regression Trees
OK, let's now extract the columns we've chosen and fit a GBRT model to this dataset. Note that any changes we make should also be made to the Test.csv dataset so we can generate the final predictions. Let's play nice and only train on all values from train.csv, then use it to predict missing values in both datasets.
Reorder Columns
I like reordering columns so that the label we're predicting (or continuous value in this case) is the last column. Let's however make a copy this time since we don't want to be changing the original traincsv dataset.
Hillshade 3 PM is the 9th column and we want to move it to the end.
#Make a copy and Reorder train.csv columns
temp=traincsv.copy()
cols=temp.columns.tolist()
cols=cols[:8]+cols[9:]+[cols[8]]
temp=temp[cols]
#Delete the Forest Cover Type column. We can't train on this since it doesn't exist in the test.csv file
del temp['Cover_Type']
#Split the train.csv file into train (available values) and missing values based on hillshade 3 PM.
X,y,X_traincsv_missing,y_traincsv_missing= temp[temp.Hillshade_3pm!=0].values[:,:-1], \
temp[temp.Hillshade_3pm!=0].values[:,-1:].ravel(), \
temp[temp.Hillshade_3pm==0].values[:,:-1], \
temp[temp.Hillshade_3pm==0].values[:,-1:].ravel()
#Let's very quickly do a train/cv split so we can see how the model will generalize. Note that we're calling this cv since
#we're not going to be doing exhaustive cross-validation. We'll test performance using the cv set and predict on test. I am
#calling this cv so as to not confuse with terminology.
X_train,X_test,y_train,y_test=train_test_split(X,y)
#Fit a Gradient Boosted Regression Tree model to this dataset.
gbrt=GradientBoostingRegressor(n_estimators=1000)
gbrt.fit(X_train,y_train)
Let's quickly check how well the model fit to the dataset and if there's any overfitting. Note that we're not doing any extensive cross-validation in this case. We'll save that for the actual forest cover type prediction
print 'Training R-squared value: %.2f' %gbrt.score(X_train, y_train)
print 'Test R-squared value: %.2f' %gbrt.score(X_test, y_test)
Great, looks like we've done a quick and dirty but fairly neat job. the cross validation R-squared is as good as the training results. Before we move on, I always like to look at the feature importances to see what matters the most (and the least).
# Calculate the feature ranking - Top 10
importances(gbrt, temp, "Hillshade 3 PM")
So looks like apart from our earlier suspects, Slope seems to have contributed quite a bit. Great, let's now run predictions on both train.csv and test.csv and map them back to our original datasets.
#Predict and fill missing values in train.csv
temp.Hillshade_3pm.loc[temp.Hillshade_3pm==0]=gbrt.predict(X_traincsv_missing)
traincsv.Hillshade_3pm=temp.Hillshade_3pm
#Make a copy and Reorder test.csv columns
temp=testcsv.copy()
cols=temp.columns.tolist()
cols=cols[:8]+cols[9:]+[cols[8]]
temp=temp[cols]
#Extract missing rows from test.csv then predict and fill in the blanks.
X_testcsv_missing= temp[temp.Hillshade_3pm==0].values[:,:-1]
temp.Hillshade_3pm.loc[temp.Hillshade_3pm==0]=gbrt.predict(X_testcsv_missing)
testcsv.Hillshade_3pm=temp.Hillshade_3pm
OK I can't put this to bed unless I validate that we've done the job well. Let's re-plot Hillshade 3PM by Noon for both train.csv and test.csv to be sure.
fig, ax=plt.subplots(1,2,figsize=(12,4))
ax[0].scatter(traincsv.Hillshade_3pm,traincsv.Hillshade_Noon, c=traincsv.Cover_Type, s=50, cmap=plt.cm.OrRd)
ax[0].set_xlabel('Hillshade @ 3 PM')
ax[0].set_ylabel('Hillshade @ Noon')
ax[0].set_title('Train.csv', fontsize=14)
ax[1].scatter(testcsv.Hillshade_3pm,testcsv.Hillshade_Noon, s=50, cmap=plt.cm.PuBu)
ax[1].set_xlabel('Hillshade @ 3 PM')
ax[1].set_ylabel('Hillshade @ Noon')
ax[1].set_title('Test.csv', fontsize=14)
plt.show()
That does tell us we've actually done a pretty good job of filling in the blanks.
Great, perhaps now we can look at the distribution of cover types to see if anything stands out.
traincsv.Cover_Type.hist(bins=50)
plt.show()
Huh! That tells us we're still in dreamland. The cover types are so wonderfully distributed in the training set.
Feature Engineering
Can we manufacture any new features for this dataset? The first one that springs to mind is slope to water features (hydrology) since we've been already provided vertical and horizontal distances.
Slope distance and Percent to Hydrology
Let's calculate slope distance and slope % using the formulas below.
\[slope = \sqrt{(verticaldistance)^2 + (horizontaldistance)^2}\]
\[slope~ percent = \frac{rise}{run}*{100}\]
#Create two new columns named Slope hydrology and Slope hydrology percent and remove any infinite values that may result
traincsv['slope_hyd'] = np.sqrt(traincsv.Vertical_Distance_To_Hydrology**2 + \
traincsv.Horizontal_Distance_To_Hydrology**2)
traincsv.slope_hyd=traincsv.slope_hyd.map(lambda x: 0 if np.isinf(x) else x)
traincsv['slope_hyd_pct'] = traincsv.Vertical_Distance_To_Hydrology / traincsv.Horizontal_Distance_To_Hydrology
traincsv.slope_hyd_pct=traincsv.slope_hyd_pct.map(lambda x: 0 if np.isinf(x) else x)
#Apply changes to test.csv as well
testcsv['slope_hyd'] = np.sqrt(testcsv.Vertical_Distance_To_Hydrology**2 + \
testcsv.Horizontal_Distance_To_Hydrology**2)
testcsv.slope_hyd=testcsv.slope_hyd.map(lambda x: 0 if np.isinf(x) else x)
testcsv['slope_hyd_pct'] = testcsv.Vertical_Distance_To_Hydrology / testcsv.Horizontal_Distance_To_Hydrology
testcsv.slope_hyd_pct=testcsv.slope_hyd_pct.map(lambda x: 0 if np.isinf(x) else x)
Elevation adjusted by distance to Hydrology and Roadways
Let's now add the features we discussed earlier - Elevation adjusted by horizontal/vertical distances to Hydrology, Fire and Roadways.
#Elevation adjusted by Horizontal distance to Hyrdrology
traincsv['Elev_to_HD_Hyd']=traincsv.Elevation - 0.2 * traincsv.Horizontal_Distance_To_Hydrology
testcsv['Elev_to_HD_Hyd']=testcsv.Elevation - 0.2 * testcsv.Horizontal_Distance_To_Hydrology
#Elevation adjusted by Horizontal distance to Roadways
traincsv['Elev_to_HD_Road']=traincsv.Elevation - 0.05 * traincsv.Horizontal_Distance_To_Roadways
testcsv['Elev_to_HD_Road']=testcsv.Elevation - 0.05 * testcsv.Horizontal_Distance_To_Roadways
#Elevation adjusted by Vertical distance to Roadways
traincsv['Elev_to_VD_Hyd']=traincsv.Elevation - traincsv.Vertical_Distance_To_Hydrology
testcsv['Elev_to_VD_Hyd']=testcsv.Elevation - testcsv.Vertical_Distance_To_Hydrology
Distance to Amenities
We had earlier seen that the distances to amenities like Water, Fire and Roadways played a key role in determining cover type. Let's create some more features based on these horizontal distances.
#Mean distance to Amenities
traincsv['Mean_Amenities']=(traincsv.Horizontal_Distance_To_Fire_Points +
traincsv.Horizontal_Distance_To_Hydrology +
traincsv.Horizontal_Distance_To_Roadways) / 3
testcsv['Mean_Amenities']=(testcsv.Horizontal_Distance_To_Fire_Points +
testcsv.Horizontal_Distance_To_Hydrology +
testcsv.Horizontal_Distance_To_Roadways) / 3
#Mean Distance to Fire and Water
traincsv['Mean_Fire_Hyd']=(traincsv.Horizontal_Distance_To_Fire_Points +
traincsv.Horizontal_Distance_To_Hydrology) / 2
testcsv['Mean_Fire_Hyd']=(testcsv.Horizontal_Distance_To_Fire_Points +
testcsv.Horizontal_Distance_To_Hydrology) / 2
Great, I can't think of any new features at this point. I will come back and add more features if I get any ideas. For now, we'll move forward.
Reorder columns
Let's reorder the columns in train and test csv datasets again so Cover Type is the last column.
#Reorder train.csv columns
cols=traincsv.columns.tolist()
cols=cols[:10]+cols[-7:]+cols[10:-7]
traincsv=traincsv[cols]
#Reorder test.csv columns
cols=testcsv.columns.tolist()
cols=cols[:10]+cols[-7:]+cols[10:-7]
testcsv=testcsv[cols]
#Take a backup of Train.csv and Test.csv so we don't need to run from the beginning if there are issues.
traincsvbk=traincsv.copy()
testcsvbk=testcsv.copy()
OK, that brings our exploration phase to a close (atleast for the time being). Note that typically I would create a function to apply any changes to both train and test (instead of doing it twice). But I've chosen write as I work approach for convenience.
Let's now proceed to train multiple classifiers on this dataset, make submissions to Kaggle and run model comparisons.
Model Fitting and Cross Validation
We'll now fit our dataset through a series of models and hypertune each through cross-validation. Let's start with everyone's favorite, the Random Forests Classifier.
Train Test Split
Before we do anything else, let's split the traincsv into training and test sets. Note that test from this point on refers to the test split that we're creating now (from traincsv). We will NOT be using the testcsv dataset until we're ready to make predictions.
#75/25 Split
X_train, X_test, y_train, y_test = train_test_split(traincsv.ix[:,:-1].values, traincsv.ix[:,-1:].values.ravel())
print X_train.shape, X_test.shape, y_train.shape, y_test.shape
Cross Validation Generator
All right let's do this the right way.
We'll use a cross-validation generator to select train and CV datasets to finetune parameters such as C (Regularization parameter we saw earlier). These hyperparameters are extremely critical to most models. Now, if we tune parameters against the Test dataset, we will end up biasing towards the test set and will once again not generalize very well. We will also have no good way to find out since we have essentially trained on all our data.
Luckily scikit-learn has builit-in packages that can help with this. We'll use a crossvalidation generator that can train the model by tuning the parameters based on a cross-validation subset (cv) that is picked from within the training set. A different cv subset will be picked for each iteration, we control the number of iterations. Then we will use these cv/train splits and run a gridsearch function that will evaluate the model with each split and tune parameters to give us the best parameter that gives the optimal result.
Let's define this as a function so we can call on it as many times as we please later on.
def CrossValidate(estimator, param_grid, n_jobs):
#Choose cross-validation generator - let's choose ShuffleSplit which randomly shuffles and selects Train and CV sets
#for each iteration. There are other methods like the KFold split.
cv = ShuffleSplit(X_train.shape[0], n_iter=3, test_size=0.2)
#Apply the cross-validation iterator on the Training set using GridSearchCV. This will run the classifier on the
#different train/cv splits using parameters specified and return the model that has the best results
#Note that we are tuning based on Classification Accuracy, which is the metric used by Kaggle for evaluating results for
#this competition. Ideally, we would use a better metric such as the mean-squared error.
classifier = GridSearchCV(estimator=estimator, cv=cv, param_grid=param_grid, n_jobs=n_jobs, scoring='accuracy')
#Also note that we're feeding multiple neighbors to the GridSearch to try out.
#We'll now fit the training dataset to this classifier
classifier.fit(X_train, y_train)
#Let's look at the best estimator that was found by GridSearchCV
print "Best Estimator learned through GridSearch"
print classifier.best_estimator_
return cv, classifier.best_estimator_
Learning Curves
Let's define a plot_learning_curve module that's provided by scikit-learn. It allows us to quickly and easily visualize how well the model is performing based on number of samples we're training on. It helps to understand situations such as high variance or bias that quite often arise during model fitting.
We'll once again define this as a function so we can call on it multiple times.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_digits
from sklearn.learning_curve import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=n_jobs, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
Random Forest Classifier
Alright, here comes. Let's fit an RF classifier to our dataset and hypertune parameters using cross-validation. The most critical parameters for a Random Forest are max depth (how deep we want to grow the trees) and number of estimators (trees).
#WARNING - THIS MIGHT TAKE A WHILE TO RUN. TRY ADJUSTING parameters such as n_jobs (jobs to run in parallel, before
#increasing this make sure your system can handle it), n_iter for ShuffleSplit (in the function definition) and reducing
#number of values being tried for max_depth/n_estimators.
#EXERCISE CAUTION WHILE SETTING n_jobs --> Set based on available h/w resources
#BEWARE OF THE MAX_FEATURES=None setting, it can truly take an extremely long time to crossvalidate.
#You can leave it out instead.
#SELECT INTERRUPT IN THE MENU AND PRESS INTERRUPT KERNEL IF YOU NEEDD TO STOP EXECUTION
estimator=RandomForestClassifier()
param_grid={'n_estimators': [1000], #[100, 1000, 5000],
'max_depth': [8, 10, 12, 14],
# 'max_features': None,
# 'oob_score': ['True']
# 'n_jobs': [4]
}
n_jobs=4
#Let's fit RF to the training dataset by calling the function we just created.
cv,best_estimator=CrossValidate(estimator, param_grid, n_jobs)
So looks like the best Random Forest estimator our grid search has learned has 1000 estimators grown to a max depth of 14. How well did we score on the training set?
#OK great, so we got back the best estimator parameters as follows:
print "Best Estimator Parameters"
print"---------------------------"
print "n_estimators: %d" %best_estimator.n_estimators
print "max_depth: %d" %best_estimator.max_depth
print
print "Training Score(F1): %.2f" %best_estimator.score(X_train,y_train)
Let's now look at the feature importances to see how our features have fared.
#Calling fit on the estimator so we can look at feature_importances.
# best_estimator.fit(X_train, y_train)
#Call the Importances Method
importances(best_estimator, traincsv, "Cover Type - Random Forests try I")
So looks like we've done a pretty good job manufacturing features there. Almost all of the features we engineered have made it to the Top 10 and have kind of eaten into the importance score of Elevation.
Great, let's now plot the learning curve to see if there's any overfitting (or underfitting). We'll call the plot_learning_curve module by feeding it the estimator (best estimator returned from GS) and train/cv sets. The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores. Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now. We may need to adjust the hyperparameters depending on the learning curve.
title = "Learning Curves (Random Forests, n_estimators=%d, max_depth=%.6f)" %(best_estimator.n_estimators, \
best_estimator.max_depth)
plot_learning_curve(best_estimator, title, X_train, y_train, cv=cv, n_jobs=4)
plt.show()
That was pretty great performance on the training set but not so great on cross-validation, only about 85% accuracy. The gap that we see between the curves is typical of an overfitting scenario where the training accuracy is way better than the cross-validation accuracy. Let's try changing a couple of features to see if we can improve this performance.
Max_Features = We can try to cut this short. Auto leaves it at sqrt(n_features). But I'd like to try the Top 20 features (based on our feature importance ranking). This is as good as saying consider only features that have importance greater than or equal to mean importance.
We'll also experiment with other features and try to hypertune them.
best_estimator.set_params(max_features=25, min_samples_leaf=2)
title = "Learning Curves (Random Forests, n_estimators=%d, max_depth=%.6f, max_features=%d, min_samples_leaf=%d)" \
%(best_estimator.n_estimators, best_estimator.max_depth, best_estimator.max_features, \
best_estimator.min_samples_leaf)
plot_learning_curve(best_estimator, title, X_train, y_train, cv=cv, n_jobs=4)
plt.show()
The test accuracy seems to have gone up ever so slightly (almost hard to notice). Let's run predictions on the test set (not the testcsv file) to see how we fare.
best_estimator.set_params(max_features=25, min_samples_leaf=1)
#Call fit on X_train so we can run predictions
best_estimator.fit(X_train, y_train)
#Run predictions on the Test set
y_pred=best_estimator.predict(X_test)
print "Training Score: %.2f" %best_estimator.score(X_train,y_train)
print "Test Score: %.2f" %best_estimator.score(X_test,y_test)
print
print "Classification Report - Test"
print metrics.classification_report(y_test, y_pred)
That's interesting to see. The two Cover Types our model is struggling to predict are labels 1 and 2. The rest have atleast a 80% probability. Overally we have about 86% success rate with the test set. Compared to the train performance, this is nowhere close but let's see how well we do with the Kaggle Leaderboard.
Run Predictions and Submit to Kaggle
#Make a copy of the test.csv file
temp=testcsv.copy()
#Run Predictions on test.csv
temp['Cover_Type']=best_estimator.predict(temp.values)
#Create Submissions csv file
temp=temp['Cover_Type']
temp.to_csv('RF-tune1.csv', header=True)
Surprise! the leaderboard score was only 0.71, so with putting in so much effort with feature engineering and all, we've managed to drag ourselves only in the downward direction. But we're going to brave through this and find out why it happened. I can think of two reasons:
- When we submitted the initial result, we trained on the entire traincsv file. But here we made a train/test split there by lost about 3K rows which we could've used for training. But if we had split out train/test sets well (looking at the classification report above, the classes were quite evenly distributed in X_test) and hypertuned our parameters well, this shouldn't be such a big deal.
Variation b/w cross-val and Kaggle LB scores
- This is the most disappointing result. Our own validation results say we should have a 86% accuracy while the Kaggle LB says we only had 71%. That is a huge variation. But why is this? Our cross-validation score is based on the average score against all labels. And we know that the labels were distributed equally. But what happened with the testcsv dataset? Which labels did we predict the most? For these labels what was our accuracy of prediction?
Let's find out.
#Derive counts of each predicted label in the testcsv dataset
class_weights=pd.DataFrame({'Class_Count':temp.groupby(temp).agg(len)}, index=None)
print class_weights
Wow, look at that. Labels 1 and 2, which we ironically had the most trouble predicting, made up most of the labels our model predicted for testcsv. The rest of the labels are at 1/8th or less in terms of count. Now, based on our cross-validation, the mean accuracy for labels 1 and 2 is only about 75%, which is pretty close to our LB score. So lets take heart, our cross-validation isn't all that bad.
Now that we know how not-so-uniform our testcsv predictions are turning out to be, let's try and create weightage for each class and apply it to our model. For weights, I am simply going to get the fraction of occurences in testcsv. Note that we don't actually know what the real labels of the dataset is, we're just basing this on our predictions.
Determine Class Weights based on test.csv predictions
We already know the number of predictions for each labels. Let's just divide that by the total number of predictions to get the weightage for each class.
We'll then feed this back to our model.
class_weights['Class_Weights'] = temp.groupby(temp).agg(len)/len(temp)
print class_weights
There we have it. Now we just need to apply this to our list of training labels (y_train). We can do this using a simple indexing trick in Numpy.
sample_weights=class_weights.ix[y_train]
print sample_weights
Re-train the Random Forest
Let's now retrain our Random Forest model based on the class weights. We can then continue to predict the testcsv labels and submit to Kaggle.
best_estimator.fit(X_train, y_train, sample_weight=sample_weights.Class_Weights.values)
Now, we'll regenerate the test classification report and generate a new submission file for Kaggle.
#Run predictions on the Test set
y_pred=best_estimator.predict(X_test)
print "Training Score: %.2f" %best_estimator.score(X_train,y_train)
print "Test Score: %.2f" %best_estimator.score(X_test,y_test)
print
print "Classification Report - Test"
print metrics.classification_report(y_test, y_pred)
#Make a copy of the test.csv file
temp=testcsv.copy()
#Run Predictions on test.csv
temp['Cover_Type']=best_estimator.predict(temp.values)
#Create Submissions csv file
temp=temp['Cover_Type']
temp.to_csv('RF-tune2.csv', header=True)
The model scored 0.7694 on the Leaderboard which is definitely an improvement to where we were before. I'd like to try a couple of things now. We had split our train and test sets at 75/25 earlier. Seeing as to how much of an impact training with a bit more data is having with our dataset, I am inclined to change this ratio a bit. Let's try 90/10 and see how we fare.
#90/10 Split
X_train, X_test, y_train, y_test = train_test_split(traincsv.ix[:,:-1].values, traincsv.ix[:,-1:].values.ravel(),
test_size=0.1)
print X_train.shape, X_test.shape, y_train.shape, y_test.shape
#Regenerate Sample Weights since train size has changed
sample_weights=class_weights.ix[y_train]
#Call Fit on the training set
best_estimator.fit(X_train, y_train, sample_weight=sample_weights.Class_Weights.values)
#Run predictions on the Test set
y_pred=best_estimator.predict(X_test)
print "Training Score: %.2f" %best_estimator.score(X_train,y_train)
print "Test Score: %.2f" %best_estimator.score(X_test,y_test)
print
print "Classification Report - Test"
print metrics.classification_report(y_test, y_pred)
#Make a copy of the test.csv file
temp=testcsv.copy()
#Run Predictions on test.csv
temp['Cover_Type']=best_estimator.predict(temp.values)
#Create Submissions csv file
temp=temp['Cover_Type']
temp.to_csv('RF-tune3.csv', header=True)
rf_final=best_estimator
So that did not improve our Leaderboard score. Let's try another classifier to see if we can better our predictions.
ExtraTrees Classifier
Let's now try fitting an ExtraTrees Classifier to our dataset. There're two critical differences between an Extra Trees classifier and a Random Forest:
- Extremely randomized trees don’t apply the bagging procedure to construct a set of the training samples for each tree. The same input training set is used to train all trees.
- Extremely randomized trees pick a node split very extremely (both a variable index and variable splitting value are chosen randomly), whereas Random Forest finds the best split (optimal one by variable index and variable splitting value) among random subset of variables.
We're not going to hypertune any parameters since they're very similar to a Random Forest. But we will still measure the Test scores while fitting the model to train.
ext=ExtraTreesClassifier(n_estimators=1000, oob_score=True, bootstrap=True)
ext.fit(X_train, y_train, sample_weight=sample_weights.Class_Weights.values)
#Run predictions on the Test set
y_pred=ext.predict(X_test)
print "Training Score: %.2f" %ext.score(X_train,y_train)
print "Test Score: %.2f" %ext.score(X_test,y_test)
print "OOB Generalization Score: %.2f" %ext.oob_score_
print
print "Classification Report - Test"r
print metrics.classification_report(y_test, y_pred)
So looks like our generalization score (Test as well as the Out-of-the-Bag score, generalization score as predicted by the model, has improved considerably by switching from Random Forests to the ExtraTrees Classifier. Let's try our luck with the Leaderboard now.
#Make a copy of the test.csv file
temp=testcsv.copy()
#Run Predictions on test.csv
temp['Cover_Type']=ext.predict(temp.values)
#Create Submissions csv file
temp=temp['Cover_Type']
temp.to_csv('ET-tune1.csv', header=True)
The model did not score any better. Looks like we hit a stonewall at 0.7694. Nevertheless, let me try one last technique before I move on to something else. Let's try a technique called blending.
Ensemble Learning - Blending
Blending is an ensemble learning method where in you take results from multiple classifiers and "blend" them together in some way.
There are multiple options to do this like Voting or simply Averaging the results. But we're going to try training a logistic regression model on top of the predictions generated by our existing models. T hen use results from the Logistic regression model for the actual submission.
I am not sure if this is the best approach for this problem and how well this will work but I'd simply like to get a taste of this approach. Note that the models we've already built are ensemble learning methods themselves so I am not quite sure what we're going to gain here.
Gather predictions from Level 0 Models
Don't be put off by the fancy term, we're simply going to generate predictions with the Random Forests and Extra Trees models we've already trained.
estimators = [best_estimator, ext]
temp=testcsv.copy()
X_blend_train=[]
X_blend_test=[]
X_blend_testcsv=[]
for i, est in enumerate(estimators):
X_blend_train.append(est.predict(X_train))
X_blend_test.append(est.predict(X_test))
X_blend_testcsv.append(est.predict(temp.values))
X_blend_train = np.array(X_blend_train).T
X_blend_test = np.array(X_blend_test).T
X_blend_testcsv = np.array(X_blend_testcsv).T
We'll now use these predictions (training set) to train a Logistic Regression model.
from sklearn.linear_model import LogisticRegression
lo=LogisticRegression().fit(X_blend_train, y_train)
Now we will generate the scores for Train and Test and also view the classification accuracy to see if anything has improved.
y_pred=lo.predict(X_blend_test)
print "Ensemble Learning - Blending Random Forests & Extra Trees Results"
print "-----------------------------------------------------------------"
print "Blended Training Score: %.2f" %lo.score(X_blend_train,y_train)
print "Blended Test Score: %.2f" %lo.score(X_blend_test,y_test)
print
print "Classification Report - Blended Test"
print metrics.classification_report(y_test, y_pred)
Looks like there's only a slight improvement. Test score has improved by a point to 87% while training accuracy has also improved slightly. Let's submit to Kaggle and see if all of this work made any difference to the end result.
#Make a copy of the test.csv file
temp=testcsv.copy()
#Run Predictions on test.csv
temp['Cover_Type']=lo.predict(X_blend_testcsv)
#Create Submissions csv file
temp=temp['Cover_Type']
temp.to_csv('Blended-tune1.csv', header=True)
So there was no change to the result. We've had a good run so far, got a not-so-great ranking on the LB but we did try several techniques that may have made all the difference on another day. I also likely should've done certain things differently but I got what I wanted from this - to learn and try something new. One thing I could've probably tried is combining a different classifier like SVM while blending. But I will save this for another exercise. But we learnt a very valuable machine learning lesson:
Sometimes the most simplest of models will beat the oh-so-complicated ones
But I'll take some brownie points for trying.
Hope you got something out of this...Happy Reading!
Share more, Learn more!
Comments
Comments powered by Disqus