Predicting Forest Cover Types with Ensemble Learning
This is a documentation of one of my approaches to solving the forest cover type prediction challenge hosted by Kaggle. Feel free to use for your own reference and let me know if you have any suggestions on how I can improve the model :-)
I found this topic very engaging being a nature lover. Also the features are very friendly and don't require much domain knowledge to explore (and hopefully engineer new features).
OK let's get started.
Below is an excerpt from the competition page:
In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.
This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
So we're going to be provided training data based on existing cover types and we're expected to predict integer classifications for each forest cover type. Let's look at the types of forest covers we're going to predict.
1 - Spruce/Fir
2 - Lodgepole Pine
3 - Ponderosa Pine
4 - Cottonwood/Willow
5 - Aspen
6 - Douglas-fir
7 - Krummholz
Get the data
Now is the time to go grab the train and test datasets from here. Place them in your working directory (or host them somewhere like Github). Let's very quickly explore the features available to us for this classification task:
Elevation - Elevation in meters
Aspect - Aspect in degrees azimuth
Slope - Slope in degrees
Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway
Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice
Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice
Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation
Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation
Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation
There are 4 different wilderness areas and 40 different soil types. These're already converted to machine-learning readable binary formats for us (almost never expect this to happen in real-time, ever!).
Let's now read the dataset using pandas.
#Let's first import required Python packages.
#Importing required Python packages
import matplotlib.pylab as plt
import numpy as np
from scipy import sparse
from sklearn.datasets import make_classification, make_blobs, load_boston
from sklearn.decomposition import PCA
from sklearn.cross_validation import ShuffleSplit, train_test_split, Bootstrap
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.learning_curve import learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.grid_search import GridSearchCV
from pprint import pprint
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import urllib
import seaborn
np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 25)
pd.options.display.max_colwidth = 50
%matplotlib inline
traincsv=pd.read_csv('train.csv', index_col='Id')
testcsv=pd.read_csv('test.csv', index_col='Id')
orig_traincsv = traincsv.copy()
orig_testcsv = testcsv.copy()
Let's take a look at a summary of the training set. The test set should be the same except the cover type column.
traincsv.head(2)
The data looks as smooth as silk, no dates, no strings, nothing alarming. Are there any nulls?
traincsv[traincsv.isnull().any(axis=1)]
There are no rows with null values. Considering all the columns are numeric we should really look at 0 as a possible missing value. Now, before we proceed to explore the features, I'd like to do a quick and dirty submission with a Random Forest model, with no changes to the dataset. This helps to understand how any changes we make down the line fare in comparison.
Initial Submission (Random Forests Classifier)
Let's fit a Random Forest Classifier to the traincsv dataset. We'll then quickly run predictions on the testcsv dataset and submit to Kaggle.
Note, for this initial submission we are not performing any cross-validation or even a train/test split. We're simply running a quick and dirty submission to check out the results. For our subsqeuent submissions, we will follow due process.
#Initial Random Forest Model fitting with 1000 estimators. Note below, that the labels lie in the last column, Cover Type.
rf_initial=RandomForestClassifier(n_estimators=1000, bootstrap=True, oob_score=True)
rf_initial.fit(traincsv.ix[:,:-1].values, traincsv.ix[:,-1:].values.ravel())
print "Initial Traincsv score: %.2f" %rf_initial.score(traincsv.ix[:,:-1].values, traincsv.ix[:,-1:].values.ravel())
So with 1000 estimators and no crossvalidation/hypertuning etc. we're able to predict 100% of Cover Types on the train.csv file. Note that we wouldn't know at this point if there is any overfitting. We also do not have a test set (not same as test.csv) for comparison. OK, let's now run predictions on testcsv and submit to Kaggle.
#Make a copy of the test.csv file
temp=testcsv.copy()
#Run Predictions on test.csv
temp['Cover_Type']=rf_initial.predict(temp.values)
#Create Submissions csv file
temp=temp['Cover_Type']
temp.to_csv('RF-Initial.csv', header=True)
The submission to Kaggle scored 0.75366, taking us to better than 50% of the leaderboard. And we haven't even touched the dataset yet. Simply fit an out-of-the-box random forest to the dataset.
OK let's see if and how we can improve this score. We'll run through a series of visualizations to understand the data better. But before we do that, let's look at a very valuable statistic provided by Random Forests, Feature Importances. This will tell us which features in the dataset matter the most to our predictions. It helps to understand where we should focus our energy.
Feature Importances
We'll define this as a method so we can call on this later if necessary.
#Always call fit on the estimator before invoking this method.
def importances(estimator, col_array, title):
# Calculate the feature ranking - Top 10
importances = estimator.feature_importances_
indices = np.argsort(importances)[::-1]
print "%s Top 20 Important Features\n" %title
for f in range(20):
print("%d. %s (%f)" % (f + 1, col_array.columns[indices[f]], importances[indices[f]]))
#Mean Feature Importance
print "\nMean Feature Importance %.6f" %np.mean(importances)
#Plot the feature importances of the forest
indices=indices[:10]
plt.figure()
plt.title(title+" Top 10 Feature importances")
plt.bar(range(10), importances[indices],
color="gr", align="center")
plt.xticks(range(10), col_array.columns[indices], fontsize=14, rotation=90)
plt.xlim([-1, 10])
plt.show()
#Call the method we just created to display the feature importances
importances(rf_initial, traincsv, "Cover Type (Initial RF)")
So there it is, Elevation is the most important feature in the dataset, that's almost 2.5 times more important than the second feature in the Top 10. If we can manufacture more features with Elevation, we'll almost certainly up our success rate of prediction.
We'll continue to refer back to these importances as we move through this exercise.
Data Exploration
We'll now begin to examine the data through a series of visualizations. Let's start by looking at histograms of the features to find any missing values.
traincsv.ix[:,:10].hist(figsize=(16,10), bins=50)
plt.show()
Looks like there're some missing values for Hillshade at 3 PM. If there was shade at 9 AM, I'd expect little shade at 3 PM but 0 doesn't sit right. We'll explore this further down the line, since Hillshade 3 PM is in the Top 10.
Let's now plot an SPLOM to analyze the Top 5 features. I am interested to see how Elevation relates to the others.
with seaborn.axes_style('white'):
smaller_frame=traincsv[['Elevation', 'Horizontal_Distance_To_Roadways',
'Horizontal_Distance_To_Fire_Points',
'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology']]
smaller_frame.columns=smaller_frame.columns.map(lambda x: x.replace('Horizontal_Distance_To','HD'))
smaller_frame.columns=smaller_frame.columns.map(lambda x: x.replace('Vertical_Distance_To','VD'))
scatter_matrix(smaller_frame, figsize=(14, 14), diagonal="kde")
plt.show()
Linearity of Cover Types
I am interested to explore some of these relationships further. Especially how Elevation plays with the following:
- Vertical Distance to Hydrology - Horizontal Distance to Hydrology - Horizontal Distance to Roadways
The reason for my interest is linearity. Tree based models make decisions to group values based on certain criteria for each label. For instance, if HD_Hyrdrology > 25, Cover Type is x. If we can manage to create a feature where in more similar labels fall under the same bracket then we might be able to achieve more success with Random Forests.
#Let's add Cover Type to smaller_frame so we can color code.
smaller_frame['Cover_Type']=traincsv.Cover_Type
#Elevation vs HD Hydrology
plt.figure(figsize=(8,8))
plt.scatter(smaller_frame.Elevation, smaller_frame.HD_Hydrology, c=smaller_frame.Cover_Type, s=75, cmap=plt.cm.Greens_r)
plt.xlabel("Elevation")
plt.ylabel("HD Hydrology")
plt.show()