Predicting Forest Cover Types with Ensemble Learning

This is a documentation of one of my approaches to solving the forest cover type prediction challenge hosted by Kaggle. Feel free to use for your own reference and let me know if you have any suggestions on how I can improve the model :-)

I found this topic very engaging being a nature lover. Also the features are very friendly and don't require much domain knowledge to explore (and hopefully engineer new features).

OK let's get started.

Below is an excerpt from the competition page:

In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.

This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.

So we're going to be provided training data based on existing cover types and we're expected to predict integer classifications for each forest cover type. Let's look at the types of forest covers we're going to predict.

1 - Spruce/Fir
2 - Lodgepole Pine
3 - Ponderosa Pine
4 - Cottonwood/Willow
5 - Aspen
6 - Douglas-fir
7 - Krummholz

Get the data

Now is the time to go grab the train and test datasets from here. Place them in your working directory (or host them somewhere like Github). Let's very quickly explore the features available to us for this classification task:

Elevation - Elevation in meters  
Aspect - Aspect in degrees azimuth  
Slope - Slope in degrees  
Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features  
Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features  
Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway  
Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice  
Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice  
Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice  
Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points  
Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation  
Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation  
Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation  

There are 4 different wilderness areas and 40 different soil types. These're already converted to machine-learning readable binary formats for us (almost never expect this to happen in real-time, ever!).

Let's now read the dataset using pandas.

In [4]:
#Let's first import required Python packages.
#Importing required Python packages
import matplotlib.pylab as plt
import numpy as np
from scipy import sparse
from sklearn.datasets import make_classification, make_blobs, load_boston
from sklearn.decomposition import PCA
from sklearn.cross_validation import ShuffleSplit, train_test_split, Bootstrap
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.learning_curve import learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.grid_search import GridSearchCV
from pprint import pprint
import pandas as pd
from import scatter_matrix
import urllib
import seaborn

np.random.seed(sum(map(ord, "aesthetics")))

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5) 

# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 25)
pd.options.display.max_colwidth = 50

%matplotlib inline
In [5]:
traincsv=pd.read_csv('train.csv', index_col='Id')
testcsv=pd.read_csv('test.csv', index_col='Id')

orig_traincsv = traincsv.copy()
orig_testcsv = testcsv.copy()

Let's take a look at a summary of the training set. The test set should be the same except the cover type column.

In [3]:
    Elevation  Aspect  Slope  Horizontal_Distance_To_Hydrology  \
1        2596      51      3                               258   
2        2590      56      2                               212   

    Vertical_Distance_To_Hydrology  Horizontal_Distance_To_Roadways  \
1                                0                              510   
2                               -6                              390   

    Hillshade_9am  Hillshade_Noon  Hillshade_3pm  \
1             221             232            148   
2             220             235            151   

    Horizontal_Distance_To_Fire_Points  Wilderness_Area1  Wilderness_Area2  \
1                                 6279                 1                 0   
2                                 6225                 1                 0   

    Wilderness_Area3  Wilderness_Area4  Soil_Type1  Soil_Type2  Soil_Type3  \
1                  0                 0           0           0           0   
2                  0                 0           0           0           0   

    Soil_Type4  Soil_Type5  Soil_Type6     ...      Soil_Type22  Soil_Type23  \
Id                                         ...                                 
1            0           0           0     ...                0            0   
2            0           0           0     ...                0            0   

    Soil_Type24  Soil_Type25  Soil_Type26  Soil_Type27  Soil_Type28  \
1             0            0            0            0            0   
2             0            0            0            0            0   

    Soil_Type29  Soil_Type30  Soil_Type31  Soil_Type32  Soil_Type33  \
1             1            0            0            0            0   
2             1            0            0            0            0   

    Soil_Type34  Soil_Type35  Soil_Type36  Soil_Type37  Soil_Type38  \
1             0            0            0            0            0   
2             0            0            0            0            0   

    Soil_Type39  Soil_Type40  Cover_Type  
1             0            0           5  
2             0            0           5  

[2 rows x 55 columns]

The data looks as smooth as silk, no dates, no strings, nothing alarming. Are there any nulls?

In [4]:
Empty DataFrame
Columns: [Elevation, Aspect, Slope, Horizontal_Distance_To_Hydrology, Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways, Hillshade_9am, Hillshade_Noon, Hillshade_3pm, Horizontal_Distance_To_Fire_Points, Wilderness_Area1, Wilderness_Area2, Wilderness_Area3, Wilderness_Area4, Soil_Type1, Soil_Type2, Soil_Type3, Soil_Type4, Soil_Type5, Soil_Type6, Soil_Type7, Soil_Type8, Soil_Type9, Soil_Type10, Soil_Type11, Soil_Type12, Soil_Type13, Soil_Type14, Soil_Type15, Soil_Type16, Soil_Type17, Soil_Type18, Soil_Type19, Soil_Type20, Soil_Type21, Soil_Type22, Soil_Type23, Soil_Type24, Soil_Type25, Soil_Type26, Soil_Type27, Soil_Type28, Soil_Type29, Soil_Type30, Soil_Type31, Soil_Type32, Soil_Type33, Soil_Type34, Soil_Type35, Soil_Type36, Soil_Type37, Soil_Type38, Soil_Type39, Soil_Type40, Cover_Type]
Index: []

[0 rows x 55 columns]

There are no rows with null values. Considering all the columns are numeric we should really look at 0 as a possible missing value. Now, before we proceed to explore the features, I'd like to do a quick and dirty submission with a Random Forest model, with no changes to the dataset. This helps to understand how any changes we make down the line fare in comparison.

Initial Submission (Random Forests Classifier)

Let's fit a Random Forest Classifier to the traincsv dataset. We'll then quickly run predictions on the testcsv dataset and submit to Kaggle.

Note, for this initial submission we are not performing any cross-validation or even a train/test split. We're simply running a quick and dirty submission to check out the results. For our subsqeuent submissions, we will follow due process.

In [5]:
#Initial Random Forest Model fitting with 1000 estimators. Note below, that the labels lie in the last column, Cover Type.
rf_initial=RandomForestClassifier(n_estimators=1000, bootstrap=True, oob_score=True)[:,:-1].values, traincsv.ix[:,-1:].values.ravel())
print "Initial Traincsv score: %.2f" %rf_initial.score(traincsv.ix[:,:-1].values, traincsv.ix[:,-1:].values.ravel())
Initial Traincsv score: 1.00

So with 1000 estimators and no crossvalidation/hypertuning etc. we're able to predict 100% of Cover Types on the train.csv file. Note that we wouldn't know at this point if there is any overfitting. We also do not have a test set (not same as test.csv) for comparison. OK, let's now run predictions on testcsv and submit to Kaggle.

In [6]:
#Make a copy of the test.csv file

#Run Predictions on test.csv

#Create Submissions csv file
temp.to_csv('RF-Initial.csv', header=True)

The submission to Kaggle scored 0.75366, taking us to better than 50% of the leaderboard. And we haven't even touched the dataset yet. Simply fit an out-of-the-box random forest to the dataset.

OK let's see if and how we can improve this score. We'll run through a series of visualizations to understand the data better. But before we do that, let's look at a very valuable statistic provided by Random Forests, Feature Importances. This will tell us which features in the dataset matter the most to our predictions. It helps to understand where we should focus our energy.

Feature Importances
We'll define this as a method so we can call on this later if necessary.

In [108]:
#Always call fit on the estimator before invoking this method.
def importances(estimator, col_array, title):

# Calculate the feature ranking - Top 10
    importances = estimator.feature_importances_
    indices = np.argsort(importances)[::-1]

    print "%s Top 20 Important Features\n" %title
    for f in range(20):
        print("%d. %s   (%f)" % (f + 1, col_array.columns[indices[f]], importances[indices[f]]))
#Mean Feature Importance
    print "\nMean Feature Importance %.6f" %np.mean(importances)
#Plot the feature importances of the forest
    plt.title(title+" Top 10 Feature importances"), importances[indices],
            color="gr", align="center")
    plt.xticks(range(10), col_array.columns[indices], fontsize=14, rotation=90)
    plt.xlim([-1, 10])
In [88]:
#Call the method we just created to display the feature importances
importances(rf_initial, traincsv, "Cover Type (Initial RF)")
Cover Type (Initial RF) Top 20 Important Features

1. Elevation   (0.224641)
2. Horizontal_Distance_To_Roadways   (0.092400)
3. Horizontal_Distance_To_Fire_Points   (0.073394)
4. Horizontal_Distance_To_Hydrology   (0.062296)
5. Vertical_Distance_To_Hydrology   (0.054044)
6. Hillshade_9am   (0.051547)
7. Aspect   (0.049650)
8. Hillshade_3pm   (0.045999)
9. Hillshade_Noon   (0.044683)
10. Wilderness_Area4   (0.044381)
11. Slope   (0.036172)
12. Soil_Type10   (0.022764)
13. Soil_Type38   (0.020321)
14. Soil_Type3   (0.018344)
15. Soil_Type39   (0.018343)
16. Wilderness_Area1   (0.017865)
17. Wilderness_Area3   (0.017311)
18. Soil_Type4   (0.012687)
19. Soil_Type40   (0.010560)
20. Soil_Type30   (0.008182)

Mean Feature Importance 0.018519

So there it is, Elevation is the most important feature in the dataset, that's almost 2.5 times more important than the second feature in the Top 10. If we can manufacture more features with Elevation, we'll almost certainly up our success rate of prediction.

We'll continue to refer back to these importances as we move through this exercise.

Data Exploration

We'll now begin to examine the data through a series of visualizations. Let's start by looking at histograms of the features to find any missing values.

In [9]:
traincsv.ix[:,:10].hist(figsize=(16,10), bins=50)

Looks like there're some missing values for Hillshade at 3 PM. If there was shade at 9 AM, I'd expect little shade at 3 PM but 0 doesn't sit right. We'll explore this further down the line, since Hillshade 3 PM is in the Top 10.

Let's now plot an SPLOM to analyze the Top 5 features. I am interested to see how Elevation relates to the others.

In [10]:
with seaborn.axes_style('white'): 
    smaller_frame=traincsv[['Elevation', 'Horizontal_Distance_To_Roadways', 
                            'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology']] x: x.replace('Horizontal_Distance_To','HD')) x: x.replace('Vertical_Distance_To','VD'))
    scatter_matrix(smaller_frame, figsize=(14, 14), diagonal="kde")

Linearity of Cover Types
I am interested to explore some of these relationships further. Especially how Elevation plays with the following:
- Vertical Distance to Hydrology - Horizontal Distance to Hydrology - Horizontal Distance to Roadways

The reason for my interest is linearity. Tree based models make decisions to group values based on certain criteria for each label. For instance, if HD_Hyrdrology > 25, Cover Type is x. If we can manage to create a feature where in more similar labels fall under the same bracket then we might be able to achieve more success with Random Forests.

In [39]:
#Let's add Cover Type to smaller_frame so we can color code.
In [42]:
#Elevation vs HD Hydrology
plt.scatter(smaller_frame.Elevation, smaller_frame.HD_Hydrology, c=smaller_frame.Cover_Type, s=75,
plt.ylabel("HD Hydrology")