This is a documentation of one of my approaches to solving the forest cover type prediction challenge hosted by Kaggle. Feel free to use for your own reference and let me know if you have any suggestions on how I can improve the model :-)
I found this topic very engaging being a nature lover. Also the features are very friendly and don't require much domain knowledge to explore (and hopefully engineer new features).
OK let's get started.
Below is an excerpt from the competition page:
In this competition you are asked to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.
This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
So we're going to be provided training data based on existing cover types and we're expected to predict integer classifications for each forest cover type. Let's look at the types of forest covers we're going to predict.
1 - Spruce/Fir
2 - Lodgepole Pine
3 - Ponderosa Pine
4 - Cottonwood/Willow
5 - Aspen
6 - Douglas-fir
7 - Krummholz
Get the data
Now is the time to go grab the train and test datasets from here. Place them in your working directory (or host them somewhere like Github). Let's very quickly explore the features available to us for this classification task:
Elevation - Elevation in meters Aspect - Aspect in degrees azimuth Slope - Slope in degrees Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation
There are 4 different wilderness areas and 40 different soil types. These're already converted to machine-learning readable binary formats for us (almost never expect this to happen in real-time, ever!).
Let's now read the dataset using pandas.
#Let's first import required Python packages. #Importing required Python packages import matplotlib.pylab as plt import numpy as np from scipy import sparse from sklearn.datasets import make_classification, make_blobs, load_boston from sklearn.decomposition import PCA from sklearn.cross_validation import ShuffleSplit, train_test_split, Bootstrap from sklearn.preprocessing import StandardScaler from sklearn import metrics from sklearn.learning_curve import learning_curve from sklearn.linear_model import LinearRegression from sklearn.ensemble import GradientBoostingRegressor from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.grid_search import GridSearchCV from pprint import pprint import pandas as pd from pandas.tools.plotting import scatter_matrix import urllib import seaborn np.random.seed(sum(map(ord, "aesthetics"))) seaborn.set_context('notebook') pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier plt.rcParams['figure.figsize'] = (15, 5) # Set some Pandas options pd.set_option('display.notebook_repr_html', False) pd.set_option('display.max_columns', 40) pd.set_option('display.max_rows', 25) pd.options.display.max_colwidth = 50 %matplotlib inline
traincsv=pd.read_csv('train.csv', index_col='Id') testcsv=pd.read_csv('test.csv', index_col='Id') orig_traincsv = traincsv.copy() orig_testcsv = testcsv.copy()
Let's take a look at a summary of the training set. The test set should be the same except the cover type column.
Elevation Aspect Slope Horizontal_Distance_To_Hydrology \ Id 1 2596 51 3 258 2 2590 56 2 212 Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways \ Id 1 0 510 2 -6 390 Hillshade_9am Hillshade_Noon Hillshade_3pm \ Id 1 221 232 148 2 220 235 151 Horizontal_Distance_To_Fire_Points Wilderness_Area1 Wilderness_Area2 \ Id 1 6279 1 0 2 6225 1 0 Wilderness_Area3 Wilderness_Area4 Soil_Type1 Soil_Type2 Soil_Type3 \ Id 1 0 0 0 0 0 2 0 0 0 0 0 Soil_Type4 Soil_Type5 Soil_Type6 ... Soil_Type22 Soil_Type23 \ Id ... 1 0 0 0 ... 0 0 2 0 0 0 ... 0 0 Soil_Type24 Soil_Type25 Soil_Type26 Soil_Type27 Soil_Type28 \ Id 1 0 0 0 0 0 2 0 0 0 0 0 Soil_Type29 Soil_Type30 Soil_Type31 Soil_Type32 Soil_Type33 \ Id 1 1 0 0 0 0 2 1 0 0 0 0 Soil_Type34 Soil_Type35 Soil_Type36 Soil_Type37 Soil_Type38 \ Id 1 0 0 0 0 0 2 0 0 0 0 0 Soil_Type39 Soil_Type40 Cover_Type Id 1 0 0 5 2 0 0 5 [2 rows x 55 columns]
The data looks as smooth as silk, no dates, no strings, nothing alarming. Are there any nulls?
Empty DataFrame Columns: [Elevation, Aspect, Slope, Horizontal_Distance_To_Hydrology, Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways, Hillshade_9am, Hillshade_Noon, Hillshade_3pm, Horizontal_Distance_To_Fire_Points, Wilderness_Area1, Wilderness_Area2, Wilderness_Area3, Wilderness_Area4, Soil_Type1, Soil_Type2, Soil_Type3, Soil_Type4, Soil_Type5, Soil_Type6, Soil_Type7, Soil_Type8, Soil_Type9, Soil_Type10, Soil_Type11, Soil_Type12, Soil_Type13, Soil_Type14, Soil_Type15, Soil_Type16, Soil_Type17, Soil_Type18, Soil_Type19, Soil_Type20, Soil_Type21, Soil_Type22, Soil_Type23, Soil_Type24, Soil_Type25, Soil_Type26, Soil_Type27, Soil_Type28, Soil_Type29, Soil_Type30, Soil_Type31, Soil_Type32, Soil_Type33, Soil_Type34, Soil_Type35, Soil_Type36, Soil_Type37, Soil_Type38, Soil_Type39, Soil_Type40, Cover_Type] Index:  [0 rows x 55 columns]
There are no rows with null values. Considering all the columns are numeric we should really look at 0 as a possible missing value. Now, before we proceed to explore the features, I'd like to do a quick and dirty submission with a Random Forest model, with no changes to the dataset. This helps to understand how any changes we make down the line fare in comparison.
Initial Submission (Random Forests Classifier)
Let's fit a Random Forest Classifier to the traincsv dataset. We'll then quickly run predictions on the testcsv dataset and submit to Kaggle.
Note, for this initial submission we are not performing any cross-validation or even a train/test split. We're simply running a quick and dirty submission to check out the results. For our subsqeuent submissions, we will follow due process.
#Initial Random Forest Model fitting with 1000 estimators. Note below, that the labels lie in the last column, Cover Type. rf_initial=RandomForestClassifier(n_estimators=1000, bootstrap=True, oob_score=True) rf_initial.fit(traincsv.ix[:,:-1].values, traincsv.ix[:,-1:].values.ravel()) print "Initial Traincsv score: %.2f" %rf_initial.score(traincsv.ix[:,:-1].values, traincsv.ix[:,-1:].values.ravel())
Initial Traincsv score: 1.00
So with 1000 estimators and no crossvalidation/hypertuning etc. we're able to predict 100% of Cover Types on the train.csv file. Note that we wouldn't know at this point if there is any overfitting. We also do not have a test set (not same as test.csv) for comparison. OK, let's now run predictions on testcsv and submit to Kaggle.
#Make a copy of the test.csv file temp=testcsv.copy() #Run Predictions on test.csv temp['Cover_Type']=rf_initial.predict(temp.values) #Create Submissions csv file temp=temp['Cover_Type'] temp.to_csv('RF-Initial.csv', header=True)
The submission to Kaggle scored 0.75366, taking us to better than 50% of the leaderboard. And we haven't even touched the dataset yet. Simply fit an out-of-the-box random forest to the dataset.
OK let's see if and how we can improve this score. We'll run through a series of visualizations to understand the data better. But before we do that, let's look at a very valuable statistic provided by Random Forests, Feature Importances. This will tell us which features in the dataset matter the most to our predictions. It helps to understand where we should focus our energy.
We'll define this as a method so we can call on this later if necessary.
#Always call fit on the estimator before invoking this method. def importances(estimator, col_array, title): # Calculate the feature ranking - Top 10 importances = estimator.feature_importances_ indices = np.argsort(importances)[::-1] print "%s Top 20 Important Features\n" %title for f in range(20): print("%d. %s (%f)" % (f + 1, col_array.columns[indices[f]], importances[indices[f]])) #Mean Feature Importance print "\nMean Feature Importance %.6f" %np.mean(importances) #Plot the feature importances of the forest indices=indices[:10] plt.figure() plt.title(title+" Top 10 Feature importances") plt.bar(range(10), importances[indices], color="gr", align="center") plt.xticks(range(10), col_array.columns[indices], fontsize=14, rotation=90) plt.xlim([-1, 10]) plt.show()
#Call the method we just created to display the feature importances importances(rf_initial, traincsv, "Cover Type (Initial RF)")
Cover Type (Initial RF) Top 20 Important Features 1. Elevation (0.224641) 2. Horizontal_Distance_To_Roadways (0.092400) 3. Horizontal_Distance_To_Fire_Points (0.073394) 4. Horizontal_Distance_To_Hydrology (0.062296) 5. Vertical_Distance_To_Hydrology (0.054044) 6. Hillshade_9am (0.051547) 7. Aspect (0.049650) 8. Hillshade_3pm (0.045999) 9. Hillshade_Noon (0.044683) 10. Wilderness_Area4 (0.044381) 11. Slope (0.036172) 12. Soil_Type10 (0.022764) 13. Soil_Type38 (0.020321) 14. Soil_Type3 (0.018344) 15. Soil_Type39 (0.018343) 16. Wilderness_Area1 (0.017865) 17. Wilderness_Area3 (0.017311) 18. Soil_Type4 (0.012687) 19. Soil_Type40 (0.010560) 20. Soil_Type30 (0.008182) Mean Feature Importance 0.018519
So there it is, Elevation is the most important feature in the dataset, that's almost 2.5 times more important than the second feature in the Top 10. If we can manufacture more features with Elevation, we'll almost certainly up our success rate of prediction.
We'll continue to refer back to these importances as we move through this exercise.
We'll now begin to examine the data through a series of visualizations. Let's start by looking at histograms of the features to find any missing values.
traincsv.ix[:,:10].hist(figsize=(16,10), bins=50) plt.show()
Looks like there're some missing values for Hillshade at 3 PM. If there was shade at 9 AM, I'd expect little shade at 3 PM but 0 doesn't sit right. We'll explore this further down the line, since Hillshade 3 PM is in the Top 10.
Let's now plot an SPLOM to analyze the Top 5 features. I am interested to see how Elevation relates to the others.
with seaborn.axes_style('white'): smaller_frame=traincsv[['Elevation', 'Horizontal_Distance_To_Roadways', 'Horizontal_Distance_To_Fire_Points', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology']] smaller_frame.columns=smaller_frame.columns.map(lambda x: x.replace('Horizontal_Distance_To','HD')) smaller_frame.columns=smaller_frame.columns.map(lambda x: x.replace('Vertical_Distance_To','VD')) scatter_matrix(smaller_frame, figsize=(14, 14), diagonal="kde") plt.show()
Linearity of Cover Types
I am interested to explore some of these relationships further. Especially how Elevation plays with the following:
- Vertical Distance to Hydrology - Horizontal Distance to Hydrology - Horizontal Distance to Roadways
The reason for my interest is linearity. Tree based models make decisions to group values based on certain criteria for each label. For instance, if HD_Hyrdrology > 25, Cover Type is x. If we can manage to create a feature where in more similar labels fall under the same bracket then we might be able to achieve more success with Random Forests.
#Let's add Cover Type to smaller_frame so we can color code. smaller_frame['Cover_Type']=traincsv.Cover_Type
#Elevation vs HD Hydrology plt.figure(figsize=(8,8)) plt.scatter(smaller_frame.Elevation, smaller_frame.HD_Hydrology, c=smaller_frame.Cover_Type, s=75, cmap=plt.cm.Greens_r) plt.xlabel("Elevation") plt.ylabel("HD Hydrology") plt.show()