Gradient Boosted Regression Trees (GBRT) or shorter Gradient Boosting is a flexible non-parametric statistical learning technique for classification and regression. I'll demonstrate learning with GBRT using multiple examples in this notebook.
Feel free to use for your own reference. Let's get started.
In [26]:
#Importing required Python packages
%matplotlib inline
import matplotlib.pylab as plt
import numpy as np
from scipy import sparse
from sklearn.datasets import make_classification, make_blobs, load_boston, fetch_california_housing
from sklearn.decomposition import PCA
from sklearn.cross_validation import ShuffleSplit, train_test_split
from sklearn import metrics
from sklearn.learning_curve import learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.grid_search import GridSearchCV
from pprint import pprint
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import urllib
import requests
import zipfile
import StringIO
import seaborn
np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 25)
pd.options.display.max_colwidth = 50
In [2]:
#Let's begin by downloading one of scikit-learn's sample datasets - California Housing
cal=fetch_california_housing()
In [8]:
#Let's check out the structure of the dataset
print cal.keys()
print
#DESCR contains a description of the dataset
print cal.DESCR
#Great, as expected the dataset contains housing data with several parameters including income, no of bedrooms etc. and
#the target variable as the average house value.
#Let's use GBRT to build a model that can predict house prices.
In [3]:
#Let's first do a train/test split so we have some examples to find out how well our model will generalize. If you're new to
#these concepts, check out one of the other repos in my profile like SVM where I illustrate these concepts in detail.
X_train, X_test, y_train, y_test = train_test_split(cal.data, cal.target)
print X_train.shape, X_test.shape
#Default split is 75/25
In [25]:
#OK now, let's fit a GBRT classifer to the training dataset. The classifier has several hyperparameters which we will tune
#in subsequent steps. For now, we'll just build a simple model.
#Let's go instantiate, fit and predict.
gbrt=GradientBoostingRegressor(n_estimators=100)
gbrt.fit(X_train, y_train)
y_pred=gbrt.predict(X_test)
In [26]:
#Let's get some stats.
#One of the benefits of growing trees is that we can understand how important each of the features are
print "Feature Importances"
print gbrt.feature_importances_
print
#Let's print the R-squared value for train/test. This explains how much of the variance in the data our model is
#able to decipher.
print "R-squared for Train: %.2f" %gbrt.score(X_train, y_train)
print "R-squared for Test: %.2f" %gbrt.score(X_test, y_test)
In [4]:
#All right let's do this the right way. We'll use a cross-validation generator to select train and CV datasets to finetune
#parameters such as C (Regularization parameter we saw earlier). These hyperparameters are extremely critical to the model.
#Now, if we tune parameters against the Test dataset, we will end up biasing towards the test set and will once again
#not generalize very well. We will also have no good way to find out since we have essentially trained on all our data.
#Luckily scikit-learn has builit-in packages that can help with this. We'll use a crossvalidation generator that
#can train the model by tuning the parameters based on a cross-validation subset (cv) that is picked from within the
#training set. A different cv subset will be picked for each iteration, we control the number of iterations. Then we will
#use these cv/train splits and run a gridsearch function that will evaluate the model with each split and tune parameters
#to give us the best parameter that gives the optimal result.
#Defining this as a function so we can call it anytime we want
def GradientBooster(param_grid, n_jobs):
estimator = GradientBoostingRegressor()
#Choose cross-validation generator - let's choose ShuffleSplit which randomly shuffles and selects Train and CV sets
#for each iteration. There are other methods like the KFold split.
cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2)
#Apply the cross-validation iterator on the Training set using GridSearchCV. This will run the classifier on the
#different train/cv splits using parameters specified and return the model that has the best results
#Note that we are tuning based on the F1 score 2PR/P+R where P is Precision and R is Recall. This may not always be
#the best score to tune our model on. I will explore this area further in a seperate exercise. For now, we'll use F1.
classifier = GridSearchCV(estimator=estimator, cv=cv, param_grid=param_grid, n_jobs=n_jobs)
#Also note that we're feeding multiple neighbors to the GridSearch to try out.
#We'll now fit the training dataset to this classifier
classifier.fit(X_train, y_train)
#Let's look at the best estimator that was found by GridSearchCV
print "Best Estimator learned through GridSearch"
print classifier.best_estimator_
return cv, classifier.best_estimator_
In [5]:
#Below is a plot_learning_curve module that's provided by scikit-learn. It allows us to quickly and easily visualize how
#well the model is performing based on number of samples we're training on. It helps to understand situations such as
#high variance or bias.
#We'll call this module in the next segment.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_digits
from sklearn.learning_curve import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.
Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.
title : string
Title for the chart.
X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.
cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
n_jobs : integer, optional
Number of jobs to run in parallel (default 1).
"""
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
In [13]:
#WARNING - THIS MIGHT TAKE A WHILE TO RUN. TRY ADJUSTING parameters such as n_jobs (jobs to run in parallel, before
#increasing this make sure your system can handle it), n_iter for ShuffleSplit (in the function definition) and reducing
#number of values being tried for max_depth/n_estimators.
#SELECT INTERRUPT IN THE MENU AND PRESS INTERRUPT KERNEL IF YOU NEEDD TO STOP EXECUTION
param_grid={'n_estimators':[100],
'learning_rate': [0.1],# 0.05, 0.02, 0.01],
'max_depth':[6],#4,6],
'min_samples_leaf':[3],#,5,9,17],
'max_features':[1.0],#,0.3]#,0.1]
}
n_jobs=4
#Let's fit GBRT to the digits training dataset by calling the function we just created.
cv,best_est=GradientBooster(param_grid, n_jobs)
In [14]:
#OK great, so we got back the best estimator parameters as follows:
print "Best Estimator Parameters"
print"---------------------------"
print "n_estimators: %d" %best_est.n_estimators
print "max_depth: %d" %best_est.max_depth
print "Learning Rate: %.1f" %best_est.learning_rate
print "min_samples_leaf: %d" %best_est.min_samples_leaf
print "max_features: %.1f" %best_est.max_features
print
print "Train R-squared: %.2f" %best_est.score(X_train,y_train)
#Each of these parameters is critical to learning. Some of them will help address overfitting issues as well. For more
#info about overfitting and regularization, check out the SVM notebook in my Github repos where I provide more info on
#the subject.
In [12]:
#OK we'll now call the plot_learning_curve module by feeding it the estimator (best estimator returned from GS)
#and train/cv sets.
#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
#Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now.
#We may need to adjust the hyperparameters further if there is overfitting (or underfitting, though unlikely)
title = "Learning Curves (Gradient Boosted Regression Trees)"
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()
#Looks like we've done a reasonable job getting about ~0.85 R-squared on the cv set and looks from the learning
#curve that we may be able to do a bit better with more estimators. Although we may need to reduce the learning rate even
#further to address any overfitting.
In [17]:
#OK we'll now call the plot_learning_curve module by feeding it the estimator (best estimator returned from GS)
#and train/cv sets.
#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
#Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now.
#We may need to adjust the hyperparameters further if there is overfitting (or underfitting, though unlikely)
title = "Learning Curves (Gradient Boosted Regression Trees)"
estimator = GradientBoostingRegressor(n_estimators=1000, max_depth=best_est.max_depth,
learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()
#OK that didn't really help. It did improve the training score but there's way more overfitting (look at the wide gap b/w
#train and cv curves). This possibly could be addressed by further reducing learning rate. But also note that the cv scores
#are pretty much the same, so we might not see a major improvement unless we can procure more samples.
In [16]:
#Let's try one more trick. We'll trim the training set to its most important features and re-train to see if
#that helps.
title = "Learning Curves (Gradient Boosted Regression Trees)"
#Dropping all parameters except n_estimators and learning_rate since we're going to trim the features anyway.
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, learning_rate=best_est.learning_rate)
#Calling fit on the estimator so we can transform the X matrices.
estimator.fit(X_train, y_train)
#Trimming feature matrices to include only those features that are more important than the mean of all importances.
X_train_trim=estimator.transform(X_train, threshold='mean')
#Trimming test as well in case we end up going with this model as final.
X_test_trim=estimator.transform(X_test, threshold='mean')
#Re-plotting Learning cruves.
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()
#So what do we infer from this plot? We seem to have addressed overfitting much better but the overall score of both train
#and cv has gone down considerably, indicating that the features we dropped were actually collectively contributing
#to the model. Let's go back to the first model that the Grid Search returned and run our test scores.
In [15]:
#Switching back to the best model from gridsearch
estimator = best_est
#Re-fitting to the train set
estimator.fit(X_train, y_train)
#Calculating train/test scores - R-squared value
print "Train R-squared: %.2f" %estimator.score(X_train, y_train)
print "Test R-squared: %.2f" %estimator.score(X_test, y_test)
#There you have it, our final R-squared on the California housing dataset, 0.82
In [27]:
#OK let's run through a more complex example this time. We'll explore anonymous loan data provided by lendingclub.
#We'll try to predict the interest rate for loan applications based on data provided. Let's first download data to
#a pandas df.
#The Dataset is a zip file. So let's first read in the dataset through requests then pass it on to Pandas through the
#read_csv command
url=requests.get('https://resources.lendingclub.com/LoanStats3c.csv.zip')
z=zipfile.ZipFile(StringIO.StringIO(url.content))
loan=pd.read_csv(z.open('LoanStats3c.csv'), skiprows=1, parse_dates=True, index_col='id')
loanbk=loan.copy() #Backup of the dataframe so we don't have to download data everytime
In [28]:
#Let's take a quick peek at the dataset
loan.describe()
Out[28]:
In [29]:
#For simplicity, let's first drop nulls in the dataset. axis=1 indicates we'll drop rows not cols.
loan = loan.dropna(axis=0)
In [30]:
#OK let's take a look at the columns and see if there are any we can drop any before we get started.
loan.columns.values
#There're plenty that don't seem very relevant. Let's drop them.
loan=loan.drop(['member_id', 'grade', 'sub_grade', 'emp_title', 'issue_d',
'pymnt_plan', 'url', 'desc', 'title', 'initial_list_status',
'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
'policy_code', 'emp_length', 'addr_state','zip_code'], axis=1)
#Check the data dictionary for this dataset at https://resources.lendingclub.com/LCDataDictionary.xlsx for more details
In [31]:
# Get rid of non-numeric values throughout the DataFrame:
for col in loan.columns.values:
loan[col] = loan[col].replace('[^0-9]+.-', '', regex=True)
loan.head(2)
Out[31]:
In [32]:
#Remove % symbol from the interest rate & revolving utilization
loan.int_rate=loan.int_rate.str.split('%',1).str[0]
loan.revol_util=loan.revol_util.str.split('%',1).str[0]
#Remove "months" from the loan period
loan.term=loan.term.str.split(' ',2).str[1]
loan.head(2)
Out[32]:
In [33]:
#Let's change the Income Verified column, which currently has textual labels to numeric.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
loan.is_inc_v = le.fit_transform(loan.is_inc_v.values)
loan.home_ownership=le.fit_transform(loan.home_ownership.values)
loan.loan_status=le.fit_transform(loan.loan_status.values)
loan.purpose=le.fit_transform(loan.purpose.values)
#Finally let's be sure we convert all fields to numeric
loan=loan.convert_objects(convert_numeric=True)
loan.head(2)
Out[33]:
In [34]:
#OK great, let's now get our X and y. We know that interest rate is y.
#Pandas is fantastic, all you need to do is use .values to get the data in numpy format
y=loan.int_rate.values
#Let's remove y from the df so we can get X
del loan['int_rate']
X=loan.values
#Now, the train test split
X_train, X_test, y_train, y_test = train_test_split(X,y)
In [38]:
#Great, in no time we have grabbed an unknown dataset from the web, munged it using Pandas and now have ready-to-go
#training and test numpy arrays for running the GBRT regressor. Let's go!
#WARNING - THIS MIGHT TAKE A WHILE TO RUN. TRY ADJUSTING parameters such as n_jobs (jobs to run in parallel, before
#increasing this make sure your system can handle it), n_iter for ShuffleSplit (in the function definition) and reducing
#number of values being tried for max_depth/n_estimators.
#SELECT INTERRUPT IN THE MENU AND PRESS INTERRUPT KERNEL IF YOU NEEDD TO STOP EXECUTION
param_grid={'n_estimators':[100],#,500,1000],
'learning_rate': [0.1,0.05,0.02],# 0.01],
'max_depth':[4,6],
'min_samples_leaf':[3,5,9,17],
'max_features':[1.0,0.3,0.1]
}
n_jobs=4
#Let's fit GBRT to the digits training dataset by calling the function we just created.
cv,best_est=GradientBooster(param_grid, n_jobs)
In [39]:
#OK great, so we got back the best estimator parameters as follows:
print "Best Estimator Parameters"
print"---------------------------"
print "n_estimators: %d" %best_est.n_estimators
print "max_depth: %d" %best_est.max_depth
print "Learning Rate: %.1f" %best_est.learning_rate
print "min_samples_leaf: %d" %best_est.min_samples_leaf
print "max_features: %.1f" %best_est.max_features
print
print "Train R-squared: %.2f" %best_est.score(X_train,y_train)
#The training R-Squared is almost 1.0 which indicates we can understand 99% of the variance in the data as well as
#there's a chance we might overfit. Let's see with the learning curves below.
In [45]:
#OK we'll now call the plot_learning_curve module by feeding it the estimator (best estimator returned from GS)
#and train/cv sets.
#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
#Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now.
#We may need to adjust the hyperparameters further if there is overfitting (or underfitting, though unlikely)
title = "Learning Curves (Gradient Boosted Regression Trees)"
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()
#OK yes, there is some overfitting there. We can see the training scores in red almost close to 1.0 and the cv scores
#trying its best to reach it as the number of examples increases. This is what happens during overfitting. To address
#overfitting, GBRT basically has the following parameters we can fine tune: Learning Rate, Max Depth, Min Samples leaf and
#Max features.
In [46]:
#the typical recommended values of Max depth is 4 to 6, so lets leave it at 4. Let's try increasing the min
#samples leaf parameter, this basically enforces a lower bound on the number of samples in any given leaf.
min_samples_leaf=9
title = "Learning Curves (Gradient Boosted Regression Trees), min_samples_leaf=9"
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
learning_rate=best_est.learning_rate, min_samples_leaf=min_samples_leaf,
max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()
#OK that did not really help.
In [47]:
#Let's try reducing the max features parameter. This enforces an upper bound of the maximum number of features to use
#for training. It's supposed to work well when n_features>30. We'll also remove min samples leaf for this run.
max_features=0.5
title = "Learning Curves (Gradient Boosted Regression Trees), max_features=50%"
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
learning_rate=best_est.learning_rate, max_features=max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()
#Nope that didn't quite improve the cv score either. What happens if we reduce learning rate?
In [48]:
#The lower the learning rate is the more the number of trees we need to train. This is because the rate at which we train
#is simply, well, reduced.
learning_rate=.01
n_estimators=1000
title = "Learning Curves (Gradient Boosted Regression Trees), 1000 Trees at learning rate .01"
estimator = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=best_est.max_depth,
learning_rate=learning_rate, min_samples_leaf=best_est.min_samples_leaf,
max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()
#Perhaps that improved it a tiny little bit.
In [81]:
#Before we try anything else, I would like to explore one of the beautiful advantages of growing trees. And that is to
#capture feature importances. Now that we have a publicly available loan application collection (though anonymous), it makes
#me really curious to see what impacts the interest rate for a loan application the most.
#Let's take a look
#Calling fit on the estimator so we can look at feature_importances.
estimator.fit(X_train, y_train)
# Calculate the feature ranking - Top 10
importances = estimator.feature_importances_
indices = np.argsort(importances)[::-1]
print "Lending Club Loan Data - Top 10 Important Features\n"
for f in range(10):
print("%d. %s (%f)" % (f + 1, loan.columns[indices[f]], importances[indices[f]]))
#Plot the feature importances of the forest
indices=indices[:10]
plt.figure()
plt.title("Top 10 Feature importances")
plt.bar(range(10), importances[indices],
color="r", align="center")
plt.xticks(range(10), loan.columns[indices], fontsize=14, rotation=45)
plt.xlim([-1, 10])
plt.show()
#Mean Feature Importance
print "Mean Feature Importance %.6f" %np.mean(importances)
#Interesting, the total amount of interest received to date is the top most influencer for getting a better interest rate.
#Good for the lenders eh? Pay more interest, we'll give you a cut on the interest rate. Of course!
In [83]:
#Can we actually trim (like before) and get a better result? Perhaps not, but who's to stop us from trying.
title = "Learning Curves (Gradient Boosted Regression Trees) - Trimmed features to > 1% importance"
#Dropping all parameters except n_estimators and learning_rate since we're going to trim the features anyway.
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, learning_rate=best_est.learning_rate)
#Calling fit on the estimator so we can transform the X matrices.
estimator.fit(X_train, y_train)
#Trimming feature matrices to include only those features that are more important than the mean of all importances.
X_train_trim=estimator.transform(X_train, threshold=.01)
#Trimming test as well in case we end up going with this model as final.
X_test_trim=estimator.transform(X_test, threshold=.01)
#Re-plotting Learning cruves.
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()
#Nope, the curve looks like it overfits less, but look at the cv score, in all our fancy attempts it never really crossed
#that ~0.8 R-squared barrier. That tells me, we actually have a decent model at hand and also that a 0.9+ R-squared value is
#not always possible, atleast in real time. Let's wrap this up.
In [85]:
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
max_features=best_est.max_features)
estimator.fit(X_train, y_train)
print "Final Estimator Parameters"
print"---------------------------"
print "n_estimators: %d" %best_est.n_estimators
print "max_depth: %d" %best_est.max_depth
print "Learning Rate: %.1f" %best_est.learning_rate
print "min_samples_leaf: %d" %best_est.min_samples_leaf
print "max_features: %.1f" %best_est.max_features
print
print "Final Train R-squared: %.2f" %estimator.score(X_train, y_train)
print "Final Test R-squared: %.2f" %estimator.score(X_test, y_test)
There it is a final 0.83 R-squared value for the Test set. Compare this to the 0.72 we get with a LinearRegression model (look at the Linear Regression notebook in my repos) you'll know that we've done better.