Caifornia house price predictions with Gradient Boosted Regression Trees

Gradient Boosted Regression Trees (GBRT) or shorter Gradient Boosting is a flexible non-parametric statistical learning technique for classification and regression. I'll demonstrate learning with GBRT using multiple examples in this notebook.

Feel free to use for your own reference. Let's get started.

In [26]:
#Importing required Python packages
%matplotlib inline
import matplotlib.pylab as plt
import numpy as np
from scipy import sparse
from sklearn.datasets import make_classification, make_blobs, load_boston, fetch_california_housing
from sklearn.decomposition import PCA
from sklearn.cross_validation import ShuffleSplit, train_test_split
from sklearn import metrics
from sklearn.learning_curve import learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.grid_search import GridSearchCV
from pprint import pprint
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import urllib
import requests
import zipfile
import StringIO
import seaborn

np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5) 

# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 25)
pd.options.display.max_colwidth = 50
In [2]:
#Let's begin by downloading one of scikit-learn's sample datasets - California Housing
cal=fetch_california_housing()
In [8]:
#Let's check out the structure of the dataset
print cal.keys()
print

#DESCR contains a description of the dataset
print cal.DESCR

#Great, as expected the dataset contains housing data with several parameters including income, no of bedrooms etc. and
#the target variable as the average house value.

#Let's use GBRT to build a model that can predict house prices.
['data', 'feature_names', 'DESCR', 'target']

California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.



In [3]:
#Let's first do a train/test split so we have some examples to find out how well our model will generalize. If you're new to 
#these concepts, check out one of the other repos in my profile like SVM where I illustrate these concepts in detail.
X_train, X_test, y_train, y_test = train_test_split(cal.data, cal.target)

print X_train.shape, X_test.shape

#Default split is 75/25
(15480, 8) (5160, 8)

In [25]:
#OK now, let's fit a GBRT classifer to the training dataset. The classifier has several hyperparameters which we will tune
#in subsequent steps. For now, we'll just build a simple model.

#Let's go instantiate, fit and predict.
gbrt=GradientBoostingRegressor(n_estimators=100)
gbrt.fit(X_train, y_train)
y_pred=gbrt.predict(X_test)
In [26]:
#Let's get some stats. 

#One of the benefits of growing trees is that we can understand how important each of the features are
print "Feature Importances"
print gbrt.feature_importances_
print

#Let's print the R-squared value for train/test. This explains how much of the variance in the data our model is
#able to decipher.
print "R-squared for Train: %.2f" %gbrt.score(X_train, y_train)
print "R-squared for Test: %.2f" %gbrt.score(X_test, y_test)
Feature Importances
[ 0.19300523  0.05615198  0.0547014   0.05048903  0.01719911  0.08529837
  0.25995097  0.28320391]

R-squared for Train: 0.80
R-squared for Test: 0.79

In [4]:
#All right let's do this the right way. We'll use a cross-validation generator to select train and CV datasets to finetune
#parameters such as C (Regularization parameter we saw earlier). These hyperparameters are extremely critical to the model.
#Now, if we tune parameters against the Test dataset, we will end up biasing towards the test set and will once again
#not generalize very well. We will also have no good way to find out since we have essentially trained on all our data.

#Luckily scikit-learn has builit-in packages that can help with this. We'll use a crossvalidation generator that
#can train the model by tuning the parameters based on a cross-validation subset (cv) that is picked from within the 
#training set. A different cv subset will be picked for each iteration, we control the number of iterations. Then we will 
#use these cv/train splits and run a gridsearch function that will evaluate the model with each split and tune parameters 
#to give us the best parameter that gives the optimal result.

#Defining this as a function so we can call it anytime we want

def GradientBooster(param_grid, n_jobs):

    estimator = GradientBoostingRegressor()

#Choose cross-validation generator - let's choose ShuffleSplit which randomly shuffles and selects Train and CV sets
#for each iteration. There are other methods like the KFold split.
    cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2)

#Apply the cross-validation iterator on the Training set using GridSearchCV. This will run the classifier on the 
#different train/cv splits using parameters specified and return the model that has the best results

#Note that we are tuning based on the F1 score 2PR/P+R where P is Precision and R is Recall. This may not always be
#the best score to tune our model on. I will explore this area further in a seperate exercise. For now, we'll use F1.

    classifier = GridSearchCV(estimator=estimator, cv=cv, param_grid=param_grid, n_jobs=n_jobs)

#Also note that we're feeding multiple neighbors to the GridSearch to try out.

#We'll now fit the training dataset to this classifier
    classifier.fit(X_train, y_train)

#Let's look at the best estimator that was found by GridSearchCV
    print "Best Estimator learned through GridSearch"
    print classifier.best_estimator_
    
    return cv, classifier.best_estimator_
In [5]:
#Below is a plot_learning_curve module that's provided by scikit-learn. It allows us to quickly and easily visualize how
#well the model is performing based on number of samples we're training on. It helps to understand situations such as 
#high variance or bias.

#We'll call this module in the next segment. 

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_digits
from sklearn.learning_curve import learning_curve


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt
Automatically created module for IPython interactive environment

In [13]:
#WARNING - THIS MIGHT TAKE A WHILE TO RUN. TRY ADJUSTING parameters such as n_jobs (jobs to run in parallel, before 
#increasing this make sure your system can handle it), n_iter for ShuffleSplit (in the function definition) and reducing 
#number of values being tried for max_depth/n_estimators.

#SELECT INTERRUPT IN THE MENU AND PRESS INTERRUPT KERNEL IF YOU NEEDD TO STOP EXECUTION

param_grid={'n_estimators':[100], 
            'learning_rate': [0.1],# 0.05, 0.02, 0.01],
            'max_depth':[6],#4,6], 
            'min_samples_leaf':[3],#,5,9,17], 
            'max_features':[1.0],#,0.3]#,0.1]
           }
n_jobs=4

#Let's fit GBRT to the digits training dataset by calling the function we just created.
cv,best_est=GradientBooster(param_grid, n_jobs)
Best Estimator learned through GridSearch
GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.1, loss='ls',
             max_depth=6, max_features=1.0, max_leaf_nodes=None,
             min_samples_leaf=3, min_samples_split=2, n_estimators=100,
             random_state=None, subsample=1.0, verbose=0, warm_start=False)

In [14]:
#OK great, so we got back the best estimator parameters as follows:
print "Best Estimator Parameters"
print"---------------------------"
print "n_estimators: %d" %best_est.n_estimators
print "max_depth: %d" %best_est.max_depth
print "Learning Rate: %.1f" %best_est.learning_rate
print "min_samples_leaf: %d" %best_est.min_samples_leaf
print "max_features: %.1f" %best_est.max_features

print
print "Train R-squared: %.2f" %best_est.score(X_train,y_train)

#Each of these parameters is critical to learning. Some of them will help address overfitting issues as well. For more 
#info about overfitting and regularization, check out the SVM notebook in my Github repos where I provide more info on 
#the subject.
Best Estimator Parameters
---------------------------
n_estimators: 100
max_depth: 6
Learning Rate: 0.1
min_samples_leaf: 3
max_features: 1.0

Train Score: 0.90

In [12]:
#OK we'll now call the plot_learning_curve module by feeding it the estimator (best estimator returned from GS) 
#and train/cv sets.

#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
#Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now.
#We may need to adjust the hyperparameters further if there is overfitting (or underfitting, though unlikely)
title = "Learning Curves (Gradient Boosted Regression Trees)" 
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
                                      max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#Looks like we've done a reasonable job getting about ~0.85 R-squared on the cv set and looks from the learning
#curve that we may be able to do a bit better with more estimators. Although we may need to reduce the learning rate even 
#further to address any overfitting.
In [17]:
#OK we'll now call the plot_learning_curve module by feeding it the estimator (best estimator returned from GS) 
#and train/cv sets.

#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
#Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now.
#We may need to adjust the hyperparameters further if there is overfitting (or underfitting, though unlikely)
title = "Learning Curves (Gradient Boosted Regression Trees)" 
estimator = GradientBoostingRegressor(n_estimators=1000, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
                                      max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#OK that didn't really help. It did improve the training score but there's way more overfitting (look at the wide gap b/w
#train and cv curves). This possibly could be addressed by further reducing learning rate. But also note that the cv scores 
#are pretty much the same, so we might not see a major improvement unless we can procure more samples.
In [16]:
#Let's try one more trick. We'll trim the training set to its most important features and re-train to see if 
#that helps.
title = "Learning Curves (Gradient Boosted Regression Trees)" 

#Dropping all parameters except n_estimators and learning_rate since we're going to trim the features anyway.
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, learning_rate=best_est.learning_rate)

#Calling fit on the estimator so we can transform the X matrices.
estimator.fit(X_train, y_train)

#Trimming feature matrices to include only those features that are more important than the mean of all importances.
X_train_trim=estimator.transform(X_train, threshold='mean')

#Trimming test as well in case we end up going with this model as final.
X_test_trim=estimator.transform(X_test, threshold='mean')

#Re-plotting Learning cruves.
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#So what do we infer from this plot? We seem to have addressed overfitting much better but the overall score of both train
#and cv has gone down considerably, indicating that the features we dropped were actually collectively contributing
#to the model. Let's go back to the first model that the Grid Search returned and run our test scores.
In [15]:
#Switching back to the best model from gridsearch
estimator = best_est

#Re-fitting to the train set
estimator.fit(X_train, y_train)

#Calculating train/test scores - R-squared value
print "Train R-squared: %.2f" %estimator.score(X_train, y_train)
print "Test R-squared: %.2f" %estimator.score(X_test, y_test)

#There you have it, our final R-squared on the California housing dataset, 0.82
Train R-squared: 0.90
Test R-squared: 0.82

In [27]:
#OK let's run through a more complex example this time. We'll explore anonymous loan data provided by lendingclub. 
#We'll try to predict the interest rate for loan applications based on data provided. Let's first download data to 
#a pandas df.

#The Dataset is a zip file. So let's first read in the dataset through requests then pass it on to Pandas through the
#read_csv command
url=requests.get('https://resources.lendingclub.com/LoanStats3c.csv.zip')
z=zipfile.ZipFile(StringIO.StringIO(url.content))

loan=pd.read_csv(z.open('LoanStats3c.csv'), skiprows=1, parse_dates=True, index_col='id')
loanbk=loan.copy() #Backup of the dataframe so we don't have to download data everytime
/home/shanks/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (0,19) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

In [28]:
#Let's take a quick peek at the dataset
loan.describe()
Out[28]:
             member_id      loan_amnt    funded_amnt  funded_amnt_inv  \
count    161231.000000  161231.000000  161231.000000    161231.000000   
mean   19096209.203993   14864.642190   14864.642190     14858.829723   
std     5402435.130132    8412.637516    8412.637516      8409.342582   
min      137225.000000    1000.000000    1000.000000       950.000000   
25%    14557573.500000    8400.000000    8400.000000      8400.000000   
50%    18284171.000000   13000.000000   13000.000000     13000.000000   
75%    23292780.000000   20000.000000   20000.000000     20000.000000   
max    31526675.000000   35000.000000   35000.000000     35000.000000   

         installment      annual_inc            dti    delinq_2yrs  \
count  161231.000000   161231.000000  161231.000000  161231.000000   
mean      446.003881    75258.495695      17.685653       0.341609   
std       245.922837    57407.341278       7.733079       0.891854   
min        23.360000     3000.000000       0.000000       0.000000   
25%       267.790000    46000.000000      11.890000       0.000000   
50%       387.240000    65000.000000      17.360000       0.000000   
75%       585.390000    90000.000000      23.320000       0.000000   
max      1409.990000  7500000.000000      39.990000      22.000000   

       inq_last_6mths  mths_since_last_delinq  mths_since_last_record  \
count   161231.000000            81008.000000            28929.000000   
mean         0.833078               33.436759               71.541706   
std          1.094065               21.866078               28.736096   
min          0.000000                0.000000                0.000000   
25%          0.000000               15.000000               50.000000   
50%          0.000000               30.000000               70.000000   
75%          1.000000               49.000000              100.000000   
max          6.000000              188.000000              121.000000   

            open_acc        pub_rec       revol_bal     total_acc  \
count  161231.000000  161231.000000   161231.000000  161231.00000   
mean       11.632620       0.224324    16034.328436      26.09554   
std         5.175382       0.608610    19376.339472      11.73985   
min         0.000000       0.000000        0.000000       2.00000   
25%         8.000000       0.000000     6323.000000      18.00000   
50%        11.000000       0.000000    11607.000000      24.00000   
75%        14.000000       0.000000    20209.500000      33.00000   
max        76.000000      63.000000  1190046.000000     121.00000   

           out_prncp  out_prncp_inv    total_pymnt  total_pymnt_inv  \
count  161231.000000  161231.000000  161231.000000    161231.000000   
mean    12597.017050   12592.210838    3142.971573      3141.605224   
std      8136.428865    8133.531787    3753.058690      3751.385803   
min         0.000000       0.000000       0.000000         0.000000   
25%      6511.070000    6511.070000    1220.995000      1220.020000   
50%     11167.950000   11166.840000    2138.880000      2138.650000   
75%     17587.260000   17579.925000    3626.665000      3623.660000   
max     35000.000000   35000.000000   40554.969852     40554.970000   

       total_rec_prncp  total_rec_int  total_rec_late_fee     recoveries  \
count    161231.000000  161231.000000       161231.000000  161231.000000   
mean       2209.550404     933.078710            0.040902       0.301558   
std        3496.448030     831.964426            1.152217      30.839240   
min           0.000000       0.000000            0.000000       0.000000   
25%         731.210000     358.200000            0.000000       0.000000   
50%        1280.400000     676.500000            0.000000       0.000000   
75%        2246.190000    1226.200000            0.000000       0.000000   
max       35000.000000    7301.320000          100.380000    4935.000000   

       collection_recovery_fee  last_pymnt_amnt  collections_12_mths_ex_med  \
count            161231.000000    161231.000000               161231.000000   
mean                  0.003164      1048.075335                    0.014191   
std                   0.314840      3186.231922                    0.135941   
min                   0.000000         0.000000                    0.000000   
25%                   0.000000       275.260000                    0.000000   
50%                   0.000000       404.270000                    0.000000   
75%                   0.000000       629.010000                    0.000000   
max                  49.350000     35757.360000                   20.000000   

       mths_since_last_major_derog  policy_code  
count                 45175.000000       161231  
mean                     43.363475            1  
std                      22.153984            0  
min                       0.000000            1  
25%                      26.000000            1  
50%                      43.000000            1  
75%                      60.000000            1  
max                     188.000000            1  
In [29]:
#For simplicity, let's first drop nulls in the dataset. axis=1 indicates we'll drop rows not cols.
loan = loan.dropna(axis=0)
In [30]:
#OK let's take a look at the columns and see if there are any we can drop any before we get started.
loan.columns.values

#There're plenty that don't seem very relevant. Let's drop them.
loan=loan.drop(['member_id', 'grade', 'sub_grade', 'emp_title', 'issue_d',
          'pymnt_plan', 'url', 'desc', 'title', 'initial_list_status',
          'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
          'policy_code', 'emp_length', 'addr_state','zip_code'], axis=1)

#Check the data dictionary for this dataset at https://resources.lendingclub.com/LCDataDictionary.xlsx for more details
In [31]:
# Get rid of non-numeric values throughout the DataFrame:
for col in loan.columns.values:
  loan[col] = loan[col].replace('[^0-9]+.-', '', regex=True)
loan.head(2)
Out[31]:
          loan_amnt  funded_amnt  funded_amnt_inv        term int_rate  \
id                                                                       
12958045      17700        17700            17700   36 months   24.99%   
12968021      19500        19500            19500   60 months   16.59%   

          installment home_ownership  annual_inc         is_inc_v loan_status  \
id                                                                              
12958045       703.66           RENT       90000     Not Verified     Current   
12968021       480.34           RENT       63048  Source Verified     Current   

                     purpose    dti  delinq_2yrs earliest_cr_line  \
id                                                                  
12958045  debt_consolidation  15.52            0             2002   
12968021  debt_consolidation  11.44            0             1992   

          inq_last_6mths  mths_since_last_delinq  mths_since_last_record  \
id                                                                         
12958045               1                      56                     107   
12968021               1                      66                      82   

          open_acc  pub_rec  revol_bal revol_util  total_acc  out_prncp  \
id                                                                        
12958045        10        1      12466      65.3%         21   15202.88   
12968021         7        2      19308      66.1%         24   17962.12   

          out_prncp_inv  total_pymnt  total_pymnt_inv  total_rec_prncp  \
id                                                                       
12958045       15202.88      4925.62          4925.62          2497.12   
12968021       17962.12      3362.38          3362.38          1537.88   

          total_rec_int  total_rec_late_fee  recoveries  \
id                                                        
12958045         2428.5                   0           0   
12968021         1824.5                   0           0   

          collection_recovery_fee  collections_12_mths_ex_med  \
id                                                              
12958045                        0                           0   
12968021                        0                           0   

          mths_since_last_major_derog  
id                                     
12958045                           65  
12968021                           66  
In [32]:
#Remove % symbol from the interest rate & revolving utilization
loan.int_rate=loan.int_rate.str.split('%',1).str[0]
loan.revol_util=loan.revol_util.str.split('%',1).str[0]

#Remove "months" from the loan period
loan.term=loan.term.str.split(' ',2).str[1]

loan.head(2)
Out[32]:
          loan_amnt  funded_amnt  funded_amnt_inv term int_rate  installment  \
id                                                                             
12958045      17700        17700            17700   36    24.99       703.66   
12968021      19500        19500            19500   60    16.59       480.34   

         home_ownership  annual_inc         is_inc_v loan_status  \
id                                                                 
12958045           RENT       90000     Not Verified     Current   
12968021           RENT       63048  Source Verified     Current   

                     purpose    dti  delinq_2yrs earliest_cr_line  \
id                                                                  
12958045  debt_consolidation  15.52            0             2002   
12968021  debt_consolidation  11.44            0             1992   

          inq_last_6mths  mths_since_last_delinq  mths_since_last_record  \
id                                                                         
12958045               1                      56                     107   
12968021               1                      66                      82   

          open_acc  pub_rec  revol_bal revol_util  total_acc  out_prncp  \
id                                                                        
12958045        10        1      12466       65.3         21   15202.88   
12968021         7        2      19308       66.1         24   17962.12   

          out_prncp_inv  total_pymnt  total_pymnt_inv  total_rec_prncp  \
id                                                                       
12958045       15202.88      4925.62          4925.62          2497.12   
12968021       17962.12      3362.38          3362.38          1537.88   

          total_rec_int  total_rec_late_fee  recoveries  \
id                                                        
12958045         2428.5                   0           0   
12968021         1824.5                   0           0   

          collection_recovery_fee  collections_12_mths_ex_med  \
id                                                              
12958045                        0                           0   
12968021                        0                           0   

          mths_since_last_major_derog  
id                                     
12958045                           65  
12968021                           66  
In [33]:
#Let's change the Income Verified column, which currently has textual labels to numeric.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
loan.is_inc_v = le.fit_transform(loan.is_inc_v.values)
loan.home_ownership=le.fit_transform(loan.home_ownership.values)
loan.loan_status=le.fit_transform(loan.loan_status.values)
loan.purpose=le.fit_transform(loan.purpose.values)

#Finally let's be sure we convert all fields to numeric
loan=loan.convert_objects(convert_numeric=True)

loan.head(2)
Out[33]:
          loan_amnt  funded_amnt  funded_amnt_inv  term  int_rate  \
id                                                                  
12958045      17700        17700            17700    36     24.99   
12968021      19500        19500            19500    60     16.59   

          installment  home_ownership  annual_inc  is_inc_v  loan_status  \
id                                                                         
12958045       703.66               2       90000         0            0   
12968021       480.34               2       63048         1            0   

          purpose    dti  delinq_2yrs  earliest_cr_line  inq_last_6mths  \
id                                                                        
12958045        2  15.52            0              2002               1   
12968021        2  11.44            0              1992               1   

          mths_since_last_delinq  mths_since_last_record  open_acc  pub_rec  \
id                                                                            
12958045                      56                     107        10        1   
12968021                      66                      82         7        2   

          revol_bal  revol_util  total_acc  out_prncp  out_prncp_inv  \
id                                                                     
12958045      12466        65.3         21   15202.88       15202.88   
12968021      19308        66.1         24   17962.12       17962.12   

          total_pymnt  total_pymnt_inv  total_rec_prncp  total_rec_int  \
id                                                                       
12958045      4925.62          4925.62          2497.12         2428.5   
12968021      3362.38          3362.38          1537.88         1824.5   

          total_rec_late_fee  recoveries  collection_recovery_fee  \
id                                                                  
12958045                   0           0                        0   
12968021                   0           0                        0   

          collections_12_mths_ex_med  mths_since_last_major_derog  
id                                                                 
12958045                           0                           65  
12968021                           0                           66  
In [34]:
#OK great, let's now get our X and y. We know that interest rate is y.
#Pandas is fantastic, all you need to do is use .values to get the data in numpy format
y=loan.int_rate.values

#Let's remove y from the df so we can get X
del loan['int_rate']
X=loan.values

#Now, the train test split
X_train, X_test, y_train, y_test = train_test_split(X,y)
In [38]:
#Great, in no time we have grabbed an unknown dataset from the web, munged it using Pandas and now have ready-to-go
#training and test numpy arrays for running the GBRT regressor. Let's go!

#WARNING - THIS MIGHT TAKE A WHILE TO RUN. TRY ADJUSTING parameters such as n_jobs (jobs to run in parallel, before 
#increasing this make sure your system can handle it), n_iter for ShuffleSplit (in the function definition) and reducing 
#number of values being tried for max_depth/n_estimators.

#SELECT INTERRUPT IN THE MENU AND PRESS INTERRUPT KERNEL IF YOU NEEDD TO STOP EXECUTION

param_grid={'n_estimators':[100],#,500,1000],
            'learning_rate': [0.1,0.05,0.02],# 0.01],
            'max_depth':[4,6], 
            'min_samples_leaf':[3,5,9,17], 
            'max_features':[1.0,0.3,0.1]
           }
n_jobs=4

#Let's fit GBRT to the digits training dataset by calling the function we just created.
cv,best_est=GradientBooster(param_grid, n_jobs)
Best Estimator learned through GridSearch
GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.1, loss='ls',
             max_depth=4, max_features=1.0, max_leaf_nodes=None,
             min_samples_leaf=5, min_samples_split=2, n_estimators=100,
             random_state=None, subsample=1.0, verbose=0, warm_start=False)

In [39]:
#OK great, so we got back the best estimator parameters as follows:
print "Best Estimator Parameters"
print"---------------------------"
print "n_estimators: %d" %best_est.n_estimators
print "max_depth: %d" %best_est.max_depth
print "Learning Rate: %.1f" %best_est.learning_rate
print "min_samples_leaf: %d" %best_est.min_samples_leaf
print "max_features: %.1f" %best_est.max_features

print
print "Train R-squared: %.2f" %best_est.score(X_train,y_train)

#The training R-Squared is almost 1.0 which indicates we can understand 99% of the variance in the data as well as
#there's a chance we might overfit. Let's see with the learning curves below.
Best Estimator Parameters
---------------------------
n_estimators: 100
max_depth: 4
Learning Rate: 0.1
min_samples_leaf: 5
max_features: 1.0

Train R-squared: 0.99

In [45]:
#OK we'll now call the plot_learning_curve module by feeding it the estimator (best estimator returned from GS) 
#and train/cv sets.

#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
#Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now.
#We may need to adjust the hyperparameters further if there is overfitting (or underfitting, though unlikely)
title = "Learning Curves (Gradient Boosted Regression Trees)" 
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
                                      max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#OK yes, there is some overfitting there. We can see the training scores in red almost close to 1.0 and the cv scores
#trying its best to reach it as the number of examples increases. This is what happens during overfitting. To address
#overfitting, GBRT basically has the following parameters we can fine tune: Learning Rate, Max Depth, Min Samples leaf and
#Max features.
In [46]:
#the typical recommended values of Max depth is 4 to 6, so lets leave it at 4. Let's try increasing the min
#samples leaf parameter, this basically enforces a lower bound on the number of samples in any given leaf.
min_samples_leaf=9

title = "Learning Curves (Gradient Boosted Regression Trees), min_samples_leaf=9" 
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, min_samples_leaf=min_samples_leaf,
                                      max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#OK that did not really help. 
In [47]:
#Let's try reducing the max features parameter. This enforces an upper bound of the maximum number of features to use
#for training. It's supposed to work well when n_features>30. We'll also remove min samples leaf for this run.
max_features=0.5

title = "Learning Curves (Gradient Boosted Regression Trees), max_features=50%" 
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, max_features=max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#Nope that didn't quite improve the cv score either. What happens if we reduce learning rate?
In [48]:
#The lower the learning rate is the more the number of trees we need to train. This is because the rate at which we train
#is simply, well, reduced.
learning_rate=.01
n_estimators=1000

title = "Learning Curves (Gradient Boosted Regression Trees), 1000 Trees at learning rate .01"
estimator = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=learning_rate, min_samples_leaf=best_est.min_samples_leaf,
                                      max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#Perhaps that improved it a tiny little bit.
In [81]:
#Before we try anything else, I would like to explore one of the beautiful advantages of growing trees. And that is to
#capture feature importances. Now that we have a publicly available loan application collection (though anonymous), it makes
#me really curious to see what impacts the interest rate for a loan application the most.

#Let's take a look

#Calling fit on the estimator so we can look at feature_importances.
estimator.fit(X_train, y_train)

# Calculate the feature ranking - Top 10
importances = estimator.feature_importances_
indices = np.argsort(importances)[::-1]

print "Lending Club Loan Data - Top 10 Important Features\n"

for f in range(10):
    print("%d. %s   (%f)" % (f + 1, loan.columns[indices[f]], importances[indices[f]]))
    
#Plot the feature importances of the forest
indices=indices[:10]
plt.figure()
plt.title("Top 10 Feature importances")
plt.bar(range(10), importances[indices],
       color="r", align="center")
plt.xticks(range(10), loan.columns[indices], fontsize=14, rotation=45)
plt.xlim([-1, 10])
plt.show()

#Mean Feature Importance
print "Mean Feature Importance %.6f" %np.mean(importances)

#Interesting, the total amount of interest received to date is the top most influencer for getting a better interest rate.
#Good for the lenders eh? Pay more interest, we'll give you a cut on the interest rate. Of course!
Lending Club Loan Data - Top 10 Important Features

1. total_rec_int   (0.226486)
2. total_rec_prncp   (0.178744)
3. out_prncp_inv   (0.046132)
4. term   (0.044804)
5. out_prncp   (0.042197)
6. revol_bal   (0.038520)
7. installment   (0.033520)
8. mths_since_last_record   (0.031467)
9. annual_inc   (0.030583)
10. purpose   (0.027809)

Mean Feature Importance 0.031250

In [83]:
#Can we actually trim (like before) and get a better result? Perhaps not, but who's to stop us from trying.
title = "Learning Curves (Gradient Boosted Regression Trees) - Trimmed features to > 1% importance" 

#Dropping all parameters except n_estimators and learning_rate since we're going to trim the features anyway.
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, learning_rate=best_est.learning_rate)

#Calling fit on the estimator so we can transform the X matrices.
estimator.fit(X_train, y_train)

#Trimming feature matrices to include only those features that are more important than the mean of all importances.
X_train_trim=estimator.transform(X_train, threshold=.01)

#Trimming test as well in case we end up going with this model as final.
X_test_trim=estimator.transform(X_test, threshold=.01)

#Re-plotting Learning cruves.
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#Nope, the curve looks like it overfits less, but look at the cv score, in all our fancy attempts it never really crossed
#that ~0.8 R-squared barrier. That tells me, we actually have a decent model at hand and also that a 0.9+ R-squared value is
#not always possible, atleast in real time. Let's wrap this up.
In [85]:
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
                                      max_features=best_est.max_features)
estimator.fit(X_train, y_train)

print "Final Estimator Parameters"
print"---------------------------"
print "n_estimators: %d" %best_est.n_estimators
print "max_depth: %d" %best_est.max_depth
print "Learning Rate: %.1f" %best_est.learning_rate
print "min_samples_leaf: %d" %best_est.min_samples_leaf
print "max_features: %.1f" %best_est.max_features
print
print "Final Train R-squared: %.2f" %estimator.score(X_train, y_train)
print "Final Test R-squared: %.2f" %estimator.score(X_test, y_test)
Final Estimator Parameters
---------------------------
n_estimators: 100
max_depth: 4
Learning Rate: 0.1
min_samples_leaf: 5
max_features: 1.0

Final Train R-squared: 0.99
Final Test R-squared: 0.83

There it is a final 0.83 R-squared value for the Test set. Compare this to the 0.72 we get with a LinearRegression model (look at the Linear Regression notebook in my repos) you'll know that we've done better.