Gradient Boosted Regression Trees (GBRT) or shorter Gradient Boosting is a flexible non-parametric statistical learning technique for classification and regression. I'll demonstrate learning with GBRT using multiple examples in this notebook.

Feel free to use for your own reference. Let's get started.

#Importing required Python packages
%matplotlib inline
import matplotlib.pylab as plt
import numpy as np
from scipy import sparse
from sklearn.datasets import make_classification, make_blobs, load_boston, fetch_california_housing
from sklearn.decomposition import PCA
from sklearn.cross_validation import ShuffleSplit, train_test_split
from sklearn import metrics
from sklearn.learning_curve import learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.grid_search import GridSearchCV
from pprint import pprint
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import urllib
import requests
import zipfile
import StringIO
import seaborn

np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5) 

# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 25)
pd.options.display.max_colwidth = 50

#Let's begin by downloading one of scikit-learn's sample datasets - California Housing
cal=fetch_california_housing()

#Let's check out the structure of the dataset
print cal.keys()
print

#DESCR contains a description of the dataset
print cal.DESCR

#Great, as expected the dataset contains housing data with several parameters including income, no of bedrooms etc. and
#the target variable as the average house value.

#Let's use GBRT to build a model that can predict house prices.

['data', 'feature_names', 'DESCR', 'target']

California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.

#Let's first do a train/test split so we have some examples to find out how well our model will generalize. If you're new to 
#these concepts, check out one of the other repos in my profile like SVM where I illustrate these concepts in detail.
X_train, X_test, y_train, y_test = train_test_split(cal.data, cal.target)

print X_train.shape, X_test.shape

#Default split is 75/25

(15480, 8) (5160, 8)

#OK now, let's fit a GBRT classifer to the training dataset. The classifier has several hyperparameters which we will tune
#in subsequent steps. For now, we'll just build a simple model.

#Let's go instantiate, fit and predict.
gbrt=GradientBoostingRegressor(n_estimators=100)
gbrt.fit(X_train, y_train)
y_pred=gbrt.predict(X_test)

#Let's get some stats. 

#One of the benefits of growing trees is that we can understand how important each of the features are
print "Feature Importances"
print gbrt.feature_importances_
print

#Let's print the R-squared value for train/test. This explains how much of the variance in the data our model is
#able to decipher.
print "R-squared for Train: %.2f" %gbrt.score(X_train, y_train)
print "R-squared for Test: %.2f" %gbrt.score(X_test, y_test)

Feature Importances
[ 0.19300523  0.05615198  0.0547014   0.05048903  0.01719911  0.08529837
  0.25995097  0.28320391]

R-squared for Train: 0.80
R-squared for Test: 0.79

#All right let's do this the right way. We'll use a cross-validation generator to select train and CV datasets to finetune
#parameters such as C (Regularization parameter we saw earlier). These hyperparameters are extremely critical to the model.
#Now, if we tune parameters against the Test dataset, we will end up biasing towards the test set and will once again
#not generalize very well. We will also have no good way to find out since we have essentially trained on all our data.

#Luckily scikit-learn has builit-in packages that can help with this. We'll use a crossvalidation generator that
#can train the model by tuning the parameters based on a cross-validation subset (cv) that is picked from within the 
#training set. A different cv subset will be picked for each iteration, we control the number of iterations. Then we will 
#use these cv/train splits and run a gridsearch function that will evaluate the model with each split and tune parameters 
#to give us the best parameter that gives the optimal result.

#Defining this as a function so we can call it anytime we want

def GradientBooster(param_grid, n_jobs):

    estimator = GradientBoostingRegressor()

#Choose cross-validation generator - let's choose ShuffleSplit which randomly shuffles and selects Train and CV sets
#for each iteration. There are other methods like the KFold split.
    cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2)

#Apply the cross-validation iterator on the Training set using GridSearchCV. This will run the classifier on the 
#different train/cv splits using parameters specified and return the model that has the best results

#Note that we are tuning based on the F1 score 2PR/P+R where P is Precision and R is Recall. This may not always be
#the best score to tune our model on. I will explore this area further in a seperate exercise. For now, we'll use F1.

    classifier = GridSearchCV(estimator=estimator, cv=cv, param_grid=param_grid, n_jobs=n_jobs)

#Also note that we're feeding multiple neighbors to the GridSearch to try out.

#We'll now fit the training dataset to this classifier
    classifier.fit(X_train, y_train)

#Let's look at the best estimator that was found by GridSearchCV
    print "Best Estimator learned through GridSearch"
    print classifier.best_estimator_
    
    return cv, classifier.best_estimator_

#Below is a plot_learning_curve module that's provided by scikit-learn. It allows us to quickly and easily visualize how
#well the model is performing based on number of samples we're training on. It helps to understand situations such as 
#high variance or bias.

#We'll call this module in the next segment. 

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_digits
from sklearn.learning_curve import learning_curve


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

Automatically created module for IPython interactive environment

#WARNING - THIS MIGHT TAKE A WHILE TO RUN. TRY ADJUSTING parameters such as n_jobs (jobs to run in parallel, before 
#increasing this make sure your system can handle it), n_iter for ShuffleSplit (in the function definition) and reducing 
#number of values being tried for max_depth/n_estimators.

#SELECT INTERRUPT IN THE MENU AND PRESS INTERRUPT KERNEL IF YOU NEEDD TO STOP EXECUTION

param_grid={'n_estimators':[100], 
            'learning_rate': [0.1],# 0.05, 0.02, 0.01],
            'max_depth':[6],#4,6], 
            'min_samples_leaf':[3],#,5,9,17], 
            'max_features':[1.0],#,0.3]#,0.1]
           }
n_jobs=4

#Let's fit GBRT to the digits training dataset by calling the function we just created.
cv,best_est=GradientBooster(param_grid, n_jobs)

Best Estimator learned through GridSearch
GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.1, loss='ls',
             max_depth=6, max_features=1.0, max_leaf_nodes=None,
             min_samples_leaf=3, min_samples_split=2, n_estimators=100,
             random_state=None, subsample=1.0, verbose=0, warm_start=False)

#OK great, so we got back the best estimator parameters as follows:
print "Best Estimator Parameters"
print"---------------------------"
print "n_estimators: %d" %best_est.n_estimators
print "max_depth: %d" %best_est.max_depth
print "Learning Rate: %.1f" %best_est.learning_rate
print "min_samples_leaf: %d" %best_est.min_samples_leaf
print "max_features: %.1f" %best_est.max_features

print
print "Train R-squared: %.2f" %best_est.score(X_train,y_train)

#Each of these parameters is critical to learning. Some of them will help address overfitting issues as well. For more 
#info about overfitting and regularization, check out the SVM notebook in my Github repos where I provide more info on 
#the subject.

Best Estimator Parameters
---------------------------
n_estimators: 100
max_depth: 6
Learning Rate: 0.1
min_samples_leaf: 3
max_features: 1.0

Train Score: 0.90

#OK we'll now call the plot_learning_curve module by feeding it the estimator (best estimator returned from GS) 
#and train/cv sets.

#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
#Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now.
#We may need to adjust the hyperparameters further if there is overfitting (or underfitting, though unlikely)
title = "Learning Curves (Gradient Boosted Regression Trees)" 
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
                                      max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#Looks like we've done a reasonable job getting about ~0.85 R-squared on the cv set and looks from the learning
#curve that we may be able to do a bit better with more estimators. Although we may need to reduce the learning rate even 
#further to address any overfitting.

#OK we'll now call the plot_learning_curve module by feeding it the estimator (best estimator returned from GS) 
#and train/cv sets.

#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
#Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now.
#We may need to adjust the hyperparameters further if there is overfitting (or underfitting, though unlikely)
title = "Learning Curves (Gradient Boosted Regression Trees)" 
estimator = GradientBoostingRegressor(n_estimators=1000, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
                                      max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#OK that didn't really help. It did improve the training score but there's way more overfitting (look at the wide gap b/w
#train and cv curves). This possibly could be addressed by further reducing learning rate. But also note that the cv scores 
#are pretty much the same, so we might not see a major improvement unless we can procure more samples.

#Let's try one more trick. We'll trim the training set to its most important features and re-train to see if 
#that helps.
title = "Learning Curves (Gradient Boosted Regression Trees)" 

#Dropping all parameters except n_estimators and learning_rate since we're going to trim the features anyway.
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, learning_rate=best_est.learning_rate)

#Calling fit on the estimator so we can transform the X matrices.
estimator.fit(X_train, y_train)

#Trimming feature matrices to include only those features that are more important than the mean of all importances.
X_train_trim=estimator.transform(X_train, threshold='mean')

#Trimming test as well in case we end up going with this model as final.
X_test_trim=estimator.transform(X_test, threshold='mean')

#Re-plotting Learning cruves.
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#So what do we infer from this plot? We seem to have addressed overfitting much better but the overall score of both train
#and cv has gone down considerably, indicating that the features we dropped were actually collectively contributing
#to the model. Let's go back to the first model that the Grid Search returned and run our test scores.

#Switching back to the best model from gridsearch
estimator = best_est

#Re-fitting to the train set
estimator.fit(X_train, y_train)

#Calculating train/test scores - R-squared value
print "Train R-squared: %.2f" %estimator.score(X_train, y_train)
print "Test R-squared: %.2f" %estimator.score(X_test, y_test)

#There you have it, our final R-squared on the California housing dataset, 0.82

Train R-squared: 0.90
Test R-squared: 0.82

#OK let's run through a more complex example this time. We'll explore anonymous loan data provided by lendingclub. 
#We'll try to predict the interest rate for loan applications based on data provided. Let's first download data to 
#a pandas df.

#The Dataset is a zip file. So let's first read in the dataset through requests then pass it on to Pandas through the
#read_csv command
url=requests.get('https://resources.lendingclub.com/LoanStats3c.csv.zip')
z=zipfile.ZipFile(StringIO.StringIO(url.content))

loan=pd.read_csv(z.open('LoanStats3c.csv'), skiprows=1, parse_dates=True, index_col='id')
loanbk=loan.copy() #Backup of the dataframe so we don't have to download data everytime

/home/shanks/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (0,19) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)

#Let's take a quick peek at the dataset
loan.describe()

             member_id      loan_amnt    funded_amnt  funded_amnt_inv  \
count    161231.000000  161231.000000  161231.000000    161231.000000   
mean   19096209.203993   14864.642190   14864.642190     14858.829723   
std     5402435.130132    8412.637516    8412.637516      8409.342582   
min      137225.000000    1000.000000    1000.000000       950.000000   
25%    14557573.500000    8400.000000    8400.000000      8400.000000   
50%    18284171.000000   13000.000000   13000.000000     13000.000000   
75%    23292780.000000   20000.000000   20000.000000     20000.000000   
max    31526675.000000   35000.000000   35000.000000     35000.000000   

         installment      annual_inc            dti    delinq_2yrs  \
count  161231.000000   161231.000000  161231.000000  161231.000000   
mean      446.003881    75258.495695      17.685653       0.341609   
std       245.922837    57407.341278       7.733079       0.891854   
min        23.360000     3000.000000       0.000000       0.000000   
25%       267.790000    46000.000000      11.890000       0.000000   
50%       387.240000    65000.000000      17.360000       0.000000   
75%       585.390000    90000.000000      23.320000       0.000000   
max      1409.990000  7500000.000000      39.990000      22.000000   

       inq_last_6mths  mths_since_last_delinq  mths_since_last_record  \
count   161231.000000            81008.000000            28929.000000   
mean         0.833078               33.436759               71.541706   
std          1.094065               21.866078               28.736096   
min          0.000000                0.000000                0.000000   
25%          0.000000               15.000000               50.000000   
50%          0.000000               30.000000               70.000000   
75%          1.000000               49.000000              100.000000   
max          6.000000              188.000000              121.000000   

            open_acc        pub_rec       revol_bal     total_acc  \
count  161231.000000  161231.000000   161231.000000  161231.00000   
mean       11.632620       0.224324    16034.328436      26.09554   
std         5.175382       0.608610    19376.339472      11.73985   
min         0.000000       0.000000        0.000000       2.00000   
25%         8.000000       0.000000     6323.000000      18.00000   
50%        11.000000       0.000000    11607.000000      24.00000   
75%        14.000000       0.000000    20209.500000      33.00000   
max        76.000000      63.000000  1190046.000000     121.00000   

           out_prncp  out_prncp_inv    total_pymnt  total_pymnt_inv  \
count  161231.000000  161231.000000  161231.000000    161231.000000   
mean    12597.017050   12592.210838    3142.971573      3141.605224   
std      8136.428865    8133.531787    3753.058690      3751.385803   
min         0.000000       0.000000       0.000000         0.000000   
25%      6511.070000    6511.070000    1220.995000      1220.020000   
50%     11167.950000   11166.840000    2138.880000      2138.650000   
75%     17587.260000   17579.925000    3626.665000      3623.660000   
max     35000.000000   35000.000000   40554.969852     40554.970000   

       total_rec_prncp  total_rec_int  total_rec_late_fee     recoveries  \
count    161231.000000  161231.000000       161231.000000  161231.000000   
mean       2209.550404     933.078710            0.040902       0.301558   
std        3496.448030     831.964426            1.152217      30.839240   
min           0.000000       0.000000            0.000000       0.000000   
25%         731.210000     358.200000            0.000000       0.000000   
50%        1280.400000     676.500000            0.000000       0.000000   
75%        2246.190000    1226.200000            0.000000       0.000000   
max       35000.000000    7301.320000          100.380000    4935.000000   

       collection_recovery_fee  last_pymnt_amnt  collections_12_mths_ex_med  \
count            161231.000000    161231.000000               161231.000000   
mean                  0.003164      1048.075335                    0.014191   
std                   0.314840      3186.231922                    0.135941   
min                   0.000000         0.000000                    0.000000   
25%                   0.000000       275.260000                    0.000000   
50%                   0.000000       404.270000                    0.000000   
75%                   0.000000       629.010000                    0.000000   
max                  49.350000     35757.360000                   20.000000   

       mths_since_last_major_derog  policy_code  
count                 45175.000000       161231  
mean                     43.363475            1  
std                      22.153984            0  
min                       0.000000            1  
25%                      26.000000            1  
50%                      43.000000            1  
75%                      60.000000            1  
max                     188.000000            1

#For simplicity, let's first drop nulls in the dataset. axis=1 indicates we'll drop rows not cols.
loan = loan.dropna(axis=0)

#OK let's take a look at the columns and see if there are any we can drop any before we get started.
loan.columns.values

#There're plenty that don't seem very relevant. Let's drop them.
loan=loan.drop(['member_id', 'grade', 'sub_grade', 'emp_title', 'issue_d',
          'pymnt_plan', 'url', 'desc', 'title', 'initial_list_status',
          'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
          'policy_code', 'emp_length', 'addr_state','zip_code'], axis=1)

#Check the data dictionary for this dataset at https://resources.lendingclub.com/LCDataDictionary.xlsx for more details

# Get rid of non-numeric values throughout the DataFrame:
for col in loan.columns.values:
  loan[col] = loan[col].replace('[^0-9]+.-', '', regex=True)
loan.head(2)

          loan_amnt  funded_amnt  funded_amnt_inv        term int_rate  \
id                                                                       
12958045      17700        17700            17700   36 months   24.99%   
12968021      19500        19500            19500   60 months   16.59%   

          installment home_ownership  annual_inc         is_inc_v loan_status  \
id                                                                              
12958045       703.66           RENT       90000     Not Verified     Current   
12968021       480.34           RENT       63048  Source Verified     Current   

                     purpose    dti  delinq_2yrs earliest_cr_line  \
id                                                                  
12958045  debt_consolidation  15.52            0             2002   
12968021  debt_consolidation  11.44            0             1992   

          inq_last_6mths  mths_since_last_delinq  mths_since_last_record  \
id                                                                         
12958045               1                      56                     107   
12968021               1                      66                      82   

          open_acc  pub_rec  revol_bal revol_util  total_acc  out_prncp  \
id                                                                        
12958045        10        1      12466      65.3%         21   15202.88   
12968021         7        2      19308      66.1%         24   17962.12   

          out_prncp_inv  total_pymnt  total_pymnt_inv  total_rec_prncp  \
id                                                                       
12958045       15202.88      4925.62          4925.62          2497.12   
12968021       17962.12      3362.38          3362.38          1537.88   

          total_rec_int  total_rec_late_fee  recoveries  \
id                                                        
12958045         2428.5                   0           0   
12968021         1824.5                   0           0   

          collection_recovery_fee  collections_12_mths_ex_med  \
id                                                              
12958045                        0                           0   
12968021                        0                           0   

          mths_since_last_major_derog  
id                                     
12958045                           65  
12968021                           66

#Remove % symbol from the interest rate & revolving utilization
loan.int_rate=loan.int_rate.str.split('%',1).str[0]
loan.revol_util=loan.revol_util.str.split('%',1).str[0]

#Remove "months" from the loan period
loan.term=loan.term.str.split(' ',2).str[1]

loan.head(2)

          loan_amnt  funded_amnt  funded_amnt_inv term int_rate  installment  \
id                                                                             
12958045      17700        17700            17700   36    24.99       703.66   
12968021      19500        19500            19500   60    16.59       480.34   

         home_ownership  annual_inc         is_inc_v loan_status  \
id                                                                 
12958045           RENT       90000     Not Verified     Current   
12968021           RENT       63048  Source Verified     Current   

                     purpose    dti  delinq_2yrs earliest_cr_line  \
id                                                                  
12958045  debt_consolidation  15.52            0             2002   
12968021  debt_consolidation  11.44            0             1992   

          inq_last_6mths  mths_since_last_delinq  mths_since_last_record  \
id                                                                         
12958045               1                      56                     107   
12968021               1                      66                      82   

          open_acc  pub_rec  revol_bal revol_util  total_acc  out_prncp  \
id                                                                        
12958045        10        1      12466       65.3         21   15202.88   
12968021         7        2      19308       66.1         24   17962.12   

          out_prncp_inv  total_pymnt  total_pymnt_inv  total_rec_prncp  \
id                                                                       
12958045       15202.88      4925.62          4925.62          2497.12   
12968021       17962.12      3362.38          3362.38          1537.88   

          total_rec_int  total_rec_late_fee  recoveries  \
id                                                        
12958045         2428.5                   0           0   
12968021         1824.5                   0           0   

          collection_recovery_fee  collections_12_mths_ex_med  \
id                                                              
12958045                        0                           0   
12968021                        0                           0   

          mths_since_last_major_derog  
id                                     
12958045                           65  
12968021                           66

#Let's change the Income Verified column, which currently has textual labels to numeric.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
loan.is_inc_v = le.fit_transform(loan.is_inc_v.values)
loan.home_ownership=le.fit_transform(loan.home_ownership.values)
loan.loan_status=le.fit_transform(loan.loan_status.values)
loan.purpose=le.fit_transform(loan.purpose.values)

#Finally let's be sure we convert all fields to numeric
loan=loan.convert_objects(convert_numeric=True)

loan.head(2)

          loan_amnt  funded_amnt  funded_amnt_inv  term  int_rate  \
id                                                                  
12958045      17700        17700            17700    36     24.99   
12968021      19500        19500            19500    60     16.59   

          installment  home_ownership  annual_inc  is_inc_v  loan_status  \
id                                                                         
12958045       703.66               2       90000         0            0   
12968021       480.34               2       63048         1            0   

          purpose    dti  delinq_2yrs  earliest_cr_line  inq_last_6mths  \
id                                                                        
12958045        2  15.52            0              2002               1   
12968021        2  11.44            0              1992               1   

          mths_since_last_delinq  mths_since_last_record  open_acc  pub_rec  \
id                                                                            
12958045                      56                     107        10        1   
12968021                      66                      82         7        2   

          revol_bal  revol_util  total_acc  out_prncp  out_prncp_inv  \
id                                                                     
12958045      12466        65.3         21   15202.88       15202.88   
12968021      19308        66.1         24   17962.12       17962.12   

          total_pymnt  total_pymnt_inv  total_rec_prncp  total_rec_int  \
id                                                                       
12958045      4925.62          4925.62          2497.12         2428.5   
12968021      3362.38          3362.38          1537.88         1824.5   

          total_rec_late_fee  recoveries  collection_recovery_fee  \
id                                                                  
12958045                   0           0                        0   
12968021                   0           0                        0   

          collections_12_mths_ex_med  mths_since_last_major_derog  
id                                                                 
12958045                           0                           65  
12968021                           0                           66

#OK great, let's now get our X and y. We know that interest rate is y.
#Pandas is fantastic, all you need to do is use .values to get the data in numpy format
y=loan.int_rate.values

#Let's remove y from the df so we can get X
del loan['int_rate']
X=loan.values

#Now, the train test split
X_train, X_test, y_train, y_test = train_test_split(X,y)

#Great, in no time we have grabbed an unknown dataset from the web, munged it using Pandas and now have ready-to-go
#training and test numpy arrays for running the GBRT regressor. Let's go!

#WARNING - THIS MIGHT TAKE A WHILE TO RUN. TRY ADJUSTING parameters such as n_jobs (jobs to run in parallel, before 
#increasing this make sure your system can handle it), n_iter for ShuffleSplit (in the function definition) and reducing 
#number of values being tried for max_depth/n_estimators.

#SELECT INTERRUPT IN THE MENU AND PRESS INTERRUPT KERNEL IF YOU NEEDD TO STOP EXECUTION

param_grid={'n_estimators':[100],#,500,1000],
            'learning_rate': [0.1,0.05,0.02],# 0.01],
            'max_depth':[4,6], 
            'min_samples_leaf':[3,5,9,17], 
            'max_features':[1.0,0.3,0.1]
           }
n_jobs=4

#Let's fit GBRT to the digits training dataset by calling the function we just created.
cv,best_est=GradientBooster(param_grid, n_jobs)

Best Estimator learned through GridSearch
GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.1, loss='ls',
             max_depth=4, max_features=1.0, max_leaf_nodes=None,
             min_samples_leaf=5, min_samples_split=2, n_estimators=100,
             random_state=None, subsample=1.0, verbose=0, warm_start=False)

#OK great, so we got back the best estimator parameters as follows:
print "Best Estimator Parameters"
print"---------------------------"
print "n_estimators: %d" %best_est.n_estimators
print "max_depth: %d" %best_est.max_depth
print "Learning Rate: %.1f" %best_est.learning_rate
print "min_samples_leaf: %d" %best_est.min_samples_leaf
print "max_features: %.1f" %best_est.max_features

print
print "Train R-squared: %.2f" %best_est.score(X_train,y_train)

#The training R-Squared is almost 1.0 which indicates we can understand 99% of the variance in the data as well as
#there's a chance we might overfit. Let's see with the learning curves below.

Best Estimator Parameters
---------------------------
n_estimators: 100
max_depth: 4
Learning Rate: 0.1
min_samples_leaf: 5
max_features: 1.0

Train R-squared: 0.99

#OK we'll now call the plot_learning_curve module by feeding it the estimator (best estimator returned from GS) 
#and train/cv sets.

#The module simply runs the estimator multiple times on subsets of the data provided and plots the train and cv scores.
#Note that we're feeding the best parameters we've learned from GridSearchCV to the estimator now.
#We may need to adjust the hyperparameters further if there is overfitting (or underfitting, though unlikely)
title = "Learning Curves (Gradient Boosted Regression Trees)" 
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
                                      max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#OK yes, there is some overfitting there. We can see the training scores in red almost close to 1.0 and the cv scores
#trying its best to reach it as the number of examples increases. This is what happens during overfitting. To address
#overfitting, GBRT basically has the following parameters we can fine tune: Learning Rate, Max Depth, Min Samples leaf and
#Max features.

#the typical recommended values of Max depth is 4 to 6, so lets leave it at 4. Let's try increasing the min
#samples leaf parameter, this basically enforces a lower bound on the number of samples in any given leaf.
min_samples_leaf=9

title = "Learning Curves (Gradient Boosted Regression Trees), min_samples_leaf=9" 
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, min_samples_leaf=min_samples_leaf,
                                      max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#OK that did not really help.

#Let's try reducing the max features parameter. This enforces an upper bound of the maximum number of features to use
#for training. It's supposed to work well when n_features>30. We'll also remove min samples leaf for this run.
max_features=0.5

title = "Learning Curves (Gradient Boosted Regression Trees), max_features=50%" 
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, max_features=max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#Nope that didn't quite improve the cv score either. What happens if we reduce learning rate?

#The lower the learning rate is the more the number of trees we need to train. This is because the rate at which we train
#is simply, well, reduced.
learning_rate=.01
n_estimators=1000

title = "Learning Curves (Gradient Boosted Regression Trees), 1000 Trees at learning rate .01"
estimator = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=learning_rate, min_samples_leaf=best_est.min_samples_leaf,
                                      max_features=best_est.max_features)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#Perhaps that improved it a tiny little bit.

#Before we try anything else, I would like to explore one of the beautiful advantages of growing trees. And that is to
#capture feature importances. Now that we have a publicly available loan application collection (though anonymous), it makes
#me really curious to see what impacts the interest rate for a loan application the most.

#Let's take a look

#Calling fit on the estimator so we can look at feature_importances.
estimator.fit(X_train, y_train)

# Calculate the feature ranking - Top 10
importances = estimator.feature_importances_
indices = np.argsort(importances)[::-1]

print "Lending Club Loan Data - Top 10 Important Features\n"

for f in range(10):
    print("%d. %s   (%f)" % (f + 1, loan.columns[indices[f]], importances[indices[f]]))
    
#Plot the feature importances of the forest
indices=indices[:10]
plt.figure()
plt.title("Top 10 Feature importances")
plt.bar(range(10), importances[indices],
       color="r", align="center")
plt.xticks(range(10), loan.columns[indices], fontsize=14, rotation=45)
plt.xlim([-1, 10])
plt.show()

#Mean Feature Importance
print "Mean Feature Importance %.6f" %np.mean(importances)

#Interesting, the total amount of interest received to date is the top most influencer for getting a better interest rate.
#Good for the lenders eh? Pay more interest, we'll give you a cut on the interest rate. Of course!

Lending Club Loan Data - Top 10 Important Features

1. total_rec_int   (0.226486)
2. total_rec_prncp   (0.178744)
3. out_prncp_inv   (0.046132)
4. term   (0.044804)
5. out_prncp   (0.042197)
6. revol_bal   (0.038520)
7. installment   (0.033520)
8. mths_since_last_record   (0.031467)
9. annual_inc   (0.030583)
10. purpose   (0.027809)

Mean Feature Importance 0.031250

#Can we actually trim (like before) and get a better result? Perhaps not, but who's to stop us from trying.
title = "Learning Curves (Gradient Boosted Regression Trees) - Trimmed features to > 1% importance" 

#Dropping all parameters except n_estimators and learning_rate since we're going to trim the features anyway.
estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, learning_rate=best_est.learning_rate)

#Calling fit on the estimator so we can transform the X matrices.
estimator.fit(X_train, y_train)

#Trimming feature matrices to include only those features that are more important than the mean of all importances.
X_train_trim=estimator.transform(X_train, threshold=.01)

#Trimming test as well in case we end up going with this model as final.
X_test_trim=estimator.transform(X_test, threshold=.01)

#Re-plotting Learning cruves.
plot_learning_curve(estimator, title, X_train, y_train, cv=cv, n_jobs=n_jobs)
plt.show()

#Nope, the curve looks like it overfits less, but look at the cv score, in all our fancy attempts it never really crossed
#that ~0.8 R-squared barrier. That tells me, we actually have a decent model at hand and also that a 0.9+ R-squared value is
#not always possible, atleast in real time. Let's wrap this up.

estimator = GradientBoostingRegressor(n_estimators=best_est.n_estimators, max_depth=best_est.max_depth,
                                      learning_rate=best_est.learning_rate, min_samples_leaf=best_est.min_samples_leaf,
                                      max_features=best_est.max_features)
estimator.fit(X_train, y_train)

print "Final Estimator Parameters"
print"---------------------------"
print "n_estimators: %d" %best_est.n_estimators
print "max_depth: %d" %best_est.max_depth
print "Learning Rate: %.1f" %best_est.learning_rate
print "min_samples_leaf: %d" %best_est.min_samples_leaf
print "max_features: %.1f" %best_est.max_features
print
print "Final Train R-squared: %.2f" %estimator.score(X_train, y_train)
print "Final Test R-squared: %.2f" %estimator.score(X_test, y_test)

Final Estimator Parameters
---------------------------
n_estimators: 100
max_depth: 4
Learning Rate: 0.1
min_samples_leaf: 5
max_features: 1.0

Final Train R-squared: 0.99
Final Test R-squared: 0.83

There it is a final 0.83 R-squared value for the Test set. Compare this to the 0.72 we get with a LinearRegression model (look at the Linear Regression notebook in my repos) you'll know that we've done better.