# Predicting LendingClub interest rates with Linear Regression

Most often in real world applications we need to understand how one variable is determined by a number of others.

For example:

• How is the interest rate charged on a loan affected by credit history and by loan amount?
• How does sales volume change with changes in price. How is this affected by a change in product design?
• How are housing prices affected based on the location, size of the home etc.?

Answering questions like these, requires us to create a model. One of the simplest models we can create is a Linear Model where we start with the assumption that the dependent variable (E.g, House Price) varies linearly with the independent variable(s) (E.g, Size of Home). Essentially fitting a straight line through the data and expecting it to give us a good prediction for values we haven't seen.

For instance, predict the price of a new home (unseen in the training examples) with the model. Naturally this comes at a cost. Depending on the samples we trained on and how well the model was put together, we're going to have deal with an error, that is the cost of making predictions. In simple words, the difference between predictions and actuals.

The idea of building a model is to minimize this error so that when we make a new prediction we can do so with utmost confidence (~95% is a good benchmark). There are multiple ways of minimizing this error, simplest being the least-squares method. In other words, calculating the sum of squares of each error (to eliminate negatives) and minimizing this number.

I intend to explore all of these concepts (and some more) using sklearn in this notebook. scikit-learn makes it easier to seamlessly switch between algorithms because of its uniform implementation.

Feel free to use for your own reference.

In [446]:
```#Importing required Python packages
import matplotlib.pylab as plt
import numpy as np
from scipy import sparse
from sklearn.datasets import make_classification, make_blobs, load_boston
from sklearn.decomposition import PCA
from sklearn.cross_validation import ShuffleSplit, train_test_split
from sklearn import metrics
from sklearn.learning_curve import learning_curve
from sklearn.linear_model import LinearRegression
from pprint import pprint
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import urllib
import seaborn

np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)

# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 25)
pd.options.display.max_colwidth = 50
```
In [16]:
```#Let's load the Boston house prices dataset provided by sklearn
print "Boston dataset keys"
print boston.keys()
print

#Let's look at number of samples and features
print boston.data.shape
print boston.target.shape

#So there are 506 examples with 13 features and of course 506 labels we can train on (though we won't use the whole set)
```
```Boston dataset keys
['data', 'feature_names', 'DESCR', 'target']

(506, 13)
(506,)

```
In [17]:
```#The DESCR object holds a description of the dataset
print boston.DESCR
```
```Boston House Prices dataset

Notes
------
Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- TAX      full-value property-tax rate per \$10,000
- PTRATIO  pupil-teacher ratio by town
- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT    % lower status of the population
- MEDV     Median value of owner-occupied homes in \$1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

**References**

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

```
In [26]:
```#Let's do a simple histogram on available house prices to see what we'll be working on
plt.hist(boston.target,bins=20)
plt.suptitle('Boston Housing Prices in \$1000s', fontsize=15)
plt.xlabel('Prices in \$1000s')
plt.ylabel('Count')
plt.show()
```
In [56]:
```#We'd now like to see which features seem to influence prices the most so we can filter out unnecessary features
#from the model. Quickest way to do this is a SPLOM, scatter plot matrix. It's easy to do one in Pandas.

#Let's first convert our data into a pandas dataframe. Here we are taking data and target and plugging it into a single
#dataframe so we can compare side-by-side. If this is not self-explanatory, refer to the pandas notebooks in my repo
#for a better explanation.
df = pd.concat([pd.DataFrame(boston.data, columns=boston.feature_names),
pd.DataFrame(boston.target, columns=['MEDV'])], axis=1)
df.describe()
```
Out[56]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000
In [59]:
```#Let's now look at a few features that we think might be more relevant
#        - CRIM     per capita crime rate by town
#        - RM       average number of rooms per dwelling
#        - DIS      weighted distances to five Boston employment centres
#        - TAX      full-value property-tax rate per \$10,000
#Of course we'll compare all of these to the median house prices.
with seaborn.axes_style('white'):
smaller_frame = df[['CRIM', 'RM', 'DIS', 'RAD', 'TAX','MEDV']]
scatter_matrix(smaller_frame, alpha=0.8, figsize=(12, 12), diagonal="kde")
plt.show()
```
In [62]:
```#So by looking at the SPLOM we can clearly see that the number of rooms (RM) has a clear positive correlation with
#house prices. More the number of rooms, more than price. Similarly, the crime rate has a negative correlation, higher the
#crime rate lower the house price. Of course we knew these even without looking at anything, just based on common sense.

#This is one type of feature selection, done maually. There are other better ways of doing this which we'll explore later.
#For now, lets look at the correlation to make sure we're interpreting things correctly.
df[['MEDV','RM','CRIM']].corr()

#There's the confirmation.
```
Out[62]:
MEDV RM CRIM
MEDV 1.000000 0.69536 -0.385832
RM 0.695360 1.00000 -0.219940
CRIM -0.385832 -0.21994 1.000000
In [278]:
```#OK let's now fit a LinearRegression model to this dataset. We'll start with a very simple classifier in LinearRegression.
#This uses the least squares method I'd mentioned earlier.

#To make it easier for us to visualize this dataset (and how our model fits), let's use PCA to reduce this to a
#single dimension. For more information about PCA, refer to a seperate notebook on PCA in my repo.
data_reduced=PCA(n_components=1).fit_transform(boston.data)

#Let's now split the dataset into train and test sets so we can find out how well the model can generalize.
X_train, X_test, y_train, y_test = train_test_split(data_reduced, boston.target)
print X_train.shape, X_test.shape, y_train.shape, y_test.shape

#Let's fit the LinearRegression classifier to the training set
linr=LinearRegression().fit(X_train, y_train)

#We'll now run predictions on the Test set using the model we just trained
y_pred = linr.predict(X_test)

#Let's check out the score - in this case, this is the R-squared which tells us how much of the
#variance of the data is captured by the model. The higher this number is, the better.
print "R-squared for train: %.2f" %linr.score(X_train, y_train)
print "R-squared for test: %.2f" %linr.score(X_test, y_test)

#That's pretty reasonable. We're able to capture about 67% of variance in the test dataset.
```
```(379, 1) (127, 1) (379,) (127,)
R-squared for train: 0.20
R-squared for test: 0.31

```
In [204]:
```#Let's get some more key stats:
print "Coefficients (Parameters theta_1..theta_n"
print linr.coef_
print "Y intercept (theta_0): %.2f" %linr.intercept_
```
```Coefficients (Parameters theta_1..theta_n
[-0.02636759]
Y intercept (theta_0): 22.82

```
In [279]:
```with seaborn.axes_style('white'):

#Let's quickly plot the data so we can visualize better.
plt.scatter(data_reduced, boston.target, c='r')
plt.plot(X_test, y_pred,'--k', c='b')

plt.show()

#There's our straight line fit to the data. The model's done a reasonable job although there's plenty of bias. Our accuracy
#of predicting new values should be just about average.
```
In [453]:
```#OK let's run through one more example before moving onto more advanced regressors.
#We'll explore anonymous loan data provided by lendingclub. We'll try to predict the interest rate for loan applications
#based on data provided. Let's first download data to a pandas df.

#The Dataset is a zip file. So let's first read in the dataset through requests then pass it on to Pandas through the
url=requests.get('https://resources.lendingclub.com/LoanStats3c.csv.zip')
z=zipfile.ZipFile(StringIO.StringIO(url.content))

loanbk=loan.copy() #Backup of the dataframe so we don't have to download data everytime
```
In [454]:
```#Let's take a quick peek at the dataset
loan.describe()
```
Out[454]:
```             member_id      loan_amnt    funded_amnt  funded_amnt_inv  \
count    161231.000000  161231.000000  161231.000000    161231.000000
mean   19096209.203993   14864.642190   14864.642190     14858.829723
std     5402435.130132    8412.637516    8412.637516      8409.342582
min      137225.000000    1000.000000    1000.000000       950.000000
25%    14557573.500000    8400.000000    8400.000000      8400.000000
50%    18284171.000000   13000.000000   13000.000000     13000.000000
75%    23292780.000000   20000.000000   20000.000000     20000.000000
max    31526675.000000   35000.000000   35000.000000     35000.000000

installment      annual_inc            dti    delinq_2yrs  \
count  161231.000000   161231.000000  161231.000000  161231.000000
mean      446.003881    75258.495695      17.685653       0.341609
std       245.922837    57407.341278       7.733079       0.891854
min        23.360000     3000.000000       0.000000       0.000000
25%       267.790000    46000.000000      11.890000       0.000000
50%       387.240000    65000.000000      17.360000       0.000000
75%       585.390000    90000.000000      23.320000       0.000000
max      1409.990000  7500000.000000      39.990000      22.000000

inq_last_6mths  mths_since_last_delinq  mths_since_last_record  \
count   161231.000000            81008.000000            28929.000000
mean         0.833078               33.436759               71.541706
std          1.094065               21.866078               28.736096
min          0.000000                0.000000                0.000000
25%          0.000000               15.000000               50.000000
50%          0.000000               30.000000               70.000000
75%          1.000000               49.000000              100.000000
max          6.000000              188.000000              121.000000

open_acc        pub_rec       revol_bal     total_acc  \
count  161231.000000  161231.000000   161231.000000  161231.00000
mean       11.632620       0.224324    16034.328436      26.09554
std         5.175382       0.608610    19376.339472      11.73985
min         0.000000       0.000000        0.000000       2.00000
25%         8.000000       0.000000     6323.000000      18.00000
50%        11.000000       0.000000    11607.000000      24.00000
75%        14.000000       0.000000    20209.500000      33.00000
max        76.000000      63.000000  1190046.000000     121.00000

out_prncp  out_prncp_inv    total_pymnt  total_pymnt_inv  \
count  161231.000000  161231.000000  161231.000000    161231.000000
mean    12597.017050   12592.210838    3142.971573      3141.605224
std      8136.428865    8133.531787    3753.058690      3751.385803
min         0.000000       0.000000       0.000000         0.000000
25%      6511.070000    6511.070000    1220.995000      1220.020000
50%     11167.950000   11166.840000    2138.880000      2138.650000
75%     17587.260000   17579.925000    3626.665000      3623.660000
max     35000.000000   35000.000000   40554.969852     40554.970000

total_rec_prncp  total_rec_int  total_rec_late_fee     recoveries  \
count    161231.000000  161231.000000       161231.000000  161231.000000
mean       2209.550404     933.078710            0.040902       0.301558
std        3496.448030     831.964426            1.152217      30.839240
min           0.000000       0.000000            0.000000       0.000000
25%         731.210000     358.200000            0.000000       0.000000
50%        1280.400000     676.500000            0.000000       0.000000
75%        2246.190000    1226.200000            0.000000       0.000000
max       35000.000000    7301.320000          100.380000    4935.000000

collection_recovery_fee  last_pymnt_amnt  collections_12_mths_ex_med  \
count            161231.000000    161231.000000               161231.000000
mean                  0.003164      1048.075335                    0.014191
std                   0.314840      3186.231922                    0.135941
min                   0.000000         0.000000                    0.000000
25%                   0.000000       275.260000                    0.000000
50%                   0.000000       404.270000                    0.000000
75%                   0.000000       629.010000                    0.000000
max                  49.350000     35757.360000                   20.000000

mths_since_last_major_derog  policy_code
count                 45175.000000       161231
mean                     43.363475            1
std                      22.153984            0
min                       0.000000            1
25%                      26.000000            1
50%                      43.000000            1
75%                      60.000000            1
max                     188.000000            1
```
In [489]:
```#For simplicity, let's first drop nulls in the dataset. axis=1 indicates we'll drop rows not cols.
loan = loan.dropna(axis=0)
```
In [490]:
```#OK let's take a look at the columns and see if there are any we can drop any before we get started.
loan.columns.values

#There're plenty that don't seem very relevant. Let's drop them.
'pymnt_plan', 'url', 'desc', 'title', 'initial_list_status',
'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',

#Check the data dictionary for this dataset at https://resources.lendingclub.com/LCDataDictionary.xlsx for more details
```
In [491]:
```# Get rid of non-numeric values throughout the DataFrame:
for col in loan.columns.values:
loan[col] = loan[col].replace('[^0-9]+.-', '', regex=True)
```
Out[491]:
```          loan_amnt  funded_amnt  funded_amnt_inv        term int_rate  \
id
12958045      17700        17700            17700   36 months   24.99%
12968021      19500        19500            19500   60 months   16.59%

installment home_ownership  annual_inc         is_inc_v loan_status  \
id
12958045       703.66           RENT       90000     Not Verified     Current
12968021       480.34           RENT       63048  Source Verified     Current

purpose    dti  delinq_2yrs earliest_cr_line  \
id
12958045  debt_consolidation  15.52            0             2002
12968021  debt_consolidation  11.44            0             1992

inq_last_6mths  mths_since_last_delinq  mths_since_last_record  \
id
12958045               1                      56                     107
12968021               1                      66                      82

open_acc  pub_rec  revol_bal revol_util  total_acc  out_prncp  \
id
12958045        10        1      12466      65.3%         21   15202.88
12968021         7        2      19308      66.1%         24   17962.12

out_prncp_inv  total_pymnt  total_pymnt_inv  total_rec_prncp  \
id
12958045       15202.88      4925.62          4925.62          2497.12
12968021       17962.12      3362.38          3362.38          1537.88

total_rec_int  total_rec_late_fee  recoveries  \
id
12958045         2428.5                   0           0
12968021         1824.5                   0           0

collection_recovery_fee  collections_12_mths_ex_med  \
id
12958045                        0                           0
12968021                        0                           0

mths_since_last_major_derog
id
12958045                           65
12968021                           66
```
In [492]:
```#Remove % symbol from the interest rate & revolving utilization
loan.int_rate=loan.int_rate.str.split('%',1).str[0]
loan.revol_util=loan.revol_util.str.split('%',1).str[0]

#Remove "months" from the loan period
loan.term=loan.term.str.split(' ',2).str[1]

```
Out[492]:
```          loan_amnt  funded_amnt  funded_amnt_inv term int_rate  installment  \
id
12958045      17700        17700            17700   36    24.99       703.66
12968021      19500        19500            19500   60    16.59       480.34

home_ownership  annual_inc         is_inc_v loan_status  \
id
12958045           RENT       90000     Not Verified     Current
12968021           RENT       63048  Source Verified     Current

purpose    dti  delinq_2yrs earliest_cr_line  \
id
12958045  debt_consolidation  15.52            0             2002
12968021  debt_consolidation  11.44            0             1992

inq_last_6mths  mths_since_last_delinq  mths_since_last_record  \
id
12958045               1                      56                     107
12968021               1                      66                      82

open_acc  pub_rec  revol_bal revol_util  total_acc  out_prncp  \
id
12958045        10        1      12466       65.3         21   15202.88
12968021         7        2      19308       66.1         24   17962.12

out_prncp_inv  total_pymnt  total_pymnt_inv  total_rec_prncp  \
id
12958045       15202.88      4925.62          4925.62          2497.12
12968021       17962.12      3362.38          3362.38          1537.88

total_rec_int  total_rec_late_fee  recoveries  \
id
12958045         2428.5                   0           0
12968021         1824.5                   0           0

collection_recovery_fee  collections_12_mths_ex_med  \
id
12958045                        0                           0
12968021                        0                           0

mths_since_last_major_derog
id
12958045                           65
12968021                           66
```
In [493]:
```#Let's change the Income Verified column, which currently has textual labels to numeric.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
loan.is_inc_v = le.fit_transform(loan.is_inc_v.values)
loan.home_ownership=le.fit_transform(loan.home_ownership.values)
loan.loan_status=le.fit_transform(loan.loan_status.values)
loan.purpose=le.fit_transform(loan.purpose.values)

#Finally let's be sure we convert all fields to numeric
loan=loan.convert_objects(convert_numeric=True)

```
Out[493]:
```          loan_amnt  funded_amnt  funded_amnt_inv  term  int_rate  \
id
12958045      17700        17700            17700    36     24.99
12968021      19500        19500            19500    60     16.59

installment  home_ownership  annual_inc  is_inc_v  loan_status  \
id
12958045       703.66               2       90000         0            0
12968021       480.34               2       63048         1            0

purpose    dti  delinq_2yrs  earliest_cr_line  inq_last_6mths  \
id
12958045        2  15.52            0              2002               1
12968021        2  11.44            0              1992               1

mths_since_last_delinq  mths_since_last_record  open_acc  pub_rec  \
id
12958045                      56                     107        10        1
12968021                      66                      82         7        2

revol_bal  revol_util  total_acc  out_prncp  out_prncp_inv  \
id
12958045      12466        65.3         21   15202.88       15202.88
12968021      19308        66.1         24   17962.12       17962.12

total_pymnt  total_pymnt_inv  total_rec_prncp  total_rec_int  \
id
12958045      4925.62          4925.62          2497.12         2428.5
12968021      3362.38          3362.38          1537.88         1824.5

total_rec_late_fee  recoveries  collection_recovery_fee  \
id
12958045                   0           0                        0
12968021                   0           0                        0

collections_12_mths_ex_med  mths_since_last_major_derog
id
12958045                           0                           65
12968021                           0                           66
```
In [494]:
```#OK great, let's now get our X and y. We know that interest rate is y.
#Pandas is fantastic, all you need to do is use .values to get the data in numpy format
y=loan.int_rate.values

#Let's remove y from the df so we can get X
del loan['int_rate']
X=loan.values

#Now, the train test split
X_train, X_test, y_train, y_test = train_test_split(X,y)
```
In [501]:
```#Alright let's now fit a linear regression model to the training set
linr=LinearRegression().fit(X_train, y_train)

#Alright let's get the parameters we've learned
print "Coefficients (theta_1..theta_n)"
print linr.coef_
print
print "Y Intercept(theta0)"
print linr.intercept_
print

#Let's also spit out the R-squared values for Train and Test
print "R-squared for Train: %.2f" %linr.score(X_train, y_train)
print "R-squared for Test: %.2f" %linr.score(X_test, y_test)

#There we have it, the R-squared value on the test set is about 72%, which is not great but understandable considering
#the data must be much more sophisticated than a straight line. The only other thing we can do with this regressor is
#to normalize the data before training (value - mean /std) so all values are in the same range from 0 to 1.
#Let's try this in the next step
```
```Coefficients (theta_1..theta_n)
[ -5.41845587e-03  -5.41845587e-03   2.56023202e-02   2.37941420e-01
3.02727769e-02   1.93785478e-01  -1.49959149e-06   1.05069805e-02
8.99988441e-01   3.67418300e-01  -7.10859025e-03   5.79387360e-02
6.49132404e-03   1.92378475e-01  -3.13657813e-03  -2.81881841e-03
4.87178733e-03   1.86943103e-02   5.48467229e-06   8.12539204e-03
-4.15528637e-03  -2.55097759e-02   9.36511649e-03   8.31602560e-02
-1.19542112e-01   2.00913201e-02   4.07658777e-02   2.23030582e-02
0.00000000e+00   0.00000000e+00   3.14373914e-01   4.38635209e-03]

Y Intercept(theta0)
-9.3670079647

R-squared for Train: 0.81
R-squared for Test: 0.72

```
In [504]:
```#Let's now fit a linear regression model with normalize=True to the training set
linr=LinearRegression(normalize=True).fit(X_train, y_train)

#Alright let's get the parameters we've learned
print "Coefficients (theta_1..theta_n)"
print linr.coef_
print
print "Y Intercept(theta0)"
print linr.intercept_
print

#Let's also spit out the R-squared values for Train and Test
print "R-squared for Train: %.2f" %linr.score(X_train, y_train)
print "R-squared for Test: %.2f" %linr.score(X_test, y_test)

#There we have it, the R-squared value on the test set is about 71%, actually a bit less than what we managed before.

#I will explore more sophisticated regressors in a seperate post.
```
```Coefficients (theta_1..theta_n)
[  4.04645767e+09   3.87810757e+09   2.73530283e-02   2.37214963e-01
3.02840682e-02   1.78692334e-01  -1.49583810e-06   3.53254708e-03
8.94575080e-01   3.68010061e-01  -8.21068167e-03   7.43240431e-02
7.57253404e-03   1.93104864e-01  -3.27704686e-03  -2.96730415e-03
3.71869073e-03   2.02756071e-02   5.26288744e-06   8.20479422e-03
-3.05924327e-03  -7.92456524e+09   7.66817237e-03  -5.15822934e+09
-1.20501121e-01  -2.76633589e+09   5.15822934e+09   5.15822934e+09
-1.16820204e-01   0.00000000e+00   3.31651317e-01   4.79238930e-03]

Y Intercept(theta0)
-11.450916369

R-squared for Train: 0.82
R-squared for Test: 0.71

```