Most often in real world applications we need to understand how one variable is determined by a number of others.
For example:
- How is the interest rate charged on a loan affected by credit history and by loan amount?
- How does sales volume change with changes in price. How is this affected by a change in product design?
- How are housing prices affected based on the location, size of the home etc.?
Answering questions like these, requires us to create a model. One of the simplest models we can create is a Linear Model where we start with the assumption that the dependent variable (E.g, House Price) varies linearly with the independent variable(s) (E.g, Size of Home). Essentially fitting a straight line through the data and expecting it to give us a good prediction for values we haven't seen.
For instance, predict the price of a new home (unseen in the training examples) with the model. Naturally this comes at a cost. Depending on the samples we trained on and how well the model was put together, we're going to have deal with an error, that is the cost of making predictions. In simple words, the difference between predictions and actuals.
The idea of building a model is to minimize this error so that when we make a new prediction we can do so with utmost confidence (~95% is a good benchmark). There are multiple ways of minimizing this error, simplest being the least-squares method. In other words, calculating the sum of squares of each error (to eliminate negatives) and minimizing this number.
I intend to explore all of these concepts (and some more) using sklearn in this notebook. scikit-learn makes it easier to seamlessly switch between algorithms because of its uniform implementation.
Feel free to use for your own reference.
#Importing required Python packages
import matplotlib.pylab as plt
import numpy as np
from scipy import sparse
from sklearn.datasets import make_classification, make_blobs, load_boston
from sklearn.decomposition import PCA
from sklearn.cross_validation import ShuffleSplit, train_test_split
from sklearn import metrics
from sklearn.learning_curve import learning_curve
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from pprint import pprint
import pandas as pd
from pandas.tools.plotting import scatter_matrix
import urllib
import seaborn
np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
# Set some Pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 40)
pd.set_option('display.max_rows', 25)
pd.options.display.max_colwidth = 50
#Let's load the Boston house prices dataset provided by sklearn
boston=load_boston()
print "Boston dataset keys"
print boston.keys()
print
#Let's look at number of samples and features
print boston.data.shape
print boston.target.shape
#So there are 506 examples with 13 features and of course 506 labels we can train on (though we won't use the whole set)
#The DESCR object holds a description of the dataset
print boston.DESCR
#Let's do a simple histogram on available house prices to see what we'll be working on
plt.hist(boston.target,bins=20)
plt.suptitle('Boston Housing Prices in $1000s', fontsize=15)
plt.xlabel('Prices in $1000s')
plt.ylabel('Count')
plt.show()
#We'd now like to see which features seem to influence prices the most so we can filter out unnecessary features
#from the model. Quickest way to do this is a SPLOM, scatter plot matrix. It's easy to do one in Pandas.
#Let's first convert our data into a pandas dataframe. Here we are taking data and target and plugging it into a single
#dataframe so we can compare side-by-side. If this is not self-explanatory, refer to the pandas notebooks in my repo
#for a better explanation.
df = pd.concat([pd.DataFrame(boston.data, columns=boston.feature_names),
pd.DataFrame(boston.target, columns=['MEDV'])], axis=1)
df.describe()
#Let's now look at a few features that we think might be more relevant
# - CRIM per capita crime rate by town
# - RM average number of rooms per dwelling
# - DIS weighted distances to five Boston employment centres
# - RAD index of accessibility to radial highways
# - TAX full-value property-tax rate per $10,000
#Of course we'll compare all of these to the median house prices.
with seaborn.axes_style('white'):
smaller_frame = df[['CRIM', 'RM', 'DIS', 'RAD', 'TAX','MEDV']]
scatter_matrix(smaller_frame, alpha=0.8, figsize=(12, 12), diagonal="kde")
plt.show()
#So by looking at the SPLOM we can clearly see that the number of rooms (RM) has a clear positive correlation with
#house prices. More the number of rooms, more than price. Similarly, the crime rate has a negative correlation, higher the
#crime rate lower the house price. Of course we knew these even without looking at anything, just based on common sense.
#This is one type of feature selection, done maually. There are other better ways of doing this which we'll explore later.
#For now, lets look at the correlation to make sure we're interpreting things correctly.
df[['MEDV','RM','CRIM']].corr()
#There's the confirmation.
#OK let's now fit a LinearRegression model to this dataset. We'll start with a very simple classifier in LinearRegression.
#This uses the least squares method I'd mentioned earlier.
#To make it easier for us to visualize this dataset (and how our model fits), let's use PCA to reduce this to a
#single dimension. For more information about PCA, refer to a seperate notebook on PCA in my repo.
data_reduced=PCA(n_components=1).fit_transform(boston.data)
#Let's now split the dataset into train and test sets so we can find out how well the model can generalize.
X_train, X_test, y_train, y_test = train_test_split(data_reduced, boston.target)
print X_train.shape, X_test.shape, y_train.shape, y_test.shape
#Let's fit the LinearRegression classifier to the training set
linr=LinearRegression().fit(X_train, y_train)
#We'll now run predictions on the Test set using the model we just trained
y_pred = linr.predict(X_test)
#Let's check out the score - in this case, this is the R-squared which tells us how much of the
#variance of the data is captured by the model. The higher this number is, the better.
print "R-squared for train: %.2f" %linr.score(X_train, y_train)
print "R-squared for test: %.2f" %linr.score(X_test, y_test)
#That's pretty reasonable. We're able to capture about 67% of variance in the test dataset.
#Let's get some more key stats:
print "Coefficients (Parameters theta_1..theta_n"
print linr.coef_
print "Y intercept (theta_0): %.2f" %linr.intercept_
with seaborn.axes_style('white'):
#Let's quickly plot the data so we can visualize better.
plt.scatter(data_reduced, boston.target, c='r')
plt.plot(X_test, y_pred,'--k', c='b')
plt.show()
#There's our straight line fit to the data. The model's done a reasonable job although there's plenty of bias. Our accuracy
#of predicting new values should be just about average.
#OK let's run through one more example before moving onto more advanced regressors.
#We'll explore anonymous loan data provided by lendingclub. We'll try to predict the interest rate for loan applications
#based on data provided. Let's first download data to a pandas df.
#The Dataset is a zip file. So let's first read in the dataset through requests then pass it on to Pandas through the
#read_csv command
url=requests.get('https://resources.lendingclub.com/LoanStats3c.csv.zip')
z=zipfile.ZipFile(StringIO.StringIO(url.content))
loan=pd.read_csv(z.open('LoanStats3c.csv'), skiprows=1, parse_dates=True, index_col='id')
loanbk=loan.copy() #Backup of the dataframe so we don't have to download data everytime
#Let's take a quick peek at the dataset
loan.describe()
#For simplicity, let's first drop nulls in the dataset. axis=1 indicates we'll drop rows not cols.
loan = loan.dropna(axis=0)
#OK let's take a look at the columns and see if there are any we can drop any before we get started.
loan.columns.values
#There're plenty that don't seem very relevant. Let's drop them.
loan=loan.drop(['member_id', 'grade', 'sub_grade', 'emp_title', 'issue_d',
'pymnt_plan', 'url', 'desc', 'title', 'initial_list_status',
'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
'policy_code', 'emp_length', 'addr_state','zip_code'], axis=1)
#Check the data dictionary for this dataset at https://resources.lendingclub.com/LCDataDictionary.xlsx for more details
# Get rid of non-numeric values throughout the DataFrame:
for col in loan.columns.values:
loan[col] = loan[col].replace('[^0-9]+.-', '', regex=True)
loan.head(2)
#Remove % symbol from the interest rate & revolving utilization
loan.int_rate=loan.int_rate.str.split('%',1).str[0]
loan.revol_util=loan.revol_util.str.split('%',1).str[0]
#Remove "months" from the loan period
loan.term=loan.term.str.split(' ',2).str[1]
loan.head(2)
#Let's change the Income Verified column, which currently has textual labels to numeric.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
loan.is_inc_v = le.fit_transform(loan.is_inc_v.values)
loan.home_ownership=le.fit_transform(loan.home_ownership.values)
loan.loan_status=le.fit_transform(loan.loan_status.values)
loan.purpose=le.fit_transform(loan.purpose.values)
#Finally let's be sure we convert all fields to numeric
loan=loan.convert_objects(convert_numeric=True)
loan.head(2)
#OK great, let's now get our X and y. We know that interest rate is y.
#Pandas is fantastic, all you need to do is use .values to get the data in numpy format
y=loan.int_rate.values
#Let's remove y from the df so we can get X
del loan['int_rate']
X=loan.values
#Now, the train test split
X_train, X_test, y_train, y_test = train_test_split(X,y)
#Alright let's now fit a linear regression model to the training set
linr=LinearRegression().fit(X_train, y_train)
#Alright let's get the parameters we've learned
print "Coefficients (theta_1..theta_n)"
print linr.coef_
print
print "Y Intercept(theta0)"
print linr.intercept_
print
#Let's also spit out the R-squared values for Train and Test
print "R-squared for Train: %.2f" %linr.score(X_train, y_train)
print "R-squared for Test: %.2f" %linr.score(X_test, y_test)
#There we have it, the R-squared value on the test set is about 72%, which is not great but understandable considering
#the data must be much more sophisticated than a straight line. The only other thing we can do with this regressor is
#to normalize the data before training (value - mean /std) so all values are in the same range from 0 to 1.
#Let's try this in the next step
#Let's now fit a linear regression model with normalize=True to the training set
linr=LinearRegression(normalize=True).fit(X_train, y_train)
#Alright let's get the parameters we've learned
print "Coefficients (theta_1..theta_n)"
print linr.coef_
print
print "Y Intercept(theta0)"
print linr.intercept_
print
#Let's also spit out the R-squared values for Train and Test
print "R-squared for Train: %.2f" %linr.score(X_train, y_train)
print "R-squared for Test: %.2f" %linr.score(X_test, y_test)
#There we have it, the R-squared value on the test set is about 71%, actually a bit less than what we managed before.
#I will explore more sophisticated regressors in a seperate post.