Saving the Titanic with R & IPython
The following is an illustration of one of my approaches to solving the Titanic Survival prediction challenge hosted by Kaggle. Below is an excerpt from the competition page.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
Disclaimer
This pursuit infuses my own ideas with others I've had the privilege to learn from. I write to further my learning.
To get started, download the data files from Kaggle's website here. You will see two CSV files, train and test. Download them to your working directory.
OK let's now take a look at the column descriptions provided for the dataset.
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored. The following are the definitions used
for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws. Some children travelled
only with a nanny, therefore parch=0 for them. As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
Bottomline, we have some information about passengers traveling aboard the Titanic and we need to predict if train a model that can predict if one survived or not based on data similar to that provided in the dataset. Without further ado, let's get started.
#Load the R Magic so we can execute R scripts within this notebook
%load_ext rmagic
%%R
#Note that every code block in this notebook will need to have the above line to enable IPython to understand we're coding R.
#I've downloaded the train and test CSV files to my work directory. You should too unless you cloned this repo.
#While downloading the CSV files in R, let's do some data handling so it saves us the headache later on.
#Define a read function so we don't need to do it twice. Column types specifies data types for each column and missing
#types specify different types of null values possible.
read_better <- function(file.name, column.types, missing.types) {
read.csv( file.name,
colClasses=column.types,
na.strings=missing.types )
}
#Let's now define the column types
column.types=c('integer', # PassengerId
'factor', # Survived
'factor', # Pclass
'character', # Name
'factor', # Sex
'numeric', # Age
'integer', # SibSp
'integer', # Parch
'character', # Ticket
'numeric', # Fare
'character', # Cabin
'factor') # Embarked
#Different types of null values
missing.types=c('NA','')
#Alright,let's read train
orig_train<-read_better('train.csv', column.types, missing.types)
#For test, the Survived column (2nd col) doesn't exist, let's remove that type before reading.
orig_test<-read_better('test.csv', column.types[-2], missing.types)
#Let's make copies so we never have to read again
train<-orig_train
test<-orig_test
#Quickly print a summary of train
summary(train)
Data Munging
Munging is essentially cleansing the data so it's ready for our super sophisticated Machine Learning algorithms :-)
Ideally I'd like to use Pandas which is an awesome tool for these types of things but considering Pandas itself was inspired from R, we will try the whole thing in R this time. I'll create a seperate notebook later to do it all in sklearn/pandas.
Alright let's get started. The first step of any Data Cleansing process is Visualization. Why is that? I asked the question myself but how would you cleanse something when you don't know what it is? And what better way to understand data than by looking at in colourful visualizations. Let's go and create some.
%%R
#I loved the look of this R package that provides the missing map which can give you a super quick peek into the dataset.
#You'll need the Amelia package for this visualization.
#install.packages("Amelia")
require(Amelia)
missmap(train, main="Titanic - Missing Data Map", col=c("forestgreen","lightskyblue2"), legend=FALSE)
It's obvious looking at this map that the most values missing are in the Cabin and Age columns. Cabin basically refers to the cabin# of each passenger and considering how many values are missing, we could as well just drop it. Age however is critical and we need a better mechanism to handle. There is also one missing Embarked value.
%%R
#OK let's plot a series of visualizations that can hopefully explain the data better"
#Proportion of Survivors
barplot(prop.table(table(train$Survived)), names.arg=c('Didn\'t Survive', 'Survived'), main="Proportion of Survivors",
col=c('mistyrose','lightseagreen'))
#Clearly more people died than survived.
%%R
#Proportion of Survivors by Gender
barplot(prop.table(table(train$Sex, train$Survived), 1), names.arg=c('Didn\'t Survive', 'Survived'),
main="Proportion of Survivors by Gender", legend=TRUE, col=c('darksalmon','paleturquoise'))
#More Females survived than Males. This is understandable considering the ladies and children first approach for survival.
%%R
#Proportion of Survivors by Class of Travel - let's do a mosaicplot this time for fun.
mosaicplot(prop.table(table(train$Pclass, train$Survived), 1), main="Proportion of Survivors by Pclass",
xlab='Pclass', ylab='Survived ?',col=c('darkturquoise','mediumspringgreen'))
#Looks like those traveling in upper class (3) were more lucky than the rest. We'll keep this in mind.
%%R
#OK let's do a quick plot by Age - remember that we need to fill in missing values for this column.
boxplot(train$Age~train$Survived,main="Proportion of Survivors by Age", col=c('darkseagreen4','salmon4'), xlab="Survived ?",
ylab="Age")
#OK that was a helpful plot, it clearly tells us that there were more survivors in the 20-35 Age bracket, young legs perhaps?
%%R
#OK let's get down to business. We'll look at how many values are missing for Age.
summary(train$Age)
#177 is a lot considering our dataset is really small. Let's try to find a meaningful way to fill these up.
#The determining factors so far has been Gender and Pclass. Can we find out how many Age values are missing by Gender?
barplot(table(train$Sex[which(is.na(train$Age))]),main="Proportion of Missing Ages by Gender",
col=c('lightsteelblue','bisque3'), xlab="Gender", ylab="Missing Ages")
#OK that's clearly tilted in favor of males. We're a sloppy bunch, aren't we?
%%R
#How about missing Age values by Pclass?
barplot(table(train$Pclass[which(is.na(train$Age))]),main="Proportion of Missing Ages by Pclass",
col=c('mediumseagreen','rosybrown4','mediumslateblue'), xlab="Pclass", ylab="Missing Ages")
#There's our most important highlight yet. Most of the ages we're missing are in Pclass 3. So this goes to show, we will
#probably be better off taking the median of ages for each Pclass and gender and filling the null values with it. That
#should be better than simply filling all with overall median.
%%R
#Let's go ahead and do the honors. Note that we could do all of this with a single sophisticated function but I am
#choosing to keep things simple at the moment.
#Before we make any changes to Train, let's combine Train and Test temporarily to a new dataset. Since any change we need
#to make to Train needs to be made to Test as well, we can do the changes only once and split the datasets again later.
#We'll first add the Survived Column to the Test set since it doesn't exist and init to NULL values since we won't use it.
test$Survived<-rep(0,nrow(test))
titanic<-rbind(train,test)
#na.rm basically calculates mean for all non-null values. We then plug the medi
titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female" & is.na(titanic$Age))]<-
median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
titanic$Age[which(titanic$Pclass==2 & titanic$Sex=="female" & is.na(titanic$Age))]<-
median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
titanic$Age[which(titanic$Pclass==1 & titanic$Sex=="female" & is.na(titanic$Age))]<-
median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="male" & is.na(titanic$Age))]<-
median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
titanic$Age[which(titanic$Pclass==2 & titanic$Sex=="male" & is.na(titanic$Age))]<-
median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
titanic$Age[which(titanic$Pclass==1 & titanic$Sex=="male" & is.na(titanic$Age))]<-
median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
#Let's do a quick summary of the Age column to make sure there are no nulls remaining
summary(titanic$Age)
%%R
#Let's turn our attention to the Embarked Column
summary(titanic$Embarked)
#There are just 2 missing values and S seems to be the majority so let's plug in S
titanic$Embarked[which(is.na(titanic$Embarked))] <- 'S'
summary(titanic$Embarked)
%%R
#Let's remove the Cabin column which contains way too many NULLs and will probably not give anything new considering
#we already have the port of embarcation.
titanic$Cabin<-NULL
Feature Engineering
Feature Engineering refers to manufacturing new features based on the idea that they may replace existing ones because they are easier for the machine learning algorithms to digest or extend the value of existing feature(s) because they're more relevant for the predictions we're trying to make. Let's see what new features we can add to the Titanic dataset.
%%R
#OK we all know that they tried to save the women and children first aboard the titanic. We have gender which identifies
#women but no identifier for children. Let's call all passengers with Age < 18 children shall we?
titanic$Child<-0
titanic$Child[which(titanic$Age<18)]<-1
%%R
#We have a column called fare which is a numeric value of the actual fare that was paid for the trip. This column as such
#might not be very useful but what if we can break it down to different buckets - say <20, 20-40, 40-60, 60+. Recall that
#we had noticed a curious fact with respect to the range 20-35, let's make sure that is covered in a single bucket.
titanic$FareGroup<-'40+'
titanic$FareGroup[which(titanic$Fare<10)] <- '<10'
titanic$FareGroup[which(titanic$Fare>=10 & titanic$Fare<20)] <- '10-20'
titanic$FareGroup[which(titanic$Fare>=20 & titanic$Fare<40)] <- '20-40'
titanic$FareGroup<-as.factor(titanic$FareGroup)
barplot(table(titanic$FareGroup),col='tomato1',main='Fare Groups')
%%R
#Name is another clear candidate that might not be of great value to us. What could a name contribute to predicting if he/she
#actually survived? Well, it may not directly but it could contain things that influence the prediction. Let's print a few
#names to see.
tail(titanic$Name)
#We can see that the names seem to be in similar format and have a title in between the surname and first name. Can we
#extract the Titles from the names?
%%R
#We could use the strsplit function to split based on , and . then capture the middle items which should be the title. The
#sapply function helps to perform this for the entire dataframe. function keyword is just like lambda in python. We will
#also remove any extra whitespaces.
titanic$Title<-sapply(titanic$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
titanic$Title<-sub(' ','',titanic$Title)
head(titanic$Title)
%%R
#Let's take a look at the different values of Title and see if further grouping is necessary.
#To be safe, we'll perform any analysis only on the training set. Let's temporarily split Train.
temp <- titanic[1:nrow(train),]
table(temp$Title)
#OK there's obviously only a few titles that are most prominent namely Mr, Miss, Mrs and Master. Perhaps we can group the
#other titles further?
%%R
#Looking at wikipedia definitions, the following grouping makes reasonable sense because of similar definitions.
#Note it's definitely possible to nitpick these groupings, feel free to make your own judgement call.
titanic$Title[titanic$Title %in% c('Dona','Ms','Lady','the Countess','Jonkheer')] <- 'Mrs'
titanic$Title[titanic$Title %in% c('Col','Dr','Rev')] <- 'Noble'
titanic$Title[titanic$Title %in% c('Mme','Mile','Mlle')] <- 'Miss'
titanic$Title[titanic$Title %in% c('Capt','Don','Major','Sir')] <- 'Mr'
titanic$Title<-as.factor(titanic$Title)
table(titanic$Title)
%%R
#OK so we split the Titles out, but what about Surnames? Surnames could indicate families traveling together, maybe
#many of them tried to stick together trying to escape? Let's capture the surnames.
titanic$Surname<-sapply(titanic$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
titanic$Surname<-sub(' ','',titanic$Surname)
head(titanic$Surname)
%%R
#Very quickly let's get a quick rundown of the families
temp <- titanic[1:nrow(train),]
fams<-data.frame(table(temp$Surname))
print(summary(fams$Freq))
#There we we have it. Looks like a lot of the passengers don't share a Surname with each other since the Median is 1
#but there are families as well with upto 9 members (in the training set). So we were on the right track.
hist(fams$Freq, ylim=c(0,50), col='darkcyan')
#The histogram tells us there are a lot of ones and about 50 odd families with sizes 2 and 3, the remaining few have
#large families. Perhaps there's a bigger question, how do we which of them are families? There could be multiple passengers
#with the same surname traveling different groups, which would negate our purpose.
%%R
#To simplify this, let's use the variables we haven't explored so far, SibSp (No. of Siblings and Spouses) and Parch (No.
#Parents and Children). Summing these up (+1 for self) should gives us the family size of each passenger. Making an
#assumption that these attributes were entered correctly, we could then surmise that passengers with the same surname and
#family size belong to same families. Of course, there're still loopholes if we want to nitpick but I am stopping here.
titanic$FamSize<-titanic$SibSp + titanic$Parch + 1
#Club FamilySize with Surname to create a new Family ID
titanic$FamID<-paste(titanic$Surname,as.character(titanic$FamSize),sep='')
#We'll call all passengers with Family Size = 1 as traveling Solo.
titanic$FamID[which(titanic$FamSize==1)]<-'Solo'
titanic$FamID<-as.factor(titanic$FamID)
#Finally let's remove some columns we won't use for model building.
#Name and Ticket are unique for each passenger (we assume) and couldn't possible add any relevance to our prediction. Even
#if multiple passengers had the same name, it shouldn't really help us decide if one or more survived.
titanic$Name<-NULL
titanic$Ticket<-NULL
head(titanic)
%%R
#Great we're through. We'll now get back our Train and Test sets from Titanic.
train<-titanic[1:nrow(train),]
temp=nrow(train)+1
kaggletest<-titanic[temp:nrow(titanic),]
print(nrow(train))
print(nrow(kaggletest))
print(nrow(titanic))
Model Fitting and Evaluation
I want to use this opportunity and challenge to try out different models in R and see how they stack up against each other. We'll run through one at a time, fit the parameters and submit to Kaggle. We'll then wrap things up by comparing the different approaches. I think this is a great chance to learn tuning models in R.
First up, before we start training and running cross-validation, I would like to try simple, plain-old logistic regression. We'll not be training but simply fitting a model and checking how the different parameters we've created contribute to the predictions.
Before we get started we need to split the "train.csv" file we're holding in the train dataframe to train/test sets. The reason we do this is so we have a baseline to run our model against before the submission to Kaggle. It enables a good approximation on how well the model can generalize.
We'll be using the caret package for the rest of this approach. The caret package has several functions that attempt to streamline the model building and evaluation process. One of them is the createDataPartition function which can be used to create a stratified random sample of the data into training and test sets.
%%R
#75/25 Train/Test Split
#install.packages('caret')
#Setting a random seed. We will be using this throughout to make sure our results are consistently comparable.
library(caret)
set.seed(35)
trainrows<-createDataPartition(train$Survived, p = 0.8, list=FALSE)
train.set<-train[trainrows,]
test.set<-train[-trainrows,]
print(nrow(train.set))
print(nrow(test.set))
#Remember, from this point on Test does NOT refer to the test.csv file. I will call out explicitly when it does and it won't
#until the very end when we submit predictions to Kaggle for each model.
Logistic Regression
Before we start training models with caret, I would like to first explore simple logistic regression through the glm() method. glm (Generalized Linear Models) is easy-to-use and lets us several types of linear models, logistic regression being one of them. Let's begin
%%R
#To start with I am not including any of the features we manufactured. Let's see how the raw features perform.
Titan.logit.1 <- glm(Survived ~ Sex + Pclass + Age + SibSp + Parch + Embarked + Fare,
data = train.set, family=binomial("logit"))
print(Titan.logit.1)
Couple of observations on the results above. The factors we're interested in are Deviance and Degrees of Freedom. The null observations are based on how well we can predict survival given a "null" model, which works only based on a constant, mean of means or a "grand mean". The null deviance is expected to be high. While the residual deviance tells us how much the inclusion of features has brought down the null deviance. So for instance, in our first run, the null deviance was 950.9 and the residual deviance was 633.3. So including the raw features (after the data munging process) brought down the deviance by ~327 points with a 713-704=9 change in degrees of freedom. If you're interested like I am, google to learn more about these topics. The coefficients are the parameters (theta) for each of the features and the Intercept is the theta0 term.
Let's run the extractor function, anova(), which gives us the result of analysis. I am using the chi-square or "goodness of fit" statistic. Lower the value the better.
%%R
anova(Titan.logit.1, test="Chisq")
Looking at the individual deviances, we see that Sex and Pclass accounted for the biggest reductions while Age and SibSp seem to be contributing somewhat. Embarked and Fare are on the lower end while the contribution of Parch is negligible. Let's now make some changes, include our new features and remove Fare and Parch.
%%R
Titan.logit.2 <- glm(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + FamID + Title + FamSize,
data = train.set, family=binomial("logit"))
print(Titan.logit.2)
%%R
anova(Titan.logit.2, test="Chisq")
Looks like FamID and to an extent FareGroup contribute well to the model. Although looking at the extraordinarily high deviance and df for FamID, I am suspicious that this might cause overfitting - meaning we've modeled in extreme based on the training set hence won't generalize very well on new examples. This can be addressed by resampling and hypertuning parameters based on crossvalidation. We essentially split the training set further into train and cv, then hypertune parameters against the cv set. This is repeated multiple times, each with a different sample of train/cv.
OK let's proceed to use the train method of the caret package to train a logistic regression model. We'll use one of the most common crossvalidation methods namely the 3x 10-fold CV. That is 10 folds of data to split train and cv repeated 3 times.
%%R
#Define the 3x 10 fold cv control using the traincontrol method of caret.
tenfoldcv<-trainControl(method='repeatedcv', number=10, repeats=3)
%%R
#Train a logistic regression classifier using the train method of the caret package. Everything is same as before except
#we use the train function and pass glm as the method.
#Install the doSNOW package to leverage multiple cores - parallelization
#install.packages(doSNOW)
library(doSNOW)
#Set 4 below to the number of cores you'd like to run in parallel. I have 4 and using 3 of them!
cl <- makeCluster(3, type="SOCK")
registerDoSNOW(cl)
#Note that I've also added options below to normalize the features (Feature Scaling)
set.seed(35)
logit.tune1<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + FamID + Title,
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)
logit.tune1
#May need to install a dependency for caret train
#install.packages('e1071', dependencies=TRUE)
%%R
summary(logit.tune1)
Okay looks like we're doing (un)reasonably well. Let's try a couple of interesting ideas. Class Compression refers to collapsing some levels on a categorical variable. In layman terms, making a factor two-level. So for instance, we have Embarked, most of which has the value 'S' as we saw earlier. We can use the I() function when training to shrink this to S or otherwise. Let's do that.
%%R
#Let's set class compression on Embarked to 'S' or otherwise.
set.seed(35)
logit.tune2<-train(Survived ~ Sex + Pclass + Age + SibSp + I(Embarked=='S') + FareGroup + FamID + Title,
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)
logit.tune2
%%R
summary(logit.tune2)
So that didn't really help. Let's try one last trick, Interaction. Let's work in an interaction effect between passenger class and sex, as passenger class showed a much bigger difference in survival rate amongst the women compared to the men (i.e. Higher class women were much more likely to survive than lower class women, whereas first class Men were more likely to survive than 2nd or 3rd class men, but not by the same margin as amongst the women). We saw this during our initial visualizations. Besides Pclass and Sex have been the biggest determining factors so far.
%%R
#Let's work in an interaction between Pclass and Sex.
set.seed(35)
logit.tune3<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + FamID + Title,
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)
logit.tune3
%%R
summary(logit.tune3)
So we did a little bit better. I would just like to test a theory. We manufactured the FamilyID and have been doing well so far. What happens if we take it out? Will we do worse or better? Let's check it out.
%%R
#Let's work in an interaction between Pclass and Sex.
set.seed(35)
logit.tune4<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + Title,
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)
logit.tune4
%%R
summary(logit.tune4)
So we actually did much better. So lesson learnt, engineering new features is a great idea but may or may not positively impact your model. In fact, if we don't get it right, it could have an adverse impact. Let's see if we can make anymore tiny improvements with the Title. We have 4 possible values and I am going to class compress each.
%%R
#Let's work in an interaction between Pclass and Sex.
set.seed(35)
logit.tune5<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + I(Title=='Mr') +
I(Title=='Mrs') + I(Title=='Miss') + I(Title=='Noble'),
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)
logit.tune5
%%R
summary(logit.tune5)
Hmm, didn't really make a difference. Let's try one last thing, adding Child, which we derived during the feature engineering exercise.
%%R
#Let's add Child to the mix.
set.seed(35)
logit.tune6<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + Title + Child,
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)
logit.tune6
%%R
summary(logit.tune6)
So we got that tiny push we were looking for. Let's now go ahead and try this model on our test set as well as submit to Kaggle.
Model Evaluation - Logistic Regression
We can now begin to evaluate model performance by putting together some cross-tabulations of the observed and predicted Survival for the passengers in the test.set data. caret makes this easy with the confusionMatrix function.
%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
logit.pred<-predict(logit.tune6, test.set)
#Generate the confusion matrix
confusionMatrix(logit.pred, test.set$Survived)
The metric we're looking for here is called Specificity. It basically is, out of all that actually survived how many we predicted will survive. So, it's 50/68 = 73.53%. Not too shabby though I'd like to do better. Let's anyway try to make a submission and find out for real.
Submit Results to Kaggle
Let's now submit the results from the LR model to Kaggle to see how we fare.
%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(logit.tune6, kaggletest)
lr.predictions<-as.data.frame(Survived)
lr.predictions$PassengerId<-kaggletest$PassengerId
#Write results as a CSV file
write.csv(lr.predictions[,c('PassengerId','Survived')], file="LR_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)
The model scored 0.76555 which put us ahead of only about 1/4th of the teams on the leaderboard. Let us keep trying to improve.
Support Vector Machines
Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for classification or for regression. SVMs are a discriminative classifier: that is, they draw a boundary between clusters of data. The process of fitting an SVM based model to our dataset is very similar to what we just did with glm but involves an additional step to hypertune parameters. The key parameter for SVM is C, which can be considered as 1/lambda where lambda is the regularization term. We talked about overfitting previously, regularization is the process of offsetting it. If there is overfitting, we would increase lambda, so conversely decrease C.
The caret package automatically selects the best C value by hypertuning it during crossvalidation. But we need to supply a range of C values for the train method to try. For SVMs, this could be handled by updating the parameter, tunelength (or tunegrid). By default, the length is 3 and the values tried are 0.25, 0.5 and 1. Setting it to 6 would try 0.25 - 8. Let's get started.
%%R
set.seed(35)
#Training an SVM model with the RBF kernel - Radial Basis Function
svm.tune1<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + Title + Child,
data=train.set,
method='svmRadial',
tuneLength=9,
preProcess = c("center", "scale"),
trControl=tenfoldcv)
svm.tune1
Great, so SVM automatically tried 9 different values of C while holding the other parameter Sigma constant and picked the values that gave the best results. We really didn't have to do much there. Let's go ahead and evaluate the model as well as submit to Kaggle.
Model Evaluation - SVM
%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
svm.pred<-predict(svm.tune1, test.set)
#Generate the confusion matrix
confusionMatrix(svm.pred, test.set$Survived)
The specificity is actually a bit down from the LR model, we're at 70%. Other than that eveyrthing looks pretty similar. I also just discovered that a Support Vector Machine automatically captures interaction between variables. So the Pclass:Sex interaction we put in has no importance to this model. We'll remove it going forward since Random Forests also automatically identifies interactions.
Submit to Kaggle
%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(svm.tune1, kaggletest)
svm.predictions<-as.data.frame(Survived)
svm.predictions$PassengerId<-kaggletest$PassengerId
#Write results as a CSV file
write.csv(svm.predictions[,c('PassengerId','Survived')], file="SVM_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)
We scored 0.77512, an improvement over the Logistic Regression model that took us up the leaderboard a few notches. We're not going to let this go!
Random Forests
Next up, a very popular and easy to use model, Random Forests. RF builds on the concept of decision trees and expands it by growing multiple trees and averaging out the results to find the best fit. The parameter we need to tune is mtry, that number of features to try at each node. The best recommended value for this parameter is typically the square root of the number of features. Let's give this a shot.
%%R
set.seed(35)
rfgrid<-data.frame(.mtry=c(2,3,4))
#Training a Random Forest model
rf.tune1<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + Title + Child,
data=train.set,
method='rf',
tuneGrid=rfgrid,
trControl=tenfoldcv)
rf.tune1
Looks like the best mtry value found was 4. Let's complete the formalities.
Model Evaluation - Random Forests
%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
rf.pred<-predict(rf.tune1, test.set)
#Generate the confusion matrix
confusionMatrix(rf.pred, test.set$Survived)
Well, the specificity is exactly the same, at 70%. I doubt this is going to give us a different result with Kaggle but let's try anyway.
Submit to Kaggle
%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(rf.tune1, kaggletest)
rf.predictions<-as.data.frame(Survived)
rf.predictions$PassengerId<-kaggletest$PassengerId
#Write results as a CSV file
write.csv(rf.predictions[,c('PassengerId','Survived')], file="RF_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)
Surprise, we scored 0.78469 bringing us midway on the leaderboard. So we're definitely making headway!
Feature Importances
Before we move ahead let's look at a key statistic that comes from running a tree based model. It's called feature importances, which tells us how much of an impact each of the features we're feeding to the model has to the final outcome.
%%R
#Print variable importan
varImp(rf.tune1$finalModel)
That's very interesting. We always knew that Gender was the most important variable. Sexmale by the way shows up there because it's the class that suffered the most. We see here that Age and "Mr." Title take the second spot. That is a revelation of sorts. SibSp also seems quite important compared to say Embarked.
Before we move ahead with other models, I am really curious to try a few more things out with RF. I would like to measure the impact of adding a couple of features we've left out, Family Size and Parch and see how they pan out in terms of importance. Offline I tried adding FamID and it again created a negative impact so I am leaving it out for good
%%R
set.seed(35)
#Notice how we're trying out 2-5 features at each node now since we're adding a couple of features to train.
rfgrid<-data.frame(.mtry=c(2,3,4,5))
#Training a Random Forest model
rf.tune2<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + Title + Child + FamSize + Parch,
data=train.set,
method='rf',
tuneGrid=rfgrid,
trControl=tenfoldcv)
rf.tune2
As expected, the best mtry value this time was 5. Let's look at the feature importances.
%%R
varImp(rf.tune2$finalModel)
So it worked out well. Parch and especially FamSize seem to be reasonably important. I believe feature selection plays a very important role in Tree based on models (not that they don't in others but perhaps even more important in this case). OK let's run predictions then re-submit to Kaggle to see if we can improve the score.
%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
rf.pred<-predict(rf.tune2, test.set)
#Generate the confusion matrix
confusionMatrix(rf.pred, test.set$Survived)
No difference on predictions in our Test set. The sensitivity is the same. But let's try submitting to Kaggle anyway.
%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(rf.tune2, kaggletest)
rf.predictions<-as.data.frame(Survived)
rf.predictions$PassengerId<-kaggletest$PassengerId
#Write results as a CSV file
write.csv(rf.predictions[,c('PassengerId','Survived')], file="RF_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)
Guess what, we did make an improvement to the Kaggle score. This model scored 0.78947 bringing us to the top 1/3rd of the leaderboard. Since tree based models are giving us great results let's try one more, a very interesting model called Conditional Trees.
Conditional Trees
Conditional Tree based models supposedly tend to select variables that have many possible splits or many missing values. So instead of Random Forests which tries to find out which variables are important, using an information measure, these models perform sort of a significance test to see which features will yield the best results at each split. Let's give this a whirl.
Note that there are two Conditional Tree packages in caret (ctree, ctree2). We will be using ctree2 which allows us to tune the Max Depth, how deep the trees can grow.
%%R
set.seed(35)
ctrgrid<-data.frame(.maxdepth=c(2,3,4,5,6))
#Training a Conditional Tree model
ctr.tune1<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + Title + Child + FamSize + Parch,
data=train.set,
method='ctree2',
tuneGrid=ctrgrid,
trControl=tenfoldcv)
ctr.tune1
OK that did give us a better accuracy at Max Depth 4. Before we run predictions, let's visualize the tree that was built for the final model.
%%R
plot(ctr.tune1$finalModel)
We can observe here that Sexmale as expected was the starting node. From then on, Pclass and TitleMr took the honors for level 2 leading further then to the other variables. We're now ready to evaluate the model and run the predictions.
Model Evaluation - Conditional Trees
%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
ctr.pred<-predict(ctr.tune1, test.set)
#Generate the confusion matrix
confusionMatrix(ctr.pred, test.set$Survived)
So that gave us a better Specificity on the Test set at 72.06%. I am interested to see how this tests out at Kaggle.
Submit to Kaggle
%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(ctr.tune1, kaggletest)
ctr.predictions<-as.data.frame(Survived)
ctr.predictions$PassengerId<-kaggletest$PassengerId
#Write results as a CSV file
write.csv(ctr.predictions[,c('PassengerId','Survived')], file="CTR_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)
The model scored 0.77512 which is actually slightly worse than how Random Forests did. So the best model in terms of the Kaggle leaderboard has been Random Forests. But let's run a formal comparison of all the models we've built so far.
Model Comparison
The resamples method in the caret package makes it easy to compare results between different models.
%%R
modelcompare<-resamples(list(Logit = logit.tune6, SVM = svm.tune1, RF = rf.tune2, CTREE = ctr.tune1 ))
summary(modelcompare)
We can see that for the metric we chose (Accuracy), Random Forests outperformed the other models. Note here that we could've chosen ROC (Receiver Operating Characteristic) as the metric in which case, we'd have needed to generate class probabilties - that is, probability for survived/not survived for every data item rather than letting crossvalidation generate predictions automatically. I intend to learn and demonstrate these concepts in a seperate session. For now, let's plot the results in a couple of different ways.
Box Plot of Model Comparison Results
%%R
bwplot(modelcompare)
Dot Plot of Model Comparison Results
%%R
dotplot(modelcompare)
So there you have it. We took the Titanic dataset presented by Kaggle, imported and visualized the data in a series of plots, munged the data to address gaps and fitted 4 different models with varying results, all in R, special thanks to the caret package. We only managed to get up to 0.7894 on the Kaggle leaderboard but the point of this exercise was to learn and demonstrate machine learning concepts in R. Hope we've achieved that objective.
I will be happy to receive your feedback positive or negative but try to be nice, I am just learning :-)
Share more, Learn more!
Comments
Comments powered by Disqus