Saving the Titanic with R & IPython

The following is an illustration of one of my approaches to solving the Titanic Survival prediction challenge hosted by Kaggle. Below is an excerpt from the competition page.

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Disclaimer
This pursuit infuses my own ideas with others I've had the privilege to learn from. I write to further my learning.

To get started, download the data files from Kaggle's website here. You will see two CSV files, train and test. Download them to your working directory.

OK let's now take a look at the column descriptions provided for the dataset.

VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

Bottomline, we have some information about passengers traveling aboard the Titanic and we need to predict if train a model that can predict if one survived or not based on data similar to that provided in the dataset. Without further ado, let's get started.

In [1]:
#Load the R Magic so we can execute R scripts within this notebook
%load_ext rmagic
In [2]:
%%R
#Note that every code block in this notebook will need to have the above line to enable IPython to understand we're coding R.

#I've downloaded the train and test CSV files to my work directory. You should too unless you cloned this repo.
#While downloading the CSV files in R, let's do some data handling so it saves us the headache later on.

#Define a read function so we don't need to do it twice. Column types specifies data types for each column and missing
#types specify different types of null values possible.
read_better <- function(file.name, column.types, missing.types) {
  read.csv( file.name, 
            colClasses=column.types,
            na.strings=missing.types )
}

#Let's now define the column types
column.types=c('integer',   # PassengerId
               'factor',    # Survived 
               'factor',    # Pclass
               'character', # Name
               'factor',    # Sex
               'numeric',   # Age
               'integer',   # SibSp
               'integer',   # Parch
               'character', # Ticket
               'numeric',   # Fare
               'character', # Cabin
               'factor')    # Embarked
#Different types of null values
missing.types=c('NA','') 

#Alright,let's read train
orig_train<-read_better('train.csv', column.types, missing.types)

#For test, the Survived column (2nd col) doesn't exist, let's remove that type before reading.
orig_test<-read_better('test.csv', column.types[-2], missing.types)

#Let's make copies so we never have to read again
train<-orig_train
test<-orig_test

#Quickly print a summary of train
summary(train)
  PassengerId    Survived Pclass      Name               Sex     
 Min.   :  1.0   0:549    1:216   Length:891         female:314  
 1st Qu.:223.5   1:342    2:184   Class :character   male  :577  
 Median :446.0            3:491   Mode  :character               
 Mean   :446.0                                                   
 3rd Qu.:668.5                                                   
 Max.   :891.0                                                   
                                                                 
      Age            SibSp           Parch           Ticket         
 Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891        
 1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character  
 Median :28.00   Median :0.000   Median :0.0000   Mode  :character  
 Mean   :29.70   Mean   :0.523   Mean   :0.3816                     
 3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000                     
 Max.   :80.00   Max.   :8.000   Max.   :6.0000                     
 NA's   :177                                                        
      Fare           Cabin           Embarked  
 Min.   :  0.00   Length:891         C   :168  
 1st Qu.:  7.91   Class :character   Q   : 77  
 Median : 14.45   Mode  :character   S   :644  
 Mean   : 32.20                      NA's:  2  
 3rd Qu.: 31.00                                
 Max.   :512.33                                
                                               

Data Munging

Munging is essentially cleansing the data so it's ready for our super sophisticated Machine Learning algorithms :-)

Ideally I'd like to use Pandas which is an awesome tool for these types of things but considering Pandas itself was inspired from R, we will try the whole thing in R this time. I'll create a seperate notebook later to do it all in sklearn/pandas.

Alright let's get started. The first step of any Data Cleansing process is Visualization. Why is that? I asked the question myself but how would you cleanse something when you don't know what it is? And what better way to understand data than by looking at in colourful visualizations. Let's go and create some.

In [3]:
%%R
#I loved the look of this R package that provides the missing map which can give you a super quick peek into the dataset.
#You'll need the Amelia package for this visualization.

#install.packages("Amelia")
require(Amelia)
missmap(train, main="Titanic - Missing Data Map", col=c("forestgreen","lightskyblue2"), legend=FALSE)
Loading required package: Amelia
Loading required package: Rcpp
## 
## Amelia II: Multiple Imputation
## (Version 1.7.3, built: 2014-11-14)
## Copyright (C) 2005-2014 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
## 

It's obvious looking at this map that the most values missing are in the Cabin and Age columns. Cabin basically refers to the cabin# of each passenger and considering how many values are missing, we could as well just drop it. Age however is critical and we need a better mechanism to handle. There is also one missing Embarked value.

In [4]:
%%R
#OK let's plot a series of visualizations that can hopefully explain the data better"

#Proportion of Survivors
barplot(prop.table(table(train$Survived)), names.arg=c('Didn\'t Survive', 'Survived'), main="Proportion of Survivors",
       col=c('mistyrose','lightseagreen'))

#Clearly more people died than survived.
In [5]:
%%R
#Proportion of Survivors by Gender
barplot(prop.table(table(train$Sex, train$Survived), 1), names.arg=c('Didn\'t Survive', 'Survived'), 
        main="Proportion of Survivors by Gender", legend=TRUE, col=c('darksalmon','paleturquoise'))

#More Females survived than Males. This is understandable considering the ladies and children first approach for survival.
In [6]:
%%R
#Proportion of Survivors by Class of Travel - let's do a mosaicplot this time for fun.
mosaicplot(prop.table(table(train$Pclass, train$Survived), 1), main="Proportion of Survivors by Pclass", 
           xlab='Pclass', ylab='Survived ?',col=c('darkturquoise','mediumspringgreen'))

#Looks like those traveling in upper class (3) were more lucky than the rest. We'll keep this in mind.
In [7]:
%%R
#OK let's do a quick plot by Age - remember that we need to fill in missing values for this column.
boxplot(train$Age~train$Survived,main="Proportion of Survivors by Age", col=c('darkseagreen4','salmon4'), xlab="Survived ?",
       ylab="Age")

#OK that was a helpful plot, it clearly tells us that there were more survivors in the 20-35 Age bracket, young legs perhaps?
In [8]:
%%R
#OK let's get down to business. We'll look at how many values are missing for Age.
summary(train$Age)

#177 is a lot considering our dataset is really small. Let's try to find a meaningful way to fill these up.

#The determining factors so far has been Gender and Pclass. Can we find out how many Age values are missing by Gender?
barplot(table(train$Sex[which(is.na(train$Age))]),main="Proportion of Missing Ages by Gender", 
        col=c('lightsteelblue','bisque3'), xlab="Gender", ylab="Missing Ages")

#OK that's clearly tilted in favor of males. We're a sloppy bunch, aren't we?
In [9]:
%%R
#How about missing Age values by Pclass?
barplot(table(train$Pclass[which(is.na(train$Age))]),main="Proportion of Missing Ages by Pclass", 
        col=c('mediumseagreen','rosybrown4','mediumslateblue'), xlab="Pclass", ylab="Missing Ages")

#There's our most important highlight yet. Most of the ages we're missing are in Pclass 3. So this goes to show, we will
#probably be better off taking the median of ages for each Pclass and gender and filling the null values with it. That
#should be better than simply filling all with overall median.
In [10]:
%%R
#Let's go ahead and do the honors. Note that we could do all of this with a single sophisticated function but I am
#choosing to keep things simple at the moment.

#Before we make any changes to Train, let's combine Train and Test temporarily to a new dataset. Since any change we need
#to make to Train needs to be made to Test as well, we can do the changes only once and split the datasets again later.

#We'll first add the Survived Column to the Test set since it doesn't exist and init to NULL values since we won't use it.
test$Survived<-rep(0,nrow(test))
titanic<-rbind(train,test)

#na.rm basically calculates mean for all non-null values. We then plug the medi
titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female" & is.na(titanic$Age))]<-
        median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
titanic$Age[which(titanic$Pclass==2 & titanic$Sex=="female" & is.na(titanic$Age))]<-
        median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
titanic$Age[which(titanic$Pclass==1 & titanic$Sex=="female" & is.na(titanic$Age))]<-
        median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="male" & is.na(titanic$Age))]<-
        median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
titanic$Age[which(titanic$Pclass==2 & titanic$Sex=="male" & is.na(titanic$Age))]<-
        median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)
titanic$Age[which(titanic$Pclass==1 & titanic$Sex=="male" & is.na(titanic$Age))]<-
        median(titanic$Age[which(titanic$Pclass==3 & titanic$Sex=="female")],na.rm=TRUE)

#Let's do a quick summary of the Age column to make sure there are no nulls remaining

summary(titanic$Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.17   22.00   24.00   28.30   35.00   80.00 

In [11]:
%%R
#Let's turn our attention to the Embarked Column
summary(titanic$Embarked)

#There are just 2 missing values and S seems to be the majority so let's plug in S
titanic$Embarked[which(is.na(titanic$Embarked))] <- 'S'

summary(titanic$Embarked)
  C   Q   S 
270 123 916 

In [12]:
%%R
#Let's remove the Cabin column which contains way too many NULLs and will probably not give anything new considering
#we already have the port of embarcation.
titanic$Cabin<-NULL
Feature Engineering

Feature Engineering refers to manufacturing new features based on the idea that they may replace existing ones because they are easier for the machine learning algorithms to digest or extend the value of existing feature(s) because they're more relevant for the predictions we're trying to make. Let's see what new features we can add to the Titanic dataset.

In [13]:
%%R
#OK we all know that they tried to save the women and children first aboard the titanic. We have gender which identifies
#women but no identifier for children. Let's call all passengers with Age < 18 children shall we?
titanic$Child<-0
titanic$Child[which(titanic$Age<18)]<-1
In [14]:
%%R
#We have a column called fare which is a numeric value of the actual fare that was paid for the trip. This column as such
#might not be very useful but what if we can break it down to different buckets - say <20, 20-40, 40-60, 60+. Recall that
#we had noticed a curious fact with respect to the range 20-35, let's make sure that is covered in a single bucket.
titanic$FareGroup<-'40+'
titanic$FareGroup[which(titanic$Fare<10)] <- '<10'
titanic$FareGroup[which(titanic$Fare>=10 & titanic$Fare<20)] <- '10-20'
titanic$FareGroup[which(titanic$Fare>=20 & titanic$Fare<40)] <- '20-40'

titanic$FareGroup<-as.factor(titanic$FareGroup)

barplot(table(titanic$FareGroup),col='tomato1',main='Fare Groups')
In [15]:
%%R
#Name is another clear candidate that might not be of great value to us. What could a name contribute to predicting if he/she
#actually survived? Well, it may not directly but it could contain things that influence the prediction. Let's print a few 
#names to see.
tail(titanic$Name)

#We can see that the names seem to be in similar format and have a title in between the surname and first name. Can we 
#extract the Titles from the names?
[1] "Henriksson, Miss. Jenny Lovisa" "Spector, Mr. Woolf"            
[3] "Oliva y Ocana, Dona. Fermina"   "Saether, Mr. Simon Sivertsen"  
[5] "Ware, Mr. Frederick"            "Peter, Master. Michael J"      

In [16]:
%%R
#We could use the strsplit function to split based on , and . then capture the middle items which should be the title. The
#sapply function helps to perform this for the entire dataframe. function keyword is just like lambda in python. We will
#also remove any extra whitespaces.
titanic$Title<-sapply(titanic$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
titanic$Title<-sub(' ','',titanic$Title)
head(titanic$Title)
[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"  

In [17]:
%%R
#Let's take a look at the different values of Title and see if further grouping is necessary.
#To be safe, we'll perform any analysis only on the training set. Let's temporarily split Train.
temp <- titanic[1:nrow(train),]
table(temp$Title)

#OK there's obviously only a few titles that are most prominent namely Mr, Miss, Mrs and Master. Perhaps we can group the
#other titles further?

        Capt          Col          Don           Dr     Jonkheer         Lady 
           1            2            1            7            1            1 
       Major       Master         Miss         Mlle          Mme           Mr 
           2           40          182            2            1          517 
         Mrs           Ms          Rev          Sir the Countess 
         125            1            6            1            1 

In [18]:
%%R
#Looking at wikipedia definitions, the following grouping makes reasonable sense because of similar definitions.
#Note it's definitely possible to nitpick these groupings, feel free to make your own judgement call.
titanic$Title[titanic$Title %in% c('Dona','Ms','Lady','the Countess','Jonkheer')] <- 'Mrs'
titanic$Title[titanic$Title %in% c('Col','Dr','Rev')] <- 'Noble'
titanic$Title[titanic$Title %in% c('Mme','Mile','Mlle')] <- 'Miss'
titanic$Title[titanic$Title %in% c('Capt','Don','Major','Sir')] <- 'Mr'

titanic$Title<-as.factor(titanic$Title)

table(titanic$Title)

Master   Miss     Mr    Mrs  Noble 
    61    263    762    203     20 

In [19]:
%%R
#OK so we split the Titles out, but what about Surnames? Surnames could indicate families traveling together, maybe 
#many of them tried to stick together trying to escape? Let's capture the surnames.
titanic$Surname<-sapply(titanic$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
titanic$Surname<-sub(' ','',titanic$Surname)
head(titanic$Surname)
[1] "Braund"    "Cumings"   "Heikkinen" "Futrelle"  "Allen"     "Moran"    

In [20]:
%%R
#Very quickly let's get a quick rundown of the families
temp <- titanic[1:nrow(train),]
fams<-data.frame(table(temp$Surname))
print(summary(fams$Freq))
#There we we have it. Looks like a lot of the passengers don't share a Surname with each other since the Median is 1
#but there are families as well with upto 9 members (in the training set). So we were on the right track.
hist(fams$Freq, ylim=c(0,50), col='darkcyan')

#The histogram tells us there are a lot of ones and about 50 odd families with sizes 2 and 3, the remaining few have
#large families. Perhaps there's a bigger question, how do we which of them are families? There could be multiple passengers
#with the same surname traveling different groups, which would negate our purpose.
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   1.336   1.000   9.000 

In [21]:
%%R
#To simplify this, let's use the variables we haven't explored so far, SibSp (No. of Siblings and Spouses) and Parch (No.
#Parents and Children). Summing these up (+1 for self) should gives us the family size of each passenger. Making an
#assumption that these attributes were entered correctly, we could then surmise that passengers with the same surname and 
#family size belong to same families. Of course, there're still loopholes if we want to nitpick but I am stopping here.
titanic$FamSize<-titanic$SibSp + titanic$Parch + 1

#Club FamilySize with Surname to create a new Family ID 
titanic$FamID<-paste(titanic$Surname,as.character(titanic$FamSize),sep='')

#We'll call all passengers with Family Size = 1 as traveling Solo.
titanic$FamID[which(titanic$FamSize==1)]<-'Solo'
titanic$FamID<-as.factor(titanic$FamID)

#Finally let's remove some columns we won't use for model building.
#Name and Ticket are unique for each passenger (we assume) and couldn't possible add any relevance to our prediction. Even
#if multiple passengers had the same name, it shouldn't really help us decide if one or more survived.
titanic$Name<-NULL
titanic$Ticket<-NULL

head(titanic)
  PassengerId Survived Pclass    Sex Age SibSp Parch    Fare Embarked Child
1           1        0      3   male  22     1     0  7.2500        S     0
2           2        1      1 female  38     1     0 71.2833        C     0
3           3        1      3 female  26     0     0  7.9250        S     0
4           4        1      1 female  35     1     0 53.1000        S     0
5           5        0      3   male  35     0     0  8.0500        S     0
6           6        0      3   male  22     0     0  8.4583        Q     0
  FareGroup Title   Surname FamSize     FamID
1       <10    Mr    Braund       2   Braund2
2       40+   Mrs   Cumings       2  Cumings2
3       <10  Miss Heikkinen       1      Solo
4       40+   Mrs  Futrelle       2 Futrelle2
5       <10    Mr     Allen       1      Solo
6       <10    Mr     Moran       1      Solo

In [22]:
%%R
#Great we're through. We'll now get back our Train and Test sets from Titanic.
train<-titanic[1:nrow(train),]
temp=nrow(train)+1
kaggletest<-titanic[temp:nrow(titanic),]

print(nrow(train))
print(nrow(kaggletest))
print(nrow(titanic))
[1] 891
[1] 418
[1] 1309

Model Fitting and Evaluation

I want to use this opportunity and challenge to try out different models in R and see how they stack up against each other. We'll run through one at a time, fit the parameters and submit to Kaggle. We'll then wrap things up by comparing the different approaches. I think this is a great chance to learn tuning models in R.

First up, before we start training and running cross-validation, I would like to try simple, plain-old logistic regression. We'll not be training but simply fitting a model and checking how the different parameters we've created contribute to the predictions.

Before we get started we need to split the "train.csv" file we're holding in the train dataframe to train/test sets. The reason we do this is so we have a baseline to run our model against before the submission to Kaggle. It enables a good approximation on how well the model can generalize.

We'll be using the caret package for the rest of this approach. The caret package has several functions that attempt to streamline the model building and evaluation process. One of them is the createDataPartition function which can be used to create a stratified random sample of the data into training and test sets.

In [23]:
%%R
#75/25 Train/Test Split
#install.packages('caret')
#Setting a random seed. We will be using this throughout to make sure our results are consistently comparable.
library(caret)
set.seed(35) 
trainrows<-createDataPartition(train$Survived, p = 0.8, list=FALSE)
train.set<-train[trainrows,]
test.set<-train[-trainrows,]

print(nrow(train.set))
print(nrow(test.set))
      
#Remember, from this point on Test does NOT refer to the test.csv file. I will call out explicitly when it does and it won't
#until the very end when we submit predictions to Kaggle for each model.
Loading required package: lattice
Loading required package: ggplot2
[1] 714
[1] 177

Logistic Regression

Before we start training models with caret, I would like to first explore simple logistic regression through the glm() method. glm (Generalized Linear Models) is easy-to-use and lets us several types of linear models, logistic regression being one of them. Let's begin

In [24]:
%%R
#To start with I am not including any of the features we manufactured. Let's see how the raw features perform.

Titan.logit.1 <- glm(Survived ~ Sex + Pclass + Age + SibSp + Parch + Embarked + Fare, 
                       data = train.set, family=binomial("logit"))
print(Titan.logit.1)

Call:  glm(formula = Survived ~ Sex + Pclass + Age + SibSp + Parch + 
    Embarked + Fare, family = binomial("logit"), data = train.set)

Coefficients:
(Intercept)      Sexmale      Pclass2      Pclass3          Age        SibSp  
   4.034633    -2.640662    -1.101916    -2.263913    -0.038039    -0.263821  
      Parch    EmbarkedQ    EmbarkedS         Fare  
  -0.112988    -0.001405    -0.455287     0.002163  

Degrees of Freedom: 713 Total (i.e. Null);  704 Residual
Null Deviance:	    950.9 
Residual Deviance: 633.3 	AIC: 653.3

Couple of observations on the results above. The factors we're interested in are Deviance and Degrees of Freedom. The null observations are based on how well we can predict survival given a "null" model, which works only based on a constant, mean of means or a "grand mean". The null deviance is expected to be high. While the residual deviance tells us how much the inclusion of features has brought down the null deviance. So for instance, in our first run, the null deviance was 950.9 and the residual deviance was 633.3. So including the raw features (after the data munging process) brought down the deviance by ~327 points with a 713-704=9 change in degrees of freedom. If you're interested like I am, google to learn more about these topics. The coefficients are the parameters (theta) for each of the features and the Intercept is the theta0 term.

Let's run the extractor function, anova(), which gives us the result of analysis. I am using the chi-square or "goodness of fit" statistic. Lower the value the better.

In [25]:
%%R
anova(Titan.logit.1, test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Survived

Terms added sequentially (first to last)


         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                       713     950.86              
Sex       1  206.206       712     744.66 < 2.2e-16 ***
Pclass    2   77.642       710     667.02 < 2.2e-16 ***
Age       1   18.755       709     648.26 1.486e-05 ***
SibSp     1    8.977       708     639.29  0.002734 ** 
Parch     1    0.640       707     638.65  0.423776    
Embarked  2    4.625       705     634.02  0.098991 .  
Fare      1    0.723       704     633.30  0.395187    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Looking at the individual deviances, we see that Sex and Pclass accounted for the biggest reductions while Age and SibSp seem to be contributing somewhat. Embarked and Fare are on the lower end while the contribution of Parch is negligible. Let's now make some changes, include our new features and remove Fare and Parch.

In [26]:
%%R
Titan.logit.2 <- glm(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + FamID + Title + FamSize, 
                       data = train.set, family=binomial("logit"))
print(Titan.logit.2)

Call:  glm(formula = Survived ~ Sex + Pclass + Age + SibSp + Embarked + 
    FareGroup + FamID + Title + FamSize, family = binomial("logit"), 
    data = train.set)

Coefficients:
              (Intercept)                    Sexmale  
                3.022e+01                 -2.932e+01  
                  Pclass2                    Pclass3  
               -1.165e+00                 -1.093e+00  
                      Age                      SibSp  
               -2.415e-02                  2.153e-01  
                EmbarkedQ                  EmbarkedS  
                2.933e-01                 -2.709e-01  
           FareGroup10-20             FareGroup20-40  
                5.610e-01                  1.247e+00  
             FareGroup40+              FamIDAbelson2  
                1.058e+00                 -1.144e-01  
                FamIDAks2              FamIDAllison4  
                4.504e+15                 -2.955e+01  
    FamIDAndersen-Jensen2            FamIDAndersson7  
                2.559e+01                 -2.594e+00  
            FamIDAndrews2             FamIDAppleton3  
                2.591e+01                  2.754e+01  
     FamIDArnold-Franchi2              FamIDAsplund7  
               -2.780e+01                 -9.996e-01  
          FamIDBackstrom2            FamIDBackstrom4  
               -2.342e+01                  2.432e+01  
            FamIDBaclini4              FamIDBarbara2  
                2.514e+01                 -2.695e+01  
             FamIDBaxter2                FamIDBeane2  
               -6.829e-01                  3.796e+01  
             FamIDBecker4             FamIDBeckwith3  
                4.504e+15                  2.686e+01  
             FamIDBishop2               FamIDBoulos3  
                1.901e+01                 -2.693e+01  
             FamIDBourke3             FamIDBowerman2  
               -2.735e+01                  2.270e+01  
              FamIDBrown3                FamIDBryhl2  
                6.636e-01                 -4.504e+15  
           FamIDCaldwell3                FamIDCaram2  
                2.510e+01                 -3.145e+01  
            FamIDCardeza2               FamIDCarter2  
                2.719e+01                 -2.599e+01  
             FamIDCarter4            FamIDCavendish2  
                1.396e+01                 -2.490e+01  
           FamIDChambers2              FamIDChristy3  
                2.565e+01                  2.697e+01  
       FamIDChronopoulos2               FamIDClarke2  
               -2.292e+01                  2.775e+01  
            FamIDCollyer3              FamIDCompton3  
                1.118e+00                  2.399e+01  
             FamIDCoutts3                FamIDCribb2  
                2.256e+01                 -7.634e+01  
             FamIDCrosby3              FamIDCumings2  
                8.250e-01                  2.315e+01  
             FamIDDanbom3             FamIDDavidson2  
               -2.789e+01                 -4.445e+01  
             FamIDDavies3              FamIDDavison2  
                7.337e-01                  2.443e+01  
               FamIDDean4             FamIDdelCarlo2  
                7.449e-01                 -2.444e+01  
      FamIDdeMessemaeker2                 FamIDDick2  
                2.064e+01                  2.697e+01  
              FamIDDodge3               FamIDDoling2  
                2.502e+01                  4.261e+01  
            FamIDDouglas2           FamIDDuffGordon2  
               -2.482e+01                  2.711e+01  
        FamIDDurany More2                FamIDElias3  
                2.638e+01                 -2.680e+01  
             FamIDEustis2           FamIDFaunthorpe2  
                4.504e+15                  2.679e+01  
               FamIDFord5              FamIDFortune6  
               -1.731e+01                 -6.254e-01  
         FamIDFrauenthal2           FamIDFrauenthal3  
                2.072e+01                  2.958e+01  
          FamIDFrolicher3     FamIDFrolicher-Stehli3  
                2.450e+01                  2.806e+01  
           FamIDFutrelle2                 FamIDGale2  
               -6.513e-01                 -2.397e+01  
              FamIDGiles2           FamIDGoldenberg2  
               -2.348e+01                  2.839e+01  
          FamIDGoldsmith3              FamIDGoodwin8  
                2.487e+01                 -3.330e+01  
             FamIDGraham2           FamIDGreenfield2  
                2.441e+01                  2.620e+01  
         FamIDGustafsson3              FamIDHagland2  
               -4.187e+01                 -1.861e+01  
         FamIDHamalainen3               FamIDHansen2  
                2.591e+01                 -2.340e+01  
             FamIDHansen3               FamIDHarder2  
               -5.239e+01                  2.766e+01  
             FamIDHarper2               FamIDHarris2  
                2.961e+00                 -5.547e-01  
               FamIDHart3                 FamIDHays3  
                1.289e+00                  2.229e+01  
             FamIDHerman4              FamIDHickman3  
                2.614e+01                 -3.538e+01  
            FamIDHippach2             FamIDHirvonen2  
                5.357e+01                  2.910e+01  
            FamIDHocking4                 FamIDHold2  
               -2.375e+01                 -2.059e+01  
          FamIDHolverson2                 FamIDHoyt2  
               -2.258e+01                  1.834e+01  
         FamIDIlmakangas2            FamIDJacobsohn2  
               -2.598e+01                 -1.966e+01  
          FamIDJacobsohn4               FamIDJensen2  
                2.239e+01                 -2.301e+01  
            FamIDJohnson3             FamIDJohnston4  
                2.611e+01                 -2.985e+01  
            FamIDJussila2               FamIDKantor2  
               -2.704e+01                  1.565e-01  
             FamIDKenyon2              FamIDKiernan2  
                2.598e+01                 -1.207e+03  
            FamIDKimball2                 FamIDKink3  
                2.856e+01                 -2.513e+01  
      FamIDKink-Heilmann3               FamIDKlasen3  
                2.618e+01                 -2.429e+01  
           FamIDLahtinen3              FamIDLaroche4  
               -2.831e+01                  8.050e-01  
            FamIDLefebre5               FamIDLennon2  
               -2.767e+01                 -4.504e+15  
            FamIDLindell2            FamIDLindqvist2  
               -2.437e+01                  2.823e+01  
              FamIDLines2                 FamIDLobb2  
                2.471e+01                 -2.340e+01  
              FamIDLouch2               FamIDMadill2  
                2.383e+01                  2.414e+01  
             FamIDMallet3               FamIDMarvin2  
                7.140e-01                 -2.524e+01  
              FamIDMcCoy3              FamIDMcNamee2  
                3.043e+01                 -2.358e+01  
              FamIDMeyer2              FamIDMinahan2  
               -1.188e+00                  2.485e+01  
            FamIDMinahan3                 FamIDMoor2  
               -2.357e+01                  2.595e+01  
              FamIDMoran2             FamIDMoubarek3  
                3.095e-01                  1.393e+01  
             FamIDMurphy2                FamIDNakid3  
                2.723e+01                  2.852e+01  
             FamIDNasser2               FamIDNatsch2  
               -2.532e-01                 -2.146e+01  
           FamIDNavratil3               FamIDNewell2  
                2.643e+01                  2.878e+01  
             FamIDNewell3               FamIDNewsom3  
               -2.082e+01                  2.493e+01  
           FamIDNicholls3        FamIDNicola-Yarred2  
               -1.773e+01                  2.645e+01  
            FamIDO'Brien2                FamIDOlsen2  
                2.950e+01                 -2.216e+01  
            FamIDPalsson5               FamIDPanula6  
               -4.936e+01                 -2.862e+01  
            FamIDParrish2                FamIDPears2  
                3.472e+01                 -9.049e-01  
FamIDPenascoy Castellana2              FamIDPersson2  
               -1.369e+00                  3.023e+01  
              FamIDPeter3            FamIDPetterson2  
                2.598e+01                 -2.278e+01  
             FamIDPotter2                FamIDQuick3  
                2.398e+01                  2.556e+01  
             FamIDRenouf4                 FamIDRice6  
                2.379e+01                 -2.202e+01  
           FamIDRichards3             FamIDRichards6  
                2.988e+01                  2.444e+01  
             FamIDRobert2               FamIDRobins2  
                2.393e+01                 -2.787e+01  
            FamIDRosblom3              FamIDRyerson5  
               -1.694e+01                  2.468e+01  
              FamIDSage11               FamIDSamaan3  
               -2.948e+01                 -2.481e+01  
          FamIDSandstrom3              FamIDShelley2  
                2.597e+01                  2.361e+01  
             FamIDSilven3               FamIDSilvey2  
                2.797e+01                 -4.461e-01  
              FamIDSkoog6                  FamIDSolo  
               -2.788e+01                  1.459e+00  
            FamIDSpencer2           FamIDStephenson2  
                2.281e+01                  2.347e+01  
              FamIDStrom2                FamIDStrom3  
               -2.698e+01                 -2.850e+01  
            FamIDTaussig3               FamIDTaylor2  
                2.470e+01                  2.732e+01  
             FamIDThayer3               FamIDThomas2  
                2.664e+01                  2.664e+01  
       FamIDThorneycroft2               FamIDTurpin2  
                6.012e-01                 -2.782e+01  
        FamIDvanBilliard3         FamIDVanderPlanke2  
               -2.289e+01                 -4.026e+02  
       FamIDVanderPlanke3              FamIDVanImpe3  
               -2.312e+05                 -2.825e+01  
             FamIDWarren2                FamIDWeisz2  
                2.370e+01                  2.415e+01  
              FamIDWells3                 FamIDWest4  
                2.545e+01                  1.068e+00  
              FamIDWhite2                 FamIDWick3  
               -2.411e+01                  2.541e+01  
            FamIDWidener3             FamIDWilliams2  
               -2.544e+01                 -2.437e+01  
             FamIDZabour2                  TitleMiss  
               -2.698e+01                 -2.917e+01  
                  TitleMr                   TitleMrs  
               -2.822e+00                 -2.725e+01  
               TitleNoble                    FamSize  
               -3.994e+00                         NA  

Degrees of Freedom: 713 Total (i.e. Null);  517 Residual
Null Deviance:	    950.9 
Residual Deviance: 373.3 	AIC: 767.3

In [27]:
%%R
anova(Titan.logit.2, test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Survived

Terms added sequentially (first to last)


           Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                         713     950.86              
Sex         1  206.206       712     744.66 < 2.2e-16 ***
Pclass      2   77.642       710     667.02 < 2.2e-16 ***
Age         1   18.755       709     648.26 1.486e-05 ***
SibSp       1    8.977       708     639.29 0.0027343 ** 
Embarked    2    4.742       706     634.54 0.0933863 .  
FareGroup   3    1.034       703     633.51 0.7930468    
FamID     182  248.620       521     384.89 0.0007554 ***
Title       4   11.591       517     373.30 0.0206677 *  
FamSize     0    0.000       517     373.30              
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Looks like FamID and to an extent FareGroup contribute well to the model. Although looking at the extraordinarily high deviance and df for FamID, I am suspicious that this might cause overfitting - meaning we've modeled in extreme based on the training set hence won't generalize very well on new examples. This can be addressed by resampling and hypertuning parameters based on crossvalidation. We essentially split the training set further into train and cv, then hypertune parameters against the cv set. This is repeated multiple times, each with a different sample of train/cv.

OK let's proceed to use the train method of the caret package to train a logistic regression model. We'll use one of the most common crossvalidation methods namely the 3x 10-fold CV. That is 10 folds of data to split train and cv repeated 3 times.

In [28]:
%%R
#Define the 3x 10 fold cv control using the traincontrol method of caret.
tenfoldcv<-trainControl(method='repeatedcv', number=10, repeats=3)
In [283]:
%%R
#Train a logistic regression classifier using the train method of the caret package. Everything is same as before except
#we use the train function and pass glm as the method.

#Install the doSNOW package to leverage multiple cores - parallelization
#install.packages(doSNOW)
library(doSNOW)

#Set 4 below to the number of cores you'd like to run in parallel. I have 4 and using 3 of them!
cl <- makeCluster(3, type="SOCK")
registerDoSNOW(cl)

#Note that I've also added options below to normalize the features (Feature Scaling)
set.seed(35) 
logit.tune1<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + FamID + Title,
                data=train.set,
                method='glm',
                preProcess = c("center", "scale"),
                trControl=tenfoldcv)

logit.tune1

#May need to install a dependency for caret train
#install.packages('e1071', dependencies=TRUE)
Generalized Linear Model 

714 samples
 14 predictors
  2 classes: '0', '1' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ... 

Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD 
  0.7829812  0.5332698  0.05484398   0.1230111

 

In [284]:
%%R
summary(logit.tune1)

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.28940  -0.43374  -0.00008   0.00010   2.42885  

Coefficients: (45 not defined because of singularities)
                              Estimate Std. Error z value Pr(>|z|)  
(Intercept)                 -6.178e-01  2.122e+02  -0.003   0.9977  
Sexmale                     -1.065e+01  3.540e+03  -0.003   0.9976  
Pclass2                     -4.784e-01  2.887e-01  -1.657   0.0975 .
Pclass3                     -5.451e-01  3.414e-01  -1.597   0.1103  
Age                         -3.162e-01  1.743e-01  -1.814   0.0697 .
SibSp                        2.275e-01  6.706e-01   0.339   0.7344  
EmbarkedQ                    8.080e-02  1.658e-01   0.487   0.6259  
EmbarkedS                   -1.214e-01  1.804e-01  -0.673   0.5010  
`FareGroup10-20`             2.253e-01  2.610e-01   0.863   0.3880  
`FareGroup20-40`             5.179e-01  2.891e-01   1.792   0.0732 .
`FareGroup40+`               4.250e-01  2.883e-01   1.474   0.1404  
FamIDAbelson2               -6.050e-03  1.922e-01  -0.031   0.9749  
FamIDAhlin2                         NA         NA      NA       NA  
FamIDAks2                    6.886e-01  4.025e+02   0.002   0.9986  
FamIDAllison4               -1.184e+00  3.883e+02  -0.003   0.9976  
`FamIDAndersen-Jensen2`      7.532e-01  4.025e+02   0.002   0.9985  
FamIDAndersson7             -1.938e-01  2.168e-01  -0.894   0.3714  
FamIDAndrews2                7.125e-01  4.025e+02   0.002   0.9986  
FamIDAngle2                         NA         NA      NA       NA  
FamIDAppleton3               6.236e-01  4.025e+02   0.002   0.9988  
`FamIDArnold-Franchi2`      -1.073e+00  3.092e+02  -0.003   0.9972  
FamIDAsplund7               -6.470e-02  1.963e-01  -0.330   0.7417  
FamIDAstor2                         NA         NA      NA       NA  
FamIDBackstrom2             -6.092e-01  4.025e+02  -0.002   0.9988  
FamIDBackstrom4              6.570e-01  4.025e+02   0.002   0.9987  
FamIDBaclini4                9.648e-01  3.949e+02   0.002   0.9981  
FamIDBarbara2               -1.079e+00  3.893e+02  -0.003   0.9978  
FamIDBaxter2                -3.612e-02  1.847e-01  -0.195   0.8450  
FamIDBeane2                  8.323e-01  4.025e+02   0.002   0.9983  
FamIDBecker4                 6.907e-01  4.025e+02   0.002   0.9986  
FamIDBeckwith3               1.070e+00  3.166e+02   0.003   0.9973  
FamIDBishop2                 5.908e-01  4.025e+02   0.001   0.9988  
FamIDBoulos3                -7.514e-01  4.025e+02  -0.002   0.9985  
FamIDBourke3                -1.310e+00  3.150e+02  -0.004   0.9967  
FamIDBowerman2               6.835e-01  4.025e+02   0.002   0.9986  
FamIDBraund2                        NA         NA      NA       NA  
FamIDBrown3                  3.510e-02  2.015e-01   0.174   0.8617  
FamIDBryhl2                 -6.385e-01  4.025e+02  -0.002   0.9987  
FamIDCaldwell3               9.659e-01  3.860e+02   0.003   0.9980  
FamIDCaram2                 -8.114e-01  4.025e+02  -0.002   0.9984  
FamIDCardeza2                7.973e-01  4.025e+02   0.002   0.9984  
FamIDCarter2                -1.066e+00  2.916e+02  -0.004   0.9971  
FamIDCarter4                 1.284e+00  3.196e+02   0.004   0.9968  
FamIDCavendish2             -6.651e-01  4.025e+02  -0.002   0.9987  
FamIDChaffee2                       NA         NA      NA       NA  
FamIDChambers2               1.056e+00  3.148e+02   0.003   0.9973  
FamIDChapman2                       NA         NA      NA       NA  
FamIDChibnall2                      NA         NA      NA       NA  
FamIDChristy3                7.147e-01  4.025e+02   0.002   0.9986  
FamIDChronopoulos2          -6.247e-01  4.025e+02  -0.002   0.9988  
FamIDClark2                         NA         NA      NA       NA  
FamIDClarke2                 6.457e-01  4.025e+02   0.002   0.9987  
FamIDCollyer3                7.236e-02  2.007e-01   0.361   0.7184  
FamIDCompton3                6.806e-01  4.025e+02   0.002   0.9987  
FamIDCornell3                       NA         NA      NA       NA  
FamIDCoutts3                 7.235e-01  4.025e+02   0.002   0.9986  
FamIDCribb2                 -5.903e-01  4.025e+02  -0.001   0.9988  
FamIDCrosby3                 4.363e-02  1.757e-01   0.248   0.8038  
FamIDCumings2                6.080e-01  4.025e+02   0.002   0.9988  
FamIDDanbom3                -1.061e+00  3.096e+02  -0.003   0.9973  
FamIDDavidson2              -6.696e-01  4.025e+02  -0.002   0.9987  
FamIDDavidson4                      NA         NA      NA       NA  
FamIDDavies3                 3.880e-02  1.637e-01   0.237   0.8126  
FamIDDavison2                6.632e-01  4.025e+02   0.002   0.9987  
FamIDDean4                   3.940e-02  1.691e-01   0.233   0.8157  
FamIDdelCarlo2              -6.450e-01  4.025e+02  -0.002   0.9987  
FamIDdeMessemaeker2          6.759e-01  4.025e+02   0.002   0.9987  
FamIDDick2                   1.058e+00  3.063e+02   0.003   0.9972  
FamIDDodge3                  6.729e-01  4.025e+02   0.002   0.9987  
FamIDDoling2                 7.164e-01  4.025e+02   0.002   0.9986  
FamIDDouglas2               -6.626e-01  4.025e+02  -0.002   0.9987  
FamIDDouglas3                       NA         NA      NA       NA  
FamIDDrew3                          NA         NA      NA       NA  
FamIDDuffGordon2             1.068e+00  3.094e+02   0.003   0.9972  
`FamIDDurany More2`          7.320e-01  4.025e+02   0.002   0.9985  
FamIDDyker2                         NA         NA      NA       NA  
FamIDEarnshaw2                      NA         NA      NA       NA  
FamIDElias3                 -6.119e-01  4.025e+02  -0.002   0.9988  
FamIDEustis2                 6.942e-01  4.025e+02   0.002   0.9986  
FamIDFaunthorpe2             6.466e-01  4.025e+02   0.002   0.9987  
FamIDFord5                  -1.315e+00  3.334e+02  -0.004   0.9969  
FamIDFortune6               -4.048e-02  2.085e-01  -0.194   0.8460  
FamIDFrauenthal2             6.037e-01  4.025e+02   0.001   0.9988  
FamIDFrauenthal3             8.479e-01  4.025e+02   0.002   0.9983  
FamIDFrolicher3              6.733e-01  4.025e+02   0.002   0.9987  
`FamIDFrolicher-Stehli3`     8.109e-01  4.025e+02   0.002   0.9984  
FamIDFutrelle2              -3.445e-02  1.937e-01  -0.178   0.8588  
FamIDGale2                  -6.303e-01  4.025e+02  -0.002   0.9988  
FamIDGibson2                        NA         NA      NA       NA  
FamIDGiles2                 -6.164e-01  4.025e+02  -0.002   0.9988  
FamIDGoldenberg2             1.065e+00  3.013e+02   0.004   0.9972  
FamIDGoldsmith3              6.457e-01  4.025e+02   0.002   0.9987  
FamIDGoodwin8               -1.580e+00  3.994e+02  -0.004   0.9968  
FamIDGraham2                 6.443e-01  4.025e+02   0.002   0.9987  
FamIDGreenfield2             7.856e-01  4.025e+02   0.002   0.9984  
FamIDGustafsson3            -8.422e-01  4.018e+02  -0.002   0.9983  
FamIDHagland2               -6.182e-01  4.025e+02  -0.002   0.9988  
FamIDHakkarainen2                   NA         NA      NA       NA  
FamIDHamalainen3             9.970e-01  3.930e+02   0.003   0.9980  
FamIDHansen2                -5.936e-01  4.025e+02  -0.001   0.9988  
FamIDHansen3                -6.091e-01  4.025e+02  -0.002   0.9988  
FamIDHarder2                 7.793e-01  4.025e+02   0.002   0.9985  
FamIDHarper2                 2.212e-01  2.270e-01   0.974   0.3300  
FamIDHarris2                -2.934e-02  1.980e-01  -0.148   0.8822  
FamIDHart3                   8.346e-02  2.031e-01   0.411   0.6811  
FamIDHays3                   6.308e-01  4.025e+02   0.002   0.9987  
FamIDHerman4                 7.208e-01  4.025e+02   0.002   0.9986  
FamIDHickman3               -9.069e-01  4.021e+02  -0.002   0.9982  
FamIDHiltunen3                      NA         NA      NA       NA  
FamIDHippach2                9.173e-01  3.896e+02   0.002   0.9981  
FamIDHirvonen2               7.249e-01  4.025e+02   0.002   0.9986  
FamIDHirvonen3                      NA         NA      NA       NA  
FamIDHocking4               -6.227e-01  4.025e+02  -0.002   0.9988  
FamIDHocking5                       NA         NA      NA       NA  
FamIDHogeboom2                      NA         NA      NA       NA  
FamIDHold2                  -6.213e-01  4.025e+02  -0.002   0.9988  
FamIDHolverson2             -6.597e-01  4.025e+02  -0.002   0.9987  
FamIDHoward2                        NA         NA      NA       NA  
FamIDHoyt2                   6.154e-01  4.025e+02   0.002   0.9988  
FamIDIlmakangas2            -7.058e-01  4.025e+02  -0.002   0.9986  
FamIDJacobsohn2             -6.231e-01  4.025e+02  -0.002   0.9988  
FamIDJacobsohn4              6.340e-01  4.025e+02   0.002   0.9987  
FamIDJefferys3                      NA         NA      NA       NA  
FamIDJensen2                -6.017e-01  4.025e+02  -0.001   0.9988  
FamIDJohnson3                1.017e+00  3.989e+02   0.003   0.9980  
FamIDJohnston4              -7.552e-01  4.025e+02  -0.002   0.9985  
FamIDJussila2               -1.003e+00  4.022e+02  -0.002   0.9980  
FamIDKantor2                 8.279e-03  1.951e-01   0.042   0.9661  
FamIDKarun2                         NA         NA      NA       NA  
FamIDKenyon2                 6.037e-01  4.025e+02   0.001   0.9988  
FamIDKhalil2                        NA         NA      NA       NA  
FamIDKiernan2               -6.183e-01  4.025e+02  -0.002   0.9988  
FamIDKimball2                8.048e-01  4.025e+02   0.002   0.9984  
FamIDKink3                  -6.017e-01  4.025e+02  -0.001   0.9988  
`FamIDKink-Heilmann3`        7.011e-01  4.025e+02   0.002   0.9986  
`FamIDKink-Heilmann5`               NA         NA      NA       NA  
FamIDKlasen3                -6.008e-01  4.025e+02  -0.001   0.9988  
FamIDLahtinen3              -8.206e-01  4.025e+02  -0.002   0.9984  
FamIDLaroche4                5.211e-02  2.064e-01   0.252   0.8007  
FamIDLefebre5               -1.330e+00  3.994e+02  -0.003   0.9973  
FamIDLennon2                -6.393e-01  4.025e+02  -0.002   0.9987  
FamIDLindell2               -6.056e-01  4.025e+02  -0.002   0.9988  
FamIDLindqvist2              8.655e-01  4.025e+02   0.002   0.9983  
FamIDLines2                  6.710e-01  4.025e+02   0.002   0.9987  
FamIDLobb2                  -6.110e-01  4.025e+02  -0.002   0.9988  
FamIDLouch2                  6.583e-01  4.025e+02   0.002   0.9987  
FamIDMadill2                 6.771e-01  4.025e+02   0.002   0.9987  
FamIDMallet3                 3.776e-02  1.753e-01   0.215   0.8294  
FamIDMarvin2                -6.804e-01  4.025e+02  -0.002   0.9987  
FamIDMcCoy3                  1.063e+00  3.296e+02   0.003   0.9974  
FamIDMcNamee2               -6.164e-01  4.025e+02  -0.002   0.9988  
FamIDMellinger2                     NA         NA      NA       NA  
FamIDMeyer2                 -6.282e-02  1.958e-01  -0.321   0.7484  
FamIDMinahan2                6.642e-01  4.025e+02   0.002   0.9987  
FamIDMinahan3               -6.432e-01  4.025e+02  -0.002   0.9987  
FamIDMock2                          NA         NA      NA       NA  
FamIDMoor2                   1.008e+00  3.892e+02   0.003   0.9979  
FamIDMoran2                  1.637e-02  1.617e-01   0.101   0.9193  
FamIDMoubarek3               1.032e+00  4.022e+02   0.003   0.9980  
FamIDMurphy2                 7.138e-01  4.025e+02   0.002   0.9986  
FamIDNakid3                  1.121e+00  3.216e+02   0.003   0.9972  
FamIDNasser2                -1.339e-02  2.015e-01  -0.066   0.9470  
FamIDNatsch2                -6.733e-01  4.025e+02  -0.002   0.9987  
FamIDNavratil3               7.005e-01  4.025e+02   0.002   0.9986  
FamIDNewell2                 9.467e-01  4.019e+02   0.002   0.9981  
FamIDNewell3                -6.473e-01  4.025e+02  -0.002   0.9987  
FamIDNewsom3                 6.737e-01  4.025e+02   0.002   0.9987  
FamIDNicholls3              -6.439e-01  4.025e+02  -0.002   0.9987  
`FamIDNicola-Yarred2`        1.016e+00  3.995e+02   0.003   0.9980  
`FamIDO'Brien2`              6.421e-01  4.025e+02   0.002   0.9987  
FamIDOlsen2                 -5.711e-01  4.025e+02  -0.001   0.9989  
FamIDOstby2                         NA         NA      NA       NA  
FamIDPalsson5               -1.583e+00  3.966e+02  -0.004   0.9968  
FamIDPanula6                -1.702e+00  3.394e+02  -0.005   0.9960  
FamIDParrish2                6.736e-01  4.025e+02   0.002   0.9987  
FamIDPeacock3                       NA         NA      NA       NA  
FamIDPears2                 -4.786e-02  1.964e-01  -0.244   0.8074  
`FamIDPenascoy Castellana2` -7.240e-02  1.934e-01  -0.374   0.7082  
FamIDPersson2                8.700e-01  4.025e+02   0.002   0.9983  
FamIDPeter3                  9.555e-01  3.847e+02   0.002   0.9980  
FamIDPetterson2             -5.945e-01  4.025e+02  -0.001   0.9988  
FamIDPhillips2                      NA         NA      NA       NA  
FamIDPotter2                 6.323e-01  4.025e+02   0.002   0.9987  
FamIDQuick3                  6.939e-01  4.025e+02   0.002   0.9986  
FamIDRenouf2                        NA         NA      NA       NA  
FamIDRenouf4                 6.314e-01  4.025e+02   0.002   0.9987  
FamIDRice6                  -1.813e+00  4.006e+02  -0.005   0.9964  
FamIDRichards3               7.262e-01  4.025e+02   0.002   0.9986  
FamIDRichards6               6.597e-01  4.025e+02   0.002   0.9987  
FamIDRobert2                 6.307e-01  4.025e+02   0.002   0.9987  
FamIDRobins2                -7.787e-01  4.025e+02  -0.002   0.9985  
FamIDRosblom3               -1.074e+00  3.243e+02  -0.003   0.9974  
FamIDRothschild2                    NA         NA      NA       NA  
FamIDRyerson5                6.536e-01  4.025e+02   0.002   0.9987  
FamIDSage11                 -1.755e+00  3.590e+02  -0.005   0.9961  
FamIDSamaan3                -6.621e-01  4.025e+02  -0.002   0.9987  
FamIDSandstrom3              9.897e-01  3.910e+02   0.003   0.9980  
FamIDSchabert2                      NA         NA      NA       NA  
FamIDShelley2                6.510e-01  4.025e+02   0.002   0.9987  
FamIDSilven3                 7.421e-01  4.025e+02   0.002   0.9985  
FamIDSilvey2                -2.359e-02  1.986e-01  -0.119   0.9055  
FamIDSkoog6                 -1.722e+00  3.458e+02  -0.005   0.9960  
FamIDSmith2                         NA         NA      NA       NA  
FamIDSnyder2                        NA         NA      NA       NA  
FamIDSolo                    7.131e-01  1.282e+00   0.556   0.5781  
FamIDSpedden3                       NA         NA      NA       NA  
FamIDSpencer2                5.935e-01  4.025e+02   0.001   0.9988  
FamIDStengel2                       NA         NA      NA       NA  
FamIDStephenson2             6.207e-01  4.025e+02   0.002   0.9988  
FamIDStraus2                        NA         NA      NA       NA  
FamIDStrom2                 -7.396e-01  4.025e+02  -0.002   0.9985  
FamIDStrom3                 -7.949e-01  4.025e+02  -0.002   0.9984  
FamIDTaussig3                9.287e-01  3.840e+02   0.002   0.9981  
FamIDTaylor2                 1.078e+00  3.017e+02   0.004   0.9971  
FamIDThayer3                 1.042e+00  3.205e+02   0.003   0.9974  
FamIDThomas2                 7.401e-01  4.025e+02   0.002   0.9985  
FamIDThomas3                        NA         NA      NA       NA  
FamIDThorneycroft2           3.180e-02  1.904e-01   0.167   0.8674  
FamIDTouma3                         NA         NA      NA       NA  
FamIDTurpin2                -1.095e+00  3.113e+02  -0.004   0.9972  
FamIDvanBilliard3           -5.934e-01  4.025e+02  -0.001   0.9988  
FamIDVanderPlanke2          -7.931e-01  4.025e+02  -0.002   0.9984  
FamIDVanderPlanke3          -9.929e-01  3.306e+02  -0.003   0.9976  
FamIDVanderPlanke4                  NA         NA      NA       NA  
FamIDVanImpe3               -1.126e+00  3.838e+02  -0.003   0.9977  
FamIDWare2                          NA         NA      NA       NA  
FamIDWarren2                 6.279e-01  4.025e+02   0.002   0.9988  
FamIDWeisz2                  6.466e-01  4.025e+02   0.002   0.9987  
FamIDWells3                  6.957e-01  4.025e+02   0.002   0.9986  
FamIDWest4                   6.913e-02  2.036e-01   0.340   0.7341  
FamIDWhite2                 -6.408e-01  4.025e+02  -0.002   0.9987  
FamIDWick3                   6.916e-01  4.025e+02   0.002   0.9986  
FamIDWidener3               -6.753e-01  4.025e+02  -0.002   0.9987  
FamIDWiklund2                       NA         NA      NA       NA  
FamIDWilkes2                        NA         NA      NA       NA  
FamIDWilliams2              -6.536e-01  4.025e+02  -0.002   0.9987  
FamIDYasbeck2                       NA         NA      NA       NA  
FamIDZabour2                -7.397e-01  4.025e+02  -0.002   0.9985  
TitleMiss                   -8.997e+00  3.011e+03  -0.003   0.9976  
TitleMr                     -1.393e+00  7.860e-01  -1.772   0.0764 .
TitleMrs                    -7.105e+00  2.603e+03  -0.003   0.9978  
TitleNoble                  -5.542e-01  2.803e-01  -1.977   0.0480 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 373.30  on 517  degrees of freedom
AIC: 767.3

Number of Fisher Scoring iterations: 18


Okay looks like we're doing (un)reasonably well. Let's try a couple of interesting ideas. Class Compression refers to collapsing some levels on a categorical variable. In layman terms, making a factor two-level. So for instance, we have Embarked, most of which has the value 'S' as we saw earlier. We can use the I() function when training to shrink this to S or otherwise. Let's do that.

In [285]:
%%R
#Let's set class compression on Embarked to 'S' or otherwise.
set.seed(35) 

logit.tune2<-train(Survived ~ Sex + Pclass + Age + SibSp + I(Embarked=='S') + FareGroup + FamID + Title,
                data=train.set,
                method='glm',
                preProcess = c("center", "scale"),
                trControl=tenfoldcv)

logit.tune2
Generalized Linear Model 

714 samples
 14 predictors
  2 classes: '0', '1' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ... 

Resampling results

  Accuracy   Kappa    Accuracy SD  Kappa SD 
  0.7820488  0.53063  0.05507752   0.1240941

 

In [286]:
%%R
summary(logit.tune2)

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.34231  -0.43582  -0.00008   0.00010   2.42395  

Coefficients: (45 not defined because of singularities)
                              Estimate Std. Error z value Pr(>|z|)  
(Intercept)                 -6.156e-01  2.122e+02  -0.003   0.9977  
Sexmale                     -1.065e+01  3.548e+03  -0.003   0.9976  
Pclass2                     -4.596e-01  2.869e-01  -1.602   0.1092  
Pclass3                     -5.264e-01  3.407e-01  -1.545   0.1223  
Age                         -3.148e-01  1.749e-01  -1.800   0.0718 .
SibSp                        2.230e-01  6.716e-01   0.332   0.7398  
`I(Embarked == "S")TRUE`    -1.728e-01  1.444e-01  -1.197   0.2315  
`FareGroup10-20`             2.101e-01  2.593e-01   0.810   0.4177  
`FareGroup20-40`             5.142e-01  2.900e-01   1.773   0.0762 .
`FareGroup40+`               4.068e-01  2.876e-01   1.415   0.1572  
FamIDAbelson2               -1.258e-02  1.918e-01  -0.066   0.9477  
FamIDAhlin2                         NA         NA      NA       NA  
FamIDAks2                    6.882e-01  4.025e+02   0.002   0.9986  
FamIDAllison4               -1.181e+00  3.885e+02  -0.003   0.9976  
`FamIDAndersen-Jensen2`      7.511e-01  4.025e+02   0.002   0.9985  
FamIDAndersson7             -1.949e-01  2.170e-01  -0.898   0.3690  
FamIDAndrews2                7.132e-01  4.025e+02   0.002   0.9986  
FamIDAngle2                         NA         NA      NA       NA  
FamIDAppleton3               6.264e-01  4.025e+02   0.002   0.9988  
`FamIDArnold-Franchi2`      -1.072e+00  3.092e+02  -0.003   0.9972  
FamIDAsplund7               -6.467e-02  1.965e-01  -0.329   0.7421  
FamIDAstor2                         NA         NA      NA       NA  
FamIDBackstrom2             -6.081e-01  4.025e+02  -0.002   0.9988  
FamIDBackstrom4              6.584e-01  4.025e+02   0.002   0.9987  
FamIDBaclini4                9.586e-01  3.950e+02   0.002   0.9981  
FamIDBarbara2               -1.085e+00  3.895e+02  -0.003   0.9978  
FamIDBaxter2                -3.857e-02  1.849e-01  -0.209   0.8347  
FamIDBeane2                  8.320e-01  4.025e+02   0.002   0.9984  
FamIDBecker4                 6.901e-01  4.025e+02   0.002   0.9986  
FamIDBeckwith3               1.073e+00  3.164e+02   0.003   0.9973  
FamIDBishop2                 5.893e-01  4.025e+02   0.001   0.9988  
FamIDBoulos3                -7.564e-01  4.025e+02  -0.002   0.9985  
FamIDBourke3                -1.297e+00  3.146e+02  -0.004   0.9967  
FamIDBowerman2               6.842e-01  4.025e+02   0.002   0.9986  
FamIDBraund2                        NA         NA      NA       NA  
FamIDBrown3                  3.452e-02  2.015e-01   0.171   0.8640  
FamIDBryhl2                 -6.388e-01  4.025e+02  -0.002   0.9987  
FamIDCaldwell3               9.648e-01  3.862e+02   0.002   0.9980  
FamIDCaram2                 -8.146e-01  4.025e+02  -0.002   0.9984  
FamIDCardeza2                7.956e-01  4.025e+02   0.002   0.9984  
FamIDCarter2                -1.067e+00  2.919e+02  -0.004   0.9971  
FamIDCarter4                 1.289e+00  3.194e+02   0.004   0.9968  
FamIDCavendish2             -6.623e-01  4.025e+02  -0.002   0.9987  
FamIDChaffee2                       NA         NA      NA       NA  
FamIDChambers2               1.060e+00  3.146e+02   0.003   0.9973  
FamIDChapman2                       NA         NA      NA       NA  
FamIDChibnall2                      NA         NA      NA       NA  
FamIDChristy3                7.125e-01  4.025e+02   0.002   0.9986  
FamIDChronopoulos2          -6.279e-01  4.025e+02  -0.002   0.9988  
FamIDClark2                         NA         NA      NA       NA  
FamIDClarke2                 6.453e-01  4.025e+02   0.002   0.9987  
FamIDCollyer3                7.044e-02  2.012e-01   0.350   0.7262  
FamIDCompton3                6.772e-01  4.025e+02   0.002   0.9987  
FamIDCornell3                       NA         NA      NA       NA  
FamIDCoutts3                 7.242e-01  4.025e+02   0.002   0.9986  
FamIDCribb2                 -5.894e-01  4.025e+02  -0.001   0.9988  
FamIDCrosby3                 4.598e-02  1.763e-01   0.261   0.7943  
FamIDCumings2                6.064e-01  4.025e+02   0.002   0.9988  
FamIDDanbom3                -1.059e+00  3.096e+02  -0.003   0.9973  
FamIDDavidson2              -6.668e-01  4.025e+02  -0.002   0.9987  
FamIDDavidson4                      NA         NA      NA       NA  
FamIDDavies3                 3.841e-02  1.639e-01   0.234   0.8148  
FamIDDavison2                6.643e-01  4.025e+02   0.002   0.9987  
FamIDDean4                   3.913e-02  1.693e-01   0.231   0.8172  
FamIDdelCarlo2              -6.496e-01  4.025e+02  -0.002   0.9987  
FamIDdeMessemaeker2          6.769e-01  4.025e+02   0.002   0.9987  
FamIDDick2                   1.062e+00  3.064e+02   0.003   0.9972  
FamIDDodge3                  6.751e-01  4.025e+02   0.002   0.9987  
FamIDDoling2                 7.141e-01  4.025e+02   0.002   0.9986  
FamIDDouglas2               -6.642e-01  4.025e+02  -0.002   0.9987  
FamIDDouglas3                       NA         NA      NA       NA  
FamIDDrew3                          NA         NA      NA       NA  
FamIDDuffGordon2             1.065e+00  3.082e+02   0.003   0.9972  
`FamIDDurany More2`          7.266e-01  4.025e+02   0.002   0.9986  
FamIDDyker2                         NA         NA      NA       NA  
FamIDEarnshaw2                      NA         NA      NA       NA  
FamIDElias3                 -6.165e-01  4.025e+02  -0.002   0.9988  
FamIDEustis2                 6.907e-01  4.025e+02   0.002   0.9986  
FamIDFaunthorpe2             6.462e-01  4.025e+02   0.002   0.9987  
FamIDFord5                  -1.315e+00  3.326e+02  -0.004   0.9968  
FamIDFortune6               -3.674e-02  2.089e-01  -0.176   0.8604  
FamIDFrauenthal2             6.064e-01  4.025e+02   0.002   0.9988  
FamIDFrauenthal3             8.513e-01  4.025e+02   0.002   0.9983  
FamIDFrolicher3              6.698e-01  4.025e+02   0.002   0.9987  
`FamIDFrolicher-Stehli3`     8.093e-01  4.025e+02   0.002   0.9984  
FamIDFutrelle2              -3.060e-02  1.936e-01  -0.158   0.8744  
FamIDGale2                  -6.307e-01  4.025e+02  -0.002   0.9987  
FamIDGibson2                        NA         NA      NA       NA  
FamIDGiles2                 -6.156e-01  4.025e+02  -0.002   0.9988  
FamIDGoldenberg2             1.063e+00  3.015e+02   0.004   0.9972  
FamIDGoldsmith3              6.456e-01  4.025e+02   0.002   0.9987  
FamIDGoodwin8               -1.578e+00  3.993e+02  -0.004   0.9968  
FamIDGraham2                 6.467e-01  4.025e+02   0.002   0.9987  
FamIDGreenfield2             7.839e-01  4.025e+02   0.002   0.9984  
FamIDGustafsson3            -8.424e-01  4.018e+02  -0.002   0.9983  
FamIDHagland2               -6.171e-01  4.025e+02  -0.002   0.9988  
FamIDHakkarainen2                   NA         NA      NA       NA  
FamIDHamalainen3             9.976e-01  3.930e+02   0.003   0.9980  
FamIDHansen2                -5.939e-01  4.025e+02  -0.001   0.9988  
FamIDHansen3                -6.079e-01  4.025e+02  -0.002   0.9988  
FamIDHarder2                 7.778e-01  4.025e+02   0.002   0.9985  
FamIDHarper2                 2.196e-01  2.276e-01   0.965   0.3347  
FamIDHarris2                -2.551e-02  1.979e-01  -0.129   0.8974  
FamIDHart3                   8.158e-02  2.036e-01   0.401   0.6886  
FamIDHays3                   6.334e-01  4.025e+02   0.002   0.9987  
FamIDHerman4                 7.200e-01  4.025e+02   0.002   0.9986  
FamIDHickman3               -9.052e-01  4.021e+02  -0.002   0.9982  
FamIDHiltunen3                      NA         NA      NA       NA  
FamIDHippach2                9.129e-01  3.897e+02   0.002   0.9981  
FamIDHirvonen2               7.241e-01  4.025e+02   0.002   0.9986  
FamIDHirvonen3                      NA         NA      NA       NA  
FamIDHocking4               -6.217e-01  4.025e+02  -0.002   0.9988  
FamIDHocking5                       NA         NA      NA       NA  
FamIDHogeboom2                      NA         NA      NA       NA  
FamIDHold2                  -6.217e-01  4.025e+02  -0.002   0.9988  
FamIDHolverson2             -6.569e-01  4.025e+02  -0.002   0.9987  
FamIDHoward2                        NA         NA      NA       NA  
FamIDHoyt2                   6.181e-01  4.025e+02   0.002   0.9988  
FamIDIlmakangas2            -7.080e-01  4.025e+02  -0.002   0.9986  
FamIDJacobsohn2             -6.235e-01  4.025e+02  -0.002   0.9988  
FamIDJacobsohn4              6.338e-01  4.025e+02   0.002   0.9987  
FamIDJefferys3                      NA         NA      NA       NA  
FamIDJensen2                -6.020e-01  4.025e+02  -0.001   0.9988  
FamIDJohnson3                1.017e+00  3.987e+02   0.003   0.9980  
FamIDJohnston4              -7.571e-01  4.025e+02  -0.002   0.9985  
FamIDJussila2               -1.006e+00  4.022e+02  -0.003   0.9980  
FamIDKantor2                 7.818e-03  1.951e-01   0.040   0.9680  
FamIDKarun2                         NA         NA      NA       NA  
FamIDKenyon2                 6.064e-01  4.025e+02   0.002   0.9988  
FamIDKhalil2                        NA         NA      NA       NA  
FamIDKiernan2               -6.120e-01  4.025e+02  -0.002   0.9988  
FamIDKimball2                8.075e-01  4.025e+02   0.002   0.9984  
FamIDKink3                  -6.018e-01  4.025e+02  -0.001   0.9988  
`FamIDKink-Heilmann3`        6.991e-01  4.025e+02   0.002   0.9986  
`FamIDKink-Heilmann5`               NA         NA      NA       NA  
FamIDKlasen3                -6.011e-01  4.025e+02  -0.001   0.9988  
FamIDLahtinen3              -8.210e-01  4.025e+02  -0.002   0.9984  
FamIDLaroche4                4.529e-02  2.065e-01   0.219   0.8264  
FamIDLefebre5               -1.332e+00  3.992e+02  -0.003   0.9973  
FamIDLennon2                -6.315e-01  4.025e+02  -0.002   0.9987  
FamIDLindell2               -6.045e-01  4.025e+02  -0.002   0.9988  
FamIDLindqvist2              8.652e-01  4.025e+02   0.002   0.9983  
FamIDLines2                  6.704e-01  4.025e+02   0.002   0.9987  
FamIDLobb2                  -6.099e-01  4.025e+02  -0.002   0.9988  
FamIDLouch2                  6.579e-01  4.025e+02   0.002   0.9987  
FamIDMadill2                 6.779e-01  4.025e+02   0.002   0.9987  
FamIDMallet3                 3.085e-02  1.749e-01   0.176   0.8600  
FamIDMarvin2                -6.776e-01  4.025e+02  -0.002   0.9987  
FamIDMcCoy3                  1.072e+00  3.278e+02   0.003   0.9974  
FamIDMcNamee2               -6.153e-01  4.025e+02  -0.002   0.9988  
FamIDMellinger2                     NA         NA      NA       NA  
FamIDMeyer2                 -6.498e-02  1.959e-01  -0.332   0.7401  
FamIDMinahan2                6.718e-01  4.025e+02   0.002   0.9987  
FamIDMinahan3               -6.330e-01  4.025e+02  -0.002   0.9987  
FamIDMock2                          NA         NA      NA       NA  
FamIDMoor2                   1.008e+00  3.893e+02   0.003   0.9979  
FamIDMoran2                  2.453e-02  1.613e-01   0.152   0.8791  
FamIDMoubarek3               1.027e+00  4.022e+02   0.003   0.9980  
FamIDMurphy2                 7.197e-01  4.025e+02   0.002   0.9986  
FamIDNakid3                  1.116e+00  3.201e+02   0.003   0.9972  
FamIDNasser2                -1.989e-02  2.010e-01  -0.099   0.9212  
FamIDNatsch2                -6.764e-01  4.025e+02  -0.002   0.9987  
FamIDNavratil3               6.998e-01  4.025e+02   0.002   0.9986  
FamIDNewell2                 9.419e-01  4.019e+02   0.002   0.9981  
FamIDNewell3                -6.491e-01  4.025e+02  -0.002   0.9987  
FamIDNewsom3                 6.731e-01  4.025e+02   0.002   0.9987  
FamIDNicholls3              -6.442e-01  4.025e+02  -0.002   0.9987  
`FamIDNicola-Yarred2`        1.010e+00  3.993e+02   0.003   0.9980  
`FamIDO'Brien2`              6.498e-01  4.025e+02   0.002   0.9987  
FamIDOlsen2                 -5.716e-01  4.025e+02  -0.001   0.9989  
FamIDOstby2                         NA         NA      NA       NA  
FamIDPalsson5               -1.585e+00  3.966e+02  -0.004   0.9968  
FamIDPanula6                -1.702e+00  3.386e+02  -0.005   0.9960  
FamIDParrish2                6.730e-01  4.025e+02   0.002   0.9987  
FamIDPeacock3                       NA         NA      NA       NA  
FamIDPears2                 -4.395e-02  1.963e-01  -0.224   0.8228  
`FamIDPenascoy Castellana2` -7.452e-02  1.935e-01  -0.385   0.7001  
FamIDPersson2                8.697e-01  4.025e+02   0.002   0.9983  
FamIDPeter3                  9.473e-01  3.850e+02   0.002   0.9980  
FamIDPetterson2             -5.948e-01  4.025e+02  -0.001   0.9988  
FamIDPhillips2                      NA         NA      NA       NA  
FamIDPotter2                 6.305e-01  4.025e+02   0.002   0.9988  
FamIDQuick3                  6.918e-01  4.025e+02   0.002   0.9986  
FamIDRenouf2                        NA         NA      NA       NA  
FamIDRenouf4                 6.313e-01  4.025e+02   0.002   0.9987  
FamIDRice6                  -1.799e+00  4.005e+02  -0.004   0.9964  
FamIDRichards3               7.265e-01  4.025e+02   0.002   0.9986  
FamIDRichards6               6.606e-01  4.025e+02   0.002   0.9987  
FamIDRobert2                 6.332e-01  4.025e+02   0.002   0.9987  
FamIDRobins2                -7.777e-01  4.025e+02  -0.002   0.9985  
FamIDRosblom3               -1.075e+00  3.238e+02  -0.003   0.9974  
FamIDRothschild2                    NA         NA      NA       NA  
FamIDRyerson5                6.504e-01  4.025e+02   0.002   0.9987  
FamIDSage11                 -1.752e+00  3.578e+02  -0.005   0.9961  
FamIDSamaan3                -6.662e-01  4.025e+02  -0.002   0.9987  
FamIDSandstrom3              9.893e-01  3.912e+02   0.003   0.9980  
FamIDSchabert2                      NA         NA      NA       NA  
FamIDShelley2                6.505e-01  4.025e+02   0.002   0.9987  
FamIDSilven3                 7.408e-01  4.025e+02   0.002   0.9985  
FamIDSilvey2                -1.979e-02  1.986e-01  -0.100   0.9206  
FamIDSkoog6                 -1.723e+00  3.454e+02  -0.005   0.9960  
FamIDSmith2                         NA         NA      NA       NA  
FamIDSnyder2                        NA         NA      NA       NA  
FamIDSolo                    7.121e-01  1.284e+00   0.555   0.5791  
FamIDSpedden3                       NA         NA      NA       NA  
FamIDSpencer2                5.920e-01  4.025e+02   0.001   0.9988  
FamIDStengel2                       NA         NA      NA       NA  
FamIDStephenson2             6.190e-01  4.025e+02   0.002   0.9988  
FamIDStraus2                        NA         NA      NA       NA  
FamIDStrom2                 -7.404e-01  4.025e+02  -0.002   0.9985  
FamIDStrom3                 -7.939e-01  4.025e+02  -0.002   0.9984  
FamIDTaussig3                9.304e-01  3.843e+02   0.002   0.9981  
FamIDTaylor2                 1.082e+00  3.019e+02   0.004   0.9971  
FamIDThayer3                 1.040e+00  3.203e+02   0.003   0.9974  
FamIDThomas2                 7.349e-01  4.025e+02   0.002   0.9985  
FamIDThomas3                        NA         NA      NA       NA  
FamIDThorneycroft2           3.334e-02  1.905e-01   0.175   0.8611  
FamIDTouma3                         NA         NA      NA       NA  
FamIDTurpin2                -1.096e+00  3.112e+02  -0.004   0.9972  
FamIDvanBilliard3           -5.926e-01  4.025e+02  -0.001   0.9988  
FamIDVanderPlanke2          -7.921e-01  4.025e+02  -0.002   0.9984  
FamIDVanderPlanke3          -9.930e-01  3.288e+02  -0.003   0.9976  
FamIDVanderPlanke4                  NA         NA      NA       NA  
FamIDVanImpe3               -1.127e+00  3.840e+02  -0.003   0.9977  
FamIDWare2                          NA         NA      NA       NA  
FamIDWarren2                 6.262e-01  4.025e+02   0.002   0.9988  
FamIDWeisz2                  6.462e-01  4.025e+02   0.002   0.9987  
FamIDWells3                  6.936e-01  4.025e+02   0.002   0.9986  
FamIDWest4                   6.739e-02  2.041e-01   0.330   0.7412  
FamIDWhite2                 -6.383e-01  4.025e+02  -0.002   0.9987  
FamIDWick3                   6.923e-01  4.025e+02   0.002   0.9986  
FamIDWidener3               -6.770e-01  4.025e+02  -0.002   0.9987  
FamIDWiklund2                       NA         NA      NA       NA  
FamIDWilkes2                        NA         NA      NA       NA  
FamIDWilliams2              -6.554e-01  4.025e+02  -0.002   0.9987  
FamIDYasbeck2                       NA         NA      NA       NA  
FamIDZabour2                -7.448e-01  4.025e+02  -0.002   0.9985  
TitleMiss                   -8.980e+00  3.017e+03  -0.003   0.9976  
TitleMr                     -1.399e+00  7.872e-01  -1.778   0.0755 .
TitleMrs                    -7.107e+00  2.608e+03  -0.003   0.9978  
TitleNoble                  -5.585e-01  2.807e-01  -1.989   0.0467 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 373.54  on 518  degrees of freedom
AIC: 765.54

Number of Fisher Scoring iterations: 18


So that didn't really help. Let's try one last trick, Interaction. Let's work in an interaction effect between passenger class and sex, as passenger class showed a much bigger difference in survival rate amongst the women compared to the men (i.e. Higher class women were much more likely to survive than lower class women, whereas first class Men were more likely to survive than 2nd or 3rd class men, but not by the same margin as amongst the women). We saw this during our initial visualizations. Besides Pclass and Sex have been the biggest determining factors so far.

In [287]:
%%R
#Let's work in an interaction between Pclass and Sex.
set.seed(35) 

logit.tune3<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + FamID + Title,
                data=train.set,
                method='glm',
                preProcess = c("center", "scale"),
                trControl=tenfoldcv)

logit.tune3
Generalized Linear Model 

714 samples
 14 predictors
  2 classes: '0', '1' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ... 

Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD 
  0.7904864  0.5472096  0.05401862   0.1249195

 

In [288]:
%%R
summary(logit.tune3)

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.85112  -0.46296  -0.00009   0.00009   2.54282  

Coefficients: (45 not defined because of singularities)
                              Estimate Std. Error z value Pr(>|z|)  
(Intercept)                 -5.350e-01  2.135e+02  -0.003   0.9980  
Sexmale                     -1.099e+01  3.407e+03  -0.003   0.9974  
Pclass2                     -7.112e-01  6.345e-01  -1.121   0.2623  
Pclass3                     -1.487e+00  7.122e-01  -2.088   0.0368 *
Age                         -3.107e-01  1.799e-01  -1.727   0.0842 .
SibSp                        2.551e-02  6.888e-01   0.037   0.9705  
EmbarkedQ                    1.342e-01  1.615e-01   0.831   0.4058  
EmbarkedS                   -1.049e-01  1.818e-01  -0.577   0.5640  
`FareGroup10-20`             1.840e-01  2.615e-01   0.704   0.4816  
`FareGroup20-40`             5.310e-01  2.953e-01   1.798   0.0721 .
`FareGroup40+`               3.425e-01  3.013e-01   1.137   0.2556  
FamIDAbelson2               -2.723e-02  1.935e-01  -0.141   0.8881  
FamIDAhlin2                         NA         NA      NA       NA  
FamIDAks2                    7.047e-01  4.025e+02   0.002   0.9986  
FamIDAllison4               -1.250e+00  3.928e+02  -0.003   0.9975  
`FamIDAndersen-Jensen2`      7.543e-01  4.025e+02   0.002   0.9985  
FamIDAndersson7             -1.563e-01  1.910e-01  -0.818   0.4132  
FamIDAndrews2                6.498e-01  4.025e+02   0.002   0.9987  
FamIDAngle2                         NA         NA      NA       NA  
FamIDAppleton3               5.904e-01  4.025e+02   0.001   0.9988  
`FamIDArnold-Franchi2`      -1.044e+00  3.351e+02  -0.003   0.9975  
FamIDAsplund7               -3.992e-02  1.723e-01  -0.232   0.8168  
FamIDAstor2                         NA         NA      NA       NA  
FamIDBackstrom2             -6.261e-01  4.025e+02  -0.002   0.9988  
FamIDBackstrom4              6.981e-01  4.025e+02   0.002   0.9986  
FamIDBaclini4                9.878e-01  3.963e+02   0.002   0.9980  
FamIDBarbara2               -1.059e+00  3.933e+02  -0.003   0.9979  
FamIDBaxter2                -7.557e-02  1.924e-01  -0.393   0.6945  
FamIDBeane2                  8.272e-01  4.025e+02   0.002   0.9984  
FamIDBecker4                 7.046e-01  4.025e+02   0.002   0.9986  
FamIDBeckwith3               1.064e+00  3.177e+02   0.003   0.9973  
FamIDBishop2                 5.524e-01  4.025e+02   0.001   0.9989  
FamIDBoulos3                -7.450e-01  4.025e+02  -0.002   0.9985  
FamIDBourke3                -1.292e+00  3.411e+02  -0.004   0.9970  
FamIDBowerman2               6.143e-01  4.025e+02   0.002   0.9988  
FamIDBraund2                        NA         NA      NA       NA  
FamIDBrown3                  1.151e-02  2.058e-01   0.056   0.9554  
FamIDBryhl2                 -6.435e-01  4.025e+02  -0.002   0.9987  
FamIDCaldwell3               9.558e-01  3.665e+02   0.003   0.9979  
FamIDCaram2                 -7.831e-01  4.025e+02  -0.002   0.9984  
FamIDCardeza2                7.874e-01  4.025e+02   0.002   0.9984  
FamIDCarter2                -1.102e+00  2.822e+02  -0.004   0.9969  
FamIDCarter4                 1.278e+00  3.216e+02   0.004   0.9968  
FamIDCavendish2             -6.692e-01  4.025e+02  -0.002   0.9987  
FamIDChaffee2                       NA         NA      NA       NA  
FamIDChambers2               1.051e+00  3.161e+02   0.003   0.9973  
FamIDChapman2                       NA         NA      NA       NA  
FamIDChibnall2                      NA         NA      NA       NA  
FamIDChristy3                6.650e-01  4.025e+02   0.002   0.9987  
FamIDChronopoulos2          -6.402e-01  4.025e+02  -0.002   0.9987  
FamIDClark2                         NA         NA      NA       NA  
FamIDClarke2                 6.180e-01  4.025e+02   0.002   0.9988  
FamIDCollyer3                2.449e-02  2.003e-01   0.122   0.9027  
FamIDCompton3                6.197e-01  4.025e+02   0.002   0.9988  
FamIDCornell3                       NA         NA      NA       NA  
FamIDCoutts3                 7.184e-01  4.025e+02   0.002   0.9986  
FamIDCribb2                 -6.146e-01  4.025e+02  -0.002   0.9988  
FamIDCrosby3                -8.693e-03  1.950e-01  -0.045   0.9644  
FamIDCumings2                5.693e-01  4.025e+02   0.001   0.9989  
FamIDDanbom3                -1.032e+00  3.356e+02  -0.003   0.9975  
FamIDDavidson2              -6.737e-01  4.025e+02  -0.002   0.9987  
FamIDDavidson4                      NA         NA      NA       NA  
FamIDDavies3                 3.311e-02  1.406e-01   0.235   0.8138  
FamIDDavison2                6.902e-01  4.025e+02   0.002   0.9986  
FamIDDean4                   1.679e-02  1.481e-01   0.113   0.9097  
FamIDdelCarlo2              -6.487e-01  4.025e+02  -0.002   0.9987  
FamIDdeMessemaeker2          7.026e-01  4.025e+02   0.002   0.9986  
FamIDDick2                   1.053e+00  3.083e+02   0.003   0.9973  
FamIDDodge3                  6.735e-01  4.025e+02   0.002   0.9987  
FamIDDoling2                 6.597e-01  4.025e+02   0.002   0.9987  
FamIDDouglas2               -6.656e-01  4.025e+02  -0.002   0.9987  
FamIDDouglas3                       NA         NA      NA       NA  
FamIDDrew3                          NA         NA      NA       NA  
FamIDDuffGordon2             1.063e+00  3.093e+02   0.003   0.9973  
`FamIDDurany More2`          6.887e-01  4.025e+02   0.002   0.9986  
FamIDDyker2                         NA         NA      NA       NA  
FamIDEarnshaw2                      NA         NA      NA       NA  
FamIDElias3                 -6.311e-01  4.025e+02  -0.002   0.9987  
FamIDEustis2                 6.331e-01  4.025e+02   0.002   0.9987  
FamIDFaunthorpe2             6.189e-01  4.025e+02   0.002   0.9988  
FamIDFord5                  -1.295e+00  3.571e+02  -0.004   0.9971  
FamIDFortune6               -7.440e-02  2.115e-01  -0.352   0.7250  
FamIDFrauenthal2             5.638e-01  4.025e+02   0.001   0.9989  
FamIDFrauenthal3             8.475e-01  4.025e+02   0.002   0.9983  
FamIDFrolicher3              6.055e-01  4.025e+02   0.002   0.9988  
`FamIDFrolicher-Stehli3`     8.078e-01  4.025e+02   0.002   0.9984  
FamIDFutrelle2              -6.571e-02  2.075e-01  -0.317   0.7515  
FamIDGale2                  -6.355e-01  4.025e+02  -0.002   0.9987  
FamIDGibson2                        NA         NA      NA       NA  
FamIDGiles2                 -6.164e-01  4.025e+02  -0.002   0.9988  
FamIDGoldenberg2             1.061e+00  3.036e+02   0.003   0.9972  
FamIDGoldsmith3              6.675e-01  4.025e+02   0.002   0.9987  
FamIDGoodwin8               -1.520e+00  3.999e+02  -0.004   0.9970  
FamIDGraham2                 5.967e-01  4.025e+02   0.001   0.9988  
FamIDGreenfield2             7.758e-01  4.025e+02   0.002   0.9985  
FamIDGustafsson3            -8.615e-01  4.018e+02  -0.002   0.9983  
FamIDHagland2               -6.350e-01  4.025e+02  -0.002   0.9987  
FamIDHakkarainen2                   NA         NA      NA       NA  
FamIDHamalainen3             9.973e-01  3.640e+02   0.003   0.9978  
FamIDHansen2                -6.143e-01  4.025e+02  -0.002   0.9988  
FamIDHansen3                -6.191e-01  4.025e+02  -0.002   0.9988  
FamIDHarder2                 7.767e-01  4.025e+02   0.002   0.9985  
FamIDHarper2                 1.991e-01  2.017e-01   0.987   0.3236  
FamIDHarris2                -6.069e-02  2.139e-01  -0.284   0.7766  
FamIDHart3                   3.698e-02  2.039e-01   0.181   0.8561  
FamIDHays3                   5.905e-01  4.025e+02   0.001   0.9988  
FamIDHerman4                 6.800e-01  4.025e+02   0.002   0.9987  
FamIDHickman3               -8.913e-01  4.021e+02  -0.002   0.9982  
FamIDHiltunen3                      NA         NA      NA       NA  
FamIDHippach2                8.313e-01  3.934e+02   0.002   0.9983  
FamIDHirvonen2               7.229e-01  4.025e+02   0.002   0.9986  
FamIDHirvonen3                      NA         NA      NA       NA  
FamIDHocking4               -6.155e-01  4.025e+02  -0.002   0.9988  
FamIDHocking5                       NA         NA      NA       NA  
FamIDHogeboom2                      NA         NA      NA       NA  
FamIDHold2                  -6.266e-01  4.025e+02  -0.002   0.9988  
FamIDHolverson2             -6.639e-01  4.025e+02  -0.002   0.9987  
FamIDHoward2                        NA         NA      NA       NA  
FamIDHoyt2                   5.754e-01  4.025e+02   0.001   0.9989  
FamIDIlmakangas2            -7.049e-01  4.025e+02  -0.002   0.9986  
FamIDJacobsohn2             -6.284e-01  4.025e+02  -0.002   0.9988  
FamIDJacobsohn4              6.136e-01  4.025e+02   0.002   0.9988  
FamIDJefferys3                      NA         NA      NA       NA  
FamIDJensen2                -6.223e-01  4.025e+02  -0.002   0.9988  
FamIDJohnson3                1.018e+00  4.013e+02   0.003   0.9980  
FamIDJohnston4              -7.554e-01  4.025e+02  -0.002   0.9985  
FamIDJussila2               -1.002e+00  4.022e+02  -0.002   0.9980  
FamIDKantor2                -1.484e-02  1.977e-01  -0.075   0.9401  
FamIDKarun2                         NA         NA      NA       NA  
FamIDKenyon2                 5.638e-01  4.025e+02   0.001   0.9989  
FamIDKhalil2                        NA         NA      NA       NA  
FamIDKiernan2               -6.449e-01  4.025e+02  -0.002   0.9987  
FamIDKimball2                8.006e-01  4.025e+02   0.002   0.9984  
FamIDKink3                  -6.152e-01  4.025e+02  -0.002   0.9988  
`FamIDKink-Heilmann3`        6.940e-01  4.025e+02   0.002   0.9986  
`FamIDKink-Heilmann5`               NA         NA      NA       NA  
FamIDKlasen3                -6.214e-01  4.025e+02  -0.002   0.9988  
FamIDLahtinen3              -8.482e-01  4.025e+02  -0.002   0.9983  
FamIDLaroche4                2.688e-02  2.057e-01   0.131   0.8960  
FamIDLefebre5               -1.312e+00  4.011e+02  -0.003   0.9974  
FamIDLennon2                -6.620e-01  4.025e+02  -0.002   0.9987  
FamIDLindell2               -6.226e-01  4.025e+02  -0.002   0.9988  
FamIDLindqvist2              8.448e-01  4.025e+02   0.002   0.9983  
FamIDLines2                  5.931e-01  4.025e+02   0.001   0.9988  
FamIDLobb2                  -6.279e-01  4.025e+02  -0.002   0.9988  
FamIDLouch2                  6.305e-01  4.025e+02   0.002   0.9988  
FamIDMadill2                 6.081e-01  4.025e+02   0.002   0.9988  
FamIDMallet3                 3.582e-02  1.574e-01   0.228   0.8200  
FamIDMarvin2                -6.843e-01  4.025e+02  -0.002   0.9986  
FamIDMcCoy3                  1.042e+00  3.470e+02   0.003   0.9976  
FamIDMcNamee2               -6.332e-01  4.025e+02  -0.002   0.9987  
FamIDMellinger2                     NA         NA      NA       NA  
FamIDMeyer2                 -9.190e-02  2.106e-01  -0.436   0.6625  
FamIDMinahan2                5.962e-01  4.025e+02   0.001   0.9988  
FamIDMinahan3               -6.493e-01  4.025e+02  -0.002   0.9987  
FamIDMock2                          NA         NA      NA       NA  
FamIDMoor2                   1.003e+00  3.969e+02   0.003   0.9980  
FamIDMoran2                 -7.498e-03  1.385e-01  -0.054   0.9568  
FamIDMoubarek3               1.027e+00  4.022e+02   0.003   0.9980  
FamIDMurphy2                 7.128e-01  4.025e+02   0.002   0.9986  
FamIDNakid3                  1.104e+00  3.350e+02   0.003   0.9974  
FamIDNasser2                -3.445e-02  2.056e-01  -0.168   0.8669  
FamIDNatsch2                -6.921e-01  4.025e+02  -0.002   0.9986  
FamIDNavratil3               7.073e-01  4.025e+02   0.002   0.9986  
FamIDNewell2                 8.609e-01  4.019e+02   0.002   0.9983  
FamIDNewell3                -6.576e-01  4.025e+02  -0.002   0.9987  
FamIDNewsom3                 5.957e-01  4.025e+02   0.001   0.9988  
FamIDNicholls3              -6.488e-01  4.025e+02  -0.002   0.9987  
`FamIDNicola-Yarred2`        1.019e+00  4.013e+02   0.003   0.9980  
`FamIDO'Brien2`              6.632e-01  4.025e+02   0.002   0.9987  
FamIDOlsen2                 -5.992e-01  4.025e+02  -0.001   0.9988  
FamIDOstby2                         NA         NA      NA       NA  
FamIDPalsson5               -1.559e+00  3.981e+02  -0.004   0.9969  
FamIDPanula6                -1.680e+00  3.489e+02  -0.005   0.9962  
FamIDParrish2                6.385e-01  4.025e+02   0.002   0.9987  
FamIDPeacock3                       NA         NA      NA       NA  
FamIDPears2                 -7.889e-02  2.114e-01  -0.373   0.7090  
`FamIDPenascoy Castellana2` -1.013e-01  2.069e-01  -0.490   0.6243  
FamIDPersson2                8.493e-01  4.025e+02   0.002   0.9983  
FamIDPeter3                  9.624e-01  3.894e+02   0.002   0.9980  
FamIDPetterson2             -6.152e-01  4.025e+02  -0.002   0.9988  
FamIDPhillips2                      NA         NA      NA       NA  
FamIDPotter2                 5.861e-01  4.025e+02   0.001   0.9988  
FamIDQuick3                  6.446e-01  4.025e+02   0.002   0.9987  
FamIDRenouf2                        NA         NA      NA       NA  
FamIDRenouf4                 6.180e-01  4.025e+02   0.002   0.9988  
FamIDRice6                  -1.800e+00  4.008e+02  -0.004   0.9964  
FamIDRichards3               7.379e-01  4.025e+02   0.002   0.9985  
FamIDRichards6               6.443e-01  4.025e+02   0.002   0.9987  
FamIDRobert2                 5.834e-01  4.025e+02   0.001   0.9988  
FamIDRobins2                -7.521e-01  4.025e+02  -0.002   0.9985  
FamIDRosblom3               -1.063e+00  3.483e+02  -0.003   0.9976  
FamIDRothschild2                    NA         NA      NA       NA  
FamIDRyerson5                6.002e-01  4.025e+02   0.001   0.9988  
FamIDSage11                 -1.635e+00  3.694e+02  -0.004   0.9965  
FamIDSamaan3                -6.754e-01  4.025e+02  -0.002   0.9987  
FamIDSandstrom3              1.003e+00  3.942e+02   0.003   0.9980  
FamIDSchabert2                      NA         NA      NA       NA  
FamIDShelley2                6.163e-01  4.025e+02   0.002   0.9988  
FamIDSilven3                 6.904e-01  4.025e+02   0.002   0.9986  
FamIDSilvey2                -5.504e-02  2.148e-01  -0.256   0.7977  
FamIDSkoog6                 -1.697e+00  3.623e+02  -0.005   0.9963  
FamIDSmith2                         NA         NA      NA       NA  
FamIDSnyder2                        NA         NA      NA       NA  
FamIDSolo                    4.404e-01  1.094e+00   0.403   0.6872  
FamIDSpedden3                       NA         NA      NA       NA  
FamIDSpencer2                5.550e-01  4.025e+02   0.001   0.9989  
FamIDStengel2                       NA         NA      NA       NA  
FamIDStephenson2             5.817e-01  4.025e+02   0.001   0.9988  
FamIDStraus2                        NA         NA      NA       NA  
FamIDStrom2                 -7.416e-01  4.025e+02  -0.002   0.9985  
FamIDStrom3                 -7.681e-01  4.025e+02  -0.002   0.9985  
FamIDTaussig3                8.421e-01  3.904e+02   0.002   0.9983  
FamIDTaylor2                 1.073e+00  3.040e+02   0.004   0.9972  
FamIDThayer3                 1.030e+00  3.244e+02   0.003   0.9975  
FamIDThomas2                 7.254e-01  4.025e+02   0.002   0.9986  
FamIDThomas3                        NA         NA      NA       NA  
FamIDThorneycroft2           3.898e-02  1.546e-01   0.252   0.8010  
FamIDTouma3                         NA         NA      NA       NA  
FamIDTurpin2                -1.127e+00  2.917e+02  -0.004   0.9969  
FamIDvanBilliard3           -6.177e-01  4.025e+02  -0.002   0.9988  
FamIDVanderPlanke2          -7.663e-01  4.025e+02  -0.002   0.9985  
FamIDVanderPlanke3          -9.833e-01  3.480e+02  -0.003   0.9977  
FamIDVanderPlanke4                  NA         NA      NA       NA  
FamIDVanImpe3               -1.106e+00  3.902e+02  -0.003   0.9977  
FamIDWare2                          NA         NA      NA       NA  
FamIDWarren2                 5.888e-01  4.025e+02   0.001   0.9988  
FamIDWeisz2                  6.189e-01  4.025e+02   0.002   0.9988  
FamIDWells3                  6.463e-01  4.025e+02   0.002   0.9987  
FamIDWest4                   2.669e-02  2.030e-01   0.132   0.8954  
FamIDWhite2                 -6.523e-01  4.025e+02  -0.002   0.9987  
FamIDWick3                   6.223e-01  4.025e+02   0.002   0.9988  
FamIDWidener3               -6.851e-01  4.025e+02  -0.002   0.9986  
FamIDWiklund2                       NA         NA      NA       NA  
FamIDWilkes2                        NA         NA      NA       NA  
FamIDWilliams2              -6.638e-01  4.025e+02  -0.002   0.9987  
FamIDYasbeck2                       NA         NA      NA       NA  
FamIDZabour2                -7.335e-01  4.025e+02  -0.002   0.9985  
TitleMiss                   -8.526e+00  2.897e+03  -0.003   0.9977  
TitleMr                     -1.242e+00  7.618e-01  -1.630   0.1030  
TitleMrs                    -6.906e+00  2.505e+03  -0.003   0.9978  
TitleNoble                  -5.001e-01  2.775e-01  -1.802   0.0716 .
`Sexmale:Pclass2`            1.211e-01  4.900e-01   0.247   0.8048  
`Sexmale:Pclass3`            1.035e+00  6.352e-01   1.630   0.1031  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 366.58  on 515  degrees of freedom
AIC: 764.58

Number of Fisher Scoring iterations: 18


So we did a little bit better. I would just like to test a theory. We manufactured the FamilyID and have been doing well so far. What happens if we take it out? Will we do worse or better? Let's check it out.

In [289]:
%%R
#Let's work in an interaction between Pclass and Sex.
set.seed(35) 

logit.tune4<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + Title,
                data=train.set,
                method='glm',
                preProcess = c("center", "scale"),
                trControl=tenfoldcv)

logit.tune4
Generalized Linear Model 

714 samples
 14 predictors
  2 classes: '0', '1' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ... 

Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD  
  0.8123305  0.5934912  0.04130739   0.09243173

 

In [290]:
%%R
summary(logit.tune4)

Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6591  -0.5397  -0.4099   0.3937   2.4194  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -0.56336    0.12309  -4.577 4.72e-06 ***
Sexmale            -8.13159  278.33333  -0.029 0.976693    
Pclass2            -0.36485    0.33842  -1.078 0.280987    
Pclass3            -1.62269    0.37865  -4.285 1.82e-05 ***
Age                -0.30128    0.13506  -2.231 0.025693 *  
SibSp              -0.62972    0.17467  -3.605 0.000312 ***
EmbarkedQ           0.04973    0.11805   0.421 0.673568    
EmbarkedS          -0.17619    0.12541  -1.405 0.160047    
`FareGroup10-20`    0.07364    0.14908   0.494 0.621326    
`FareGroup20-40`   -0.01342    0.17629  -0.076 0.939314    
`FareGroup40+`      0.11277    0.22144   0.509 0.610572    
TitleMiss          -6.75940  236.67998  -0.029 0.977216    
TitleMr            -1.55762    0.29962  -5.199 2.01e-07 ***
TitleMrs           -5.78595  204.63657  -0.028 0.977443    
TitleNoble         -0.46573    0.14315  -3.253 0.001140 ** 
`Sexmale:Pclass2`  -0.32045    0.29394  -1.090 0.275630    
`Sexmale:Pclass3`   0.65008    0.35598   1.826 0.067823 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 576.85  on 697  degrees of freedom
AIC: 610.85

Number of Fisher Scoring iterations: 13


So we actually did much better. So lesson learnt, engineering new features is a great idea but may or may not positively impact your model. In fact, if we don't get it right, it could have an adverse impact. Let's see if we can make anymore tiny improvements with the Title. We have 4 possible values and I am going to class compress each.

In [291]:
%%R
#Let's work in an interaction between Pclass and Sex.
set.seed(35) 

logit.tune5<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + I(Title=='Mr') +
                   I(Title=='Mrs') + I(Title=='Miss') + I(Title=='Noble'),
                data=train.set,
                method='glm',
                preProcess = c("center", "scale"),
                trControl=tenfoldcv)

logit.tune5
Generalized Linear Model 

714 samples
 14 predictors
  2 classes: '0', '1' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ... 

Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD  
  0.8123305  0.5934912  0.04130739   0.09243173

 

In [292]:
%%R
summary(logit.tune5)

Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6591  -0.5397  -0.4099   0.3937   2.4194  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)                -0.56336    0.12309  -4.577 4.72e-06 ***
Sexmale                    -8.13159  278.33333  -0.029 0.976693    
Pclass2                    -0.36485    0.33842  -1.078 0.280987    
Pclass3                    -1.62269    0.37865  -4.285 1.82e-05 ***
Age                        -0.30128    0.13506  -2.231 0.025693 *  
SibSp                      -0.62972    0.17467  -3.605 0.000312 ***
EmbarkedQ                   0.04973    0.11805   0.421 0.673568    
EmbarkedS                  -0.17619    0.12541  -1.405 0.160047    
`FareGroup10-20`            0.07364    0.14908   0.494 0.621326    
`FareGroup20-40`           -0.01342    0.17629  -0.076 0.939314    
`FareGroup40+`              0.11277    0.22144   0.509 0.610572    
`I(Title == "Mr")TRUE`     -1.55762    0.29962  -5.199 2.01e-07 ***
`I(Title == "Mrs")TRUE`    -5.78595  204.63657  -0.028 0.977443    
`I(Title == "Miss")TRUE`   -6.75940  236.67998  -0.029 0.977216    
`I(Title == "Noble")TRUE`  -0.46573    0.14315  -3.253 0.001140 ** 
`Sexmale:Pclass2`          -0.32045    0.29394  -1.090 0.275630    
`Sexmale:Pclass3`           0.65008    0.35598   1.826 0.067823 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 576.85  on 697  degrees of freedom
AIC: 610.85

Number of Fisher Scoring iterations: 13


Hmm, didn't really make a difference. Let's try one last thing, adding Child, which we derived during the feature engineering exercise.

In [120]:
%%R
#Let's add Child to the mix.
set.seed(35) 

logit.tune6<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + Title + Child,
                data=train.set,
                method='glm',
                preProcess = c("center", "scale"),
                trControl=tenfoldcv)

logit.tune6
Generalized Linear Model 

714 samples
 14 predictors
  2 classes: '0', '1' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ... 

Resampling results

  Accuracy   Kappa      Accuracy SD  Kappa SD
  0.8144806  0.5969237  0.04436537   0.101153

 

In [294]:
%%R
summary(logit.tune6)

Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7285  -0.5465  -0.4164   0.4031   2.4338  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -0.56186    0.12306  -4.566 4.98e-06 ***
Sexmale            -8.16008  276.99294  -0.029 0.976498    
Pclass2            -0.36500    0.33876  -1.077 0.281271    
Pclass3            -1.62927    0.37881  -4.301 1.70e-05 ***
Age                -0.25715    0.14521  -1.771 0.076581 .  
SibSp              -0.64181    0.17585  -3.650 0.000262 ***
EmbarkedQ           0.05690    0.11859   0.480 0.631327    
EmbarkedS          -0.17477    0.12575  -1.390 0.164569    
`FareGroup10-20`    0.05529    0.15150   0.365 0.715165    
`FareGroup20-40`   -0.03809    0.17965  -0.212 0.832083    
`FareGroup40+`      0.09460    0.22321   0.424 0.671714    
TitleMiss          -6.73181  235.54018  -0.029 0.977199    
TitleMr            -1.47297    0.31722  -4.643 3.43e-06 ***
TitleMrs           -5.73725  203.65108  -0.028 0.977525    
TitleNoble         -0.44396    0.14534  -3.055 0.002254 ** 
Child               0.11549    0.14579   0.792 0.428237    
`Sexmale:Pclass2`  -0.31971    0.29422  -1.087 0.277200    
`Sexmale:Pclass3`   0.64601    0.35560   1.817 0.069267 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 576.23  on 696  degrees of freedom
AIC: 612.23

Number of Fisher Scoring iterations: 13


So we got that tiny push we were looking for. Let's now go ahead and try this model on our test set as well as submit to Kaggle.

Model Evaluation - Logistic Regression
We can now begin to evaluate model performance by putting together some cross-tabulations of the observed and predicted Survival for the passengers in the test.set data. caret makes this easy with the confusionMatrix function.

In [295]:
%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
logit.pred<-predict(logit.tune6, test.set)

#Generate the confusion matrix
confusionMatrix(logit.pred, test.set$Survived)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 99 18
         1 10 50
                                          
               Accuracy : 0.8418          
                 95% CI : (0.7795, 0.8922)
    No Information Rate : 0.6158          
    P-Value [Acc > NIR] : 4.229e-11       
                                          
                  Kappa : 0.6581          
 Mcnemar's Test P-Value : 0.1859          
                                          
            Sensitivity : 0.9083          
            Specificity : 0.7353          
         Pos Pred Value : 0.8462          
         Neg Pred Value : 0.8333          
             Prevalence : 0.6158          
         Detection Rate : 0.5593          
   Detection Prevalence : 0.6610          
      Balanced Accuracy : 0.8218          
                                          
       'Positive' Class : 0               
                                          

The metric we're looking for here is called Specificity. It basically is, out of all that actually survived how many we predicted will survive. So, it's 50/68 = 73.53%. Not too shabby though I'd like to do better. Let's anyway try to make a submission and find out for real.

Submit Results to Kaggle
Let's now submit the results from the LR model to Kaggle to see how we fare.

In [356]:
%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(logit.tune6, kaggletest)
lr.predictions<-as.data.frame(Survived)
lr.predictions$PassengerId<-kaggletest$PassengerId

#Write results as a CSV file
write.csv(lr.predictions[,c('PassengerId','Survived')], file="LR_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)

The model scored 0.76555 which put us ahead of only about 1/4th of the teams on the leaderboard. Let us keep trying to improve.

Support Vector Machines
Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for classification or for regression. SVMs are a discriminative classifier: that is, they draw a boundary between clusters of data. The process of fitting an SVM based model to our dataset is very similar to what we just did with glm but involves an additional step to hypertune parameters. The key parameter for SVM is C, which can be considered as 1/lambda where lambda is the regularization term. We talked about overfitting previously, regularization is the process of offsetting it. If there is overfitting, we would increase lambda, so conversely decrease C.

The caret package automatically selects the best C value by hypertuning it during crossvalidation. But we need to supply a range of C values for the train method to try. For SVMs, this could be handled by updating the parameter, tunelength (or tunegrid). By default, the length is 3 and the values tried are 0.25, 0.5 and 1. Setting it to 6 would try 0.25 - 8. Let's get started.

In [124]:
%%R
set.seed(35) 

#Training an SVM model with the RBF kernel - Radial Basis Function
svm.tune1<-train(Survived ~ Sex + Pclass +  Sex:Pclass + Age + SibSp + Embarked + FareGroup + Title + Child,
                data=train.set,
                method='svmRadial',
                tuneLength=9,
                preProcess = c("center", "scale"),
                trControl=tenfoldcv)

svm.tune1
Support Vector Machines with Radial Basis Function Kernel 

714 samples
 14 predictors
  2 classes: '0', '1' 

Pre-processing: centered, scaled 
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ... 

Resampling results across tuning parameters:

  C      Accuracy   Kappa      Accuracy SD  Kappa SD 
   0.25  0.8049426  0.5722555  0.05162691   0.1132959
   0.50  0.8117884  0.5843046  0.05585971   0.1259916
   1.00  0.8226591  0.6100882  0.05152085   0.1165301
   2.00  0.8249804  0.6164518  0.05146277   0.1156968
   4.00  0.8160650  0.5986011  0.05477380   0.1213830
   8.00  0.8131307  0.5934170  0.04906936   0.1095503
  16.00  0.8131172  0.5940660  0.05068143   0.1116418
  32.00  0.8150599  0.5995714  0.05261309   0.1151817
  64.00  0.8092588  0.5876340  0.05748809   0.1249829

Tuning parameter 'sigma' was held constant at a value of 0.07616343
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.07616343 and C = 2. 

Great, so SVM automatically tried 9 different values of C while holding the other parameter Sigma constant and picked the values that gave the best results. We really didn't have to do much there. Let's go ahead and evaluate the model as well as submit to Kaggle.

Model Evaluation - SVM

In [312]:
%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
svm.pred<-predict(svm.tune1, test.set)

#Generate the confusion matrix
confusionMatrix(svm.pred, test.set$Survived)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 100  20
         1   9  48
                                          
               Accuracy : 0.8362          
                 95% CI : (0.7732, 0.8874)
    No Information Rate : 0.6158          
    P-Value [Acc > NIR] : 1.38e-10        
                                          
                  Kappa : 0.6429          
 Mcnemar's Test P-Value : 0.06332         
                                          
            Sensitivity : 0.9174          
            Specificity : 0.7059          
         Pos Pred Value : 0.8333          
         Neg Pred Value : 0.8421          
             Prevalence : 0.6158          
         Detection Rate : 0.5650          
   Detection Prevalence : 0.6780          
      Balanced Accuracy : 0.8117          
                                          
       'Positive' Class : 0               
                                          

The specificity is actually a bit down from the LR model, we're at 70%. Other than that eveyrthing looks pretty similar. I also just discovered that a Support Vector Machine automatically captures interaction between variables. So the Pclass:Sex interaction we put in has no importance to this model. We'll remove it going forward since Random Forests also automatically identifies interactions.

Submit to Kaggle

In [355]:
%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(svm.tune1, kaggletest)
svm.predictions<-as.data.frame(Survived)
svm.predictions$PassengerId<-kaggletest$PassengerId

#Write results as a CSV file
write.csv(svm.predictions[,c('PassengerId','Survived')], file="SVM_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)

We scored 0.77512, an improvement over the Logistic Regression model that took us up the leaderboard a few notches. We're not going to let this go!

Random Forests
Next up, a very popular and easy to use model, Random Forests. RF builds on the concept of decision trees and expands it by growing multiple trees and averaging out the results to find the best fit. The parameter we need to tune is mtry, that number of features to try at each node. The best recommended value for this parameter is typically the square root of the number of features. Let's give this a shot.

In [365]:
%%R
set.seed(35) 

rfgrid<-data.frame(.mtry=c(2,3,4))

#Training a Random Forest model
rf.tune1<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + Title + Child,
                data=train.set,
                method='rf',
                tuneGrid=rfgrid,
                trControl=tenfoldcv)

rf.tune1
Random Forest 

714 samples
 14 predictors
  2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ... 

Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD 
  2     0.8096114  0.5868040  0.05614486   0.1231515
  3     0.8120599  0.5869056  0.05485173   0.1228654
  4     0.8151139  0.5951016  0.05099510   0.1145908

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 4. 

Looks like the best mtry value found was 4. Let's complete the formalities.

Model Evaluation - Random Forests

In [353]:
%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
rf.pred<-predict(rf.tune1, test.set)

#Generate the confusion matrix
confusionMatrix(rf.pred, test.set$Survived)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 101  20
         1   8  48
                                          
               Accuracy : 0.8418          
                 95% CI : (0.7795, 0.8922)
    No Information Rate : 0.6158          
    P-Value [Acc > NIR] : 4.229e-11       
                                          
                  Kappa : 0.6542          
 Mcnemar's Test P-Value : 0.03764         
                                          
            Sensitivity : 0.9266          
            Specificity : 0.7059          
         Pos Pred Value : 0.8347          
         Neg Pred Value : 0.8571          
             Prevalence : 0.6158          
         Detection Rate : 0.5706          
   Detection Prevalence : 0.6836          
      Balanced Accuracy : 0.8162          
                                          
       'Positive' Class : 0               
                                          

Well, the specificity is exactly the same, at 70%. I doubt this is going to give us a different result with Kaggle but let's try anyway.

Submit to Kaggle

In [357]:
%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(rf.tune1, kaggletest)
rf.predictions<-as.data.frame(Survived)
rf.predictions$PassengerId<-kaggletest$PassengerId

#Write results as a CSV file
write.csv(rf.predictions[,c('PassengerId','Survived')], file="RF_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)

Surprise, we scored 0.78469 bringing us midway on the leaderboard. So we're definitely making headway!

Feature Importances
Before we move ahead let's look at a key statistic that comes from running a tree based model. It's called feature importances, which tells us how much of an impact each of the features we're feeding to the model has to the final outcome.

In [370]:
%%R
#Print variable importan
varImp(rf.tune1$finalModel)
                 Overall
Sexmale        41.576129
Pclass2         6.801074
Pclass3        21.558667
Age            36.931313
SibSp          15.046491
EmbarkedQ       2.972848
EmbarkedS       6.683872
FareGroup10-20  3.980921
FareGroup20-40  6.418674
FareGroup40+   10.203688
TitleMiss      11.145308
TitleMr        36.630889
TitleMrs       10.509372
TitleNoble      2.862573
Child           4.153887

That's very interesting. We always knew that Gender was the most important variable. Sexmale by the way shows up there because it's the class that suffered the most. We see here that Age and "Mr." Title take the second spot. That is a revelation of sorts. SibSp also seems quite important compared to say Embarked.

Before we move ahead with other models, I am really curious to try a few more things out with RF. I would like to measure the impact of adding a couple of features we've left out, Family Size and Parch and see how they pan out in terms of importance. Offline I tried adding FamID and it again created a negative impact so I am leaving it out for good

In [41]:
%%R
set.seed(35) 

#Notice how we're trying out 2-5 features at each node now since we're adding a couple of features to train.
rfgrid<-data.frame(.mtry=c(2,3,4,5))

#Training a Random Forest model
rf.tune2<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + Title + Child + FamSize + Parch,
                data=train.set,
                method='rf',
                tuneGrid=rfgrid,
                trControl=tenfoldcv)

rf.tune2
Random Forest 

714 samples
 14 predictors
  2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ... 

Resampling results across tuning parameters:

  mtry  Accuracy   Kappa      Accuracy SD  Kappa SD  
  2     0.8161450  0.6018844  0.05130416   0.11229173
  3     0.8114502  0.5872835  0.05293296   0.11855522
  4     0.8133020  0.5920896  0.04751211   0.10699385
  5     0.8221049  0.6147302  0.04405445   0.09805046

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 5. 

As expected, the best mtry value this time was 5. Let's look at the feature importances.

In [73]:
%%R
varImp(rf.tune2$finalModel)
                 Overall
Sexmale        43.453469
Pclass2         6.453174
Pclass3        19.967399
Age            42.531124
SibSp          10.604394
EmbarkedQ       2.998889
EmbarkedS       6.815784
FareGroup10-20  4.111547
FareGroup20-40  6.365776
FareGroup40+    9.403774
TitleMiss      10.173660
TitleMr        38.239391
TitleMrs        8.937783
TitleNoble      2.831949
Child           3.719191
FamSize        19.145703
Parch           7.960594

So it worked out well. Parch and especially FamSize seem to be reasonably important. I believe feature selection plays a very important role in Tree based on models (not that they don't in others but perhaps even more important in this case). OK let's run predictions then re-submit to Kaggle to see if we can improve the score.

In [43]:
%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
rf.pred<-predict(rf.tune2, test.set)

#Generate the confusion matrix
confusionMatrix(rf.pred, test.set$Survived)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 98 20
         1 11 48
                                          
               Accuracy : 0.8249          
                 95% CI : (0.7607, 0.8778)
    No Information Rate : 0.6158          
    P-Value [Acc > NIR] : 1.304e-09       
                                          
                  Kappa : 0.6204          
 Mcnemar's Test P-Value : 0.1508          
                                          
            Sensitivity : 0.8991          
            Specificity : 0.7059          
         Pos Pred Value : 0.8305          
         Neg Pred Value : 0.8136          
             Prevalence : 0.6158          
         Detection Rate : 0.5537          
   Detection Prevalence : 0.6667          
      Balanced Accuracy : 0.8025          
                                          
       'Positive' Class : 0               
                                          

No difference on predictions in our Test set. The sensitivity is the same. But let's try submitting to Kaggle anyway.

In [44]:
%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(rf.tune2, kaggletest)
rf.predictions<-as.data.frame(Survived)
rf.predictions$PassengerId<-kaggletest$PassengerId

#Write results as a CSV file
write.csv(rf.predictions[,c('PassengerId','Survived')], file="RF_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)

Guess what, we did make an improvement to the Kaggle score. This model scored 0.78947 bringing us to the top 1/3rd of the leaderboard. Since tree based models are giving us great results let's try one more, a very interesting model called Conditional Trees.

Conditional Trees
Conditional Tree based models supposedly tend to select variables that have many possible splits or many missing values. So instead of Random Forests which tries to find out which variables are important, using an information measure, these models perform sort of a significance test to see which features will yield the best results at each split. Let's give this a whirl.

Note that there are two Conditional Tree packages in caret (ctree, ctree2). We will be using ctree2 which allows us to tune the Max Depth, how deep the trees can grow.

In [75]:
%%R
set.seed(35) 

ctrgrid<-data.frame(.maxdepth=c(2,3,4,5,6))

#Training a Conditional Tree model
ctr.tune1<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + Title + Child + FamSize + Parch,
                data=train.set,
                method='ctree2',
                tuneGrid=ctrgrid,
                trControl=tenfoldcv)

ctr.tune1
Conditional Inference Tree 

714 samples
 14 predictors
  2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ... 

Resampling results across tuning parameters:

  maxdepth  Accuracy   Kappa      Accuracy SD  Kappa SD 
  2         0.7707812  0.4755401  0.04604332   0.1104954
  3         0.8077726  0.5855616  0.05702946   0.1237835
  4         0.8236502  0.6130369  0.05491988   0.1236674
  5         0.8194444  0.6055659  0.04897090   0.1102774
  6         0.8152061  0.5960590  0.05126715   0.1141253

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was maxdepth = 4. 

OK that did give us a better accuracy at Max Depth 4. Before we run predictions, let's visualize the tree that was built for the final model.

In [115]:
%%R
plot(ctr.tune1$finalModel)

We can observe here that Sexmale as expected was the starting node. From then on, Pclass and TitleMr took the honors for level 2 leading further then to the other variables. We're now ready to evaluate the model and run the predictions.

Model Evaluation - Conditional Trees

In [76]:
%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
ctr.pred<-predict(ctr.tune1, test.set)

#Generate the confusion matrix
confusionMatrix(ctr.pred, test.set$Survived)
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 102  19
         1   7  49
                                          
               Accuracy : 0.8531          
                 95% CI : (0.7922, 0.9017)
    No Information Rate : 0.6158          
    P-Value [Acc > NIR] : 3.506e-12       
                                          
                  Kappa : 0.6789          
 Mcnemar's Test P-Value : 0.03098         
                                          
            Sensitivity : 0.9358          
            Specificity : 0.7206          
         Pos Pred Value : 0.8430          
         Neg Pred Value : 0.8750          
             Prevalence : 0.6158          
         Detection Rate : 0.5763          
   Detection Prevalence : 0.6836          
      Balanced Accuracy : 0.8282          
                                          
       'Positive' Class : 0               
                                          

So that gave us a better Specificity on the Test set at 72.06%. I am interested to see how this tests out at Kaggle.

Submit to Kaggle

In [91]:
%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(ctr.tune1, kaggletest)
ctr.predictions<-as.data.frame(Survived)
ctr.predictions$PassengerId<-kaggletest$PassengerId

#Write results as a CSV file
write.csv(ctr.predictions[,c('PassengerId','Survived')], file="CTR_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)

The model scored 0.77512 which is actually slightly worse than how Random Forests did. So the best model in terms of the Kaggle leaderboard has been Random Forests. But let's run a formal comparison of all the models we've built so far.

Model Comparison
The resamples method in the caret package makes it easy to compare results between different models.

In [126]:
%%R
modelcompare<-resamples(list(Logit = logit.tune6, SVM = svm.tune1, RF = rf.tune2, CTREE = ctr.tune1 ))
summary(modelcompare)

Call:
summary.resamples(object = modelcompare)

Models: Logit, SVM, RF, CTREE 
Number of resamples: 24 

Accuracy 
        Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
Logit 0.7465  0.7887 0.8042 0.8145  0.8333 0.9437    0
SVM   0.7606  0.7778 0.8042 0.8244  0.8677 0.9296    0
RF    0.7606  0.7917 0.8182 0.8250  0.8592 0.9155    0
CTREE 0.7361  0.7770 0.8099 0.8216  0.8502 0.9577    0

Kappa 
        Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
Logit 0.4295  0.5334 0.5747 0.5969  0.6505 0.8787    0
SVM   0.4570  0.5146 0.5612 0.6149  0.7173 0.8495    0
RF    0.4652  0.5392 0.6140 0.6195  0.7012 0.8232    0
CTREE 0.4338  0.5003 0.5771 0.6083  0.6847 0.9097    0


We can see that for the metric we chose (Accuracy), Random Forests outperformed the other models. Note here that we could've chosen ROC (Receiver Operating Characteristic) as the metric in which case, we'd have needed to generate class probabilties - that is, probability for survived/not survived for every data item rather than letting crossvalidation generate predictions automatically. I intend to learn and demonstrate these concepts in a seperate session. For now, let's plot the results in a couple of different ways.

Box Plot of Model Comparison Results

In [128]:
%%R
bwplot(modelcompare)

Dot Plot of Model Comparison Results

In [129]:
%%R
dotplot(modelcompare)

So there you have it. We took the Titanic dataset presented by Kaggle, imported and visualized the data in a series of plots, munged the data to address gaps and fitted 4 different models with varying results, all in R, special thanks to the caret package. We only managed to get up to 0.7894 on the Kaggle leaderboard but the point of this exercise was to learn and demonstrate machine learning concepts in R. Hope we've achieved that objective.

I will be happy to receive your feedback positive or negative but try to be nice, I am just learning :-)


Share more, Learn more!





Comments

Comments powered by Disqus