# Saving the Titanic with R & IPython

The following is an illustration of one of my approaches to solving the Titanic Survival prediction challenge hosted by Kaggle. Below is an excerpt from the competition page.

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Disclaimer
This pursuit infuses my own ideas with others I've had the privilege to learn from. I write to further my learning.

OK let's now take a look at the column descriptions provided for the dataset.

``````VARIABLE DESCRIPTIONS:
survival        Survival
(0 = No; 1 = Yes)
pclass          Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.``````

Bottomline, we have some information about passengers traveling aboard the Titanic and we need to predict if train a model that can predict if one survived or not based on data similar to that provided in the dataset. Without further ado, let's get started.

In [1]:
```#Load the R Magic so we can execute R scripts within this notebook
```
In [2]:
```%%R
#Note that every code block in this notebook will need to have the above line to enable IPython to understand we're coding R.

#I've downloaded the train and test CSV files to my work directory. You should too unless you cloned this repo.

#Define a read function so we don't need to do it twice. Column types specifies data types for each column and missing
#types specify different types of null values possible.
read_better <- function(file.name, column.types, missing.types) {
colClasses=column.types,
na.strings=missing.types )
}

#Let's now define the column types
column.types=c('integer',   # PassengerId
'factor',    # Survived
'factor',    # Pclass
'character', # Name
'factor',    # Sex
'numeric',   # Age
'integer',   # SibSp
'integer',   # Parch
'character', # Ticket
'numeric',   # Fare
'character', # Cabin
'factor')    # Embarked
#Different types of null values
missing.types=c('NA','')

#For test, the Survived column (2nd col) doesn't exist, let's remove that type before reading.

#Let's make copies so we never have to read again
train<-orig_train
test<-orig_test

#Quickly print a summary of train
summary(train)
```
```  PassengerId    Survived Pclass      Name               Sex
Min.   :  1.0   0:549    1:216   Length:891         female:314
1st Qu.:223.5   1:342    2:184   Class :character   male  :577
Median :446.0            3:491   Mode  :character
Mean   :446.0
3rd Qu.:668.5
Max.   :891.0

Age            SibSp           Parch           Ticket
Min.   : 0.42   Min.   :0.000   Min.   :0.0000   Length:891
1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000   Class :character
Median :28.00   Median :0.000   Median :0.0000   Mode  :character
Mean   :29.70   Mean   :0.523   Mean   :0.3816
3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000
Max.   :80.00   Max.   :8.000   Max.   :6.0000
NA's   :177
Fare           Cabin           Embarked
Min.   :  0.00   Length:891         C   :168
1st Qu.:  7.91   Class :character   Q   : 77
Median : 14.45   Mode  :character   S   :644
Mean   : 32.20                      NA's:  2
3rd Qu.: 31.00
Max.   :512.33

```

#### Data Munging

Munging is essentially cleansing the data so it's ready for our super sophisticated Machine Learning algorithms :-)

Ideally I'd like to use Pandas which is an awesome tool for these types of things but considering Pandas itself was inspired from R, we will try the whole thing in R this time. I'll create a seperate notebook later to do it all in sklearn/pandas.

Alright let's get started. The first step of any Data Cleansing process is Visualization. Why is that? I asked the question myself but how would you cleanse something when you don't know what it is? And what better way to understand data than by looking at in colourful visualizations. Let's go and create some.

In [3]:
```%%R
#I loved the look of this R package that provides the missing map which can give you a super quick peek into the dataset.
#You'll need the Amelia package for this visualization.

#install.packages("Amelia")
require(Amelia)
missmap(train, main="Titanic - Missing Data Map", col=c("forestgreen","lightskyblue2"), legend=FALSE)
```
```Loading required package: Amelia
##
## Amelia II: Multiple Imputation
## (Version 1.7.3, built: 2014-11-14)
## Copyright (C) 2005-2014 James Honaker, Gary King and Matthew Blackwell
##

```

It's obvious looking at this map that the most values missing are in the Cabin and Age columns. Cabin basically refers to the cabin# of each passenger and considering how many values are missing, we could as well just drop it. Age however is critical and we need a better mechanism to handle. There is also one missing Embarked value.

In [4]:
```%%R
#OK let's plot a series of visualizations that can hopefully explain the data better"

#Proportion of Survivors
barplot(prop.table(table(train\$Survived)), names.arg=c('Didn\'t Survive', 'Survived'), main="Proportion of Survivors",
col=c('mistyrose','lightseagreen'))

#Clearly more people died than survived.
```
In [5]:
```%%R
#Proportion of Survivors by Gender
barplot(prop.table(table(train\$Sex, train\$Survived), 1), names.arg=c('Didn\'t Survive', 'Survived'),
main="Proportion of Survivors by Gender", legend=TRUE, col=c('darksalmon','paleturquoise'))

#More Females survived than Males. This is understandable considering the ladies and children first approach for survival.
```
In [6]:
```%%R
#Proportion of Survivors by Class of Travel - let's do a mosaicplot this time for fun.
mosaicplot(prop.table(table(train\$Pclass, train\$Survived), 1), main="Proportion of Survivors by Pclass",
xlab='Pclass', ylab='Survived ?',col=c('darkturquoise','mediumspringgreen'))

#Looks like those traveling in upper class (3) were more lucky than the rest. We'll keep this in mind.
```
In [7]:
```%%R
#OK let's do a quick plot by Age - remember that we need to fill in missing values for this column.
boxplot(train\$Age~train\$Survived,main="Proportion of Survivors by Age", col=c('darkseagreen4','salmon4'), xlab="Survived ?",
ylab="Age")

#OK that was a helpful plot, it clearly tells us that there were more survivors in the 20-35 Age bracket, young legs perhaps?
```
In [8]:
```%%R
#OK let's get down to business. We'll look at how many values are missing for Age.
summary(train\$Age)

#177 is a lot considering our dataset is really small. Let's try to find a meaningful way to fill these up.

#The determining factors so far has been Gender and Pclass. Can we find out how many Age values are missing by Gender?
barplot(table(train\$Sex[which(is.na(train\$Age))]),main="Proportion of Missing Ages by Gender",
col=c('lightsteelblue','bisque3'), xlab="Gender", ylab="Missing Ages")

#OK that's clearly tilted in favor of males. We're a sloppy bunch, aren't we?
```
In [9]:
```%%R
#How about missing Age values by Pclass?
barplot(table(train\$Pclass[which(is.na(train\$Age))]),main="Proportion of Missing Ages by Pclass",
col=c('mediumseagreen','rosybrown4','mediumslateblue'), xlab="Pclass", ylab="Missing Ages")

#There's our most important highlight yet. Most of the ages we're missing are in Pclass 3. So this goes to show, we will
#probably be better off taking the median of ages for each Pclass and gender and filling the null values with it. That
#should be better than simply filling all with overall median.
```
In [10]:
```%%R
#Let's go ahead and do the honors. Note that we could do all of this with a single sophisticated function but I am
#choosing to keep things simple at the moment.

#Before we make any changes to Train, let's combine Train and Test temporarily to a new dataset. Since any change we need
#to make to Train needs to be made to Test as well, we can do the changes only once and split the datasets again later.

#We'll first add the Survived Column to the Test set since it doesn't exist and init to NULL values since we won't use it.
test\$Survived<-rep(0,nrow(test))
titanic<-rbind(train,test)

#na.rm basically calculates mean for all non-null values. We then plug the medi
titanic\$Age[which(titanic\$Pclass==3 & titanic\$Sex=="female" & is.na(titanic\$Age))]<-
median(titanic\$Age[which(titanic\$Pclass==3 & titanic\$Sex=="female")],na.rm=TRUE)
titanic\$Age[which(titanic\$Pclass==2 & titanic\$Sex=="female" & is.na(titanic\$Age))]<-
median(titanic\$Age[which(titanic\$Pclass==3 & titanic\$Sex=="female")],na.rm=TRUE)
titanic\$Age[which(titanic\$Pclass==1 & titanic\$Sex=="female" & is.na(titanic\$Age))]<-
median(titanic\$Age[which(titanic\$Pclass==3 & titanic\$Sex=="female")],na.rm=TRUE)
titanic\$Age[which(titanic\$Pclass==3 & titanic\$Sex=="male" & is.na(titanic\$Age))]<-
median(titanic\$Age[which(titanic\$Pclass==3 & titanic\$Sex=="female")],na.rm=TRUE)
titanic\$Age[which(titanic\$Pclass==2 & titanic\$Sex=="male" & is.na(titanic\$Age))]<-
median(titanic\$Age[which(titanic\$Pclass==3 & titanic\$Sex=="female")],na.rm=TRUE)
titanic\$Age[which(titanic\$Pclass==1 & titanic\$Sex=="male" & is.na(titanic\$Age))]<-
median(titanic\$Age[which(titanic\$Pclass==3 & titanic\$Sex=="female")],na.rm=TRUE)

#Let's do a quick summary of the Age column to make sure there are no nulls remaining

summary(titanic\$Age)
```
```   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.17   22.00   24.00   28.30   35.00   80.00

```
In [11]:
```%%R
#Let's turn our attention to the Embarked Column
summary(titanic\$Embarked)

#There are just 2 missing values and S seems to be the majority so let's plug in S
titanic\$Embarked[which(is.na(titanic\$Embarked))] <- 'S'

summary(titanic\$Embarked)
```
```  C   Q   S
270 123 916

```
In [12]:
```%%R
#Let's remove the Cabin column which contains way too many NULLs and will probably not give anything new considering
#we already have the port of embarcation.
titanic\$Cabin<-NULL
```
##### Feature Engineering

Feature Engineering refers to manufacturing new features based on the idea that they may replace existing ones because they are easier for the machine learning algorithms to digest or extend the value of existing feature(s) because they're more relevant for the predictions we're trying to make. Let's see what new features we can add to the Titanic dataset.

In [13]:
```%%R
#OK we all know that they tried to save the women and children first aboard the titanic. We have gender which identifies
#women but no identifier for children. Let's call all passengers with Age < 18 children shall we?
titanic\$Child<-0
titanic\$Child[which(titanic\$Age<18)]<-1
```
In [14]:
```%%R
#We have a column called fare which is a numeric value of the actual fare that was paid for the trip. This column as such
#might not be very useful but what if we can break it down to different buckets - say <20, 20-40, 40-60, 60+. Recall that
#we had noticed a curious fact with respect to the range 20-35, let's make sure that is covered in a single bucket.
titanic\$FareGroup<-'40+'
titanic\$FareGroup[which(titanic\$Fare<10)] <- '<10'
titanic\$FareGroup[which(titanic\$Fare>=10 & titanic\$Fare<20)] <- '10-20'
titanic\$FareGroup[which(titanic\$Fare>=20 & titanic\$Fare<40)] <- '20-40'

titanic\$FareGroup<-as.factor(titanic\$FareGroup)

barplot(table(titanic\$FareGroup),col='tomato1',main='Fare Groups')
```
In [15]:
```%%R
#Name is another clear candidate that might not be of great value to us. What could a name contribute to predicting if he/she
#actually survived? Well, it may not directly but it could contain things that influence the prediction. Let's print a few
#names to see.
tail(titanic\$Name)

#We can see that the names seem to be in similar format and have a title in between the surname and first name. Can we
#extract the Titles from the names?
```
```[1] "Henriksson, Miss. Jenny Lovisa" "Spector, Mr. Woolf"
[3] "Oliva y Ocana, Dona. Fermina"   "Saether, Mr. Simon Sivertsen"
[5] "Ware, Mr. Frederick"            "Peter, Master. Michael J"

```
In [16]:
```%%R
#We could use the strsplit function to split based on , and . then capture the middle items which should be the title. The
#sapply function helps to perform this for the entire dataframe. function keyword is just like lambda in python. We will
#also remove any extra whitespaces.
titanic\$Title<-sapply(titanic\$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
titanic\$Title<-sub(' ','',titanic\$Title)
```
```[1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"

```
In [17]:
```%%R
#Let's take a look at the different values of Title and see if further grouping is necessary.
#To be safe, we'll perform any analysis only on the training set. Let's temporarily split Train.
temp <- titanic[1:nrow(train),]
table(temp\$Title)

#OK there's obviously only a few titles that are most prominent namely Mr, Miss, Mrs and Master. Perhaps we can group the
#other titles further?
```
```
Capt          Col          Don           Dr     Jonkheer         Lady
1            2            1            7            1            1
Major       Master         Miss         Mlle          Mme           Mr
2           40          182            2            1          517
Mrs           Ms          Rev          Sir the Countess
125            1            6            1            1

```
In [18]:
```%%R
#Looking at wikipedia definitions, the following grouping makes reasonable sense because of similar definitions.
#Note it's definitely possible to nitpick these groupings, feel free to make your own judgement call.
titanic\$Title[titanic\$Title %in% c('Dona','Ms','Lady','the Countess','Jonkheer')] <- 'Mrs'
titanic\$Title[titanic\$Title %in% c('Col','Dr','Rev')] <- 'Noble'
titanic\$Title[titanic\$Title %in% c('Mme','Mile','Mlle')] <- 'Miss'
titanic\$Title[titanic\$Title %in% c('Capt','Don','Major','Sir')] <- 'Mr'

titanic\$Title<-as.factor(titanic\$Title)

table(titanic\$Title)
```
```
Master   Miss     Mr    Mrs  Noble
61    263    762    203     20

```
In [19]:
```%%R
#OK so we split the Titles out, but what about Surnames? Surnames could indicate families traveling together, maybe
#many of them tried to stick together trying to escape? Let's capture the surnames.
titanic\$Surname<-sapply(titanic\$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
titanic\$Surname<-sub(' ','',titanic\$Surname)
```
```[1] "Braund"    "Cumings"   "Heikkinen" "Futrelle"  "Allen"     "Moran"

```
In [20]:
```%%R
#Very quickly let's get a quick rundown of the families
temp <- titanic[1:nrow(train),]
fams<-data.frame(table(temp\$Surname))
print(summary(fams\$Freq))
#There we we have it. Looks like a lot of the passengers don't share a Surname with each other since the Median is 1
#but there are families as well with upto 9 members (in the training set). So we were on the right track.
hist(fams\$Freq, ylim=c(0,50), col='darkcyan')

#The histogram tells us there are a lot of ones and about 50 odd families with sizes 2 and 3, the remaining few have
#large families. Perhaps there's a bigger question, how do we which of them are families? There could be multiple passengers
#with the same surname traveling different groups, which would negate our purpose.
```
```   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.000   1.000   1.000   1.336   1.000   9.000

```
In [21]:
```%%R
#To simplify this, let's use the variables we haven't explored so far, SibSp (No. of Siblings and Spouses) and Parch (No.
#Parents and Children). Summing these up (+1 for self) should gives us the family size of each passenger. Making an
#assumption that these attributes were entered correctly, we could then surmise that passengers with the same surname and
#family size belong to same families. Of course, there're still loopholes if we want to nitpick but I am stopping here.
titanic\$FamSize<-titanic\$SibSp + titanic\$Parch + 1

#Club FamilySize with Surname to create a new Family ID
titanic\$FamID<-paste(titanic\$Surname,as.character(titanic\$FamSize),sep='')

#We'll call all passengers with Family Size = 1 as traveling Solo.
titanic\$FamID[which(titanic\$FamSize==1)]<-'Solo'
titanic\$FamID<-as.factor(titanic\$FamID)

#Finally let's remove some columns we won't use for model building.
#Name and Ticket are unique for each passenger (we assume) and couldn't possible add any relevance to our prediction. Even
#if multiple passengers had the same name, it shouldn't really help us decide if one or more survived.
titanic\$Name<-NULL
titanic\$Ticket<-NULL

```
```  PassengerId Survived Pclass    Sex Age SibSp Parch    Fare Embarked Child
1           1        0      3   male  22     1     0  7.2500        S     0
2           2        1      1 female  38     1     0 71.2833        C     0
3           3        1      3 female  26     0     0  7.9250        S     0
4           4        1      1 female  35     1     0 53.1000        S     0
5           5        0      3   male  35     0     0  8.0500        S     0
6           6        0      3   male  22     0     0  8.4583        Q     0
FareGroup Title   Surname FamSize     FamID
1       <10    Mr    Braund       2   Braund2
2       40+   Mrs   Cumings       2  Cumings2
3       <10  Miss Heikkinen       1      Solo
4       40+   Mrs  Futrelle       2 Futrelle2
5       <10    Mr     Allen       1      Solo
6       <10    Mr     Moran       1      Solo

```
In [22]:
```%%R
#Great we're through. We'll now get back our Train and Test sets from Titanic.
train<-titanic[1:nrow(train),]
temp=nrow(train)+1
kaggletest<-titanic[temp:nrow(titanic),]

print(nrow(train))
print(nrow(kaggletest))
print(nrow(titanic))
```
```[1] 891
[1] 418
[1] 1309

```

#### Model Fitting and Evaluation

I want to use this opportunity and challenge to try out different models in R and see how they stack up against each other. We'll run through one at a time, fit the parameters and submit to Kaggle. We'll then wrap things up by comparing the different approaches. I think this is a great chance to learn tuning models in R.

First up, before we start training and running cross-validation, I would like to try simple, plain-old logistic regression. We'll not be training but simply fitting a model and checking how the different parameters we've created contribute to the predictions.

Before we get started we need to split the "train.csv" file we're holding in the train dataframe to train/test sets. The reason we do this is so we have a baseline to run our model against before the submission to Kaggle. It enables a good approximation on how well the model can generalize.

We'll be using the caret package for the rest of this approach. The caret package has several functions that attempt to streamline the model building and evaluation process. One of them is the createDataPartition function which can be used to create a stratified random sample of the data into training and test sets.

In [23]:
```%%R
#75/25 Train/Test Split
#install.packages('caret')
#Setting a random seed. We will be using this throughout to make sure our results are consistently comparable.
library(caret)
set.seed(35)
trainrows<-createDataPartition(train\$Survived, p = 0.8, list=FALSE)
train.set<-train[trainrows,]
test.set<-train[-trainrows,]

print(nrow(train.set))
print(nrow(test.set))

#Remember, from this point on Test does NOT refer to the test.csv file. I will call out explicitly when it does and it won't
#until the very end when we submit predictions to Kaggle for each model.
```
```Loading required package: lattice
[1] 714
[1] 177

```
##### Logistic Regression

Before we start training models with caret, I would like to first explore simple logistic regression through the glm() method. glm (Generalized Linear Models) is easy-to-use and lets us several types of linear models, logistic regression being one of them. Let's begin

In [24]:
```%%R
#To start with I am not including any of the features we manufactured. Let's see how the raw features perform.

Titan.logit.1 <- glm(Survived ~ Sex + Pclass + Age + SibSp + Parch + Embarked + Fare,
data = train.set, family=binomial("logit"))
print(Titan.logit.1)
```
```
Call:  glm(formula = Survived ~ Sex + Pclass + Age + SibSp + Parch +
Embarked + Fare, family = binomial("logit"), data = train.set)

Coefficients:
(Intercept)      Sexmale      Pclass2      Pclass3          Age        SibSp
4.034633    -2.640662    -1.101916    -2.263913    -0.038039    -0.263821
Parch    EmbarkedQ    EmbarkedS         Fare
-0.112988    -0.001405    -0.455287     0.002163

Degrees of Freedom: 713 Total (i.e. Null);  704 Residual
Null Deviance:	    950.9
Residual Deviance: 633.3 	AIC: 653.3

```

Couple of observations on the results above. The factors we're interested in are Deviance and Degrees of Freedom. The null observations are based on how well we can predict survival given a "null" model, which works only based on a constant, mean of means or a "grand mean". The null deviance is expected to be high. While the residual deviance tells us how much the inclusion of features has brought down the null deviance. So for instance, in our first run, the null deviance was 950.9 and the residual deviance was 633.3. So including the raw features (after the data munging process) brought down the deviance by ~327 points with a 713-704=9 change in degrees of freedom. If you're interested like I am, google to learn more about these topics. The coefficients are the parameters (theta) for each of the features and the Intercept is the theta0 term.

Let's run the extractor function, anova(), which gives us the result of analysis. I am using the chi-square or "goodness of fit" statistic. Lower the value the better.

In [25]:
```%%R
anova(Titan.logit.1, test="Chisq")
```
```Analysis of Deviance Table

Response: Survived

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
NULL                       713     950.86
Sex       1  206.206       712     744.66 < 2.2e-16 ***
Pclass    2   77.642       710     667.02 < 2.2e-16 ***
Age       1   18.755       709     648.26 1.486e-05 ***
SibSp     1    8.977       708     639.29  0.002734 **
Parch     1    0.640       707     638.65  0.423776
Embarked  2    4.625       705     634.02  0.098991 .
Fare      1    0.723       704     633.30  0.395187
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

```

Looking at the individual deviances, we see that Sex and Pclass accounted for the biggest reductions while Age and SibSp seem to be contributing somewhat. Embarked and Fare are on the lower end while the contribution of Parch is negligible. Let's now make some changes, include our new features and remove Fare and Parch.

In [26]:
```%%R
Titan.logit.2 <- glm(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + FamID + Title + FamSize,
data = train.set, family=binomial("logit"))
print(Titan.logit.2)
```
```
Call:  glm(formula = Survived ~ Sex + Pclass + Age + SibSp + Embarked +
FareGroup + FamID + Title + FamSize, family = binomial("logit"),
data = train.set)

Coefficients:
(Intercept)                    Sexmale
3.022e+01                 -2.932e+01
Pclass2                    Pclass3
-1.165e+00                 -1.093e+00
Age                      SibSp
-2.415e-02                  2.153e-01
EmbarkedQ                  EmbarkedS
2.933e-01                 -2.709e-01
FareGroup10-20             FareGroup20-40
5.610e-01                  1.247e+00
FareGroup40+              FamIDAbelson2
1.058e+00                 -1.144e-01
FamIDAks2              FamIDAllison4
4.504e+15                 -2.955e+01
2.559e+01                 -2.594e+00
FamIDAndrews2             FamIDAppleton3
2.591e+01                  2.754e+01
FamIDArnold-Franchi2              FamIDAsplund7
-2.780e+01                 -9.996e-01
FamIDBackstrom2            FamIDBackstrom4
-2.342e+01                  2.432e+01
FamIDBaclini4              FamIDBarbara2
2.514e+01                 -2.695e+01
FamIDBaxter2                FamIDBeane2
-6.829e-01                  3.796e+01
FamIDBecker4             FamIDBeckwith3
4.504e+15                  2.686e+01
FamIDBishop2               FamIDBoulos3
1.901e+01                 -2.693e+01
FamIDBourke3             FamIDBowerman2
-2.735e+01                  2.270e+01
FamIDBrown3                FamIDBryhl2
6.636e-01                 -4.504e+15
FamIDCaldwell3                FamIDCaram2
2.510e+01                 -3.145e+01
FamIDCardeza2               FamIDCarter2
2.719e+01                 -2.599e+01
FamIDCarter4            FamIDCavendish2
1.396e+01                 -2.490e+01
FamIDChambers2              FamIDChristy3
2.565e+01                  2.697e+01
FamIDChronopoulos2               FamIDClarke2
-2.292e+01                  2.775e+01
FamIDCollyer3              FamIDCompton3
1.118e+00                  2.399e+01
FamIDCoutts3                FamIDCribb2
2.256e+01                 -7.634e+01
FamIDCrosby3              FamIDCumings2
8.250e-01                  2.315e+01
FamIDDanbom3             FamIDDavidson2
-2.789e+01                 -4.445e+01
FamIDDavies3              FamIDDavison2
7.337e-01                  2.443e+01
FamIDDean4             FamIDdelCarlo2
7.449e-01                 -2.444e+01
FamIDdeMessemaeker2                 FamIDDick2
2.064e+01                  2.697e+01
FamIDDodge3               FamIDDoling2
2.502e+01                  4.261e+01
FamIDDouglas2           FamIDDuffGordon2
-2.482e+01                  2.711e+01
FamIDDurany More2                FamIDElias3
2.638e+01                 -2.680e+01
FamIDEustis2           FamIDFaunthorpe2
4.504e+15                  2.679e+01
FamIDFord5              FamIDFortune6
-1.731e+01                 -6.254e-01
FamIDFrauenthal2           FamIDFrauenthal3
2.072e+01                  2.958e+01
FamIDFrolicher3     FamIDFrolicher-Stehli3
2.450e+01                  2.806e+01
FamIDFutrelle2                 FamIDGale2
-6.513e-01                 -2.397e+01
FamIDGiles2           FamIDGoldenberg2
-2.348e+01                  2.839e+01
FamIDGoldsmith3              FamIDGoodwin8
2.487e+01                 -3.330e+01
FamIDGraham2           FamIDGreenfield2
2.441e+01                  2.620e+01
FamIDGustafsson3              FamIDHagland2
-4.187e+01                 -1.861e+01
FamIDHamalainen3               FamIDHansen2
2.591e+01                 -2.340e+01
FamIDHansen3               FamIDHarder2
-5.239e+01                  2.766e+01
FamIDHarper2               FamIDHarris2
2.961e+00                 -5.547e-01
FamIDHart3                 FamIDHays3
1.289e+00                  2.229e+01
FamIDHerman4              FamIDHickman3
2.614e+01                 -3.538e+01
FamIDHippach2             FamIDHirvonen2
5.357e+01                  2.910e+01
FamIDHocking4                 FamIDHold2
-2.375e+01                 -2.059e+01
FamIDHolverson2                 FamIDHoyt2
-2.258e+01                  1.834e+01
FamIDIlmakangas2            FamIDJacobsohn2
-2.598e+01                 -1.966e+01
FamIDJacobsohn4               FamIDJensen2
2.239e+01                 -2.301e+01
FamIDJohnson3             FamIDJohnston4
2.611e+01                 -2.985e+01
FamIDJussila2               FamIDKantor2
-2.704e+01                  1.565e-01
FamIDKenyon2              FamIDKiernan2
2.598e+01                 -1.207e+03
FamIDKimball2                 FamIDKink3
2.856e+01                 -2.513e+01
FamIDKink-Heilmann3               FamIDKlasen3
2.618e+01                 -2.429e+01
FamIDLahtinen3              FamIDLaroche4
-2.831e+01                  8.050e-01
FamIDLefebre5               FamIDLennon2
-2.767e+01                 -4.504e+15
FamIDLindell2            FamIDLindqvist2
-2.437e+01                  2.823e+01
FamIDLines2                 FamIDLobb2
2.471e+01                 -2.340e+01
2.383e+01                  2.414e+01
FamIDMallet3               FamIDMarvin2
7.140e-01                 -2.524e+01
FamIDMcCoy3              FamIDMcNamee2
3.043e+01                 -2.358e+01
FamIDMeyer2              FamIDMinahan2
-1.188e+00                  2.485e+01
FamIDMinahan3                 FamIDMoor2
-2.357e+01                  2.595e+01
FamIDMoran2             FamIDMoubarek3
3.095e-01                  1.393e+01
FamIDMurphy2                FamIDNakid3
2.723e+01                  2.852e+01
FamIDNasser2               FamIDNatsch2
-2.532e-01                 -2.146e+01
FamIDNavratil3               FamIDNewell2
2.643e+01                  2.878e+01
FamIDNewell3               FamIDNewsom3
-2.082e+01                  2.493e+01
FamIDNicholls3        FamIDNicola-Yarred2
-1.773e+01                  2.645e+01
FamIDO'Brien2                FamIDOlsen2
2.950e+01                 -2.216e+01
FamIDPalsson5               FamIDPanula6
-4.936e+01                 -2.862e+01
FamIDParrish2                FamIDPears2
3.472e+01                 -9.049e-01
-1.369e+00                  3.023e+01
FamIDPeter3            FamIDPetterson2
2.598e+01                 -2.278e+01
FamIDPotter2                FamIDQuick3
2.398e+01                  2.556e+01
FamIDRenouf4                 FamIDRice6
2.379e+01                 -2.202e+01
FamIDRichards3             FamIDRichards6
2.988e+01                  2.444e+01
FamIDRobert2               FamIDRobins2
2.393e+01                 -2.787e+01
FamIDRosblom3              FamIDRyerson5
-1.694e+01                  2.468e+01
FamIDSage11               FamIDSamaan3
-2.948e+01                 -2.481e+01
FamIDSandstrom3              FamIDShelley2
2.597e+01                  2.361e+01
FamIDSilven3               FamIDSilvey2
2.797e+01                 -4.461e-01
FamIDSkoog6                  FamIDSolo
-2.788e+01                  1.459e+00
FamIDSpencer2           FamIDStephenson2
2.281e+01                  2.347e+01
FamIDStrom2                FamIDStrom3
-2.698e+01                 -2.850e+01
FamIDTaussig3               FamIDTaylor2
2.470e+01                  2.732e+01
FamIDThayer3               FamIDThomas2
2.664e+01                  2.664e+01
FamIDThorneycroft2               FamIDTurpin2
6.012e-01                 -2.782e+01
FamIDvanBilliard3         FamIDVanderPlanke2
-2.289e+01                 -4.026e+02
FamIDVanderPlanke3              FamIDVanImpe3
-2.312e+05                 -2.825e+01
FamIDWarren2                FamIDWeisz2
2.370e+01                  2.415e+01
FamIDWells3                 FamIDWest4
2.545e+01                  1.068e+00
FamIDWhite2                 FamIDWick3
-2.411e+01                  2.541e+01
FamIDWidener3             FamIDWilliams2
-2.544e+01                 -2.437e+01
FamIDZabour2                  TitleMiss
-2.698e+01                 -2.917e+01
TitleMr                   TitleMrs
-2.822e+00                 -2.725e+01
TitleNoble                    FamSize
-3.994e+00                         NA

Degrees of Freedom: 713 Total (i.e. Null);  517 Residual
Null Deviance:	    950.9
Residual Deviance: 373.3 	AIC: 767.3

```
In [27]:
```%%R
anova(Titan.logit.2, test="Chisq")
```
```Analysis of Deviance Table

Response: Survived

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
NULL                         713     950.86
Sex         1  206.206       712     744.66 < 2.2e-16 ***
Pclass      2   77.642       710     667.02 < 2.2e-16 ***
Age         1   18.755       709     648.26 1.486e-05 ***
SibSp       1    8.977       708     639.29 0.0027343 **
Embarked    2    4.742       706     634.54 0.0933863 .
FareGroup   3    1.034       703     633.51 0.7930468
FamID     182  248.620       521     384.89 0.0007554 ***
Title       4   11.591       517     373.30 0.0206677 *
FamSize     0    0.000       517     373.30
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

```

Looks like FamID and to an extent FareGroup contribute well to the model. Although looking at the extraordinarily high deviance and df for FamID, I am suspicious that this might cause overfitting - meaning we've modeled in extreme based on the training set hence won't generalize very well on new examples. This can be addressed by resampling and hypertuning parameters based on crossvalidation. We essentially split the training set further into train and cv, then hypertune parameters against the cv set. This is repeated multiple times, each with a different sample of train/cv.

OK let's proceed to use the train method of the caret package to train a logistic regression model. We'll use one of the most common crossvalidation methods namely the 3x 10-fold CV. That is 10 folds of data to split train and cv repeated 3 times.

In [28]:
```%%R
#Define the 3x 10 fold cv control using the traincontrol method of caret.
tenfoldcv<-trainControl(method='repeatedcv', number=10, repeats=3)
```
In [283]:
```%%R
#Train a logistic regression classifier using the train method of the caret package. Everything is same as before except
#we use the train function and pass glm as the method.

#Install the doSNOW package to leverage multiple cores - parallelization
#install.packages(doSNOW)
library(doSNOW)

#Set 4 below to the number of cores you'd like to run in parallel. I have 4 and using 3 of them!
cl <- makeCluster(3, type="SOCK")
registerDoSNOW(cl)

#Note that I've also added options below to normalize the features (Feature Scaling)
set.seed(35)
logit.tune1<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + FamID + Title,
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)

logit.tune1

#May need to install a dependency for caret train
#install.packages('e1071', dependencies=TRUE)
```
```Generalized Linear Model

714 samples
14 predictors
2 classes: '0', '1'

Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ...

Resampling results

Accuracy   Kappa      Accuracy SD  Kappa SD
0.7829812  0.5332698  0.05484398   0.1230111

```
In [284]:
```%%R
summary(logit.tune1)
```
```
Call:
NULL

Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.28940  -0.43374  -0.00008   0.00010   2.42885

Coefficients: (45 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept)                 -6.178e-01  2.122e+02  -0.003   0.9977
Sexmale                     -1.065e+01  3.540e+03  -0.003   0.9976
Pclass2                     -4.784e-01  2.887e-01  -1.657   0.0975 .
Pclass3                     -5.451e-01  3.414e-01  -1.597   0.1103
Age                         -3.162e-01  1.743e-01  -1.814   0.0697 .
SibSp                        2.275e-01  6.706e-01   0.339   0.7344
EmbarkedQ                    8.080e-02  1.658e-01   0.487   0.6259
EmbarkedS                   -1.214e-01  1.804e-01  -0.673   0.5010
`FareGroup10-20`             2.253e-01  2.610e-01   0.863   0.3880
`FareGroup20-40`             5.179e-01  2.891e-01   1.792   0.0732 .
`FareGroup40+`               4.250e-01  2.883e-01   1.474   0.1404
FamIDAbelson2               -6.050e-03  1.922e-01  -0.031   0.9749
FamIDAhlin2                         NA         NA      NA       NA
FamIDAks2                    6.886e-01  4.025e+02   0.002   0.9986
FamIDAllison4               -1.184e+00  3.883e+02  -0.003   0.9976
`FamIDAndersen-Jensen2`      7.532e-01  4.025e+02   0.002   0.9985
FamIDAndrews2                7.125e-01  4.025e+02   0.002   0.9986
FamIDAngle2                         NA         NA      NA       NA
FamIDAppleton3               6.236e-01  4.025e+02   0.002   0.9988
`FamIDArnold-Franchi2`      -1.073e+00  3.092e+02  -0.003   0.9972
FamIDAsplund7               -6.470e-02  1.963e-01  -0.330   0.7417
FamIDAstor2                         NA         NA      NA       NA
FamIDBackstrom2             -6.092e-01  4.025e+02  -0.002   0.9988
FamIDBackstrom4              6.570e-01  4.025e+02   0.002   0.9987
FamIDBaclini4                9.648e-01  3.949e+02   0.002   0.9981
FamIDBarbara2               -1.079e+00  3.893e+02  -0.003   0.9978
FamIDBaxter2                -3.612e-02  1.847e-01  -0.195   0.8450
FamIDBeane2                  8.323e-01  4.025e+02   0.002   0.9983
FamIDBecker4                 6.907e-01  4.025e+02   0.002   0.9986
FamIDBeckwith3               1.070e+00  3.166e+02   0.003   0.9973
FamIDBishop2                 5.908e-01  4.025e+02   0.001   0.9988
FamIDBoulos3                -7.514e-01  4.025e+02  -0.002   0.9985
FamIDBourke3                -1.310e+00  3.150e+02  -0.004   0.9967
FamIDBowerman2               6.835e-01  4.025e+02   0.002   0.9986
FamIDBraund2                        NA         NA      NA       NA
FamIDBrown3                  3.510e-02  2.015e-01   0.174   0.8617
FamIDBryhl2                 -6.385e-01  4.025e+02  -0.002   0.9987
FamIDCaldwell3               9.659e-01  3.860e+02   0.003   0.9980
FamIDCaram2                 -8.114e-01  4.025e+02  -0.002   0.9984
FamIDCardeza2                7.973e-01  4.025e+02   0.002   0.9984
FamIDCarter2                -1.066e+00  2.916e+02  -0.004   0.9971
FamIDCarter4                 1.284e+00  3.196e+02   0.004   0.9968
FamIDCavendish2             -6.651e-01  4.025e+02  -0.002   0.9987
FamIDChaffee2                       NA         NA      NA       NA
FamIDChambers2               1.056e+00  3.148e+02   0.003   0.9973
FamIDChapman2                       NA         NA      NA       NA
FamIDChibnall2                      NA         NA      NA       NA
FamIDChristy3                7.147e-01  4.025e+02   0.002   0.9986
FamIDChronopoulos2          -6.247e-01  4.025e+02  -0.002   0.9988
FamIDClark2                         NA         NA      NA       NA
FamIDClarke2                 6.457e-01  4.025e+02   0.002   0.9987
FamIDCollyer3                7.236e-02  2.007e-01   0.361   0.7184
FamIDCompton3                6.806e-01  4.025e+02   0.002   0.9987
FamIDCornell3                       NA         NA      NA       NA
FamIDCoutts3                 7.235e-01  4.025e+02   0.002   0.9986
FamIDCribb2                 -5.903e-01  4.025e+02  -0.001   0.9988
FamIDCrosby3                 4.363e-02  1.757e-01   0.248   0.8038
FamIDCumings2                6.080e-01  4.025e+02   0.002   0.9988
FamIDDanbom3                -1.061e+00  3.096e+02  -0.003   0.9973
FamIDDavidson2              -6.696e-01  4.025e+02  -0.002   0.9987
FamIDDavidson4                      NA         NA      NA       NA
FamIDDavies3                 3.880e-02  1.637e-01   0.237   0.8126
FamIDDavison2                6.632e-01  4.025e+02   0.002   0.9987
FamIDDean4                   3.940e-02  1.691e-01   0.233   0.8157
FamIDdelCarlo2              -6.450e-01  4.025e+02  -0.002   0.9987
FamIDdeMessemaeker2          6.759e-01  4.025e+02   0.002   0.9987
FamIDDick2                   1.058e+00  3.063e+02   0.003   0.9972
FamIDDodge3                  6.729e-01  4.025e+02   0.002   0.9987
FamIDDoling2                 7.164e-01  4.025e+02   0.002   0.9986
FamIDDouglas2               -6.626e-01  4.025e+02  -0.002   0.9987
FamIDDouglas3                       NA         NA      NA       NA
FamIDDrew3                          NA         NA      NA       NA
FamIDDuffGordon2             1.068e+00  3.094e+02   0.003   0.9972
`FamIDDurany More2`          7.320e-01  4.025e+02   0.002   0.9985
FamIDDyker2                         NA         NA      NA       NA
FamIDEarnshaw2                      NA         NA      NA       NA
FamIDElias3                 -6.119e-01  4.025e+02  -0.002   0.9988
FamIDEustis2                 6.942e-01  4.025e+02   0.002   0.9986
FamIDFaunthorpe2             6.466e-01  4.025e+02   0.002   0.9987
FamIDFord5                  -1.315e+00  3.334e+02  -0.004   0.9969
FamIDFortune6               -4.048e-02  2.085e-01  -0.194   0.8460
FamIDFrauenthal2             6.037e-01  4.025e+02   0.001   0.9988
FamIDFrauenthal3             8.479e-01  4.025e+02   0.002   0.9983
FamIDFrolicher3              6.733e-01  4.025e+02   0.002   0.9987
`FamIDFrolicher-Stehli3`     8.109e-01  4.025e+02   0.002   0.9984
FamIDFutrelle2              -3.445e-02  1.937e-01  -0.178   0.8588
FamIDGale2                  -6.303e-01  4.025e+02  -0.002   0.9988
FamIDGibson2                        NA         NA      NA       NA
FamIDGiles2                 -6.164e-01  4.025e+02  -0.002   0.9988
FamIDGoldenberg2             1.065e+00  3.013e+02   0.004   0.9972
FamIDGoldsmith3              6.457e-01  4.025e+02   0.002   0.9987
FamIDGoodwin8               -1.580e+00  3.994e+02  -0.004   0.9968
FamIDGraham2                 6.443e-01  4.025e+02   0.002   0.9987
FamIDGreenfield2             7.856e-01  4.025e+02   0.002   0.9984
FamIDGustafsson3            -8.422e-01  4.018e+02  -0.002   0.9983
FamIDHagland2               -6.182e-01  4.025e+02  -0.002   0.9988
FamIDHakkarainen2                   NA         NA      NA       NA
FamIDHamalainen3             9.970e-01  3.930e+02   0.003   0.9980
FamIDHansen2                -5.936e-01  4.025e+02  -0.001   0.9988
FamIDHansen3                -6.091e-01  4.025e+02  -0.002   0.9988
FamIDHarder2                 7.793e-01  4.025e+02   0.002   0.9985
FamIDHarper2                 2.212e-01  2.270e-01   0.974   0.3300
FamIDHarris2                -2.934e-02  1.980e-01  -0.148   0.8822
FamIDHart3                   8.346e-02  2.031e-01   0.411   0.6811
FamIDHays3                   6.308e-01  4.025e+02   0.002   0.9987
FamIDHerman4                 7.208e-01  4.025e+02   0.002   0.9986
FamIDHickman3               -9.069e-01  4.021e+02  -0.002   0.9982
FamIDHiltunen3                      NA         NA      NA       NA
FamIDHippach2                9.173e-01  3.896e+02   0.002   0.9981
FamIDHirvonen2               7.249e-01  4.025e+02   0.002   0.9986
FamIDHirvonen3                      NA         NA      NA       NA
FamIDHocking4               -6.227e-01  4.025e+02  -0.002   0.9988
FamIDHocking5                       NA         NA      NA       NA
FamIDHogeboom2                      NA         NA      NA       NA
FamIDHold2                  -6.213e-01  4.025e+02  -0.002   0.9988
FamIDHolverson2             -6.597e-01  4.025e+02  -0.002   0.9987
FamIDHoward2                        NA         NA      NA       NA
FamIDHoyt2                   6.154e-01  4.025e+02   0.002   0.9988
FamIDIlmakangas2            -7.058e-01  4.025e+02  -0.002   0.9986
FamIDJacobsohn2             -6.231e-01  4.025e+02  -0.002   0.9988
FamIDJacobsohn4              6.340e-01  4.025e+02   0.002   0.9987
FamIDJefferys3                      NA         NA      NA       NA
FamIDJensen2                -6.017e-01  4.025e+02  -0.001   0.9988
FamIDJohnson3                1.017e+00  3.989e+02   0.003   0.9980
FamIDJohnston4              -7.552e-01  4.025e+02  -0.002   0.9985
FamIDJussila2               -1.003e+00  4.022e+02  -0.002   0.9980
FamIDKantor2                 8.279e-03  1.951e-01   0.042   0.9661
FamIDKarun2                         NA         NA      NA       NA
FamIDKenyon2                 6.037e-01  4.025e+02   0.001   0.9988
FamIDKhalil2                        NA         NA      NA       NA
FamIDKiernan2               -6.183e-01  4.025e+02  -0.002   0.9988
FamIDKimball2                8.048e-01  4.025e+02   0.002   0.9984
FamIDKink3                  -6.017e-01  4.025e+02  -0.001   0.9988
`FamIDKink-Heilmann3`        7.011e-01  4.025e+02   0.002   0.9986
`FamIDKink-Heilmann5`               NA         NA      NA       NA
FamIDKlasen3                -6.008e-01  4.025e+02  -0.001   0.9988
FamIDLahtinen3              -8.206e-01  4.025e+02  -0.002   0.9984
FamIDLaroche4                5.211e-02  2.064e-01   0.252   0.8007
FamIDLefebre5               -1.330e+00  3.994e+02  -0.003   0.9973
FamIDLennon2                -6.393e-01  4.025e+02  -0.002   0.9987
FamIDLindell2               -6.056e-01  4.025e+02  -0.002   0.9988
FamIDLindqvist2              8.655e-01  4.025e+02   0.002   0.9983
FamIDLines2                  6.710e-01  4.025e+02   0.002   0.9987
FamIDLobb2                  -6.110e-01  4.025e+02  -0.002   0.9988
FamIDLouch2                  6.583e-01  4.025e+02   0.002   0.9987
FamIDMallet3                 3.776e-02  1.753e-01   0.215   0.8294
FamIDMarvin2                -6.804e-01  4.025e+02  -0.002   0.9987
FamIDMcCoy3                  1.063e+00  3.296e+02   0.003   0.9974
FamIDMcNamee2               -6.164e-01  4.025e+02  -0.002   0.9988
FamIDMellinger2                     NA         NA      NA       NA
FamIDMeyer2                 -6.282e-02  1.958e-01  -0.321   0.7484
FamIDMinahan2                6.642e-01  4.025e+02   0.002   0.9987
FamIDMinahan3               -6.432e-01  4.025e+02  -0.002   0.9987
FamIDMock2                          NA         NA      NA       NA
FamIDMoor2                   1.008e+00  3.892e+02   0.003   0.9979
FamIDMoran2                  1.637e-02  1.617e-01   0.101   0.9193
FamIDMoubarek3               1.032e+00  4.022e+02   0.003   0.9980
FamIDMurphy2                 7.138e-01  4.025e+02   0.002   0.9986
FamIDNakid3                  1.121e+00  3.216e+02   0.003   0.9972
FamIDNasser2                -1.339e-02  2.015e-01  -0.066   0.9470
FamIDNatsch2                -6.733e-01  4.025e+02  -0.002   0.9987
FamIDNavratil3               7.005e-01  4.025e+02   0.002   0.9986
FamIDNewell2                 9.467e-01  4.019e+02   0.002   0.9981
FamIDNewell3                -6.473e-01  4.025e+02  -0.002   0.9987
FamIDNewsom3                 6.737e-01  4.025e+02   0.002   0.9987
FamIDNicholls3              -6.439e-01  4.025e+02  -0.002   0.9987
`FamIDNicola-Yarred2`        1.016e+00  3.995e+02   0.003   0.9980
`FamIDO'Brien2`              6.421e-01  4.025e+02   0.002   0.9987
FamIDOlsen2                 -5.711e-01  4.025e+02  -0.001   0.9989
FamIDOstby2                         NA         NA      NA       NA
FamIDPalsson5               -1.583e+00  3.966e+02  -0.004   0.9968
FamIDPanula6                -1.702e+00  3.394e+02  -0.005   0.9960
FamIDParrish2                6.736e-01  4.025e+02   0.002   0.9987
FamIDPeacock3                       NA         NA      NA       NA
FamIDPears2                 -4.786e-02  1.964e-01  -0.244   0.8074
`FamIDPenascoy Castellana2` -7.240e-02  1.934e-01  -0.374   0.7082
FamIDPeter3                  9.555e-01  3.847e+02   0.002   0.9980
FamIDPetterson2             -5.945e-01  4.025e+02  -0.001   0.9988
FamIDPhillips2                      NA         NA      NA       NA
FamIDPotter2                 6.323e-01  4.025e+02   0.002   0.9987
FamIDQuick3                  6.939e-01  4.025e+02   0.002   0.9986
FamIDRenouf2                        NA         NA      NA       NA
FamIDRenouf4                 6.314e-01  4.025e+02   0.002   0.9987
FamIDRice6                  -1.813e+00  4.006e+02  -0.005   0.9964
FamIDRichards3               7.262e-01  4.025e+02   0.002   0.9986
FamIDRichards6               6.597e-01  4.025e+02   0.002   0.9987
FamIDRobert2                 6.307e-01  4.025e+02   0.002   0.9987
FamIDRobins2                -7.787e-01  4.025e+02  -0.002   0.9985
FamIDRosblom3               -1.074e+00  3.243e+02  -0.003   0.9974
FamIDRothschild2                    NA         NA      NA       NA
FamIDRyerson5                6.536e-01  4.025e+02   0.002   0.9987
FamIDSage11                 -1.755e+00  3.590e+02  -0.005   0.9961
FamIDSamaan3                -6.621e-01  4.025e+02  -0.002   0.9987
FamIDSandstrom3              9.897e-01  3.910e+02   0.003   0.9980
FamIDSchabert2                      NA         NA      NA       NA
FamIDShelley2                6.510e-01  4.025e+02   0.002   0.9987
FamIDSilven3                 7.421e-01  4.025e+02   0.002   0.9985
FamIDSilvey2                -2.359e-02  1.986e-01  -0.119   0.9055
FamIDSkoog6                 -1.722e+00  3.458e+02  -0.005   0.9960
FamIDSmith2                         NA         NA      NA       NA
FamIDSnyder2                        NA         NA      NA       NA
FamIDSolo                    7.131e-01  1.282e+00   0.556   0.5781
FamIDSpedden3                       NA         NA      NA       NA
FamIDSpencer2                5.935e-01  4.025e+02   0.001   0.9988
FamIDStengel2                       NA         NA      NA       NA
FamIDStephenson2             6.207e-01  4.025e+02   0.002   0.9988
FamIDStraus2                        NA         NA      NA       NA
FamIDStrom2                 -7.396e-01  4.025e+02  -0.002   0.9985
FamIDStrom3                 -7.949e-01  4.025e+02  -0.002   0.9984
FamIDTaussig3                9.287e-01  3.840e+02   0.002   0.9981
FamIDTaylor2                 1.078e+00  3.017e+02   0.004   0.9971
FamIDThayer3                 1.042e+00  3.205e+02   0.003   0.9974
FamIDThomas2                 7.401e-01  4.025e+02   0.002   0.9985
FamIDThomas3                        NA         NA      NA       NA
FamIDThorneycroft2           3.180e-02  1.904e-01   0.167   0.8674
FamIDTouma3                         NA         NA      NA       NA
FamIDTurpin2                -1.095e+00  3.113e+02  -0.004   0.9972
FamIDvanBilliard3           -5.934e-01  4.025e+02  -0.001   0.9988
FamIDVanderPlanke2          -7.931e-01  4.025e+02  -0.002   0.9984
FamIDVanderPlanke3          -9.929e-01  3.306e+02  -0.003   0.9976
FamIDVanderPlanke4                  NA         NA      NA       NA
FamIDVanImpe3               -1.126e+00  3.838e+02  -0.003   0.9977
FamIDWare2                          NA         NA      NA       NA
FamIDWarren2                 6.279e-01  4.025e+02   0.002   0.9988
FamIDWeisz2                  6.466e-01  4.025e+02   0.002   0.9987
FamIDWells3                  6.957e-01  4.025e+02   0.002   0.9986
FamIDWest4                   6.913e-02  2.036e-01   0.340   0.7341
FamIDWhite2                 -6.408e-01  4.025e+02  -0.002   0.9987
FamIDWick3                   6.916e-01  4.025e+02   0.002   0.9986
FamIDWidener3               -6.753e-01  4.025e+02  -0.002   0.9987
FamIDWiklund2                       NA         NA      NA       NA
FamIDWilkes2                        NA         NA      NA       NA
FamIDWilliams2              -6.536e-01  4.025e+02  -0.002   0.9987
FamIDYasbeck2                       NA         NA      NA       NA
FamIDZabour2                -7.397e-01  4.025e+02  -0.002   0.9985
TitleMiss                   -8.997e+00  3.011e+03  -0.003   0.9976
TitleMr                     -1.393e+00  7.860e-01  -1.772   0.0764 .
TitleMrs                    -7.105e+00  2.603e+03  -0.003   0.9978
TitleNoble                  -5.542e-01  2.803e-01  -1.977   0.0480 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 373.30  on 517  degrees of freedom
AIC: 767.3

Number of Fisher Scoring iterations: 18

```

Okay looks like we're doing (un)reasonably well. Let's try a couple of interesting ideas. Class Compression refers to collapsing some levels on a categorical variable. In layman terms, making a factor two-level. So for instance, we have Embarked, most of which has the value 'S' as we saw earlier. We can use the I() function when training to shrink this to S or otherwise. Let's do that.

In [285]:
```%%R
#Let's set class compression on Embarked to 'S' or otherwise.
set.seed(35)

logit.tune2<-train(Survived ~ Sex + Pclass + Age + SibSp + I(Embarked=='S') + FareGroup + FamID + Title,
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)

logit.tune2
```
```Generalized Linear Model

714 samples
14 predictors
2 classes: '0', '1'

Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ...

Resampling results

Accuracy   Kappa    Accuracy SD  Kappa SD
0.7820488  0.53063  0.05507752   0.1240941

```
In [286]:
```%%R
summary(logit.tune2)
```
```
Call:
NULL

Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.34231  -0.43582  -0.00008   0.00010   2.42395

Coefficients: (45 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept)                 -6.156e-01  2.122e+02  -0.003   0.9977
Sexmale                     -1.065e+01  3.548e+03  -0.003   0.9976
Pclass2                     -4.596e-01  2.869e-01  -1.602   0.1092
Pclass3                     -5.264e-01  3.407e-01  -1.545   0.1223
Age                         -3.148e-01  1.749e-01  -1.800   0.0718 .
SibSp                        2.230e-01  6.716e-01   0.332   0.7398
`I(Embarked == "S")TRUE`    -1.728e-01  1.444e-01  -1.197   0.2315
`FareGroup10-20`             2.101e-01  2.593e-01   0.810   0.4177
`FareGroup20-40`             5.142e-01  2.900e-01   1.773   0.0762 .
`FareGroup40+`               4.068e-01  2.876e-01   1.415   0.1572
FamIDAbelson2               -1.258e-02  1.918e-01  -0.066   0.9477
FamIDAhlin2                         NA         NA      NA       NA
FamIDAks2                    6.882e-01  4.025e+02   0.002   0.9986
FamIDAllison4               -1.181e+00  3.885e+02  -0.003   0.9976
`FamIDAndersen-Jensen2`      7.511e-01  4.025e+02   0.002   0.9985
FamIDAndrews2                7.132e-01  4.025e+02   0.002   0.9986
FamIDAngle2                         NA         NA      NA       NA
FamIDAppleton3               6.264e-01  4.025e+02   0.002   0.9988
`FamIDArnold-Franchi2`      -1.072e+00  3.092e+02  -0.003   0.9972
FamIDAsplund7               -6.467e-02  1.965e-01  -0.329   0.7421
FamIDAstor2                         NA         NA      NA       NA
FamIDBackstrom2             -6.081e-01  4.025e+02  -0.002   0.9988
FamIDBackstrom4              6.584e-01  4.025e+02   0.002   0.9987
FamIDBaclini4                9.586e-01  3.950e+02   0.002   0.9981
FamIDBarbara2               -1.085e+00  3.895e+02  -0.003   0.9978
FamIDBaxter2                -3.857e-02  1.849e-01  -0.209   0.8347
FamIDBeane2                  8.320e-01  4.025e+02   0.002   0.9984
FamIDBecker4                 6.901e-01  4.025e+02   0.002   0.9986
FamIDBeckwith3               1.073e+00  3.164e+02   0.003   0.9973
FamIDBishop2                 5.893e-01  4.025e+02   0.001   0.9988
FamIDBoulos3                -7.564e-01  4.025e+02  -0.002   0.9985
FamIDBourke3                -1.297e+00  3.146e+02  -0.004   0.9967
FamIDBowerman2               6.842e-01  4.025e+02   0.002   0.9986
FamIDBraund2                        NA         NA      NA       NA
FamIDBrown3                  3.452e-02  2.015e-01   0.171   0.8640
FamIDBryhl2                 -6.388e-01  4.025e+02  -0.002   0.9987
FamIDCaldwell3               9.648e-01  3.862e+02   0.002   0.9980
FamIDCaram2                 -8.146e-01  4.025e+02  -0.002   0.9984
FamIDCardeza2                7.956e-01  4.025e+02   0.002   0.9984
FamIDCarter2                -1.067e+00  2.919e+02  -0.004   0.9971
FamIDCarter4                 1.289e+00  3.194e+02   0.004   0.9968
FamIDCavendish2             -6.623e-01  4.025e+02  -0.002   0.9987
FamIDChaffee2                       NA         NA      NA       NA
FamIDChambers2               1.060e+00  3.146e+02   0.003   0.9973
FamIDChapman2                       NA         NA      NA       NA
FamIDChibnall2                      NA         NA      NA       NA
FamIDChristy3                7.125e-01  4.025e+02   0.002   0.9986
FamIDChronopoulos2          -6.279e-01  4.025e+02  -0.002   0.9988
FamIDClark2                         NA         NA      NA       NA
FamIDClarke2                 6.453e-01  4.025e+02   0.002   0.9987
FamIDCollyer3                7.044e-02  2.012e-01   0.350   0.7262
FamIDCompton3                6.772e-01  4.025e+02   0.002   0.9987
FamIDCornell3                       NA         NA      NA       NA
FamIDCoutts3                 7.242e-01  4.025e+02   0.002   0.9986
FamIDCribb2                 -5.894e-01  4.025e+02  -0.001   0.9988
FamIDCrosby3                 4.598e-02  1.763e-01   0.261   0.7943
FamIDCumings2                6.064e-01  4.025e+02   0.002   0.9988
FamIDDanbom3                -1.059e+00  3.096e+02  -0.003   0.9973
FamIDDavidson2              -6.668e-01  4.025e+02  -0.002   0.9987
FamIDDavidson4                      NA         NA      NA       NA
FamIDDavies3                 3.841e-02  1.639e-01   0.234   0.8148
FamIDDavison2                6.643e-01  4.025e+02   0.002   0.9987
FamIDDean4                   3.913e-02  1.693e-01   0.231   0.8172
FamIDdelCarlo2              -6.496e-01  4.025e+02  -0.002   0.9987
FamIDdeMessemaeker2          6.769e-01  4.025e+02   0.002   0.9987
FamIDDick2                   1.062e+00  3.064e+02   0.003   0.9972
FamIDDodge3                  6.751e-01  4.025e+02   0.002   0.9987
FamIDDoling2                 7.141e-01  4.025e+02   0.002   0.9986
FamIDDouglas2               -6.642e-01  4.025e+02  -0.002   0.9987
FamIDDouglas3                       NA         NA      NA       NA
FamIDDrew3                          NA         NA      NA       NA
FamIDDuffGordon2             1.065e+00  3.082e+02   0.003   0.9972
`FamIDDurany More2`          7.266e-01  4.025e+02   0.002   0.9986
FamIDDyker2                         NA         NA      NA       NA
FamIDEarnshaw2                      NA         NA      NA       NA
FamIDElias3                 -6.165e-01  4.025e+02  -0.002   0.9988
FamIDEustis2                 6.907e-01  4.025e+02   0.002   0.9986
FamIDFaunthorpe2             6.462e-01  4.025e+02   0.002   0.9987
FamIDFord5                  -1.315e+00  3.326e+02  -0.004   0.9968
FamIDFortune6               -3.674e-02  2.089e-01  -0.176   0.8604
FamIDFrauenthal2             6.064e-01  4.025e+02   0.002   0.9988
FamIDFrauenthal3             8.513e-01  4.025e+02   0.002   0.9983
FamIDFrolicher3              6.698e-01  4.025e+02   0.002   0.9987
`FamIDFrolicher-Stehli3`     8.093e-01  4.025e+02   0.002   0.9984
FamIDFutrelle2              -3.060e-02  1.936e-01  -0.158   0.8744
FamIDGale2                  -6.307e-01  4.025e+02  -0.002   0.9987
FamIDGibson2                        NA         NA      NA       NA
FamIDGiles2                 -6.156e-01  4.025e+02  -0.002   0.9988
FamIDGoldenberg2             1.063e+00  3.015e+02   0.004   0.9972
FamIDGoldsmith3              6.456e-01  4.025e+02   0.002   0.9987
FamIDGoodwin8               -1.578e+00  3.993e+02  -0.004   0.9968
FamIDGraham2                 6.467e-01  4.025e+02   0.002   0.9987
FamIDGreenfield2             7.839e-01  4.025e+02   0.002   0.9984
FamIDGustafsson3            -8.424e-01  4.018e+02  -0.002   0.9983
FamIDHagland2               -6.171e-01  4.025e+02  -0.002   0.9988
FamIDHakkarainen2                   NA         NA      NA       NA
FamIDHamalainen3             9.976e-01  3.930e+02   0.003   0.9980
FamIDHansen2                -5.939e-01  4.025e+02  -0.001   0.9988
FamIDHansen3                -6.079e-01  4.025e+02  -0.002   0.9988
FamIDHarder2                 7.778e-01  4.025e+02   0.002   0.9985
FamIDHarper2                 2.196e-01  2.276e-01   0.965   0.3347
FamIDHarris2                -2.551e-02  1.979e-01  -0.129   0.8974
FamIDHart3                   8.158e-02  2.036e-01   0.401   0.6886
FamIDHays3                   6.334e-01  4.025e+02   0.002   0.9987
FamIDHerman4                 7.200e-01  4.025e+02   0.002   0.9986
FamIDHickman3               -9.052e-01  4.021e+02  -0.002   0.9982
FamIDHiltunen3                      NA         NA      NA       NA
FamIDHippach2                9.129e-01  3.897e+02   0.002   0.9981
FamIDHirvonen2               7.241e-01  4.025e+02   0.002   0.9986
FamIDHirvonen3                      NA         NA      NA       NA
FamIDHocking4               -6.217e-01  4.025e+02  -0.002   0.9988
FamIDHocking5                       NA         NA      NA       NA
FamIDHogeboom2                      NA         NA      NA       NA
FamIDHold2                  -6.217e-01  4.025e+02  -0.002   0.9988
FamIDHolverson2             -6.569e-01  4.025e+02  -0.002   0.9987
FamIDHoward2                        NA         NA      NA       NA
FamIDHoyt2                   6.181e-01  4.025e+02   0.002   0.9988
FamIDIlmakangas2            -7.080e-01  4.025e+02  -0.002   0.9986
FamIDJacobsohn2             -6.235e-01  4.025e+02  -0.002   0.9988
FamIDJacobsohn4              6.338e-01  4.025e+02   0.002   0.9987
FamIDJefferys3                      NA         NA      NA       NA
FamIDJensen2                -6.020e-01  4.025e+02  -0.001   0.9988
FamIDJohnson3                1.017e+00  3.987e+02   0.003   0.9980
FamIDJohnston4              -7.571e-01  4.025e+02  -0.002   0.9985
FamIDJussila2               -1.006e+00  4.022e+02  -0.003   0.9980
FamIDKantor2                 7.818e-03  1.951e-01   0.040   0.9680
FamIDKarun2                         NA         NA      NA       NA
FamIDKenyon2                 6.064e-01  4.025e+02   0.002   0.9988
FamIDKhalil2                        NA         NA      NA       NA
FamIDKiernan2               -6.120e-01  4.025e+02  -0.002   0.9988
FamIDKimball2                8.075e-01  4.025e+02   0.002   0.9984
FamIDKink3                  -6.018e-01  4.025e+02  -0.001   0.9988
`FamIDKink-Heilmann3`        6.991e-01  4.025e+02   0.002   0.9986
`FamIDKink-Heilmann5`               NA         NA      NA       NA
FamIDKlasen3                -6.011e-01  4.025e+02  -0.001   0.9988
FamIDLahtinen3              -8.210e-01  4.025e+02  -0.002   0.9984
FamIDLaroche4                4.529e-02  2.065e-01   0.219   0.8264
FamIDLefebre5               -1.332e+00  3.992e+02  -0.003   0.9973
FamIDLennon2                -6.315e-01  4.025e+02  -0.002   0.9987
FamIDLindell2               -6.045e-01  4.025e+02  -0.002   0.9988
FamIDLindqvist2              8.652e-01  4.025e+02   0.002   0.9983
FamIDLines2                  6.704e-01  4.025e+02   0.002   0.9987
FamIDLobb2                  -6.099e-01  4.025e+02  -0.002   0.9988
FamIDLouch2                  6.579e-01  4.025e+02   0.002   0.9987
FamIDMallet3                 3.085e-02  1.749e-01   0.176   0.8600
FamIDMarvin2                -6.776e-01  4.025e+02  -0.002   0.9987
FamIDMcCoy3                  1.072e+00  3.278e+02   0.003   0.9974
FamIDMcNamee2               -6.153e-01  4.025e+02  -0.002   0.9988
FamIDMellinger2                     NA         NA      NA       NA
FamIDMeyer2                 -6.498e-02  1.959e-01  -0.332   0.7401
FamIDMinahan2                6.718e-01  4.025e+02   0.002   0.9987
FamIDMinahan3               -6.330e-01  4.025e+02  -0.002   0.9987
FamIDMock2                          NA         NA      NA       NA
FamIDMoor2                   1.008e+00  3.893e+02   0.003   0.9979
FamIDMoran2                  2.453e-02  1.613e-01   0.152   0.8791
FamIDMoubarek3               1.027e+00  4.022e+02   0.003   0.9980
FamIDMurphy2                 7.197e-01  4.025e+02   0.002   0.9986
FamIDNakid3                  1.116e+00  3.201e+02   0.003   0.9972
FamIDNasser2                -1.989e-02  2.010e-01  -0.099   0.9212
FamIDNatsch2                -6.764e-01  4.025e+02  -0.002   0.9987
FamIDNavratil3               6.998e-01  4.025e+02   0.002   0.9986
FamIDNewell2                 9.419e-01  4.019e+02   0.002   0.9981
FamIDNewell3                -6.491e-01  4.025e+02  -0.002   0.9987
FamIDNewsom3                 6.731e-01  4.025e+02   0.002   0.9987
FamIDNicholls3              -6.442e-01  4.025e+02  -0.002   0.9987
`FamIDNicola-Yarred2`        1.010e+00  3.993e+02   0.003   0.9980
`FamIDO'Brien2`              6.498e-01  4.025e+02   0.002   0.9987
FamIDOlsen2                 -5.716e-01  4.025e+02  -0.001   0.9989
FamIDOstby2                         NA         NA      NA       NA
FamIDPalsson5               -1.585e+00  3.966e+02  -0.004   0.9968
FamIDPanula6                -1.702e+00  3.386e+02  -0.005   0.9960
FamIDParrish2                6.730e-01  4.025e+02   0.002   0.9987
FamIDPeacock3                       NA         NA      NA       NA
FamIDPears2                 -4.395e-02  1.963e-01  -0.224   0.8228
`FamIDPenascoy Castellana2` -7.452e-02  1.935e-01  -0.385   0.7001
FamIDPeter3                  9.473e-01  3.850e+02   0.002   0.9980
FamIDPetterson2             -5.948e-01  4.025e+02  -0.001   0.9988
FamIDPhillips2                      NA         NA      NA       NA
FamIDPotter2                 6.305e-01  4.025e+02   0.002   0.9988
FamIDQuick3                  6.918e-01  4.025e+02   0.002   0.9986
FamIDRenouf2                        NA         NA      NA       NA
FamIDRenouf4                 6.313e-01  4.025e+02   0.002   0.9987
FamIDRice6                  -1.799e+00  4.005e+02  -0.004   0.9964
FamIDRichards3               7.265e-01  4.025e+02   0.002   0.9986
FamIDRichards6               6.606e-01  4.025e+02   0.002   0.9987
FamIDRobert2                 6.332e-01  4.025e+02   0.002   0.9987
FamIDRobins2                -7.777e-01  4.025e+02  -0.002   0.9985
FamIDRosblom3               -1.075e+00  3.238e+02  -0.003   0.9974
FamIDRothschild2                    NA         NA      NA       NA
FamIDRyerson5                6.504e-01  4.025e+02   0.002   0.9987
FamIDSage11                 -1.752e+00  3.578e+02  -0.005   0.9961
FamIDSamaan3                -6.662e-01  4.025e+02  -0.002   0.9987
FamIDSandstrom3              9.893e-01  3.912e+02   0.003   0.9980
FamIDSchabert2                      NA         NA      NA       NA
FamIDShelley2                6.505e-01  4.025e+02   0.002   0.9987
FamIDSilven3                 7.408e-01  4.025e+02   0.002   0.9985
FamIDSilvey2                -1.979e-02  1.986e-01  -0.100   0.9206
FamIDSkoog6                 -1.723e+00  3.454e+02  -0.005   0.9960
FamIDSmith2                         NA         NA      NA       NA
FamIDSnyder2                        NA         NA      NA       NA
FamIDSolo                    7.121e-01  1.284e+00   0.555   0.5791
FamIDSpedden3                       NA         NA      NA       NA
FamIDSpencer2                5.920e-01  4.025e+02   0.001   0.9988
FamIDStengel2                       NA         NA      NA       NA
FamIDStephenson2             6.190e-01  4.025e+02   0.002   0.9988
FamIDStraus2                        NA         NA      NA       NA
FamIDStrom2                 -7.404e-01  4.025e+02  -0.002   0.9985
FamIDStrom3                 -7.939e-01  4.025e+02  -0.002   0.9984
FamIDTaussig3                9.304e-01  3.843e+02   0.002   0.9981
FamIDTaylor2                 1.082e+00  3.019e+02   0.004   0.9971
FamIDThayer3                 1.040e+00  3.203e+02   0.003   0.9974
FamIDThomas2                 7.349e-01  4.025e+02   0.002   0.9985
FamIDThomas3                        NA         NA      NA       NA
FamIDThorneycroft2           3.334e-02  1.905e-01   0.175   0.8611
FamIDTouma3                         NA         NA      NA       NA
FamIDTurpin2                -1.096e+00  3.112e+02  -0.004   0.9972
FamIDvanBilliard3           -5.926e-01  4.025e+02  -0.001   0.9988
FamIDVanderPlanke2          -7.921e-01  4.025e+02  -0.002   0.9984
FamIDVanderPlanke3          -9.930e-01  3.288e+02  -0.003   0.9976
FamIDVanderPlanke4                  NA         NA      NA       NA
FamIDVanImpe3               -1.127e+00  3.840e+02  -0.003   0.9977
FamIDWare2                          NA         NA      NA       NA
FamIDWarren2                 6.262e-01  4.025e+02   0.002   0.9988
FamIDWeisz2                  6.462e-01  4.025e+02   0.002   0.9987
FamIDWells3                  6.936e-01  4.025e+02   0.002   0.9986
FamIDWest4                   6.739e-02  2.041e-01   0.330   0.7412
FamIDWhite2                 -6.383e-01  4.025e+02  -0.002   0.9987
FamIDWick3                   6.923e-01  4.025e+02   0.002   0.9986
FamIDWidener3               -6.770e-01  4.025e+02  -0.002   0.9987
FamIDWiklund2                       NA         NA      NA       NA
FamIDWilkes2                        NA         NA      NA       NA
FamIDWilliams2              -6.554e-01  4.025e+02  -0.002   0.9987
FamIDYasbeck2                       NA         NA      NA       NA
FamIDZabour2                -7.448e-01  4.025e+02  -0.002   0.9985
TitleMiss                   -8.980e+00  3.017e+03  -0.003   0.9976
TitleMr                     -1.399e+00  7.872e-01  -1.778   0.0755 .
TitleMrs                    -7.107e+00  2.608e+03  -0.003   0.9978
TitleNoble                  -5.585e-01  2.807e-01  -1.989   0.0467 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 373.54  on 518  degrees of freedom
AIC: 765.54

Number of Fisher Scoring iterations: 18

```

So that didn't really help. Let's try one last trick, Interaction. Let's work in an interaction effect between passenger class and sex, as passenger class showed a much bigger difference in survival rate amongst the women compared to the men (i.e. Higher class women were much more likely to survive than lower class women, whereas first class Men were more likely to survive than 2nd or 3rd class men, but not by the same margin as amongst the women). We saw this during our initial visualizations. Besides Pclass and Sex have been the biggest determining factors so far.

In [287]:
```%%R
#Let's work in an interaction between Pclass and Sex.
set.seed(35)

logit.tune3<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + FamID + Title,
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)

logit.tune3
```
```Generalized Linear Model

714 samples
14 predictors
2 classes: '0', '1'

Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ...

Resampling results

Accuracy   Kappa      Accuracy SD  Kappa SD
0.7904864  0.5472096  0.05401862   0.1249195

```
In [288]:
```%%R
summary(logit.tune3)
```
```
Call:
NULL

Deviance Residuals:
Min        1Q    Median        3Q       Max
-2.85112  -0.46296  -0.00009   0.00009   2.54282

Coefficients: (45 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept)                 -5.350e-01  2.135e+02  -0.003   0.9980
Sexmale                     -1.099e+01  3.407e+03  -0.003   0.9974
Pclass2                     -7.112e-01  6.345e-01  -1.121   0.2623
Pclass3                     -1.487e+00  7.122e-01  -2.088   0.0368 *
Age                         -3.107e-01  1.799e-01  -1.727   0.0842 .
SibSp                        2.551e-02  6.888e-01   0.037   0.9705
EmbarkedQ                    1.342e-01  1.615e-01   0.831   0.4058
EmbarkedS                   -1.049e-01  1.818e-01  -0.577   0.5640
`FareGroup10-20`             1.840e-01  2.615e-01   0.704   0.4816
`FareGroup20-40`             5.310e-01  2.953e-01   1.798   0.0721 .
`FareGroup40+`               3.425e-01  3.013e-01   1.137   0.2556
FamIDAbelson2               -2.723e-02  1.935e-01  -0.141   0.8881
FamIDAhlin2                         NA         NA      NA       NA
FamIDAks2                    7.047e-01  4.025e+02   0.002   0.9986
FamIDAllison4               -1.250e+00  3.928e+02  -0.003   0.9975
`FamIDAndersen-Jensen2`      7.543e-01  4.025e+02   0.002   0.9985
FamIDAndrews2                6.498e-01  4.025e+02   0.002   0.9987
FamIDAngle2                         NA         NA      NA       NA
FamIDAppleton3               5.904e-01  4.025e+02   0.001   0.9988
`FamIDArnold-Franchi2`      -1.044e+00  3.351e+02  -0.003   0.9975
FamIDAsplund7               -3.992e-02  1.723e-01  -0.232   0.8168
FamIDAstor2                         NA         NA      NA       NA
FamIDBackstrom2             -6.261e-01  4.025e+02  -0.002   0.9988
FamIDBackstrom4              6.981e-01  4.025e+02   0.002   0.9986
FamIDBaclini4                9.878e-01  3.963e+02   0.002   0.9980
FamIDBarbara2               -1.059e+00  3.933e+02  -0.003   0.9979
FamIDBaxter2                -7.557e-02  1.924e-01  -0.393   0.6945
FamIDBeane2                  8.272e-01  4.025e+02   0.002   0.9984
FamIDBecker4                 7.046e-01  4.025e+02   0.002   0.9986
FamIDBeckwith3               1.064e+00  3.177e+02   0.003   0.9973
FamIDBishop2                 5.524e-01  4.025e+02   0.001   0.9989
FamIDBoulos3                -7.450e-01  4.025e+02  -0.002   0.9985
FamIDBourke3                -1.292e+00  3.411e+02  -0.004   0.9970
FamIDBowerman2               6.143e-01  4.025e+02   0.002   0.9988
FamIDBraund2                        NA         NA      NA       NA
FamIDBrown3                  1.151e-02  2.058e-01   0.056   0.9554
FamIDBryhl2                 -6.435e-01  4.025e+02  -0.002   0.9987
FamIDCaldwell3               9.558e-01  3.665e+02   0.003   0.9979
FamIDCaram2                 -7.831e-01  4.025e+02  -0.002   0.9984
FamIDCardeza2                7.874e-01  4.025e+02   0.002   0.9984
FamIDCarter2                -1.102e+00  2.822e+02  -0.004   0.9969
FamIDCarter4                 1.278e+00  3.216e+02   0.004   0.9968
FamIDCavendish2             -6.692e-01  4.025e+02  -0.002   0.9987
FamIDChaffee2                       NA         NA      NA       NA
FamIDChambers2               1.051e+00  3.161e+02   0.003   0.9973
FamIDChapman2                       NA         NA      NA       NA
FamIDChibnall2                      NA         NA      NA       NA
FamIDChristy3                6.650e-01  4.025e+02   0.002   0.9987
FamIDChronopoulos2          -6.402e-01  4.025e+02  -0.002   0.9987
FamIDClark2                         NA         NA      NA       NA
FamIDClarke2                 6.180e-01  4.025e+02   0.002   0.9988
FamIDCollyer3                2.449e-02  2.003e-01   0.122   0.9027
FamIDCompton3                6.197e-01  4.025e+02   0.002   0.9988
FamIDCornell3                       NA         NA      NA       NA
FamIDCoutts3                 7.184e-01  4.025e+02   0.002   0.9986
FamIDCribb2                 -6.146e-01  4.025e+02  -0.002   0.9988
FamIDCrosby3                -8.693e-03  1.950e-01  -0.045   0.9644
FamIDCumings2                5.693e-01  4.025e+02   0.001   0.9989
FamIDDanbom3                -1.032e+00  3.356e+02  -0.003   0.9975
FamIDDavidson2              -6.737e-01  4.025e+02  -0.002   0.9987
FamIDDavidson4                      NA         NA      NA       NA
FamIDDavies3                 3.311e-02  1.406e-01   0.235   0.8138
FamIDDavison2                6.902e-01  4.025e+02   0.002   0.9986
FamIDDean4                   1.679e-02  1.481e-01   0.113   0.9097
FamIDdelCarlo2              -6.487e-01  4.025e+02  -0.002   0.9987
FamIDdeMessemaeker2          7.026e-01  4.025e+02   0.002   0.9986
FamIDDick2                   1.053e+00  3.083e+02   0.003   0.9973
FamIDDodge3                  6.735e-01  4.025e+02   0.002   0.9987
FamIDDoling2                 6.597e-01  4.025e+02   0.002   0.9987
FamIDDouglas2               -6.656e-01  4.025e+02  -0.002   0.9987
FamIDDouglas3                       NA         NA      NA       NA
FamIDDrew3                          NA         NA      NA       NA
FamIDDuffGordon2             1.063e+00  3.093e+02   0.003   0.9973
`FamIDDurany More2`          6.887e-01  4.025e+02   0.002   0.9986
FamIDDyker2                         NA         NA      NA       NA
FamIDEarnshaw2                      NA         NA      NA       NA
FamIDElias3                 -6.311e-01  4.025e+02  -0.002   0.9987
FamIDEustis2                 6.331e-01  4.025e+02   0.002   0.9987
FamIDFaunthorpe2             6.189e-01  4.025e+02   0.002   0.9988
FamIDFord5                  -1.295e+00  3.571e+02  -0.004   0.9971
FamIDFortune6               -7.440e-02  2.115e-01  -0.352   0.7250
FamIDFrauenthal2             5.638e-01  4.025e+02   0.001   0.9989
FamIDFrauenthal3             8.475e-01  4.025e+02   0.002   0.9983
FamIDFrolicher3              6.055e-01  4.025e+02   0.002   0.9988
`FamIDFrolicher-Stehli3`     8.078e-01  4.025e+02   0.002   0.9984
FamIDFutrelle2              -6.571e-02  2.075e-01  -0.317   0.7515
FamIDGale2                  -6.355e-01  4.025e+02  -0.002   0.9987
FamIDGibson2                        NA         NA      NA       NA
FamIDGiles2                 -6.164e-01  4.025e+02  -0.002   0.9988
FamIDGoldenberg2             1.061e+00  3.036e+02   0.003   0.9972
FamIDGoldsmith3              6.675e-01  4.025e+02   0.002   0.9987
FamIDGoodwin8               -1.520e+00  3.999e+02  -0.004   0.9970
FamIDGraham2                 5.967e-01  4.025e+02   0.001   0.9988
FamIDGreenfield2             7.758e-01  4.025e+02   0.002   0.9985
FamIDGustafsson3            -8.615e-01  4.018e+02  -0.002   0.9983
FamIDHagland2               -6.350e-01  4.025e+02  -0.002   0.9987
FamIDHakkarainen2                   NA         NA      NA       NA
FamIDHamalainen3             9.973e-01  3.640e+02   0.003   0.9978
FamIDHansen2                -6.143e-01  4.025e+02  -0.002   0.9988
FamIDHansen3                -6.191e-01  4.025e+02  -0.002   0.9988
FamIDHarder2                 7.767e-01  4.025e+02   0.002   0.9985
FamIDHarper2                 1.991e-01  2.017e-01   0.987   0.3236
FamIDHarris2                -6.069e-02  2.139e-01  -0.284   0.7766
FamIDHart3                   3.698e-02  2.039e-01   0.181   0.8561
FamIDHays3                   5.905e-01  4.025e+02   0.001   0.9988
FamIDHerman4                 6.800e-01  4.025e+02   0.002   0.9987
FamIDHickman3               -8.913e-01  4.021e+02  -0.002   0.9982
FamIDHiltunen3                      NA         NA      NA       NA
FamIDHippach2                8.313e-01  3.934e+02   0.002   0.9983
FamIDHirvonen2               7.229e-01  4.025e+02   0.002   0.9986
FamIDHirvonen3                      NA         NA      NA       NA
FamIDHocking4               -6.155e-01  4.025e+02  -0.002   0.9988
FamIDHocking5                       NA         NA      NA       NA
FamIDHogeboom2                      NA         NA      NA       NA
FamIDHold2                  -6.266e-01  4.025e+02  -0.002   0.9988
FamIDHolverson2             -6.639e-01  4.025e+02  -0.002   0.9987
FamIDHoward2                        NA         NA      NA       NA
FamIDHoyt2                   5.754e-01  4.025e+02   0.001   0.9989
FamIDIlmakangas2            -7.049e-01  4.025e+02  -0.002   0.9986
FamIDJacobsohn2             -6.284e-01  4.025e+02  -0.002   0.9988
FamIDJacobsohn4              6.136e-01  4.025e+02   0.002   0.9988
FamIDJefferys3                      NA         NA      NA       NA
FamIDJensen2                -6.223e-01  4.025e+02  -0.002   0.9988
FamIDJohnson3                1.018e+00  4.013e+02   0.003   0.9980
FamIDJohnston4              -7.554e-01  4.025e+02  -0.002   0.9985
FamIDJussila2               -1.002e+00  4.022e+02  -0.002   0.9980
FamIDKantor2                -1.484e-02  1.977e-01  -0.075   0.9401
FamIDKarun2                         NA         NA      NA       NA
FamIDKenyon2                 5.638e-01  4.025e+02   0.001   0.9989
FamIDKhalil2                        NA         NA      NA       NA
FamIDKiernan2               -6.449e-01  4.025e+02  -0.002   0.9987
FamIDKimball2                8.006e-01  4.025e+02   0.002   0.9984
FamIDKink3                  -6.152e-01  4.025e+02  -0.002   0.9988
`FamIDKink-Heilmann3`        6.940e-01  4.025e+02   0.002   0.9986
`FamIDKink-Heilmann5`               NA         NA      NA       NA
FamIDKlasen3                -6.214e-01  4.025e+02  -0.002   0.9988
FamIDLahtinen3              -8.482e-01  4.025e+02  -0.002   0.9983
FamIDLaroche4                2.688e-02  2.057e-01   0.131   0.8960
FamIDLefebre5               -1.312e+00  4.011e+02  -0.003   0.9974
FamIDLennon2                -6.620e-01  4.025e+02  -0.002   0.9987
FamIDLindell2               -6.226e-01  4.025e+02  -0.002   0.9988
FamIDLindqvist2              8.448e-01  4.025e+02   0.002   0.9983
FamIDLines2                  5.931e-01  4.025e+02   0.001   0.9988
FamIDLobb2                  -6.279e-01  4.025e+02  -0.002   0.9988
FamIDLouch2                  6.305e-01  4.025e+02   0.002   0.9988
FamIDMallet3                 3.582e-02  1.574e-01   0.228   0.8200
FamIDMarvin2                -6.843e-01  4.025e+02  -0.002   0.9986
FamIDMcCoy3                  1.042e+00  3.470e+02   0.003   0.9976
FamIDMcNamee2               -6.332e-01  4.025e+02  -0.002   0.9987
FamIDMellinger2                     NA         NA      NA       NA
FamIDMeyer2                 -9.190e-02  2.106e-01  -0.436   0.6625
FamIDMinahan2                5.962e-01  4.025e+02   0.001   0.9988
FamIDMinahan3               -6.493e-01  4.025e+02  -0.002   0.9987
FamIDMock2                          NA         NA      NA       NA
FamIDMoor2                   1.003e+00  3.969e+02   0.003   0.9980
FamIDMoran2                 -7.498e-03  1.385e-01  -0.054   0.9568
FamIDMoubarek3               1.027e+00  4.022e+02   0.003   0.9980
FamIDMurphy2                 7.128e-01  4.025e+02   0.002   0.9986
FamIDNakid3                  1.104e+00  3.350e+02   0.003   0.9974
FamIDNasser2                -3.445e-02  2.056e-01  -0.168   0.8669
FamIDNatsch2                -6.921e-01  4.025e+02  -0.002   0.9986
FamIDNavratil3               7.073e-01  4.025e+02   0.002   0.9986
FamIDNewell2                 8.609e-01  4.019e+02   0.002   0.9983
FamIDNewell3                -6.576e-01  4.025e+02  -0.002   0.9987
FamIDNewsom3                 5.957e-01  4.025e+02   0.001   0.9988
FamIDNicholls3              -6.488e-01  4.025e+02  -0.002   0.9987
`FamIDNicola-Yarred2`        1.019e+00  4.013e+02   0.003   0.9980
`FamIDO'Brien2`              6.632e-01  4.025e+02   0.002   0.9987
FamIDOlsen2                 -5.992e-01  4.025e+02  -0.001   0.9988
FamIDOstby2                         NA         NA      NA       NA
FamIDPalsson5               -1.559e+00  3.981e+02  -0.004   0.9969
FamIDPanula6                -1.680e+00  3.489e+02  -0.005   0.9962
FamIDParrish2                6.385e-01  4.025e+02   0.002   0.9987
FamIDPeacock3                       NA         NA      NA       NA
FamIDPears2                 -7.889e-02  2.114e-01  -0.373   0.7090
`FamIDPenascoy Castellana2` -1.013e-01  2.069e-01  -0.490   0.6243
FamIDPeter3                  9.624e-01  3.894e+02   0.002   0.9980
FamIDPetterson2             -6.152e-01  4.025e+02  -0.002   0.9988
FamIDPhillips2                      NA         NA      NA       NA
FamIDPotter2                 5.861e-01  4.025e+02   0.001   0.9988
FamIDQuick3                  6.446e-01  4.025e+02   0.002   0.9987
FamIDRenouf2                        NA         NA      NA       NA
FamIDRenouf4                 6.180e-01  4.025e+02   0.002   0.9988
FamIDRice6                  -1.800e+00  4.008e+02  -0.004   0.9964
FamIDRichards3               7.379e-01  4.025e+02   0.002   0.9985
FamIDRichards6               6.443e-01  4.025e+02   0.002   0.9987
FamIDRobert2                 5.834e-01  4.025e+02   0.001   0.9988
FamIDRobins2                -7.521e-01  4.025e+02  -0.002   0.9985
FamIDRosblom3               -1.063e+00  3.483e+02  -0.003   0.9976
FamIDRothschild2                    NA         NA      NA       NA
FamIDRyerson5                6.002e-01  4.025e+02   0.001   0.9988
FamIDSage11                 -1.635e+00  3.694e+02  -0.004   0.9965
FamIDSamaan3                -6.754e-01  4.025e+02  -0.002   0.9987
FamIDSandstrom3              1.003e+00  3.942e+02   0.003   0.9980
FamIDSchabert2                      NA         NA      NA       NA
FamIDShelley2                6.163e-01  4.025e+02   0.002   0.9988
FamIDSilven3                 6.904e-01  4.025e+02   0.002   0.9986
FamIDSilvey2                -5.504e-02  2.148e-01  -0.256   0.7977
FamIDSkoog6                 -1.697e+00  3.623e+02  -0.005   0.9963
FamIDSmith2                         NA         NA      NA       NA
FamIDSnyder2                        NA         NA      NA       NA
FamIDSolo                    4.404e-01  1.094e+00   0.403   0.6872
FamIDSpedden3                       NA         NA      NA       NA
FamIDSpencer2                5.550e-01  4.025e+02   0.001   0.9989
FamIDStengel2                       NA         NA      NA       NA
FamIDStephenson2             5.817e-01  4.025e+02   0.001   0.9988
FamIDStraus2                        NA         NA      NA       NA
FamIDStrom2                 -7.416e-01  4.025e+02  -0.002   0.9985
FamIDStrom3                 -7.681e-01  4.025e+02  -0.002   0.9985
FamIDTaussig3                8.421e-01  3.904e+02   0.002   0.9983
FamIDTaylor2                 1.073e+00  3.040e+02   0.004   0.9972
FamIDThayer3                 1.030e+00  3.244e+02   0.003   0.9975
FamIDThomas2                 7.254e-01  4.025e+02   0.002   0.9986
FamIDThomas3                        NA         NA      NA       NA
FamIDThorneycroft2           3.898e-02  1.546e-01   0.252   0.8010
FamIDTouma3                         NA         NA      NA       NA
FamIDTurpin2                -1.127e+00  2.917e+02  -0.004   0.9969
FamIDvanBilliard3           -6.177e-01  4.025e+02  -0.002   0.9988
FamIDVanderPlanke2          -7.663e-01  4.025e+02  -0.002   0.9985
FamIDVanderPlanke3          -9.833e-01  3.480e+02  -0.003   0.9977
FamIDVanderPlanke4                  NA         NA      NA       NA
FamIDVanImpe3               -1.106e+00  3.902e+02  -0.003   0.9977
FamIDWare2                          NA         NA      NA       NA
FamIDWarren2                 5.888e-01  4.025e+02   0.001   0.9988
FamIDWeisz2                  6.189e-01  4.025e+02   0.002   0.9988
FamIDWells3                  6.463e-01  4.025e+02   0.002   0.9987
FamIDWest4                   2.669e-02  2.030e-01   0.132   0.8954
FamIDWhite2                 -6.523e-01  4.025e+02  -0.002   0.9987
FamIDWick3                   6.223e-01  4.025e+02   0.002   0.9988
FamIDWidener3               -6.851e-01  4.025e+02  -0.002   0.9986
FamIDWiklund2                       NA         NA      NA       NA
FamIDWilkes2                        NA         NA      NA       NA
FamIDWilliams2              -6.638e-01  4.025e+02  -0.002   0.9987
FamIDYasbeck2                       NA         NA      NA       NA
FamIDZabour2                -7.335e-01  4.025e+02  -0.002   0.9985
TitleMiss                   -8.526e+00  2.897e+03  -0.003   0.9977
TitleMr                     -1.242e+00  7.618e-01  -1.630   0.1030
TitleMrs                    -6.906e+00  2.505e+03  -0.003   0.9978
TitleNoble                  -5.001e-01  2.775e-01  -1.802   0.0716 .
`Sexmale:Pclass2`            1.211e-01  4.900e-01   0.247   0.8048
`Sexmale:Pclass3`            1.035e+00  6.352e-01   1.630   0.1031
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 366.58  on 515  degrees of freedom
AIC: 764.58

Number of Fisher Scoring iterations: 18

```

So we did a little bit better. I would just like to test a theory. We manufactured the FamilyID and have been doing well so far. What happens if we take it out? Will we do worse or better? Let's check it out.

In [289]:
```%%R
#Let's work in an interaction between Pclass and Sex.
set.seed(35)

logit.tune4<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + Title,
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)

logit.tune4
```
```Generalized Linear Model

714 samples
14 predictors
2 classes: '0', '1'

Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ...

Resampling results

Accuracy   Kappa      Accuracy SD  Kappa SD
0.8123305  0.5934912  0.04130739   0.09243173

```
In [290]:
```%%R
summary(logit.tune4)
```
```
Call:
NULL

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.6591  -0.5397  -0.4099   0.3937   2.4194

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)        -0.56336    0.12309  -4.577 4.72e-06 ***
Sexmale            -8.13159  278.33333  -0.029 0.976693
Pclass2            -0.36485    0.33842  -1.078 0.280987
Pclass3            -1.62269    0.37865  -4.285 1.82e-05 ***
Age                -0.30128    0.13506  -2.231 0.025693 *
SibSp              -0.62972    0.17467  -3.605 0.000312 ***
EmbarkedQ           0.04973    0.11805   0.421 0.673568
EmbarkedS          -0.17619    0.12541  -1.405 0.160047
`FareGroup10-20`    0.07364    0.14908   0.494 0.621326
`FareGroup20-40`   -0.01342    0.17629  -0.076 0.939314
`FareGroup40+`      0.11277    0.22144   0.509 0.610572
TitleMiss          -6.75940  236.67998  -0.029 0.977216
TitleMr            -1.55762    0.29962  -5.199 2.01e-07 ***
TitleMrs           -5.78595  204.63657  -0.028 0.977443
TitleNoble         -0.46573    0.14315  -3.253 0.001140 **
`Sexmale:Pclass2`  -0.32045    0.29394  -1.090 0.275630
`Sexmale:Pclass3`   0.65008    0.35598   1.826 0.067823 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 576.85  on 697  degrees of freedom
AIC: 610.85

Number of Fisher Scoring iterations: 13

```

So we actually did much better. So lesson learnt, engineering new features is a great idea but may or may not positively impact your model. In fact, if we don't get it right, it could have an adverse impact. Let's see if we can make anymore tiny improvements with the Title. We have 4 possible values and I am going to class compress each.

In [291]:
```%%R
#Let's work in an interaction between Pclass and Sex.
set.seed(35)

logit.tune5<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + I(Title=='Mr') +
I(Title=='Mrs') + I(Title=='Miss') + I(Title=='Noble'),
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)

logit.tune5
```
```Generalized Linear Model

714 samples
14 predictors
2 classes: '0', '1'

Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ...

Resampling results

Accuracy   Kappa      Accuracy SD  Kappa SD
0.8123305  0.5934912  0.04130739   0.09243173

```
In [292]:
```%%R
summary(logit.tune5)
```
```
Call:
NULL

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.6591  -0.5397  -0.4099   0.3937   2.4194

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)                -0.56336    0.12309  -4.577 4.72e-06 ***
Sexmale                    -8.13159  278.33333  -0.029 0.976693
Pclass2                    -0.36485    0.33842  -1.078 0.280987
Pclass3                    -1.62269    0.37865  -4.285 1.82e-05 ***
Age                        -0.30128    0.13506  -2.231 0.025693 *
SibSp                      -0.62972    0.17467  -3.605 0.000312 ***
EmbarkedQ                   0.04973    0.11805   0.421 0.673568
EmbarkedS                  -0.17619    0.12541  -1.405 0.160047
`FareGroup10-20`            0.07364    0.14908   0.494 0.621326
`FareGroup20-40`           -0.01342    0.17629  -0.076 0.939314
`FareGroup40+`              0.11277    0.22144   0.509 0.610572
`I(Title == "Mr")TRUE`     -1.55762    0.29962  -5.199 2.01e-07 ***
`I(Title == "Mrs")TRUE`    -5.78595  204.63657  -0.028 0.977443
`I(Title == "Miss")TRUE`   -6.75940  236.67998  -0.029 0.977216
`I(Title == "Noble")TRUE`  -0.46573    0.14315  -3.253 0.001140 **
`Sexmale:Pclass2`          -0.32045    0.29394  -1.090 0.275630
`Sexmale:Pclass3`           0.65008    0.35598   1.826 0.067823 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 576.85  on 697  degrees of freedom
AIC: 610.85

Number of Fisher Scoring iterations: 13

```

Hmm, didn't really make a difference. Let's try one last thing, adding Child, which we derived during the feature engineering exercise.

In [120]:
```%%R
#Let's add Child to the mix.
set.seed(35)

logit.tune6<-train(Survived ~ Sex + Pclass + Sex:Pclass + Age + SibSp + Embarked + FareGroup + Title + Child,
data=train.set,
method='glm',
preProcess = c("center", "scale"),
trControl=tenfoldcv)

logit.tune6
```
```Generalized Linear Model

714 samples
14 predictors
2 classes: '0', '1'

Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ...

Resampling results

Accuracy   Kappa      Accuracy SD  Kappa SD
0.8144806  0.5969237  0.04436537   0.101153

```
In [294]:
```%%R
summary(logit.tune6)
```
```
Call:
NULL

Deviance Residuals:
Min       1Q   Median       3Q      Max
-2.7285  -0.5465  -0.4164   0.4031   2.4338

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)        -0.56186    0.12306  -4.566 4.98e-06 ***
Sexmale            -8.16008  276.99294  -0.029 0.976498
Pclass2            -0.36500    0.33876  -1.077 0.281271
Pclass3            -1.62927    0.37881  -4.301 1.70e-05 ***
Age                -0.25715    0.14521  -1.771 0.076581 .
SibSp              -0.64181    0.17585  -3.650 0.000262 ***
EmbarkedQ           0.05690    0.11859   0.480 0.631327
EmbarkedS          -0.17477    0.12575  -1.390 0.164569
`FareGroup10-20`    0.05529    0.15150   0.365 0.715165
`FareGroup20-40`   -0.03809    0.17965  -0.212 0.832083
`FareGroup40+`      0.09460    0.22321   0.424 0.671714
TitleMiss          -6.73181  235.54018  -0.029 0.977199
TitleMr            -1.47297    0.31722  -4.643 3.43e-06 ***
TitleMrs           -5.73725  203.65108  -0.028 0.977525
TitleNoble         -0.44396    0.14534  -3.055 0.002254 **
Child               0.11549    0.14579   0.792 0.428237
`Sexmale:Pclass2`  -0.31971    0.29422  -1.087 0.277200
`Sexmale:Pclass3`   0.64601    0.35560   1.817 0.069267 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 950.86  on 713  degrees of freedom
Residual deviance: 576.23  on 696  degrees of freedom
AIC: 612.23

Number of Fisher Scoring iterations: 13

```

So we got that tiny push we were looking for. Let's now go ahead and try this model on our test set as well as submit to Kaggle.

Model Evaluation - Logistic Regression
We can now begin to evaluate model performance by putting together some cross-tabulations of the observed and predicted Survival for the passengers in the test.set data. caret makes this easy with the confusionMatrix function.

In [295]:
```%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
logit.pred<-predict(logit.tune6, test.set)

#Generate the confusion matrix
confusionMatrix(logit.pred, test.set\$Survived)
```
```Confusion Matrix and Statistics

Reference
Prediction  0  1
0 99 18
1 10 50

Accuracy : 0.8418
95% CI : (0.7795, 0.8922)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 4.229e-11

Kappa : 0.6581
Mcnemar's Test P-Value : 0.1859

Sensitivity : 0.9083
Specificity : 0.7353
Pos Pred Value : 0.8462
Neg Pred Value : 0.8333
Prevalence : 0.6158
Detection Rate : 0.5593
Detection Prevalence : 0.6610
Balanced Accuracy : 0.8218

'Positive' Class : 0

```

The metric we're looking for here is called Specificity. It basically is, out of all that actually survived how many we predicted will survive. So, it's 50/68 = 73.53%. Not too shabby though I'd like to do better. Let's anyway try to make a submission and find out for real.

Submit Results to Kaggle
Let's now submit the results from the LR model to Kaggle to see how we fare.

In [356]:
```%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(logit.tune6, kaggletest)
lr.predictions<-as.data.frame(Survived)
lr.predictions\$PassengerId<-kaggletest\$PassengerId

#Write results as a CSV file
write.csv(lr.predictions[,c('PassengerId','Survived')], file="LR_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)
```

The model scored 0.76555 which put us ahead of only about 1/4th of the teams on the leaderboard. Let us keep trying to improve.

Support Vector Machines
Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for classification or for regression. SVMs are a discriminative classifier: that is, they draw a boundary between clusters of data. The process of fitting an SVM based model to our dataset is very similar to what we just did with glm but involves an additional step to hypertune parameters. The key parameter for SVM is C, which can be considered as 1/lambda where lambda is the regularization term. We talked about overfitting previously, regularization is the process of offsetting it. If there is overfitting, we would increase lambda, so conversely decrease C.

The caret package automatically selects the best C value by hypertuning it during crossvalidation. But we need to supply a range of C values for the train method to try. For SVMs, this could be handled by updating the parameter, tunelength (or tunegrid). By default, the length is 3 and the values tried are 0.25, 0.5 and 1. Setting it to 6 would try 0.25 - 8. Let's get started.

In [124]:
```%%R
set.seed(35)

#Training an SVM model with the RBF kernel - Radial Basis Function
svm.tune1<-train(Survived ~ Sex + Pclass +  Sex:Pclass + Age + SibSp + Embarked + FareGroup + Title + Child,
data=train.set,
tuneLength=9,
preProcess = c("center", "scale"),
trControl=tenfoldcv)

svm.tune1
```
```Support Vector Machines with Radial Basis Function Kernel

714 samples
14 predictors
2 classes: '0', '1'

Pre-processing: centered, scaled
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ...

Resampling results across tuning parameters:

C      Accuracy   Kappa      Accuracy SD  Kappa SD
0.25  0.8049426  0.5722555  0.05162691   0.1132959
0.50  0.8117884  0.5843046  0.05585971   0.1259916
1.00  0.8226591  0.6100882  0.05152085   0.1165301
2.00  0.8249804  0.6164518  0.05146277   0.1156968
4.00  0.8160650  0.5986011  0.05477380   0.1213830
8.00  0.8131307  0.5934170  0.04906936   0.1095503
16.00  0.8131172  0.5940660  0.05068143   0.1116418
32.00  0.8150599  0.5995714  0.05261309   0.1151817
64.00  0.8092588  0.5876340  0.05748809   0.1249829

Tuning parameter 'sigma' was held constant at a value of 0.07616343
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.07616343 and C = 2.

```

Great, so SVM automatically tried 9 different values of C while holding the other parameter Sigma constant and picked the values that gave the best results. We really didn't have to do much there. Let's go ahead and evaluate the model as well as submit to Kaggle.

Model Evaluation - SVM

In [312]:
```%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
svm.pred<-predict(svm.tune1, test.set)

#Generate the confusion matrix
confusionMatrix(svm.pred, test.set\$Survived)
```
```Confusion Matrix and Statistics

Reference
Prediction   0   1
0 100  20
1   9  48

Accuracy : 0.8362
95% CI : (0.7732, 0.8874)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 1.38e-10

Kappa : 0.6429
Mcnemar's Test P-Value : 0.06332

Sensitivity : 0.9174
Specificity : 0.7059
Pos Pred Value : 0.8333
Neg Pred Value : 0.8421
Prevalence : 0.6158
Detection Rate : 0.5650
Detection Prevalence : 0.6780
Balanced Accuracy : 0.8117

'Positive' Class : 0

```

The specificity is actually a bit down from the LR model, we're at 70%. Other than that eveyrthing looks pretty similar. I also just discovered that a Support Vector Machine automatically captures interaction between variables. So the Pclass:Sex interaction we put in has no importance to this model. We'll remove it going forward since Random Forests also automatically identifies interactions.

Submit to Kaggle

In [355]:
```%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(svm.tune1, kaggletest)
svm.predictions<-as.data.frame(Survived)
svm.predictions\$PassengerId<-kaggletest\$PassengerId

#Write results as a CSV file
write.csv(svm.predictions[,c('PassengerId','Survived')], file="SVM_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)
```

We scored 0.77512, an improvement over the Logistic Regression model that took us up the leaderboard a few notches. We're not going to let this go!

Random Forests
Next up, a very popular and easy to use model, Random Forests. RF builds on the concept of decision trees and expands it by growing multiple trees and averaging out the results to find the best fit. The parameter we need to tune is mtry, that number of features to try at each node. The best recommended value for this parameter is typically the square root of the number of features. Let's give this a shot.

In [365]:
```%%R
set.seed(35)

rfgrid<-data.frame(.mtry=c(2,3,4))

#Training a Random Forest model
rf.tune1<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + Title + Child,
data=train.set,
method='rf',
tuneGrid=rfgrid,
trControl=tenfoldcv)

rf.tune1
```
```Random Forest

714 samples
14 predictors
2 classes: '0', '1'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ...

Resampling results across tuning parameters:

mtry  Accuracy   Kappa      Accuracy SD  Kappa SD
2     0.8096114  0.5868040  0.05614486   0.1231515
3     0.8120599  0.5869056  0.05485173   0.1228654
4     0.8151139  0.5951016  0.05099510   0.1145908

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 4.

```

Looks like the best mtry value found was 4. Let's complete the formalities.

Model Evaluation - Random Forests

In [353]:
```%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
rf.pred<-predict(rf.tune1, test.set)

#Generate the confusion matrix
confusionMatrix(rf.pred, test.set\$Survived)
```
```Confusion Matrix and Statistics

Reference
Prediction   0   1
0 101  20
1   8  48

Accuracy : 0.8418
95% CI : (0.7795, 0.8922)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 4.229e-11

Kappa : 0.6542
Mcnemar's Test P-Value : 0.03764

Sensitivity : 0.9266
Specificity : 0.7059
Pos Pred Value : 0.8347
Neg Pred Value : 0.8571
Prevalence : 0.6158
Detection Rate : 0.5706
Detection Prevalence : 0.6836
Balanced Accuracy : 0.8162

'Positive' Class : 0

```

Well, the specificity is exactly the same, at 70%. I doubt this is going to give us a different result with Kaggle but let's try anyway.

Submit to Kaggle

In [357]:
```%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(rf.tune1, kaggletest)
rf.predictions<-as.data.frame(Survived)
rf.predictions\$PassengerId<-kaggletest\$PassengerId

#Write results as a CSV file
write.csv(rf.predictions[,c('PassengerId','Survived')], file="RF_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)
```

Surprise, we scored 0.78469 bringing us midway on the leaderboard. So we're definitely making headway!

Feature Importances
Before we move ahead let's look at a key statistic that comes from running a tree based model. It's called feature importances, which tells us how much of an impact each of the features we're feeding to the model has to the final outcome.

In [370]:
```%%R
#Print variable importan
varImp(rf.tune1\$finalModel)
```
```                 Overall
Sexmale        41.576129
Pclass2         6.801074
Pclass3        21.558667
Age            36.931313
SibSp          15.046491
EmbarkedQ       2.972848
EmbarkedS       6.683872
FareGroup10-20  3.980921
FareGroup20-40  6.418674
FareGroup40+   10.203688
TitleMiss      11.145308
TitleMr        36.630889
TitleMrs       10.509372
TitleNoble      2.862573
Child           4.153887

```

That's very interesting. We always knew that Gender was the most important variable. Sexmale by the way shows up there because it's the class that suffered the most. We see here that Age and "Mr." Title take the second spot. That is a revelation of sorts. SibSp also seems quite important compared to say Embarked.

Before we move ahead with other models, I am really curious to try a few more things out with RF. I would like to measure the impact of adding a couple of features we've left out, Family Size and Parch and see how they pan out in terms of importance. Offline I tried adding FamID and it again created a negative impact so I am leaving it out for good

In [41]:
```%%R
set.seed(35)

#Notice how we're trying out 2-5 features at each node now since we're adding a couple of features to train.
rfgrid<-data.frame(.mtry=c(2,3,4,5))

#Training a Random Forest model
rf.tune2<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + Title + Child + FamSize + Parch,
data=train.set,
method='rf',
tuneGrid=rfgrid,
trControl=tenfoldcv)

rf.tune2
```
```Random Forest

714 samples
14 predictors
2 classes: '0', '1'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ...

Resampling results across tuning parameters:

mtry  Accuracy   Kappa      Accuracy SD  Kappa SD
2     0.8161450  0.6018844  0.05130416   0.11229173
3     0.8114502  0.5872835  0.05293296   0.11855522
4     0.8133020  0.5920896  0.04751211   0.10699385
5     0.8221049  0.6147302  0.04405445   0.09805046

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 5.

```

As expected, the best mtry value this time was 5. Let's look at the feature importances.

In [73]:
```%%R
varImp(rf.tune2\$finalModel)
```
```                 Overall
Sexmale        43.453469
Pclass2         6.453174
Pclass3        19.967399
Age            42.531124
SibSp          10.604394
EmbarkedQ       2.998889
EmbarkedS       6.815784
FareGroup10-20  4.111547
FareGroup20-40  6.365776
FareGroup40+    9.403774
TitleMiss      10.173660
TitleMr        38.239391
TitleMrs        8.937783
TitleNoble      2.831949
Child           3.719191
FamSize        19.145703
Parch           7.960594

```

So it worked out well. Parch and especially FamSize seem to be reasonably important. I believe feature selection plays a very important role in Tree based on models (not that they don't in others but perhaps even more important in this case). OK let's run predictions then re-submit to Kaggle to see if we can improve the score.

In [43]:
```%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
rf.pred<-predict(rf.tune2, test.set)

#Generate the confusion matrix
confusionMatrix(rf.pred, test.set\$Survived)
```
```Confusion Matrix and Statistics

Reference
Prediction  0  1
0 98 20
1 11 48

Accuracy : 0.8249
95% CI : (0.7607, 0.8778)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 1.304e-09

Kappa : 0.6204
Mcnemar's Test P-Value : 0.1508

Sensitivity : 0.8991
Specificity : 0.7059
Pos Pred Value : 0.8305
Neg Pred Value : 0.8136
Prevalence : 0.6158
Detection Rate : 0.5537
Detection Prevalence : 0.6667
Balanced Accuracy : 0.8025

'Positive' Class : 0

```

No difference on predictions in our Test set. The sensitivity is the same. But let's try submitting to Kaggle anyway.

In [44]:
```%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(rf.tune2, kaggletest)
rf.predictions<-as.data.frame(Survived)
rf.predictions\$PassengerId<-kaggletest\$PassengerId

#Write results as a CSV file
write.csv(rf.predictions[,c('PassengerId','Survived')], file="RF_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)
```

Guess what, we did make an improvement to the Kaggle score. This model scored 0.78947 bringing us to the top 1/3rd of the leaderboard. Since tree based models are giving us great results let's try one more, a very interesting model called Conditional Trees.

Conditional Trees
Conditional Tree based models supposedly tend to select variables that have many possible splits or many missing values. So instead of Random Forests which tries to find out which variables are important, using an information measure, these models perform sort of a significance test to see which features will yield the best results at each split. Let's give this a whirl.

Note that there are two Conditional Tree packages in caret (ctree, ctree2). We will be using ctree2 which allows us to tune the Max Depth, how deep the trees can grow.

In [75]:
```%%R
set.seed(35)

ctrgrid<-data.frame(.maxdepth=c(2,3,4,5,6))

#Training a Conditional Tree model
ctr.tune1<-train(Survived ~ Sex + Pclass + Age + SibSp + Embarked + FareGroup + Title + Child + FamSize + Parch,
data=train.set,
method='ctree2',
tuneGrid=ctrgrid,
trControl=tenfoldcv)

ctr.tune1
```
```Conditional Inference Tree

714 samples
14 predictors
2 classes: '0', '1'

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)

Summary of sample sizes: 643, 643, 643, 642, 643, 643, ...

Resampling results across tuning parameters:

maxdepth  Accuracy   Kappa      Accuracy SD  Kappa SD
2         0.7707812  0.4755401  0.04604332   0.1104954
3         0.8077726  0.5855616  0.05702946   0.1237835
4         0.8236502  0.6130369  0.05491988   0.1236674
5         0.8194444  0.6055659  0.04897090   0.1102774
6         0.8152061  0.5960590  0.05126715   0.1141253

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was maxdepth = 4.

```

OK that did give us a better accuracy at Max Depth 4. Before we run predictions, let's visualize the tree that was built for the final model.

In [115]:
```%%R
plot(ctr.tune1\$finalModel)
```

We can observe here that Sexmale as expected was the starting node. From then on, Pclass and TitleMr took the honors for level 2 leading further then to the other variables. We're now ready to evaluate the model and run the predictions.

Model Evaluation - Conditional Trees

In [76]:
```%%R
#Derive predictions using our final LR model on the test set (this is NOT the test.csv file from Kaggle)
ctr.pred<-predict(ctr.tune1, test.set)

#Generate the confusion matrix
confusionMatrix(ctr.pred, test.set\$Survived)
```
```Confusion Matrix and Statistics

Reference
Prediction   0   1
0 102  19
1   7  49

Accuracy : 0.8531
95% CI : (0.7922, 0.9017)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 3.506e-12

Kappa : 0.6789
Mcnemar's Test P-Value : 0.03098

Sensitivity : 0.9358
Specificity : 0.7206
Pos Pred Value : 0.8430
Neg Pred Value : 0.8750
Prevalence : 0.6158
Detection Rate : 0.5763
Detection Prevalence : 0.6836
Balanced Accuracy : 0.8282

'Positive' Class : 0

```

So that gave us a better Specificity on the Test set at 72.06%. I am interested to see how this tests out at Kaggle.

Submit to Kaggle

In [91]:
```%%R
#Generate predictions and write as a dataframe, then include PassengerID
Survived<-predict(ctr.tune1, kaggletest)
ctr.predictions<-as.data.frame(Survived)
ctr.predictions\$PassengerId<-kaggletest\$PassengerId

#Write results as a CSV file
write.csv(ctr.predictions[,c('PassengerId','Survived')], file="CTR_Titanic_Predictions.csv", row.names=FALSE, quote=FALSE)
```

The model scored 0.77512 which is actually slightly worse than how Random Forests did. So the best model in terms of the Kaggle leaderboard has been Random Forests. But let's run a formal comparison of all the models we've built so far.

Model Comparison
The resamples method in the caret package makes it easy to compare results between different models.

In [126]:
```%%R
modelcompare<-resamples(list(Logit = logit.tune6, SVM = svm.tune1, RF = rf.tune2, CTREE = ctr.tune1 ))
summary(modelcompare)
```
```
Call:
summary.resamples(object = modelcompare)

Models: Logit, SVM, RF, CTREE
Number of resamples: 24

Accuracy
Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
Logit 0.7465  0.7887 0.8042 0.8145  0.8333 0.9437    0
SVM   0.7606  0.7778 0.8042 0.8244  0.8677 0.9296    0
RF    0.7606  0.7917 0.8182 0.8250  0.8592 0.9155    0
CTREE 0.7361  0.7770 0.8099 0.8216  0.8502 0.9577    0

Kappa
Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
Logit 0.4295  0.5334 0.5747 0.5969  0.6505 0.8787    0
SVM   0.4570  0.5146 0.5612 0.6149  0.7173 0.8495    0
RF    0.4652  0.5392 0.6140 0.6195  0.7012 0.8232    0
CTREE 0.4338  0.5003 0.5771 0.6083  0.6847 0.9097    0

```

We can see that for the metric we chose (Accuracy), Random Forests outperformed the other models. Note here that we could've chosen ROC (Receiver Operating Characteristic) as the metric in which case, we'd have needed to generate class probabilties - that is, probability for survived/not survived for every data item rather than letting crossvalidation generate predictions automatically. I intend to learn and demonstrate these concepts in a seperate session. For now, let's plot the results in a couple of different ways.

Box Plot of Model Comparison Results

In [128]:
```%%R
bwplot(modelcompare)
```

Dot Plot of Model Comparison Results

In [129]:
```%%R
dotplot(modelcompare)
```

So there you have it. We took the Titanic dataset presented by Kaggle, imported and visualized the data in a series of plots, munged the data to address gaps and fitted 4 different models with varying results, all in R, special thanks to the caret package. We only managed to get up to 0.7894 on the Kaggle leaderboard but the point of this exercise was to learn and demonstrate machine learning concepts in R. Hope we've achieved that objective.

I will be happy to receive your feedback positive or negative but try to be nice, I am just learning :-)