thanks for sharing your article and the effort you put into this. One thing I am missing a bit is a discussion on causal inference vs. correlation. E.g. the “embarked” variable — what could possibly be the reason that these passengers had a lower rate of survival? Is there any direct connection of the port of embarkation with chances of survival during a catastrophe or is the variable correlated to another variable which is actually explaining the causal relationship (did only old men embark in Southampton? Or did they get cabins in a specific location of the ship?). A similar point can be made about the ticket fare. Is the ticket price really explaining the chance of survival? Or is it merely correlated to the location of the cabin and a status of importance that played a role during the evacuation. If so, would this not already be captured by the passenger class variable? If I understand correctly, the competition is only about forecast accuracy, so causal relationship of individual variables is less of concern. But I always think it is nice to be able to“tell a story” when investigating data, it often helps to understand it better and lead to more meaningful results.
Looking forward to your future articles ;)