Hey Rafael, thanks for the nice article. I am fairly new to machine learning and might not get all the nuances between different methodologies, but wouldn’t an anomaly detection algorithm using a (multivariate) gaussian distribution be a good candidate for such a skewed data set? With some feature preparation and selection up front? My concern with undersampling would be that the logistic regressions still has few (less than 500) fraud cases to build its estimate on. Can you elaborate on some pros and cons for the different methodologies? Thanks

Written by

Economist turned Data Scientist. Creating human mobility insights at Unacast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store