I recently came across of a similar flaw in the EEG classification experiments. ...

drothlis · on April 5, 2021

This contamination of test data from the training data reminds me of "Overly Optimistic Prediction Results on Imbalanced Data: a Case Study of Flaws and Benefits when Applying Over-sampling" [1] where almost 50% of the 24 peer-reviewed studies that use machine learning based on a particular publicly-available dataset, were claiming near-perfect accuracy at predicting the risk of pre-term birth for a patient, but were actually testing (accidentally) on training data.

[1]: https://arxiv.org/abs/2001.06296

timy2shoes · on April 5, 2021

Oversampling, then applying a train-test split? Jesus, that's like machine learning 101. But then again, I see a lot of questionable practices in the application of ML in biology.

ta988 · on April 5, 2021

Great find, thanks!