Greetings everyone. I have a very simple question yet so tricky. In the attached figure I created an end to end workflow of a model selection and evaluation. In principal, there's nothing wrong with it but the first split (to train/test sets) introduces bias although I bootstrap the test set in the final evaluation module. My question is: How can I do it robust ? If I repeat the exact same workflow 1000 times, I'll get a new "best configuration" and that's not what I want. I work with tabular data of 835 samples for a binary classification task with 80 features and imbalanced of 65-35 %.
Someone could argue that the 1st split (train/test) basically corresponds to reality and whatever the result is this should be reported.
Others could argue that the 1st split (train/test) might lead to poor scoring for the best configuration hence split was done once and sets might not be ideal.