#RandomForrest ML
12 messages · Page 1 of 1 (latest)
I would start framing the problem, do you have current loan data? How familiar are you with measuring performance, assessing a model, tuning it. Are you creating this model from scratch?
Hi, thank you so much. I already did this:
- Cleaned the data (from LendingClub)
- Removed unnecessary columns (I now have 43 columns of the 151 left)
After this, I did the following, but I am not sure this is the way to do it.
- Divided the data in X and y (X = data - loan status : y = loan status)
- Splitted X and y with train_test_split
- Took 5 best features using the SelectKBest tool from sklearn
- Defined the RandomForest model and fitted this with the selected data
- Evaluated with predict_proba(X_test
- Created a roc curve
The reason I am not sure this is the way is that my ROC curve is too good I think, which probably means it is overfitted?
Thank you again!
I do have hat but I am very new to all of this. I have some sample code from lectures but other than that, everything is from scratch.
What is the orange line?
the orange line should be the performance per threshold
How come it says feature selection?
because this is a name i gave just because i used the features from SelectKBest 🙂
^ + 1, Is this curve the performance on the train dataset or the test dataset?