#planttraits2024 | Kaggle | Page 1

eager radish Feb 21, 2024, 7:02 PM

#

I found out that using the mean of each column as the submission gives a bad score (it should be 0), and using the clipped mean at 0.001 and 0.999 quantiles give a score of -0.001, but that implies that the average area of the leaves in the test dataset is 2200 (i guess cm^2 ?), which is huge, almost no leaves are that big. and most big leaves seem mislabeled. Same thing happens with the seeds. So I think that a big part of the extreme values come from mislabeled data, and this is confirmed by looking at the images. I think whoever created the competition should have used the log of the columns (except X4) instead. because using the raw values it's kind of pointless, since right now, it doesn't matter if you misclassify a 1 milligram seed by a 1 gram seed, or a 1 square centimeter leaf for a 10 square centimeter leaf, Only the extreme values matter in the score.

quasi ore Feb 28, 2024, 4:55 PM

#

Hi, does anyone know why on the leaderboard the top submission is miles better than everyone else?

#

Could it be a case of overfitting to the public test set?

heavy kite Mar 6, 2024, 9:02 AM

#

according to this two messages @eager radish and @quasi ore in my opinion there is mislabeled data and i think that its becauce there is no clearly description even on the website with data there is no information about unit. So maybe some of the target's values are in cm2 some of them in mm2 and so on.... About huge gap on the leaderboard: maybe they find a distribution of hidden target or they somewhow find those plants in try plant trait database dataset 😛

quasi ore Mar 6, 2024, 10:54 AM

#

heavy kite according to this two messages <@733365775590621306> and <@416335678515970049> ...

You can do that? 🤯

#

Would somwthing like that hold up when it comes to the hidden eval set?

heavy kite Mar 6, 2024, 11:26 AM

#

True it wouldn’t 😆 so maybe something basen on images like first classifier and then regression 🤷🏻‍♀️