knotty sun Jun 28, 2024, 4:53 PM

#

What's it for?

weak jay Jun 28, 2024, 9:10 PM

#

Finally

#

I found it

#

So What's up everyone,how u guys are approaching it

#

Such a imbalance data

stable barn Jun 29, 2024, 12:22 AM

#

weak jay So What's up everyone,how u guys are approaching it

So, I haven't worked with this kind of data before (this is also my first Kaggle competition) so first I am thinking of cleaning the data and then subsequent stuff later on

fossil thicket Jun 29, 2024, 1:16 AM

#

The data is so large

weak jay Jun 29, 2024, 9:59 AM

#

It's so much biased

#

We should train on adding previous yrs data too ig

stiff sparrow Jun 29, 2024, 1:05 PM

#

Read the discussion, I think you can add cancerous data to treat the imùbalanced aspect

obsidian leaf Jun 29, 2024, 1:40 PM

#

Hi everyone, I’m new here and need some help. Can someone please explain what I need to do for this competition in a simple way?

What do I need to do? - Develop a program to identify dangerous skin spots from images.
What language and tools should I use? - Use Python and tools like TensorFlow or PyTorch.
Which data should I use? - Use the Training and Test data provided in the competition files or data from outside.

Thanks a lot!

fossil thicket Jun 29, 2024, 5:56 PM

#

Welcome

gentle yarrow Jun 29, 2024, 9:00 PM

#

stiff sparrow Read the discussion, I think you can add cancerous data to treat the imùbalanced...

wdym exactly?

stiff sparrow Jun 29, 2024, 9:11 PM

#

it's better to detect a cancerous case which is not cancerous; than a begnin case where it's cancerous, what I've seen in the discussions they have like 300 case which is cancerous and 4000 something not cancerous, my opinion is I need to get more cancerous data so the model can learn more

fringe trench Jun 30, 2024, 3:35 PM

#

how would we go about adding more cancerous data?

digital sky Jun 30, 2024, 5:40 PM

#

I think we can try things like manipulating existing cancerous images like flipping, inverting , blurring, or shift the cancer to the sides instead of centring. Although, I am not sure if this might backfire and make it harder for the model jerry_stare

digital sky Jun 30, 2024, 6:12 PM

#

fringe trench how would we go about adding more cancerous data?

https://www.tensorflow.org/tutorials/images/data_augmentation

TensorFlow

Data augmentation | TensorFlow Core

fringe trench Jun 30, 2024, 7:10 PM

#

Thanks, also, is there a way to calculate the score before I submit, all the ways I tried arnt giving me accurate results

fringe trench Jun 30, 2024, 7:10 PM

#

digital sky I think we can try things like manipulating existing cancerous images like flipp...

I read that some peoople are just integrating more images from prior years

crude mist Jun 30, 2024, 7:45 PM

#

obsidian leaf Hi everyone, I’m new here and need some help. Can someone please explain what I ...

No, your model doesn't have to point at dangerous skin spots from the images, which is called image segmentation. It's enough if your model only classifies an image as cancer or not cancer (and it will probably spit out a probability instead of a hard 1 or 0, which is what you submit in your submission.csv)
Whatever you like. If you can look at the image name or average color and tell if it's cancer or not, you can do that. All that the competition cares about is doing accurate classifications, measured by the competition metric (pAUC). If you get a good score, it doesn't matter what method you used.
With that said, the method that most participants will likely go with is deep learning, either by training their own models from scratch or by fine-tuning pre-trained models (e.g., from huggingface)
You're going to use the competition training data mainly, because they will likely be the most similar to the test data, which will make it easier for your model to pick patterns that it's used to. Using external data will definately help, as long they are quality data and they are accessible to everyone else in the competition. Also, they should be available under a license that allows you to open-source your model if you win the competition, as required by the competition rules.
A wise strategy to start with is to fine-tune a pre-trained model on a subset of the competition training data, then increase the size of this subset, use augmentation methods, use external data, and finally (if you are willing enough) create your own synthetic data from the competition data and use them for further training (which I suspect only very few will do, if it's even worth the effort)
By the way, the competition test data are not actually available for you to train your model on. The test files are only available as a placeholder for when you submit your notebook to make predictions on them behind the scenes. You will not be able to see them yourself.

obsidian leaf Jul 1, 2024, 8:12 AM

#

crude mist 1. No, your model doesn't have to point at dangerous skin spots from the images,...

Thanks a lot bro for your detailed explanation now I am clear, really thanks a lot bro

hasty whale Jul 1, 2024, 7:35 PM

#

crude mist 1. No, your model doesn't have to point at dangerous skin spots from the images,...

Can I ask what augmentation methods there are to use and where is best to learn about them?

crude mist Jul 1, 2024, 7:55 PM

#

hasty whale Can I ask what augmentation methods there are to use and where is best to learn ...

There are plenty of image augmentation methods. The aim is to make a lot of image perturbations to introduce more variability in the image data provided they have little to no effect on the signal that you're trying to capture.
For example, when physicians try to diagnose skin cancer, they typically look at the color of the lesion, its shape, number, base, edge, margin, floor, among other stuff. Assuming the deep-learning model will use the same features to make its predictions, you can introduce random variability in the images that don't affect these features, like flipping the image horizontally or vertically, rotating the image an arbitrary amount, a little shift in brightness up or down, and so forth. These will multiply the number of training examples that you have and will make your model more robust to changes in the images that should NOT matter in its predictions, making it more likely to look the right features and less likely to overfit to irrelevant features that can only exist in the training data.

#

Note: There is an "art" part of the image augmentation process and domain knowledge will help you make the right decision on which augmentations to use, and more importantly which to AVOID. Also, one of the powerful characteristics of deep-learning models is that they craft their own features, meaning that sometimes models will look at things that humans wouldn't even consider bothering with. For example, let's imagine that there is a metric called "skin color - ulcer color index" defined as the skin tone in RGB divided by the color of the lesion in RGB, and that "index" actually turns out to be useful in predicting the lesion malignancy. Human physicians would never be able to use this feature by eye, but a deep-learning model would, and actually can arrive at that via gradient-descent. Now, making the wrong assumptions that your model will not use such an "index" will actually hurt your model performance. So, you should be careful about what assumptions you make and should experiment well with different augmentations, loss functions, etc.

#

By the way, here is an index of potential augmentation methods you can use in pytorch, provided by a package called "albumentations".
https://albumentations.ai/docs/api_reference/augmentations/

Albumentations Documentation - Index

Albumentations provides a comprehensive, high-performance framework for augmenting images to improve machine learning models. Ideal for computer vision applications, supporting a wide range of augmentations.

fringe trench Jul 3, 2024, 8:42 PM

#

Has anyone had any success with images?

deft scarab Jul 4, 2024, 12:03 AM

#

Has anyone gotten Error: Submission CSV Not Found, despite submission being in the output folder? I posted a discussion on it. Cant seem to figure out how to debug it

quartz herald Jul 4, 2024, 7:41 AM

#

deft scarab Has anyone gotten Error: Submission CSV Not Found, despite submission being in t...

Are you submitting NB which saves predictions as submission.csv?

deft scarab Jul 5, 2024, 3:22 PM

#

quartz herald Are you submitting NB which saves predictions as `submission.csv`?

turns out i had too many files in my output directory. i had to !rm all the files except submission.csv

twilit aspen Jul 10, 2024, 11:38 AM

#

I get this error while downloading the data via API, any suggestions?

403 - Forbidden - You must accept this competition's rules before you'll be able to download files.

PS: I have already accepted the rules for the competition

hasty whale Jul 11, 2024, 8:49 PM

#

twilit aspen I get this error while downloading the data via API, any suggestions? 403 - Fo...

Have you verified your identity?

twilit aspen Jul 11, 2024, 9:10 PM

#

Yea, nvm the issue has been solved, there was some bug in the cofig file that was to be fixed

novel falcon Jul 12, 2024, 10:01 PM

#

hey guys .can you share your score if you have used cnn based models not LGBM or ... ?

void solar Jul 15, 2024, 9:52 PM

#

hello isic-2024-challenge

#

Discussion Forum says - Discord is intended to be a more casual space, so be it ☕ 🐉

void solar Jul 15, 2024, 10:27 PM

#

jpg train images are redundant. hdf5 contains the same?

#

find out how to read hdf5 and see it for myself.

void solar Jul 15, 2024, 11:03 PM

#

hdf5 contains 401059 images, there are 401059 jpgs. so the number checks out.

void solar Jul 16, 2024, 12:10 AM

#

all images are checked one by one and turned out all same. 👍

#

next is to build a baseline classifier model and submit for a test

void solar Jul 16, 2024, 2:10 AM

#

interestingly, many of public notebooks do not use images for training. use tabular data only, this is very new to me. 🤔

void solar Jul 16, 2024, 8:01 AM

#

ok, tabular vs image was discussed enough in the forum.

void solar Jul 16, 2024, 7:35 PM

#

hope for the best, embrace for the worst. 🤣

buoyant anchor Jul 17, 2024, 8:40 PM

#

Bit of a noob question, I have a question about the "no internet access" if anyone knows. I would like to download previous years datasets. I assume I can't do this during runtime? But is it okay for me to add it to my notebook files directly?

unreal flower Jul 21, 2024, 6:57 AM

#

I think you'd need to upload them and then access them from within the kaggle directory, not 100% sure myself though, first time using the notebooks myself

cedar karma Jul 21, 2024, 1:53 PM

#

Hi, guys! Do I have to mention on the forum that I'm using external data given it is under CC BY 4.0 license? I remember there was an External Data Thread in every competition but not anymore.

stiff sparrow Jul 21, 2024, 5:39 PM

#

guys, I made alot of wrong choices in my life

#

but running a transformer on cpu

#

is worst

somber dirge Jul 24, 2024, 10:33 AM

#

How to use GCP credits for submission??

somber abyss Jul 25, 2024, 5:35 PM

#

void solar ok, tabular vs image was discussed enough in the forum.

Hi. I just joined. Could you tell me where I can find those discussions?

unkempt sorrel Jul 30, 2024, 3:46 PM

#

hello guys i'm kinda new to kaggle competitions and AI for medical uses, i'm wondering if we're supposed to predict the "target" column in the .csv file because there is a huge class imbalance, for instance there are about 1k images with target = 1 and the rest which is around 400k with target = 0, is it normal ?

novel falcon Jul 31, 2024, 11:01 AM

#

unkempt sorrel hello guys i'm kinda new to kaggle competitions and AI for medical uses, i'm won...

not normal but thats the available data

unkempt sorrel Jul 31, 2024, 2:20 PM

#

novel falcon not normal but thats the available data

i see

#

well thank you sir

amber orchid Aug 3, 2024, 1:18 PM

#

Hi ! I'm a first-timer in a Kaggle competition. Getting my head around the skin cancer dataset. Can I ask a stupid question?
The AI model we are developing is for an end user to take a mobile photo at home and submit this. So there won't be any other information provided to our model for inference. So is there anything useful to us about all the metadata in the dataset? Surely we just focus on the photos and labels?

dense crescent Aug 4, 2024, 2:42 AM

#

amber orchid Hi ! I'm a first-timer in a Kaggle competition. Getting my head around the skin ...

Hey,
So just reminding, Don't submit the entire app code just the model to the competition.
Second of all, In my model I have just decided to use images and with some model tweaking I have been getting good results, main problem though it it takes a long time to train.

#

Hey Guys,
I am Ishaan. So I was wondering if you guys knew of any good ways of tweaking a model so that it can run faster?
I have used normalization, dataparalleism, maxpulling, batchnormalization. Yet It's still taking 15 mins per epoch

dense crescent Aug 4, 2024, 10:03 PM

#

@crude mist

#

Could you please help answer my question?

stiff sparrow Aug 6, 2024, 5:06 PM

#

could I only submit my jupyet notebook or only my csv file ?

tidal sky Aug 8, 2024, 7:48 AM

#

Hey I'm getting submission error

#

rn_image_picker_lib_temp_d3948cc0-33ce-4163-89ca-3091fadcfbed.jpg

#

rn_image_picker_lib_temp_96198044-d25e-4be4-a20b-3c0774af6cb0.jpg

#

Somebody help me

hollow turtle Aug 8, 2024, 5:06 PM

#

tidal sky Hey I'm getting submission error

Hi, you supposed to make inference with the trained model/models, not uploading just the cvs files.

#

Check some inference notebooks in the code section of that particular competition

livid rapids Aug 12, 2024, 8:14 PM

#

dense crescent Could you please help answer my question?

You should use smaller models to start

stoic osprey Aug 13, 2024, 7:38 PM

#

Hi Everyone. I apologize for not posting in the 'General' area but I wanted to talk directly with the people on this challenge. I am very interested in this specific challenge. My wife died from cancer and I am trying to focus the rest of my life on researching cancer detection and treatments. I've been studying ML, DL and LLM's for the past six months. I literally just started learning CNN's and ResNet50 architectures last week.

I know it's late in this competition, but are there any teams that would accept me and let me look over your shoulder and learn?

dense crescent Aug 14, 2024, 4:57 AM

#

Hey I had a question?
so how is my public score decreasing when it takes less time plus also has a better accuracy?

feral bison Aug 14, 2024, 8:33 AM

#

Hi everyone, I have a dump question,
this isic-2024 challenge requires internet to be off, right,?, then if i want to install some libraries, like 'pip install <library>', that won't be possible, what could be the alternative?!!

novel falcon Aug 14, 2024, 1:03 PM

#

feral bison Hi everyone, I have a dump question, this isic-2024 challenge requires internet ...

you can train the model in a separate note book with the internet on. but when you want to submit it you can use the model if you upload it to the kaggle datasets (your own ,private !) and then use it . and the internet should be off

crude mist Aug 14, 2024, 10:39 PM

#

dense crescent Could you please help answer my question?

Hi! Sorry, I've been busy for the past month or so, and I've been offline most of the time.
Anyway, using a GPU and parallel processing (like multi-threading and/or multi-processing) is powerful for sure in speeding things up, but they only take you so far without using better resources 😦
As for normalization and batch normalization, these actually slow down your training. They're not used to speed it up. Batch normalization is used to make it easier for gradient descent to arrive at the global minimum, and maybe it can speed up the training by reducing the number of epochs you need to get to the global minimum, but it's expected to take more time per epoch for sure. Data normalization is optimally only done once before feeding the data into the model and if you're doing it on the fly (i.e., every training cycle), it's also expected to take more time. It's main use is to make it easy for the model to pick patterns in the data and avoid giving more emphasis to features with larger values, and also to avoid exploding/diminishing gradients in deeper models.
I agree with @livid rapids's advice. You should try out smaller models in your initial experiments, and then invest in more depth in the architecture you see promising, and only if it proves beneficial to the score.

crude mist Aug 14, 2024, 10:42 PM

#

dense crescent Hey I had a question? so how is my public score decreasing when it takes less t...

Maybe you're overfitting to the training data. Make sure you use cross validation in evaluating your model(s) locally, and make sure the evaluation data (dev set or test set) do not leak into the training process (i.e., you shouldn't do backprobagation on their gradients). Oh, and time has nothing to do with the score. It's only to be qualified as a valid notebook (should run in no more than 9 hours)

#

@feral bison
Well, this condition (the internet off thing) is only to prevent participants from leaking the test data to a server that they have access to. Imagine someone submitting a notebook that reads in the test data (which is only available after you submit the notebook) and uploads them to their Google Drive.
A trick for your situation that kagglers do is that you can download the .whl files for the packages that you need to install at test time, upload them to a personal (private) dataset, add the dataset to your notebook, and use pip to install those .whl files offline. You can refer to this answer to know how:
https://stackoverflow.com/questions/27885397/how-do-i-install-a-python-package-with-a-whl-file

Stack Overflow

How do I install a Python package with a .whl file?

I'm having trouble installing a Python package on my Windows machine, and would like to install it with Christoph Gohlke's Window binaries. (Which, to my experience, alleviated much of the fuss for...

#

To download .whl files, refer to PyPi and search for your desired package. For example, here are the .whl files for numpy: https://pypi.org/project/numpy/#files

Make sure you download a manylinux version of your package, as kaggle runs on linux behind the scenes.

PyPI

numpy

Fundamental package for array computing in Python

dense crescent Aug 14, 2024, 10:50 PM

#

crude mist Maybe you're overfitting to the training data. Make sure you use cross validatio...

Thanks a lot, quick question, how does my score get calculated then. I got a higher accuracy so should not my score increase then?

crude mist Aug 14, 2024, 10:51 PM

#

dense crescent Thanks a lot, quick question, how does my score get calculated then. I got a hig...

What do you mean you got a higher accuracy? I presume you're referring to the accuracy you calculated locally using the training data, right?

dense crescent Aug 14, 2024, 10:51 PM

#

yup

#

I split the data into testing and train

crude mist Aug 14, 2024, 10:53 PM

#

dense crescent I split the data into testing and train

Great. Isn't it possible that the testing data have easier examples than the hidden test data at kaggle?

dense crescent Aug 14, 2024, 10:53 PM

#

crude mist Great. Isn't it possible that the testing data have easier examples than the hid...

Oh yeah true,

#

Is there no way to check what was our accuracy with the test data, or is that the public score

#

Currently my score is 0.06 something

crude mist Aug 14, 2024, 10:56 PM

#

To get over that, people use cross validation. They don't only evaluate their model on one test set. They make five splits of the training data for example and train their model five times, each time on 4/5 of the data to be tested on the final 1/5. A different 1/5 of the data is used for evaluation each time and then you get the average "accuracy" or whatever evaluation metric you're using.

#

That would be a more accurate estimation of your score.

dense crescent Aug 14, 2024, 10:56 PM

#

crude mist To get over that, people use cross validation. They don't only evaluate their mo...

ok thanks a lot

crude mist Aug 14, 2024, 10:56 PM

#

And even then, it's just an estimation. Leaderboard (LB) scores can be and are probably different.

#

Just make sure you don't overfit to your training data and you'll do well on both public and private Leaderbaord.

crude mist Aug 14, 2024, 10:57 PM

#

dense crescent ok thanks a lot

Sure, no problem.

unkempt quest Aug 18, 2024, 4:09 PM

#

Is setting the stratify parameter to the target values on train_test_split() enough to remedy the imbalanced dataset, or should I compute class weights as well and pass them to a classifier

languid saffron Aug 19, 2024, 9:32 AM

#

Hey a beginner question,
I am building a model, an it is taking 2 hrs for the model to reain, but within that I am getting a pop up - 'Are you still there' and my session gets over. It is suggested to save version and run all, but on doing that I would not get the state of the notebook, I would be unable to code after the notebook runs as I would not have the variables, How do I get over this issue??

languid saffron Aug 19, 2024, 12:48 PM

#

Guys, can someone please help me with this? VERY URGENT

feral bison Aug 21, 2024, 11:35 AM

#

crude mist <@896667448110092298> Well, this condition (the internet off thing) is only to ...

Hi, if this is the case, could you please tell how kaggler able to submit the notebook with "internet on", for example this notebook:
https://www.kaggle.com/code/motono0223/isic-pytorch-training-baseline-image-only#Training-Configuration-⚙️

[ISIC] Pytorch Training Baseline (Image only)

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

dense crescent Aug 21, 2024, 5:19 PM

#

unkempt quest Is setting the stratify parameter to the target values on train_test_split() eno...

Computing class_weights was what I did to make sure that the data was balanced

spring dome Aug 26, 2024, 7:31 PM

#

unkempt quest Is setting the stratify parameter to the target values on train_test_split() eno...

They are essentially two different things. Stratifying is used so that the proportion of positives is roughly the same across splits. Class weights are used to strengthen/dampen the effect of samples from each class on the loss function (and, thus, on model learning). Using both is nice (although you can argue that weighing messes with expected probabilities of the classes)

meager juniper Aug 26, 2024, 7:57 PM

#

spring dome They are essentially two different things. Stratifying is used so that the propo...

Having this severe unbalance would you argue using both is better?

spring dome Aug 26, 2024, 9:13 PM

#

using stratification is mandatory

#

it ensures that your splits will contain roughly the same class proportions

#

the weighting can be substituted for a loss function that handles imbalance like focal loss

#

you may also use under and oversampling

#

I think they all mess with probabilities (models are uncalibrated and their outputs will not be close to actual event probabilities)

spring dome Aug 26, 2024, 9:16 PM

#

meager juniper Having this severe unbalance would you argue using both is better?

so yes, using both may be one approach

#

I suspect this might be just another Kaggle competition with most models hardly having any actual applicability (so hacking your way through might actually get you a medal)

stoic osprey Aug 29, 2024, 6:57 PM

#

Anyone know why I can not use ResNet152V2Backbone? I keep getting an error that it cannot find the preset "resnet152_v2".

unreal geyser Aug 30, 2024, 10:19 AM

#

which data is mismatched wrognly marked ?

hazy thunder Aug 30, 2024, 8:57 PM

#

Hey guys! Does anyone have a tool for pulling out the tabular features from additional datasets? I can’t find one anywhere.

hollow turtle Sep 2, 2024, 3:26 PM

#

stoic osprey Anyone know why I can not use ResNet152V2Backbone? I keep getting an error that ...

Hi Brian, I haven’t tried that backbone, so I’m not sure. Are you using Keras or timm? In my experience, the best-performing backbones are the smaller ones.

stoic osprey Sep 3, 2024, 3:18 AM

#

hollow turtle Hi Brian, I haven’t tried that backbone, so I’m not sure. Are you using Keras or...

Hi Maximiliano, thanks for the reply. I am using Keras. I'll try again with ResNet50

hollow turtle Sep 3, 2024, 11:18 AM

#

stoic osprey Hi Maximiliano, thanks for the reply. I am using Keras. I'll try again with ResN...

You're welcome, Brian. Are you working locally or on Kaggle? Perhaps you don't have the latest version installed. I'm not a Keras user, so I'm not quite sure. Regarding the models, ResNets don't perform very well, at least in my case. I achieved 0.149 on the LB with a tiny_vit_5m_224 and tf_efficientnetv2_b0. And to reduce the gap between CV and LB, I had to use undersampling of the majority class.

stoic osprey Sep 3, 2024, 7:58 PM

#

Thanks for the feedback! I was working locally but now I'm mostly running on Kaggle. I haven't tried tiny_vit_5m_224. My best has been 0.135. 😕

hollow turtle Sep 3, 2024, 9:28 PM

#

You're welcome! Perhaps that was because you were trying bigger architectures; the smaller ones works better in this competition. Good luck!

tidal sky Sep 3, 2024, 10:42 PM

#

hollow turtle You're welcome, Brian. Are you working locally or on Kaggle? Perhaps you don't h...

U got that 0.149 score with that data only ? Or did you used other dataset too ? What was your class 1 and class 0 ratio ?

hollow turtle Sep 3, 2024, 11:05 PM

#

tidal sky U got that 0.149 score with that data only ? Or did you used other dataset too ?...

For that data, do you mean the 2024 data? If that’s the case, yes, I only use the actual competition data. I don’t remember the exact class ratio, but it wasn’t as extreme as in the shared notebooks. I used 25% of the negative samples (around 100k), stratified over patient_id to ensure all patients were included.

pastel forge Nov 19, 2024, 5:28 AM

#

Hi,
I was trying to use ImageDatagenerator for this competetion. However, when I run the cell, it keeps running..I have given TRAIN_+image directory as the path to training image directory.. any suggestion on how to use deal with this ?

#

datagen = ImageDataGenerator(
rescale=1.0/255, # Normalize pixel values to [0, 1]
rotation_range=30, # Random rotation
width_shift_range=0.2, # Random horizontal shift
height_shift_range=0.2, # Random vertical shift
shear_range=0.2, # Shearing transformations
zoom_range=0.2, # Random zoom
horizontal_flip=True, # Flip horizontally
fill_mode='nearest' # Fill strategy for empty pixels
)

Flow from directory

train_generator = datagen.flow_from_directory(
TRAIN_IMAGE_PATH, # Path to training image directory
target_size=(224, 224), # Resize all images to 224x224
batch_size=32, # Number of images per batch
class_mode='binary' # Output labels (e.g., binary classification)
)

This is my code for reference

#isic-2024-challenge

Flow from directory