#isic-2024-challenge
1 messages · Page 1 of 1 (latest)
Finally
I found it
So What's up everyone,how u guys are approaching it
Such a imbalance data
So, I haven't worked with this kind of data before (this is also my first Kaggle competition) so first I am thinking of cleaning the data and then subsequent stuff later on
The data is so large
Read the discussion, I think you can add cancerous data to treat the imùbalanced aspect
Hi everyone, I’m new here and need some help. Can someone please explain what I need to do for this competition in a simple way?
- What do I need to do? - Develop a program to identify dangerous skin spots from images.
- What language and tools should I use? - Use Python and tools like TensorFlow or PyTorch.
- Which data should I use? - Use the Training and Test data provided in the competition files or data from outside.
Thanks a lot!
Welcome
wdym exactly?
it's better to detect a cancerous case which is not cancerous; than a begnin case where it's cancerous, what I've seen in the discussions they have like 300 case which is cancerous and 4000 something not cancerous, my opinion is I need to get more cancerous data so the model can learn more
how would we go about adding more cancerous data?
I think we can try things like manipulating existing cancerous images like flipping, inverting , blurring, or shift the cancer to the sides instead of centring. Although, I am not sure if this might backfire and make it harder for the model 
Thanks, also, is there a way to calculate the score before I submit, all the ways I tried arnt giving me accurate results
I read that some peoople are just integrating more images from prior years
-
No, your model doesn't have to point at dangerous skin spots from the images, which is called image segmentation. It's enough if your model only classifies an image as cancer or not cancer (and it will probably spit out a probability instead of a hard 1 or 0, which is what you submit in your submission.csv)
-
Whatever you like. If you can look at the image name or average color and tell if it's cancer or not, you can do that. All that the competition cares about is doing accurate classifications, measured by the competition metric (pAUC). If you get a good score, it doesn't matter what method you used.
With that said, the method that most participants will likely go with is deep learning, either by training their own models from scratch or by fine-tuning pre-trained models (e.g., from huggingface) -
You're going to use the competition training data mainly, because they will likely be the most similar to the test data, which will make it easier for your model to pick patterns that it's used to. Using external data will definately help, as long they are quality data and they are accessible to everyone else in the competition. Also, they should be available under a license that allows you to open-source your model if you win the competition, as required by the competition rules.
A wise strategy to start with is to fine-tune a pre-trained model on a subset of the competition training data, then increase the size of this subset, use augmentation methods, use external data, and finally (if you are willing enough) create your own synthetic data from the competition data and use them for further training (which I suspect only very few will do, if it's even worth the effort)
By the way, the competition test data are not actually available for you to train your model on. The test files are only available as a placeholder for when you submit your notebook to make predictions on them behind the scenes. You will not be able to see them yourself.
Thanks a lot bro for your detailed explanation now I am clear, really thanks a lot bro
Can I ask what augmentation methods there are to use and where is best to learn about them?
There are plenty of image augmentation methods. The aim is to make a lot of image perturbations to introduce more variability in the image data provided they have little to no effect on the signal that you're trying to capture.
For example, when physicians try to diagnose skin cancer, they typically look at the color of the lesion, its shape, number, base, edge, margin, floor, among other stuff. Assuming the deep-learning model will use the same features to make its predictions, you can introduce random variability in the images that don't affect these features, like flipping the image horizontally or vertically, rotating the image an arbitrary amount, a little shift in brightness up or down, and so forth. These will multiply the number of training examples that you have and will make your model more robust to changes in the images that should NOT matter in its predictions, making it more likely to look the right features and less likely to overfit to irrelevant features that can only exist in the training data.
Note: There is an "art" part of the image augmentation process and domain knowledge will help you make the right decision on which augmentations to use, and more importantly which to AVOID. Also, one of the powerful characteristics of deep-learning models is that they craft their own features, meaning that sometimes models will look at things that humans wouldn't even consider bothering with. For example, let's imagine that there is a metric called "skin color - ulcer color index" defined as the skin tone in RGB divided by the color of the lesion in RGB, and that "index" actually turns out to be useful in predicting the lesion malignancy. Human physicians would never be able to use this feature by eye, but a deep-learning model would, and actually can arrive at that via gradient-descent. Now, making the wrong assumptions that your model will not use such an "index" will actually hurt your model performance. So, you should be careful about what assumptions you make and should experiment well with different augmentations, loss functions, etc.
By the way, here is an index of potential augmentation methods you can use in pytorch, provided by a package called "albumentations".
https://albumentations.ai/docs/api_reference/augmentations/
Has anyone had any success with images?
Has anyone gotten Error: Submission CSV Not Found, despite submission being in the output folder? I posted a discussion on it. Cant seem to figure out how to debug it
Are you submitting NB which saves predictions as submission.csv?
turns out i had too many files in my output directory. i had to !rm all the files except submission.csv
I get this error while downloading the data via API, any suggestions?
403 - Forbidden - You must accept this competition's rules before you'll be able to download files.
PS: I have already accepted the rules for the competition
Have you verified your identity?
Yea, nvm the issue has been solved, there was some bug in the cofig file that was to be fixed
hey guys .can you share your score if you have used cnn based models not LGBM or ... ?
hello isic-2024-challenge
Discussion Forum says - Discord is intended to be a more casual space, so be it ☕ 🐉
jpg train images are redundant. hdf5 contains the same?
find out how to read hdf5 and see it for myself.
hdf5 contains 401059 images, there are 401059 jpgs. so the number checks out.
all images are checked one by one and turned out all same. 👍
next is to build a baseline classifier model and submit for a test
interestingly, many of public notebooks do not use images for training. use tabular data only, this is very new to me. 🤔
ok, tabular vs image was discussed enough in the forum.
hope for the best, embrace for the worst. 🤣
Bit of a noob question, I have a question about the "no internet access" if anyone knows. I would like to download previous years datasets. I assume I can't do this during runtime? But is it okay for me to add it to my notebook files directly?
I think you'd need to upload them and then access them from within the kaggle directory, not 100% sure myself though, first time using the notebooks myself
Hi, guys! Do I have to mention on the forum that I'm using external data given it is under CC BY 4.0 license? I remember there was an External Data Thread in every competition but not anymore.
guys, I made alot of wrong choices in my life
but running a transformer on cpu
is worst
How to use GCP credits for submission??
Hi. I just joined. Could you tell me where I can find those discussions?
hello guys i'm kinda new to kaggle competitions and AI for medical uses, i'm wondering if we're supposed to predict the "target" column in the .csv file because there is a huge class imbalance, for instance there are about 1k images with target = 1 and the rest which is around 400k with target = 0, is it normal ?
not normal but thats the available data
Hi ! I'm a first-timer in a Kaggle competition. Getting my head around the skin cancer dataset. Can I ask a stupid question?
The AI model we are developing is for an end user to take a mobile photo at home and submit this. So there won't be any other information provided to our model for inference. So is there anything useful to us about all the metadata in the dataset? Surely we just focus on the photos and labels?
Hey,
So just reminding, Don't submit the entire app code just the model to the competition.
Second of all, In my model I have just decided to use images and with some model tweaking I have been getting good results, main problem though it it takes a long time to train.
Hey Guys,
I am Ishaan. So I was wondering if you guys knew of any good ways of tweaking a model so that it can run faster?
I have used normalization, dataparalleism, maxpulling, batchnormalization. Yet It's still taking 15 mins per epoch
could I only submit my jupyet notebook or only my csv file ?
Hi, you supposed to make inference with the trained model/models, not uploading just the cvs files.
Check some inference notebooks in the code section of that particular competition
You should use smaller models to start
Hi Everyone. I apologize for not posting in the 'General' area but I wanted to talk directly with the people on this challenge. I am very interested in this specific challenge. My wife died from cancer and I am trying to focus the rest of my life on researching cancer detection and treatments. I've been studying ML, DL and LLM's for the past six months. I literally just started learning CNN's and ResNet50 architectures last week.
I know it's late in this competition, but are there any teams that would accept me and let me look over your shoulder and learn?
Hey I had a question?
so how is my public score decreasing when it takes less time plus also has a better accuracy?
Hi everyone, I have a dump question,
this isic-2024 challenge requires internet to be off, right,?, then if i want to install some libraries, like 'pip install <library>', that won't be possible, what could be the alternative?!!
you can train the model in a separate note book with the internet on. but when you want to submit it you can use the model if you upload it to the kaggle datasets (your own ,private !) and then use it . and the internet should be off
Hi! Sorry, I've been busy for the past month or so, and I've been offline most of the time.
Anyway, using a GPU and parallel processing (like multi-threading and/or multi-processing) is powerful for sure in speeding things up, but they only take you so far without using better resources 😦
As for normalization and batch normalization, these actually slow down your training. They're not used to speed it up. Batch normalization is used to make it easier for gradient descent to arrive at the global minimum, and maybe it can speed up the training by reducing the number of epochs you need to get to the global minimum, but it's expected to take more time per epoch for sure. Data normalization is optimally only done once before feeding the data into the model and if you're doing it on the fly (i.e., every training cycle), it's also expected to take more time. It's main use is to make it easy for the model to pick patterns in the data and avoid giving more emphasis to features with larger values, and also to avoid exploding/diminishing gradients in deeper models.
I agree with @livid rapids's advice. You should try out smaller models in your initial experiments, and then invest in more depth in the architecture you see promising, and only if it proves beneficial to the score.
Maybe you're overfitting to the training data. Make sure you use cross validation in evaluating your model(s) locally, and make sure the evaluation data (dev set or test set) do not leak into the training process (i.e., you shouldn't do backprobagation on their gradients). Oh, and time has nothing to do with the score. It's only to be qualified as a valid notebook (should run in no more than 9 hours)
@feral bison
Well, this condition (the internet off thing) is only to prevent participants from leaking the test data to a server that they have access to. Imagine someone submitting a notebook that reads in the test data (which is only available after you submit the notebook) and uploads them to their Google Drive.
A trick for your situation that kagglers do is that you can download the .whl files for the packages that you need to install at test time, upload them to a personal (private) dataset, add the dataset to your notebook, and use pip to install those .whl files offline. You can refer to this answer to know how:
https://stackoverflow.com/questions/27885397/how-do-i-install-a-python-package-with-a-whl-file
To download .whl files, refer to PyPi and search for your desired package. For example, here are the .whl files for numpy: https://pypi.org/project/numpy/#files
Make sure you download a manylinux version of your package, as kaggle runs on linux behind the scenes.
Thanks a lot, quick question, how does my score get calculated then. I got a higher accuracy so should not my score increase then?
What do you mean you got a higher accuracy? I presume you're referring to the accuracy you calculated locally using the training data, right?
Great. Isn't it possible that the testing data have easier examples than the hidden test data at kaggle?
Oh yeah true,
Is there no way to check what was our accuracy with the test data, or is that the public score
Currently my score is 0.06 something
To get over that, people use cross validation. They don't only evaluate their model on one test set. They make five splits of the training data for example and train their model five times, each time on 4/5 of the data to be tested on the final 1/5. A different 1/5 of the data is used for evaluation each time and then you get the average "accuracy" or whatever evaluation metric you're using.
That would be a more accurate estimation of your score.
ok thanks a lot
And even then, it's just an estimation. Leaderboard (LB) scores can be and are probably different.
Just make sure you don't overfit to your training data and you'll do well on both public and private Leaderbaord.
Sure, no problem.
Is setting the stratify parameter to the target values on train_test_split() enough to remedy the imbalanced dataset, or should I compute class weights as well and pass them to a classifier
Hey a beginner question,
I am building a model, an it is taking 2 hrs for the model to reain, but within that I am getting a pop up - 'Are you still there' and my session gets over. It is suggested to save version and run all, but on doing that I would not get the state of the notebook, I would be unable to code after the notebook runs as I would not have the variables, How do I get over this issue??
Guys, can someone please help me with this? VERY URGENT
Hi, if this is the case, could you please tell how kaggler able to submit the notebook with "internet on", for example this notebook:
https://www.kaggle.com/code/motono0223/isic-pytorch-training-baseline-image-only#Training-Configuration-⚙️
Computing class_weights was what I did to make sure that the data was balanced
They are essentially two different things. Stratifying is used so that the proportion of positives is roughly the same across splits. Class weights are used to strengthen/dampen the effect of samples from each class on the loss function (and, thus, on model learning). Using both is nice (although you can argue that weighing messes with expected probabilities of the classes)
Having this severe unbalance would you argue using both is better?
using stratification is mandatory
it ensures that your splits will contain roughly the same class proportions
the weighting can be substituted for a loss function that handles imbalance like focal loss
you may also use under and oversampling
I think they all mess with probabilities (models are uncalibrated and their outputs will not be close to actual event probabilities)
so yes, using both may be one approach
I suspect this might be just another Kaggle competition with most models hardly having any actual applicability (so hacking your way through might actually get you a medal)
Anyone know why I can not use ResNet152V2Backbone? I keep getting an error that it cannot find the preset "resnet152_v2".
which data is mismatched wrognly marked ?
Hey guys! Does anyone have a tool for pulling out the tabular features from additional datasets? I can’t find one anywhere.
Hi Brian, I haven’t tried that backbone, so I’m not sure. Are you using Keras or timm? In my experience, the best-performing backbones are the smaller ones.
Hi Maximiliano, thanks for the reply. I am using Keras. I'll try again with ResNet50
You're welcome, Brian. Are you working locally or on Kaggle? Perhaps you don't have the latest version installed. I'm not a Keras user, so I'm not quite sure. Regarding the models, ResNets don't perform very well, at least in my case. I achieved 0.149 on the LB with a tiny_vit_5m_224 and tf_efficientnetv2_b0. And to reduce the gap between CV and LB, I had to use undersampling of the majority class.
Thanks for the feedback! I was working locally but now I'm mostly running on Kaggle. I haven't tried tiny_vit_5m_224. My best has been 0.135. 😕
You're welcome! Perhaps that was because you were trying bigger architectures; the smaller ones works better in this competition. Good luck!
U got that 0.149 score with that data only ? Or did you used other dataset too ? What was your class 1 and class 0 ratio ?
For that data, do you mean the 2024 data? If that’s the case, yes, I only use the actual competition data. I don’t remember the exact class ratio, but it wasn’t as extreme as in the shared notebooks. I used 25% of the negative samples (around 100k), stratified over patient_id to ensure all patients were included.
Hi,
I was trying to use ImageDatagenerator for this competetion. However, when I run the cell, it keeps running..I have given TRAIN_+image directory as the path to training image directory.. any suggestion on how to use deal with this ?
datagen = ImageDataGenerator(
rescale=1.0/255, # Normalize pixel values to [0, 1]
rotation_range=30, # Random rotation
width_shift_range=0.2, # Random horizontal shift
height_shift_range=0.2, # Random vertical shift
shear_range=0.2, # Shearing transformations
zoom_range=0.2, # Random zoom
horizontal_flip=True, # Flip horizontally
fill_mode='nearest' # Fill strategy for empty pixels
)
Flow from directory
train_generator = datagen.flow_from_directory(
TRAIN_IMAGE_PATH, # Path to training image directory
target_size=(224, 224), # Resize all images to 224x224
batch_size=32, # Number of images per batch
class_mode='binary' # Output labels (e.g., binary classification)
)
This is my code for reference