#isic-2024-challenge

1 messages · Page 1 of 1 (latest)

knotty sun
#

What's it for?

weak jay
#

Finally

#

I found it

#

So What's up everyone,how u guys are approaching it

#

Such a imbalance data

stable barn
fossil thicket
#

The data is so large

weak jay
#

It's so much biased

#

We should train on adding previous yrs data too ig

stiff sparrow
#

Read the discussion, I think you can add cancerous data to treat the imùbalanced aspect

obsidian leaf
#

Hi everyone, I’m new here and need some help. Can someone please explain what I need to do for this competition in a simple way?

  1. What do I need to do? - Develop a program to identify dangerous skin spots from images.
  2. What language and tools should I use? - Use Python and tools like TensorFlow or PyTorch.
  3. Which data should I use? - Use the Training and Test data provided in the competition files or data from outside.

Thanks a lot!

fossil thicket
#

Welcome

stiff sparrow
#

it's better to detect a cancerous case which is not cancerous; than a begnin case where it's cancerous, what I've seen in the discussions they have like 300 case which is cancerous and 4000 something not cancerous, my opinion is I need to get more cancerous data so the model can learn more

fringe trench
#

how would we go about adding more cancerous data?

digital sky
#

I think we can try things like manipulating existing cancerous images like flipping, inverting , blurring, or shift the cancer to the sides instead of centring. Although, I am not sure if this might backfire and make it harder for the model jerry_stare

fringe trench
#

Thanks, also, is there a way to calculate the score before I submit, all the ways I tried arnt giving me accurate results

fringe trench
crude mist
# obsidian leaf Hi everyone, I’m new here and need some help. Can someone please explain what I ...
  1. No, your model doesn't have to point at dangerous skin spots from the images, which is called image segmentation. It's enough if your model only classifies an image as cancer or not cancer (and it will probably spit out a probability instead of a hard 1 or 0, which is what you submit in your submission.csv)

  2. Whatever you like. If you can look at the image name or average color and tell if it's cancer or not, you can do that. All that the competition cares about is doing accurate classifications, measured by the competition metric (pAUC). If you get a good score, it doesn't matter what method you used.
    With that said, the method that most participants will likely go with is deep learning, either by training their own models from scratch or by fine-tuning pre-trained models (e.g., from huggingface)

  3. You're going to use the competition training data mainly, because they will likely be the most similar to the test data, which will make it easier for your model to pick patterns that it's used to. Using external data will definately help, as long they are quality data and they are accessible to everyone else in the competition. Also, they should be available under a license that allows you to open-source your model if you win the competition, as required by the competition rules.
    A wise strategy to start with is to fine-tune a pre-trained model on a subset of the competition training data, then increase the size of this subset, use augmentation methods, use external data, and finally (if you are willing enough) create your own synthetic data from the competition data and use them for further training (which I suspect only very few will do, if it's even worth the effort)
    By the way, the competition test data are not actually available for you to train your model on. The test files are only available as a placeholder for when you submit your notebook to make predictions on them behind the scenes. You will not be able to see them yourself.

obsidian leaf
hasty whale
crude mist
# hasty whale Can I ask what augmentation methods there are to use and where is best to learn ...

There are plenty of image augmentation methods. The aim is to make a lot of image perturbations to introduce more variability in the image data provided they have little to no effect on the signal that you're trying to capture.
For example, when physicians try to diagnose skin cancer, they typically look at the color of the lesion, its shape, number, base, edge, margin, floor, among other stuff. Assuming the deep-learning model will use the same features to make its predictions, you can introduce random variability in the images that don't affect these features, like flipping the image horizontally or vertically, rotating the image an arbitrary amount, a little shift in brightness up or down, and so forth. These will multiply the number of training examples that you have and will make your model more robust to changes in the images that should NOT matter in its predictions, making it more likely to look the right features and less likely to overfit to irrelevant features that can only exist in the training data.

#

Note: There is an "art" part of the image augmentation process and domain knowledge will help you make the right decision on which augmentations to use, and more importantly which to AVOID. Also, one of the powerful characteristics of deep-learning models is that they craft their own features, meaning that sometimes models will look at things that humans wouldn't even consider bothering with. For example, let's imagine that there is a metric called "skin color - ulcer color index" defined as the skin tone in RGB divided by the color of the lesion in RGB, and that "index" actually turns out to be useful in predicting the lesion malignancy. Human physicians would never be able to use this feature by eye, but a deep-learning model would, and actually can arrive at that via gradient-descent. Now, making the wrong assumptions that your model will not use such an "index" will actually hurt your model performance. So, you should be careful about what assumptions you make and should experiment well with different augmentations, loss functions, etc.

fringe trench
#

Has anyone had any success with images?

deft scarab
#

Has anyone gotten Error: Submission CSV Not Found, despite submission being in the output folder? I posted a discussion on it. Cant seem to figure out how to debug it

quartz herald
deft scarab
twilit aspen
#

I get this error while downloading the data via API, any suggestions?

403 - Forbidden - You must accept this competition's rules before you'll be able to download files.

PS: I have already accepted the rules for the competition

hasty whale
twilit aspen
#

Yea, nvm the issue has been solved, there was some bug in the cofig file that was to be fixed

novel falcon
#

hey guys .can you share your score if you have used cnn based models not LGBM or ... ?

void solar
#

hello isic-2024-challenge

#

Discussion Forum says - Discord is intended to be a more casual space, so be it ☕ 🐉

void solar
#

jpg train images are redundant. hdf5 contains the same?

#

find out how to read hdf5 and see it for myself.

void solar
#

hdf5 contains 401059 images, there are 401059 jpgs. so the number checks out.

void solar
#

all images are checked one by one and turned out all same. 👍

#

next is to build a baseline classifier model and submit for a test

void solar
#

interestingly, many of public notebooks do not use images for training. use tabular data only, this is very new to me. 🤔

void solar
#

ok, tabular vs image was discussed enough in the forum.

void solar
#

hope for the best, embrace for the worst. 🤣

buoyant anchor
#

Bit of a noob question, I have a question about the "no internet access" if anyone knows. I would like to download previous years datasets. I assume I can't do this during runtime? But is it okay for me to add it to my notebook files directly?

unreal flower
#

I think you'd need to upload them and then access them from within the kaggle directory, not 100% sure myself though, first time using the notebooks myself

cedar karma
#

Hi, guys! Do I have to mention on the forum that I'm using external data given it is under CC BY 4.0 license? I remember there was an External Data Thread in every competition but not anymore.

stiff sparrow
#

guys, I made alot of wrong choices in my life

#

but running a transformer on cpu

#

is worst

somber dirge
#

How to use GCP credits for submission??

somber abyss
unkempt sorrel
#

hello guys i'm kinda new to kaggle competitions and AI for medical uses, i'm wondering if we're supposed to predict the "target" column in the .csv file because there is a huge class imbalance, for instance there are about 1k images with target = 1 and the rest which is around 400k with target = 0, is it normal ?

novel falcon
unkempt sorrel
#

well thank you sir

amber orchid
#

Hi ! I'm a first-timer in a Kaggle competition. Getting my head around the skin cancer dataset. Can I ask a stupid question?
The AI model we are developing is for an end user to take a mobile photo at home and submit this. So there won't be any other information provided to our model for inference. So is there anything useful to us about all the metadata in the dataset? Surely we just focus on the photos and labels?

dense crescent
#

Hey Guys,
I am Ishaan. So I was wondering if you guys knew of any good ways of tweaking a model so that it can run faster?
I have used normalization, dataparalleism, maxpulling, batchnormalization. Yet It's still taking 15 mins per epoch

dense crescent
#

@crude mist

#

Could you please help answer my question?

stiff sparrow
#

could I only submit my jupyet notebook or only my csv file ?

tidal sky
#

Hey I'm getting submission error

#

Somebody help me

hollow turtle
#

Check some inference notebooks in the code section of that particular competition

livid rapids
stoic osprey
#

Hi Everyone. I apologize for not posting in the 'General' area but I wanted to talk directly with the people on this challenge. I am very interested in this specific challenge. My wife died from cancer and I am trying to focus the rest of my life on researching cancer detection and treatments. I've been studying ML, DL and LLM's for the past six months. I literally just started learning CNN's and ResNet50 architectures last week.

I know it's late in this competition, but are there any teams that would accept me and let me look over your shoulder and learn?

dense crescent
#

Hey I had a question?
so how is my public score decreasing when it takes less time plus also has a better accuracy?

feral bison
#

Hi everyone, I have a dump question,
this isic-2024 challenge requires internet to be off, right,?, then if i want to install some libraries, like 'pip install <library>', that won't be possible, what could be the alternative?!!

novel falcon
crude mist
# dense crescent Could you please help answer my question?

Hi! Sorry, I've been busy for the past month or so, and I've been offline most of the time.
Anyway, using a GPU and parallel processing (like multi-threading and/or multi-processing) is powerful for sure in speeding things up, but they only take you so far without using better resources 😦
As for normalization and batch normalization, these actually slow down your training. They're not used to speed it up. Batch normalization is used to make it easier for gradient descent to arrive at the global minimum, and maybe it can speed up the training by reducing the number of epochs you need to get to the global minimum, but it's expected to take more time per epoch for sure. Data normalization is optimally only done once before feeding the data into the model and if you're doing it on the fly (i.e., every training cycle), it's also expected to take more time. It's main use is to make it easy for the model to pick patterns in the data and avoid giving more emphasis to features with larger values, and also to avoid exploding/diminishing gradients in deeper models.
I agree with @livid rapids's advice. You should try out smaller models in your initial experiments, and then invest in more depth in the architecture you see promising, and only if it proves beneficial to the score.

crude mist
# dense crescent Hey I had a question? so how is my public score decreasing when it takes less t...

Maybe you're overfitting to the training data. Make sure you use cross validation in evaluating your model(s) locally, and make sure the evaluation data (dev set or test set) do not leak into the training process (i.e., you shouldn't do backprobagation on their gradients). Oh, and time has nothing to do with the score. It's only to be qualified as a valid notebook (should run in no more than 9 hours)

#

@feral bison
Well, this condition (the internet off thing) is only to prevent participants from leaking the test data to a server that they have access to. Imagine someone submitting a notebook that reads in the test data (which is only available after you submit the notebook) and uploads them to their Google Drive.
A trick for your situation that kagglers do is that you can download the .whl files for the packages that you need to install at test time, upload them to a personal (private) dataset, add the dataset to your notebook, and use pip to install those .whl files offline. You can refer to this answer to know how:
https://stackoverflow.com/questions/27885397/how-do-i-install-a-python-package-with-a-whl-file

#

To download .whl files, refer to PyPi and search for your desired package. For example, here are the .whl files for numpy: https://pypi.org/project/numpy/#files

Make sure you download a manylinux version of your package, as kaggle runs on linux behind the scenes.

dense crescent
crude mist
dense crescent
#

yup

#

I split the data into testing and train

crude mist
dense crescent
#

Is there no way to check what was our accuracy with the test data, or is that the public score

#

Currently my score is 0.06 something

crude mist
#

To get over that, people use cross validation. They don't only evaluate their model on one test set. They make five splits of the training data for example and train their model five times, each time on 4/5 of the data to be tested on the final 1/5. A different 1/5 of the data is used for evaluation each time and then you get the average "accuracy" or whatever evaluation metric you're using.

#

That would be a more accurate estimation of your score.

crude mist
#

And even then, it's just an estimation. Leaderboard (LB) scores can be and are probably different.

#

Just make sure you don't overfit to your training data and you'll do well on both public and private Leaderbaord.

crude mist
unkempt quest
#

Is setting the stratify parameter to the target values on train_test_split() enough to remedy the imbalanced dataset, or should I compute class weights as well and pass them to a classifier

languid saffron
#

Hey a beginner question,
I am building a model, an it is taking 2 hrs for the model to reain, but within that I am getting a pop up - 'Are you still there' and my session gets over. It is suggested to save version and run all, but on doing that I would not get the state of the notebook, I would be unable to code after the notebook runs as I would not have the variables, How do I get over this issue??

languid saffron
#

Guys, can someone please help me with this? VERY URGENT

feral bison
dense crescent
spring dome
meager juniper
spring dome
#

using stratification is mandatory

#

it ensures that your splits will contain roughly the same class proportions

#

the weighting can be substituted for a loss function that handles imbalance like focal loss

#

you may also use under and oversampling

#

I think they all mess with probabilities (models are uncalibrated and their outputs will not be close to actual event probabilities)

spring dome
#

I suspect this might be just another Kaggle competition with most models hardly having any actual applicability (so hacking your way through might actually get you a medal)

stoic osprey
#

Anyone know why I can not use ResNet152V2Backbone? I keep getting an error that it cannot find the preset "resnet152_v2".

unreal geyser
#

which data is mismatched wrognly marked ?

hazy thunder
#

Hey guys! Does anyone have a tool for pulling out the tabular features from additional datasets? I can’t find one anywhere.

hollow turtle
stoic osprey
hollow turtle
# stoic osprey Hi Maximiliano, thanks for the reply. I am using Keras. I'll try again with ResN...

You're welcome, Brian. Are you working locally or on Kaggle? Perhaps you don't have the latest version installed. I'm not a Keras user, so I'm not quite sure. Regarding the models, ResNets don't perform very well, at least in my case. I achieved 0.149 on the LB with a tiny_vit_5m_224 and tf_efficientnetv2_b0. And to reduce the gap between CV and LB, I had to use undersampling of the majority class.

stoic osprey
#

Thanks for the feedback! I was working locally but now I'm mostly running on Kaggle. I haven't tried tiny_vit_5m_224. My best has been 0.135. 😕

hollow turtle
#

You're welcome! Perhaps that was because you were trying bigger architectures; the smaller ones works better in this competition. Good luck!

tidal sky
hollow turtle
pastel forge
#

Hi,
I was trying to use ImageDatagenerator for this competetion. However, when I run the cell, it keeps running..I have given TRAIN_+image directory as the path to training image directory.. any suggestion on how to use deal with this ?

#

datagen = ImageDataGenerator(
rescale=1.0/255, # Normalize pixel values to [0, 1]
rotation_range=30, # Random rotation
width_shift_range=0.2, # Random horizontal shift
height_shift_range=0.2, # Random vertical shift
shear_range=0.2, # Shearing transformations
zoom_range=0.2, # Random zoom
horizontal_flip=True, # Flip horizontally
fill_mode='nearest' # Fill strategy for empty pixels
)

Flow from directory

train_generator = datagen.flow_from_directory(
TRAIN_IMAGE_PATH, # Path to training image directory
target_size=(224, 224), # Resize all images to 224x224
batch_size=32, # Number of images per batch
class_mode='binary' # Output labels (e.g., binary classification)
)

This is my code for reference