Shoutout to Tubotubo for an awesome starters notebook. https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/discussion/535121
#child-mind-institute-problematic-internet-use
1 messages Β· Page 1 of 1 (latest)
Im interested to join a team!
Me too!
looking for a team.
Looking for team, Someone interested please DM
Can't we participate solo in the competition?
As far as I know, you can
Did you read something that suggests the contrary?
Just wanted to confirm
I'm actually a bit confused about getting started with the project tbh
ah yeah I can understand that
I'm yet to try this kaggle comp, but in general a good way to start is to read the description ofc, but also the main discussion posts & possibly check out some of the notebooks on what other people are doing.
But checking what others are doing, isn't that cheating?
not really
seeing how people load the data and create a submission is perfectly fine
I didn't mean exactly copying their code :P
but especially if you're new to kaggle, that can help with getting started
Ahh I see thanks
Is there any null value in the data provided in the competition's .parquet file?
No, I haven't seen any!
Does anyone know why is there 3960 ids in the train.csv but only 996 directories with ids to work on? Are the other id important in any way?
same question
train.csv contains too many missing values.
I'm trying to figure out why is that...
I mean there must be some reason for that
ok
You could try a semi-supervised learning method to help with the unlabelled data
which method have you tried
I haven't yet, been too focused on gradient boosting models
Are you planning to work on ser?
Hi, I am a beginner in the field of DS and looking for a partner for Child-mind-institute-problematic-internet-use Kaggle competition and also a partners from whom I can learn.
Hi there! Just a quick question: how do we interpret weekday in the parquet data? Does it start from Monday or Sunday?
The data description pages says that only a proportion of the participants are asked to wear a wearable accelerometer for (allegedly) up to 30 days. Only those participants will have corresponding parquet data
Thank you so much!
I highly recommend always having the data dictionary and the data description page open. You need constantly look-up to make sense of all the fields
Although, spoiler alert, some descriptions are very inaccurate. Be critical and don't just take the description at face value
I suppose this is part of the competition, that's why they haven't corrected them
I would, however, really appreciate it if we are given some more information on the data columns. For example, the actual model of the ActiGraph used, the protocol for fitness endurance test, etc. I spent so many hours just to make sense of the data because of the scarcity of documentation from the competition
It's a wonderful learning experience, but extremely frustrating at times
I'm not sure, for example, whether the ActiGraph devices already implemented some form of vehicle detection algorithm or not.
And I have concerns about the implementation of GGIR package in wristpy, specifically for the dataset of this competition
The lack of true raw data really limits what we can do and what we can know
Fortunately wristpy does not do any further filtering, so the major information loss is the aggregation (and potentially arising from auto-calibration)
We also don't know the actual protocol of the accelerometer experiment
Like whether the device is on dominant or non-dominant hand
(By the way, if you need any help, I am happy to contribute to the wristpy repo. I have gone over the entire GGIR documentation and many relevant literature)
Oh sorry I'm stupid. It was written on the data description page. I'm used to digging things up for this project so I was overcomplicating it lol
Hi! I am Aakash, having experience in Web Development and now learning AI/ML.
This competition is the second one I've joined, and got a confusion by seeing there are ~21 columns missing in the test set, and what's the scene with the parquet files?
Through the map provided, it seems like those 21-22 columns that are missing in the test df are very crucial in predicting the target variable.
what is your approach in proceeding in this challenge? By looking at some notebooks I found out that someone has put those missing columns as target columns. It seems completely a new concept to me.
Looking forward to hear from you π
Hello everyone! Seems like my local CV score and public LB are not very correlated. Has anyone found good local settings which give at least some correlation between local CV and public LB?
I saw 82 columns in the train dataset which we use to develop our model and was confused by that. I would love to work with someone/team on this competition. Please kindly DM me thank you
Hello Folks!
Im Dharmik and im a data scientist from India.
Ive been working on this competition and basically managed to get the rough skeleton of the approach i had in my mind ready.
The major question I had in my mind was around the model performance. Has anyone able to get a decent performance without any fancy feature engineering? I want to brainstorm ideas on how to further improve the performance on two major fronts:-
- Data -> includes imputation, feat engg. and any transformations in general
- Model -> this would majorly include hyperparameter tuning or later exploring if any other algorithms provide better performance.
Happy to get in touch with anyone here and do a problem solving session!!
What's your experience with coding? I am looking for good coders to team up with.
A PhD AI student this side
who wants to team Up ?
Those 22 columns are not provided as they are related to the target as questionaire scores
Did any one attended this webinar ? if yes, can you share some insights please.
Also guide me how to access recording of the webinar
yes I used imputation, the data seemed highly imbalanced
I spent a bit doing EDA and PCA yesterday, ping me if you get stuck
Hi all, I wanted to ask how people are dealing with the outliers in the dataset and if anyone has used any transformations at all to improve performance?
I don't know how to read this big dataset since it takes too much time in local machine and o kaggle notebook it goes out of memory to read ths muc parquet files any idea how to tackle this link to the competition has been shared https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/overview
I am trying to combine all these paquet files please guide
you can always refer to other notebooks and see how they deal with the dataset
are all submitted notebooks public?
no, the notebooks are open to public only if you turn on in the settings of the notebook
Aii thanks
they talked about outliers, metrics, CV score. It's worth a watch https://vimeo.com/1020745085
This is "Powering AI With Precision NVIDIA - Episode 2 - AI for Good with Child Mind Institute, Dell Technologies, and NVIDIA" by Dell Technologies Videoβ¦
Thank you very much dear @hexed mica for providing the source.
it's explained in the data tab -> weekday - The day of the week, coded as an integer with 1 being Monday and 7 being Sunday.
I'm having an issue when trying to submit to the competition.
I get the error: Submission Scoring Error
Your notebook generated a submission file with incorrect format. Some examples causing this are: wrong number of rows or columns, empty values, an incorrect data type for a value, or invalid submission values from what is expected.
I have searched the discussion to see if anyone else was having the issue: https://www.kaggle.com/competitions/child-mind-institute-problematic-internet-use/discussion/541102#3025616 but there seems to be no obvious solution, at least to me π
I assume what they are hinting at in the post is that there are entries which are in both test and train, however, having tried to remove duplicates from either as well as both and retried to submit, I get the same error.
My submission looks fine, to me regarding the test.csv:
id sii
00008ff9 0
000fd460 0
105258 0
00115b9f 0
0016bb22 0
001f3379 1
0038ba98 0
0068a485 0
0069fbed 0
0083e397 2
0087dd65 0
00abe655 0
00ae59c9 1
00af6387 0
00bd4359 0
00c0cd71 0
00d56d4b 0
00d9913d 0
00e6167c 0
00ebc35d 0
Anyone available for a little support?
Is your file really in CSV (comma-separeted) format? It looks like it is in space separated as you have shown here.
Hi.. I am new to kaggle. I am trying to submit my notebook but the submit button shows disabled. Why is it so?
Yes, it should be - same as if I download the test.csv and open it. Always good to check though π
any team up ?
Hello everyone , how are you planning to use part-0.parquets
There is notebook that shows how to read these big data files. That uses parallel computing.
use polars to read the parquet , it supports hive based reading, have a look at the docs
Hi guys, just a small doubt. How many actual columns do the test set has. I am trying to reduce the dimensionality of the dataset and it throws an error stating the number of columns might be actually more.
I am using cuda df, lmk I can provide code
for EDA
Could it be that one hot encoded columns has a few values that arenβt present in the train set etc
Im interested to join a team!
are you solving it as a regression problem or classification problem?
I've been working on it as a regression problem - seems more flexible; Polars as well LazyFrame streaming to stay w/in the 30GB RAM limit on kaggle notebooks
Can we participate in this competition without a team ?
Yes, we can.
Hi everyone,
I hope you're doing well. I recently completed a data science course on Udemy and am eager to enhance my skills further. As I am new to the field, could anyone kindly suggest ways to practice and work on projects that would help me improve?
I would truly appreciate any guidance, tips, or resources you could share to help me grow in this journey.
Can we use LLM API in CIU competition?
No you can't connect to the internet
Ressources? : 1. Copiously illustrated, will help you visualize just what algorithms are doing
- You name it, he's thought about it ; strategies for every type of situation explored ; a toolbox for reference
Hey all, I was recently working on my submission for this competition but I unfortunately seem to be running into an error when trying to submit recently. I noticed that other people are still submitting their predictions but had thought that because we were past the one deadline, it wasn't letting me submit anymore. I have gone through my code and run the notebook successfully, but everytime I go to submit an entry, it says the notebook gave an exception. I looked through the logs and the logs state that it ran successfully. I wonder if someone has seen this before, or am I missing something in that I can't submit any more entries? I was able to submit entries before and have submitted quite a few, so not sure what is going on. Any help would be greatly appreciated!
If i derive a dataset by preprocessing the existing competition dataset, will i be allowed to directly use that derived dataset on my submission notebook? Or, it needs to including the preprocessing code too in the submission notebook?
Hello @bold hound ,
Overfitting the leaderboard means that a notebook has learned patterns specific to the public portion of the data but may not generalize well to the private portion. Public leaderboard scores are often computed using a small subset of the validation data (for example, 20%), while the final scores are calculated on the larger, untested hold-out dataset (the remaining 80%).
As a result, some public notebooks will seem very accurate by blending multiple models to fit the public validation set very well. However, they can often end up learning noise rather than real patterns, causing them to perform poorly to very poorly when evaluated on the private dataset.
This is referred to as a "shake-up" on Kaggle. It varies depending on the dataset and the metric used. Sometimes, blending public notebooks leads to good results, but at other times, it causes severe overfitting and poor performance on the private leaderboard.
The recommended approach is to choose a local cross-validation strategy to test your local models and choose your best models based on this.
To determine if a shake-up will occur, people compare their local metrics to their public LB scores. When it does correlate well, a shake-up is less likely. Here is seems a lot of people have noticed a strong difference so a shake-up seems likely.
Hope that helps
Hi
πππ
KNN imputation is all you need (from my experiments) π€£
coreect
Thank you very much @patent panther for the answer!
It was definitely a heartbreaking shake up π
My selected sub was 2500th+ in public LB π
My rank came to be 1580 in public LB π
I was in silver medal category hours back π
I was staring at the screen in in disbelief after the private LB was revealed. π I expected a shake, but certainly not this.
Congrats btw,
Thank you βΊοΈ