#child-mind-institute-detect-sleep-states
1 messages · Page 1 of 1 (latest)
From the dataset description:
"The dataset comprises about 500 multi-day recordings of wrist-worn accelerometer data annotated with two event types: onset, the beginning of sleep, and wakeup, the end of sleep."
It looks like you'll need to do some of your own data processing to get labels in there. Everything else can be a feature.
Thanks. What is the final/y variable we'll be predicting?
Hey, can anyone give an idea of what is the feature 'steps' in the dataset?
The "event" column of the train_events.csv as I understand it. Some stitchwork is needed to link everything together between the parquet data and the csv files.
From what I'm seeing, it's a sequence number from beginning to end for a specific series_id.
For example, in the case of series_id 038441c925bb, it looks like each step equates to roughly a 5 second poll starting from 0.
It's a bit more obvious when you look at the submission response -- they only ask for the step value for a specific series; so it's used as a type of surrogate key tolink across the two files.
Thanks Rob, I just need to figure how to reduce the data size before applying any model
Hii !! Can you guys help me get started with this competition !?
For ppl who’ve had successful submissions, how long did it typically take for your submissions to be scored? Mine is taking awfully long and eventually runs out of memory
I have prepared the basic model but I am confused about the score metrics and evaluation (Event Detection AP).
hi just started working on this competition I have loaded the data, checked for cleaning, and now about to do feature extraction once i get my gpu set up to jupyter notebook, anyone want to work together for fun?
During submission, mine takes ~1 hour with LightGBM (100 iters) & 18 features from memory-optimized lightweight training set. When only experimenting with Kaggle notebook without submitting, on the same setup it runs ~10mins. I guess the hidden test set is much larger than the training set one.
Hello to the Kaggle community,
I'm currently participating in the competition and have a question regarding the accelerometer data collection frequency. The competition overview and documentation do not provide specific details on the frequency of Z-axis data logging.
Do you think the Z-axis accelerometer data is recorded at regular intervals, every 5 seconds, or does it represent the total acceleration over a 5 second time window?
Thanks in advance for your help!
how does the scoring work? I'm quite confused
have you gotten this working?
What?
the score metrics and evaluation
I haven't looked at it yet
Hi Olivier, I think it recorded the data at 23:30:05 rather than summing or aggregating the data from 23:30:00 to 23:30:05
@fathom zodiac Thank you for your answer, I think more and more that this is the case.
Hi everyone, I am trying to get to understand the data and what we are dealing with. I do not have any technical expertise in the health related field so I do not understand the description of the data provided and how events are defined with all the talk about inactivity for more than 30 mins and one event per night …. Got a bit confused.
Plus I read that this can be considered as a time series problem which triggered a “what?!!!!! “ in my head as I was seeing this as a classification problem …. So any help regarding this please?
@sour vale No, I see this suite as a signal, not a classification.
Hey everyone, I'm ran a RF classifier NB. The NB got submitted, but I see that submission.csv is empty. What am I missing here?
Are you able to submit
If you get a valid score then you are good to go
No, I'm getting this error.
@oblique fulcrum So we're suppose to get something in submission right?
No this is because your notebook is consuming too much ram
You can look at my baseline work
I have taken care of this
Please share the link here
But, why sample submission is empty? It doesn't make any sense to me.
This is a time series problem. Not a time series forecasting problem but a time series classification problem. We are basically supposed to detect when. person goes to sleep and when the person wakes up.
The test file that is given to us is empty...that's correct
But after that your code will be run on a hidden test set which is not empty and predictions will be made on the hidden test set. Your code used too much memory when ran on their test set
There are tips on discussions for ways to reduce memory usage...this is a very common problem in this comp
Just wondering, is there an efficient way of loading and dealing with Parquet files?
I used pandas to process it as a dataframe
the files are pretty big though, I don't know if anyone has a way to deal with massive files?
Polars is a pret good alternative
I think polars (lazy evaluation) may be necessary for this comp especially if you're not using Carl McBride's condensed dataset. Pandas is too slow to do all the data manipulation
Also you can cast the values to the smallest datatype (e.g. int16 instead of int64) possible to save memory
basically - lazy evaluation doesn't load the dataframe into memory. The dataframe is a huge memory bottleneck
Anyone willing to make team for cmi competition please dm
I can be your teammate
Iam in too.. Iam new to kaggle.. trying to understand how it all works.. I would love to team up and start off
can anyone explain how to get model weights..
Zoom is the leader in modern enterprise video communications, with an easy, reliable cloud platform for video and audio conferencing, chat, and webinars across mobile, desktop, and room systems. Zoom Rooms is the original software-based conference room solution used around the world in board, conference, huddle, and training rooms, as well as ex...
Use Polars
Would it work in 8 GB ram
hey guys i’m using dask .compute for loading the files but I’m still getting out of memory errors when i submit
should I just ditch dask and use polars?
Polars lazy execution may work but you will need to be very careful with your code
I will recommend you to train elsewhere and infer on the kaggle submission kernel for best results
what do you mean by this
also how does it know which file is my submission file? Do I need to name it something specific?
Please read the competition code instructions and you will be clear
You may download the training data and train your model on local pc
Save your model and infer in your kaggle kernel and submit
hey that’s really smart
I didn’t even think of that
Yes, name your submission "submission.csv"
Polars definitely helped in processing the training set. Thanks to all the folks who put their starter notebooks on that
Hey how on earth do I predict on the test set if I don’t have the labels? Do I use the series id to predict?
thank so you much
Hey all, this will be my first real attempt at a kaggle competition!
Althrough i'm already 1 month late i wish you all good luck ❤️
Hello everyone! This is my first code competition, and how much time do your notebooks usually take to submit? My notebook is submitting within an hour, although I just used a simple RandomForest model without any data optimization and it took 10 minutes to save version. Should I optimize the data more deeply?
My notebook is taking longer than an hour for submission, I also used a Random Forest classifier, some optimization using Optuna framework, I have a write-up on this: https://bayoadejare.medium.com/91ad8af99c24?sk=cfe58e4cdfd5cbb59401bd3ebcf05500 I'm getting some submission errors, and for further optimization getting some out-of-memory error, so not sure if it is as useful for this competition.
Thank you for your answer! It's really helpful!
Ok, I solved it. The problem was I was downloading the entire test set at once, and it seems to be too huge to download, since it takes about 9 hrs and then the notebook interrupts. I solved it downloading the test set only partially each time, and then concatenating them
Goal is not to win but to learn
Looking for collaboration
https://www.kaggle.com/code/sb0702/model-tensorflow-deep-learning-model
ah im back kaggle : ) just starting working on the sleep states problem... turned the problem into a state prediction problem instead because if you just keep the labels as events its ugly as the events are really sparse .. idk what other approaches are heck ivent even submitted a solution, anyone who's in the same box or better and want to learn... im training an LSTM rn , my pipeline for this competition is not complete (evaluation and inference[submission] are left) nonetheless looking to learn... specially i want to get better at recurrent nets and maybe time series transformers too but we saw way too much of that in the llm science exam lol
I kinda am in the same boat. Its just data is too sparse to predict something tangible. A simple random forest is enough to just based on the data size for the events
ahan
Hello everyone, after using reduced memory dataset I'm trying to merge parquet and csv file but getting memory issue. Also, using merge I'm getting 9500 rows which I believe is incorrect. Can someone help me out, what is the final dataset used for ML?
Also, how can I install CuML succesfully?
wanted to share some results i guess.. really hectic month but now i have some time, with 8 days left i made my first submission, really really bad results... just gotta make the validation functions and make the models better now, most of the software infrastructure is done.
how big is the test dataset for calculating leader board score? and also how big is the final real test dataset?
Hello, I'm currently finishing my first version and I'm trying to make a valid submission but the submission crashed and I'm trying to debug it.
The provided test_series.parquet is bothering me here, there is only 3 series and they only have a few data point each (and their very small size is very likely messing with my algo).
Should I consider that the real test events will look like the training ones, or should I expect them to look like the provided ones?
My crashing submission giving me very small info on the cause of the crash, I'm a bit desperate...
- Found in the "Data" tab: Note that this is a Code Competition, in which the actual test set is hidden. In this public version, we give some sample data in the correct format to help you author your solutions. The full test set contains about 200 series.
- Found in the "Leaderboard" tab: This leaderboard is calculated with approximately 25% of the test data. The final results will be based on the other 75%, so the final standings may be different.
Thanks a lot!
Maybe it's because you generated too many predictions, causing an exception to be thrown, because I once encountered this problem, but when I tried to reduce the generated predictions, this error would not be reported. You can try to reduce the predictions. number
Ok, I'll try thx!
Is there any NB (code) where validation accuracy is calculated?