#❓┊ask-a-question
1 messages · Page 1 of 1 (latest)
Hi @indigo fulcrum, this post would be a better fit for the #🔗┊sharing-projects channel. Good luck on your journey to become a Notebook Expert!
Hello Everyone,
I have a question about the Kaggle competition.
There are many pre-trained models already available. If I use those models in my competition only on test data, not any work on train dataset, and submit it. will it be acceptable? Or I have to train it, then I can test my trained model on test set.
Example:
There are many models of English speech recognition in hugging face. Can I use those pre-trained models only on the test set, and if it will produce a good score in the leaderboard, will it be acceptable in the competition?
I know it's a noob question. Help me☺️
Thanks
Aditta
@empty lion Is there a way I can change my Kaggle username? There had been a minor spelling error while creating the username.
Hey Aditta, some competitions have specific rules about which data and models you can use, you need to check each one. But usually it is fine to use external models (from hugging face, kaggle models, or anywhere else). That said, you'll probably find that on their own pre-trained models will only get you part of the way to a good score.
Unfortunately at this time we do not support changing usernames. If you have extenuating circumstances you can contact Kaggle support to request a change at https://www.kaggle.com/contact
A lot of the common inquiries we receive are listed below. Please click on the one that applies to you to learn more.
Without any training, if I get a good score. Is it acceptable?
🤔 🤔 🤔
@verbal crest sir
@sharp iris Yeah it's fine unless a specific competition has a specific rule against using certain models.
Thanks
How long does it take for competition results to be verified in general?
@elder flower Usually something like 2-7 days, sometimes longer depending on the competition.
Does anyone have tips for one to reach the Master tier or above on Kaggle ?
Not sure if this is the right channel to ask though
In the competition ranking or in some other ranking?
If it is about competitions, then you need to take part in many competitions, learn from the solutions of the winners of the past competitions and incorporate them into your approach.
In any category. I could not find many quality answers when searching on kaggle forums
Okay, so the official requirements are the following. I'll give suggestions for each ranking separately
Datasets:
There are two main parts: collecting interesting data and promoting it.
So, first step is to collect some data. There are already thousands of datasets on Kaggle, so you would need to find some interesting data, which wasn't collected yet. Another approach would be to share some data for the ongoing competitions: for example, sharing relevant external data, doing some processing on the data and so on.
But simply making a good dataset isn't enough - you need to get people's attention. The first step is to make the dataset presentable. When you create a dataset, you see a score - how well it is done, it includes descriptions, metadata and other things. So be sure to fill in all the fields.
And after the dataset is ready, you need to promote it - post about it on Kaggle forums and social media.
Discussions:
You need people to upvote your posts. 1 vote is bronze, 5 votes is silver, 10 votes is gold.
The "easiest" way to get upvotes is to be active on forums in an ongoing competition - share your insights, ask questions, participate in hot topics.
Some people simply share articles from internet on Kaggle forums. It is a low-effort activity, but, unfortunately, it works.
Votes for the comments in the notebooks are counted too.
Notebooks:
Now, this becomes tricker. Personally I think that Notebooks (and competitions) are much more competitive rankings compared to the two previous ones.
You need to make a good analysis, share it and get enough votes.
There are numerous ways to make good kaggle notebooks:
- build a good model for an ongoing competition and share it
- do an EDA (exploratory data analysis) for a competition or dataset and share it
And so on. What is important to know, that it is difficult to produce novels ideas, so many people try to get medals by joining a new competition and share a good analysis within first 12-24 hours. It is tough, but doable.
It will take some time to be good at it, but it is definitely rewarding.
I'll share some resources to help you:
https://www.analyticsvidhya.com/blog/2020/12/exclusive-interview-with-andrey-lukyanenko/
https://www.youtube.com/watch?v=qKqLHs3J-Rc&ab_channel=AnalyticsVidhya
In this Interview, Andrey Lukyanenko joins us today to give insight into his data science journey and what pitfalls to avoid in the start.
Visualization is the best method to grasp the complex and hidden results from the data. Analyzing the visualizations is better than calculating data statistics and various plots and techniques can be used to do so.
In this DataHour, Andrey will share the history of data visualization. After which he will explain about different plot types and ...
Competitions:
Now, this is the most difficult ranking on Kaggle. You need to take part in the competitions and reach a high place in it. It is very difficult, so even experienced data scientists can fail. The important thing is to iterate over ideas fast, try many things and be prepared to spend a lot of time.
Here is a link where I talk how I got a gold medal several years ago:
https://www.youtube.com/watch?v=rpClh8WmTdo&ab_channel=ChaiTimeDataScience
This channel has a lot of very useful interviews
Audio (Podcast Version) available here: https://anchor.fm/chaitimedatascience
In this episode, Sanyam Bhutani interviews the king of kaggle kernels, Grandmaster Andrew Lukyanenko Ranked #1 about his journey into Data Science, Kaggle. They also talk about his pipeline for writing kernels.
Follow:
Andrew Lukyaneko
https://twitter.com/AndLukyane...
That's it. If you have any further questions, I'll be happy to answer them
Thank you so much for the detailed explanation 
In the learning Python tab of Kaggle, chapter 6, there is sth confuse me.
claim.startswith(planet)
>>>TRUE```
While I try it myself in jupiternote, it return False, with the exact code.
Also, why is the thing btw () must be identified?
the argument of startswith method should be string, like "guud"
oh, it seems that the reason why it returns "TRUE" initially was bc planet is identified as a string somewhere before.
thank you for your help
This code should not return True!!!!!!!!!!!!!!!!!
This is the screenshot, and the link is https://www.kaggle.com/code/colinmorris/strings-and-dictionaries
Can you guys explain what happen?

are medals and votes linearly related i.e. if someone gets 20 votes on a discussion post, does that mean he/she will get 2 gold medals? Or once you cross 10 votes threshold, you only get one gold medal irrespective of how many votes one get?
You can get only one medal on a post/notebook/dataset/competition.
Here is the progression information: https://www.kaggle.com/progression
Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.
Why does ICR competition doesn't appear in meta kaggle dataset on competition file?
Hello, I'm new to kaggle and trying my luck with the CommonLit - Evaluate Student Summaries Competition. I wrote code for this and saved it in a submission.csv file at the end. But always get a scoring error. Can someone help me or give me a tip?
what's the error you get? It may be because of the fact that the schema of your submission.csv doesn't align with what the competition expects. I have also encountered similar error in other competition, that was because when I saved the file, pandas by default an additional index column, but not including it gets the submission through.
Submission Scoring Error
Save and Run all works and I get a submission.csv file but the upload doesn't work
compare the columns of your submission.csv and the submission.csv from the competition.
well the output of my submission.csv and the required version looks the same. Or do I have to store this in a pandas dataframe?
okay, I mean, if that's the case, the low key effort posts of sharing articles seems like a good way for discussion. It's basically the more you post, the more are the chances of getting a medal (atleast bronze, although getting a gold may be difficult). Sharing resources like "Data Science cheat sheet" tend to do good in discussion forums.
Yes, in fact I remember that there were some discussion grandmasters who got their rank by sharing such articles
I do not understand what you mean
the above message was for separate conversation. I am not sure exactly what could be cause of your error, generally in my case, the submission error was because of schema mismatch. My second guess was, may be for some samples the scoring metric is undefined like for example log of negative number, but I briefly looked into the scoring metric of common-lit competition it looks like they are using RMSE, which should be easily definable for all samples.
are you sure, there aren't any more information aboout the error in logs, like the full stack trace?
I actually have negative values in my predictons. Can it be that the MCRMSE is not implemented correctly?
I'm still a complete beginner. Please excuse me if I'm not doing everything right
So it can't be because of that, since both positive and negative values can occur
may I post some of the logs here?
How and where I can get a reason why my result disappeared from final leaderboard at ICR competition?
Usually the results dissappear from the leaderboard in case when admins decide that there was some kind of rule breaking.
Should I reply to kaggle competition admin that sent me instructions after the competition to ask some questions about it?
I think it would be a good idea
Are kaggle winnings considered like lottery wins or like income for taxation?
Probably depends on your juridiction
Has anyone gotten their silver medal converted to bronze? Got a mail for achieving silver on a notebook which I saw on Kaggle itself. But now after 2 hours, the medal is again bronze. Any ideas about it? The votes are still the same!
someone may have downvoted the notebook, even though the votes are same, but not all votes are counted the same towards the medals
Ohh I see
Yea that's what I thought so asked it here
ohh, didn't know about that, my bad. I guess, my comment is only applicable for comments and discussion then @solid dome
Sounds like we might have a bug where the email is being sent on different logic than the medal is awarded. I'll make sure we look into it! The vote logic is very well tested, so its very likely the email is what went wrong here.
@solid dome this is because someone retracted their upvote. He gave you the vote which is needed for the silver medal then deleted his account / retracted it a few hours later
Hi @verbal crest , I have seen the medal on Kaggle and the notebook also showed the silver medal icon. Email as well Kaggle notifications both showed silver. Attaching photos of the Kaggle notification as well as the notebook.
@solid dome The scenario bogoconic1 mentioned above is a very liekly cause. This sort of thing happens all the time of course. Our system constantly calculates medals based on requirements, and it is possible to lose or downgrade medals.
Typically if you wait a little bit, you'll get some more upvotes and it will upgrade again.
@verbal crest got it. Thanks for the clarification.
why doesn't kaggle have any competition for audio for newbie?
Audio competitions are pretty rare on Kaggle, but I agree it would be pretty cool if we had a beginner competition in that category
How can we advanced it? I believe that we can take an open dataset and establish a comptition
sorted_by_flavor_and_unitssold.to_markdown(max_rows = 20) This is throwing an error in Kaggle. What am I doing wrong?
max_rows int, optional
Maximum number of rows to display in the console.
Hello, is it possible for a notebook to have >= 5 non-novice votes, but still not be awarded a bronze medal?
Yes
hey guys , is it possible for ttest to return 0 as p-value ?
In the USA, lottery wins (and Kaggle wins) are considered an income. It may be different in other countries.
Pretty sure that if a user who upvoted your notebook was deleted, their vote goes away. If you were that close to silver, you will get it again.
Yes, happens all the time. Not sure if you want me to go into detail, but a short version is that people who upvote you frequently don't have their votes counted. Supposed to prevent the gaming of the voting system. I have hundreds of posts with >1 non-novice votes without a bronze medal, and probably dozens of posts >7-8 votes without a silver medal. Similar for >12-13 votes without a gold medal.
You can create a competition of any kind you like. Yet feature audio competitions are rare, presumably because competition hosts deal with other types of data.
that's fair, though i feel they should probably mention this on the progression system page (just like they do for discussions)
How can help with that? I can try to do my best
There is a guide about community competitions. https://www.kaggle.com/c/about/community
Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.
Not sure if this can be talked about but where is the unlearning competition? The announcement said it would be on kaggle ages ago and it has to be done before neurips.
If I want to take a open dataset for a specific problem, that may have any problem? e.g. https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0
I am not a lawyer and I don't have any undserstanding on that
Hi @everyone,
Can someone please explain to me why the first output is a signed zero, even arithmetically the 2 should be unsigned zeros.
Float conversion… this is quite dangerous, esp with if conditions. Solution is to mind your type and use int() if you expect an int.
Votes that do not count towards medals also do not count towards points
I need to evaluate the output of an ML model using MATLAB, but dont have a license, can someone run a script for me?
Could a server admin please change my server nickname to "Chris Akiki" ? 🙏
On this discord accounts are linked to Kaggle, so if you change your Kaggle profile name it will update here automatically.
I would rather keep my full name on my Kaggle profile, but it's not a big deal. Thank you for letting me know 🙂
Totally get the desire to differentiate. Right now we're sticking with the linking since we really want people to be able to find each other on Kaggle.com too.
Hey I am doing a project with protiens and ligands in the form of mol2 and pdb files, would anyone happen to know the best way to encode the files into a fixed length vector while considering both structures
Hey, I need resources on deploying ML models - Pytorch, Tensorflow based ones.
I want to know the best industry practices followed in deployment.
Any book/article related with it is appreciated.
@verbal crest , is it still unlinked
@ashen reef It's linked now! 🎉
Haa.. Finally, thanks..!!
hey , anyone familiar with the flair library , i need some help !
Anyone got any experience with similarity score scripting with chroma vector store?
I missed the BIPOC cohort deadlines, does anyone know how often does kaggle organise such cohorts?
@past spade Not sure when our next one will be, but I'd say it's probably about 6 months away or so. In the meantime you can learn a lot from all the helpful people here in the discord!
Oh okay cool, thanks!
What happened with plans to start machine unlearning challenge on the mid August?
Will this competition appear in the nearest future?
It's in the works
I need help on how to use the lasio python Library in Kaggle. I thought these libraries were auto imported. I have tried
!pip install lasio
..on a separate cell but it didn't work. At one time, it gave a network connectivity error but my wifi was good and fast.
I couldn't find any console to input commands either.
Please I need guidance 🙏
hey there I am using R and Error in predict.xgb.Booster(xgb_model, test_dmatrix) :
Feature names stored in object and newdata are different! this pops up can anybody take a look at my code
Hey all✌️ Anyone here ever messed around with creating synthetic datasets to train theroem based neural networks for search and rec.? Basically hierarchy logic of domain specific data. DM me would love any suggestions 👍
Not sure how anyone can look at your code when you didn't provide a link. Based on the error message Feature names stored in object and newdata are different! I would conclude that you have different features in train and test data (the matrix you are trying to predict). This is usually the ID column or something similar. So whatever features you are adding or removing in the train data, make sure the same operations are applied to test data.
Hello Kaggle community, I want to know if I can invite people to this discord server.
Yes, you are very welcome to! Our custom invite link is discord.gg/kaggle
Hello guys. Just started doing Kaggle and I’m curious how do you guys handle large image datasets. I am currently working on the RSNA challenge. It’s some 400 gb so I don’t think it’s possible to download locally. What would be the best option for online computing with persistent storage?
If you don't have the storage and the bandwidth to download these files - most people don't - my suggestion is to work with them directly on Kaggle. No need to download or move anything - simply write notebooks and do the training right there.
I m a newbie data scientist, what is the best possible way for me to advance in this field, also i m currently pursuing my masters
Hello, I have recently come across a problem trying to use kaggle website as the headers and other parts of the UI are overlapping making it difficult to use it as shown in the images below. I wanted to know if the problem is caused from settings in my browser(even though I have faced same issues in both chrome and FireFox) and how I can fix it.
I am really interested to know more about clustering algorithms from people who have used them. For example, perhaps data is broken down by age, gender, race, country, language. Standard questions to ask on forms. I know that in clustering, principal components of the cluster grouping boundaries don’t necessarily align with the predefined categories that set the axes. In fact, discovering structures in the data is the point. I have only used clustering at the very beginner level though. To what extent do demographics data with unusual individuals result in outliers from any cluster? This is a question I’ve been curious about for some time. Since I’m new to this discord I’m not sure if it’s too far off topic or if it’s a reasonable learning question. Could you please let me know if I should delete? I think it is clearly on the subject of data science but not a specific kaggle competition.
I would love to know how this works if anyone knows though
Like a social scientist or someone
xgb_train <- xgb.DMatrix(data = as.matrix(train_data_main), label = a)
xgb_test <- xgb.DMatrix(data = as.matrix(test_data_main))
bst<-xgboost(data = as.matrix(train_data_main), label = a, max.depth = 6,
eta = 0.3, nthread = 2, nround = 100, objective = "reg:absoluteerror")
I have a code like this how can I easily optimize it
Hi @graceful axle ! I attempted clustering using purchasing behavior (I know it’s not exactly demographics as you asked). The data I used does contain a bunch of people whose purchasing trends are what could be called outliers. My understanding is that people in one cluster aren’t going to be exactly alike, but more similar to those in the same cluster than to people who are in a different cluster. https://www.kaggle.com/code/mounikagoruganthu/mathematical-distance-in-ml
Please I'm still stuck. Any kind assistance will be greatly appreciated.
I need help on how to use the lasio python Library in Kaggle. I thought these libraries were auto imported. I have tried
!pip install lasio
..on a separate cell but it didn't work. At one time, it gave a network connectivity error but my wifi was good and fast.
I couldn't find any console to input commands either.
Please I need guidance 🙏
what online gpu platform should I use? I want to about 1 TB about persistent storage. Thanks in advance.
what online gpu platform should I use? I want to use about 1 TB of persistent storage. Thanks in advance.
The issue you are facing here is that none of our icons are loading and instead you are seeing their alt text on the page. I'm not sure what's causing them all to fail, I'll share it with the team.
Thank you for your support.
Interesting. I don't have the same issue. Maybe your browser setting?!
hi everyone, can anyone tell me that how can we extract data from mobile applications like API permissions just like the CSV file i attached . I need this for my thesis research. @verbal crest
Hey, what type of data are you interested in?
That what permissions an application is taking.
Has anyone read the Kaggle Workbook? I was gonna use it to check out if I can do my first kaggle competition from it or not. It’s from packt publications and not that well known or at least I hadn’t heard about it before, it’s available in an humble bundle now
Which browser setting do you think would bring such changes?
Where are you accessing Kaggle from? We've previously had issues with China's firewall blocking Google's CDN causing this bug specifically. Otherwise maybe something else is blocking that specific resource from loading. (We are also internally looking to try and fix this bug, but it might take a little while - it only happens rarely).
I am accessing it from Ethiopia.
Hello everyone! I am new in ML and did some basic models, feature engineering etc. Can anyone recommend me some basic knowledge competitions? I already did titanic and Spaceship competition. Thank you!
Hello everyone , I am thinking to start on ASR and LLM . Can anyone please suggest me a proper roadmap to start it
Hello, everyone, where can I summit a bug/improvement to the Learn notebooks in the platform?
Answer
Product Feedback.
Hello, I heard that Kaggle has demos in Google Cloud Next, how can I find those?
I'd love to know which session that might be also. I searched the Session Library, but couldn't find it.
But the first step is to register for a complementary access of recorded sessions via a Digital Pass: https://cloud.withgoogle.com/next/
Google Cloud Next ’23 is back - in person on August 29-31 in San Francisco! Connect with me and 15,000+ peers for product announcements, sneak peeks into future roadmaps, on-site demos with experts and partners, plus hands-on training & certification opportunities. g.co/cloudnext
Hello, I have only one learning path in my Google Cloud Skills Boost and my Mentor @trim lotus informed me that I suppose to have more than one. May I kindly request that this issue be resolved. Thank you.
Hey, how are you? My name is Matviy and I am a high schooler from Ukraine. I would just like a quick word of advice. I got perfect accuracy on this model so I thought it is perfect, but I googled it and Google said it is more than likely it is a false accuracy. Could you let me know what you think in this matter?
It's probably overfitting, meaning the model is "memorized" the dataset. Usually you need to split the data into two, one for training, one for validing/testing. You would take the accuracy from the second set
Yeah, I did split the data with:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Hey kagglers if any senior data scientist, or machine learning engineering want mentoring me would happy and very palisent with that
Can I use chi2 test and pearson correlation coefficient in dataset containing both numerical and categorical variables?
I have a dataset which contains both numerical and categorical variables, So can I use mentioned two techniques separately to select features? For example - A, B, C, D, E are my columns wherein A, B are categorical so here I'll use chi2 test whereas C,D are numerical so i'll use pearson coeff. and E can be my target can be either categorical or numerical.
Yes, You can use the chi-squared test for categorical variables and the Pearson correlation coefficient for numerical variables to select features in a dataset. Adjust your approach for variable E based on whether it's categorical or numerical. Remember to interpret results carefully and consider the broader context of your data.
Thank you. I want to know doesn’t it affect my model if I use some test on some variables while excluding others?
For datasets containing both categorical and numerical features, consider using metrics like F1 score and AUC-PR (Area Under the Precision-Recall Curve). These metrics are well-suited to handle the mixed nature of your data and provide valuable insights into your classification model's performance.
I have a question..may be silly one... Can anyone tell me how efficient are Datacamp and udemy courses for Datascience ?
Hello everyone! I've just completed my first work on a classification algorithm using a spam email dataset. I would love to hear your thoughts and suggestions for any improvements I can make. Your insights would be greatly appreciated!
How to resolve this error?
py
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Error:
ValueError: Expected 2D array, got 1D array instead:
array=[1232. 677. 221. ... 1294. 860. 1126.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
reshape your x_train using array.reshape(1,-1)
Ahh , thanks
Did tat, but still didnt work
check you x_tarin and x_test who is 1D array. and change it into 2D array.
ok
The answer was already provided in the error message: Reshape your data either using array.reshape(-1, 1) rather than .reshape(1,-1). This is assuming you have a single column of data, or single feature as described in the error. It is unlikely that you have a single sample, so array.reshape(1, -1) probably would not work. Also, instead of array you need to have X_train or X_test, meaning the actual name of the array.
Hi, I am new to kaggle competitions and have a few questions. Are we allowed to use a LLM like llama2 for the CommonLit - Evaluate Student Summaries competition?
It says "Internet access disabled
Freely & publicly available external data is allowed, including pre-trained models".
Can we use a model we train and upload to huggingface?
Thanks
Can we use a model we train and upload to huggingface? No, but you can use a model that you train and upload to Kaggle. The key is that you can't use an internet connection, but anything uploaded to Kaggle can be accessed directly even with internet turned off.
@deft fox this is kinda random but you said you got your first computer in 1984. Was it the first macintosh?
Commodore 64. My first exposure to Macintosh was in 1989, which at that point was Mac II.
Pretty cool, thanks for that fun fact.
I am wondering:
code:
from sklearn.tree import DecisionTreeRegressor
Why do we import the specific function instead of just:import slkearn
👍
You don't have to import individual functions, but there are at least 2 reasons to do so: 1) the whole sklearn takes up more memory than individual functions; 2) later in the script individual functions are called only by their name (DecisionTreeRegressor) while you would have to type the whole thing if only sklearn was imported (sklearn.tree.DecisionTreeRegressor). So, memory savings and less typing.
Any one here from kaggle staff ? I want DM about my payment
Hey guys! I mostly been doing cv stuff, but I've been looking into Reinforcement Learning, especially with a robot simulation. Is there a pathway/free resources where I could look into deep RL with simulations in unity?
i need to feed my data into an llm, i am using lora to do it, but i have a large amount of text data it would have nearly 500m tokens, so does that harm the accuracy or efficiency of the model in any way if so is there any other methods to input data into llms.
Not sure about unity but hugging face has a pretty good intro course for RL
Hello,
I'm currently working on a time series project, and I intend to employ the EMD+CNN technique for forecasting the output. Upon applying EMD to the training data, I obtained a total of 14 Intrinsic Mode Functions (IMFs). Consequently, I constructed my CNN neural network with dimensions (30100, 20, 14, 1), with 20 representing the window size. However, I encountered an issue when attempting to decompose the test data using EMD, as it produced only 11 IMFs. This inconsistency caused an error when trying to execute the CNN model.
I have two questions: Is there a method to enforce a consistent number of IMFs during the EMD decomposition process? If not, is there an automated way to select the most significant IMFs?
Please note that I am utilizing the EMD-signal library in Python.
Thank you
Gotcha, thanks!
Hi all Kaggle Family!!
I recently published the following post in the Q&A Forum about a two step model for document classification. It would be great if you can have a look and help me with this problem I'm trying to face since I'm a bit lost at this point . Thanks a lot in advance! 🙂
https://www.kaggle.com/discussions/questions-and-answers/436192#2418471
Two Stage Model for Document Classification.
How can I exactly check and compare ?
i the prediction is only price how can I know which home was inputted and hence check.
like the prediction is an array of prices, how can I check which price is for which home ? or even what features are being applied ?
if that makes sense.
Hi guys, newbie here. I submitted an answer for Digit Recognizer competition and it was accepted. Now I'm trying to use that model and create a website or desktop application. But I'm stuck. I tried to get a prediction using the test data using the below code, but an error was thrown. What can be the issue here? and as a start, how can I use this model on Gradio to make a simple digit recognizer. TIA!
test = X_test.iloc[0]
pred=model.predict(test)
print(pred)
The first image is the start of the error, and the second image is the end of the error.
In this prediction you need a 2D array but you pass a 1D array.
Always read errors because the solutions are there.
did anyone mess around w yolo enough i can ask them a question
By doing it this way you are basically asking someone to commit to answering your next question before they see it. I suggest you ask your question, and it may or may not be answered. That's the nature of these types of forums.
ok my bad
Nothing bad about it, just ask a question.
i have used yolo v8 to train my object detection model to train on a bunch of pics of apples and bananas
it generated a train folder that looks like this
i am now facing diffuculties trying to use this to test on new pics that i have
i am done with the training part but i can't test on the pics i have of apples and so
Hello Everyone. I;m in the Intro to SQL Course facing some issues anyone there to help me?
What's the issue?
Hey. Thank you but its solved. 🙂
Where I could get data on fire event globally
As a data scientist working remotely. Anybody has any recommendations on which country to migrate? Considering taxes, culture and all of that. (Not really important but would prefer a cold country, but still open to any country)
Hi there,
I have performed EDA on a dataset, but the notebook is not shown in the notebook section of that dataset
how can I have my notebook there?
Hey there, I am just starting off with Kaggle,
is there any list/sheet for different Kaggle Datasets to practice for beginners (equivalent to LeetCode 75 for example) to learn and implement different ML approaches?
hay there, try it model = YOLO("runs/detect/train5/weights/best.pt") or same sub folders like this runs/detect/train5 .
little trick to find it,
❯ find runs/detect | grep -i best.pt
runs/detect/train5/weights/best.pt
hey hope you all are doing i want to get any ideas regarding the projects that can help me land a job in machine learning
You need to make your notebook public (in the editing tab)
Got it
Thanks
Is it a good idea to post a notebook on statistical methods for data analysis like distributions method to get more upvotes ?
Hahah I recall now I am enrolled in a course.
@vast relic : i need some responses regarding machine learning survey forms can i post here ?
Hi! For the past few days, I've been trying to fine-tune a model using TPU parallelism / FSDP with a Kaggle TPU notebook. The reason I need to set up FSDP is because the model I'm using is very large (Openlm's open llama 3b v2). When I try to fine-tune it, I quickly run out of memory on the TPU. I'm not sure where to even begin with trying to get this to work, I was able to find this article in the documentation of Hugging Face Transformers Trainer, but I don't understand what I'm supposed to be doing...
Link: https://huggingface.co/docs/transformers/main_classes/trainer#pytorchxla-fully-sharded-data-parallel
My current code: https://www.kaggle.com/code/starblasters8/fine-tuning-llama
Any help would be greatly appreciated!!
Hi, has anyone come from an unrelated bachelors degree to a masters? Or have gotten into the field through alternative means other than achieving a Bachelors?
Currently getting a bachelors in an unrelated, but statistic heavy degree that I am completely uninterested in. I am looking to get into data science since the only thing I really enjoyed about my degree so far has been the stats lol.
Two questions related to creating a kaggle dataset:
- Isn't the data limit per dataset supposed to be 100GB? I currently have a dataset of size ~50GB and when trying to upload an additional ~16GB of data it says I'm exceeding the size limit.
- I have uploaded my data in batches (see attached image) but want to unpack the individual folders so that all the data is in one single folder. How do i do this?
Hello everyone, I'm currently working on estimating the market size of the retail credit market in South Africa, and I'm facing some challenges. I'd appreciate your insights and suggestions on which statistical models or methods might be suitable for this task. Additionally, if anyone has experience or expertise in market sizing, I'd be grateful for any guidance or best practices you can share. Thank you in advance for your assistance!
Hey there! I think you'd find a lot of people from non-CS backgrounds working in the DS / ML space. I have a bachelors in core electrical engineering but I didn't like the domain so I shifted to machine learning and general computer science on my own during undergrad. It just takes extra time and effort to make the shift. Easier to do it while you're a student.
hello everybody i am new and i have problem with an exercise notebook , i delete some part of the initial code and i want to restart the file from the beginning
Thanks for the reply! Good to know, I am thinking of pursuing a masters in DS but I think most of the programs have a ton of Prereqs
I am in a similar situation myself. I graduated with a degree in Biology but am trying to get into data science after taking biostatistics and using R for my senior thesis.pursuing my masters in Biological data science but I think, as for any field, learning the skills that you will likely need on the job and practicing them thru your own projects will be crucial in landing a job
I totally agreee. I am a psych major but my school focuses heavely on research. I have taken like 6 different research stat classes. My fear is that most of my prereqs wont translate when I go to apply for a masters. Also, I have only ever used SPSS, we never really got to mess around with R or any database programs. From my understanding STEM degrees have a far better chance at getting into those types of programs than social sciences.
is this normal?
when i run the code it sometimes has an error
and then it sometimes works
what error do you get
im trying to do something with this data:
https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
looks like there's a row that contains an inconsistent data, in this case the long string. hence your model cannot interprete it. try checking the row that contains this particular data and see if u can drop it
found it, thanks
The line at the top and the bottom tell you what the problem is. Decision trees want purely numerical data, and you seem to have a mix of numerical and categorical variables. All non-numerical features must be converted to numbers.
btw here is what was wrong if anyone is curious
Hi, I am running a notebook and getting this error ?
I assume this an older version as I dont see " > | "
My account is phone verified, where is the connect to the internet setting ?
Hi there, I am trying to build a weather classification app with streamlit. The problem is my model is over 25 MB (it's 87MB minimum let's say) which GitHub doesn't allow as per their size restriction. I am thinking of using Git LFS to store a pointer to that file but I read the streamlit doesnt interact with git LFS to fetch the large object in the LFS repository.
I need advice on how I can push my large file into the repo directly for my app to find and use it.
It's under "Notebook options" in the right-hand pane within the editor
Thank you for sharing. That’s a long string value. Good job catching it! It blended in so well with other numerical values. 🤣
what do I do if my mentor has never shown up or answered my messages?
Hi @tawdry thorn - This is a channel for all the Kaggle members (14 million of them can view this post). I suggest you post KaggleX related topics in channels prefixed with "kx-" that stands for "KaggleX".
Read my message in a KaggleX channel here: https://discordapp.com/channels/1101210829807956100/1145761138248785982/1148605268599513108
Hey guys,
I have a general query
So, I started learning ML in june last year, then my college started and my first year was very rigirous so I had to put my focus on it. Now, it's over, I want to resume my learning, what should I do, which path do you recommend ?
It will be really helpful for me!
Thanks 💛
More context: I did intro and intermediate ml courses on kaggle, participating in 2-3 beginner friendly competitions, and then started ML Specialization by Andrew Ng sir, did 2 courses and made a project.
Now that almost an year passed, I am not able to recall most of the concepts like how to handle bias and variance, gradient descent, entropy etc...
Which path from following should I choose ?
- Do a recap of both kaggle courses and a fast revision of both ML specialization courses, and participate in more competitions, also make projects.
- Do the recap of kaggle courses and re-learn from both courses in ML specialization and participate in more competitions with projects.
Any other path you know which will help me better than above.
Or anything you would like to add on ?
It will really helpful for me, I'll appreciate it!
Hello! I would like to fine tune LayoutLM using my own dataset of form images. These images are similar to those included in the Funsd dataset. I intend to annotate the data using the exact structure of the Funsd dataset. My question is regarding the block level annotations, do the bounding box coordinates of the block need to coincide with the bounding boxes coordinates returned by the OCR (in my case I'm using pyTesseract to get the box dimensions). The problem is that the blocks found by the pyTesseract do not always match the desired box boundaries.
Hello everyone, please what is the difference between Data science and machine learning? I am confused
In simpler words, Data Science is data driven decision making and Machine Learning focuses on learning from the data to train the models
Thank you so much 🙏
May anyone recommend some coding/programming or machine learning internships for high school students?
The more the merrier!
Thank you in advance!
I am a newbie deep learning enthusiast, I am encountering a problem while creating and training a model, The accuracy of the model changes every time I run the code and the change is sometimes substantial, I have created the model by following the instructor and checked the code thoroughly for typos. Everything is perfect but the accuracy of the model changes with every instantiation of the model which is not logical as I have already set a random seed for that model. Please see the screen recording attached to see the issue. Can anybody explain this to me?
computer not work with text. also machine learning algorithm, bcs. all machine learning algorithm just a math formula.
1- set test-size in train_test_split function.
2- convert your categorical features to numerical
Check out https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism for how to enable determinism (reproducible results), and its side-effects.
Configures TensorFlow ops to run deterministically.
That was really helpful for me to understand the issue. thank you for your input and time.😇
When you run a model for a fixed number of epochs (25 in your case), sometimes the model learns more up to that point, and sometimes less. It is a function of how close the initial weights were to the actual solution. Instead, you should run the model with arbitrarily large number of epochs, say 500, and use early stopping with patience of 5-10 epochs. That means the training stops if the loss function doesn't decrease for 5-10 epochs. If done this way, you will get more reproducible results, and the number of training epochs will likely be different as well each time. That is normal.
Hi guys, I cannot import a library module in kaggle while working on my own local. Can someone help me please ?
Thanks
@verbal crest
Your problem is not in import - it is in not being able to install skimpy. If you read the whole error message, you will see that for some reason it can't resolve names and connect to PyPi to fetch skimpy.
Starting with ML, I have Pandas and Numpy done. What should I start learning in ML? Try to do something with Titanic dataset?
are you referring to the courses offered on kaggle?
Hello, everyone. Do any of know where can I find sources to create my own dataset? I would like to create a project or dataset, where the it will predict the time a lettuce to grow based on temperature, humidity, tds value, ph level, and nutrient solutions in a controlled environment. Thank you in advance.
Please anyone guide me how to decide which algo is to apply.
And what steps should i take to do EDA?
Hello there, I have a question guys. I have to work with Knowledge graphs and am completely new to ML. Could someone suggest some tutorial on PyKeen? It would be really helpful. Like a crash course or something
Hi Jjay, are you looking for existing data or do you want to collect new data by setting up a physical environment of lettuce growth?
I am trying to set up a physical environment, however, for now, I would like to know or get sample datasets that I mentioned and try to predict the time it takes to grow. And then, I will set up a physical environment where those independents variables will be controlled.
Got it 😜 Maybe try searching in Kaggle Datasets and UCI Machine Learning Repository. Otherwise, maybe try to search in academic journals and research papers
I already tried searching it on kaggle, and cannot find one, or maybe I search for the wrong keyword. And also, I will try the others you mention. Thank you for your help.
Are you going to schedule a new diffusers event? I was looking forward to that.
How can I improve my DNN solution here:
https://www.kaggle.com/code/touhidurrr/predict-survival-in-titanic-with-deep-learning
Anyone else having an issue saying there is no CSV file found when submitting?
I know the file is being created, maybe it's not in the correct place?
Outputting it to /kaggle/working/submission.csv
When I submit predictions I see an error but when I check the latest version of the notebook under "Output" tab I see the file with data in the correct format
Guys, please help me find resources for: Analysis of News Articles and videos for regional languages
I want to make a Media News Monitoring and Feedback System that can handle multiple regional languages, categorize news stories, and notify me about negative coverage of news in the media.
Please suggest some good resources related to sentiment analysis from news articles and video transcripts
@fair ingot Try saving the file to "submission.csv" rather than "/kaggle/working/submission.csv"
Help Required: I am try to detect and remove the outliers from a dataframe.
The dataframe is very extensive and huge so I have selected three key features ['TS', 'Mean_RMS', 'Mean_ToF']. The main idea is to calculate z scores and detect outliers (whose z scores are greater than 3 standard deviations). Then append the indices of those outliers in a separate list. After that use this list of indices to filter out the rows from the main dataframe df.
See my code and error I am encountering:
from scipy import stats
from sklearn.utils import resample
from joblib import Parallel, delayed
Define the number of samples to take
num_samples = 10000
Define the number of parallel processes
num_processes = 4
Define the threshold value
threshold = 3
Define the outlier detection function
def detect_outliers(data):
z_scores = stats.zscore(data)
outlier_indices = np.argwhere(np.abs(z_scores) > threshold)[:, 0]
return outlier_indices
Select feature columns to detect outliers from
df_select = df[['TS', 'Mean_RMS', 'Mean_ToF']]
Perform outlier detection on random samples in parallel
samples = [resample(df_select, n_samples=num_samples) for _ in range(num_processes)]
outlier_indices = Parallel(n_jobs=num_processes)(delayed(detect_outliers)(sample) for sample in samples)
Flatten the list of outlier indices
outlier_indices = np.concatenate(outlier_indices)
Remove the outliers from the DataFrame
df.drop(outlier_indices, inplace=True)
Reset the index of the DataFrame
df.reset_index(drop=True, inplace=True)
Error:
ValueError: Shape of passed values is (2, 350), indices imply (10000, 3)
Please help me resolve this error. Thanks in advance.
I have an old script that removes outliers by modified Z-scores:
llamaindex, langchain, assembly ai, weaviate, clarifai if we are supposed to make a chatbot with one of these, which would be good and free and can anyone share resources in making ai chatbots😅
Want to try Google Cloud AI Platform Notebooks. But getting the error below and don't see GPUs in any region on Google Cloud. How does one get around this?
nvidia-t4-1x: The zone 'projects/bkowshik-kaggle/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.: Something went wrong. Sorry about that.
@vivid owl i have problem with an exercise notebook , i delete some part of the initial code and i want to restart the file from the beginning- it a python course the module 5 Exercise: Loops and List Comprehensions
heyy folks, so I am relatively to new to the field of deep learning. I was working on a project for time series forecasting. It has a lot of factors affecting gdp of a country so I was thinking about mutlivariate analysis but it isnt working like it should. Like I tried using different libraries and approaches but the graph always seems not being impacted much by the factors.. I wasnt able to find any good resources for the same as well. Can anyone help me with that?
Step 1: Go to: https://www.kaggle.com/code/colinmorris/exercise-loops-and-list-comprehensions
Step 2: Click "Copy & Edit" that appears at the top right corner (marked in the screen print)
Hi - Discord is new and we all are still exploring and experimenting to find the best way to ask questions and get responses, but this worked for me and wanted to share.
- Describe the issue in detail so people will know exactly what the issues you are experiencing
- Add a link to your Kaggle notebook so that people can take a look and investigate for you (vs. imagine what the error/issue might or could be 🤔 )
- People will respond by leaving suggestions in the Comment section of your notebook in the Kaggle platform or here in Discord
Hope you will find this tip helpful. Good luck!
Below is an example of what I described above:
https://discordapp.com/channels/1101210829807956100/1133184287886299237/1148812886026764360
the problem is solved ,thanks
Great! Thank you for letting me know. 👏 🥳 🤩
thanks alot!
Hi - I am not knowledgeable in the space, but short courses offered by DeepLearning.AI might assist?!
Take your generative AI skills to the next level with short courses from DeepLearning.AI. Enroll today to learn directly from industry leaders, and practice generative AI concepts via hands-on exercises. Available free for a limited time.
I will go through them thank you😅😅
Please come back to Discord and let us know what you will have developed so we all can learn from you! 🤓
Yeahh sure I will try to build something😅
Beginner Notebooks on DNN, TFDF and RF: How can I improve the accuracy?
I am a beginner and I have 3 notebooks that use 3 different approaches to predict survival on Titanic. I tried many things but I was not being able to get my accuracy above 80%. To break this wall, I need advice of knowledgeable people in the Kaggle Community. Please share your advice with me regarding how to improve my accuracy!
DNN Approach (78% accuracy):
https://www.kaggle.com/code/touhidurrr/predict-survival-in-titanic-with-deep-learning
TFDF Approach (77% accuracy):
https://www.kaggle.com/code/touhidurrr/predict-survival-in-titanic-with-decision-forests
Random Forest Approach (74% accuracy):
https://www.kaggle.com/code/touhidurrr/predict-survival-in-titanic-with-random-forest
Explore and run machine learning code with Kaggle Notebooks | Using data from Titanic - Machine Learning from Disaster
Explore and run machine learning code with Kaggle Notebooks | Using data from Titanic - Machine Learning from Disaster
Hey all, sorry for asking, but can anyone point me in the right direction on how to get started learning reinforcement learning on PyTorch?
I need the knowledge to solve one of freecodecamp's ML problems here https://www.freecodecamp.org/learn/machine-learning-with-python/machine-learning-with-python-projects/rock-paper-scissors but the course on FCC mainly used Tensorflow and TF runs very slow on Replit.
I know this is a bit subjective, but do you guys recommend going through all the Learn Lessons first then trying a competition?
Hello Kaggle Community,
I'm currently working on a project analyzing two decades of Premier League soccer data with the goal of creating a predictive model. However, I'm new to soccer datasets. If anyone has experience or insights to share on soccer data analysis and regression modeling, I'd greatly appreciate your guidance.
Specifically, I'm interested in predicting full time outcomes from half-time data, and predictive modeling based on the historical data. Your tips, resources, or collaboration would be invaluable.
Please reply or reach out if you can help. Thank you!
you might not have sufficient quota if your project is new (you can request an increase-- instructions here https://cloud.google.com/compute/resource-usage#gpu_quota
what exactly do you need help with? Also, that looks to me like a job interview take home question, in which case it is not very appropriate to ask other people to help you solve it...
I needed some help to understand my code better and I am not asking anyone to solve the whole thing for me.
Anyone can help me with what I need to do in this competition?
https://www.kaggle.com/competitions/playground-series-s3e21/overview
Playground Series - Season 3, Episode 21
I don't know what to do, I want to know what to do 😁
💀
Hii Everyone, I am in a mess !!! I am new to data science initially working as data analyst
I need some help related to one task which got assigned to me which is related to data science , where I have to make a time series model in python can anyone share his experience breiefly here and guide me little bit
Hey guys, I got 0.78229 on my first submission.
Could anyone look over my code and offer some suggestions? This is my first ML project and I want to also make a YouTube video on how I built it out and such, I know I still need to leave a lot more comments/documentation and clean up a few sections
Hey guys plz help me
I recently learnt data visualization via kaggle and in the final project i have completed it but on kaggle it shows only 75% and due to which i am not getting my certificate for data visualization as its has been 98% done and it requires 100%
Hi, everyone. Could you please take a look at my beginner projects, where I do a prediction of growth days of lettuce in a controlled environment? I think I am missing something I don't understand and I think my dataset also has missing features in order to predict the time it takes lettuce to grow in a controlled environment where temperature, humidity, tds value, ph level are automated and used for predictions.
Here is the link for my kaggle notebook. If can comment what I did wrong, I will gladly take it as a stepping stone to further improve my knowledge. Thank you in advance. https://www.kaggle.com/datasets/jjayfabor/lettuce-growth-days
Hey, just got a warning for self-promo on Kaggle 100% deserved there, but it says if you keep posting your account will be banned. They mean from now on or should I go back and delete all the ones from the past
@green haven It is best you ask those who warned you. I don't think the problem was that you were promoting your work. I think it is that you posted the same notebook announcement in multiple channels, such as in jobs, that had nothing to do with promoting notebooks. If you tamp that down and post in channels that are meant for sharing, I suspect you will be fine.
The first sentence of my response still applies.
Sure you can, go to their general discussions:
General.
Getting Started.
Generally speaking we mean from now on, but if people report your old spammy posts it might lead to future violations, so if you want to be extra safe you should clean up older spam.
Ok, thank you
Hello everyone!
I'm looking to practice feature engineering. Do you guys have any recommendations for a Get Started competition where this skill would be particularly useful?
does anyone know why some people use log softmax activation during training instead of seperating the log and the softmax?
Hi everyone. It's been some time since I practiced ML. My focus was only on data visualization and analytics so neglected this area. How do I start all over again when it comes to ML?
hi guys, is there any good free online statistic book ,I'm new to machine learning anyways
Here you go: https://openintro-ims.netlify.app/
@vivid owl thanks msn
Does anyone know how to config the UI? I want to exclude console UI. Kaggle notebook is awesome, but very hard to adjust the UIs
I am getting this error/warning message on Kaggle. please how do i solve it?
Looks like a weird bug is making the collapsed console space too big on your screen. I'd suggest posting in the product feedback forum so our engineers can take a look (ideally include your browser / operating system too).
what are you trying to run on your notebok? it looks like you are trying to load too much into memory and it crashed the notebook
I am working on a dataset that contains csv files for 6 years (eg 2000.csv, 2001.cvs etc). I am trying to merge the whole dataset into one. Is there a way I can run the data successfully? The dataset is from kaggle (2GB memory).
2GB each? or 2GB in total?
Also, how are you planning on using the combined file? e.g. it is fairly easy to combine csvs together using a bash command (e.g. https://unix.stackexchange.com/questions/293775/merging-contents-of-multiple-csv-files-into-single-csv-file) without having to load everything in memory, but you do need to remember that you will also have issues trying to, say load the combined csv into pandas on kaggle
Total
hmm, that is odd, since 2 GB usually should load fine
I don't have the file in my local machine, it's in Kaggle.
you can run bash commands in kaggle using the ! operator
or using %%bash cell magic
I will try it out. This is the dataset i am talking about https://www.kaggle.com/datasets/yuanyuwendymu/airline-delay-and-cancellation-data-2009-2018
cat *csv > combined.csv
To run this on Kaggle I have to use this !cat *csv > combined.csv right?
yeah
although do note that you have to remove the headers first (if you scroll a little on the comments on that answer
34
This answer will duplicate the headers. Use head -n 1 file1.csv > combined.out && tail -n+2 -q *.csv >> combined.out where file1.csv is any of the files you want merged. This will merge all the CSVs into one like this answer, but with only one set of headers at the top. Assumes that all CSVs share headers. It is called combined.out to prevent the statements from conflicting. –
hLk
Oct 12, 2019 at 1:00
is probably what you want
How do i achieve this?
(see my edited note 🙂 )
Just read that part in the article now
head -n 1 file1.csv > combined.out && tail -n+2 -q *.csv >> combined.out. Let me try it
I am getting error
you need to run it on files. e.g. xxx.csv
the command in the screenshot is pointing to a directory
Where will the output be saved? combined.out?
Error
I tried and debugged it, but i am getting something else
you need to pass in the directory before the *.csv ie line-delay-xxx/*csv in the
i did that but there was no output
This is it
works for me
use
%%bash
head -n 1 /kaggle/input/airline-delay-and-cancellation-data-2009-2018/2011.csv > combined.out \
&& tail -n+2 -q /kaggle/input/airline-delay-and-cancellation-data-2009-2018/*.csv >> combined.out
You used a different code, i will try this out now
The output is combined.out right?
yeah
But is it not in csv. Mine is still running
Your code in the screenshot cat combined.out | wc -1 what does it mean
?
Okay. I now changed the directory to path/combined.csv
I have shared my notebook with you https://www.kaggle.com/code/wwymak/notebook018f720f4d/edit
Still the same issue
@arctic marten I don't know if you realize how lucky you are that Wendy has been troubleshooting this with you line by line, and from what I can tell for the better part of the past hour. If Wendy wants to keep doing it, great for you. Still, at some point I think you have to invest a bit of your own time to figure things out, as these are fairly standard and straightforward operations. I realize that I am butting in without being asked anything, but it is important not to take other people's time for granted. Wendy would most likely not tell you even if that was the case.
Thank You
I appreciate it, but that should be directed to @red hawk
I can't open your notebook. It's saying permission denied.
Hello @red hawk I was able to successfully load my data, I use nrows to specify the number of rows. Thank You so much yesterday for your time, I truly appreciate it. Thanks for teaching me that how to use unix command to load data in csv (i haven't heard that before). Do have a lovely day.
i need help setting up my gpu to jupyter notebook i followed the steps but it still says my cuda gpu is not available after importing torch
hello everyone!
I have some questions related to data preprocessing. If you have any knowledge, please share it with me.
link:- https://www.kaggle.com/discussions/getting-started/439141
Questions About Data Preprocessing: Contribute Your Knowledge.
I have some notebook about data cleaning and data preprocessing. can you check it here. https://www.kaggle.com/zxarifi/code
Hi everyone, maybe a super dumb question but I am going through learning exercises and just built my own model based on DecisionTreeRegressor from sklearn. I understand X is feature set and y is the prediction target. But when I have a prediction valie on house prices, what is y value about? I am unable to understand the prediction value when we do not have the concept of time, i.e., when we can expect the prices to be the prediction values.
So what exactly then y represents when we get the prediction in the end.
Does anyone have a unique project idea for ML?
Hello all,
I am currently trying to decrease the training time by sampling the dataset and then using that trained model to make predictions about the whole dataset.
After training on the sample, we checked the AUC for 10%, 30%, 50% and 100% sample sizes.
If the validation AUC for all of them is very close to each other we can minimize the training time by only training on the 10% of the sample for other datasets and can conclude that the predictions will be the same as that of when trained upon the whole dataset.
The problem is in the case of a very low minority class it is discarded in the sample and the predictions for those are not coming accurately.
The metrics I am using is AUC and the sampling method I am following is stratified sampling.
If you are aware of any better approaches I would like to discuss it.
It depends on the number of data pints that you have. It is difficult to make suggestions cos we dont know that. In addition, when taking a certain percentage of the data, did you consider if the dataset will be imbalance? That is, having more of certain classes over the other
I noticed there is different colors for functions that can be applied when using tab key.
I assumed the blue on is for the imported package, purple is default, and the wrench is also default related to settings, right ? (just want to confirm my understanding)
Have you tried downsampling the other class(es) so that the minority class is better represented? - https://imbalanced-learn.org/stable/under_sampling.html
Y is the prediction target. they are both the same.
I think Y is used in the actual code, while you can say prediction target when generally speaking or writing. it is just a convention.
Just like how a model can also be called architechure, there are many similar examples in DS
(if I am wrong someone correct me plz)
SELECT u.id as id, MIN(q.q_creation_date) as q_creation_date, MIN(a.a_creation_date) as a_creation_date
FROM `bigquery-public-data.stackoverflow.posts_answers` a
FULL JOIN `bigquery-public-data.stackoverflow.posts_questions` q ON q.owner_user_id = a.owner_user_id
RIGHT JOIN `bigquery-public-data.stackoverflow.users` u ON u.id=q.owner_user_id
WHERE u.creation_date >= '2019-01-01'and u.creation_date <= '2019-01-31'
GROUP BY u.id
"""
SELECT u.id AS id,
MIN(q.creation_date) AS q_creation_date,
MIN(a.creation_date) AS a_creation_date
FROM `bigquery-public-data.stackoverflow.users` AS u
LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
ON u.id = a.owner_user_id
LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q
ON q.owner_user_id = u.id
WHERE u.creation_date >= '2019-01-01' and u.creation_date < '2019-02-01'
GROUP BY id
""" ```
Solving the first notebook in Kaggle advanced SQL lesson, first query is mine and the second is the answer, when visualizing in head, I think the JOIN logic should give the same results at the end but unfortunately it is wrong as per kaggle check, I want to confirm maybe if it is wrong cause I used a different join or because the results itself (if that makes sense)
Any help is appreciated ty!
what's the question? in any case 'MIN(q.q_creation_date) ' is wrong syntax, the column is called 'creation_date' (and similar error for 'a_creation_date'). I think the joins should give you the right results but the 2nd way of doing joins is quite a bit more clear than your 1st version (maybe a personal preference but I find thinking about 2 left joins a lot more intuitive than a full join and a right join...)
tip: If you paste your query in the bigquery UI it will highlight your query errors for you-- something that running a query in a notebook don't
Oh thanks, I will check out the bigquery UI, yea it is annyoing to not have specific query errors show in notebook.
Hmm I must have tunne visioned and didnot notice this creation date, thanks for pointing it out!
*update: Just checked, yes my logic works! *
I ran model.fit on the Kaggle online notebook and it is taking a very long time. (I ran it about 10 minutes ago and it's still 35/152 progress) Does running it online slow it down? Would it be faster to run it on a local PC? I have a gaming PC.
Without more details, it's hard to tell what the issue is. One likely guess is that you haven't enabled the GPU for your notebook session. But again, it is possible for some models to take very long if the dataset is huge and/or model parameters are in the range of billions e.g. LLMs.
Kaggle seems to prohibit code sharing outside the forum, does this mean that sharing code on GitHub is also prohibited?
https://www.kaggle.com/discussions/getting-started/440767 stuck with this problem. Can't save a NB as dataset. It caused my NB to run longer than anticipated and fail submission. All advice is appreciated.
HOW TO Guide for Dataset from NB..
At any given time there could be thousands of users running notebooks, so the available resources vary throughout the day. If you have a reasonably recent computer and a GPU, it is very likely that it will run faster locally.
Kaggle prohibits code sharing with non-group members outside of Kaggle. Anything you share with a broad audience is not a violation. That means if you post your code on GitHub and make a public link on Kaggle that anyone can access, you would be fine. Still, it is more convenient for most people to make a notebook on Kaggle and share it like that.
hello people, I am new to Datascience and ml, I have knowledge about performing Data Preprocessing and EDA and right now i am learning ML models starting from simple and multi linear regression
Can anyone suggest me some already done analysis and cleaning on datasets on kaggle? i would like to see how people go about doing data preprocessing in different ways
Also a followup on this question, I would also like to know where can i learn how to create Pipelines for datascience, i already know OOPS and python concepts just wanted to know how can they be implemented
Any good resource to learn shaders (glsl) for ai ?
guys it would mean a lot if anyone could let me know please why my program is performing poorly. I get 84% on test results
i tried changing the model architecture a lot but it always yielded worse results, this is not my first run
also somehow before using data augmentation, it had better results
if you run the notebook yourself and are trying to check the test results, the first 150 pics should be of horses and the remainder is of humans
i did a lot of research before asking here and i checked diff methods to yield better results but they did not help much
You don't have enough images to train a model from scratch and still have great accuracy. It takes at least tens of thousands, and even better hundreds of thousands of images, to get truly high performance. Instead, I suggest you start with one of pre-trained image models (VGG, ResNet, SqueezeNet, take your pick) and fine-tune it with your dataset. That should give you better performance.
I have a question about notebook-only competitions, what actually prevents me to load a pretrained model or a model I trained myself or even already preprocessed data I created myself and then run the notebook only doing inference? I thought the goal is to test the skills with limited hardware at disposal so I am a bit confused.
know someone why I'm getting this error from "Intro to SQL" course on Kaggle?
It varies among competitions, but in some of them you can do exactly what you proposed: train locally, upload the files, and only do inference on Kaggle. I suggest you consult the competition rules and ask the same question in their discussion forum if unclear.
yes, check out #1130785765274685500 it is answered there. you will have to copy and use a different code that will be written in discussion forum.
alright noted, thank you so much for taking the time to help
Considering the imbalance nature of dataset I would perform the same by over sampling the majority class and performing the same operations.
Thanks for your suggestion
I have save and run all in notebook kaggle . But when download notebook. That notebook not appear output. How to solve it? . Isnt bug or something else?
Hey guys,
Is there anyone here who have experience working with sound data, in particular sound as input and sound as output models? Would love to ask some questions regarding where to get started!
I do a variant of this response several times each session. The way you are asking a question is indirectly expecting someone to pre-commit to answering your future questions without knowing what they are going to be about. Instead, I suggest you simply go ahead and ask your question. There may be someone to answer it, or not. Yet you need to put in the initial effort.
i was learning data visualization and i stumbled across sns.clustermap
this was on a relatively small dataset
this was on a bigger dataset
are some people able to make sense of this or is it better suited for some datasets
@dapper stratus What you are showing is a two-way clustering by certain features on top, and some IDs (presumably users) on the left. The intensity of color corresponds to values for a given feature/user combinations. Features close to each other in the top dendrogram are more similar to each other, while features that are far are dissimilar. Same for IDs. The plot tells you that a number of weekend nights and a number of week nights are correlated features, while number of weekend nights and booking status are not. Same for users/IDs, except that it is very difficult to see most of them as the plot is crowded on the left and right sides.
Hi I have a question regarding the definition of an "old post".
I have a notebook that just reached 50 non-novice upvotes about 2 hours ago.
But it wouldn't update the status of the medal.
Is it because the notebook was initially created about 3 months ago? I have actively modified until last month and recently updated a bit.
I have googled about the term "old post" in progression section, but no post online / kaggle discussion was clarifying my curiousity.
Is it the matter of the "old post"? or is it my patience?
Hi, has anyone here worked on the Multi-label text classification problem?
Some of the features have very less labelled data.
I had tried my hands on SETFIT but it didn't give me good results.
Hey, here is a video on how to set up CUDA for PyTorch on Jupyter Notebooks. Hope this helps! https://youtu.be/d_jBX7OrptI
/// LINKS BELOW ///
Cuda Install
https://developer.nvidia.com/cuda-downloads
Cuda GPU Compatibility
https://developer.nvidia.com/cuda-gpus
Anaconda Install
https://www.anaconda.com/download/
PyTorch CUDA
https://pytorch.org/get-started/locally/
@molten wharf It is rare that a notebook gets a gold medal exactly at 50 non-novice upvotes. It could be because it is older, or because Kaggle has an undisclosed algorithm where they don't count votes from users who have heavily upvoted your posts or notebooks in the past. I think you will need to get at least a few votes over 50.
@deft fox Ohh...! Thank you so much! Then I may have to be patient for a bit more...
Thank you so much for your insight!🤩
I am new to stable diffusion. I am still wondering why do we do this step? And why exactly those numbers. This is an example from huggingface: https://huggingface.co/docs/diffusers/quicktour
the output from Stable Diffusion is in the -1.0 to 1.0 range as floats. PIL images need to be 8bit int (per channel). in order to convert a float into a valid int (with the correct range) you need to:
- add 1.0 to make the range 0.0 to 2.0
- multiply by 127 to make the range 0.0 to 255.0 (the number range for 8 bit integers)
- cast everything as uint8 now that the value is in the correct range
Got it. Thanks for the response!
Hello! I asked a question here a few weeks ago about LayoutLM and it never got an answer. #❓┊ask-a-question message
If no one can answer, does anyone have any suggestions on where I could find more info about fine-tuning LayoutLM.
Im not getting the medals for the upvotes in discussion tab[yes those votes are neither SELF VOTES or novice votes] and it says "too much requests" whenever i try and post a new topic. How do i fix it, its been 8 hours. I have been constantly trying to post a new topic but it throws the same error. @tender trench @verbal crest@wind silo
We have secret algorithms that don't count all votes to progression (typically if the same person upvotes you multiple times we stop counting them), be patient and with more upvotes you'll get medals.
The too many requests error happens when you try to do something too often. Stop trying for a day and you should be able to make topics again.
Hey @vivid ice I have studied how to handle the audio data. And how it works
https://www.kaggle.com/code/sujaykapadnis/audio-machine-learning-for-speech-recog-intro
These were my notes, there're 4 notebooks. Link of next notebook is at the end of the current one
Thanks for clarification 👍🏻
Guys I need a help
how do I submit my model of titanic?
I have waited more than 30 hrs , it doesn't seem to go back to normal. Could u please help me?
@fervent ocean You'll have to contact support (although if you wait longer it will probably fix itself)
hey, have 99% to finish my certificate and i can t find the probleme in this exercise, can someone help me ??
https://www.kaggle.com/code/otmanesajid/exercise-categorical-variables
Aight I will wait some more
It appears step_4.check() is missing.
it gives me :
/opt/conda/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: sparse was renamed to sparse_output in version 1.2 and will be removed in 1.4. sparse_output is ignored unless you leave sparse to its default value.
warnings.warn(
when i run this code:
from sklearn.preprocessing import OneHotEncoder
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
when i change it to
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
it removes the error, but it not tell me that i finished the certificate
hi guys i got a little problem that generated shoking results . can anyone help me for just 5mins to explain the main problem of the result (no need to help me correct it i just want to know where is the problem)
#cross_val_score(xgb,X_test,y_test,cv=5).mean() : 0.9928446688501186 y_test_pre=xgb.predict(X_test) mean_squared_error(y_test,y_test_pre) :84941265285.15845
when i saw that shoking result i wanted to try this thing :i tried to calculate the mean squared error in the training data i wanted to see if it is equals to 0 but : y_test_pre7=xgb.predict(X_train) mean_squared_error(y_train,y_test_pre7) btw xgb.score(X_train,y_test_pre7) : 1.0
Maybe this can be obvious for you guys but i have like just 2months of experience
Hello sir
so Do you have a other questions?
No thanks you already answered
Hello everyone!
If you have any advice, please share it with me.
https://www.kaggle.com/discussions/questions-and-answers/442690
Improving Classification Accuracy.
I don't mean to be harsh, but you don't seem to be applying any of the previously learned lessons to this new dataset. I left some suggestions for you that will hopefully be helpful.
Hello sir, Sorry but Would you like to talk with you?
Yes, I totally forgot that 😛
I have those techniques too.
I just notice it now haha thanks a lot
how to use model from huggingface in some competition ban internet, it's there better way than upload to /kaggle/input?
Can some one help me out with this...not able to understand which all fields I should slect
In Predict Health Outcomes of Horses, when i try to submit the file then they show me error.
this my submission file
photo is not clear.
I've been trying to train latent diffusion model, somehow the loss does not converge. Is there any issue in my training loop?
How much time does it take for support team to get back to you?
I don't think there is a prescribed time. Besides, the weekend just ended in North America.
hello, sir.
Sorry for the inconvenience. I have been waiting for your response since yesterday.
Are using a learning rate that is appropriate for your model and Dataset?
Hey folks,
I've been working on a project involving a large dataset, and I've hit a bit of a roadblock. I was wondering if there's anyone here who might be able to lend a hand or share some insights.
If anyone has experience with data analysis or visualization and would be willing to help, I'd be incredibly grateful. It doesn't have to be a huge commitment – even a few pointers or suggestions would be immensely helpful.
Thanks a bunch! 🚀
How can I help you?
Can I Dm you?
Ok
is there a general rule to how much you should be testing?
for example, i'm training on 300 pics (the sense behind the challenge was trying to get a good accuracy with a low number of pics) so i used data augmentation and transfer learning and so
when it comes to testing though, they didn't restrict how much pics we got for testing, so i just went online and got a dataset that had almost 3000 testing pics but i only used around 80 pics to test my model that was trained on 250 pics
would this be considered bad practice and i should have used the whole testing dataset that i had?
I have no idea who you are or what you want, and you asked me if I would like to talk to myself. I have no response to that, and hopefully you understand that the lack of literal response is often a response. If you have a question, I suggest you ask it. Since all of us engage with others on a voluntary basis, you may or may not get an answer.
Hi!
I am working on a project in which I need to integrate a DL model into flutter app .
Could any body help me how to integrate that model into flutter app?
You can use the TensorFlow Lite
Could it decreases the accuracy of original model?
Have you integrated before?if so will you provide me your git repo link?
Um Maybe, it is possible to decrease the accuracy of a n original ml when integrating it into a flutterapp
It is too large or complex to run on a m device
or not optimized for the Flutter app
I have integrated DL models into Flutterapps before, but I don't have any public repositories that I can share 😩
alright, it is considered bad practice to o nly use a small subset of your testing dataset. The purpose of the testing dataset is to evaluate how well your model generalizes to new data, and using a small subset of the data will not give you an accurate assessment of this.
@everyone is it possible to use image processing to find the soil nutrients like Nitrogen, Phosphorus and Potassium for agriculture. If anyone has any idea can you please reply me it would be really useful.
Um, Approach is to use the texture of the soil to estimate the nutrient levels. For example, soils that are high in clay content tend to have higher levels of nutrients than soils that are high in sand content. By analyzing the texture of the soil, it is possible to get a more accurate estimate of the nutrient levels.😀
@obsidian pulsar is it possible to find the approx level the nutrients in the soil using image processing techniques
Yes, it is also possible to use image processing to detect the presence of specific nutrients in the Soi
I am looking into converting images into HSV color space using computer vision
For example, nitrogen can be detected by looking for the presence of chlorophyll, which is a green pigment that is essential for plant growth.
So if you have some inputs or sources that I can use to develop my model
That would be really helpful
Thank you
Hi Dev,
Why is Kaggle not responding?
Hello everyone,
I got an actual score in my Notebook for the model I built and submitted it for a competition but my public score is 0
Does it take some time to load or is there a problem with how I submitted my output (The submission was successful)? Thanks
If it shows zero, that is your score. Kaggle doesn't use 0 as a placeholder until it calculates the score. As to why you got zero, impossible to tell without knowing more details. We don't even know what the metric is, or whether this is classification or regression. Log-loss of 0 would be excellent, and so would MAE or RMSE. Accuracy, less so.
Should I use knn model for training how a person looks?
or SVMs
Deep learning models
If you are a Messi fan, 😀 ~
lol 😂
Thanks
thanks for the insight. I realised I was submitting the predicting values as '0' or '1' instead of 'True' or 'False'. It was a classification task
Currently I am working on a project in which I need to recognise diseases in plants/crops through image taken by mobile camera.
So, I am training my CNN model accordingly. Is it good to go with CNN or any other neural network that you recommend to increase the accuracy of diseases detection?
Hello, I am trying to resize the images but it the disk space is full. Kindly guide me how can I resolve this issue?
One option is to download the files to a local computer and run it there, assuming you have enough disk space. Yet another is to try and delete the original images on Kaggle AFTER you resize them, as that will presumably create more disk space. Not sure this option will work as the original dataset probably doesn't count as your image quota, and you may not have access to delete it.
Thanks @deft fox I appreciate your response 🙂
CNNs will likely work. However, it is unlikely that you will be able to collect a truly large number of images, say 10,000+, which is what is needed to train CNNs properly from scratch. I suggest you start with pretrained neural networks (VGG16, ResNet, SqueezeNet, EfficientNet, Inception or something along those lines) and fine-tune them on your images. https://www.analyticsvidhya.com/blog/2020/08/top-4-pre-trained-models-for-image-classification-with-python-code/
Guys i need a little help...
Actually i am new to ML field... i have started learning some algorithms but cannot understand how to move on the projects.
Can someone pls help me with how can i start my journey in a proper way in this field.
Also i started reading some ML related books....are they beneficial??
Currently i am working on a project for yield prediction using NDVI data but i cannot find much of the data on kaggle...can you suggest me any site or anything that can help me with NDVI dataset /_/\
I am looking for transcribed Australian podcasts on humour, sarcasm and everyday conversation, would anyone be able to help point me in the right direction?
@deft fox Thank you for your response and valuable suggestions 👍
Hi I was wondering if anyone here is good with image analysis using python. I am still very lost, trying to learn on my own... I have labeled image data in yolo format using labelImg. I just don't know what to do with my image data set and labels in python
I have to quantify fluorecently labeled cells in images, any advice would be appreciated 🙏😭
You can use the scikit-image library to load and view your images.
from skimage.io import imread
Did you preprocess your images before performing cell analysis?
A general approach here is to use train images with labels and object masks for fine-tuning the existing model. After that you test on a separate set of data that hasn't been seen during training. Not sure that YOLO has the ability to quantify fluorescence or anything else, as most of these types of models are meant to be qualitative rather than quantitative. It may be a bigger bite than what you can chew if you have no background in image analysis, as this is a decidedly non-trivial task.
So I have images like this. I have to count "Red" cells [Necrotic cells] , "Green" cells [Live cells], "Green Yellow" cells [Early Apoptosis] and "Yellow Orange" cels [Late Apoptosis] , and I used labelImg to box cells of each color type and label them as "Live", "Necrotic", EA", and "LA", not sure if that is a right approach
This is different from how you initially described it, and might work. It is important to clearly delineate cells when drawing ovals/rectangles around them. Are you relying on your eyes to make a distinction between these colors? If so, the classifier will be only as good as your eyes are.
I also read that U-NET is something that could be used for something like this where I have different instances [colored] of cells, but not sure if that is the right approach or how to approach using such methodology and how to best label my images for such, I guess I was put in a bit of shark tank with no guidance to figure this out on my own since we are doing this for first time ever
Also, it seems that your signal is diffusely green in cytoplasm no matter what color is in nucleus, and that also might complicate things.
To train from scratch using U-NET or any other architecture, you would need at least thousands of images with many labeled cells in each (100+). That's why a pre-trained model might still be desirable as you only need to fine-tune it, which can be done with a relatively smaller number of images.
i have a dataset of close to 400 images
how should I approach a pre-trained model for such a task? Also I agree the diffuse cytoplasm may be an issue
400 images may sound like a lot to you - and I know from a personal experience that labeling that many images is a pain - but that's nothing for training models from scratch. You'd have to set aside at least a quarter of them for testing, and 300 images x 100 cells is not very much when you have 4 different categories.
Yes; I have been labeling a lot of images, but I am not sure if I took the right approach - I used the labelImg API and the yolo method of output, which is just the format like this:
1 0.699219 0.573423 0.159375 0.172072
0 0.684766 0.521622 0.016406 0.023423
2 0.288672 0.284234 0.205469 0.217117
it gives the label, coordinates and length i believe.
Which pre-trained model may be best suited. Also how to best approach learning and being able to code a model to perform the task I desire?
Those numbers seem like relative rather than absolute coordinates. What I have seen is something like 588, 417, 661, 479 which is xmin,ymin,xmax,ymax coordinates. Maybe when you multiply your numbers by image height and weight they became whole numbers as well. YOLO-supplied models should work as a pre-trained models to be tuned.
You will have to research this on your own or better yet get local help from someone who knows, as I can't guide you through all the steps via keyboard.
Kaggle should have some notebooks that cover all these steps if you are patient and go through many search results.
Thank you so much, also once I have done that to my best abilities, would it be okay for me to reach out to you directly?
There is no guarantee that I or anyone here will respond when contacted, as all communication is done on a voluntary basis and depends on available time. But there is no harm in trying to get in touch.
Thank you so much for your help again!
Hello, this is a career related question to the data scientists and ml engineers of the industry.
How should an undergrad student navigate his way into internship and FTOs in this domain
The very basic yet crucial thing is to be good at mathematics and statistics. You have to learn all the concepts to proof (knowing applications is a plus).
Secondly, you have to be good at any programming language (I will suggest Python) from moderate to expert level.
Next, Learning various machine learning algorithms and practice them with industry based project (based on which domain you're or want to work in future).
Casting the acceptable RESUME..
Hope this helps..😌 🙂 🤝 🤞
Not really sure why I can't get credit on the last python excercise, but I've even tried copy pasting the solutions and it still won't fill it up: https://www.kaggle.com/code/vidmaric/exercise-working-with-external-libraries/edit any ideas?
Hi I'm Jonathan, Singapore born California bred. I work at the intersection between Quantum Computing and Web3 at pQCee, a post quantum computing startup based in Singapore. I'm the product owner of QuantumNFT, a platform that let's developers showcase their quantum programming skills. We're addressing the talent gap problem. We're validating in the QIF Quantum Games Hackathon. We want to do what Kaggle has done for Data science for Quantum Science. For this hackathon, an idea is to build out the competition workflow. Is there anyone from the team that can spare 30 minutes for a discovery call? 🙏
Hey there is this allowed
Say I have a friend who's NOT competing in a competition
They decide to lend me their account for GPU hours
This isn't code sharing since they aren't competing and isent multiple accounts of the same person. So it shouldn't violate any rules
Can any1/@mild geode staff confirm?
Pretty sure this is a violation but you should wait for the official response. It’s like having multiple accounts to work with but only submitting from one of them.
Yeah waiting for the official response since the point is I'm borrowing someone's real account for GPU hours (who isent participating (
And I don't have multiple accounts so not violating rules)
It's a 'technicality' but I don't wanna get banned on it
cc @twin elbow
You are thinking only about what feels right to you, but moderators have to think globally. What if someone has 10-20 friends who are not participating and all of them are willing to donate their GPU hours? Do we draw the line at 5 friends that can contribute their GPU hours, or is 50 okay as well? If there is no line, soon enough everyone would be making Kaggle friends left and right, which would create inequity in how many GPU hours individuals have at their disposal.
Aaah yeah that makes sense !
But there's a loophole where the "friends" participate in a team with the person and pool GPU hours. The catch is that they didn't actually participate and just lent their account .
Since within a team people can share anything.
Not sure how that's moderated...
If it can't be moderated then it makes no sense not allowing what I proposed .
It could be capped the same as max team size which is 10
But yeah I agree there isn't any one size fits all 'fair' solution and there is a lot of nuance
@barren phoenix Again you are not thinking like a moderator, so this might help. Let's say that someone has multiple accounts (a violation) and is running notebooks on them using the same IP number. Now here comes you without multiple accounts and borrowing your friend's GPU hours, but you are also running notebooks from multiple accounts using the same IP number. How are Kaggle moderators going to distinguish between these two events? Would they even care to do it even if they could?
Aaah yup valid point thanks ! Ig I should just stick with my own account
i have a question please
if i am making a brain mri tumor classification program and we are given a dataset of 250 images to train and validate on
and this is kinda like a challenge
i tried to use transfer learning
i am testing on an online dataset of 60 images that aren't in my current training data
the highest accuracy i am getting with transfer learning is 78%
i am testing many models and they are all performing poorer than Xception that got 78%
is it even possible to get higher than 78% or am i wasting my time
it is worth to note i am using data augmentation for sure and i tried fine tuning hyper parameters like learning rate
I am looking for someone to run thru the Titanaic competition with me so that I can learn. If anyone is up for the challenge or has any insights for me please share. Thanks
here is a link to the notebook with all the models i tried to use, i test on almost 1k images. any help would be appreciated
It is impossible for anyone who hasn't tried that exact dataset to tell you whether you are doing well or not, and we don't even know what dataset you are using. Generally speaking, it is very difficulty to get excellent performance if you train on 250 images and validate on 60, but 78% sounds decent. You may want to try to split your dataset into 5 folds and make 5 models, and then average their predictions. That might give you a small boost.
For brain MRI tumor models, using a single model trained on a larger data set is far superior to using an ensemble of models with cross-validation.
Currently i am working on a project for yield prediction using NDVI data but i cannot find much of the data on kaggle...can you suggest me any site or anything that can help me with NDVI dataset .
NDVI data is basically data extracted from satellite images
Please respond
NASA Earthdata and Google Earth Engine
or Sentinel Hub
I am new to data science and looking to get a headstart in this domain. I am currently learning python and its libraries.Should I do something else along with this?
For almost any type of data and models, using a larger dataset is superior than using a small dataset. It was implied in my statement that it is very difficult to get a high-performance model when training on 250 images. If you are saying that doing a single model on a large dataset is far superior than doing an ensemble on the same large dataset, that simply is not the case.
Understand, but You have to know Ensemble models are a type of machine learning model that combines the predictions of multiple base models to produce a more accurate prediction. Ensemble models can be trained on datasets of any size. However, they are often more effective when trained on larger datasets.
Again, you are not writing very precisely so others may get wrong ideas. Ensembles are not combining just base models - they can combine any kind of models. My original contention with your statement is that you seemed to suggest that single models on a large sample would do better than ensembles. Generally speaking, on the same dataset ensembles will do better than any single model. That goes for any dataset, whether big or small.
While ensemble modeling can offer excellent performance, it can be a complex process to implement and may not be as effective when handling large datasets. In such cases, it may be more appropriate to rely on a single model that can handle large datasets efficiently. This can help to streamline the overall modeling process and ensure that the final model meets the desired level of accuracy and performance.☺️
I am not disputing your last statement. Yes, some people may not care about a complex ensemble to get a 0.01% improvement when a single model may be lighter and easier to implement. Single models may be more appropriate, no doubt about that. Yet "more appropriate" doesn't mean "far superior" which was your original statement. Single models are not "far superior" to ensemble models, even though there could be good reasons to use them.
I said this because I saw cases where a single model could be convenient.🫡
And machine learning is not something you do for fun.
thank you for answering, my dataset is basically 250 pics split into two halves with half being pics of brains with a tumor and the other half being pics w no tumor, my issue was that i expected transfer learning to yield a higher accuracy but i could not get more than 81%, tbf i tested on 1k pics when i only trained on 250 which would not be a real life scenario since if i had 1k pics, i would have most probably used them for training but the challenge for the task was that i need to train on only 250 images
It is definitely challenging to train a model on such a small dataset. 😩
You can use data augmentation techniques to increase the size of your dataset. This can be done by flipping, rotating, cropping, and adding noise to your images.☺️
👍
Is there any pytorch-based time series feature extraction libs? Most of the implementations I saw are based on dataframe.groupby and apply...
I mean if there's not I may have an idea and I can start something. I assume torch even on CPU utilizes maximum resources and can achieve better performance. (Or maybe I'm wrong?)
The libraries I'm looking at right now are tsfresh and tsfel.
You already picked good libraries for this purpose. Another good one is https://github.com/DataCanvasIO/HyperTS
Good to know. Thanks so much for that!
Is this the place to ask questions regarding specific Kaggle Courses/Exercises?
@thick vessel You can ask specific questions here or in the discussion forums for each course on the site.
🫡
unless you can utilise the gpu I don't think torch will give you any extra optimizations on top of numpy tbh. (and tsfresh is really decent, as is sktime )
I'm working on the Optiver competition dataset. TSfresh Takes ~25 secs(where 15 seconds for creating the rolling time series data frame) to do the feature extraction for the first stock in the training set and there are 200 stock_id in that dataset. If I use a for loop on top of my current code it'll probably take >1hr on CPU to run on all training data. This is fine and affordable for the training stage(since I can pre-compute them) but I may be using GPU for inference. But I think overall you are right. I realized that if I use another way to compute, it'll probably spend as much time in creating the rolling data frame.
I just created a notebook and I wish to share it to the competition's code section, can anyone teach me how to do that? 
In the competition page, Click the "Code" tab.
Upload your notebook file (I assume by searching and selecting it)
Open the notebook, Save a version, in version history, click the 3 dots and at the bottom you will find "Submit to comptetion" button.
Hope this works!
Hello all! I had a question regarding the implementation of momentum based gradient descent. Should I be zeroing the momentum based gradient at the start of each epoch or keep updating it across epochs?
Momentum term in SGD is meant to overcome noisy gradients, especially in small datasets. In Keras implementation its value is 0, which means that it is not required. If you don't enter any momentum value, I suspect in most SGD implementations it will default to zero. Presumably that means it is safe not to use the momentum, but it doesn't necessarily mean that momentum=0 is the best choice. I think you should stick with one value for it rather than try to change it from one epoch to another, as that will only add more complexity to the interpretation.
Sorry I was not referring to the momentum parameter in the update rule. Rather, the velocity that you maintain during training as (momentum parameter) * velocity + lr * gradient
should I reset it after each epoch of training or maintain throughout?
anyone know what the options are for persisting data in output folder across notebook sessions?
I'm converting some csv files to parquet, but I dont want to do that every time i boot up a notebook
You create a private dataset with those .parquet files, and simply point future notebook versions to that dataset.
I see, thank you. It’s my first time working on a Kaggle comp!
thanks!
i am taking an applied statistics class with R and i am stuck on an error in my code
is this normal/ok ?
I am downloading image files for an image model classifier, I noticed the red CPU bar and RAM, is this ok ? should I ignore it? is there a way to optimize it ?
Appreciate all the help 
getting this too, looks like I am also running out of RAM.
Hello everyone!
I am currently working on my second computer vision model, and I am facing a hard time in reducing the error_rate and improving accuracy. Despite intensive data cleaning (it actually got worse), and I would appreciate any suggestions!
notebook link:
https://www.kaggle.com/code/raedsherif/green-leaves-classifier/notebook
I also have a few questions:
- Is the input data available to you ? I could not find it in the notebook, should I download it to my local device and upload it as a Kaggle dataset, then input it in the notebook?
- As you can see, when installing libraries it generates a lengthy output. which looks bad, is there a way to clear or avoid displaying this output?
- After cleaning the images, should I create new dl and run fine_tune again ? Does it pick up from where it left off and saves previous progress?
My first model was 76% accurate at predicting plastic types and I was able to share it with a basic interface on Gradio, for this one, I'm aiming for a high level of accuracy and eventually want to deploy the model on a website. Any tips or tricks would be highly appreciated
thank you 
If you use pip -q that should make the installation quiet. I suggest you consider using validation loss as your metrics rather than accuracy, as the loss is steadily going down in your case. I think you shouldn't stop the training after the first epoch when the metrics doesn't improve. A typical patience values during which metrics don't improve are 5-20 epochs, but in your case 2-5 may be more appropriate. It seems that you also need to fine-tune for more than 7 epochs, maybe with decaying learning rate.
Thanks, I will apply these methods, what about using a better model compared to resnet18 ?
*(ofcrs, I want to make sure I improve my model with the resnet18 first, but generally speaking what is your take on using better models?) *
Better models will likely give better result, but it won't necessarily be very dramatic. I think working on your implementation is more likely to bring a substantial improvement.
can anyone tell what does q1.check() mean
hey guys I would like to start my first data analysis project but I dont know where to start do you recommend me start with a big dataset or a small dataset ,could someone help me pls ?
@patent kiln q1.check() is the function to check your answers, make sure you have run the cells above it to properly import the learntools package.
hello, @verbal crest
I want to get into NLP I now studying transformer what is the next step for me
What best books or courses that I have to take?
oh i see, also it seems that i cannot copy paste the upper code into the blank one
You can think of it as a function written to check whether ur answers are right or wrong
And this error is likely bcoz u have not run the initial code cells to activate this service
oh okay thanks
@glossy edge start with titanic. Work on Eda first before trying a new model every day.
Have a vid covering titanic and a playlist of all the models if you want to check it out
Hi everyone, I recently learning about preprocessing step in ML. I have a question regarding to standard scaling. There are some algorithms require standard scaling for more accurate in regulations and stuff. And as I scroll through lectures, and some shared notebook on Kaggle, I notice that they apply sklearn standard scaler right after train_set_split, without consider the data type (like nominal features, or features after one hot encoded). My question is: do it affect the performance of the algorithm?
Hello @everyone!
hello Mnihj
Depends on the data used, but you can throw it into a pipeline and select certain columns for it
I do standard scaling after train test split
code:-
t3 = tf.random.Generator.from_seed(12, alg="threefry")
t3.normal(shape=(2, 3))
error:-
InvalidArgumentError: {{function_node _wrapped__RngReadAndSkip_device/job:localhost/replica:0/task:0/device:CPU:0}} Unsupported algorithm id: 2 [Op:RngReadAndSkip]
why i get this error?
how to use hugging face model "bert-base-uncased" model in kaggle.
I was trying to login using hf-cli or using a token still autotokenizer is throwing private repo use token error.
Thank you for the advise.
Have anyone tried decision tree on titanic...nd what is the accuracy?
It is a guarantee that many have tried decision trees on Titanic dataset https://www.kaggle.com/search?q=titanic+decision+tree
Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.
Hey! Does anyone know where to find examples of companies that reported a clear benefit to their business as a result of hosting a Kaggle competition or a competition in another platform? In Luca and Konrad's Kaggle book, the list three examples Netflix, AllState and GE, but I would like to find more examples
Is there a good explanation of which GPU (p100 vs T4) you should use anywhere? I've struggled to fid anything!
can i fine tune a model with json strucutre or even jsonl, i know the answer is yes. I just need to know if i have to always make the data formatted in this way when fine tuning:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
anyone know how to embedded openAI keys to project
I am getting serious error , let me know if you can do it
Do you know a blog (or something else) that review the Kaggle competitions (namely, describe the most common approaches, analysis the best solutions, etc)?
If not, I think to open one of that
HI Everyone, does anyone understand the Time series forecasting (Sales forecasting )in Kaggle? I'm having a hard time understanding it.
i passed a json doc into model using langchain and when making an inference of the RAG model it does not seem to be responding based off the json data specifically, when asking it certain questions that should have a good response. Do i need to restructure my Json Data away from the nested dict style to something more condensed?
Anyone?
I think that it would be meaningful for everyone
yeah where can i watch it
In this Python Machine Learning Tutorial, we take a look at how you can split a data set through train test split in scikit learn.
This is a great method for prepping your data before you run a model.
Email: ryannolandata@gmail.com
LinkedIn: https://www.linkedin.com/in/ryan-p-nolan/
Twitter: https://twitter.com/RyanNolan_
GitHub: https://githu...
@glossy edge 25 vids in here
Hello guys,
Please is there a place to see past project presentation slides and recordings?
Hello
hey guys, does saving a notebook in kaggle use up my gpu quota as well?
and when using the T4 gpu, is there anyway to use both of them at once? to maximize the usage of the quota
why is it that, when compiling the model , say running 80 epochs, i get to see a pattern in the change of values of validation loss and accuracy? and also there is a pattern in reduction and increment in the learning rate?
Hello @everyone, What can I help you?
is the scoring stage just comparing my submission with the real data or it will use some private data to check? it takes my more than an hour and im still not getting my score
Heyy folks, any suggestions like tutorials to start working with tensorflow???
hi everyone, i'm working a clustering model with kmeans. But i have a dilema. My elbow method says that best k is 4, but when i see a 2d pca plot it looks like best k should be 2
When i try using k=2 in the model the cluster doesn't show the obviously groups. What am i doing wrong? should i recheck documentation? 
im looking for data for time series forecasting, it should be above 5 GB anything around 10gb+ would be nice to practice!
I have a question . On Kaggle Getting started competitions, we are provided with train and test sets separately. Is it okay to merge both of them for doing preprocessing easily or not ? According to this blog : analytics vidya blog (https://www.analyticsvidhya.com/blog/2021/07/data-leakage-and-its-effect-on-the-performance-of-an-ml-model/) we should not do this because it can lead to Data Leakage . Can anyone tell ?
U can start with YT . There are some great tutorials available. Also on Coursera , u can find some great courses
i have the age column missing in some of the rows in my data, should I do mean imputation or resort to some other method
Hi, i have 50 clusters of numbers each cluster corresponds to a plume shape, i know the location of origin point, i want to translate each plume or cluster at same position line they superposed each other cause i want to eavlaute the average values of them. anyone knows how to implement this in python
i am trying this from yesterday, not able to implement it
Hey, I hope you are all doing well. I want to extract information from a resume. How can I do it? Any guidance, please?
hello everyone, can somebody help me and answer on my questions?
I don't think anyone can commit to helping you without knowing the actual question. Why don't you ask and see if anyone responds.
Hi, who should I contact for a Kaggle bug?
I see a weird bug on Output => Submit page, on CTF competition the web page starts blinking a lot, I cannot click "Submit" and after a few minutes I get 429 too many requests error on any kaggle.com web page, it's like I'm banned for a a few hours then.
Some pages return:
So...does data regularization(like min-max, normalization, box-cox, etc. to keep all data items in a limited range) improve performance for Gradient boosting trees like LightGBM(in a regression task)? I do assume they don't do much for standard decision trees in a classification task that I may learn in class since they are based on entropy.
I think since it's related to a competition, maybe opening a discussion post would be sufficient. Some hosts/admin have access to what's happening to your code on the testing data and can provide related information(that won't leak anything about the testing data of course) for you to debug.
@split epoch thanks
I don't think that they would have much of an effect, but it depends on the data. If some features in the data are very volatile, clipping the data may improve the model performance, but LightGBM tends to be very resilient so it probably wouldn't matter much.
Thanks, I guess I'll take a closer look at how the model works.
Very stupid question, but I am new to the competition in coding in general. Where do I find the data to download, Ik the competition lists pfr and nflverse but how do I download either of those so i can get started
You must join the competition before downloading the data
Scaling should not matter to GBMs.
Hello, I'm currently training a Convolutional Neural Network. Is this a good way to train my model to unseen data?
The training accuracy jumps real high at the start while the validation accuracy gradually gets better
?
Hello, I am building an analytic dashboard on Streamlit, my code is above. I want to add the parameter delta on metrics (st.metrics, label =, value=, delta = ) My delta will be the total sales difference (increase or decrease). The aim is to show if sales are increasing or decreasing each year. All the code I have written to achieve this has gone wrong. I got it done on my jupyter notebook but if I implement the same on Streamlit I get an error that my value should be an int, str... Please I need your assistance
I don't think auto is a valid option, is it? The argument is more in the line of: it can be a string for as long as the string contains a numerical value (So you can add an Unit e.g. Degrees Celsius).
Is not, I want to change it to the difference. So that when I filter, I will get for eg the total number of sales between the years (to know if the sales are decreasing or increasing with each passing year). So my problem is the Python code that i will use to implement in Streamlit.
Broad stroke question- but can someone explain data models to me and how to build them?
Is it more than just combining datasets into relational tables? i.e. Sales and Customer data joined on transaction id?
Or is it more forecasting or financial modeling? I think the job postings I read are throwing this term around very loosely...
Two things: your training curves are way ahead of validation so I suggest regularizing your network more or add/increase dropout; even after 60 epochs val accuracy is still going up and val loss still going down, so you should train longer.
A data model is a structured representation of how data is organized and accessed in a database or information system. It defines the relationships, rules, and structure of the data to ensure accurate storage, retrieval, and management of information.
One example of a data model is the Entity-Relationship Model (ERM), which represents entities, their attributes, and the relationships between them. For instance, in a university database, "Student" could be an entity with attributes like name and student ID, connected to other entities like "Course" through relationships indicating enrollment.
Got it, I'll update you when I do this later. Thanks for the suggestion!
Also, the graphs I showed you were a sign of overfitting, right?
Thank you 🙂
15000 is to big for rmse ?
it's relative to your dataset's value xD
i'm guessing you didnt normalise it
If you estimate number of sales for your local shop and you have a RMSE of 15000 it's huge yeah xD
if you estimate number of sales for the whole Macdonald's corporation, 15000 RMSE is very good
@celest sphinx can you check dm?
no i dont take DMs but i was just giving you a quick answer to your question
usualy to evaluate your model performance
you start by making a "baseline" model, so a very simple model for example : always predict the average value of the dataset
there are mode accurate baseline models of course
then you can compare your model performances to those statistical baseline models
And to answer your question on a business view : Your acceptable error depends on what the client is willing to lose
you should never aim to reach 0 RMSE anyway since that means your learnt the noise in the data too. Usualy your main aim to know if you model is good is if your training RMSE is close to your validation RMSE
i'd rather have a model with 0.11 TRAIN & VAL RMSE than a model with 0.0001 train RMSE and 0.10 VAL RMSE
Thank you
Don’t know if there was overfitting as validation scores were still improving.
I see, I see. Have you encountered this by any chance, or do you find this unusual when training a model?
Hi guys, I am studying machine learning and suddenly I got something in my mind,
Recently i learned and did some statistics problems basically hypothesis testing and stuff but I can't visualize their implementation in industry.
I heard that statistics is the base of all AI and ML but it's hard for me to solve problems using statistics and I just know theory of it. Does anyone know of good use cases or some books which can help me with it.
Much appreciated
Hello everyone, I trust this message finds you well. I am seeking advice on whether pursuing an online Master's degree in Data Science from Coursera is a prudent choice. Furthermore, do you have any recommendations concerning funding options for pursuing a Master's degree?
I have seen this before. Your model is training too fast because there isn’t enough regularization. Adding a dropout layer should slow it down and also run for longer. There shouldn’t be a predetermined number of epochs. Rather, go for an arbitrarily large number of epochs and use early stopping with patience at 5-15 epochs.

What does Kaggle mean btw?
just a random word that sounds nice?
chatGPT said
"The name is a play on the Japanese word "kaggle" (pronounced "kah-glay"), which means a group of people who come together to learn and collaborate."
Hello everyone. I've a trouble. I have been using one notebook for some time and it worked just fine until few days ago I started to get the following error:
CUDA error: CUDA driver version is insufficient for CUDA runtime version
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Interestingly enough, my friend was using the same notebook and two days it worked fine for him while mine wasn't working already. Yesterday he started to receive this error too. Could anyone please tell a solution here?
Pretty believable! But the real answer can be found here: https://www.reddit.com/r/dataisbeautiful/comments/80xl66/hey_reddit_im_anthony_goldbloom_founder_of_kaggle/duyw6gm/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button 🙂
haha thanks
Plz share the list of all ML algorithms that are affected by correlated features.
Can anyone tell me what I'm doing wrong? I want to access a json endpoint from the chess.com API using python. But I'm getting error code 403 again and again. This is my code:
import requests
api_url = "https://api.chess.com/pub/player/jitesh117"
try:
response = requests.get(api_url)
if response.status_code == 200:
data = response.json()
print("API Response:")
print(data)
else:
print(f"Request failed with status code: {response.status_code}")
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
I reduced the batch size and added the dropout layer and it got better. Thanks for the help! On another note, can you give a few ideas on how to handle unknown classes? For example, the classes are dogs and cats and the unknown (well, known unknowns) would be all the other animals.
Hello ,I am new to data science and I want a little guidance on the binary classification with software defects
Why people ask for thumps up in their projects
Medals are awarded after a certain number of upvotes. Expertise levels increase as the number of medals goes up. In short, more upvotes means greater recognition.
Are transformers = attention models?
anyone can help?
hi i am a beginner how to tune your model to make it perfect for example when to add a dense layer or a drop out is there a course/book for that?
HTTP Error 403 indicates an authentication failure, you should check the API specs for any required authentication tokens (perhaps in the data as 'Authentication Bearer' or simply a url token)
Missingpy is quite old, your scikit-learn package should match the version as is required by missingpy to remain operational (although I recommend looking for an alternative package to missingpy)
Maybe an automatic system update that updated your CUDA drivers (and maybe you friend received the update later?)?
hello everyone. I'm from Vietnam and want to find a mentor for data science especially coding. Can someone help me please? thank you so much.
Hey could you please help out with this doubt: Not able to access the kaggle tpu in kaggle notebook
Have there been any competitions that nobody has been able to win?
Hello Everyone, Please I need help here. I am trying to forecast using Prophet. The first Screenshot is my dataframe but after creating my first model, the shape of my data reduced after i had added that i am making a prediction for 365 days (second screenshot). Please what is really going on? I should be expecting 3203 + 365 rows
Why AI won't replace data scientist/analyst
sorry but you are wrong
That is a bias question. It could replace data scientists. Don't be afraid, be a good scientist and accept the potential consequences of progress.
I was trying to plot a plotly express graph, on x axis it was days number and on y it was cumulative sum column. I made a average line chart fot it with different color, But it stays on the back. How do i bring it to the top??
Have you selected the Tpu from accelerator option?
Hello Everyone, can anyone walk me through the multidimensional data vectorization without creating a change in the data shape . I have been experiencing some challenges in this regard
coursera deeplearning.ai andrew ng
I think some of those big competitions are going to require significant compute resources and teamwork, i.e. money. I am just trying to win swag at this point.
I was trying that out
It says that the exercises are graded
But it only shows the questions
How do I get the gradings?
I want to know if my answers are what they are suppose to be
Or do they just refer to the little circle on the side?
I had a same issue, when I was imputing the data and handling categorical values, the shape changed dramatically. It was because the indexes were persisted, maybe resetting the indexes can help
I am think of flattening the columns (column- wise) so I can have my original matrix, will update you if mine eventually works
i have some questions abt regression and datasets shapes but i don't know if it's the right channel
If it is a question, then this is the right channel!
yeah i managed to talk with some of you guys already ! so basically my problem is the following one:
-
I have 2 datasets , one contains 3072 animals with 875 columns whoch are bacteries inside it , and the 2nd one is a predict dataframe with 840 animals and 6 attributes.
-
Objectif , predict the weight (real) , an other variable (real) and finally both at the same time.
problems; As you saw , my dataframes have differents shapes so a lot of regression doesn't work with it and i don't know if i should just take the first 840 animals in my 1st dataframe or no because i don't know if they are the labelised one.
Solution: I tried to first predict the weight which is a small real , it was a slaughter , negative score , 2 Md MSE etc etc so i transformed the weight into classes 0/1 (is fat or no) and now i'm using classiffication models but without this , i wouldn't know how to fix my previous problem.
2nd problem : i have a pretty good accuracy (0.90) using KNN but it doesn't give me infos about which bacterias causes the animal to be fat or no. I think of doing a Naives Bayes or a Logistic Regression to compare results
Do you guys have any ideas on how to see the weight of the differents attributes that leads to the classification
Reason: Duplicated text
Ok problem of size fixed we managed to get a new dataset with correct sizes , now remains the question of knowing the importance of parameters in the choice of the classifier
Assuming that you are using a KNN model, and are trying to get the importance's of each of the attributes which are also features, you would probably need to use permutation importance to measure the effects of each feature on your classifier. Hope that helps. If you use sklearn log regression you can get the feature importances with model.coef_[0] I think.
Yep that's exactly what i am doing and what i want to find out ! Does scikit provides a function for the permutation importance using KNN ?
Not to my knowledge, you would need to write the permutation importance code yourself. Scikit learn has a permutation importance function that works with tree based models and linear models, but not one that works with KNN models.
Alr alr thanks for the answer ! I'll take 5 more minutes of your time with a last question: i am predicting an attribute actually , then i'll need to predict a second one and finally both at the same time, which models allows to make a multi-classification at the same time ? (I'll probably predict a binary value and the 2nd one is actually a real but i'll probably convert it in classes from 1 to 10
I'm not exactly sure of how you would want to solve the problem. But if you are trying to predict a binary class and a real number, you should go with 2 different models. Otherwise if you are changing the 2nd one to categorical, you could use the MultiOutputClassifier in scikit learn.
Alright i'll test it right away ! Thanks for the help ! 
Hi, Would anyone like to team up with me for the competition?
Hi guys i am going through the python tutorial and doing the excerise for the arithmetic and variable section and i cannot figure out what im doing wrong. Any advice?
It says at the bottom that ‘survived’ is not defined. You have to first define that variable.
I have a dataframe where some of the items are lists. Normally the tables I've seen just have single items like "Age" where you just have one integer. How do I do EDA in this case?
Ig you can one-hot-encode it but if you have everycountry in the world might be a lot of columns
For EDA I’d split this data into separate columns. The first production company could be the main one and you could analyze movies with more than 1 company against the others for specific characteristics like movies with more companies have more budget or something like that
if you want to know how EDA is performed then have a look over here https://medium.com/@borhadepiyush/how-to-perform-eda-5ecaf4a3e52a. This is blog I posted on medium, I guarantee you that it will definitely help you into your study.
Oh okay so treat it as a normal categorical column
Alright thanks, nice read! I didn't really find the answer to the question there but I'll also have a look at the notebook you attached to see
wouldn't that leave a lot of rows with null values? I'd look into that as well though that's interesting, thanks!
sklearn's permutation importance can use any estimator. Think about what the permutation importance calculation is. You reshuffle the values in a feature and check how much the performance of the estimator degrades. The worse the degradation, the more important the feature is. There is no reason why the calculation should be limited to certain estimators.
My bad, I thought I read somewhere that it doesn't work for tree-based models.
Trust the sklearn manual only. 🙂 https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance
Examples using sklearn.inspection.permutation_importance: Release Highlights for scikit-learn 0.22 Feature importances with a forest of trees Gradient Boosting regression Pixel importances with a p...
Could someone recommend a video tutorial on creating project presentation recordings? I'm looking for guidance on the process. Your assistance is much appreciated.
anyone who uses rx 6650 xt i need some help?
There is a whole channel in KaggleX section devoted to recording tools. I will point you to the first post in that section: #1160980660966674452 message
okay thans
Good evening everyone, I am Hassan passionate about learning data science as a beginner what resources and site to start from baby step? Happy to be here to learn unlearn and relearn.
when people split the training data into training and validation data, do they often do another pass at the end where they use all the possible training data before submitting? I imagine having a validation split is only useful for deciding if what youre doing improves the model or not
After hyperparameter tuning using cross-validation, the model is usually refit with the best parameters on the whole training set to fully utilize the data. I've heard more than once of how exposing the model to more data (train + validation) at the end boosts performance. It's worth a try.
Hi, is it possible to pull a private notebook?
I have been trying to fine tune the faster RCNN with the help of this notebook https://www.kaggle.com/code/yerramvarun/fine-tuning-faster-rcnn-using-pytorch/notebook
but after fine tuning the model I don't know how to get the weight file and use it to run detections on my local system , it would be great if anyone can help me out.
I want to work together with a friend. What is a good place to share data?
what data you want to share? i am also lookinng to work together
i want to do the titanic competition just a place to put ideas and add data
okii....for the titanic the data is available to be downloaded on the competition page right?
yh, however i want something that we can add ideas and share helpful videos and potentially code.
you can dm someone who is interested....i am up for it if you want
Let me ask you, did you fine tune on your local or are you using Google Colab or something else?
I used Google colab for fine tuning
if the data is extremely skewed to one side and the boxplot showes alot of outliers are they really outliers, such as this data. It just seems I cant really consider these as outliers. Is the boxplot a not good enough of test for outliers?
It depends on the type of data, but generally speaking these are not necessarily outliers. There are many types of skewed distributions that are legitimate - the tail of your histogram represents rare events. You may want to modify the data using Box-Cox (or log) transformation to bring it to something resembling a normal distribution.
Can you use model.save_weights (for just the weights) or for the whole architecture model.save? Def use ChatGPT for those types of issues. It can usually point you in the right direction. Unless, I am misunderstanding.
Thank you!
Reason: Duplicated text
I'm not sure how to proceed with my question now, I was wondering if the latest keras-core is supported
@balmy tundra Apologies, I think you hit a false positive on our auto-mod tool. Try asking your question again, I've updated the settings.
yeah i can use that but not sure how to use that weight file which is saved in my local system to run detections on new images
is the latest keras-core suppported in kaggle notebooks? https://keras.io/keras_core/guides/getting_started_with_keras_core/ I ran first 2 cells in this guide but I got errors that keras wasn't defined
The way to do that is to create a Kaggle dataset, upload your file, and then point to it from any Kaggle notebook.
Are all the for metal competitions the ones that have money prizes right now... there's several pages of them but I assume thats correct?
Hii
Forgive me for this stupid question but
How do I participate in a kaggle competition 🙂
I haven't participated in any competitions before this, and I want to partake in the ai text detection competition
More specifically, I want to know do I register myself, and how to submit my model and such.....
Would there be a 'Kaggle DS & ML' survey-competition this year?
What does optimal training data for an ML trading model look like? Would it consist of just OHLCV values? Wouldn't you have to train the model on data that shows profit, since that is the ultimate goal of traders? How would you express that profit in a training set?
Start by going to a competition of interest, and click on the button "Join Competition" - it is on the right side. If you are logged on with your Kaggle account, it will take you to a page to read and accept the conditions. Then you go to the data section to download the data. Finally, there are discussions and code sections where you can find and re-use the code others have shared, or ask questions.
hello all I have recently joined to my first kaggle competition, but what in the rules of that competition it says "Internet access disabled", does that mean I can't import external libraries?
I see
Thanks a lot!
Hi, everyone
I am a fresh graduate who knows Python Data Structures and started working in a company with SQL and a little bit pyspark on JupyterHub. Wanted to have a guide to Kaggle how to start participation in contest and learn.
an easy way is to read some documentation https://www.kaggle.com/docs/competitions
Find challenges for every interest level
Hi everyone,
I need some help with getting started. I want to work on the Detect AI-generated Text competition but I'm not sure how to get started, since I've never really worked on a project in Kaggle or participated in a competition.
I'm hoping to participate to learn as I go. I had to start somewhere so I chose this.
Any advice would be appreciated.
Thank you.
There's a couple of tutorials with the basics of Python and such. After that pick any "getting started" competition and look at published notebooks, starting with the short ones that don't have high scores (=> easier to understand).
You'll need to provide more context.
Are there any good resources from past competitons regarding heuristics of thumb rules for large image resizing in CV, or are things like https://arxiv.org/abs/2103.09950v1 actually used?
For all the ways convolutional neural nets have revolutionized computer vision in recent years, one important aspect has received surprisingly little attention: the effect of image size on the accuracy of tasks being trained for. Typically, to be efficient, the input images are resized to a relatively small spatial resolution (e.g. 224x224), and...
@rare jetty To drop NaN values simply use df.dropna function in pandas. There are many ways to fill in missing values, as that is a non-trivial subject. The simplest way is to fill in mean or median values per column, but I suggest that you go to Kaggle and search for "missing values." There will be many notebooks showing ways this can be done.
Thanks everyone.
I have fill NAN values with the help of chatgpt
Hi!
I've got a question about what would be currently the best deep learning architecture for analyzing features of Raman spectra. Does anyone have worked with this type of data?
It is a 1-dimensional (vector) image in which all positions can present some information. CNNs and ResNets may be a nice option, what do you guys think? What about visual transformers or other architectures?
Thankks!
i would recommend you try a vision transformer only if you have large amounts of data otherwise CNNs and resNets always work
hello,
Am working on an ai project and there seem to be many null values in the dataset
would you advice me to go with fillna or dropna?
also If I use fillna and fill in avg random values wouldn't it affect the dataset?
And since the project is dealing with Healthcare would there be a huge affect if I add in avg values.
done
The same question was asked in general. So just scroll up a bit and check out the discussion.
oh okay
I'm not sure if this is the right place to ask, but I used to be able to edit my published notebooks without having to rerun the code cells. However, when I click on "edit" to do some changes to my markdown cells, I need to rerun all code cells to view their outputs. Does anybody know a way around this?
Hello everyone, I would like to ask where we can buy a dataset of pornographic text, we need to train chatbots
Does the link at the kaggle playbook (https://packt.link/KaggleDiscord ) refer to this server?
Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.
I dont know about the ethical implication of any of this, but you could probably user whisper or any other voice-to-text model and just go over a database of videos
Reason: Bad word usage
@verbal crest This bot may be a bit too sensitive. I got warned (and my message deleted) for using a word p_rn, which happens to be what we are legitimately discussing here.
@muted talon I think you may be assuming incorrectly that @inner cape has an access to a large database of p_rn videos.
@deft fox Sorry about that, we've got the word on the warn list just because it's one of the most commonly used words by spam bots.
How would you train a CNN model to identify if a picture is something and not something? For example, you're training a model to identify if the image is a dog or not a dog. This is different from training the model to identify if it is a dog or a cat. The "Not a Dog" could be anything such as buildings, other animals, colors, etc. Any ideas?
If you have an idea in mind, please reply to this message. Thank you!
I'm not assuming that per say, but with a scrapper it should not be a hard thing to obtain, considering they want/need the data
Can I win a kaggle comptition without own GPU?
Anyone??
Kaggle has its own gpu's for the notebooks, you just have to select it under accelerators. You get a certain amount of time per month.
I am going to play the odds here and say no. Most people don't win Kaggle competitions, with or without GPUs. But you can compete and do well without owning a GPU, because all Kagglers get 30 hours per week of free GPU time. There are also many competitions where GPU is not needed.
So, does kaggle think to increase the quanta?
More question, why dose boosting algorithm perfer better than random forest in kaggle? (I don't see any winning solution with random forest but xgboost)
umm gradeint boosting algorithm are very efficient since they try to converge the gradient rather unless like in bagging algos which which relies on bootstrapping for better results and due to this approach of gradient boosting it leads to better results
I don't understand why trying to converge the gradient is better than bagging.
As I see that, we want n estimators that should be better than random + independent given the target. In boosting, these assumptions aren't correct
is there anyway i can put kaggle in dark mode?
yess these assumptions arent correct for boosting yet boosting seems to perform decently in these cases and the question of why converging to gradient seems to better coz it usually capture more variance in training data compared to bagging i hope that gives you a little bit more intuition
also let me know if you have anymore questions
I have a question regarding "Progression System",
does a Silver medal, count as Bronze as well?
so if I got 2 Silver medals, will I became competition expert? or I need to achieve exactly 2 bronze medals ⁉️
You need exactly 2 bronze medals to become an expert, but if those medals later turn to silver you will still be an expert. Two bronze medals is a minimum, and anything else above it still qualifies you as an expert until you reach the next level.
perfect, thanks for clarification ❤️
There are many reasons that winning solutions use xgboost. One of the reason you might be overlooking ist that the xgboost implementaion comes with a lot of optimzation that goes beyond "boosting": efficient in memory/computation, flexible objective and learning control paramters, robost default parameters, etc.... These factors play a more important role than theoratical soundness in time-constrained competition.
does anyone have experience in mcq question genrate using NLP, I mean how should I approach a problem
multiple choice question more than one correct
Maybe we we can see the probability of each option and compare them . If 2 options are correct I think they should have a similar probability predicted by the model.
is it possible to edit a post to make it my team's solution?
https://www.kaggle.com/code/ayeshairshadcoder/big-mart-sales-prediction
i dont know why but model is underfitting the data ...
The training accuracy is damn high but testing accuracy in low
like 80 / 50
Your model is overfitting the data. There could be many reasons as I didn't go through every single line, but for sure you won't get the best performance by using any regressor with default values such as in this line regressor = XGBRegressor(). Regressor parameters need to be tuned, and doing cross-validation would help with that. Also, those numbers are r2 scores rather than accuracy. Accuracy is a classification metric.
Why can't I use any AI model such as Mistral 7b? It allocates ram infinitely until it crashes the container.
How can I use an AI model in kaggle?
Thanks man, let me update my code
I don't that this is the reason. I don't see any random forest solution in the first places, that may point to something else
Yes you should be able to , but you have to do it correctly. Its depends a lot on how you are loading the model. E.g. are you loading in the correct precision? (float32 will crash the gpu for sure) Are you trying to fine tune? (in which case you have to use BitsAndBytes 4bit quantisation otherwise GPU will run out of memory)
I finally figured it out after analyzing different notebook scripts. it's the bfloat yeah. I set it to auto and I can now finally manage to run some AI models and start building my dataset.
You are correct that the trees are independent in a Random Forest. But just because the assumption does not hold in Gradient Boosting, it does not mean that Gradient Boosting performs worse. It's quite the opposite, the assumption was relaxed for a reason. If the trees are not independent but rather they learn on the mistakes of the previous trees, the same predictive power can be achieved faster and with fewer trees. XGBoost went one step beyond Gradient Boosting because it is the first tree-based algorithm that has L1 and L2 regularization to help prevent overfitting. This is how tree-based algorithms evolved: RF => GB => XGB So it is not a surprise that most winning models are based on XGB.
Does someone knows about good explaination of lightGBM which includes a numeric example?
I am trying to get this question answered: If I upgrade to Google Cloud AI Platform Notebooks can I also submit that notebook on a competition ? basically by passing the runtime cap of 11 hours for instance or basically halving training time cause I am paying ?
Depends on the competition. For code competitions you can't use Google Cloud to surpass the limits. If it's a standard competition where you are submitting a CSV, you can use as much compute as you want (on your local machine or on Google Cloud or wherever).
You can do whatever you want for training, including training on the local machine if you have it. When you create a model, save it on Kaggle so it can be accessed from any of your notebooks. Yet the time limit will come into play for the inference, which has to be done on via Kaggle notebooks and that is where the time limit will be enforced without exception.
Thanks for that I was not aware I can only hook inference to the submission notebook, so I have been including training all this time :P. I actually had this thought at a random point, but I was not sure how to upload my model on Kaggle, is it via the datasets?
Yes, anything can be uploaded by creating a dataset. You link to it by adding a data source to your notebooks.
why the nfl competition is not accepting responses ?
guys is there any book for begineer at data science?
Hi, when awarding medals.. Does Kaggle also consider how old a notebook is? Like my notebook is 5 months old with 27 upvotes (22 non novice), yet it hasnt got silver medal
https://www.kaggle.com/code/akshitsharma1/easy-peasy-detailed-cnn-tutorial-for-beginners This is the notebook am facing issue with. please can someone check
hi i was wondering if i could use ngrok to open a tunnel in order to remotely collect training data, i am taking part in the UBC Ovarian Cancer Subtype Classification and Outlier Detection competition
im passing this data to another computer with wandb...
Hello everyone!!! I am participating in a competition where it states that "Freely & publicly available external data is allowed, including pre-trained models" (so I understand I can use huggingface and other services) but it also states "Internet access disabled" for the notebook, so what do I do? Do I have to download the model?
Yes, the age of the notebooks is a factor. Also, not all non-novice votes count. Kaggle doesn't explain that in detail, but in general non-novices who often upvote your posts or notebooks also may not count.
for non internet notebook competitions, if I add a model using the sidebar while editing a notebook, it should be available when running on the hidden set right?
Right.
hey! does anyone know why this training loop might not be updating gradients correctly:
for epoch in range(num_epochs):
epoch_list.append(epoch+1)
model.train()
train_loss = 0.0
for images, depths in tqdm(train_loader):
images = images.to(device)
depths = depths.to(device)
outputs = model(images)
loss = depth_loss(outputs, depths)
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
model.eval()
val_loss = 0.0
with torch.no_grad():
for images, depths in tqdm(val_loader):
images = images.to(device)
depths = depths.to(device)
outputs = model(images)
loss = depth_loss(outputs, depths)
val_loss += loss.item()
val_loss /= len(val_loader)
print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")
train_losses.append(train_loss)
val_losses.append(val_loss)```
So, in your training loop, it looks like you missed adding optimizer.zero_grad(). This is super important in PyTorch because, without it, your gradients start to pile up across different batches of data. Think of it like this: each time you pass a batch through your model and compute the loss, PyTorch calculates how much it needs to adjust the weights (that's the gradients). But if you don't reset these gradients to zero before the next batch, you're not just adjusting based on the new data, but you're also including adjustments from the previous data. It's like trying to fix a recipe but you're also considering the ingredients from your last cooking session. Not ideal, right?
So, just pop optimizer.zero_grad() right at the start of your training loop, just before you feed the images and depths into your model. This will make sure that each batch's weight adjustments are made cleanly, based only on that batch's data. That should fix the gradient updating issue and get your training back on track! 🚀👍
Is this part of an active competition? Make sure to not share code that’s currently being used for Kaggle competitions in the spirit of fairness
Thank you so much Derek! That was exactly the issue. This makes a ton of sense, thanks for the explanation
I'm so sorry, this is not code from a current competition, just a general question from a different project I'm working on!
Hello Everyone, I download the dataset: automotive vehicles engine health dataset. However, I'm facing a lot of issues with the data. I'm not getting an accuracy greater than 64-67% for multiple models. I used RF, DT, MLP model. I'm focusing on MLP honestly.
I've fixed the imbalance classes issue, I did some feature engineering and fixed the outliers as well.
Is the issue from my side or from the data itself? Any suggestion on what should I do?
Hello guys my model keeps improving in validation scores and but keeps decreasing in kaggle score. can you guys give me some advice for this?
ooh your model seems to over fit i guess think of it this way
your dataset might have black cats and
models learn only black cats are cats
which is problematic ofc
try using cross validation to get a better estimate of the error
and use some regularization methods
If I submitted before the deadline but after the notebook runs and it's scored it's past the deadline will that be considered?
My notebook cell freezes with the asterisk sign when I try to run it in Kaggle. Is this normal or there is some way to solve this problem. I restarted the kernel once already but the same issue happened.
There is nothing in this code snippet that would produce a visible output. What you think of as "freezing" is a notebook that imports several packages, sets a couple of parameters, and at that point it is done.
try a boosted tree algo like xgb or lightgbm. Tbh for a tabular dataset I won't go near a MLP (they're finicky to train correctly for tabular data). and unless your data is seriously unbalanced (ie < ~10%) of 1 class I won't bother with class balancing either. (mostly, except for the really unbalanced case, fixing it makes my model worse, not betterr. Other than that, you are the one who have explored the data so you are better placed to answer if there are 'issues' with the data...
I second everything @red hawk said. It may be worth trying this neural network https://github.com/dreamquark-ai/tabnet . It won't necessary produce better results than GBMs (it might on small datasets), but it typically produces different predictions than any other methods. Thus, it ensembles well with other models.
PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf - GitHub - dreamquark-ai/tabnet: PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf
can anyone help me solve this?
hi everybody,
one of my submission notebooks are now running for more than 2 hours which is not normal, is this a bug in kaggle? or something is wrong with my notebook ? any idea?
Running times vary depending on server load. There are many notebooks running at once, and sometimes the system is slower.
thanks for response, it is now 8 hours, and I double checked to make sure that I don't have endless loop or something like it in my notebook !
oserror: [e053] could not read config file from c:\users\….
I am facing this error, kindly help me.
You don't give us enough context. With the limited info we have, an educated guess is that either the file doesn't exist or you have a space in file/directory name.
oserror: [e053] could not read config file from c:\users\nandini agarwal\appdata\local\programs\python\python311\lib\site-packages\pyresparser\config.cfg
I would suggest that you run inference on a single test file that is available and time it, then multiply that time with 2000. That should give you some idea how long it is going to take.
Like I said, you have a space in directory name.
Yes
What is the exact python line that generates the error?
For the mohs-hardness regression data set, where can I find what the acronyms of the features actually are e.g. whatis "el_neg_chi_Average". Am I just supposed to google this or is there somewhere I can find this for future competitions also?
'https://www.kaggle.com/competitions/playground-series-s3e25/data?select=train.csv
Playground Series - Season 3, Episode 25
It could be because of backslash escapes e.g. c:\users\nandini\... is read as
c:\users
andini\...
\n is taken as "newline". Try using raw strings r"..." i.e.
r"c:\users\nandini agarwal\appdata\local\programs\python\python311\lib\site-packages\pyresparser\config.cfg"
Hi everyone,
we are doing a project based on anomaly detection through video surveillance. Our project is used mainly in sports stadiums to detect anomalies such as assault, explosion, fighting among fans etc. The surveillance video is captured by autonomously repositioning slave robots through cameras. These robots then check for the anomalies. If an anomaly is found, it sends the video footage to a central server for anomaly classification. We want an unsupervised model which takes videos as inputs. It also learns from the live video it detects during deployment.
Can anyone suggest a model to be used at the slave robot cameras?
hey! im started the course of python and the first exercise show me this error, how can i fix this?
i got it , I puted "run all" and it works
Hi everyone, i'm currently participating in the SenNet competition and to send a submission you need to turn off the internet access of the notebook. Since i'm using the segmentation-models-pytorch library i had to upload the output of !pip download segmentation-models-pytorch as a kaggle dataset. But i get the following error... Anyone had a similar issue while trying to install a package without internet ?
put these types of errors on chat gpt or bing ai they help a lot
Hey, when there is an error in the submission is there anyway to check what went wrong? If not, how do people usually debug it? I am not sure why my submission is generating an error
Is it normal for the icon of the .json file saved in kaggle to be marked as {i}? If not, how to solve it?
Is the a room for the optiver-trading-at-the-close channel?
Hello everyone, I have a college work similar to kaggle competitions, where we're required to submit our predictions of an "Xtest" and then we'll be evaluated based on how our model performs. For the training phase, I'm leaning towards using cross validation and evaluating my model based on its results, however, a friend is doing a split of the "training" data into training / validation / test, and they say that it gives more correct metrics. My stance is that doing a further split reduces the amount of data we're using to train, and we also risk having false confidence in our model. Am I correct on this matter? And, what's the general rule to follow in the scenario of competitions?
Hello everyone, i am working on a dataset and it has a categorical column with 31 unique values in it, if i perform onehotencoding on it i will have 31 extra columns, so is this correct way to do it or is there any better way to do it
It depends on the data. OneHotEncoder is used when there isn't any order in your data. If there's an underlying hierarchy in your categories (e.g., High School, Undergraduate, Graduate), you may use OrdinalEncoder instead of OneHotEncoder.
You can use the training data for cross validation, and also have a hold-out set for testing after CV.
If you have a very large dataset, a single train-validation set might give you accurate results. Overall, Cross Validation is preferable.
I see, thank you.
Generally speaking, you are correct. Your friend might have picked a fortuitously good data split that gives better metrics on test data, but it doesn't mean the model is better. If you do a 5-fold validation on train data and determine the CV score, in practice it should match better with test scores unless you got really poor train/test split. On the same train/test split, where you use your train data for cross-validation and your friend splits it additionally to train and validation, your approach should get better agreement between CV and test scores in a large majority of cases. Not always, though. That's not a method deficiency, but rather a luck of the splitting process.
Hello everyone, Question. A deeper tree can fit the training data better, but why it can also lead to overfitting?
Is it normal for the icon of the .json file saved in kaggle to be marked as {i}? If not, how to solve it?
It’s the quality of the training data that could lead to overfittjng. If your model is trained to look for apples, then it may be too generalized and return everything that’s red and round - you may have a lot of training data, but does it describe the characteristics of an apple accurately?
See few-shot training.
Sorry to ask it here, but I am unable to link my kaggle account with discord, what do I do?
I think it is linked now, could someone help
hello everyone can someone suggest laptop hardware for machine learning? I'm new to it so don't know what kind of hardware it needs, so please guides me.
I am learning Machine Learning/Deep Learning on coursera and I also know some basic about Pythons.
I am currently working on my BS Final Year Project named Data Driven Strategy for Load Forecasting of Power Systems.
I want to join a team or wanna work with some experts to learn
Please count me in.
this channel is meant for asking questions, so you are more likely to find someone to pair with in #👥┊looking-for-a-team or one of the dedicated competition channels
Anyone help?
Not sure anyone can help you other than admins. Pretty sure they don’t work on Sunday.
It would be interesting to know how much you're willing to spend to give you a more optimal suggestion. But overall, an Intel Core i5 and 8GB RAM is enough for most tasks. For Deep Learning and Neural Networks, you will need GPU. You don't need to buy a laptop with GPU though, since you can use cloud computing solutions. Kaggle offers free 30hrs/week cloud GPUs so you can train neural nets.
I agree that for most machine learning tasks that don'e involve deep learning almost any modern computer with multiple CPUs will do. But still, this comment But overall, an Intel Core i5 and 8GB RAM is enough for most tasks. I think is off the mark. Hardly anything these days can be done with 8 GB of memory, as the operating system will take a good chunk of that memory, unless one is planning to use Linux exclusively. I don't think it is worth saving $150-200 on memory and I strongly recomment at least 16 GB RAM. A GPU is a must for deep learning application, but that will make a laptop expensive. I think the suggested Kaggle GPU solution is a good option.
Sure! I was considering the minimum requirements, but indeed a 16GB RAM at least would be much ideal. I have been using a computer with 8GB and recently upgraded to a 16GB one for more optimal performance.
to add to the other recommendations, I would say that if you don't have portability requirements, ie you don't plan to carry the computer around much, it is much better value for money to get a desktop. And definitely get at least 16GB. You don't need a GPU but it can be quite convenient to have one that you don't need to be worried about turning off. (and if you are into video games, you might as well accomplish two things at once by getting a decent nvidia gpu )
Hi! I joined a competition on Kaggle and they shared a customised python package along with the .csv files. I am using Windows and the package file is .SO which is only for Linux. Does anyone know how I can solve this issue? Right now, I cannot run the package since it doesn't recognize the extension.
I have a question, best answered soon if possible
Is it legal to obtain someone's health data to build a project on ml ? That too without any doctor's or government consent ?Like as we know many health datasets are contributed by hospitals and medical researchers but is it legal to be collected by students without any proper knowledge on the field?
Hy everybody...I am searching someone to make team for kaggle competition to learn and share knowledge while working on a project. If interested please reply.
ML Course by Andrew Ng has few assignments, I cant solve the Practice Assignment of Week#2, Can you help me?
What sort of development environment should I be using as someone relatively new to all of this? Right now I'm just writing Python code in notepad++ and running via cmd line. Would it be more efficient/better in some way for me to use an IDE or some other tool instead?
@tropic copper Jupyter Notebooks are very popular for datascience, Kaggle's notebook editor (or Colab) are online versions of that style of IDE, but you can also set it up to use locally.
Thanks for the info! I've used those a lot in Coursera courses. What makes them so popular?
@tropic copper I think the ability to interweave code and output back and forth (and to go back and edit previous steps when needed) is all very handy when doing data science exploration.
Is nividia GTX 1650 sufficient for an entry level deep learning tasks?
Hello 👋
I'm participating to my first kaggle and I'm blocked at the submission level.
My submission notebook crashes and I'm trying to figure out exactly how it works to make sure I'm not doing it wrong.
Do you know if the submission notebook runs have access to the internet? My first notebook cell is a pip install and if the notebooks do not have access to the internet it would explain the failure :/
Hey all! Im fairly new to CS as a whole and was wondering if there are some pre-reqs i should know before attempting kaggle comps? Projects are the best for learning im told! But I also know I am fairly inexperienced and i will not learn much if it is too hard
4 BG memory is borderline even for entry level deep learning tasks. If you don't need to download an external model, or deal with relatively small classification tasks, it might work. I have two GTX 1080s which are 8 GB each, and I find it insufficient more frequently these days than 5 years ago.
Hi all need help to know best practices. I'm working on a project where I need to build couple tables where if I have the most granular data then it would create duplicates so is it better to have 3 different tables each of them having primary key on one column or making one table where primary key is combination of multiple columns
Hey Folks! I'm just starting on my ML adventure and I've got a question. I've created a simple one layer neural network to solve the Titanic Challenge. I've set it up such that when training is done I export the weights to disk so they can be reused. Training appears to be working pretty well. However, when I start the network with the trained weights and train some more the network starts in a state with a bit more loss than when training ended on the previous run. I would think that I would start at the point at which training ended. Does anyone know why this would be the case? Here's a link to my (very ROUGH/experimental) project incase it is helpful - https://github.com/chuckfinca/kaggle_titanic_competition
ML with Andrew Ng, Course#1, Week#2, Practice Lab Assignment
I am facing an issue regarding the assignment mentioned above.
After submitting, I receive an error. " Comment line with index: UNQ_C1 wasn’t found in code"
Can someone help me with this?
ML with Andrew Ng, Course#1, Week#2, Practice Lab Assignment
I am facing an issue regarding the assignment mentioned above. After submitting, I receive an error. " Comment line with index: UNQ_C1 wasn’t found in code" Can someone help me with this?
Link: https://lnkd.in/daynpUSe
This link will take you to a page that’s not on LinkedIn
Hi Kagglers I hope everyone is better, I want some advice or help to get job any machine learning and / or data science, feel free to dm's me
Can Kaggle ressources support multithreading for mistral 7b? I want to build a dataset and I need an AI to help me do that. ChatGPT rate limits. Thing is that going one prompt at a time is very slow and I wondering if it is possible to multithread ( so ask multiple times per time for model to generate text).
Code:
tuned_model = "codegood/HF_AWS_Mistral_SC"
trainer.model.config.save_pretrained(tuned_model + "config")
trainer.save_model(tuned_model)
torch.save(model.state_dict(), "/kaggle/working/HF_AWS_Mistral_SC/Mistral_torch_model.bin")
trainer.push_to_hub(tuned_model_SC)
print("Model saved to Huggingface")
Error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)
I'm trying to load and retrain my fine tuned model. I'm able to load the model, but get the above error during trainer.train(). Not able to figure out the problem.
Also, how to upload the bin model to the Huggingface directly?
Hi everyone, it is possible to delete a submission on the leaderboard, or just simply hide my score ?
Hey everyone! I have one project which I would love to share on public platforms like Kaggle and GitHub. However, I cannot share it without anonymizing, any idea on how to anonymize large amounts of data in Excel? So that powerBI would show anonymized names?
Hey i am new to kaggle i have question
Why use jupyter notebook and use cells
why not use and ide and simply type the code (without cells) and each function or line of code specifically in a cell
Jupyter notebooks are easier to work with as compared to using just a normal python file. If you are working with really complex code that takes a long time to run, it can be easier to just run it one cell at a time. It is also easier to debug(sometimes) since the code can be debugged one cell at a time. I also may just be a personal preference shared by most of the community.
@echo latch In the Kaggle editor you can swap from a notebook to a 'script' which is just a single python file if you'd prefer to work that way. It's really just a matter of personal preference.
Would anyone in the data science field with at least 1-2 years of experience be willing to participate in an interview for a school project (I'm considering pursuing a career in data science) ? The questions are below (Feel free to dm me or respond within the channel), Thanks for your time!
What caused you to gain interest in data science, and how did you enter the field?
Can you describe a typical day or week in your role ?
What types of projects do you typically work on ?
What programming languages, tools, and libs are most essential for you ?
Can you share an example of a challenging problem you've faced in a project and how you went about solving it?
How do you stay updated with the latest developments and trends in the field ?
What are some common misconceptions about the field of data science, and how would you address them?
What advice do you have for someone just starting their career in data science ?
@echo latch I am pretty new and I just started using a locally ran Jupyter notebook. It is SO much more convenient and easy than Notepad++ and than running via CMD prompt. It helps me chunk the program into easily digestible sections than I can move around, reposition, etc.
I've also found a lot of time can be saved with Jupyter because I can quickly add or remove "debugging" points, e.g if I want to see what the output looks like after I remove or edit a particular block of code, it seems much easier and quicker to do this in Jupyter.
Dear community,
this semester I started a course where we learn to code kind of AI in Python. For me its looks like more machine learning. But nvm.
So the task as final exam is to code a programm which can answer to questions about a data set:
_
‒ Two individual / unique research questions per student are required
Procedure:
‒ Students search themselves for large and relevant data sets
‒ Students define two questions that should be answered for the selected data sets
‒ Lecturers check the data sets and the corresponding questions (such that problems are
difficult enough but not too difficult)
‒ Implementation of the solutions by the students on their own systems using the
presented libraries and methods of the lecture
The resulting program code (as *.py) and the corresponding program execution need to
be analyzed regarding the run-time behavior. Which parameters influence the run-time in
which way?
‒ Code analysis and run-time behavior evaluation need to be executed per research
question.
The methods we learn and shall use:
Data and data preparation (Pandas etc.)
Classification I (Support Vector Machines)
Classification II (Decision trees, Random forests)
Clustering (kmeans, DBScan)
Testing and quality assurance (run time analysis)
Dimension reduction, anomaly detection
Neural Networks
raining deep learning networks
Pipelines and MLOps
As im doing all the homework, I dont think that the coding part is my challenge.
My problem is, to define two question which would fit the requirements. Can you give me examples? The question should not be answerable by statistic.
For example I choose this dataset:
https://www.kaggle.com/datasets/nelgiriyewithana/billionaires-statistics-dataset/discussion
=> But I dont know which question could I define for this, which can be solved by the methods above.
I am also open for new datasets.
Thanks in advance!
Best AI ML DL DS Roadmap
Hi! What is the best complete roadmap for AI, ML, DL, and Data Science?
Some roadmaps I have found:
- [roadmap.sh] AI and Data Scientist Roadmap ← Best?
- [i.am.ai] AI Expert Roadmap
- [github.com] mrdbourke/machine-learning-roadmap
- [github.com] luspr/awesome-ml-courses
- [rentry.org] Machine Learning Roadmap
Which one should I choose?
I am not a beginner in programming (8y as a hobby and 3y working), but it was not related to AI.
The best roadmap for any of those would change at least somewhat, since they are all slightly different. Many people get stuck up thinking about the best possible path, but the most important thing you can do is to start on a path. If you are looking for a great starting place(since you have some experience), I would recommend doing some analysis on Kaggle datasets using pandas, matplotlib, etc.
Hello !!
How can I improve the accuracy? I'm using an MLP model of only Dense Layers.
How is it possible to remove these crazy spikes.
I have tried the following:
1- Early Stopping
2- Reducing model complexity
3-Reduce LR
4- Dropout layers and Batch Normalization
5- Gaussian noise layer
6 - fixed the issues with the dataset.
You help is much appreciated!
I suggest you reduce the width and the depth of your network, use batch normalization and dropout, smaller learning rate, and try larger batch size. But in the end you still may not get much better accuracy.
I’ve added dropout layers and batch normalization layers as well. Also, the batch size is have is 32 i have increased it to 64 and it got even worse. I’m starting the LR at 1e-4.
Is there anything else, i need to try? I’ll to reduce the width and depth
Maybe try smaller batch size than 32. Large dropout (0.4-0.5) may be needed. In the end, without more data (how much do you have?) this could be the best score one can get.
In total of 19,000 samples, around 11,000 for training set, 4000 for validation and 4000 for testing
Hey everyone!
Just finished “Machine Learning Specialization” on Coursera, from Andrew Ng. Excited to dive deeper into the field!
Any recommendations on what I should learn next or any valuable resources you could suggest? Your insights would be greatly appreciated!
Ive recently done the tutorial Titanic Competition, and wanted to redo it with an ML model. However, my model is now getting a 0 public score. Idk where I'm going wrong or how to test…
Here is the link to my notebook https://www.kaggle.com/code/abishekjayan/this-is-where-it-starts
In most competitions you are supposed to submit probabilities, not binary predictions. I suggest you delete your In: 12 cell and in the following cell use output = pd.DataFrame({'PassengerId': df_test.PassengerId, 'Survived': predictions.flatten()})
ok let me try...btw is there any way to check the score without submitting for competition?
right now in order to know the score im just submitting and checking the public score
still 0 score
I need help to know more about feature engineering. Please provide me some resources so that I can catch the vital concept.
So for that very same competition, has anyone been able to fill the missing values in the Cabin section?
Hi.
Can anyone suggest a good unsupervised learning method for anomaly detection(like assault, robbery, vandalism etc)?
Can anyone help me with the following discussion : https://www.kaggle.com/discussions/questions-and-answers/461022
AI Art Generation.
Hello guys can someone provide an explanation for this?
Hi Alex, from the learning curve displayed, it seems that the algorithm hasn't learned the target function, this is shown by the high and increasing training error, and since the definition of bias is the ability of the learning algorithm to approximate the learning function, it seems that "according to the question" the test error is unacceptable, it seems that the model isn't approximating the function well. Also the model has little generalization error "Variance" since as the number of data points (the size of the dataset increases) the training error and the test error come closer to each other (to the high error value that is unacceptable), I recommend that video for understanding the curve better : https://youtu.be/zrEyxfl2-a8?si=k4DdOTt0TM72kagH, its a great course btw that I helped me alot during my studying of the Machine learning course, feel free to ask for any elaboration
Bias-Variance Tradeoff - Breaking down the learning performance into competing quantities. The learning curves. Lecture 8 of 18 of Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. View course materials in iTunes U Course App - https://itunes.apple.com/us/course/machine-learning/id515364596 and on the course website - ht...
tyvm
u r welcome
We can't help you here, you need to contact support (kaggle.com/contact)
does anybody know how can I learn to do ensembles that perform well in competitions? stacking, etc. that improves the metrics? Thank you!
Hi guys quick question I had
Can selectolax be used to scrape dynamic content of a webpage?
Hello , can someone please explain to me the cross-validation and how can i use it
You have to be willing to put in a minimal effort on your own. Your question is easily answered by Googling https://www.google.com/search?q=cross-validation
Hey, I'm hosting my own competition and I can't see how I can pin my demo notebook to the top of the code notebooks section? Any ideas? I've seen it done in other competitions. Is this something staff can answer @tardy lodge?
Does anyone know what could be the issue when submitting a notebook that runs fine in Kaggle? I submitted to the LLM Detection Competition using Keras NLP. I ran the notebook before and it was fine, training, evaluating, and saving the submission.csv. It failed when I submitted though, so I copied this from the Log: Downloading data from https://storage.googleapis.com/keras-nlp/models/distil_bert_base_en_uncased/v1/vocab.txt
272.5s 101 Traceback (most recent call last):
272.5s 102 File "<string>", line 1, in <module>
272.5s 103 File "/opt/conda/lib/python3.10/site-packages/papermill/execute.py", line 128, in execute_notebook
272.5s 104 raise_for_execution_errors(nb, output_path)
272.5s 105 File "/opt/conda/lib/python3.10/site-packages/papermill/execute.py", line 232, in raise_for_execution_errors
272.5s 106 raise error
272.5s 107 papermill.exceptions.PapermillExecutionError:
272.5s 108 ---------------------------------------------------------------------------
272.5s 109 Exception encountered at "In [18]":
272.5s 110 ---------------------------------------------------------------------------
272.5s 111 gaierror Traceback (most recent call last)
272.5s 112 File /opt/conda/lib/python3.10/urllib/request.py:1348, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
272.5s 113 1347 try:
272.5s 114 -> 1348 h.request(req.get_method(), req.selector, req.data, headers,
272.5s 115 1349 encode_chunked=req.has_header('Transfer-encoding'))
272.5s 116 1350 except OSError as err: # timeout error
Try these things and if it doesn't work it has to do with the content of the notebook itself.
Hey, there. Does anyone working on question generation using llm. Pls do help me.
Hi there. I posted a question here: https://www.kaggle.com/competitions/titanic/discussion/462500 . Can any1 help me? It's about the titanic challenge (im using pytorch)
Start here! Predict survival on the Titanic and get familiar with ML basics
has any one tried using huggingface autotrain advanced on kaggle, how was the experience? please share
Have anyone done any work on Predict energy behavior of prosumers??
Hi guys,
I built a web app to predict the classification of flowers using Machine Learning. I just need help with the last step, I have so far been able to succesfully connect the HTML and their corresponding routes, just the last step is not working.
While the model makes a prediction it returns a number
0 or 1 or 2 depending upon the flower it has predicted, instead I get a None in there in the HTML file, But I checked the logs the output from the predicting function is correct.
Kindly help me debug this.
Code link:
https://github.com/Kaus1kC0des/OIBSIP/tree/main/Data Science/Task 1
How many kaggle notebooks one can run in parallel?
Mine gives error when I try the third.
I've ran at least 5-6 notebooks in parallel, but it was a while ago. Maybe they changed something recently, or implemented stricter controls during busy times. What is the error message?
"GPU Session cap reached"
There is a 30-hour per week GPU limit on Kaggle. That message most likely means you have reached it and will have to wait until next week.
Does submitting a notebook in GPU mode, consumes our GPU quota?
For content based recsys, when doing the recommendation based on popularity, to mitigate the caveat of having high discrepancies regarding the number of evaluations / ratings per items, a damped mean of the the target metric is a common and solid solution.
Was wondering, what other alternatives to the damped mean are there?
how difficult would be be to implement graph based neural networks (GCN and GNN) in kaggle? I am struggling to find projects which utilize them
dont spam
For the Titanic ML dataset competition, there is a lot of missing data present in the Age column and the cabin column. My current guess is that age has to do a lot in matters of survival (Physical ability etc). I've found that https://en.wikipedia.org/wiki/Passengers_of_the_Titanic#:~:text=The ship's passengers were divided,military personnel%2C industrialists%2C bankers%2C contains a list of passengers with their ages. Is it correct if I can impute the values from this?
A total of 2,240 people sailed on the maiden voyage of the Titanic, the second of the White Star Line's Olympic-class ocean liners, from Southampton, England, to New York City. Partway through the voyage, the ship struck an iceberg and sank in the early morning of 15 April 1912, resulting in the deaths of 1,517 passengers.The ship's passengers w...
Check out some of most voted notebooks in the titanic competition - most talk about dealing with misssing values in the data (an important data science skill the competition is trying to teach). You shouldn't look for external data with all the answers - the goal is to find ways to deal with missing data). Also check out the Kaggle course on missing values here: https://www.kaggle.com/code/alexisbcook/handling-missing-values
Because the function pd.get_dummies() depends on the data it is being fit on, df_train and df_test end up having different columns.
Therefore, if I fit a model on the training data, it cannot fit onto the test data.
how to solve this?
nvm i figured out a basic one. randomized adjacency matric so low accuracy but works as a proof of concept
Hi kaggle community, I have been working on a project and I am unsure on how to do calculations on tuple data. My dataframe has data in the form (x, y) in every cell and I would like to add numbers of all the y data, depending on what x is, to a row total. I have about 1500 rows. what is the best way of doing this?
Here is the data I would need to do this on.
Hi community,
I have a question related to k-fold cross-validation.
I'm currently training a classification model on a relatively small dataset (approximately 500 images across 5 classes) using a 5-fold approach. At the end of this process, I have five models. For my submissions, I utilize all five models to make predictions and take the average of their scores.
Is there any specific approach to replace these 5 models with only 1 model?
I need to do this to be able to use model ensemble method.
Similarly for the cabin variable, can I refer to external data (Schematics etc) to impute values or is that approach wrong?
Why can't I make my notebook public?
Hello all,
I'm just a begineer to this field. I'm facing a problem or in simpler words stuck in a loop.
I'm pretty well aware about the theory and conceptual knowledge required of py, kaggle, maths, ml and all, but I'm not able to put things together to build my FIRST ML MODEL. Can anybody of you help me out with this.