#❓┊ask-a-question

1 messages · Page 1 of 1 (latest)

empty lion
#

Hey everyone! I'm Jerad, a developer for Kaggle. We made this channel to give people a way to get help with Kaggle, or ask other questions in general that the Kaggle staff or community may be able to answer.

If there's anything you're not sure about, let us know! I'll actually be hanging out here, and may be able to help out!

gaunt oasis
#

Hi @indigo fulcrum, this post would be a better fit for the #🔗┊sharing-projects channel. Good luck on your journey to become a Notebook Expert!

sharp iris
#

Hello Everyone,
I have a question about the Kaggle competition.
There are many pre-trained models already available. If I use those models in my competition only on test data, not any work on train dataset, and submit it. will it be acceptable? Or I have to train it, then I can test my trained model on test set.
Example:
There are many models of English speech recognition in hugging face. Can I use those pre-trained models only on the test set, and if it will produce a good score in the leaderboard, will it be acceptable in the competition?
I know it's a noob question. Help me☺️

Thanks
Aditta

deft stream
#

@empty lion Is there a way I can change my Kaggle username? There had been a minor spelling error while creating the username.

verbal crest
verbal crest
sharp iris
verbal crest
#

@sharp iris Yeah it's fine unless a specific competition has a specific rule against using certain models.

sharp iris
#

Thanks

elder flower
#

How long does it take for competition results to be verified in general?

verbal crest
#

@elder flower Usually something like 2-7 days, sometimes longer depending on the competition.

coral pebble
#

Does anyone have tips for one to reach the Master tier or above on Kaggle ?

Not sure if this is the right channel to ask though

open isle
#

In the competition ranking or in some other ranking?
If it is about competitions, then you need to take part in many competitions, learn from the solutions of the winners of the past competitions and incorporate them into your approach.

coral pebble
open isle
#

Okay, so the official requirements are the following. I'll give suggestions for each ranking separately

#

Datasets:
There are two main parts: collecting interesting data and promoting it.
So, first step is to collect some data. There are already thousands of datasets on Kaggle, so you would need to find some interesting data, which wasn't collected yet. Another approach would be to share some data for the ongoing competitions: for example, sharing relevant external data, doing some processing on the data and so on.
But simply making a good dataset isn't enough - you need to get people's attention. The first step is to make the dataset presentable. When you create a dataset, you see a score - how well it is done, it includes descriptions, metadata and other things. So be sure to fill in all the fields.
And after the dataset is ready, you need to promote it - post about it on Kaggle forums and social media.

#

Discussions:
You need people to upvote your posts. 1 vote is bronze, 5 votes is silver, 10 votes is gold.
The "easiest" way to get upvotes is to be active on forums in an ongoing competition - share your insights, ask questions, participate in hot topics.
Some people simply share articles from internet on Kaggle forums. It is a low-effort activity, but, unfortunately, it works.
Votes for the comments in the notebooks are counted too.

#

Notebooks:
Now, this becomes tricker. Personally I think that Notebooks (and competitions) are much more competitive rankings compared to the two previous ones.
You need to make a good analysis, share it and get enough votes.
There are numerous ways to make good kaggle notebooks:

  • build a good model for an ongoing competition and share it
  • do an EDA (exploratory data analysis) for a competition or dataset and share it
    And so on. What is important to know, that it is difficult to produce novels ideas, so many people try to get medals by joining a new competition and share a good analysis within first 12-24 hours. It is tough, but doable.

It will take some time to be good at it, but it is definitely rewarding.

I'll share some resources to help you:
https://www.analyticsvidhya.com/blog/2020/12/exclusive-interview-with-andrey-lukyanenko/
https://www.youtube.com/watch?v=qKqLHs3J-Rc&ab_channel=AnalyticsVidhya

In this Interview, Andrey Lukyanenko joins us today to give insight into his data science journey and what pitfalls to avoid in the start.

Visualization is the best method to grasp the complex and hidden results from the data. Analyzing the visualizations is better than calculating data statistics and various plots and techniques can be used to do so.

In this DataHour, Andrey will share the history of data visualization. After which he will explain about different plot types and ...

▶ Play video
#

Competitions:
Now, this is the most difficult ranking on Kaggle. You need to take part in the competitions and reach a high place in it. It is very difficult, so even experienced data scientists can fail. The important thing is to iterate over ideas fast, try many things and be prepared to spend a lot of time.

Here is a link where I talk how I got a gold medal several years ago:
https://www.youtube.com/watch?v=rpClh8WmTdo&ab_channel=ChaiTimeDataScience
This channel has a lot of very useful interviews

Audio (Podcast Version) available here: https://anchor.fm/chaitimedatascience

In this episode, Sanyam Bhutani interviews the king of kaggle kernels, Grandmaster Andrew Lukyanenko Ranked #1 about his journey into Data Science, Kaggle. They also talk about his pipeline for writing kernels.

Follow:
Andrew Lukyaneko
https://twitter.com/AndLukyane...

▶ Play video
#

That's it. If you have any further questions, I'll be happy to answer them

coral pebble
#

Thank you so much for the detailed explanation kerneler

summer drum
#

In the learning Python tab of Kaggle, chapter 6, there is sth confuse me.

claim.startswith(planet)
>>>TRUE```
While I try it myself in jupiternote, it return False, with the exact code.
Also, why is the thing btw () must be identified?
open isle
#

the argument of startswith method should be string, like "guud"

summer drum
#

oh, it seems that the reason why it returns "TRUE" initially was bc planet is identified as a string somewhere before.

#

thank you for your help

dire geyser
summer drum
#

Can you guys explain what happen?

reef bough
open isle
desert tusk
#

Why does ICR competition doesn't appear in meta kaggle dataset on competition file?

torpid cipher
#

Hello, I'm new to kaggle and trying my luck with the CommonLit - Evaluate Student Summaries Competition. I wrote code for this and saved it in a submission.csv file at the end. But always get a scoring error. Can someone help me or give me a tip?

reef bough
torpid cipher
#

Submission Scoring Error

#

Save and Run all works and I get a submission.csv file but the upload doesn't work

reef bough
torpid cipher
#

well the output of my submission.csv and the required version looks the same. Or do I have to store this in a pandas dataframe?

reef bough
open isle
#

Yes, in fact I remember that there were some discussion grandmasters who got their rank by sharing such articles

torpid cipher
#

I do not understand what you mean

reef bough
# torpid cipher I do not understand what you mean

the above message was for separate conversation. I am not sure exactly what could be cause of your error, generally in my case, the submission error was because of schema mismatch. My second guess was, may be for some samples the scoring metric is undefined like for example log of negative number, but I briefly looked into the scoring metric of common-lit competition it looks like they are using RMSE, which should be easily definable for all samples.

reef bough
torpid cipher
#

I actually have negative values ​​in my predictons. Can it be that the MCRMSE is not implemented correctly?

#

I'm still a complete beginner. Please excuse me if I'm not doing everything right

#

So it can't be because of that, since both positive and negative values ​​can occur

#

may I post some of the logs here?

limber moat
#

How and where I can get a reason why my result disappeared from final leaderboard at ICR competition?

open isle
#

Usually the results dissappear from the leaderboard in case when admins decide that there was some kind of rule breaking.

elder flower
#

Should I reply to kaggle competition admin that sent me instructions after the competition to ask some questions about it?

open isle
#

I think it would be a good idea

elder flower
#

Are kaggle winnings considered like lottery wins or like income for taxation?

dense atlas
#

Probably depends on your juridiction

solid dome
#

Has anyone gotten their silver medal converted to bronze? Got a mail for achieving silver on a notebook which I saw on Kaggle itself. But now after 2 hours, the medal is again bronze. Any ideas about it? The votes are still the same!

reef bough
solid dome
#

Ohh I see

solid dome
#

Yea that's what I thought so asked it here

reef bough
#

ohh, didn't know about that, my bad. I guess, my comment is only applicable for comments and discussion then @solid dome

verbal crest
coral pebble
#

@solid dome this is because someone retracted their upvote. He gave you the vote which is needed for the silver medal then deleted his account / retracted it a few hours later

solid dome
verbal crest
#

@solid dome The scenario bogoconic1 mentioned above is a very liekly cause. This sort of thing happens all the time of course. Our system constantly calculates medals based on requirements, and it is possible to lose or downgrade medals.

#

Typically if you wait a little bit, you'll get some more upvotes and it will upgrade again.

solid dome
#

@verbal crest got it. Thanks for the clarification.

desert tusk
#

why doesn't kaggle have any competition for audio for newbie?

verbal crest
desert tusk
slim meadow
#

sorted_by_flavor_and_unitssold.to_markdown(max_rows = 20) This is throwing an error in Kaggle. What am I doing wrong?

#

max_rows int, optional
Maximum number of rows to display in the console.

copper carbon
#

Hello, is it possible for a notebook to have >= 5 non-novice votes, but still not be awarded a bronze medal?

light nest
#

hey guys , is it possible for ttest to return 0 as p-value ?

deft fox
deft fox
deft fox
# copper carbon Hello, is it possible for a notebook to have >= 5 non-novice votes, but still no...

Yes, happens all the time. Not sure if you want me to go into detail, but a short version is that people who upvote you frequently don't have their votes counted. Supposed to prevent the gaming of the voting system. I have hundreds of posts with >1 non-novice votes without a bronze medal, and probably dozens of posts >7-8 votes without a silver medal. Similar for >12-13 votes without a gold medal.

deft fox
copper carbon
desert tusk
deft fox
deft valley
#

Not sure if this can be talked about but where is the unlearning competition? The announcement said it would be on kaggle ages ago and it has to be done before neurips.

desert tusk
#

I am not a lawyer and I don't have any undserstanding on that

elder warren
#

Hi @everyone,
Can someone please explain to me why the first output is a signed zero, even arithmetically the 2 should be unsigned zeros.

dense atlas
#

Float conversion… this is quite dangerous, esp with if conditions. Solution is to mind your type and use int() if you expect an int.

coral pebble
abstract sequoia
#

I need to evaluate the output of an ML model using MATLAB, but dont have a license, can someone run a script for me?

tulip hare
#

Could a server admin please change my server nickname to "Chris Akiki" ? 🙏

verbal crest
tulip hare
verbal crest
#

Totally get the desire to differentiate. Right now we're sticking with the linking since we really want people to be able to find each other on Kaggle.com too.

vapid jungle
#

Hey I am doing a project with protiens and ligands in the form of mol2 and pdb files, would anyone happen to know the best way to encode the files into a fixed length vector while considering both structures

frigid dove
#

Hey, I need resources on deploying ML models - Pytorch, Tensorflow based ones.

I want to know the best industry practices followed in deployment.

Any book/article related with it is appreciated.

ashen reef
#

@verbal crest , is it still unlinked

verbal crest
#

@ashen reef It's linked now! 🎉

ashen reef
#

Haa.. Finally, thanks..!!

stable pivot
#

hey , anyone familiar with the flair library , i need some help !

raven peak
#

Anyone got any experience with similarity score scripting with chroma vector store?

past spade
#

I missed the BIPOC cohort deadlines, does anyone know how often does kaggle organise such cohorts?

verbal crest
#

@past spade Not sure when our next one will be, but I'd say it's probably about 6 months away or so. In the meantime you can learn a lot from all the helpful people here in the discord!

lilac sierra
#

What happened with plans to start machine unlearning challenge on the mid August?
Will this competition appear in the nearest future?

slate locust
#

I need help on how to use the lasio python Library in Kaggle. I thought these libraries were auto imported. I have tried

!pip install lasio

..on a separate cell but it didn't work. At one time, it gave a network connectivity error but my wifi was good and fast.

I couldn't find any console to input commands either.

Please I need guidance 🙏

placid valve
#

hey there I am using R and Error in predict.xgb.Booster(xgb_model, test_dmatrix) :
Feature names stored in object and newdata are different! this pops up can anybody take a look at my code

torpid crater
#

Hey all✌️ Anyone here ever messed around with creating synthetic datasets to train theroem based neural networks for search and rec.? Basically hierarchy logic of domain specific data. DM me would love any suggestions 👍

deft fox
# placid valve hey there I am using R and Error in predict.xgb.Booster(xgb_model, test_dmatrix)...

Not sure how anyone can look at your code when you didn't provide a link. Based on the error message Feature names stored in object and newdata are different! I would conclude that you have different features in train and test data (the matrix you are trying to predict). This is usually the ID column or something similar. So whatever features you are adding or removing in the train data, make sure the same operations are applied to test data.

deft hearth
#

Hello Kaggle community, I want to know if I can invite people to this discord server.

verbal crest
primal wedge
#

Do you know how to deploy model ML in onnxruntime website with framework Next js

#

?

hot dock
#

Hello guys. Just started doing Kaggle and I’m curious how do you guys handle large image datasets. I am currently working on the RSNA challenge. It’s some 400 gb so I don’t think it’s possible to download locally. What would be the best option for online computing with persistent storage?

deft fox
hidden estuary
#

I m a newbie data scientist, what is the best possible way for me to advance in this field, also i m currently pursuing my masters

subtle dagger
#

Hello, I have recently come across a problem trying to use kaggle website as the headers and other parts of the UI are overlapping making it difficult to use it as shown in the images below. I wanted to know if the problem is caused from settings in my browser(even though I have faced same issues in both chrome and FireFox) and how I can fix it.

graceful axle
#

I am really interested to know more about clustering algorithms from people who have used them. For example, perhaps data is broken down by age, gender, race, country, language. Standard questions to ask on forms. I know that in clustering, principal components of the cluster grouping boundaries don’t necessarily align with the predefined categories that set the axes. In fact, discovering structures in the data is the point. I have only used clustering at the very beginner level though. To what extent do demographics data with unusual individuals result in outliers from any cluster? This is a question I’ve been curious about for some time. Since I’m new to this discord I’m not sure if it’s too far off topic or if it’s a reasonable learning question. Could you please let me know if I should delete? I think it is clearly on the subject of data science but not a specific kaggle competition.

#

I would love to know how this works if anyone knows though

#

Like a social scientist or someone

placid valve
#

xgb_train <- xgb.DMatrix(data = as.matrix(train_data_main), label = a)
xgb_test <- xgb.DMatrix(data = as.matrix(test_data_main))

bst<-xgboost(data = as.matrix(train_data_main), label = a, max.depth = 6,
eta = 0.3, nthread = 2, nround = 100, objective = "reg:absoluteerror")

I have a code like this how can I easily optimize it

zinc walrus
# graceful axle I am really interested to know more about clustering algorithms from people who ...

Hi @graceful axle ! I attempted clustering using purchasing behavior (I know it’s not exactly demographics as you asked). The data I used does contain a bunch of people whose purchasing trends are what could be called outliers. My understanding is that people in one cluster aren’t going to be exactly alike, but more similar to those in the same cluster than to people who are in a different cluster. https://www.kaggle.com/code/mounikagoruganthu/mathematical-distance-in-ml

slate locust
#

Please I'm still stuck. Any kind assistance will be greatly appreciated.

I need help on how to use the lasio python Library in Kaggle. I thought these libraries were auto imported. I have tried

!pip install lasio

..on a separate cell but it didn't work. At one time, it gave a network connectivity error but my wifi was good and fast.

I couldn't find any console to input commands either.

Please I need guidance 🙏

hot dock
#

what online gpu platform should I use? I want to about 1 TB about persistent storage. Thanks in advance.

#

what online gpu platform should I use? I want to use about 1 TB of persistent storage. Thanks in advance.

verbal crest
vivid owl
fleet ingot
#

hi everyone, can anyone tell me that how can we extract data from mobile applications like API permissions just like the CSV file i attached . I need this for my thesis research. @verbal crest

green haven
fleet ingot
pliant ermine
#

Has anyone read the Kaggle Workbook? I was gonna use it to check out if I can do my first kaggle competition from it or not. It’s from packt publications and not that well known or at least I hadn’t heard about it before, it’s available in an humble bundle now

subtle dagger
verbal crest
# subtle dagger Which browser setting do you think would bring such changes?

Where are you accessing Kaggle from? We've previously had issues with China's firewall blocking Google's CDN causing this bug specifically. Otherwise maybe something else is blocking that specific resource from loading. (We are also internally looking to try and fix this bug, but it might take a little while - it only happens rarely).

subtle dagger
rotund moth
#

Hello everyone! I am new in ML and did some basic models, feature engineering etc. Can anyone recommend me some basic knowledge competitions? I already did titanic and Spaceship competition. Thank you!

haughty basin
#

Hello everyone , I am thinking to start on ASR and LLM . Can anyone please suggest me a proper roadmap to start it

charred scroll
languid tartan
#

Hello, I heard that Kaggle has demos in Google Cloud Next, how can I find those?

vivid owl
# languid tartan Hello, I heard that Kaggle has demos in Google Cloud Next, how can I find those?

I'd love to know which session that might be also. I searched the Session Library, but couldn't find it.

But the first step is to register for a complementary access of recorded sessions via a Digital Pass: https://cloud.withgoogle.com/next/

fresh flume
#

Hello, I have only one learning path in my Google Cloud Skills Boost and my Mentor @trim lotus informed me that I suppose to have more than one. May I kindly request that this issue be resolved. Thank you.

green haven
#

Hey, how are you? My name is Matviy and I am a high schooler from Ukraine. I would just like a quick word of advice. I got perfect accuracy on this model so I thought it is perfect, but I googled it and Google said it is more than likely it is a false accuracy. Could you let me know what you think in this matter?

hot dock
green haven
#

Yeah, I did split the data with:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

sweet ice
#

Hey kagglers if any senior data scientist, or machine learning engineering want mentoring me would happy and very palisent with that

rotund moth
#

Can I use chi2 test and pearson correlation coefficient in dataset containing both numerical and categorical variables?
I have a dataset which contains both numerical and categorical variables, So can I use mentioned two techniques separately to select features? For example - A, B, C, D, E are my columns wherein A, B are categorical so here I'll use chi2 test whereas C,D are numerical so i'll use pearson coeff. and E can be my target can be either categorical or numerical.

sweet ice
rotund moth
sweet ice
gleaming beacon
#

I have a question..may be silly one... Can anyone tell me how efficient are Datacamp and udemy courses for Datascience ?

thick glacier
#

Hello everyone! I've just completed my first work on a classification algorithm using a spam email dataset. I would love to hear your thoughts and suggestions for any improvements I can make. Your insights would be greatly appreciated!

https://www.kaggle.com/dinanksoni/spam-email-classification

graceful axle
#

How to resolve this error?

py
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Error: 
ValueError: Expected 2D array, got 1D array instead:
array=[1232.  677.  221. ... 1294.  860. 1126.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
thick glacier
#

reshape your x_train using array.reshape(1,-1)

graceful axle
graceful axle
thick glacier
#

check you x_tarin and x_test who is 1D array. and change it into 2D array.

graceful axle
#

ok

deft fox
# graceful axle How to resolve this error? ```py py scaler = StandardScaler() scaler.fit(X_trai...

The answer was already provided in the error message: Reshape your data either using array.reshape(-1, 1) rather than .reshape(1,-1). This is assuming you have a single column of data, or single feature as described in the error. It is unlikely that you have a single sample, so array.reshape(1, -1) probably would not work. Also, instead of array you need to have X_train or X_test, meaning the actual name of the array.

rough cosmos
#

Hi, I am new to kaggle competitions and have a few questions. Are we allowed to use a LLM like llama2 for the CommonLit - Evaluate Student Summaries competition?
It says "Internet access disabled
Freely & publicly available external data is allowed, including pre-trained models".
Can we use a model we train and upload to huggingface?
Thanks

deft fox
green haven
#

@deft fox this is kinda random but you said you got your first computer in 1984. Was it the first macintosh?

deft fox
green haven
#

Pretty cool, thanks for that fun fact.

regal plank
#

I am wondering:

code:
from sklearn.tree import DecisionTreeRegressor

Why do we import the specific function instead of just:import slkearn

deft fox
# regal plank I am wondering: code: ``from sklearn.tree import DecisionTreeRegressor`` Why d...

You don't have to import individual functions, but there are at least 2 reasons to do so: 1) the whole sklearn takes up more memory than individual functions; 2) later in the script individual functions are called only by their name (DecisionTreeRegressor) while you would have to type the whole thing if only sklearn was imported (sklearn.tree.DecisionTreeRegressor). So, memory savings and less typing.

graceful axle
#

Any one here from kaggle staff ? I want DM about my payment

rustic ether
#

Hey guys! I mostly been doing cv stuff, but I've been looking into Reinforcement Learning, especially with a robot simulation. Is there a pathway/free resources where I could look into deep RL with simulations in unity?

gloomy egret
#

i need to feed my data into an llm, i am using lora to do it, but i have a large amount of text data it would have nearly 500m tokens, so does that harm the accuracy or efficiency of the model in any way if so is there any other methods to input data into llms.

rough cosmos
graceful axle
#

Hello,

I'm currently working on a time series project, and I intend to employ the EMD+CNN technique for forecasting the output. Upon applying EMD to the training data, I obtained a total of 14 Intrinsic Mode Functions (IMFs). Consequently, I constructed my CNN neural network with dimensions (30100, 20, 14, 1), with 20 representing the window size. However, I encountered an issue when attempting to decompose the test data using EMD, as it produced only 11 IMFs. This inconsistency caused an error when trying to execute the CNN model.

I have two questions: Is there a method to enforce a consistent number of IMFs during the EMD decomposition process? If not, is there an automated way to select the most significant IMFs?

Please note that I am utilizing the EMD-signal library in Python.

Thank you

sage belfry
regal plank
#

How can I exactly check and compare ?

i the prediction is only price how can I know which home was inputted and hence check.

like the prediction is an array of prices, how can I check which price is for which home ? or even what features are being applied ?

if that makes sense.

worn crow
#

Hi guys, newbie here. I submitted an answer for Digit Recognizer competition and it was accepted. Now I'm trying to use that model and create a website or desktop application. But I'm stuck. I tried to get a prediction using the test data using the below code, but an error was thrown. What can be the issue here? and as a start, how can I use this model on Gradio to make a simple digit recognizer. TIA!

test = X_test.iloc[0]
pred=model.predict(test)
print(pred)

The first image is the start of the error, and the second image is the end of the error.

thick glacier
#

Always read errors because the solutions are there.

dapper stratus
#

did anyone mess around w yolo enough i can ask them a question

deft fox
dapper stratus
#

ok my bad

deft fox
dapper stratus
#

i have used yolo v8 to train my object detection model to train on a bunch of pics of apples and bananas

#

it generated a train folder that looks like this

#

i am now facing diffuculties trying to use this to test on new pics that i have

#

i am done with the training part but i can't test on the pics i have of apples and so

blazing thicket
#

Hello Everyone. I;m in the Intro to SQL Course facing some issues anyone there to help me?

blazing thicket
heavy oasis
#

Where I could get data on fire event globally

nova tulip
#

As a data scientist working remotely. Anybody has any recommendations on which country to migrate? Considering taxes, culture and all of that. (Not really important but would prefer a cold country, but still open to any country)

slate pulsar
#

Hi there,
I have performed EDA on a dataset, but the notebook is not shown in the notebook section of that dataset
how can I have my notebook there?

solar plank
#

Hey there, I am just starting off with Kaggle,
is there any list/sheet for different Kaggle Datasets to practice for beginners (equivalent to LeetCode 75 for example) to learn and implement different ML approaches?

thorny mirage
# dapper stratus

hay there, try it model = YOLO("runs/detect/train5/weights/best.pt") or same sub folders like this runs/detect/train5 .

#

little trick to find it,

❯ find runs/detect | grep -i best.pt
runs/detect/train5/weights/best.pt
twin lion
#

hey hope you all are doing i want to get any ideas regarding the projects that can help me land a job in machine learning

dense atlas
slate pulsar
wraith ledge
#

Is it a good idea to post a notebook on statistical methods for data analysis like distributions method to get more upvotes ?

green haven
#

Hahah I recall now I am enrolled in a course.

junior walrus
#

@vast relic : i need some responses regarding machine learning survey forms can i post here ?

amber flax
#

Hi! For the past few days, I've been trying to fine-tune a model using TPU parallelism / FSDP with a Kaggle TPU notebook. The reason I need to set up FSDP is because the model I'm using is very large (Openlm's open llama 3b v2). When I try to fine-tune it, I quickly run out of memory on the TPU. I'm not sure where to even begin with trying to get this to work, I was able to find this article in the documentation of Hugging Face Transformers Trainer, but I don't understand what I'm supposed to be doing...

Link: https://huggingface.co/docs/transformers/main_classes/trainer#pytorchxla-fully-sharded-data-parallel

My current code: https://www.kaggle.com/code/starblasters8/fine-tuning-llama

Any help would be greatly appreciated!!

hidden plinth
#

Hi, has anyone come from an unrelated bachelors degree to a masters? Or have gotten into the field through alternative means other than achieving a Bachelors?

#

Currently getting a bachelors in an unrelated, but statistic heavy degree that I am completely uninterested in. I am looking to get into data science since the only thing I really enjoyed about my degree so far has been the stats lol.

unreal portal
#

Two questions related to creating a kaggle dataset:

  1. Isn't the data limit per dataset supposed to be 100GB? I currently have a dataset of size ~50GB and when trying to upload an additional ~16GB of data it says I'm exceeding the size limit.
  2. I have uploaded my data in batches (see attached image) but want to unpack the individual folders so that all the data is in one single folder. How do i do this?
pliant rune
#

Hello everyone, I'm currently working on estimating the market size of the retail credit market in South Africa, and I'm facing some challenges. I'd appreciate your insights and suggestions on which statistical models or methods might be suitable for this task. Additionally, if anyone has experience or expertise in market sizing, I'd be grateful for any guidance or best practices you can share. Thank you in advance for your assistance!

grave echo
pure burrow
#

hello everybody i am new and i have problem with an exercise notebook , i delete some part of the initial code and i want to restart the file from the beginning

hidden plinth
foggy monolith
hidden plinth
# foggy monolith I am in a similar situation myself. I graduated with a degree in Biology but am ...

I totally agreee. I am a psych major but my school focuses heavely on research. I have taken like 6 different research stat classes. My fear is that most of my prereqs wont translate when I go to apply for a masters. Also, I have only ever used SPSS, we never really got to mess around with R or any database programs. From my understanding STEM degrees have a far better chance at getting into those types of programs than social sciences.

zinc thistle
#

when i run the code it sometimes has an error

#

and then it sometimes works

cedar violet
zinc thistle
cedar violet
# zinc thistle

looks like there's a row that contains an inconsistent data, in this case the long string. hence your model cannot interprete it. try checking the row that contains this particular data and see if u can drop it

deft fox
# zinc thistle

The line at the top and the bottom tell you what the problem is. Decision trees want purely numerical data, and you seem to have a mix of numerical and categorical variables. All non-numerical features must be converted to numbers.

zinc thistle
regal plank
#

Hi, I am running a notebook and getting this error ?
I assume this an older version as I dont see " > | "

My account is phone verified, where is the connect to the internet setting ?

unreal salmon
#

Hi there, I am trying to build a weather classification app with streamlit. The problem is my model is over 25 MB (it's 87MB minimum let's say) which GitHub doesn't allow as per their size restriction. I am thinking of using Git LFS to store a pointer to that file but I read the streamlit doesnt interact with git LFS to fetch the large object in the LFS repository.
I need advice on how I can push my large file into the repo directly for my app to find and use it.

tardy lodge
vivid owl
tawdry thorn
#

what do I do if my mentor has never shown up or answered my messages?

vivid owl
thin sequoia
#

Hey guys,
I have a general query
So, I started learning ML in june last year, then my college started and my first year was very rigirous so I had to put my focus on it. Now, it's over, I want to resume my learning, what should I do, which path do you recommend ?
It will be really helpful for me!
Thanks 💛
More context: I did intro and intermediate ml courses on kaggle, participating in 2-3 beginner friendly competitions, and then started ML Specialization by Andrew Ng sir, did 2 courses and made a project.
Now that almost an year passed, I am not able to recall most of the concepts like how to handle bias and variance, gradient descent, entropy etc...

Which path from following should I choose ?

  1. Do a recap of both kaggle courses and a fast revision of both ML specialization courses, and participate in more competitions, also make projects.
  2. Do the recap of kaggle courses and re-learn from both courses in ML specialization and participate in more competitions with projects.

Any other path you know which will help me better than above.
Or anything you would like to add on ?
It will really helpful for me, I'll appreciate it!

cerulean basin
#

Hello! I would like to fine tune LayoutLM using my own dataset of form images. These images are similar to those included in the Funsd dataset. I intend to annotate the data using the exact structure of the Funsd dataset. My question is regarding the block level annotations, do the bounding box coordinates of the block need to coincide with the bounding boxes coordinates returned by the OCR (in my case I'm using pyTesseract to get the box dimensions). The problem is that the blocks found by the pyTesseract do not always match the desired box boundaries.

charred rock
#

Hello everyone, please what is the difference between Data science and machine learning? I am confused

wind ice
#

In simpler words, Data Science is data driven decision making and Machine Learning focuses on learning from the data to train the models

hard ether
#

May anyone recommend some coding/programming or machine learning internships for high school students?
The more the merrier!
Thank you in advance!

kindred rune
#

I am a newbie deep learning enthusiast, I am encountering a problem while creating and training a model, The accuracy of the model changes every time I run the code and the change is sometimes substantial, I have created the model by following the instructor and checked the code thoroughly for typos. Everything is perfect but the accuracy of the model changes with every instantiation of the model which is not logical as I have already set a random seed for that model. Please see the screen recording attached to see the issue. Can anybody explain this to me?

silent kite
# zinc thistle is this normal?

computer not work with text. also machine learning algorithm, bcs. all machine learning algorithm just a math formula.
1- set test-size in train_test_split function.
2- convert your categorical features to numerical

main mango
kindred rune
deft fox
# kindred rune I am a newbie deep learning enthusiast, I am encountering a problem while creati...

When you run a model for a fixed number of epochs (25 in your case), sometimes the model learns more up to that point, and sometimes less. It is a function of how close the initial weights were to the actual solution. Instead, you should run the model with arbitrarily large number of epochs, say 500, and use early stopping with patience of 5-10 epochs. That means the training stops if the loss function doesn't decrease for 5-10 epochs. If done this way, you will get more reproducible results, and the number of training epochs will likely be different as well each time. That is normal.

glossy moth
#

Hi guys, I cannot import a library module in kaggle while working on my own local. Can someone help me please ?
Thanks
@verbal crest

deft fox
graceful axle
#

Starting with ML, I have Pandas and Numpy done. What should I start learning in ML? Try to do something with Titanic dataset?

hidden plinth
thick current
#

Hello, everyone. Do any of know where can I find sources to create my own dataset? I would like to create a project or dataset, where the it will predict the time a lettuce to grow based on temperature, humidity, tds value, ph level, and nutrient solutions in a controlled environment. Thank you in advance.

steady dune
#

Please anyone guide me how to decide which algo is to apply.
And what steps should i take to do EDA?

waxen siren
#

Hello there, I have a question guys. I have to work with Knowledge graphs and am completely new to ML. Could someone suggest some tutorial on PyKeen? It would be really helpful. Like a crash course or something

pseudo holly
thick current
pseudo holly
thick current
south shard
#

Are you going to schedule a new diffusers event? I was looking forward to that.

severe pewter
fair ingot
#

Anyone else having an issue saying there is no CSV file found when submitting?
I know the file is being created, maybe it's not in the correct place?
Outputting it to /kaggle/working/submission.csv

#

When I submit predictions I see an error but when I check the latest version of the notebook under "Output" tab I see the file with data in the correct format

grave solstice
#

Guys, please help me find resources for: Analysis of News Articles and videos for regional languages

I want to make a Media News Monitoring and Feedback System that can handle multiple regional languages, categorize news stories, and notify me about negative coverage of news in the media.
Please suggest some good resources related to sentiment analysis from news articles and video transcripts

deft fox
kindred rune
#

Help Required: I am try to detect and remove the outliers from a dataframe.

The dataframe is very extensive and huge so I have selected three key features ['TS', 'Mean_RMS', 'Mean_ToF']. The main idea is to calculate z scores and detect outliers (whose z scores are greater than 3 standard deviations). Then append the indices of those outliers in a separate list. After that use this list of indices to filter out the rows from the main dataframe df.
See my code and error I am encountering:

from scipy import stats
from sklearn.utils import resample
from joblib import Parallel, delayed

Define the number of samples to take

num_samples = 10000

Define the number of parallel processes

num_processes = 4

Define the threshold value

threshold = 3

Define the outlier detection function

def detect_outliers(data):
z_scores = stats.zscore(data)
outlier_indices = np.argwhere(np.abs(z_scores) > threshold)[:, 0]
return outlier_indices

Select feature columns to detect outliers from

df_select = df[['TS', 'Mean_RMS', 'Mean_ToF']]

Perform outlier detection on random samples in parallel

samples = [resample(df_select, n_samples=num_samples) for _ in range(num_processes)]
outlier_indices = Parallel(n_jobs=num_processes)(delayed(detect_outliers)(sample) for sample in samples)

Flatten the list of outlier indices

outlier_indices = np.concatenate(outlier_indices)

Remove the outliers from the DataFrame

df.drop(outlier_indices, inplace=True)

Reset the index of the DataFrame

df.reset_index(drop=True, inplace=True)

Error:
ValueError: Shape of passed values is (2, 350), indices imply (10000, 3)

Please help me resolve this error. Thanks in advance.

deft fox
olive tinsel
#

llamaindex, langchain, assembly ai, weaviate, clarifai if we are supposed to make a chatbot with one of these, which would be good and free and can anyone share resources in making ai chatbots😅

tired granite
#

Want to try Google Cloud AI Platform Notebooks. But getting the error below and don't see GPUs in any region on Google Cloud. How does one get around this?

nvidia-t4-1x: The zone 'projects/bkowshik-kaggle/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.: Something went wrong. Sorry about that.

pure burrow
#

@vivid owl i have problem with an exercise notebook , i delete some part of the initial code and i want to restart the file from the beginning- it a python course the module 5 Exercise: Loops and List Comprehensions

zinc karma
#

heyy folks, so I am relatively to new to the field of deep learning. I was working on a project for time series forecasting. It has a lot of factors affecting gdp of a country so I was thinking about mutlivariate analysis but it isnt working like it should. Like I tried using different libraries and approaches but the graph always seems not being impacted much by the factors.. I wasnt able to find any good resources for the same as well. Can anyone help me with that?

vivid owl
vivid owl
# zinc karma heyy folks, so I am relatively to new to the field of deep learning. I was worki...

Hi - Discord is new and we all are still exploring and experimenting to find the best way to ask questions and get responses, but this worked for me and wanted to share.

  • Describe the issue in detail so people will know exactly what the issues you are experiencing
  • Add a link to your Kaggle notebook so that people can take a look and investigate for you (vs. imagine what the error/issue might or could be 🤔 )
  • People will respond by leaving suggestions in the Comment section of your notebook in the Kaggle platform or here in Discord

Hope you will find this tip helpful. Good luck!

Below is an example of what I described above:

https://discordapp.com/channels/1101210829807956100/1133184287886299237/1148812886026764360

pure burrow
vivid owl
vivid owl
olive tinsel
vivid owl
olive tinsel
severe pewter
#

Beginner Notebooks on DNN, TFDF and RF: How can I improve the accuracy?

I am a beginner and I have 3 notebooks that use 3 different approaches to predict survival on Titanic. I tried many things but I was not being able to get my accuracy above 80%. To break this wall, I need advice of knowledgeable people in the Kaggle Community. Please share your advice with me regarding how to improve my accuracy!

DNN Approach (78% accuracy):
https://www.kaggle.com/code/touhidurrr/predict-survival-in-titanic-with-deep-learning

TFDF Approach (77% accuracy):
https://www.kaggle.com/code/touhidurrr/predict-survival-in-titanic-with-decision-forests

Random Forest Approach (74% accuracy):
https://www.kaggle.com/code/touhidurrr/predict-survival-in-titanic-with-random-forest

fringe igloo
hidden plinth
#

I know this is a bit subjective, but do you guys recommend going through all the Learn Lessons first then trying a competition?

hazy spire
#

Hello Kaggle Community,

I'm currently working on a project analyzing two decades of Premier League soccer data with the goal of creating a predictive model. However, I'm new to soccer datasets. If anyone has experience or insights to share on soccer data analysis and regression modeling, I'd greatly appreciate your guidance.

Specifically, I'm interested in predicting full time outcomes from half-time data, and predictive modeling based on the historical data. Your tips, resources, or collaboration would be invaluable.

Please reply or reach out if you can help. Thank you!

red hawk
#

what exactly do you need help with? Also, that looks to me like a job interview take home question, in which case it is not very appropriate to ask other people to help you solve it...

fading swift
boreal belfry
#

I don't know what to do, I want to know what to do 😁

dusk radish
#

💀

tacit remnant
#

Hii Everyone, I am in a mess !!! I am new to data science initially working as data analyst
I need some help related to one task which got assigned to me which is related to data science , where I have to make a time series model in python can anyone share his experience breiefly here and guide me little bit

split epoch
#

Hey guys, I got 0.78229 on my first submission.

Could anyone look over my code and offer some suggestions? This is my first ML project and I want to also make a YouTube video on how I built it out and such, I know I still need to leave a lot more comments/documentation and clean up a few sections

https://www.kaggle.com/ryannolan1/titanic-wip-78-accuracy

raw vortex
#

Hey guys plz help me

#

I recently learnt data visualization via kaggle and in the final project i have completed it but on kaggle it shows only 75% and due to which i am not getting my certificate for data visualization as its has been 98% done and it requires 100%

thick current
#

Hi, everyone. Could you please take a look at my beginner projects, where I do a prediction of growth days of lettuce in a controlled environment? I think I am missing something I don't understand and I think my dataset also has missing features in order to predict the time it takes lettuce to grow in a controlled environment where temperature, humidity, tds value, ph level are automated and used for predictions.

Here is the link for my kaggle notebook. If can comment what I did wrong, I will gladly take it as a stepping stone to further improve my knowledge. Thank you in advance. https://www.kaggle.com/datasets/jjayfabor/lettuce-growth-days

green haven
#

Hey, just got a warning for self-promo on Kaggle 100% deserved there, but it says if you keep posting your account will be banned. They mean from now on or should I go back and delete all the ones from the past

deft fox
green haven
#

I mean on Kaggle website

#

Not discord

deft fox
#

The first sentence of my response still applies.

#

Sure you can, go to their general discussions:

verbal crest
green haven
#

Ok, thank you

low mural
#

Hello everyone!
I'm looking to practice feature engineering. Do you guys have any recommendations for a Get Started competition where this skill would be particularly useful?

hushed juniper
#

does anyone know why some people use log softmax activation during training instead of seperating the log and the softmax?

dull hornet
#

Hi everyone. It's been some time since I practiced ML. My focus was only on data visualization and analytics so neglected this area. How do I start all over again when it comes to ML?

mystic bolt
#

hi guys, is there any good free online statistic book ,I'm new to machine learning anyways

mystic bolt
signal lance
#

Does anyone know how to config the UI? I want to exclude console UI. Kaggle notebook is awesome, but very hard to adjust the UIs

arctic marten
#

I am getting this error/warning message on Kaggle. please how do i solve it?

verbal crest
red hawk
arctic marten
red hawk
#

Also, how are you planning on using the combined file? e.g. it is fairly easy to combine csvs together using a bash command (e.g. https://unix.stackexchange.com/questions/293775/merging-contents-of-multiple-csv-files-into-single-csv-file) without having to load everything in memory, but you do need to remember that you will also have issues trying to, say load the combined csv into pandas on kaggle

arctic marten
red hawk
arctic marten
red hawk
#

or using %%bash cell magic

arctic marten
#

cat *csv > combined.csv
To run this on Kaggle I have to use this !cat *csv > combined.csv right?

red hawk
#

yeah

#

although do note that you have to remove the headers first (if you scroll a little on the comments on that answer

34
This answer will duplicate the headers. Use head -n 1 file1.csv > combined.out && tail -n+2 -q *.csv >> combined.out where file1.csv is any of the files you want merged. This will merge all the CSVs into one like this answer, but with only one set of headers at the top. Assumes that all CSVs share headers. It is called combined.out to prevent the statements from conflicting. –
hLk
Oct 12, 2019 at 1:00

is probably what you want

red hawk
arctic marten
#

I am getting error

red hawk
#

you need to run it on files. e.g. xxx.csv

#

the command in the screenshot is pointing to a directory

arctic marten
#

I have to run it in each of the files?

#

Then how is the merging taking place?

red hawk
#

no, just run it once

#

the *.csv means running over all the csv files

arctic marten
#

Where will the output be saved? combined.out?

#

I tried and debugged it, but i am getting something else

red hawk
arctic marten
#

This is it

red hawk
#

works for me

#

use

%%bash
head -n 1 /kaggle/input/airline-delay-and-cancellation-data-2009-2018/2011.csv > combined.out \
    && tail -n+2 -q /kaggle/input/airline-delay-and-cancellation-data-2009-2018/*.csv >> combined.out
arctic marten
#

You used a different code, i will try this out now

#

The output is combined.out right?

red hawk
#

yeah

arctic marten
#

But is it not in csv. Mine is still running

#

Your code in the screenshot cat combined.out | wc -1 what does it mean

#

?

red hawk
#

the last line is a word count

#

it's a text file, you can just rename it combined.csv

arctic marten
red hawk
arctic marten
#

Still the same issue

deft fox
#

@arctic marten I don't know if you realize how lucky you are that Wendy has been troubleshooting this with you line by line, and from what I can tell for the better part of the past hour. If Wendy wants to keep doing it, great for you. Still, at some point I think you have to invest a bit of your own time to figure things out, as these are fairly standard and straightforward operations. I realize that I am butting in without being asked anything, but it is important not to take other people's time for granted. Wendy would most likely not tell you even if that was the case.

deft fox
#

I appreciate it, but that should be directed to @red hawk

arctic marten
arctic marten
#

Hello @red hawk I was able to successfully load my data, I use nrows to specify the number of rows. Thank You so much yesterday for your time, I truly appreciate it. Thanks for teaching me that how to use unix command to load data in csv (i haven't heard that before). Do have a lovely day.

velvet bridge
#

i need help setting up my gpu to jupyter notebook i followed the steps but it still says my cuda gpu is not available after importing torch

thick glacier
silent kite
glacial python
#

Hi everyone, maybe a super dumb question but I am going through learning exercises and just built my own model based on DecisionTreeRegressor from sklearn. I understand X is feature set and y is the prediction target. But when I have a prediction valie on house prices, what is y value about? I am unable to understand the prediction value when we do not have the concept of time, i.e., when we can expect the prices to be the prediction values.

#

So what exactly then y represents when we get the prediction in the end.

deep flower
#

Does anyone have a unique project idea for ML?

hybrid halo
#

Hello all,

I am currently trying to decrease the training time by sampling the dataset and then using that trained model to make predictions about the whole dataset.

After training on the sample, we checked the AUC for 10%, 30%, 50% and 100% sample sizes.
If the validation AUC for all of them is very close to each other we can minimize the training time by only training on the 10% of the sample for other datasets and can conclude that the predictions will be the same as that of when trained upon the whole dataset.

The problem is in the case of a very low minority class it is discarded in the sample and the predictions for those are not coming accurately.

The metrics I am using is AUC and the sampling method I am following is stratified sampling.

If you are aware of any better approaches I would like to discuss it.

real isle
regal plank
#

I noticed there is different colors for functions that can be applied when using tab key.

I assumed the blue on is for the imported package, purple is default, and the wrench is also default related to settings, right ? (just want to confirm my understanding)

main mango
regal plank
#
                SELECT u.id as id, MIN(q.q_creation_date) as q_creation_date, MIN(a.a_creation_date) as a_creation_date
                FROM `bigquery-public-data.stackoverflow.posts_answers` a 
                FULL JOIN `bigquery-public-data.stackoverflow.posts_questions` q ON q.owner_user_id = a.owner_user_id
                RIGHT JOIN `bigquery-public-data.stackoverflow.users` u ON u.id=q.owner_user_id 
                WHERE u.creation_date >= '2019-01-01'and u.creation_date <= '2019-01-31'
                GROUP BY u.id
                     """ 
                     SELECT u.id AS id,
                         MIN(q.creation_date) AS q_creation_date,
                         MIN(a.creation_date) AS a_creation_date
                     FROM `bigquery-public-data.stackoverflow.users` AS u
                         LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                             ON u.id = a.owner_user_id
                         LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q
                             ON q.owner_user_id = u.id
                     WHERE u.creation_date >= '2019-01-01' and u.creation_date < '2019-02-01'
                     GROUP BY id
                    """ ```

Solving the first notebook in Kaggle advanced SQL lesson, first query is mine and the second is the answer, when visualizing in head, I think the JOIN logic should give the same results at the end but unfortunately it is wrong as per kaggle check, I want to confirm maybe if it is wrong cause I used a different join or because the results itself (if that makes sense)

Any help is appreciated ty!
red hawk
# regal plank ``` three_tables_query = """ SELECT u.id as id, MIN(q.q_creatio...

what's the question? in any case 'MIN(q.q_creation_date) ' is wrong syntax, the column is called 'creation_date' (and similar error for 'a_creation_date'). I think the joins should give you the right results but the 2nd way of doing joins is quite a bit more clear than your 1st version (maybe a personal preference but I find thinking about 2 left joins a lot more intuitive than a full join and a right join...)

tip: If you paste your query in the bigquery UI it will highlight your query errors for you-- something that running a query in a notebook don't

regal plank
open flicker
#

I ran model.fit on the Kaggle online notebook and it is taking a very long time. (I ran it about 10 minutes ago and it's still 35/152 progress) Does running it online slow it down? Would it be faster to run it on a local PC? I have a gaming PC.

main mango
open flicker
#

Kaggle seems to prohibit code sharing outside the forum, does this mean that sharing code on GitHub is also prohibited?

fluid hazel
deft fox
deft fox
dim quiver
#

hello people, I am new to Datascience and ml, I have knowledge about performing Data Preprocessing and EDA and right now i am learning ML models starting from simple and multi linear regression
Can anyone suggest me some already done analysis and cleaning on datasets on kaggle? i would like to see how people go about doing data preprocessing in different ways

#

Also a followup on this question, I would also like to know where can i learn how to create Pipelines for datascience, i already know OOPS and python concepts just wanted to know how can they be implemented

dim sundial
#

Any good resource to learn shaders (glsl) for ai ?

dapper stratus
#

i tried changing the model architecture a lot but it always yielded worse results, this is not my first run

#

also somehow before using data augmentation, it had better results

#

if you run the notebook yourself and are trying to check the test results, the first 150 pics should be of horses and the remainder is of humans
i did a lot of research before asking here and i checked diff methods to yield better results but they did not help much

deft fox
pseudo harness
#

I have a question about notebook-only competitions, what actually prevents me to load a pretrained model or a model I trained myself or even already preprocessed data I created myself and then run the notebook only doing inference? I thought the goal is to test the skills with limited hardware at disposal so I am a bit confused.

keen geode
#

know someone why I'm getting this error from "Intro to SQL" course on Kaggle?

deft fox
regal plank
dapper stratus
hybrid halo
primal wedge
#

I have save and run all in notebook kaggle . But when download notebook. That notebook not appear output. How to solve it? . Isnt bug or something else?

vivid ice
#

Hey guys,

Is there anyone here who have experience working with sound data, in particular sound as input and sound as output models? Would love to ask some questions regarding where to get started!

deft fox
dapper stratus
#

i was learning data visualization and i stumbled across sns.clustermap

#

this was on a relatively small dataset

#

this was on a bigger dataset

#

are some people able to make sense of this or is it better suited for some datasets

deft fox
#

@dapper stratus What you are showing is a two-way clustering by certain features on top, and some IDs (presumably users) on the left. The intensity of color corresponds to values for a given feature/user combinations. Features close to each other in the top dendrogram are more similar to each other, while features that are far are dissimilar. Same for IDs. The plot tells you that a number of weekend nights and a number of week nights are correlated features, while number of weekend nights and booking status are not. Same for users/IDs, except that it is very difficult to see most of them as the plot is crowded on the left and right sides.

molten wharf
#

Hi I have a question regarding the definition of an "old post".

I have a notebook that just reached 50 non-novice upvotes about 2 hours ago.
But it wouldn't update the status of the medal.

Is it because the notebook was initially created about 3 months ago? I have actively modified until last month and recently updated a bit.

I have googled about the term "old post" in progression section, but no post online / kaggle discussion was clarifying my curiousity.

Is it the matter of the "old post"? or is it my patience?

finite pulsar
#

Hi, has anyone here worked on the Multi-label text classification problem?
Some of the features have very less labelled data.
I had tried my hands on SETFIT but it didn't give me good results.

deft fox
molten wharf
#

@deft fox Ohh...! Thank you so much! Then I may have to be patient for a bit more...
Thank you so much for your insight!🤩

mental cliff
broken sphinx
mental cliff
cerulean basin
#

Hello! I asked a question here a few weeks ago about LayoutLM and it never got an answer. #❓┊ask-a-question message
If no one can answer, does anyone have any suggestions on where I could find more info about fine-tuning LayoutLM.

fervent ocean
#

Im not getting the medals for the upvotes in discussion tab[yes those votes are neither SELF VOTES or novice votes] and it says "too much requests" whenever i try and post a new topic. How do i fix it, its been 8 hours. I have been constantly trying to post a new topic but it throws the same error. @tender trench @verbal crest@wind silo

verbal crest
grim urchin
fervent ocean
sleek siren
#

Guys I need a help
how do I submit my model of titanic?

fervent ocean
verbal crest
#

@fervent ocean You'll have to contact support (although if you wait longer it will probably fix itself)

molten walrus
main mango
molten walrus
# main mango It appears `step_4.check()` is missing.

it gives me :
/opt/conda/lib/python3.10/site-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: sparse was renamed to sparse_output in version 1.2 and will be removed in 1.4. sparse_output is ignored unless you leave sparse to its default value.
warnings.warn(

#

when i run this code:

from sklearn.preprocessing import OneHotEncoder

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

#

when i change it to
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
it removes the error, but it not tell me that i finished the certificate

outer apex
#

hi guys i got a little problem that generated shoking results . can anyone help me for just 5mins to explain the main problem of the result (no need to help me correct it i just want to know where is the problem)

#cross_val_score(xgb,X_test,y_test,cv=5).mean() : 0.9928446688501186 y_test_pre=xgb.predict(X_test) mean_squared_error(y_test,y_test_pre) :84941265285.15845
when i saw that shoking result i wanted to try this thing :i tried to calculate the mean squared error in the training data i wanted to see if it is equals to 0 but : y_test_pre7=xgb.predict(X_train) mean_squared_error(y_train,y_test_pre7) btw xgb.score(X_train,y_test_pre7) : 1.0

#

Maybe this can be obvious for you guys but i have like just 2months of experience

obsidian pulsar
outer apex
thick glacier
deft fox
obsidian pulsar
thick glacier
austere horizon
#

How do i upload files

#

Can anyone tell me plz

molten walrus
magic latch
#

how to use model from huggingface in some competition ban internet, it's there better way than upload to /kaggle/input?

sleek siren
#

Can some one help me out with this...not able to understand which all fields I should slect

thick glacier
#

In Predict Health Outcomes of Horses, when i try to submit the file then they show me error.

#

this my submission file

mental cliff
#

I've been trying to train latent diffusion model, somehow the loss does not converge. Is there any issue in my training loop?

fervent ocean
#

How much time does it take for support team to get back to you?

deft fox
obsidian pulsar
obsidian pulsar
blazing thicket
#

Hey folks,
I've been working on a project involving a large dataset, and I've hit a bit of a roadblock. I was wondering if there's anyone here who might be able to lend a hand or share some insights.
If anyone has experience with data analysis or visualization and would be willing to help, I'd be incredibly grateful. It doesn't have to be a huge commitment – even a few pointers or suggestions would be immensely helpful.

Thanks a bunch! 🚀

blazing thicket
obsidian pulsar
dapper stratus
#

is there a general rule to how much you should be testing?

#

for example, i'm training on 300 pics (the sense behind the challenge was trying to get a good accuracy with a low number of pics) so i used data augmentation and transfer learning and so

#

when it comes to testing though, they didn't restrict how much pics we got for testing, so i just went online and got a dataset that had almost 3000 testing pics but i only used around 80 pics to test my model that was trained on 250 pics

#

would this be considered bad practice and i should have used the whole testing dataset that i had?

deft fox
steady dune
#

Hi!
I am working on a project in which I need to integrate a DL model into flutter app .

Could any body help me how to integrate that model into flutter app?

obsidian pulsar
steady dune
obsidian pulsar
#

Um Maybe, it is possible to decrease the accuracy of a n original ml when integrating it into a flutterapp

#

It is too large or complex to run on a m device

#

or not optimized for the Flutter app

#

I have integrated DL models into Flutterapps before, but I don't have any public repositories that I can share 😩

obsidian pulsar
sharp plume
#

@everyone is it possible to use image processing to find the soil nutrients like Nitrogen, Phosphorus and Potassium for agriculture. If anyone has any idea can you please reply me it would be really useful.

obsidian pulsar
sharp plume
#

@obsidian pulsar is it possible to find the approx level the nutrients in the soil using image processing techniques

obsidian pulsar
#

Yes, it is also possible to use image processing to detect the presence of specific nutrients in the Soi

sharp plume
#

I am looking into converting images into HSV color space using computer vision

obsidian pulsar
#

For example, nitrogen can be detected by looking for the presence of chlorophyll, which is a green pigment that is essential for plant growth.

sharp plume
#

So if you have some inputs or sources that I can use to develop my model

#

That would be really helpful

obsidian pulsar
#

Researchers at the University of Arizona

#

SoilOptix

#

Farmers Edge

sharp plume
#

Thank you

thick glacier
#

Hi Dev,

Why is Kaggle not responding?

lone perch
#

Hello everyone,
I got an actual score in my Notebook for the model I built and submitted it for a competition but my public score is 0
Does it take some time to load or is there a problem with how I submitted my output (The submission was successful)? Thanks

deft fox
fresh spruce
#

Should I use knn model for training how a person looks?

obsidian pulsar
#

Deep learning models

#

If you are a Messi fan, 😀 ~

fresh spruce
fresh spruce
lone perch
steady dune
#

Currently I am working on a project in which I need to recognise diseases in plants/crops through image taken by mobile camera.

So, I am training my CNN model accordingly. Is it good to go with CNN or any other neural network that you recommend to increase the accuracy of diseases detection?

blissful spire
#

Hello, I am trying to resize the images but it the disk space is full. Kindly guide me how can I resolve this issue?

deft fox
# blissful spire

One option is to download the files to a local computer and run it there, assuming you have enough disk space. Yet another is to try and delete the original images on Kaggle AFTER you resize them, as that will presumably create more disk space. Not sure this option will work as the original dataset probably doesn't count as your image quota, and you may not have access to delete it.

blissful spire
deft fox
# steady dune Currently I am working on a project in which I need to recognise diseases in pla...

CNNs will likely work. However, it is unlikely that you will be able to collect a truly large number of images, say 10,000+, which is what is needed to train CNNs properly from scratch. I suggest you start with pretrained neural networks (VGG16, ResNet, SqueezeNet, EfficientNet, Inception or something along those lines) and fine-tune them on your images. https://www.analyticsvidhya.com/blog/2020/08/top-4-pre-trained-models-for-image-classification-with-python-code/

We cover 4 pre-trained models for Image Classification that are state-of-the-art(SOTA) and are widely used in the industry as well.

lethal raft
#

Guys i need a little help...
Actually i am new to ML field... i have started learning some algorithms but cannot understand how to move on the projects.
Can someone pls help me with how can i start my journey in a proper way in this field.
Also i started reading some ML related books....are they beneficial??

#

Currently i am working on a project for yield prediction using NDVI data but i cannot find much of the data on kaggle...can you suggest me any site or anything that can help me with NDVI dataset /_/\

spare musk
#

I am looking for transcribed Australian podcasts on humour, sarcasm and everyday conversation, would anyone be able to help point me in the right direction?

steady dune
cold torrent
#

Hi I was wondering if anyone here is good with image analysis using python. I am still very lost, trying to learn on my own... I have labeled image data in yolo format using labelImg. I just don't know what to do with my image data set and labels in python

I have to quantify fluorecently labeled cells in images, any advice would be appreciated 🙏😭

obsidian pulsar
#

Did you preprocess your images before performing cell analysis?

deft fox
# cold torrent Hi I was wondering if anyone here is good with image analysis using python. I am...

A general approach here is to use train images with labels and object masks for fine-tuning the existing model. After that you test on a separate set of data that hasn't been seen during training. Not sure that YOLO has the ability to quantify fluorescence or anything else, as most of these types of models are meant to be qualitative rather than quantitative. It may be a bigger bite than what you can chew if you have no background in image analysis, as this is a decidedly non-trivial task.

cold torrent
#

So I have images like this. I have to count "Red" cells [Necrotic cells] , "Green" cells [Live cells], "Green Yellow" cells [Early Apoptosis] and "Yellow Orange" cels [Late Apoptosis] , and I used labelImg to box cells of each color type and label them as "Live", "Necrotic", EA", and "LA", not sure if that is a right approach

deft fox
cold torrent
#

I also read that U-NET is something that could be used for something like this where I have different instances [colored] of cells, but not sure if that is the right approach or how to approach using such methodology and how to best label my images for such, I guess I was put in a bit of shark tank with no guidance to figure this out on my own since we are doing this for first time ever

deft fox
#

Also, it seems that your signal is diffusely green in cytoplasm no matter what color is in nucleus, and that also might complicate things.

#

To train from scratch using U-NET or any other architecture, you would need at least thousands of images with many labeled cells in each (100+). That's why a pre-trained model might still be desirable as you only need to fine-tune it, which can be done with a relatively smaller number of images.

cold torrent
#

i have a dataset of close to 400 images

#

how should I approach a pre-trained model for such a task? Also I agree the diffuse cytoplasm may be an issue

deft fox
# cold torrent i have a dataset of close to 400 images

400 images may sound like a lot to you - and I know from a personal experience that labeling that many images is a pain - but that's nothing for training models from scratch. You'd have to set aside at least a quarter of them for testing, and 300 images x 100 cells is not very much when you have 4 different categories.

cold torrent
#

Yes; I have been labeling a lot of images, but I am not sure if I took the right approach - I used the labelImg API and the yolo method of output, which is just the format like this:

1 0.699219 0.573423 0.159375 0.172072
0 0.684766 0.521622 0.016406 0.023423
2 0.288672 0.284234 0.205469 0.217117

it gives the label, coordinates and length i believe.

cold torrent
deft fox
#

You will have to research this on your own or better yet get local help from someone who knows, as I can't guide you through all the steps via keyboard.

#

Kaggle should have some notebooks that cover all these steps if you are patient and go through many search results.

cold torrent
#

Thank you so much, also once I have done that to my best abilities, would it be okay for me to reach out to you directly?

deft fox
#

There is no guarantee that I or anyone here will respond when contacted, as all communication is done on a voluntary basis and depends on available time. But there is no harm in trying to get in touch.

cold torrent
#

Thank you so much for your help again!

mental storm
#

Hello, this is a career related question to the data scientists and ml engineers of the industry.

How should an undergrad student navigate his way into internship and FTOs in this domain

runic geyser
# mental storm Hello, this is a career related question to the data scientists and ml engineers...

The very basic yet crucial thing is to be good at mathematics and statistics. You have to learn all the concepts to proof (knowing applications is a plus).

Secondly, you have to be good at any programming language (I will suggest Python) from moderate to expert level.

Next, Learning various machine learning algorithms and practice them with industry based project (based on which domain you're or want to work in future).

Casting the acceptable RESUME..

Hope this helps..😌 🙂 🤝 🤞

keen forum
buoyant harness
#

Hi I'm Jonathan, Singapore born California bred. I work at the intersection between Quantum Computing and Web3 at pQCee, a post quantum computing startup based in Singapore. I'm the product owner of QuantumNFT, a platform that let's developers showcase their quantum programming skills. We're addressing the talent gap problem. We're validating in the QIF Quantum Games Hackathon. We want to do what Kaggle has done for Data science for Quantum Science. For this hackathon, an idea is to build out the competition workflow. Is there anyone from the team that can spare 30 minutes for a discovery call? 🙏

barren phoenix
#

Hey there is this allowed
Say I have a friend who's NOT competing in a competition
They decide to lend me their account for GPU hours

This isn't code sharing since they aren't competing and isent multiple accounts of the same person. So it shouldn't violate any rules
Can any1/@mild geode staff confirm?

deft fox
barren phoenix
#

cc @twin elbow

deft fox
# barren phoenix Yeah waiting for the official response since the point is I'm borrowing someone'...

You are thinking only about what feels right to you, but moderators have to think globally. What if someone has 10-20 friends who are not participating and all of them are willing to donate their GPU hours? Do we draw the line at 5 friends that can contribute their GPU hours, or is 50 okay as well? If there is no line, soon enough everyone would be making Kaggle friends left and right, which would create inequity in how many GPU hours individuals have at their disposal.

barren phoenix
# deft fox You are thinking only about what feels right to you, but moderators have to thin...

Aaah yeah that makes sense !

But there's a loophole where the "friends" participate in a team with the person and pool GPU hours. The catch is that they didn't actually participate and just lent their account .

Since within a team people can share anything.

Not sure how that's moderated...

If it can't be moderated then it makes no sense not allowing what I proposed .

It could be capped the same as max team size which is 10

But yeah I agree there isn't any one size fits all 'fair' solution and there is a lot of nuance

deft fox
#

@barren phoenix Again you are not thinking like a moderator, so this might help. Let's say that someone has multiple accounts (a violation) and is running notebooks on them using the same IP number. Now here comes you without multiple accounts and borrowing your friend's GPU hours, but you are also running notebooks from multiple accounts using the same IP number. How are Kaggle moderators going to distinguish between these two events? Would they even care to do it even if they could?

barren phoenix
#

Aaah yup valid point thanks ! Ig I should just stick with my own account

dapper stratus
#

i have a question please
if i am making a brain mri tumor classification program and we are given a dataset of 250 images to train and validate on

#

and this is kinda like a challenge

#

i tried to use transfer learning

#

i am testing on an online dataset of 60 images that aren't in my current training data

#

the highest accuracy i am getting with transfer learning is 78%

#

i am testing many models and they are all performing poorer than Xception that got 78%

#

is it even possible to get higher than 78% or am i wasting my time

#

it is worth to note i am using data augmentation for sure and i tried fine tuning hyper parameters like learning rate

thorny stone
#

I am looking for someone to run thru the Titanaic competition with me so that I can learn. If anyone is up for the challenge or has any insights for me please share. Thanks

dapper stratus
#

here is a link to the notebook with all the models i tried to use, i test on almost 1k images. any help would be appreciated

deft fox
# dapper stratus i have a question please if i am making a brain mri tumor classification program...

It is impossible for anyone who hasn't tried that exact dataset to tell you whether you are doing well or not, and we don't even know what dataset you are using. Generally speaking, it is very difficulty to get excellent performance if you train on 250 images and validate on 60, but 78% sounds decent. You may want to try to split your dataset into 5 folds and make 5 models, and then average their predictions. That might give you a small boost.

obsidian pulsar
lethal raft
#

Currently i am working on a project for yield prediction using NDVI data but i cannot find much of the data on kaggle...can you suggest me any site or anything that can help me with NDVI dataset .

#

NDVI data is basically data extracted from satellite images

#

Please respond

obsidian pulsar
#

or Sentinel Hub

tough hornet
#

I am new to data science and looking to get a headstart in this domain. I am currently learning python and its libraries.Should I do something else along with this?

deft fox
obsidian pulsar
deft fox
# obsidian pulsar Understand, but You have to know Ensemble models are a type of machine learning ...

Again, you are not writing very precisely so others may get wrong ideas. Ensembles are not combining just base models - they can combine any kind of models. My original contention with your statement is that you seemed to suggest that single models on a large sample would do better than ensembles. Generally speaking, on the same dataset ensembles will do better than any single model. That goes for any dataset, whether big or small.

obsidian pulsar
# deft fox Again, you are not writing very precisely so others may get wrong ideas. Ensembl...

While ensemble modeling can offer excellent performance, it can be a complex process to implement and may not be as effective when handling large datasets. In such cases, it may be more appropriate to rely on a single model that can handle large datasets efficiently. This can help to streamline the overall modeling process and ensure that the final model meets the desired level of accuracy and performance.☺️

deft fox
# obsidian pulsar While ensemble modeling can offer excellent performance, it can be a complex pro...

I am not disputing your last statement. Yes, some people may not care about a complex ensemble to get a 0.01% improvement when a single model may be lighter and easier to implement. Single models may be more appropriate, no doubt about that. Yet "more appropriate" doesn't mean "far superior" which was your original statement. Single models are not "far superior" to ensemble models, even though there could be good reasons to use them.

obsidian pulsar
dapper stratus
# deft fox It is impossible for anyone who hasn't tried that exact dataset to tell you whet...

thank you for answering, my dataset is basically 250 pics split into two halves with half being pics of brains with a tumor and the other half being pics w no tumor, my issue was that i expected transfer learning to yield a higher accuracy but i could not get more than 81%, tbf i tested on 1k pics when i only trained on 250 which would not be a real life scenario since if i had 1k pics, i would have most probably used them for training but the challenge for the task was that i need to train on only 250 images

obsidian pulsar
daring pine
#

Is there any pytorch-based time series feature extraction libs? Most of the implementations I saw are based on dataframe.groupby and apply...

#

I mean if there's not I may have an idea and I can start something. I assume torch even on CPU utilizes maximum resources and can achieve better performance. (Or maybe I'm wrong?)

#

The libraries I'm looking at right now are tsfresh and tsfel.

deft fox
daring pine
#

Good to know. Thanks so much for that!

thick vessel
#

Is this the place to ask questions regarding specific Kaggle Courses/Exercises?

verbal crest
#

@thick vessel You can ask specific questions here or in the discussion forums for each course on the site.

red hawk
daring pine
# red hawk unless you can utilise the gpu I don't think torch will give you any extra optim...

I'm working on the Optiver competition dataset. TSfresh Takes ~25 secs(where 15 seconds for creating the rolling time series data frame) to do the feature extraction for the first stock in the training set and there are 200 stock_id in that dataset. If I use a for loop on top of my current code it'll probably take >1hr on CPU to run on all training data. This is fine and affordable for the training stage(since I can pre-compute them) but I may be using GPU for inference. But I think overall you are right. I realized that if I use another way to compute, it'll probably spend as much time in creating the rolling data frame.

crisp gust
#

I just created a notebook and I wish to share it to the competition's code section, can anyone teach me how to do that? pika_wow

regal plank
steep pebble
#

Hello all! I had a question regarding the implementation of momentum based gradient descent. Should I be zeroing the momentum based gradient at the start of each epoch or keep updating it across epochs?

deft fox
# steep pebble Hello all! I had a question regarding the implementation of momentum based gradi...

Momentum term in SGD is meant to overcome noisy gradients, especially in small datasets. In Keras implementation its value is 0, which means that it is not required. If you don't enter any momentum value, I suspect in most SGD implementations it will default to zero. Presumably that means it is safe not to use the momentum, but it doesn't necessarily mean that momentum=0 is the best choice. I think you should stick with one value for it rather than try to change it from one epoch to another, as that will only add more complexity to the interpretation.

steep pebble
#

Sorry I was not referring to the momentum parameter in the update rule. Rather, the velocity that you maintain during training as (momentum parameter) * velocity + lr * gradient

steep pebble
gritty cloud
#

anyone know what the options are for persisting data in output folder across notebook sessions?

#

I'm converting some csv files to parquet, but I dont want to do that every time i boot up a notebook

deft fox
gritty cloud
lapis mango
#

i am taking an applied statistics class with R and i am stuck on an error in my code

regal plank
#

is this normal/ok ?

I am downloading image files for an image model classifier, I noticed the red CPU bar and RAM, is this ok ? should I ignore it? is there a way to optimize it ?

Appreciate all the help peepolove

regal plank
#

getting this too, looks like I am also running out of RAM.

regal plank
#

Hello everyone!

I am currently working on my second computer vision model, and I am facing a hard time in reducing the error_rate and improving accuracy. Despite intensive data cleaning (it actually got worse), and I would appreciate any suggestions!

notebook link:
https://www.kaggle.com/code/raedsherif/green-leaves-classifier/notebook

I also have a few questions:

  1. Is the input data available to you ? I could not find it in the notebook, should I download it to my local device and upload it as a Kaggle dataset, then input it in the notebook?
  2. As you can see, when installing libraries it generates a lengthy output. which looks bad, is there a way to clear or avoid displaying this output?
  3. After cleaning the images, should I create new dl and run fine_tune again ? Does it pick up from where it left off and saves previous progress?

My first model was 76% accurate at predicting plastic types and I was able to share it with a basic interface on Gradio, for this one, I'm aiming for a high level of accuracy and eventually want to deploy the model on a website. Any tips or tricks would be highly appreciated

thank you peepolove

deft fox
# regal plank Hello everyone! I am currently working on my second computer vision model, and ...

If you use pip -q that should make the installation quiet. I suggest you consider using validation loss as your metrics rather than accuracy, as the loss is steadily going down in your case. I think you shouldn't stop the training after the first epoch when the metrics doesn't improve. A typical patience values during which metrics don't improve are 5-20 epochs, but in your case 2-5 may be more appropriate. It seems that you also need to fine-tune for more than 7 epochs, maybe with decaying learning rate.

regal plank
deft fox
patent kiln
#

can anyone tell what does q1.check() mean

glossy edge
#

hey guys I would like to start my first data analysis project but I dont know where to start do you recommend me start with a big dataset or a small dataset ,could someone help me pls ?

verbal crest
#

@patent kiln q1.check() is the function to check your answers, make sure you have run the cells above it to properly import the learntools package.

quiet bison
#

I want to get into NLP I now studying transformer what is the next step for me
What best books or courses that I have to take?

patent kiln
tidal echo
#

And this error is likely bcoz u have not run the initial code cells to activate this service

split epoch
#

@glossy edge start with titanic. Work on Eda first before trying a new model every day.

#

Have a vid covering titanic and a playlist of all the models if you want to check it out

summer drum
#

Hi everyone, I recently learning about preprocessing step in ML. I have a question regarding to standard scaling. There are some algorithms require standard scaling for more accurate in regulations and stuff. And as I scroll through lectures, and some shared notebook on Kaggle, I notice that they apply sklearn standard scaler right after train_set_split, without consider the data type (like nominal features, or features after one hot encoded). My question is: do it affect the performance of the algorithm?

obsidian pulsar
#

Hello @everyone!

summer drum
#

hello Mnihj

split epoch
#

Depends on the data used, but you can throw it into a pipeline and select certain columns for it

#

I do standard scaling after train test split

thick glacier
#

code:-
t3 = tf.random.Generator.from_seed(12, alg="threefry")
t3.normal(shape=(2, 3))
error:-
InvalidArgumentError: {{function_node _wrapped__RngReadAndSkip_device/job:localhost/replica:0/task:0/device:CPU:0}} Unsupported algorithm id: 2 [Op:RngReadAndSkip]

why i get this error?

reef furnace
#

how to use hugging face model "bert-base-uncased" model in kaggle.
I was trying to login using hf-cli or using a token still autotokenizer is throwing private repo use token error.

summer drum
lapis dirge
#

Have anyone tried decision tree on titanic...nd what is the accuracy?

deft fox
thorny sentinel
#

Hey! Does anyone know where to find examples of companies that reported a clear benefit to their business as a result of hosting a Kaggle competition or a competition in another platform? In Luca and Konrad's Kaggle book, the list three examples Netflix, AllState and GE, but I would like to find more examples

lime wyvern
#

Is there a good explanation of which GPU (p100 vs T4) you should use anywhere? I've struggled to fid anything!

velvet bridge
#

can i fine tune a model with json strucutre or even jsonl, i know the answer is yes. I just need to know if i have to always make the data formatted in this way when fine tuning:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
rustic salmon
#

anyone know how to embedded openAI keys to project

#

I am getting serious error , let me know if you can do it

desert tusk
#

Do you know a blog (or something else) that review the Kaggle competitions (namely, describe the most common approaches, analysis the best solutions, etc)?

desert tusk
#

If not, I think to open one of that

dapper bay
#

HI Everyone, does anyone understand the Time series forecasting (Sales forecasting )in Kaggle? I'm having a hard time understanding it.

velvet bridge
#

i passed a json doc into model using langchain and when making an inference of the RAG model it does not seem to be responding based off the json data specifically, when asking it certain questions that should have a good response. Do i need to restructure my Json Data away from the nested dict style to something more condensed?

desert tusk
#

I think that it would be meaningful for everyone

split epoch
#

In this Python Machine Learning Tutorial, we take a look at how you can split a data set through train test split in scikit learn.

This is a great method for prepping your data before you run a model.

Email: ryannolandata@gmail.com
LinkedIn: https://www.linkedin.com/in/ryan-p-nolan/
Twitter: https://twitter.com/RyanNolan_
GitHub: https://githu...

▶ Play video
#

@glossy edge 25 vids in here

wheat furnace
#

Hello guys,
Please is there a place to see past project presentation slides and recordings?

fair hawk
#

Can someone help out with this project

keen mantle
#

Hello

crisp gust
#

hey guys, does saving a notebook in kaggle use up my gpu quota as well?

#

and when using the T4 gpu, is there anyway to use both of them at once? to maximize the usage of the quota

tidal echo
#

why is it that, when compiling the model , say running 80 epochs, i get to see a pattern in the change of values of validation loss and accuracy? and also there is a pattern in reduction and increment in the learning rate?

obsidian pulsar
#

Hello @everyone, What can I help you?

tardy mauve
#

is the scoring stage just comparing my submission with the real data or it will use some private data to check? it takes my more than an hour and im still not getting my score

olive tinsel
#

Heyy folks, any suggestions like tutorials to start working with tensorflow???

manic iris
#

hi everyone, i'm working a clustering model with kmeans. But i have a dilema. My elbow method says that best k is 4, but when i see a 2d pca plot it looks like best k should be 2
When i try using k=2 in the model the cluster doesn't show the obviously groups. What am i doing wrong? should i recheck documentation? harold

daring robin
#

im looking for data for time series forecasting, it should be above 5 GB anything around 10gb+ would be nice to practice!

woven topaz
#

I have a question . On Kaggle Getting started competitions, we are provided with train and test sets separately. Is it okay to merge both of them for doing preprocessing easily or not ? According to this blog : analytics vidya blog (https://www.analyticsvidhya.com/blog/2021/07/data-leakage-and-its-effect-on-the-performance-of-an-ml-model/) we should not do this because it can lead to Data Leakage . Can anyone tell ?

In this article, we will discuss all the things related to Data Leakage including what it is, how it has happened, how to fix it,

woven topaz
dusty skiff
#

i have the age column missing in some of the rows in my data, should I do mean imputation or resort to some other method

naive raft
#

Hi, i have 50 clusters of numbers each cluster corresponds to a plume shape, i know the location of origin point, i want to translate each plume or cluster at same position line they superposed each other cause i want to eavlaute the average values of them. anyone knows how to implement this in python

#

i am trying this from yesterday, not able to implement it

twin lion
#

Hey, I hope you are all doing well. I want to extract information from a resume. How can I do it? Any guidance, please?

hybrid needle
#

hello everyone, can somebody help me and answer on my questions?

deft fox
frank pivot
#

Hi, who should I contact for a Kaggle bug?
I see a weird bug on Output => Submit page, on CTF competition the web page starts blinking a lot, I cannot click "Submit" and after a few minutes I get 429 too many requests error on any kaggle.com web page, it's like I'm banned for a a few hours then.
Some pages return:

daring pine
#

So...does data regularization(like min-max, normalization, box-cox, etc. to keep all data items in a limited range) improve performance for Gradient boosting trees like LightGBM(in a regression task)? I do assume they don't do much for standard decision trees in a classification task that I may learn in class since they are based on entropy.

daring pine
glossy edge
#

@split epoch thanks

shrewd scarab
daring pine
shell sierra
#

Very stupid question, but I am new to the competition in coding in general. Where do I find the data to download, Ik the competition lists pfr and nflverse but how do I download either of those so i can get started

deft fox
deft fox
urban hemlock
#

Hello, I'm currently training a Convolutional Neural Network. Is this a good way to train my model to unseen data?

#

The training accuracy jumps real high at the start while the validation accuracy gradually gets better

obsidian pulsar
arctic marten
#

Hello, I am building an analytic dashboard on Streamlit, my code is above. I want to add the parameter delta on metrics (st.metrics, label =, value=, delta = ) My delta will be the total sales difference (increase or decrease). The aim is to show if sales are increasing or decreasing each year. All the code I have written to achieve this has gone wrong. I got it done on my jupyter notebook but if I implement the same on Streamlit I get an error that my value should be an int, str... Please I need your assistance

sullen pawn
arctic marten
latent bough
#

Broad stroke question- but can someone explain data models to me and how to build them?
Is it more than just combining datasets into relational tables? i.e. Sales and Customer data joined on transaction id?
Or is it more forecasting or financial modeling? I think the job postings I read are throwing this term around very loosely...

deft fox
celest sphinx
#

One example of a data model is the Entity-Relationship Model (ERM), which represents entities, their attributes, and the relationships between them. For instance, in a university database, "Student" could be an entity with attributes like name and student ID, connected to other entities like "Course" through relationships indicating enrollment.

urban hemlock
urban hemlock
uneven radish
#

15000 is to big for rmse ?

celest sphinx
#

i'm guessing you didnt normalise it

#

If you estimate number of sales for your local shop and you have a RMSE of 15000 it's huge yeah xD

#

if you estimate number of sales for the whole Macdonald's corporation, 15000 RMSE is very good

uneven radish
#

@celest sphinx can you check dm?

celest sphinx
#

no i dont take DMs but i was just giving you a quick answer to your question

#

usualy to evaluate your model performance

#

you start by making a "baseline" model, so a very simple model for example : always predict the average value of the dataset

#

there are mode accurate baseline models of course

#

then you can compare your model performances to those statistical baseline models

#

And to answer your question on a business view : Your acceptable error depends on what the client is willing to lose

#

you should never aim to reach 0 RMSE anyway since that means your learnt the noise in the data too. Usualy your main aim to know if you model is good is if your training RMSE is close to your validation RMSE

#

i'd rather have a model with 0.11 TRAIN & VAL RMSE than a model with 0.0001 train RMSE and 0.10 VAL RMSE

uneven radish
#

Thank you

deft fox
urban hemlock
dim quiver
#

Hi guys, I am studying machine learning and suddenly I got something in my mind,
Recently i learned and did some statistics problems basically hypothesis testing and stuff but I can't visualize their implementation in industry.
I heard that statistics is the base of all AI and ML but it's hard for me to solve problems using statistics and I just know theory of it. Does anyone know of good use cases or some books which can help me with it.
Much appreciated

torn jasper
#

Hello everyone, I trust this message finds you well. I am seeking advice on whether pursuing an online Master's degree in Data Science from Coursera is a prudent choice. Furthermore, do you have any recommendations concerning funding options for pursuing a Master's degree?

deft fox
celest sphinx
#

What does Kaggle mean btw?

#

just a random word that sounds nice?

#

chatGPT said

#

"The name is a play on the Japanese word "kaggle" (pronounced "kah-glay"), which means a group of people who come together to learn and collaborate."

chilly bison
#

Hello everyone. I've a trouble. I have been using one notebook for some time and it worked just fine until few days ago I started to get the following error:

CUDA error: CUDA driver version is insufficient for CUDA runtime version
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Interestingly enough, my friend was using the same notebook and two days it worked fine for him while mine wasn't working already. Yesterday he started to receive this error too. Could anyone please tell a solution here?

celest sphinx
#

haha thanks

blazing salmon
#

Plz share the list of all ML algorithms that are affected by correlated features.

green kernel
#

Can anyone tell me what I'm doing wrong? I want to access a json endpoint from the chess.com API using python. But I'm getting error code 403 again and again. This is my code:

import requests

api_url = "https://api.chess.com/pub/player/jitesh117"  

try:
    response = requests.get(api_url)

    if response.status_code == 200:
        data = response.json()

        print("API Response:")
        print(data)
    else:
        print(f"Request failed with status code: {response.status_code}")

except requests.exceptions.RequestException as e:
    print(f"Request error: {e}")
urban hemlock
timid harbor
#

Hello ,I am new to data science and I want a little guidance on the binary classification with software defects

graceful axle
#

Why people ask for thumps up in their projects

deft fox
graceful axle
#

Are transformers = attention models?

outer apex
#

anyone can help?

storm slate
#

hi i am a beginner how to tune your model to make it perfect for example when to add a dense layer or a drop out is there a course/book for that?

sullen pawn
sullen pawn
# outer apex anyone can help?

Missingpy is quite old, your scikit-learn package should match the version as is required by missingpy to remain operational (although I recommend looking for an alternative package to missingpy)

sullen pawn
cinder gulch
#

hello everyone. I'm from Vietnam and want to find a mentor for data science especially coding. Can someone help me please? thank you so much.

fair hawk
graceful axle
#

Have there been any competitions that nobody has been able to win?

arctic marten
#

Hello Everyone, Please I need help here. I am trying to forecast using Prophet. The first Screenshot is my dataframe but after creating my first model, the shape of my data reduced after i had added that i am making a prediction for 365 days (second screenshot). Please what is really going on? I should be expecting 3203 + 365 rows

stoic bear
#

Why AI won't replace data scientist/analyst

obsidian pulsar
graceful axle
magic timber
#

I was trying to plot a plotly express graph, on x axis it was days number and on y it was cumulative sum column. I made a average line chart fot it with different color, But it stays on the back. How do i bring it to the top??

woven topaz
icy wind
#

Hello Everyone, can anyone walk me through the multidimensional data vectorization without creating a change in the data shape . I have been experiencing some challenges in this regard

daring hedge
graceful axle
#

I was trying that out

#

It says that the exercises are graded

#

But it only shows the questions

#

How do I get the gradings?

#

I want to know if my answers are what they are suppose to be

#

Or do they just refer to the little circle on the side?

copper minnow
icy wind
agile stag
#

i have some questions abt regression and datasets shapes but i don't know if it's the right channel

shrewd scarab
agile stag
#

yeah i managed to talk with some of you guys already ! so basically my problem is the following one:

  • I have 2 datasets , one contains 3072 animals with 875 columns whoch are bacteries inside it , and the 2nd one is a predict dataframe with 840 animals and 6 attributes.

  • Objectif , predict the weight (real) , an other variable (real) and finally both at the same time.

problems; As you saw , my dataframes have differents shapes so a lot of regression doesn't work with it and i don't know if i should just take the first 840 animals in my 1st dataframe or no because i don't know if they are the labelised one.

Solution: I tried to first predict the weight which is a small real , it was a slaughter , negative score , 2 Md MSE etc etc so i transformed the weight into classes 0/1 (is fat or no) and now i'm using classiffication models but without this , i wouldn't know how to fix my previous problem.

2nd problem : i have a pretty good accuracy (0.90) using KNN but it doesn't give me infos about which bacterias causes the animal to be fat or no. I think of doing a Naives Bayes or a Logistic Regression to compare results

Do you guys have any ideas on how to see the weight of the differents attributes that leads to the classification

heavy fractalBOT
#
a_himitsu has been warned

Reason: Duplicated text

agile stag
#

Ok problem of size fixed we managed to get a new dataset with correct sizes , now remains the question of knowing the importance of parameters in the choice of the classifier

shrewd scarab
agile stag
shrewd scarab
agile stag
# shrewd scarab Not to my knowledge, you would need to write the permutation importance code you...

Alr alr thanks for the answer ! I'll take 5 more minutes of your time with a last question: i am predicting an attribute actually , then i'll need to predict a second one and finally both at the same time, which models allows to make a multi-classification at the same time ? (I'll probably predict a binary value and the 2nd one is actually a real but i'll probably convert it in classes from 1 to 10

shrewd scarab
agile stag
fast plinth
#

Hi, Would anyone like to team up with me for the competition?

topaz vigil
#

Hi guys i am going through the python tutorial and doing the excerise for the arithmetic and variable section and i cannot figure out what im doing wrong. Any advice?

deft fox
lone perch
#

I have a dataframe where some of the items are lists. Normally the tables I've seen just have single items like "Age" where you just have one integer. How do I do EDA in this case?

celest sphinx
idle bobcat
hearty oriole
lone perch
lone perch
lone perch
zealous creek
shrewd scarab
zealous creek
# shrewd scarab My bad, I thought I read somewhere that it doesn't work for tree-based models.
rustic salmon
#

Could someone recommend a video tutorial on creating project presentation recordings? I'm looking for guidance on the process. Your assistance is much appreciated.

lean kraken
#

anyone who uses rx 6650 xt i need some help?

deft fox
rustic salmon
#

okay thans

finite crescent
#

Good evening everyone, I am Hassan passionate about learning data science as a beginner what resources and site to start from baby step? Happy to be here to learn unlearn and relearn.

proper solar
#

when people split the training data into training and validation data, do they often do another pass at the end where they use all the possible training data before submitting? I imagine having a validation split is only useful for deciding if what youre doing improves the model or not

main mango
wide fiber
#

Hi, is it possible to pull a private notebook?

thorny kelp
clear compass
#

I want to work together with a friend. What is a good place to share data?

jade geyser
clear compass
#

i want to do the titanic competition just a place to put ideas and add data

jade geyser
# clear compass ^

okii....for the titanic the data is available to be downloaded on the competition page right?

clear compass
jade geyser
daring hedge
thorny kelp
cunning thunder
#

if the data is extremely skewed to one side and the boxplot showes alot of outliers are they really outliers, such as this data. It just seems I cant really consider these as outliers. Is the boxplot a not good enough of test for outliers?

deft fox
daring hedge
heavy fractalBOT
#
km2468 has been warned

Reason: Duplicated text

balmy tundra
#

I'm not sure how to proceed with my question now, I was wondering if the latest keras-core is supported

verbal crest
#

@balmy tundra Apologies, I think you hit a false positive on our auto-mod tool. Try asking your question again, I've updated the settings.

thorny kelp
balmy tundra
deft fox
proper solar
#

Are all the for metal competitions the ones that have money prizes right now... there's several pages of them but I assume thats correct?

velvet slate
#

Hii
Forgive me for this stupid question but
How do I participate in a kaggle competition 🙂
I haven't participated in any competitions before this, and I want to partake in the ai text detection competition

#

More specifically, I want to know do I register myself, and how to submit my model and such.....

exotic girder
#

Would there be a 'Kaggle DS & ML' survey-competition this year?

balmy tundra
#

What does optimal training data for an ML trading model look like? Would it consist of just OHLCV values? Wouldn't you have to train the model on data that shows profit, since that is the ultimate goal of traders? How would you express that profit in a training set?

deft fox
# velvet slate Hii Forgive me for this stupid question but How do I participate in a kaggle com...

Start by going to a competition of interest, and click on the button "Join Competition" - it is on the right side. If you are logged on with your Kaggle account, it will take you to a page to read and accept the conditions. Then you go to the data section to download the data. Finally, there are discussions and code sections where you can find and re-use the code others have shared, or ask questions.

sage mason
#

hello all I have recently joined to my first kaggle competition, but what in the rules of that competition it says "Internet access disabled", does that mean I can't import external libraries?

onyx storm
#

Hi, everyone

#

I am a fresh graduate who knows Python Data Structures and started working in a company with SQL and a little bit pyspark on JupyterHub. Wanted to have a guide to Kaggle how to start participation in contest and learn.

red hawk
west galleon
#

Hi everyone,
I need some help with getting started. I want to work on the Detect AI-generated Text competition but I'm not sure how to get started, since I've never really worked on a project in Kaggle or participated in a competition.
I'm hoping to participate to learn as I go. I had to start somewhere so I chose this.
Any advice would be appreciated.
Thank you.

rare jetty
#

There is a NAN values, and i need help to remove/replace these NAN values.

lone compass
lone compass
muted talon
#

Are there any good resources from past competitons regarding heuristics of thumb rules for large image resizing in CV, or are things like https://arxiv.org/abs/2103.09950v1 actually used?

deft fox
# rare jetty

@rare jetty To drop NaN values simply use df.dropna function in pandas. There are many ways to fill in missing values, as that is a non-trivial subject. The simplest way is to fill in mean or median values per column, but I suggest that you go to Kaggle and search for "missing values." There will be many notebooks showing ways this can be done.

rare jetty
#

Thanks everyone.
I have fill NAN values with the help of chatgpt

simple night
#

Hi!

I've got a question about what would be currently the best deep learning architecture for analyzing features of Raman spectra. Does anyone have worked with this type of data?

It is a 1-dimensional (vector) image in which all positions can present some information. CNNs and ResNets may be a nice option, what do you guys think? What about visual transformers or other architectures?

Thankks!

light socket
shut ginkgo
#

hello,
Am working on an ai project and there seem to be many null values in the dataset
would you advice me to go with fillna or dropna?
also If I use fillna and fill in avg random values wouldn't it affect the dataset?

And since the project is dealing with Healthcare would there be a huge affect if I add in avg values.

shut ginkgo
#

done

zealous creek
# shut ginkgo done

The same question was asked in general. So just scroll up a bit and check out the discussion.

dusk jungle
#

I'm not sure if this is the right place to ask, but I used to be able to edit my published notebooks without having to rerun the code cells. However, when I click on "edit" to do some changes to my markdown cells, I need to rerun all code cells to view their outputs. Does anybody know a way around this?

inner cape
#

Hello everyone, I would like to ask where we can buy a dataset of pornographic text, we need to train chatbots

desert tusk
muted talon
heavy fractalBOT
#
tilii7 has been warned

Reason: Bad word usage

deft fox
#

@verbal crest This bot may be a bit too sensitive. I got warned (and my message deleted) for using a word p_rn, which happens to be what we are legitimately discussing here.

deft fox
verbal crest
#

@deft fox Sorry about that, we've got the word on the warn list just because it's one of the most commonly used words by spam bots.

urban hemlock
#

How would you train a CNN model to identify if a picture is something and not something? For example, you're training a model to identify if the image is a dog or not a dog. This is different from training the model to identify if it is a dog or a cat. The "Not a Dog" could be anything such as buildings, other animals, colors, etc. Any ideas?

If you have an idea in mind, please reply to this message. Thank you!

muted talon
desert tusk
#

Can I win a kaggle comptition without own GPU?

shrewd scarab
deft fox
# desert tusk Can I win a kaggle comptition without own GPU?

I am going to play the odds here and say no. Most people don't win Kaggle competitions, with or without GPUs. But you can compete and do well without owning a GPU, because all Kagglers get 30 hours per week of free GPU time. There are also many competitions where GPU is not needed.

desert tusk
desert tusk
#

More question, why dose boosting algorithm perfer better than random forest in kaggle? (I don't see any winning solution with random forest but xgboost)

acoustic flame
desert tusk
junior tapir
#

is there anyway i can put kaggle in dark mode?

acoustic flame
#

also let me know if you have anymore questions

graceful axle
#

I have a question regarding "Progression System",
does a Silver medal, count as Bronze as well?
so if I got 2 Silver medals, will I became competition expert? or I need to achieve exactly 2 bronze medals ⁉️

deft fox
graceful axle
lone token
# desert tusk More question, why dose boosting algorithm perfer better than random forest in k...

There are many reasons that winning solutions use xgboost. One of the reason you might be overlooking ist that the xgboost implementaion comes with a lot of optimzation that goes beyond "boosting": efficient in memory/computation, flexible objective and learning control paramters, robost default parameters, etc.... These factors play a more important role than theoratical soundness in time-constrained competition.

stoic bear
#

does anyone have experience in mcq question genrate using NLP, I mean how should I approach a problem

#

multiple choice question more than one correct

pulsar sparrow
mystic harness
#

is it possible to edit a post to make it my team's solution?

crystal maple
#

The training accuracy is damn high but testing accuracy in low

#

like 80 / 50

deft fox
# crystal maple https://www.kaggle.com/code/ayeshairshadcoder/big-mart-sales-prediction i dont ...

Your model is overfitting the data. There could be many reasons as I didn't go through every single line, but for sure you won't get the best performance by using any regressor with default values such as in this line regressor = XGBRegressor(). Regressor parameters need to be tuned, and doing cross-validation would help with that. Also, those numbers are r2 scores rather than accuracy. Accuracy is a classification metric.

copper whale
#

Why can't I use any AI model such as Mistral 7b? It allocates ram infinitely until it crashes the container.

#

How can I use an AI model in kaggle?

crystal maple
desert tusk
red hawk
copper whale
zealous creek
# desert tusk I don't understand why trying to converge the gradient is better than bagging. A...

You are correct that the trees are independent in a Random Forest. But just because the assumption does not hold in Gradient Boosting, it does not mean that Gradient Boosting performs worse. It's quite the opposite, the assumption was relaxed for a reason. If the trees are not independent but rather they learn on the mistakes of the previous trees, the same predictive power can be achieved faster and with fewer trees. XGBoost went one step beyond Gradient Boosting because it is the first tree-based algorithm that has L1 and L2 regularization to help prevent overfitting. This is how tree-based algorithms evolved: RF => GB => XGB So it is not a surprise that most winning models are based on XGB.

desert tusk
dusky nacelle
#

I am trying to get this question answered: If I upgrade to Google Cloud AI Platform Notebooks can I also submit that notebook on a competition ? basically by passing the runtime cap of 11 hours for instance or basically halving training time cause I am paying ?

verbal crest
deft fox
dusky nacelle
deft fox
graceful flax
#

why the nfl competition is not accepting responses ?

mystic bolt
#

guys is there any book for begineer at data science?

sly jolt
#

Hi, when awarding medals.. Does Kaggle also consider how old a notebook is? Like my notebook is 5 months old with 27 upvotes (22 non novice), yet it hasnt got silver medal

sly jolt
atomic tapir
#

hi i was wondering if i could use ngrok to open a tunnel in order to remotely collect training data, i am taking part in the UBC Ovarian Cancer Subtype Classification and Outlier Detection competition
im passing this data to another computer with wandb...

sweet jasper
#

Hello everyone!!! I am participating in a competition where it states that "Freely & publicly available external data is allowed, including pre-trained models" (so I understand I can use huggingface and other services) but it also states "Internet access disabled" for the notebook, so what do I do? Do I have to download the model?

deft fox
hollow grail
#

for non internet notebook competitions, if I add a model using the sidebar while editing a notebook, it should be available when running on the hidden set right?

summer crypt
#

hey! does anyone know why this training loop might not be updating gradients correctly:

for epoch in range(num_epochs):
    epoch_list.append(epoch+1)

    model.train()
    train_loss = 0.0

    for images, depths in tqdm(train_loader):
        images = images.to(device)
        depths = depths.to(device)

        outputs = model(images)

        loss = depth_loss(outputs, depths)

        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    train_loss /= len(train_loader)

    model.eval()
    val_loss = 0.0

    with torch.no_grad():
        for images, depths in tqdm(val_loader):
            images = images.to(device)
            depths = depths.to(device)

            outputs = model(images)

            loss = depth_loss(outputs, depths)
            val_loss += loss.item()

    val_loss /= len(val_loader)

    print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")

    train_losses.append(train_loss)
    val_losses.append(val_loss)```
fickle perch
# summer crypt hey! does anyone know why this training loop might not be updating gradients cor...

So, in your training loop, it looks like you missed adding optimizer.zero_grad(). This is super important in PyTorch because, without it, your gradients start to pile up across different batches of data. Think of it like this: each time you pass a batch through your model and compute the loss, PyTorch calculates how much it needs to adjust the weights (that's the gradients). But if you don't reset these gradients to zero before the next batch, you're not just adjusting based on the new data, but you're also including adjustments from the previous data. It's like trying to fix a recipe but you're also considering the ingredients from your last cooking session. Not ideal, right?

So, just pop optimizer.zero_grad() right at the start of your training loop, just before you feed the images and depths into your model. This will make sure that each batch's weight adjustments are made cleanly, based only on that batch's data. That should fix the gradient updating issue and get your training back on track! 🚀👍

ornate ravine
summer crypt
summer crypt
void notch
#

Hello Everyone, I download the dataset: automotive vehicles engine health dataset. However, I'm facing a lot of issues with the data. I'm not getting an accuracy greater than 64-67% for multiple models. I used RF, DT, MLP model. I'm focusing on MLP honestly.
I've fixed the imbalance classes issue, I did some feature engineering and fixed the outliers as well.
Is the issue from my side or from the data itself? Any suggestion on what should I do?

haughty oriole
#

Hello guys my model keeps improving in validation scores and but keeps decreasing in kaggle score. can you guys give me some advice for this?

acoustic flame
#

your dataset might have black cats and

#

models learn only black cats are cats

#

which is problematic ofc

#

try using cross validation to get a better estimate of the error

#

and use some regularization methods

barren phoenix
#

If I submitted before the deadline but after the notebook runs and it's scored it's past the deadline will that be considered?

finite galleon
#

My notebook cell freezes with the asterisk sign when I try to run it in Kaggle. Is this normal or there is some way to solve this problem. I restarted the kernel once already but the same issue happened.

deft fox
red hawk
# void notch Hello Everyone, I download the dataset: automotive vehicles engine health datase...

try a boosted tree algo like xgb or lightgbm. Tbh for a tabular dataset I won't go near a MLP (they're finicky to train correctly for tabular data). and unless your data is seriously unbalanced (ie < ~10%) of 1 class I won't bother with class balancing either. (mostly, except for the really unbalanced case, fixing it makes my model worse, not betterr. Other than that, you are the one who have explored the data so you are better placed to answer if there are 'issues' with the data...

deft fox
# red hawk try a boosted tree algo like xgb or lightgbm. Tbh for a tabular dataset I won't ...

I second everything @red hawk said. It may be worth trying this neural network https://github.com/dreamquark-ai/tabnet . It won't necessary produce better results than GBMs (it might on small datasets), but it typically produces different predictions than any other methods. Thus, it ensembles well with other models.

GitHub

PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf - GitHub - dreamquark-ai/tabnet: PyTorch implementation of TabNet paper : https://arxiv.org/pdf/1908.07442.pdf

slate pulsar
graceful axle
#

hi everybody,

one of my submission notebooks are now running for more than 2 hours which is not normal, is this a bug in kaggle? or something is wrong with my notebook ? any idea?

deft fox
graceful axle
graceful axle
#

oserror: [e053] could not read config file from c:\users\….
I am facing this error, kindly help me.

deft fox
graceful axle
#

oserror: [e053] could not read config file from c:\users\nandini agarwal\appdata\local\programs\python\python311\lib\site-packages\pyresparser\config.cfg

deft fox
deft fox
graceful axle
#

Yes

deft fox
slim parrot
main mango
austere pier
#

Hi everyone,
we are doing a project based on anomaly detection through video surveillance. Our project is used mainly in sports stadiums to detect anomalies such as assault, explosion, fighting among fans etc. The surveillance video is captured by autonomously repositioning slave robots through cameras. These robots then check for the anomalies. If an anomaly is found, it sends the video footage to a central server for anomaly classification. We want an unsupervised model which takes videos as inputs. It also learns from the live video it detects during deployment.
Can anyone suggest a model to be used at the slave robot cameras?

cerulean coral
#

hey! im started the course of python and the first exercise show me this error, how can i fix this?

#

i got it , I puted "run all" and it works

cunning iron
#

Hi everyone, i'm currently participating in the SenNet competition and to send a submission you need to turn off the internet access of the notebook. Since i'm using the segmentation-models-pytorch library i had to upload the output of !pip download segmentation-models-pytorch as a kaggle dataset. But i get the following error... Anyone had a similar issue while trying to install a package without internet ?

acoustic flame
livid bison
#

Hey, when there is an error in the submission is there anyway to check what went wrong? If not, how do people usually debug it? I am not sure why my submission is generating an error

heavy granite
#

Is it normal for the icon of the .json file saved in kaggle to be marked as {i}? If not, how to solve it?

hidden dome
#

Is the a room for the optiver-trading-at-the-close channel?

atomic quarry
#

Hello everyone, I have a college work similar to kaggle competitions, where we're required to submit our predictions of an "Xtest" and then we'll be evaluated based on how our model performs. For the training phase, I'm leaning towards using cross validation and evaluating my model based on its results, however, a friend is doing a split of the "training" data into training / validation / test, and they say that it gives more correct metrics. My stance is that doing a further split reduces the amount of data we're using to train, and we also risk having false confidence in our model. Am I correct on this matter? And, what's the general rule to follow in the scenario of competitions?

slate citrus
#

Hello everyone, i am working on a dataset and it has a categorical column with 31 unique values in it, if i perform onehotencoding on it i will have 31 extra columns, so is this correct way to do it or is there any better way to do it

dusk jungle
dusk jungle
dusk jungle
atomic quarry
#

I see, thank you.

deft fox
# atomic quarry Hello everyone, I have a college work similar to kaggle competitions, where we'r...

Generally speaking, you are correct. Your friend might have picked a fortuitously good data split that gives better metrics on test data, but it doesn't mean the model is better. If you do a 5-fold validation on train data and determine the CV score, in practice it should match better with test scores unless you got really poor train/test split. On the same train/test split, where you use your train data for cross-validation and your friend splits it additionally to train and validation, your approach should get better agreement between CV and test scores in a large majority of cases. Not always, though. That's not a method deficiency, but rather a luck of the splitting process.

slate citrus
#

Hello everyone, Question. A deeper tree can fit the training data better, but why it can also lead to overfitting?

heavy granite
#

Is it normal for the icon of the .json file saved in kaggle to be marked as {i}? If not, how to solve it?

vapid dirge
tender stratus
#

Sorry to ask it here, but I am unable to link my kaggle account with discord, what do I do?

tender stratus
placid hamlet
#

hello everyone can someone suggest laptop hardware for machine learning? I'm new to it so don't know what kind of hardware it needs, so please guides me.

austere horizon
#

Hi

#

My account has blocked? Why

craggy zephyr
#

I am learning Machine Learning/Deep Learning on coursera and I also know some basic about Pythons.
I am currently working on my BS Final Year Project named Data Driven Strategy for Load Forecasting of Power Systems.

I want to join a team or wanna work with some experts to learn
Please count me in.

red hawk
austere horizon
deft fox
dusk jungle
deft fox
# dusk jungle It would be interesting to know how much you're willing to spend to give you a m...

I agree that for most machine learning tasks that don'e involve deep learning almost any modern computer with multiple CPUs will do. But still, this comment But overall, an Intel Core i5 and 8GB RAM is enough for most tasks. I think is off the mark. Hardly anything these days can be done with 8 GB of memory, as the operating system will take a good chunk of that memory, unless one is planning to use Linux exclusively. I don't think it is worth saving $150-200 on memory and I strongly recomment at least 16 GB RAM. A GPU is a must for deep learning application, but that will make a laptop expensive. I think the suggested Kaggle GPU solution is a good option.

dusk jungle
#

Sure! I was considering the minimum requirements, but indeed a 16GB RAM at least would be much ideal. I have been using a computer with 8GB and recently upgraded to a 16GB one for more optimal performance.

red hawk
# placid hamlet hello everyone can someone suggest laptop hardware for machine learning? I'm new...

to add to the other recommendations, I would say that if you don't have portability requirements, ie you don't plan to carry the computer around much, it is much better value for money to get a desktop. And definitely get at least 16GB. You don't need a GPU but it can be quite convenient to have one that you don't need to be worried about turning off. (and if you are into video games, you might as well accomplish two things at once by getting a decent nvidia gpu )

graceful axle
#

Hi! I joined a competition on Kaggle and they shared a customised python package along with the .csv files. I am using Windows and the package file is .SO which is only for Linux. Does anyone know how I can solve this issue? Right now, I cannot run the package since it doesn't recognize the extension.

plain copper
#

I have a question, best answered soon if possible

#

Is it legal to obtain someone's health data to build a project on ml ? That too without any doctor's or government consent ?Like as we know many health datasets are contributed by hospitals and medical researchers but is it legal to be collected by students without any proper knowledge on the field?

cosmic mortar
#

Hy everybody...I am searching someone to make team for kaggle competition to learn and share knowledge while working on a project. If interested please reply.

craggy zephyr
#

ML Course by Andrew Ng has few assignments, I cant solve the Practice Assignment of Week#2, Can you help me?

tropic copper
#

What sort of development environment should I be using as someone relatively new to all of this? Right now I'm just writing Python code in notepad++ and running via cmd line. Would it be more efficient/better in some way for me to use an IDE or some other tool instead?

verbal crest
#

@tropic copper Jupyter Notebooks are very popular for datascience, Kaggle's notebook editor (or Colab) are online versions of that style of IDE, but you can also set it up to use locally.

tropic copper
#

Thanks for the info! I've used those a lot in Coursera courses. What makes them so popular?

verbal crest
#

@tropic copper I think the ability to interweave code and output back and forth (and to go back and edit previous steps when needed) is all very handy when doing data science exploration.

placid hamlet
#

Is nividia GTX 1650 sufficient for an entry level deep learning tasks?

hushed crescent
#

Hello 👋
I'm participating to my first kaggle and I'm blocked at the submission level.
My submission notebook crashes and I'm trying to figure out exactly how it works to make sure I'm not doing it wrong.

Do you know if the submission notebook runs have access to the internet? My first notebook cell is a pip install and if the notebooks do not have access to the internet it would explain the failure :/

sharp pawn
#

Hey all! Im fairly new to CS as a whole and was wondering if there are some pre-reqs i should know before attempting kaggle comps? Projects are the best for learning im told! But I also know I am fairly inexperienced and i will not learn much if it is too hard

deft fox
timid forge
#

Hi all need help to know best practices. I'm working on a project where I need to build couple tables where if I have the most granular data then it would create duplicates so is it better to have 3 different tables each of them having primary key on one column or making one table where primary key is combination of multiple columns

junior cave
#

Hey Folks! I'm just starting on my ML adventure and I've got a question. I've created a simple one layer neural network to solve the Titanic Challenge. I've set it up such that when training is done I export the weights to disk so they can be reused. Training appears to be working pretty well. However, when I start the network with the trained weights and train some more the network starts in a state with a bit more loss than when training ended on the previous run. I would think that I would start at the point at which training ended. Does anyone know why this would be the case? Here's a link to my (very ROUGH/experimental) project incase it is helpful - https://github.com/chuckfinca/kaggle_titanic_competition

GitHub

Contribute to chuckfinca/kaggle_titanic_competition development by creating an account on GitHub.

craggy zephyr
#

ML with Andrew Ng, Course#1, Week#2, Practice Lab Assignment
I am facing an issue regarding the assignment mentioned above.
After submitting, I receive an error. " Comment line with index: UNQ_C1 wasn’t found in code"
Can someone help me with this?

craggy zephyr
#

ML with Andrew Ng, Course#1, Week#2, Practice Lab Assignment

I am facing an issue regarding the assignment mentioned above. After submitting, I receive an error. " Comment line with index: UNQ_C1 wasn’t found in code" Can someone help me with this?
Link: https://lnkd.in/daynpUSe

sweet ice
#

Hi Kagglers I hope everyone is better, I want some advice or help to get job any machine learning and / or data science, feel free to dm's me

copper whale
#

Can Kaggle ressources support multithreading for mistral 7b? I want to build a dataset and I need an AI to help me do that. ChatGPT rate limits. Thing is that going one prompt at a time is very slow and I wondering if it is possible to multithread ( so ask multiple times per time for model to generate text).

austere horizon
#

Can anyone help me

#

my old account has blocked please help me

severe rune
#

Code:

tuned_model = "codegood/HF_AWS_Mistral_SC"

trainer.model.config.save_pretrained(tuned_model + "config")
trainer.save_model(tuned_model)
torch.save(model.state_dict(), "/kaggle/working/HF_AWS_Mistral_SC/Mistral_torch_model.bin")
trainer.push_to_hub(tuned_model_SC)
print("Model saved to Huggingface")

Error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

I'm trying to load and retrain my fine tuned model. I'm able to load the model, but get the above error during trainer.train(). Not able to figure out the problem.

Also, how to upload the bin model to the Huggingface directly?

little kelp
#

Hi everyone, it is possible to delete a submission on the leaderboard, or just simply hide my score ?

tame bough
#

Hey everyone! I have one project which I would love to share on public platforms like Kaggle and GitHub. However, I cannot share it without anonymizing, any idea on how to anonymize large amounts of data in Excel? So that powerBI would show anonymized names?

echo latch
#

Hey i am new to kaggle i have question
Why use jupyter notebook and use cells
why not use and ide and simply type the code (without cells) and each function or line of code specifically in a cell

shrewd scarab
verbal crest
#

@echo latch In the Kaggle editor you can swap from a notebook to a 'script' which is just a single python file if you'd prefer to work that way. It's really just a matter of personal preference.

velvet plume
#

Would anyone in the data science field with at least 1-2 years of experience be willing to participate in an interview for a school project (I'm considering pursuing a career in data science) ? The questions are below (Feel free to dm me or respond within the channel), Thanks for your time!

What caused you to gain interest in data science, and how did you enter the field?

Can you describe a typical day or week in your role ?

What types of projects do you typically work on ?

What programming languages, tools, and libs are most essential for you ?

Can you share an example of a challenging problem you've faced in a project and how you went about solving it?

How do you stay updated with the latest developments and trends in the field ?

What are some common misconceptions about the field of data science, and how would you address them?

What advice do you have for someone just starting their career in data science ?

tropic copper
#

@echo latch I am pretty new and I just started using a locally ran Jupyter notebook. It is SO much more convenient and easy than Notepad++ and than running via CMD prompt. It helps me chunk the program into easily digestible sections than I can move around, reposition, etc.

I've also found a lot of time can be saved with Jupyter because I can quickly add or remove "debugging" points, e.g if I want to see what the output looks like after I remove or edit a particular block of code, it seems much easier and quicker to do this in Jupyter.

gilded flame
#

Dear community,

this semester I started a course where we learn to code kind of AI in Python. For me its looks like more machine learning. But nvm.

So the task as final exam is to code a programm which can answer to questions about a data set:

_

‒ Two individual / unique research questions per student are required
Procedure:
‒ Students search themselves for large and relevant data sets
‒ Students define two questions that should be answered for the selected data sets
‒ Lecturers check the data sets and the corresponding questions (such that problems are
difficult enough but not too difficult)
‒ Implementation of the solutions by the students on their own systems using the
presented libraries and methods of the lecture

The resulting program code (as *.py) and the corresponding program execution need to
be analyzed regarding the run-time behavior. Which parameters influence the run-time in
which way?
‒ Code analysis and run-time behavior evaluation need to be executed per research
question.
The methods we learn and shall use:

Data and data preparation (Pandas etc.)
Classification I (Support Vector Machines)
Classification II (Decision trees, Random forests)
Clustering (kmeans, DBScan)
Testing and quality assurance (run time analysis)
Dimension reduction, anomaly detection
Neural Networks
raining deep learning networks
Pipelines and MLOps
As im doing all the homework, I dont think that the coding part is my challenge.

My problem is, to define two question which would fit the requirements. Can you give me examples? The question should not be answerable by statistic.

For example I choose this dataset:
https://www.kaggle.com/datasets/nelgiriyewithana/billionaires-statistics-dataset/discussion
=> But I dont know which question could I define for this, which can be solved by the methods above.

I am also open for new datasets.

Thanks in advance!

cursive vigil
shrewd scarab
# cursive vigil ## Best AI ML DL DS Roadmap Hi! What is the best complete roadmap for AI, ML, D...

The best roadmap for any of those would change at least somewhat, since they are all slightly different. Many people get stuck up thinking about the best possible path, but the most important thing you can do is to start on a path. If you are looking for a great starting place(since you have some experience), I would recommend doing some analysis on Kaggle datasets using pandas, matplotlib, etc.

void notch
#

Hello !!
How can I improve the accuracy? I'm using an MLP model of only Dense Layers.
How is it possible to remove these crazy spikes.

I have tried the following:
1- Early Stopping
2- Reducing model complexity
3-Reduce LR
4- Dropout layers and Batch Normalization
5- Gaussian noise layer
6 - fixed the issues with the dataset.

You help is much appreciated!

deft fox
void notch
deft fox
void notch
next prairie
#

Hey everyone!

Just finished “Machine Learning Specialization” on Coursera, from Andrew Ng. Excited to dive deeper into the field!

Any recommendations on what I should learn next or any valuable resources you could suggest? Your insights would be greatly appreciated!

night gorge
deft fox
#

In most competitions you are supposed to submit probabilities, not binary predictions. I suggest you delete your In: 12 cell and in the following cell use output = pd.DataFrame({'PassengerId': df_test.PassengerId, 'Survived': predictions.flatten()})

night gorge
#

ok let me try...btw is there any way to check the score without submitting for competition?

#

right now in order to know the score im just submitting and checking the public score

vague folio
#

I need help to know more about feature engineering. Please provide me some resources so that I can catch the vital concept.

mossy delta
austere pier
#

Hi.
Can anyone suggest a good unsupervised learning method for anomaly detection(like assault, robbery, vandalism etc)?

quick mirage
tough panther
#

Hello guys can someone provide an explanation for this?

quick mirage
# tough panther Hello guys can someone provide an explanation for this?

Hi Alex, from the learning curve displayed, it seems that the algorithm hasn't learned the target function, this is shown by the high and increasing training error, and since the definition of bias is the ability of the learning algorithm to approximate the learning function, it seems that "according to the question" the test error is unacceptable, it seems that the model isn't approximating the function well. Also the model has little generalization error "Variance" since as the number of data points (the size of the dataset increases) the training error and the test error come closer to each other (to the high error value that is unacceptable), I recommend that video for understanding the curve better : https://youtu.be/zrEyxfl2-a8?si=k4DdOTt0TM72kagH, its a great course btw that I helped me alot during my studying of the Machine learning course, feel free to ask for any elaboration

Bias-Variance Tradeoff - Breaking down the learning performance into competing quantities. The learning curves. Lecture 8 of 18 of Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. View course materials in iTunes U Course App - https://itunes.apple.com/us/course/machine-learning/id515364596 and on the course website - ht...

▶ Play video
quick mirage
austere horizon
#

Please any one help me with this

verbal crest
austere sequoia
#

does anybody know how can I learn to do ensembles that perform well in competitions? stacking, etc. that improves the metrics? Thank you!

pulsar sky
#

Hi guys quick question I had
Can selectolax be used to scrape dynamic content of a webpage?

crimson dragon
#

Hello , can someone please explain to me the cross-validation and how can i use it

deft fox
# crimson dragon Hello , can someone please explain to me the cross-validation and how can i use ...

You have to be willing to put in a minimal effort on your own. Your question is easily answered by Googling https://www.google.com/search?q=cross-validation

fallen mist
#

Hey, I'm hosting my own competition and I can't see how I can pin my demo notebook to the top of the code notebooks section? Any ideas? I've seen it done in other competitions. Is this something staff can answer @tardy lodge?

lone hearth
#

Does anyone know what could be the issue when submitting a notebook that runs fine in Kaggle? I submitted to the LLM Detection Competition using Keras NLP. I ran the notebook before and it was fine, training, evaluating, and saving the submission.csv. It failed when I submitted though, so I copied this from the Log: Downloading data from https://storage.googleapis.com/keras-nlp/models/distil_bert_base_en_uncased/v1/vocab.txt
272.5s 101 Traceback (most recent call last):
272.5s 102 File "<string>", line 1, in <module>
272.5s 103 File "/opt/conda/lib/python3.10/site-packages/papermill/execute.py", line 128, in execute_notebook
272.5s 104 raise_for_execution_errors(nb, output_path)
272.5s 105 File "/opt/conda/lib/python3.10/site-packages/papermill/execute.py", line 232, in raise_for_execution_errors
272.5s 106 raise error
272.5s 107 papermill.exceptions.PapermillExecutionError:
272.5s 108 ---------------------------------------------------------------------------
272.5s 109 Exception encountered at "In [18]":
272.5s 110 ---------------------------------------------------------------------------
272.5s 111 gaierror Traceback (most recent call last)
272.5s 112 File /opt/conda/lib/python3.10/urllib/request.py:1348, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
272.5s 113 1347 try:
272.5s 114 -> 1348 h.request(req.get_method(), req.selector, req.data, headers,
272.5s 115 1349 encode_chunked=req.has_header('Transfer-encoding'))
272.5s 116 1350 except OSError as err: # timeout error

frozen sail
urban bone
#

Hey, there. Does anyone working on question generation using llm. Pls do help me.

spiral valve
dull mauve
#

has any one tried using huggingface autotrain advanced on kaggle, how was the experience? please share

lethal raft
#

Have anyone done any work on Predict energy behavior of prosumers??

cold tundra
#

Hi guys,
I built a web app to predict the classification of flowers using Machine Learning. I just need help with the last step, I have so far been able to succesfully connect the HTML and their corresponding routes, just the last step is not working.

While the model makes a prediction it returns a number

0 or 1 or 2 depending upon the flower it has predicted, instead I get a None in there in the HTML file, But I checked the logs the output from the predicting function is correct.

Kindly help me debug this.

Code link:
https://github.com/Kaus1kC0des/OIBSIP/tree/main/Data Science/Task 1

GitHub

This repository contains all the code my projects during my internship with Oasis Infobyte - Kaus1kC0des/OIBSIP

plush cairn
#

How many kaggle notebooks one can run in parallel?
Mine gives error when I try the third.

deft fox
deft fox
ember void
#

Does submitting a notebook in GPU mode, consumes our GPU quota?

muted talon
#

For content based recsys, when doing the recommendation based on popularity, to mitigate the caveat of having high discrepancies regarding the number of evaluations / ratings per items, a damped mean of the the target metric is a common and solid solution.
Was wondering, what other alternatives to the damped mean are there?

split flume
#

how difficult would be be to implement graph based neural networks (GCN and GNN) in kaggle? I am struggling to find projects which utilize them

frozen sail
#

dont spam

fathom grove
#

For the Titanic ML dataset competition, there is a lot of missing data present in the Age column and the cabin column. My current guess is that age has to do a lot in matters of survival (Physical ability etc). I've found that https://en.wikipedia.org/wiki/Passengers_of_the_Titanic#:~:text=The ship's passengers were divided,military personnel%2C industrialists%2C bankers%2C contains a list of passengers with their ages. Is it correct if I can impute the values from this?

A total of 2,240 people sailed on the maiden voyage of the Titanic, the second of the White Star Line's Olympic-class ocean liners, from Southampton, England, to New York City. Partway through the voyage, the ship struck an iceberg and sank in the early morning of 15 April 1912, resulting in the deaths of 1,517 passengers.The ship's passengers w...

verbal crest
# fathom grove For the Titanic ML dataset competition, there is a lot of missing data present i...

Check out some of most voted notebooks in the titanic competition - most talk about dealing with misssing values in the data (an important data science skill the competition is trying to teach). You shouldn't look for external data with all the answers - the goal is to find ways to deal with missing data). Also check out the Kaggle course on missing values here: https://www.kaggle.com/code/alexisbcook/handling-missing-values

glossy finch
#

Because the function pd.get_dummies() depends on the data it is being fit on, df_train and df_test end up having different columns.
Therefore, if I fit a model on the training data, it cannot fit onto the test data.

#

how to solve this?

split flume
hasty sand
#

Hi kaggle community, I have been working on a project and I am unsure on how to do calculations on tuple data. My dataframe has data in the form (x, y) in every cell and I would like to add numbers of all the y data, depending on what x is, to a row total. I have about 1500 rows. what is the best way of doing this?

hasty sand
graceful axle
#

Hi community,

I have a question related to k-fold cross-validation.

I'm currently training a classification model on a relatively small dataset (approximately 500 images across 5 classes) using a 5-fold approach. At the end of this process, I have five models. For my submissions, I utilize all five models to make predictions and take the average of their scores.

Is there any specific approach to replace these 5 models with only 1 model?
I need to do this to be able to use model ensemble method.

fathom grove
elder flower
#

Why can't I make my notebook public?

golden nova
#

Hello all,
I'm just a begineer to this field. I'm facing a problem or in simpler words stuck in a loop.
I'm pretty well aware about the theory and conceptual knowledge required of py, kaggle, maths, ml and all, but I'm not able to put things together to build my FIRST ML MODEL. Can anybody of you help me out with this.

muted cliff
#

Hi ! Where are you stuck ?

#

Did you go through the kaggle tutorials @golden nova ?