#data-processing | Kaggle | Page 1

cedar forge Sep 5, 2023, 2:41 PM

#

tropic sky Sep 5, 2023, 4:58 PM

#

👋

dreamy talon Sep 5, 2023, 11:52 PM

#

https://tenor.com/view/kto-lbow-hi-hello-hi-there-gif-25347432

Tenor

lost star Sep 6, 2023, 5:22 PM

#

👋🏾

spare swan Sep 6, 2023, 7:33 PM

#

hello guys, please has anyone got soem experience in processing Speech/ audio data?

rustic hollow Sep 7, 2023, 3:10 PM

#

hello everyone

hot star Sep 7, 2023, 5:45 PM

#

Hello team

royal seal Sep 8, 2023, 8:13 AM

#

flint glade Sep 14, 2023, 2:55 PM

#

Hello everyone ! can anyone suggest me some learning materials or books on data processing.

oblique mirage Sep 20, 2023, 9:41 PM

#

flint glade Hello everyone ! can anyone suggest me some learning materials or books on data ...

kaggle courses

clever thunder Oct 23, 2023, 1:22 AM

#

Use Google collab?

kind cedar Oct 31, 2023, 12:17 PM

#

Hello there! We're currently in the process of developing a cost-effective storage solution designed for version control datasets and large file storage (LFS). Would you be interested in trying it out? You can find more details and access the service here: https://underhive.in/ .Your feedback and support would be greatly appreciated!

Underhive

Collaboration platform for ML Teams.

valid swan Jan 27, 2024, 5:07 PM

#

fair bough Feb 5, 2024, 6:25 PM

#

perhaps this question goes here, moreso relates to data cleaning/feature engineering... but I've seen several articles discussing different processes for normalizing and standardizing data. (such as https://www.datacamp.com/tutorial/normalization-in-machine-learning) Time and time again, I see the datasets being split into test/train and a normalization technique applied to the training set. The same normalization model is then applied to the test set. How is this not an example of data leakage? The parameters to normalize the train are directly influencing the normalization of the test set....

calm hazel Feb 6, 2024, 12:18 PM

#

fair bough perhaps this question goes here, moreso relates to data cleaning/feature enginee...

If you normalize whole set you "give" information to the training set from test. If you first normalize your train data and then use it to normalize test set, only test data gets information about train data, which is intended way.

mild nimbus Apr 26, 2024, 3:58 PM

#

Hi I have a question related to dividing the audio for classification task. I want to understand how audio is segmented into chunks like of 5 secs or 10 secs considering the amplitude or from mel spectogram. Bcoz in random clipping for example say a 60 secs audio in 6 parts of 10 secs, I fear that I might lose important features if it splits in between. Thank you!

shrewd lily May 13, 2024, 10:14 AM

#

I was reading about PCA and had a doubt, If I have a dataset of nm size, and I apply PCA to get a transformed dataset of size nm (I am not reducing the features). The transformed data has features such that, none of them are correlated, So is it always good to apply PCA to reduce the correlation among features???? But I have never seen any kaggle notebooks doing this. PCA is generally only used for feature selection it.

void wharf Jul 1, 2024, 11:51 AM

#

Hello everyone, i have a question and i would really appreciate your assistance. pika_wow
I have 2 networking and ip addresses data files with .RR format (ex: myipv6add.RR, myipv6add2.RR) and i want to extract into MySQL file .. how can i write a script in python to do that ? harold

small python Jul 3, 2024, 6:25 PM

#

hi , i'm working on a trading algorithm and i made a big dataset with at least 50k rows , but i have a problem and that in the way i made the dataset it's very biased (but it's real data with a baseline of accuracy) so maybe i can do SMOTE for data augmentation or i can do class_weight for a keras model. Right now i'm trying improve my algorithm with markov chains and i'm trying to apply a smoothing_factor to the transition_matrix but it's not working at all, any idea on wich way i need to take?

hot plinth Jul 7, 2024, 7:42 AM

#

You initially talk about data augmentation/class weight which make it seem like a standard NN task, than you talk about Markov Chains which make it seem like Online Learning stuff, than transition matrix redirecting it to Reinforcement Learning.

Please decide what exactly the task is, and not just use things blindly

sterile nova Jul 21, 2024, 12:31 AM

#

Hi, I'm new to machine learning and I tried the last playground episode from kaggle and I have a huge csv file and I can't load it all into ram, does anyone have some suggestions for this task?

dim hamlet Jul 25, 2024, 1:29 PM

#

yes i also new in ml .i very enthusiast and filled with curiosity for start learning ml. what is path? how to pactice ml using kaggle?

fallow ginkgo Aug 31, 2024, 9:21 PM

#

anyone have good resources for learning genetic algorithms for scientific research

#

currently reading genetic algorithms with python but I want a book more tailored to a research application

mint sequoia Nov 11, 2024, 10:54 AM

#

HI, I am Abdullah I am an ML engineer want to join any team to particapte in kaggle competions

graceful lotus Dec 6, 2024, 6:27 PM

#

📖 Check it out here: https://www.kaggle.com/discussions/questions-and-answers/550244, check it out it will help you to gain knowledge in data preprocessing

hidden sapphire Jan 15, 2025, 1:41 PM

#

Hello everyone! 😊

I hope you're all doing well. I’ve been actively working on creating and sharing insightful Kaggle notebooks, and I’d truly appreciate your support. It would mean a lot if you could take a moment to visit my Kaggle profile and check out my notebooks:

🔗 https://www.kaggle.com/sajjadalishah/code

If you find them helpful, I’d be grateful if you could upvote them. Your encouragement and feedback inspire me to keep learning and contributing to the community.

Thank you so much for your time and support! 🙏

Sajjad Ali Shah | Notebooks Contributor

Sajjad Ali Shah

Data Scientist | Data Analytics| Machine Learning

🚀 Progressing in the realms of data science, machine learning, and data analytics.

🎓 BS (Software Engineering)

🏆 Kaggle Achievements:

[not mentioned]
🔍 Key Skills:

Data Science
Machine Learning
Data Analytics
[Any additional skills]
🛠️ Projects:

[Brief overview of a coupl...

glass wigeon Feb 11, 2025, 8:39 AM

#

Hey all, I just wrote a discussion post about my research focus, XAI - specifically, this post was about model distillation and how inexperienced junior data scientists will tend to just throw the biggest most complicated possible model at a problem with no regard for what is truly the best solution. Please give it a read and leave your thoughts in the comments -- thanks! 😄

https://www.kaggle.com/discussions/general/562312

Model Distillation for Human Interpretability | Kaggle

Model Distillation for Human Interpretability.

median geyser Feb 27, 2025, 5:27 PM

#

Hello everyone, I have a question on removing multicollinearity among features using VIF. I have seen models where they don't add constant column to compute vif_score. Is it necessary to add constant column to compute vif_score ? Thanks!

weak fern Apr 17, 2025, 11:16 AM

#

Hey, I have a question you may find trivial but why using a standard scaler before doing PCA or clustering, does not the standard scaler changes the relative distances between instances and therefore gives a biais to the clustering ?

stoic flicker May 14, 2025, 8:21 AM

#

weak fern Hey, I have a question you may find trivial but why using a standard scaler befo...

according to my knowledge we are trying to fix our data into a fixed range to standardize it, please correct me If I am wrong

prisma tree May 14, 2025, 9:22 AM

#

weak fern Hey, I have a question you may find trivial but why using a standard scaler befo...

Although it's a little distance (time-wise) away, I'd like to add my 2 cents 😛

I recently learnt the math behind PCA, and have yet to use it, because I want to do sensitivity analysis first which I need to learn more math for.

The answer, is given the mean is 0, and standard deviation is 1, the distribution of the data is preserved. Whereas, if you used minmax for example, outliers wouldn't correctly contribute to the variance of the data.

The Pearson's correlation matrix is full of dot products, with the diagonal line (that's consistent with the identity matrix) represents the variance (as each column's own dot product against itself is the variance).

Sometimes, you might even want to use a different scaling technique instead of z-score normalization (standard scalar) deliberately to reduce the impact of outliers on variance for whatever reason.

As for the score of the dotproducts, the scalar returned is contextualized by the two vectors that formed it, hence why all columns should ideally be normalized first

(Not sure about clustering, have yet to perform this myself, but will advise of the answer when I do this)

hallow mortar May 25, 2025, 10:51 PM

#

weak fern Hey, I have a question you may find trivial but why using a standard scaler befo...

Features with large ranges can dominate the clustering or PCA process, while features with small ranges might be ignored. However a factor analysis may be done to determine key features and dimensionality reduction may be attempted before standardization in my opinion (if you feel all features need not be equally weighted or included in the model based on domain validation of the problem holistically).

wooden knoll Jun 1, 2025, 10:00 PM

#

hello everyone,i wanted to know the difference between Gradient Descent, Maximum Likelihood Estimation (MLE), and Ordinary Least Squares (OLS) wrt linear regression .If anyone know of some good article on it,please tell

vital venture Aug 7, 2025, 1:34 PM

#

Hey everyone!
I’m working on the CMI – Detect Behavior with Sensor Data Kaggle competition, where the goal is to classify BFRB vs non‑BFRB behaviors using wrist-worn sensor data (TOF, IMU, pressure, etc.)

https://www.kaggle.com/competitions/cmi-detect-behavior-with-sensor-data

I’ve trained a LSTM using PyTorch and got surprisingly strong results (i.e. accuracy = 93 ) which makes me worry about potential data leakage or preprocessing issues....

Here’s what I did to avoid leakage:

-Split data by sequence ID, no overlap between train/test

-Fit MinMaxScaler only on the training set, then applied to both

-Replaced NaNs, -1, and inf values with 0 before scaling

However, since 0 is a valid sensor reading, replacing missing/invalid values with 0 might introduce bias. I'm unsure whether I should switch to median, KNN, or use masking instead.

If anyone has experience with sensor data or wants to take a look at the code, I’d really appreciate the help and happy to include collaborators in the Kaggle submission team! Just DM me or reply here

CMI - Detect Behavior with Sensor Data

Predicting Body Focused Repetitive Behaviors from a Wrist-Worn Device

rain cape Sep 1, 2025, 8:45 PM

#

wooden knoll hello everyone,i wanted to know the difference between Gradient Descent, Maximum...

You can do GPT!!. Learn AI using AI !!

still spear Sep 5, 2025, 4:10 PM

#

Job Title: Part-Time Senior AI/ML Engineer (Remote)

We are seeking a skilled and experienced Senior AI/ML Engineer to join our remote team on a part-time basis. The ideal candidate will have a strong technical background, excellent communication skills, and the ability to work independently in a fast-paced environment.

Requirements:
-Minimum of 7–10 years of professional software development experience

-Proven experience working effectively in a remote environment

-Advanced English proficiency (C1 or higher); an American accent is preferred

-Availability to work 10–15 hours per week during EST or CST business hours

If you're a highly motivated engineer with a passion for building high-quality software and can commit to a flexible part-time schedule, we’d love to hear from you.
You can connect with me on WhatsApp: +1 (567) 469-5384

safe belfry Sep 15, 2025, 6:57 PM

#

Hi, @everybody
I have one question, I'm training ml models for the prediction, which is classification problem of 3 classes, where the number of samples are similar but the predition is skewed.
First class and second class is predicted with low precision tough, third class is never predicted. What's the reason? I can' t find the reason.
Before, when I applyed reinforcement learning, where the three classes were assigned to three actions and one action is never selected, too.
Actually, that is the preeiction model of forex eur/usd.

lofty iron Sep 23, 2025, 12:31 AM

#

J4C4U

Caring
And
Sharing
Honesty

The definitive reason for FUNDS and what our earnings are made of, I.T is the reality check of how we live and what purpose defines us.

For You and your community, these are creations made around and in all that we all exist for to both elevate our purpose and alleviation of all that challenges these goals.

https://maps.app.goo.gl/1tHMC1yYBneUqryB7

lyric trench Sep 27, 2025, 3:31 PM

#

RudraDB-Opin 1.0.0
🎉 Just dropped: The world's first FREE context-aware vector+graph database!
⚡ Why VibeCoding devs will love this:

🔥 Traditional vs RudraDB-Opin:
diff- Traditional: "Find similar documents"

RudraDB-Opin: "Find connected documents through relationships"

🚀 Zero-friction setup:
pip install rudradb-opin # 100% Free Forever!

✨ Context-aware intelligence:

🧠 Relationship-aware search - Finds connections similarity misses
🎯 Auto-dimension detection - Works with ANY embedding model
🔗 Multi-hop discovery - Traverses relationship chains
⚡ Context preservation - Maintains semantic relationships

💎 Perfect for rapid prototyping:

100 vectors + 500 relationships (ideal for demos/POCs)
5 relationship types (semantic, hierarchical, temporal, causal, associative)
Same API as production version (seamless scaling)

🛠️ Speed up your AI builds:

RAG systems with intelligent document connections
Chatbots that understand conversation context
Knowledge apps with relationship discovery
Recommendation engines beyond similarity

⚡ From zero to context-aware AI in 3 lines:
pythondb = rudradb.RudraDB() # Auto-detects dimensions
db.add_relationship("intro", "advanced", "temporal", 0.9)
results = db.search(query, include_relationships=True) 🎯

🔗 Try it: pip install rudradb-opin
📚 Docs: rudradb.com

Who's building with relationship-aware search? Share your experiments! 👇

Innovation isn't iteration - it's transformation. 🚀

devout cairn Oct 5, 2025, 7:11 AM

#

https://www.kaggle.com/datasets/ziya07/high-speed-train-bogie-vibration-and-fault-diagnosis/data
This is a dataset of Train Bogey Vibrations. I have tried everything, extracted time domain features, extracted frequency domain features, extracted time-freq features like wavelet etc. Tried Classical ML ,Tried 1d conv on raw data, Tried sliding window approach and 2d conv, Tried anomaly detection. But i cant make the accuracy more than 55%. Please help me understand this data and modelling this data

safe belfry Nov 10, 2025, 5:53 PM

#

I'm finding a US developer for the collaboration. If anybody interested, please dm me.

peak vault Dec 2, 2025, 5:08 PM

#

Hello Kaggle community,

I have a question that you might find trivial. Given a tabular task (so basically we use classic ML algo namely LR, RF, XGBoost, ...), do we add/remove features (assume all features are numerical)? When and how and why? I would like to make things clear in my head, because when trying to understand from chatgpt or similar AI LLMs it doesn't give a consistent answer and it looks biaised based on how I ask the question, sometimes it is good to make many new features from the original ones namely using sin function, product of two features, ... and fix a threshold based on their correlation with the target feature to remove "uninformative" features, sometimes it says if you have three too correlated features keep only one of them since information is redundant ( so in the correlation matrix we can find a small square matrix along the diagonal where corr coeff are too high like >0.8), ... So in short, my understanding of "good" practices in this regard is pretty much blurry and I would like to make it clear, I would really appreciate it if you can give me somehow clear logic (math based) answer or a suggestion where I could find such an answer!

crimson sleet Mar 12, 2026, 11:15 AM

#

Hello hackers,

I need some help. I’m training a conversation disentanglement model using this repo: https://github.com/jkkummerfeld/irc-disentanglement
. It will be used to prepare a conversation dataset for a project.

I don’t have access to compute resources that can run continuously for five days. I’m using Google Colab, but sessions eventually stop when the tab closes or times out. I also can’t afford a cloud provider right now.

If anyone has a home setup that can run uninterrupted for several days and is willing to help, I would really appreciate it. Thanks!

sturdy glade Apr 5, 2026, 4:17 PM

#

hello

#

Hi everyone,

I’m currently working on a machine learning project for crypto trading using XGBoost, and I’m struggling with something that I can’t fully clarify.

My setup:

Minute-level data (order book + derived features)
Predicting a continuous target (entry/exit signals based on future price movement)
Using XGBoost with time-series CV (Spearman correlation as main metric)

The issue I’m facing is related to feature engineering vs feature selection.

I’m not sure how to properly decide:

When should I create new features (interactions, ratios, transformations)?
When should I remove features (low correlation, low importance, redundancy)?
Should I remove highly correlated features between themselves (e.g. corr > 0.8), or keep them since XGBoost can handle that?

What confuses me is that I get inconsistent guidance:

Sometimes it’s recommended to generate many new features and filter them
Other times to aggressively reduce features to avoid noise and overfitting

In practice, I observe:

If I use too many features → model becomes too smooth / low amplitude predictions
If I reduce too much → model becomes unstable or noisy

So my core question is:
👉 Is there a clear logic or framework (ideally math-based) for deciding feature creation and selection in tree-based models like XGBoost?

Or is it purely empirical (based on validation performance)?

Any insights, best practices, or resources would be really appreciated.

Thanks!

fallen stream Apr 6, 2026, 12:56 AM

#

Happy Weekend!

Hello Everyone!
If you know someone who have good skills in Python and Machine Learning, Please invite me!

Our Company is open to hire Python and Software Engineer.

Requirements:
2+ years of Software Engineering Experience
C1 or Native English Level
Good vision of Software Trent

Benefits:
Competitive Income
Supporting Several roles and chances
Multiple Role Working is enable

Important:
Our company is designed for Capability Person.

Questions:
For Junior Persons?
Do not give up, strong enthusiasm is also big point and our company also focus on the person's enthusiasm.

Thanks again.
Sophia