#data-processing
1 messages ยท Page 1 of 1 (latest)
๐
๐๐พ
hello guys, please has anyone got soem experience in processing Speech/ audio data?
hello everyone
Hello team
Hello everyone ! can anyone suggest me some learning materials or books on data processing.
kaggle courses
Use Google collab?
Hello there! We're currently in the process of developing a cost-effective storage solution designed for version control datasets and large file storage (LFS). Would you be interested in trying it out? You can find more details and access the service here: https://underhive.in/ .Your feedback and support would be greatly appreciated!
Collaboration platform for ML Teams.
perhaps this question goes here, moreso relates to data cleaning/feature engineering... but I've seen several articles discussing different processes for normalizing and standardizing data. (such as https://www.datacamp.com/tutorial/normalization-in-machine-learning) Time and time again, I see the datasets being split into test/train and a normalization technique applied to the training set. The same normalization model is then applied to the test set. How is this not an example of data leakage? The parameters to normalize the train are directly influencing the normalization of the test set....
If you normalize whole set you "give" information to the training set from test. If you first normalize your train data and then use it to normalize test set, only test data gets information about train data, which is intended way.
Hi I have a question related to dividing the audio for classification task. I want to understand how audio is segmented into chunks like of 5 secs or 10 secs considering the amplitude or from mel spectogram. Bcoz in random clipping for example say a 60 secs audio in 6 parts of 10 secs, I fear that I might lose important features if it splits in between. Thank you!
I was reading about PCA and had a doubt, If I have a dataset of nm size, and I apply PCA to get a transformed dataset of size nm (I am not reducing the features). The transformed data has features such that, none of them are correlated, So is it always good to apply PCA to reduce the correlation among features???? But I have never seen any kaggle notebooks doing this. PCA is generally only used for feature selection it.
Hello everyone, i have a question and i would really appreciate your assistance.
I have 2 networking and ip addresses data files with .RR format (ex: myipv6add.RR, myipv6add2.RR) and i want to extract into MySQL file .. how can i write a script in python to do that ? 
hi , i'm working on a trading algorithm and i made a big dataset with at least 50k rows , but i have a problem and that in the way i made the dataset it's very biased (but it's real data with a baseline of accuracy) so maybe i can do SMOTE for data augmentation or i can do class_weight for a keras model. Right now i'm trying improve my algorithm with markov chains and i'm trying to apply a smoothing_factor to the transition_matrix but it's not working at all, any idea on wich way i need to take?
You initially talk about data augmentation/class weight which make it seem like a standard NN task, than you talk about Markov Chains which make it seem like Online Learning stuff, than transition matrix redirecting it to Reinforcement Learning.
Please decide what exactly the task is, and not just use things blindly
Hi, I'm new to machine learning and I tried the last playground episode from kaggle and I have a huge csv file and I can't load it all into ram, does anyone have some suggestions for this task?
yes i also new in ml .i very enthusiast and filled with curiosity for start learning ml. what is path? how to pactice ml using kaggle?
anyone have good resources for learning genetic algorithms for scientific research
currently reading genetic algorithms with python but I want a book more tailored to a research application
HI, I am Abdullah I am an ML engineer want to join any team to particapte in kaggle competions
๐ Check it out here: https://www.kaggle.com/discussions/questions-and-answers/550244, check it out it will help you to gain knowledge in data preprocessing
Hello everyone! ๐
I hope you're all doing well. Iโve been actively working on creating and sharing insightful Kaggle notebooks, and Iโd truly appreciate your support. It would mean a lot if you could take a moment to visit my Kaggle profile and check out my notebooks:
๐ https://www.kaggle.com/sajjadalishah/code
If you find them helpful, Iโd be grateful if you could upvote them. Your encouragement and feedback inspire me to keep learning and contributing to the community.
Thank you so much for your time and support! ๐
Sajjad Ali Shah
Data Scientist | Data Analytics| Machine Learning
๐ Progressing in the realms of data science, machine learning, and data analytics.
๐ BS (Software Engineering)
๐ Kaggle Achievements:
[not mentioned]
๐ Key Skills:
Data Science
Machine Learning
Data Analytics
[Any additional skills]
๐ ๏ธ Projects:
[Brief overview of a coupl...
Hey all, I just wrote a discussion post about my research focus, XAI - specifically, this post was about model distillation and how inexperienced junior data scientists will tend to just throw the biggest most complicated possible model at a problem with no regard for what is truly the best solution. Please give it a read and leave your thoughts in the comments -- thanks! ๐
Model Distillation for Human Interpretability.
Hello everyone, I have a question on removing multicollinearity among features using VIF. I have seen models where they don't add constant column to compute vif_score. Is it necessary to add constant column to compute vif_score ? Thanks!
Hey, I have a question you may find trivial but why using a standard scaler before doing PCA or clustering, does not the standard scaler changes the relative distances between instances and therefore gives a biais to the clustering ?
according to my knowledge we are trying to fix our data into a fixed range to standardize it, please correct me If I am wrong
Although it's a little distance (time-wise) away, I'd like to add my 2 cents ๐
I recently learnt the math behind PCA, and have yet to use it, because I want to do sensitivity analysis first which I need to learn more math for.
The answer, is given the mean is 0, and standard deviation is 1, the distribution of the data is preserved. Whereas, if you used minmax for example, outliers wouldn't correctly contribute to the variance of the data.
The Pearson's correlation matrix is full of dot products, with the diagonal line (that's consistent with the identity matrix) represents the variance (as each column's own dot product against itself is the variance).
Sometimes, you might even want to use a different scaling technique instead of z-score normalization (standard scalar) deliberately to reduce the impact of outliers on variance for whatever reason.
As for the score of the dotproducts, the scalar returned is contextualized by the two vectors that formed it, hence why all columns should ideally be normalized first
(Not sure about clustering, have yet to perform this myself, but will advise of the answer when I do this)
Features with large ranges can dominate the clustering or PCA process, while features with small ranges might be ignored. However a factor analysis may be done to determine key features and dimensionality reduction may be attempted before standardization in my opinion (if you feel all features need not be equally weighted or included in the model based on domain validation of the problem holistically).
hello everyone,i wanted to know the difference between Gradient Descent, Maximum Likelihood Estimation (MLE), and Ordinary Least Squares (OLS) wrt linear regression .If anyone know of some good article on it,please tell
Hey everyone!
Iโm working on the CMI โ Detect Behavior with Sensor Data Kaggle competition, where the goal is to classify BFRB vs nonโBFRB behaviors using wrist-worn sensor data (TOF, IMU, pressure, etc.)
https://www.kaggle.com/competitions/cmi-detect-behavior-with-sensor-data
Iโve trained a LSTM using PyTorch and got surprisingly strong results (i.e. accuracy = 93 ) which makes me worry about potential data leakage or preprocessing issues....
Hereโs what I did to avoid leakage:
-Split data by sequence ID, no overlap between train/test
-Fit MinMaxScaler only on the training set, then applied to both
-Replaced NaNs, -1, and inf values with 0 before scaling
However, since 0 is a valid sensor reading, replacing missing/invalid values with 0 might introduce bias. I'm unsure whether I should switch to median, KNN, or use masking instead.
If anyone has experience with sensor data or wants to take a look at the code, Iโd really appreciate the help and happy to include collaborators in the Kaggle submission team! Just DM me or reply here
You can do GPT!!. Learn AI using AI !!
Job Title: Part-Time Senior AI/ML Engineer (Remote)
We are seeking a skilled and experienced Senior AI/ML Engineer to join our remote team on a part-time basis. The ideal candidate will have a strong technical background, excellent communication skills, and the ability to work independently in a fast-paced environment.
Requirements:
-Minimum of 7โ10 years of professional software development experience
-Proven experience working effectively in a remote environment
-Advanced English proficiency (C1 or higher); an American accent is preferred
-Availability to work 10โ15 hours per week during EST or CST business hours
If you're a highly motivated engineer with a passion for building high-quality software and can commit to a flexible part-time schedule, weโd love to hear from you.
You can connect with me on WhatsApp: +1 (567) 469-5384
Hi, @everybody
I have one question, I'm training ml models for the prediction, which is classification problem of 3 classes, where the number of samples are similar but the predition is skewed.
First class and second class is predicted with low precision tough, third class is never predicted. What's the reason? I can' t find the reason.
Before, when I applyed reinforcement learning, where the three classes were assigned to three actions and one action is never selected, too.
Actually, that is the preeiction model of forex eur/usd.
J4C4U
Caring
And
Sharing
Honesty
The definitive reason for FUNDS and what our earnings are made of, I.T is the reality check of how we live and what purpose defines us.
For You and your community, these are creations made around and in all that we all exist for to both elevate our purpose and alleviation of all that challenges these goals.
RudraDB-Opin 1.0.0
๐ Just dropped: The world's first FREE context-aware vector+graph database!
โก Why VibeCoding devs will love this:
๐ฅ Traditional vs RudraDB-Opin:
diff- Traditional: "Find similar documents"
- RudraDB-Opin: "Find connected documents through relationships"
๐ Zero-friction setup:
pip install rudradb-opin # 100% Free Forever!
โจ Context-aware intelligence:
๐ง Relationship-aware search - Finds connections similarity misses
๐ฏ Auto-dimension detection - Works with ANY embedding model
๐ Multi-hop discovery - Traverses relationship chains
โก Context preservation - Maintains semantic relationships
๐ Perfect for rapid prototyping:
100 vectors + 500 relationships (ideal for demos/POCs)
5 relationship types (semantic, hierarchical, temporal, causal, associative)
Same API as production version (seamless scaling)
๐ ๏ธ Speed up your AI builds:
RAG systems with intelligent document connections
Chatbots that understand conversation context
Knowledge apps with relationship discovery
Recommendation engines beyond similarity
โก From zero to context-aware AI in 3 lines:
pythondb = rudradb.RudraDB() # Auto-detects dimensions
db.add_relationship("intro", "advanced", "temporal", 0.9)
results = db.search(query, include_relationships=True) ๐ฏ
๐ Try it: pip install rudradb-opin
๐ Docs: rudradb.com
Who's building with relationship-aware search? Share your experiments! ๐
Innovation isn't iteration - it's transformation. ๐
https://www.kaggle.com/datasets/ziya07/high-speed-train-bogie-vibration-and-fault-diagnosis/data
This is a dataset of Train Bogey Vibrations. I have tried everything, extracted time domain features, extracted frequency domain features, extracted time-freq features like wavelet etc. Tried Classical ML ,Tried 1d conv on raw data, Tried sliding window approach and 2d conv, Tried anomaly detection. But i cant make the accuracy more than 55%. Please help me understand this data and modelling this data
I'm finding a US developer for the collaboration. If anybody interested, please dm me.
Hello Kaggle community,
I have a question that you might find trivial. Given a tabular task (so basically we use classic ML algo namely LR, RF, XGBoost, ...), do we add/remove features (assume all features are numerical)? When and how and why? I would like to make things clear in my head, because when trying to understand from chatgpt or similar AI LLMs it doesn't give a consistent answer and it looks biaised based on how I ask the question, sometimes it is good to make many new features from the original ones namely using sin function, product of two features, ... and fix a threshold based on their correlation with the target feature to remove "uninformative" features, sometimes it says if you have three too correlated features keep only one of them since information is redundant ( so in the correlation matrix we can find a small square matrix along the diagonal where corr coeff are too high like >0.8), ... So in short, my understanding of "good" practices in this regard is pretty much blurry and I would like to make it clear, I would really appreciate it if you can give me somehow clear logic (math based) answer or a suggestion where I could find such an answer!
Hello hackers,
I need some help. Iโm training a conversation disentanglement model using this repo: https://github.com/jkkummerfeld/irc-disentanglement
. It will be used to prepare a conversation dataset for a project.
I donโt have access to compute resources that can run continuously for five days. Iโm using Google Colab, but sessions eventually stop when the tab closes or times out. I also canโt afford a cloud provider right now.
If anyone has a home setup that can run uninterrupted for several days and is willing to help, I would really appreciate it. Thanks!
hello
Hi everyone,
Iโm currently working on a machine learning project for crypto trading using XGBoost, and Iโm struggling with something that I canโt fully clarify.
My setup:
- Minute-level data (order book + derived features)
- Predicting a continuous target (entry/exit signals based on future price movement)
- Using XGBoost with time-series CV (Spearman correlation as main metric)
The issue Iโm facing is related to feature engineering vs feature selection.
Iโm not sure how to properly decide:
- When should I create new features (interactions, ratios, transformations)?
- When should I remove features (low correlation, low importance, redundancy)?
- Should I remove highly correlated features between themselves (e.g. corr > 0.8), or keep them since XGBoost can handle that?
What confuses me is that I get inconsistent guidance:
- Sometimes itโs recommended to generate many new features and filter them
- Other times to aggressively reduce features to avoid noise and overfitting
In practice, I observe:
- If I use too many features โ model becomes too smooth / low amplitude predictions
- If I reduce too much โ model becomes unstable or noisy
So my core question is:
๐ Is there a clear logic or framework (ideally math-based) for deciding feature creation and selection in tree-based models like XGBoost?
Or is it purely empirical (based on validation performance)?
Any insights, best practices, or resources would be really appreciated.
Thanks!
Happy Weekend!
Hello Everyone!
If you know someone who have good skills in Python and Machine Learning, Please invite me!
Our Company is open to hire Python and Software Engineer.
Requirements:
2+ years of Software Engineering Experience
C1 or Native English Level
Good vision of Software Trent
Benefits:
Competitive Income
Supporting Several roles and chances
Multiple Role Working is enable
Important:
Our company is designed for Capability Person.
Questions:
For Junior Persons?
Do not give up, strong enthusiasm is also big point and our company also focus on the person's enthusiasm.
Thanks again.
Sophia