#data-processing

1 messages ยท Page 1 of 1 (latest)

cedar forge
tropic sky
#

๐Ÿ‘‹

lost star
#

๐Ÿ‘‹๐Ÿพ

spare swan
#

hello guys, please has anyone got soem experience in processing Speech/ audio data?

rustic hollow
#

hello everyone

hot star
#

Hello team

royal seal
flint glade
#

Hello everyone ! can anyone suggest me some learning materials or books on data processing.

clever thunder
#

Use Google collab?

kind cedar
#

Hello there! We're currently in the process of developing a cost-effective storage solution designed for version control datasets and large file storage (LFS). Would you be interested in trying it out? You can find more details and access the service here: https://underhive.in/ .Your feedback and support would be greatly appreciated!

valid swan
fair bough
#

perhaps this question goes here, moreso relates to data cleaning/feature engineering... but I've seen several articles discussing different processes for normalizing and standardizing data. (such as https://www.datacamp.com/tutorial/normalization-in-machine-learning) Time and time again, I see the datasets being split into test/train and a normalization technique applied to the training set. The same normalization model is then applied to the test set. How is this not an example of data leakage? The parameters to normalize the train are directly influencing the normalization of the test set....

calm hazel
mild nimbus
#

Hi I have a question related to dividing the audio for classification task. I want to understand how audio is segmented into chunks like of 5 secs or 10 secs considering the amplitude or from mel spectogram. Bcoz in random clipping for example say a 60 secs audio in 6 parts of 10 secs, I fear that I might lose important features if it splits in between. Thank you!

shrewd lily
#

I was reading about PCA and had a doubt, If I have a dataset of nm size, and I apply PCA to get a transformed dataset of size nm (I am not reducing the features). The transformed data has features such that, none of them are correlated, So is it always good to apply PCA to reduce the correlation among features???? But I have never seen any kaggle notebooks doing this. PCA is generally only used for feature selection it.

void wharf
#

Hello everyone, i have a question and i would really appreciate your assistance.pika_wow
I have 2 networking and ip addresses data files with .RR format (ex: myipv6add.RR, myipv6add2.RR) and i want to extract into MySQL file .. how can i write a script in python to do that ? harold

small python
#

hi , i'm working on a trading algorithm and i made a big dataset with at least 50k rows , but i have a problem and that in the way i made the dataset it's very biased (but it's real data with a baseline of accuracy) so maybe i can do SMOTE for data augmentation or i can do class_weight for a keras model. Right now i'm trying improve my algorithm with markov chains and i'm trying to apply a smoothing_factor to the transition_matrix but it's not working at all, any idea on wich way i need to take?

hot plinth
#

You initially talk about data augmentation/class weight which make it seem like a standard NN task, than you talk about Markov Chains which make it seem like Online Learning stuff, than transition matrix redirecting it to Reinforcement Learning.

Please decide what exactly the task is, and not just use things blindly

sterile nova
#

Hi, I'm new to machine learning and I tried the last playground episode from kaggle and I have a huge csv file and I can't load it all into ram, does anyone have some suggestions for this task?

dim hamlet
#

yes i also new in ml .i very enthusiast and filled with curiosity for start learning ml. what is path? how to pactice ml using kaggle?

fallow ginkgo
#

anyone have good resources for learning genetic algorithms for scientific research

#

currently reading genetic algorithms with python but I want a book more tailored to a research application

mint sequoia
#

HI, I am Abdullah I am an ML engineer want to join any team to particapte in kaggle competions

graceful lotus
hidden sapphire
#

Hello everyone! ๐Ÿ˜Š

I hope you're all doing well. Iโ€™ve been actively working on creating and sharing insightful Kaggle notebooks, and Iโ€™d truly appreciate your support. It would mean a lot if you could take a moment to visit my Kaggle profile and check out my notebooks:

๐Ÿ”— https://www.kaggle.com/sajjadalishah/code

If you find them helpful, Iโ€™d be grateful if you could upvote them. Your encouragement and feedback inspire me to keep learning and contributing to the community.

Thank you so much for your time and support! ๐Ÿ™

glass wigeon
#

Hey all, I just wrote a discussion post about my research focus, XAI - specifically, this post was about model distillation and how inexperienced junior data scientists will tend to just throw the biggest most complicated possible model at a problem with no regard for what is truly the best solution. Please give it a read and leave your thoughts in the comments -- thanks! ๐Ÿ˜„

https://www.kaggle.com/discussions/general/562312

median geyser
#

Hello everyone, I have a question on removing multicollinearity among features using VIF. I have seen models where they don't add constant column to compute vif_score. Is it necessary to add constant column to compute vif_score ? Thanks!

weak fern
#

Hey, I have a question you may find trivial but why using a standard scaler before doing PCA or clustering, does not the standard scaler changes the relative distances between instances and therefore gives a biais to the clustering ?

stoic flicker
prisma tree
# weak fern Hey, I have a question you may find trivial but why using a standard scaler befo...

Although it's a little distance (time-wise) away, I'd like to add my 2 cents ๐Ÿ˜›

I recently learnt the math behind PCA, and have yet to use it, because I want to do sensitivity analysis first which I need to learn more math for.

The answer, is given the mean is 0, and standard deviation is 1, the distribution of the data is preserved. Whereas, if you used minmax for example, outliers wouldn't correctly contribute to the variance of the data.

The Pearson's correlation matrix is full of dot products, with the diagonal line (that's consistent with the identity matrix) represents the variance (as each column's own dot product against itself is the variance).

Sometimes, you might even want to use a different scaling technique instead of z-score normalization (standard scalar) deliberately to reduce the impact of outliers on variance for whatever reason.

As for the score of the dotproducts, the scalar returned is contextualized by the two vectors that formed it, hence why all columns should ideally be normalized first

(Not sure about clustering, have yet to perform this myself, but will advise of the answer when I do this)

hallow mortar
# weak fern Hey, I have a question you may find trivial but why using a standard scaler befo...

Features with large ranges can dominate the clustering or PCA process, while features with small ranges might be ignored. However a factor analysis may be done to determine key features and dimensionality reduction may be attempted before standardization in my opinion (if you feel all features need not be equally weighted or included in the model based on domain validation of the problem holistically).

wooden knoll
#

hello everyone,i wanted to know the difference between Gradient Descent, Maximum Likelihood Estimation (MLE), and Ordinary Least Squares (OLS) wrt linear regression .If anyone know of some good article on it,please tell

vital venture
#

Hey everyone!
Iโ€™m working on the CMI โ€“ Detect Behavior with Sensor Data Kaggle competition, where the goal is to classify BFRB vs nonโ€‘BFRB behaviors using wrist-worn sensor data (TOF, IMU, pressure, etc.)

https://www.kaggle.com/competitions/cmi-detect-behavior-with-sensor-data

Iโ€™ve trained a LSTM using PyTorch and got surprisingly strong results (i.e. accuracy = 93 ) which makes me worry about potential data leakage or preprocessing issues....

Hereโ€™s what I did to avoid leakage:

-Split data by sequence ID, no overlap between train/test

-Fit MinMaxScaler only on the training set, then applied to both

-Replaced NaNs, -1, and inf values with 0 before scaling

However, since 0 is a valid sensor reading, replacing missing/invalid values with 0 might introduce bias. I'm unsure whether I should switch to median, KNN, or use masking instead.

If anyone has experience with sensor data or wants to take a look at the code, Iโ€™d really appreciate the help and happy to include collaborators in the Kaggle submission team! Just DM me or reply here

rain cape
still spear
#

Job Title: Part-Time Senior AI/ML Engineer (Remote)

We are seeking a skilled and experienced Senior AI/ML Engineer to join our remote team on a part-time basis. The ideal candidate will have a strong technical background, excellent communication skills, and the ability to work independently in a fast-paced environment.

Requirements:
-Minimum of 7โ€“10 years of professional software development experience

-Proven experience working effectively in a remote environment

-Advanced English proficiency (C1 or higher); an American accent is preferred

-Availability to work 10โ€“15 hours per week during EST or CST business hours

If you're a highly motivated engineer with a passion for building high-quality software and can commit to a flexible part-time schedule, weโ€™d love to hear from you.
You can connect with me on WhatsApp: +1 (567) 469-5384

safe belfry
#

Hi, @everybody
I have one question, I'm training ml models for the prediction, which is classification problem of 3 classes, where the number of samples are similar but the predition is skewed.
First class and second class is predicted with low precision tough, third class is never predicted. What's the reason? I can' t find the reason.
Before, when I applyed reinforcement learning, where the three classes were assigned to three actions and one action is never selected, too.
Actually, that is the preeiction model of forex eur/usd.

lofty iron
#

J4C4U

Caring
And
Sharing
Honesty

The definitive reason for FUNDS and what our earnings are made of, I.T is the reality check of how we live and what purpose defines us.

For You and your community, these are creations made around and in all that we all exist for to both elevate our purpose and alleviation of all that challenges these goals.

https://maps.app.goo.gl/1tHMC1yYBneUqryB7

lyric trench
#

RudraDB-Opin 1.0.0
๐ŸŽ‰ Just dropped: The world's first FREE context-aware vector+graph database!
โšก Why VibeCoding devs will love this:

๐Ÿ”ฅ Traditional vs RudraDB-Opin:
diff- Traditional: "Find similar documents"

  • RudraDB-Opin: "Find connected documents through relationships"

๐Ÿš€ Zero-friction setup:
pip install rudradb-opin # 100% Free Forever!

โœจ Context-aware intelligence:

๐Ÿง  Relationship-aware search - Finds connections similarity misses
๐ŸŽฏ Auto-dimension detection - Works with ANY embedding model
๐Ÿ”— Multi-hop discovery - Traverses relationship chains
โšก Context preservation - Maintains semantic relationships

๐Ÿ’Ž Perfect for rapid prototyping:

100 vectors + 500 relationships (ideal for demos/POCs)
5 relationship types (semantic, hierarchical, temporal, causal, associative)
Same API as production version (seamless scaling)

๐Ÿ› ๏ธ Speed up your AI builds:

RAG systems with intelligent document connections
Chatbots that understand conversation context
Knowledge apps with relationship discovery
Recommendation engines beyond similarity

โšก From zero to context-aware AI in 3 lines:
pythondb = rudradb.RudraDB() # Auto-detects dimensions
db.add_relationship("intro", "advanced", "temporal", 0.9)
results = db.search(query, include_relationships=True) ๐ŸŽฏ

๐Ÿ”— Try it: pip install rudradb-opin
๐Ÿ“š Docs: rudradb.com

Who's building with relationship-aware search? Share your experiments! ๐Ÿ‘‡

Innovation isn't iteration - it's transformation. ๐Ÿš€

devout cairn
#

https://www.kaggle.com/datasets/ziya07/high-speed-train-bogie-vibration-and-fault-diagnosis/data
This is a dataset of Train Bogey Vibrations. I have tried everything, extracted time domain features, extracted frequency domain features, extracted time-freq features like wavelet etc. Tried Classical ML ,Tried 1d conv on raw data, Tried sliding window approach and 2d conv, Tried anomaly detection. But i cant make the accuracy more than 55%. Please help me understand this data and modelling this data

safe belfry
#

I'm finding a US developer for the collaboration. If anybody interested, please dm me.

peak vault
#

Hello Kaggle community,

I have a question that you might find trivial. Given a tabular task (so basically we use classic ML algo namely LR, RF, XGBoost, ...), do we add/remove features (assume all features are numerical)? When and how and why? I would like to make things clear in my head, because when trying to understand from chatgpt or similar AI LLMs it doesn't give a consistent answer and it looks biaised based on how I ask the question, sometimes it is good to make many new features from the original ones namely using sin function, product of two features, ... and fix a threshold based on their correlation with the target feature to remove "uninformative" features, sometimes it says if you have three too correlated features keep only one of them since information is redundant ( so in the correlation matrix we can find a small square matrix along the diagonal where corr coeff are too high like >0.8), ... So in short, my understanding of "good" practices in this regard is pretty much blurry and I would like to make it clear, I would really appreciate it if you can give me somehow clear logic (math based) answer or a suggestion where I could find such an answer!

crimson sleet
#

Hello hackers,

I need some help. Iโ€™m training a conversation disentanglement model using this repo: https://github.com/jkkummerfeld/irc-disentanglement
. It will be used to prepare a conversation dataset for a project.

I donโ€™t have access to compute resources that can run continuously for five days. Iโ€™m using Google Colab, but sessions eventually stop when the tab closes or times out. I also canโ€™t afford a cloud provider right now.

If anyone has a home setup that can run uninterrupted for several days and is willing to help, I would really appreciate it. Thanks!

sturdy glade
#

hello

#

Hi everyone,

Iโ€™m currently working on a machine learning project for crypto trading using XGBoost, and Iโ€™m struggling with something that I canโ€™t fully clarify.

My setup:

  • Minute-level data (order book + derived features)
  • Predicting a continuous target (entry/exit signals based on future price movement)
  • Using XGBoost with time-series CV (Spearman correlation as main metric)

The issue Iโ€™m facing is related to feature engineering vs feature selection.

Iโ€™m not sure how to properly decide:

  • When should I create new features (interactions, ratios, transformations)?
  • When should I remove features (low correlation, low importance, redundancy)?
  • Should I remove highly correlated features between themselves (e.g. corr > 0.8), or keep them since XGBoost can handle that?

What confuses me is that I get inconsistent guidance:

  • Sometimes itโ€™s recommended to generate many new features and filter them
  • Other times to aggressively reduce features to avoid noise and overfitting

In practice, I observe:

  • If I use too many features โ†’ model becomes too smooth / low amplitude predictions
  • If I reduce too much โ†’ model becomes unstable or noisy

So my core question is:
๐Ÿ‘‰ Is there a clear logic or framework (ideally math-based) for deciding feature creation and selection in tree-based models like XGBoost?

Or is it purely empirical (based on validation performance)?

Any insights, best practices, or resources would be really appreciated.

Thanks!

fallen stream
#

Happy Weekend!

Hello Everyone!
If you know someone who have good skills in Python and Machine Learning, Please invite me!

Our Company is open to hire Python and Software Engineer.

Requirements:
2+ years of Software Engineering Experience
C1 or Native English Level
Good vision of Software Trent

Benefits:
Competitive Income
Supporting Several roles and chances
Multiple Role Working is enable

Important:
Our company is designed for Capability Person.

Questions:
For Junior Persons?
Do not give up, strong enthusiasm is also big point and our company also focus on the person's enthusiasm.

Thanks again.
Sophia