#💾┊data

1 messages · Page 1 of 1 (latest)

fallow peak
wintry tree
fallow peak
wintry tree
fallow peak
quiet cypress
#

Is it ok to post our datasets here?

fallow peak
# quiet cypress Is it ok to post our datasets here?

If it's a dataset of yours you'd like to promote, I recommend sharing in #🔗┊sharing-projects. If people only promote their datasets in this channel, it won't be useful to many people. On the other hand, if someone in this channel asks for a data source and you feel yours is a good match then it would make sense to post in here.

#

^^ anyone interested in #💾┊data please feel free to chime in with your thoughts on what you'd like to see / not see in this channel

fallow peak
fallow peak
#

BTW for folks interested in #💾┊data , the Kaggle Datasets product team (inc'l myself, engineers, designer) is thinking of doing an AMA in Discord in the next few weeks or so. Would people be interested? Anything in particular you'd look forward to learning or talking about with us?

hardy bay
#

Want to see how different countries move on the Fifa ranking table. The website https://www.fifa.com/fifa-world-ranking has the rankings. Can I build a small script to scrape this website say once a day and make that a daily updated dataset on Kaggle?

Looked around if something like that existed and don't see any datasets on rankings:
https://www.kaggle.com/datasets/?search=fifa

fallow peak
harsh quartz
#

Excited to share my Cat vs Cat Loaf 100x100 RGB Image Classification dataset! 🐱🍞 Perfect for image classification tasks.
I'd greatly appreciate an upvote to support my work.
Check it out here: https://www.kaggle.com/datasets/erogluegemen/cat-catloaf-classification

This dataset contains a collection of Turkish dictionary definitions extracted from the official website of the Turkish Language Association (TDK). It provides comprehensive definitions for a wide range of Turkish words and phrases. I don't know if you are interested but upvotes are highly appreciated!
Dataset Link: https://www.kaggle.com/datasets/erogluegemen/tdk-turkish-words

You can check my both datasets!! I'm glad to hear your feedbacks🥳

bold delta
bold delta
harsh quartz
bold delta
harsh quartz
#

ben türküm ama kız arkadaşım rus/alman 😂🥳

formal surgeBOT
verbal jay
#

Anyone ever used duckdb here?

serene tiger
#

Hii!
I was wondering if someone could help me find a dataset that can be used for a project on water footprint calculator

#

I dont know if I phrased that right, since I am really new to ML, but any response is greatly appreciated !

candid python
#

https://www.kaggle.com/datasets/yaranathakur/ipl-all-time-best-batsman
Go through my dataset recently published on kaggle

"All-Time-Best-Batsman.csv" provides a condensed compilation of statistical data and performance metrics from the Indian Premier League (IPL). Spanning multiple seasons, the dataset highlights key batting statistics of legendary cricketers, showcasing their contributions to the league's history. Whether you're a cricket aficionado or an analyst, "All-Time-Best-Batsman.csv" offers insights into the runs, strike rates, and many more that have defined the IPL's thrilling cricketing saga.

ruby viper
#

During my research intern, I have been working with a lot of Tabular Wikipedia Infobox Data. Now my work mostly revolves around the temporal aspect of this data, but I thought I could use my work done during this time to create a Dataset consisting of Wikipedia Infobox Data for all cricketer's found on Wikipedia.

So, here it is,
Link to the Cricketer Infobox Dataset: https://www.kaggle.com/datasets/varunnagpalspyz/uncover-cricket-legends-cricketers-wikidata
Link to the Notebook which contains code for clean and efficient extraction of Wikipedia Infoboxes in JSON format: https://www.kaggle.com/code/varunnagpalspyz/uncover-cricket-legends-data-extraction-with-ease/notebook

If anyone is working with such semi-structured data and is interested in taking up projects in this domain or knows of any work opportunities in this domain, do let me know.

maiden walrus
fallow peak
shy gust
tawdry nexus
#

Hello, everyone. Do any of know where can I find sources to create my own dataset? I would like to create a project or dataset, where the it will predict the time a lettuce to grow based on temperature, humidity, tds value, ph level, and nutrient solutions in a controlled environment. Thank you in advance.

swift bough
#

Sounds like a Hydroponics project you're working on,
I had a mini-hydroponics project going on a year ago whose data is on my kaggle - I had to pause it since I couldnt control the environment within budding season

tawdry nexus
wanton topaz
#

Turkey Earthquakes Data (1994-2023) . https://www.kaggle.com/datasets/ozgecinko/turkey-earthquake-data-1914-2023

This is my first dataset and I've just published it into Kaggle! Since my country is located in an earthquake-prone zone, I have been searching for data on this subject, and I look forward to working with this data very soon. I want to express my gratitude to this Discord server for providing this opportunity to share. 🫶

I would be happy to hear your feedbacks!

lucid flame
#

Hello Kaggle Community!

Exciting news - I've just uploaded a multilabel tweet dataset containing three columns:

Tweet ID (String Format),
Tweet Text: The tweet's actual content ,
Labels: These cover a wide range of concerns, including effectiveness doubts and conspiracy theories.

Ideal for sentiment analysis, NLP, and multilabel classification, this dataset offers insights into diverse vaccine concerns shared on Twitter.
Explore it for your projects and research.

https://www.kaggle.com/datasets/prox37/twitter-multilabel-classification-dataset

obsidian idol
cosmic palm
obsidian idol
#

but i don't understand how can i convert an ECG into the same format to make predictions?

shadow river
#

Hello there, I am currently working on JPN Comments Senti-Analysis Model for which I tried to find some good JPN comments datasets (ex: Tweets, YT Comments etc). Still, I couldn't come across a proper one so I wanted to ask if there are any datasets available out there based on this?
The only best dataset I could find related to this is: Kojima Hideo Tweets

shell basalt
#

Hello Kaggle Community,

I'm currently working on a project analyzing two decades of Premier League soccer data with the goal of creating a predictive model. However, I'm new to soccer datasets. If anyone has experience or insights to share on soccer data analysis and regression modeling, I'd greatly appreciate your guidance.

Specifically, I'm interested in predicting full time outcomes from half-time data, and predictive modeling based on the historical data. Your tips, resources, or collaboration would be invaluable.

Please reply or reach out if you can help. Thank you!

lusty shore
#

Anyone has image dataset for tooth disease?

cobalt sonnet
iron stone
#

Hi.
I'm looking for some image dataset about small plastics on the beach.
The idea is identify plastic straw, candy wrap, popsicle packaging, plastic bag, bottle cap, plastic label, etc.
Can someone point me a project about or similar to this?
Thank you.

zinc dew
#

How does dataset copyright work? Are datasets protected by copyright Do I have to give credits in my project if I am using someone’s dataset or are they all open?

shy gust
tough sleet
#

i want data set for virtually try on clothes can anyone help me out this

idle crescent
#

Hi everyone, i am new to kaggle and the world of data science, i have crewted my first dataset based on pictures, please check it out and upvote and give me advises cause i know I'll be needing them, i have just started out.

https://www.Kaggle.com/minhajalii/datasets

old cobalt
wanton wadi
wanton wadi
quasi plume
#

how does this line of code work

wanton wadi
uneven marsh
#

Does anyone have Braille english characters training dataset?

opaque birch
#

🌍 Introducing "AFRICA: Soil Analysis for iSDAsoil Mapping" Dataset!

📊 Dive into the rich soil data of Africa with my meticulously curated dataset, crafted for iSDAsoil mapping exploration. This comprehensive collection offers a cleaned and structured repository of invaluable soil analysis.

🔗 Dataset URL: https://www.kaggle.com/datasets/agungpambudi/africa-soil-mapping-isdasoil-exploration

🌱 Why Explore This Dataset?

Clean: Ensuring accuracy and reliability
Rich Insights: Uncover a treasure trove of insights crucial for soil mapping and exploration within the African continent.
iSDAsoil Ready: Tailored for iSDAsoil analysis, this dataset simplifies the process for researchers, enthusiasts, and data-driven explorers.

🌟 Key Features:

Diverse soil attributes
Spatial and temporal data for comprehensive analysis
User-friendly, ready for immediate utilization in iSDAsoil tools
Join the journey to unravel the secrets hidden in Africa's soil! Whether you're a researcher, analyst, or enthusiast, this dataset is your gateway to valuable discoveries.

🌐 Don't miss out — Your feedback and contributions are highly welcomed!

tired bolt
#

Hello everyone!

I've created a new dataset that contains school performance of high school students, as well as their demographic, social, parent, and study data.

If you're interested in education and predicting student outcomes I think you'll really enjoy this dataset! I look forward to seeing what you make with it!

https://www.kaggle.com/datasets/dillonmyrick/high-school-student-performance-and-demographics

trim halo
#

Hey everyone 🌞🤗, I recently wrapped up my final project for KaggleX Cohort. As part of my final project I created two datasets, which I would like to share with the community. The inspiration behind my project was to explore the representation of BIPOC in data science, and different aspects like gender-ratio, unemployment etc.
1. Tech Diversity Dataset: https://www.kaggle.com/datasets/snehilsanyal/tech-diversity-dataset
This is a collection of real diversity datasets collected from big tech companies' diversity reports from 2014-2023 (soon to be updated with other companies).
**2. US Data Scientist Demographics Data:**https://www.kaggle.com/datasets/snehilsanyal/us-data-scientist-demographics-data/
This dataset explores data scientist demographics data in US (race and ethnicity, gender-ratio, unemployment rate) from 2010-2021.

Please feel free to reach out in case of suggestions and feedback. I also plan to extend this dataset and explore features like dropouts in career, layoffs, career transitions and salary.

harsh palm
#

Hey, hi guys! The above dataset is extracted from Replit bounties section in order to help people know more about the freelancing market and pricing analysis based on the descriptions and titles of the bounties. The dataset can help normal folks like us to understand freelancing in a much more robust sense. Thank you

old cobalt
limber cipher
#

How does one create data

#

Surely data scraping is not the only way

old cobalt
muted dome
#

Looking for facebook comment database for time series and sentiment aanalysis.

sinful loom
#

Hello

indigo raven
#

Is it normal for the icon of the .json file saved in kaggle to be marked as {i}? If not, how to solve it?

blazing nest
#

Hi.
Can someone provide(or at least give a link to) a dataset like UCF crime dataset but with bigger pictures so that I maybe able to do the annotation for the images easily?
Thank you.

old cobalt
final lark
opaque birch
#

📊 Discover Insights with Kaggle Datasets!

Hi folks, I have uploaded these datasets. If you have time, then check this out and upvote:

☕ Coffee Shop Sales Trends . https://www.kaggle.com/datasets/agungpambudi/trends-product-coffee-shop-sales-revenue-dataset

Explore revenue patterns and product trends to boost your coffee shop business.

🍽️ Global Restaurant Orders Analysis . https://www.kaggle.com/datasets/agungpambudi/analyzing-restaurant-orders-international-dataset

Optimize your menu and operations by delving into international restaurant order data.

✈️ Airline Loyalty Impact . https://www.kaggle.com/datasets/agungpambudi/airline-loyalty-campaign-program-impact-on-flights

Decode the impact of loyalty campaigns on flights and enhance customer experience.

🚗 NZ Vehicle Theft Patterns . https://www.kaggle.com/datasets/agungpambudi/nz-crime-chronicles-motor-vehicle-theft-patterns

Enhance community safety by analyzing motor vehicle theft patterns in New Zealand.

🔍 Unlock Actionable Insights Today!

half basalt
quasi plume
#

hi
I have a raw vocal of a song and I want to divide the song to map it with the lyrics based on the timestamps to create a dataset for my model. THe lyrics are as [00:28.27] फेरि त्यो दिन सम्झन चाहन्न
[00:33.20] त्यही कथा म दोहोर्याउन चाहन्न
[00:38.07] फेरि त्यो दिन सम्झन चाहन्न
[00:42.80] त्यही कथा म दोहोर्याउन चाहन्न
[00:47.64] माया यो आगो हो, पोल्छ, थाहा छ
[00:52.51] आफैलाई जलाउन चाहन्न
How do i do this?

unreal breach
#

are there any data sets with the pharmaceutical drug names and active ingredients?

limber dew
sand sage
limber dew
# sand sage I would suggest you stay with models like xgboost/random forest for these tabula...

I see. The only reason I chose to use torch is because it's required for a job I'm applying to. Do you think there's a middle ground? Or do I have to either choose a different library or a different competition? Also, I tried to use the same model with more preprocessed data, such as age (I filled the NaN values in a smart way) and still got the same results, does it indicate something?

#

Oh and another question: When I handle a binary problem, and I want to set a threshold to round the outputs of my model into 0's and 1's, is the mode of the results a good threshold?

sand sage
# limber dew I see. The only reason I chose to use torch is because it's required for a job I...

If the company you’re applying to needs PyTorch it is likely that they need it for unstructured data eg computer vision/nlp etc. you should find a competition that’s aligns with what they’re working on . (And if they are using neural nets for tabular problems I would be really skeptical of how good their DS team is)

As for the features - I only saw you using max 2 features in your code (but I might have missed something) the dataset has more fields than that so you should look into using all of them (but which ones to use can be guided by your EDA)

limber dew
#

Well they actually using pytorch for untabular data, so your'e right about that :). Although I just looked up online and apperantly some people did reach results with NN and even torch. But I might take a different challenge. They probably recommended titanic to learn ML in general, not necessarily torch..

#

Also dw, I guided you to watch an attempt to use only 1 feature in the post. I just added now that I used also more features but it still didn't work.

#

I think I'll either learn random forests, or just switch to a benchmarked cv problem (I've heard about some). Thanks for the reply!

errant imp
limber dew
jaunty pollen
#

Any housing pricing data?

noble fractal
#

Hello everyone,

Check out this new dataset I've discovered and published on Kaggle:

https://www.kaggle.com/datasets/cauelias/dam-data-to-risk-analysis

This dataset contains a vast amount of information about Brazilian mineral barriers. With 190 columns of rich data, it can be utilized in multiple applications. You can attempt to predict the risk associated with certain barriers, classify them based on the type of minerals, or even utilize regression techniques to analyze the volume.

Take a look and explore the possibilities!

prisma path
#

I need a dataset someone send me dataset without categorical values

cosmic palm
prisma path
cosmic palm
old cobalt
#

If found useful do comment

zinc dew
#

Anyone ever taken a look at what the MMLU datasets look like?

restive thicket
#

Noticed stanford dogs has a lot of b&w and color pop images: https://www.kaggle.com/datasets/jessicali9530/stanford-dogs-dataset/discussion/486660
Filtering them out appears to improve the accuracy of the dataset.
I copied another highly rated notebook https://www.kaggle.com/code/devang/transfer-learning-with-keras-and-efficientnets added the b&w and color pop filtering and ran it https://www.kaggle.com/code/cosmicbee/transfer-learning-with-keras-and-efficientnets (perhaps some bug? although the data seems distinct so seems unclear, it does converge very fast to a close point)
This other model I tried had a less noticeable improvement: https://www.kaggle.com/code/cosmicbee/dog-breeds-classifier although it did converge faster to that point.

restive thicket
restive thicket
spiral shoal
#

i need a dataset of tweets posted by different people.i am a student and i can't afford the api😭😭😭

sharp plume
#

hi i am looking for large medical data with cost

spiral shoal
#

hello, does anyone konw how to be a master in dataset making

sharp plume
#

i think the idea is simple you need data that is real highly quality and depending on your goal that might be information from official resources or book related to the subject matter will organise and and if conversational the conversation need to be focused on one topic with clear between each conversation

#

but the more specific with your goal you are and the more precise information an ideal set For professional use cover all the possible scenario for the profession and all knowledge of master profession

zinc dew
neat crypt
#

Does anyone know where I could find data sets on air quality by year and state?? if you do please replay to this I would love to know :D

old cobalt
#

Okay sorry maybe a typo

#

Just wanted to say try out my first api

#

Let me know how it is

spiral shoal
#

thanks

wild echo
#

I'm working on something that would do RAG and entity resolution on a companies internal documents, the issue is that companies don't normally like to make that available. Any datasets that simulate that? Especially lots of documents with partial context? So far I'm looking at the Enron email dataset but are there any others? Maybe documents from collaboration on open source projects?

stark wasp
autumn turtle
stark wasp
#

yeah! it's even challenging to load all csvs together in a Kaggle Notebook ram with pandas without doing some preprocessing or trade-offs

silk summit
#
  • For those interested in Medical Image Segmentation, I'm sharing two preprocessed benchmark datasets for cardiac segmentation.
  • Additionally, weakly-supervised learning, particularly scribble-supervised learning, has been gaining popularity in recent years. This is due to the high cost and difficulty of traditional labeling, especially in the medical field where data sensitivity is paramount.
  • Therefore, each image in my datasets also comes with corresponding scribble labels, facilitating superior learning in cardiac segmentation.
  • Moreover, I've included notebooks to guide you on how to load and visualize the data. To learn more about these datasets and access the code, feel free to visit the links below.
    https://www.kaggle.com/datasets/anhoangvo/acdc-dataset
    https://www.kaggle.com/datasets/anhoangvo/mscmrseg
old cobalt
#

Hey any of guys know where we can sell data?

#

Even this also projects where can we sell and all?

opaque birch
solar kraken
solar kraken
solar kraken
lone anchorBOT
#
freeman3672 has been warned

Reason: Bad word usage

old cobalt
#

why am I getting this warning?

lone anchorBOT
#
freeman3672 has been warned

Reason: Bad word usage

#
freeman3672 has been banned

Reason: Too many infractions

silk summit
#

A preprocessed dataset for CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation. The dataset also contain scribble label for weakly-supervised learning. In additional, i also give a notebook to show how to loading and visualization the dataset. Please upvote my dataset, notebook and leave a comment for me if you liked it.

  1. Dataset:
    https://www.kaggle.com/datasets/anhoangvo/chaos-t1-and-t2
  2. Notebook for loading and visualization dataset:
    https://www.kaggle.com/code/anhoangvo/chaos-dataset-loading-and-visualization
still veldt
glad kayak
modern glen
digital hedge
#

🚀 Attention Kagglers! Two new medical datasets are now available for your machine learning projects. Dive into the 🩺📊 Cancer Prediction Dataset to develop models for predicting cancer, and explore the 📊 Predict Liver Disease: 1700 Records Dataset to tackle liver disease prediction. Start experimenting and drive impactful healthcare solutions!

digital hedge
#

Attention Kagglers! We are excited to announce the addition of two new datasets for your machine learning projects. These medical datasets are invaluable for health-related data analysis and predictive modeling:

  1. 🩸 Diabetes Health Dataset Analysis 🩸
    Dive into comprehensive diabetes health data to uncover patterns and insights. This dataset is perfect for those looking to explore the factors influencing diabetes and develop predictive models.

  2. 🩺 Chronic Kidney Disease Dataset Analysis 🩺
    Analyze extensive data related to chronic kidney disease. This dataset offers a rich source of information for understanding the complexities of kidney health and creating impactful machine learning solutions.

Don't miss out on these valuable resources to enhance your data science projects and contribute to the medical field with innovative solutions. Happy analyzing!

harsh cairn
digital hedge
#

I'm excited to announce that two new datasets are now available for you to explore and use in your projects!

🌍 Air Quality and Health Impact Dataset 🌍
Dive into the intricate relationship between air quality and public health. This dataset provides detailed information on air pollutants and their impact on health outcomes. Perfect for those interested in environmental science, public health, and data analysis.
🔗 Explore the Air Quality and Health Impact Dataset

📚 Students Performance Dataset 📚
Uncover the factors influencing students' academic performance. This dataset includes variables such as socioeconomic status, parental education levels, and more. Ideal for education analysts, data scientists, and anyone passionate about improving educational outcomes.
🔗 Explore the Students Performance Dataset

Feel free to dive in, analyze, and share your insights. Happy Kaggle-ing!

Best regards,
Rabie El Kharoua

autumn kite
#

Hey can anyone help me how I would I make my own new dataset
I completed ml and I am very much introduced to the thing's but for making dataset from where we decide.colums and rows and specially feature in it and its data values
I am confused so anyone who makes dataset can HELP me please

#

You can follow me guys so we can work together

boreal vault
rustic mango
opaque birch
#

🌟 Unlock the Power of MNIST: Comprehensive Analysis of Multiple Datasets!

Discover the ultimate resource for machine learning enthusiasts and data scientists! Dive into MNIST Multiple Dataset Comprehensive Analysis on Kaggle. This dataset provides a detailed comparison and in-depth analysis of various MNIST datasets, making it an invaluable tool for your next project.

💡 Helps us continue to provide high-quality, valuable data to the Kaggle community.

Explore the Dataset at https://www.kaggle.com/datasets/agungpambudi/mnist-multiple-dataset-comprehensive-analysis

rustic mango
worn plinth
opaque birch
indigo cosmos
rustic mango
#

Please do also check out my datasets many of them are US datasets

#
worn dock
#

Hi guys working on my first DE project and I need some advice-

Scenario-

I have a pipeline loading data from Postgres to BigQuery using Python on GCP cloud function . It loads into a staging table and merges into a production table for further analysis. I would like to accommodate:

  • Incremental loading
  • Changes in the source database, such as (UPDATE, DELETE, INSERT), should replicate in the destination warehouse.
    -Incase a column name changes or added it should also replicate)

From your experience what’s the best robust & Scalable way to approach this .

Open to suggestions 🙏🏻🙏🏻🙏🏻

rustic mango
full kayak
#

Hey everyone
Looking for a dataset of some educational institution .. wanna analyze marketing trends ..
any suggestions!!!!

whole quiver
pearl drum
wicked needle
slim rapids
#

Hello everyone
Does anyone know from where can I find a dataset about type 2 supernova?

elder seal
#

Hi am working on a prototype of a motion sensor with an api to extract information already labelled via wifi on real time, a data collector but smaller, so it doesnt biased the movement with the additional weight. Do you guys have any suggestions based on your experience any kind of additional features will be best?

zinc dew
zinc dew
zinc dew
bitter island
#

Hello everyone, I'm looking for a disasters tweet datasets within the last 3-5 years. Any suggestion? thanks

zinc dew
rugged cosmos
#

Hello everyone, I'm looking for metadata on Alzheimer's disease ( MRI and PET).

brisk raft
#

**New dataset Alert! **

Global Ease of Doing Business Dataset (2010-2019)

🔗 Check it out here!

This dataset, sourced from the World Bank, encapsulates key indicators related to the ease of doing business in various countries. Covering metrics such as construction permits, costs, and regulatory compliance from 2010 to 2019, it offers a comprehensive view of how countries have evolved in their business environments.

zinc dew
#

I want a dataset for the genetic disorder classification for my project where the user can type the symptons and AI can use the dataset to get which genetic disorder is

#

so if anyone can help me I be thankful

opal garden
slate wigeon
#

Hey everyone! 👋

I've just published a new dataset on Kaggle: Resume Dataset.

This dataset contains a collection of resumes in both PDF and text formats, ideal for projects involving data extraction, natural language processing, and machine learning.

If you're interested in exploring or contributing to this dataset, check it out here: Dataset Link

Looking forward to your feedback and seeing the amazing projects you build with it!

#DataScience #MachineLearning #nlp #Kaggle #ResumeDataset

cerulean matrix
#

Hi, everybody.
We're building news analysis models and need to collect news data of 20 years.
Is there anybody who knows news data service well?
Please tell me.

scarlet bobcat
#

'ello people CH_PikaWave
I am looking for a dataset or API which has every (or most) fictional characters and their pic in it like characters from

  • anime (required)
  • movies & novels (required)
  • games (optional)
    If you can find said dataset or API, then ping please. Thanks!
vocal grove
#

Hey everyone, This is my Conversational English-Malayalam Dataset, designed for transformer-based models! Unlike other datasets, this one is error-free and not generated using tools like Google Translate. It’s crafted with real, natural conversations to provide authentic, high-quality data for tasks like machine translation, multilingual NLP, and sentiment analysis. Perfectly tailored for models like BERT, GPT, and others, it’s a game-changer for those seeking context-rich dialogues for training. Check it out on Kaggle and see the difference it can make in your projects. Let’s build smarter models together! 🎉 https://www.kaggle.com/datasets/nihalthomas15/lang-trans-eng-malayalam

zinc dew
#

Hi everyone 😀 !
I’m currently working on building a hybrid LSTM-XGBoost model to predict the CEDCOS score (an overcrowding score for emergency departments) with an hourly prediction horizon of 10 hours.
To enhance the model's accuracy, I’m looking for a reliable retrograde dataset and an open-source API that provides real-time data for flu, RSV, and COVID-19, specifically for Europe (Belgium)🧐.
Currently, my model integrates internal hospital variables along with external factors like weather, traffic, and events, which have already yielded reliable results. However, I believe incorporating infection data could significantly improve the model's performance.
I’ve explored several sources, including:
The WHO
Respicast Forecaster (respicast.ecdc.europa.eu/forecasts) – they provide some data through their GitHub.
But I’m still on the lookout for other options. Has anyone worked with such data or APIs before? Any recommendations, sources, or suggestions for reliable datasets or live feeds would be greatly appreciated!

opaque birch
small sable
kindred kiln
#

I'm looking for credible datasets for my projects cardiac arrhythmia, if you have data or you know any sources lemme know thanks btw I'll use it for research purposes, and data should be latest

languid anchor
#

How do people publish their datasets?
I'm curious to understand how people go about publishing datasets. Do they generate the data themselves, or do they collect it from somewhere else? If it's the latter, where do they usually get their data from?

harsh quartz
#

Hi everyone,

I’ve just published a dataset of Turkey’s postal codes, and I wanted to share it here in case it’s useful for your geospatial, NLP, or logistics-related projects.

What’s inside:
• Covers 81 provinces, 973 districts, and 73,000+ rows
• Organized by province, district, sub-region, and neighborhood
• Available in CSV and Excel formats
• UTF-8-sig encoded, ready for use with pandas, geopandas, map visualizations, and more

🔗 Dataset link: https://www.kaggle.com/datasets/erogluegemen/turkey-postal-codes-dataset-2025

robust ginkgo
#

any social media data sets?

queen steeple
#

Hi guys, I am writing on RAG LLM project and unable to find small dataset.
Tha dataset I am getting is having 2m or 45k rows.
If anyone has Stackoverflow questions data with less than 30k rows, pls share the link.
Thankyou

queen steeple
sharp root
#

Does anyone know where the brand new "Template for Transparency in AI Model Training Data" can be found? It was supposedly published today... but when I look I can only find a different CCIA document from January.

outer stone
#

Good day to you, I need help please, I have a Data analysis project where I have to analyze 12 dataset for a year. This is my first time taking a project by myself and I don't know where to get started.. should I group the dataset quarterly or do the analysis for each month then combine everything?.. please help 🙏🏻. The tool I will be using is Excel

keen zodiac
#

hey guys
i just like created a small webapp which takes pdf as input and u can prompt it "extract (some data) out of this" and it extracts that data andcreates a dataset downloadable in csv excel andjson
i just created it today, i would love to have people try it out and give their opinions on how it could be better, dm me and ill send u the repository link, completely free just try it out and tell me

keen zodiac
#

👋 I just built a free tool that turns any PDF, image, or Word doc into a clean dataset using just a prompt — kinda like ChatGPT but for messy files.

Want to give it a quick try and tell me what’s broken or missing? Takes 2 mins. Would love your feedback 🙏
👉 https://pdf2dataset.streamlit.app

pseudo flax
crude obsidian
#

Hey guys, I just placed my custom made FRIDAY from Marvel Conversation dataset for LLMs on Kaggle.
Its in ChatML format so mostly all models are fine tunable on the dataset
👉 Kaggle: https://www.kaggle.com/datasets/prakhar231/friday-from-marvel-conversations-for-llms
👉 Hugging Face : https://huggingface.co/datasets/git-prakhar/FRIDAY-from-Marvel-Conversations

silver wyvern
#

I am Kishor J K

I just Published a dataset on "Vodafone customer churn data"
This data was provided to us as part of hackathon and i taught it would be good idea to share it.

I have also published my note book where I did EDA, visualization and prediction using this dataset.

Dataset link: https://t.co/1CGAgATDCF
Notebook link: https://t.co/VYBEoX5lUa

slate magnet
#

Hi, @everybody
I have one question, I'm training ml models for the prediction, which is classification problem of 3 classes, where the number of samples are similar but the predition is skewed.
First class and second class is predicted with low precision tough, third class is never predicted. What's the reason? I can' t find the reason.
Before, when I applyed reinforcement learning, where the three classes were assigned to three actions and one action is never selected, too.
Actually, that is the preeiction model of forex eur/usd.

distant nest
# slate magnet Hi, @everybody I have one question, I'm training ml models for the prediction, w...

I have had this same issue with forex. If it’s the hold that is never selected, try using a gate for long and another for short, and NOT one for hold. That way you can put a threshold on ur gates and ur good.

If its a buy or sell, thats not happening, I would try a different pair that’s obviously trending the other direction, and make sure your rewards n penalties are the same for buy n sells, so there is no bias.

#

Oh wow that’s an old post, lol my bad

slate magnet
distant nest
slate magnet
stuck gust
#

I needed to build a dataset for roblox game player counts what are the best sites?'

glass hamlet
#

Hey, what is the best practice to deploy Ml models for free? Should I go with hugging face or Render ?

brisk condor
thorn grotto
hybrid vortex
#

Hello All ! 👋
I just published the Cassandra Employee Dataset — a massive 50,000-row dataset perfect for Regression, Classification, Clustering, and EDA.
Super clean, ML-ready, and has a 10/10 usability score. Great for building real-world ML projects. 🚀
Do hit an upvote on the dataset 😁

https://www.kaggle.com/datasets/rockyt07/cassandra-employee-dataset

tawdry hedge
#

Hello, where can i find cebuano text corpus/audio datasets available?

snow lion
hybrid vortex
sour arch
#

Hello everyone! 👋

I’m excited to share my capstone project:

🛡️ SENTINELS – Multimodal Disaster Intelligence System
An AI-powered system for real-time disaster detection, severity analysis, risk prediction & interactive mapping.

🔗 Kaggle Notebook: https://www.kaggle.com/code/mukthanjalibonala/sentinels-multimodal-disaster-intelligence-agent

Connect with me on LinkedIn 👉 https://www.linkedin.com/in/mukthanjalibonala/

Would love feedback, suggestions, and support 🙏

Thank you! 💙

ebon girder
#

Hey everyone! 👋
I’m conducting a short academic survey for my Research Methodology internal assessment on “The Impact of ChatGPT in Education.”
It takes less than 3 minutes to complete and all responses will remain anonymous.
Your input will really help me with my project — please fill it out below 👇

🔗Survey Link

Thanks a lot for your time and support! 🙏

velvet karma
blissful hemlock
#

Can anyone tell me where can I find tumor segmentation mask dataset for 2D image segmentation using UNet

burnt cobalt
scarlet sparrow
subtle grotto
#

Hi @everyone
📘 Python Loops & Strings – Kaggle Notebook 🐍
This notebook explains Python loops (for, while) and strings in a detailed and easy-to-understand way, with clear examples.
It’s especially helpful for beginners 🚀

Please check it out and leave a vote ⭐ and a comment 💬 — your feedback is highly appreciated! 🙌
https://www.kaggle.com/code/dastgeerjutt/3-loops-and-strings-detailed

scarlet sparrow
lost crest
#

🚗⚡ New Dataset on Kaggle: Electric Vehicle Population (Geospatial Insights)

I’ve just published an Electric Vehicle Population dataset on Kaggle, designed for EDA, machine learning, and geospatial analysis.

📊 What you can explore with this dataset:
• EV adoption patterns across regions
• Urban vs. rural penetration gaps
• Trends over time by vehicle type and location
• Opportunities for clustering, forecasting, and policy analysis

🔗 Explore & upvote the dataset:
https://www.kaggle.com/datasets/hammadansari7/electric-vehicle-population

💬 Your take?
Is EV adoption still driven by urban infrastructure and incentives, or are we approaching broader mainstream adoption?
Is the rural lag a data reality—or just a temporary phase?

I’d love to see notebooks, visualizations, and insights built on top of this dataset. Let’s learn from the data.

#DataScience #KaggleDatasets #MachineLearning #GeospatialAnalysis #ElectricVehicles #Sustainability #Python #EDA

@Kaggle @Tesla @robikscube @TowardsDataScience

lost crest
lost crest
little fog
#

https://www.kaggle.com/datasets/mabubakrsiddiq/developer-stress-simulation-dataset
This dataset simulates the stress levels of software developers under various real-world conditions. It includes a mix of workload 💼, personal habits 🛌☕, project deadlines ⏳, code complexity 💻, and interruptions 📞 that influence stress. The data is intentionally non-linear and realistic 🔄, reflecting how stress does not grow uniformly but depends on interactions between multiple factors.

little fog
#

New Dataset Just published!

View: https://www.kaggle.com/datasets/mabubakrsiddiq/clear-bg-ocr-dataset-eng-and-zh-22k-images

🔹 Overview

This dataset contains synthetic OCR images of English and Chinese sentences. Each language is organized in a separate folder with corresponding metadata. The images have clear backgrounds, random fonts and font sizes, and optional blur for variability.

The dataset is designed for OCR research, machine learning, and computer vision tasks. Perfect for training models to recognize text in multiple languages and fonts.

🎨 Features

  • Two-lingual dataset: English & Chinese
  • Random fonts: Multiple font options for diversity
  • Random font sizes: Increases model generalization
  • Optional Gaussian blur: Simulates real-world imaging
  • Clear backgrounds: Good for clean OCR training
  • Metadata included: Easy for preprocessing and analysis

💡 Possible Use Cases

  • 🖋️ OCR Model Training: Train models like Tesseract, PaddleOCR, or deep learning OCR pipelines
  • 🤖 Computer Vision Research: Use metadata for font/style classification
  • 🏫 Language Learning Tools: Visual recognition for English or Chinese sentences
  • 🔧 Augmentation Testing: Benchmark text recognition under blur and font variations
  • 🧠 Multi-Lingual OCR Experiments: Test cross-lingual recognition models

⚡ Notes

  • The Chinese text is rendered using Microsoft YaHei and NSimSun fonts for proper character display.
  • The English text uses a variety of fonts for diversity.

Please consider giving an upvote!

little fog
little fog
lost crest
#

@everyone
Assalam o alikum!
I posted new dataset on Kaggle: "Pakistan Air Quality & Weather (10 Cities)."
https://www.kaggle.com/datasets/hammadansari7/pakistan-air-quality-and-weather-10-cities
Overview
This dataset contains 3 months of hourly air quality and weather measurements for 10 major Pakistani cities, covering November 2025 to February 2026. With 21,840 complete records, it provides comprehensive data for pollution analysis and prediction modeling.
Cities Covered
Lahore
Karachi
Islamabad
Rawalpindi
Faisalabad
Multan
Peshawar
Quetta
Rahim Yar Khan
Sialkot
Data Source
Air quality and weather data collected from Open-Meteo API, an open-source weather and environmental data provider.
Dataset Statistics
Total Records: 21,840

little fog
little fog
#

See the dataset

https://www.kaggle.com/datasets/mabubakrsiddiq/developer-stress-simulation-dataset
This dataset simulates the stress levels of software developers under various real-world conditions. It includes a mix of workload 💼, personal habits 🛌☕, project deadlines ⏳, code complexity 💻, and interruptions 📞 that influence stress. The data is intentionally non-linear and realistic 🔄, reflecting how stress does not grow uniformly but depends on interactions between multiple factors.

lost crest
slate marlin
scarlet sparrow
little fog
lost crest
little fog
#

New Dataset published!

https://www.kaggle.com/datasets/mabubakrsiddiq/language-identification-dataset-20-languages/data/data/data/data/data
The Language Identification Dataset is a curated collection of approximately 68978 text samples, each paired with a corresponding language label. The dataset was constructed by gathering multilingual text passages from three major sources: the Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi-MT. These sources provide a diverse mix of domains, writing styles, and sentence structures, making the dataset suitable for research and machine learning tasks involving language detection, multilingual NLP, and text classification.

lost crest
little fog
#

New Dataset Published!

https://www.kaggle.com/datasets/mabubakrsiddiq/competition-math-problems-dataset
Please upvote...
This dataset contains over 12,000 math competition problems covering topics like Algebra and others. Each entry includes the problem statement, its difficulty level (Level 1–5), problem type, and a detailed step-by-step solution. It is ideal for training or evaluating AI models in problem-solving, explanation generation, and mathematical reasoning. The problems range from simple calculations to complex multi-step competition-level questions.

lost crest
prisma narwhal
#

Just as a reminder: server rules prohibit asking for upvotes. We will be enforcing that more assertively going forward.

little fog
lost crest
prisma narwhal
lost crest
prisma narwhal
#

It is against server rules to request upvotes for your work - and can lead moderator action.

slate marlin
spark hollow
normal ore
spark hollow
spark hollow
tidal trench
#

I need Urdu or English sentiment analysis Data set

lost crest
little fog
#

https://www.kaggle.com/datasets/mabubakrsiddiq/urdu-ghazal-dataset-32-poets-and-their-ghazals

The dataset contains poetry by 30 greatest urdu poets. Here they are:

'mirza-ghalib','allama-iqbal','faiz-ahmad-faiz','sahir-ludhianvi','meer-taqi-meer', 'dagh-dehlvi','kaifi-azmi','gulzar','bahadur-shah-zafar','parveen-shakir', 'jaan-nisar-akhtar','javed-akhtar','jigar-moradabadi','jaun-eliya', 'ahmad-faraz','meer-anees','mohsin-naqvi','firaq-gorakhpuri','fahmida-riaz','wali-mohammad-wali', 'waseem-barelvi','akbar-allahabadi','altaf-hussain-hali','ameer-khusrau','naji-shakir','naseer-turabi', 'nazm-tabatabai','nida-fazli','noon-meem-rashid', 'habib-jalib'
Every ghazal is given in three writing systems:

Urdu (Arabic Script)
Hindi (Hindi writing system)
English (Latin Script)
Divided into three folders: ur, en and hi.

Potential use cases:

NLP
Meter Detection
Modeling AI to predict the poet given the ghazal or couplet
Have fun with data!

lost crest
normal ore
#

About This Dataset https://www.kaggle.com/datasets/suhanigupta04/gold-futures-5-year-dataset

  • 5 years daily gold futures (GC=F) data from Yahoo Finance with complete OHLCV
  • Clean, ready-to-use for LSTM/GRU, ARIMA, Prophet time-series forecasting models
  • 11 pre-computed technical indicators: MA7/30/90, RSI, MACD, Bollinger Bands, volatility
  • No missing values, properly scaled features for immediate ML experimentation

🔗 [Starter Notebook created] — EDA, technical plots, LSTM baseline with RMSE evaluation

prisma narwhal
little fog
lost crest
buoyant stone
#

Hello hackers,

I need some help. I’m training a conversation disentanglement model using this repo: https://github.com/jkkummerfeld/irc-disentanglement
. It will be used to prepare a conversation dataset for a project.

I don’t have access to compute resources that can run continuously for five days. I’m using Google Colab, but sessions eventually stop when the tab closes or times out. I also can’t afford a cloud provider right now.

If anyone has a home setup that can run uninterrupted for several days and is willing to help, I would really appreciate it. Thanks!

normal ore
#

About This Dataset
🏆 2,000+ downloads and counting — synthetic placement dataset on Kaggle!
https://www.kaggle.com/datasets/suhanigupta04/student-placement-prediction-dataset

  • 100,000 synthetic student records simulating real Indian campus recruitment patterns

  • Features cover the full placement pipeline — academics (CGPA, backlogs), technical skills (DSA, coding, ML), and activities (internships, projects, hackathons)

  • Two target variables: placement_status (classification) and salary_package_lpa (regression)

  • Ideal for placement prediction, salary estimation, feature importance analysis, and fairness auditing across branches and tiers

🔗 Starter Notebook available — EDA, baseline ML models, feature importance. Great starting point for your own experiments!

normal ore
#

🧠 Just published a new dataset on Kaggle!

🔗 Mental Health & Burnout in Tech – https://www.kaggle.com/datasets/suhanigupta04/employee-mental-health-and-burnout-dataset

  • 150,000 synthetic tech employee records across roles, company sizes & work modes
  • Covers work stress, sleep, lifestyle, therapy access & social support
  • Three correlated mental health scores: stress, anxiety & depression
  • Two targets: burnout_level (Low/Moderate/High) + seeks_professional_help (binary)

📓 Starter Notebook available — EDA, correlation heatmaps & Random Forest baseline

hallow warren
normal ore
#

🏏 Just published my IPL Dataset (2008–2024) on Kaggle!
https://www.kaggle.com/datasets/suhanigupta04/ipl-dataset-20082024-with-match-features
17 seasons of IPL data with innings-level features engineered
from official ball-by-ball records.

  • ⚡ Powerplay & death over stats per innings
  • 📊 Run rate, dot ball %, boundary counts
  • 🏆 Match outcomes, toss impact & player of match
  • 🤖 Ready for EDA, win prediction & team analysis
dire olive
#

I want news data, where I will get it?

cinder oar
#

Hey everyone,

I’m not sure if you’ve been following the discussions over the past two weeks, but I recently completed a challenge called "14 Days, 14 Datasets." The challenge is now over, but it resulted in several high-quality datasets covering highly relevant topics.

The final topic is very personal to me: my home country, Sudan. As many of you may not know, Sudan has been experiencing conflict since the '90s, though it was previously concentrated in the Darfur region rather than the capital, Khartoum. Since 2019, Sudan has faced widespread demonstrations and government crackdowns that deeply affected Khartoum. Then, in 2023, a full-scale war broke out in the capital.

This conflict began as an attempt by the Rapid Support Forces (RSF) to seize authority from the National Army. Backed by the UAE which has funded the militia to gain control over Sudan’s gold resources—this war has cost civilians everything: their homes, their cars, their life savings, and their lives.

Because of this, I decided to curate a high-quality dataset to provide information on the reality of what is happening in my country.

Dataset Link: https://www.kaggle.com/datasets/waddahali/sudan-conflict-2023-2026

The dataset is fully documented, and the description provides extensive context. I hope you take a look, and please keep Sudan in your prayers.

Thank you all!

normal ore
prisma anchor
celest ether
#

Anyone can provide the best dataset download link for deepfake detection videos with good qualities videos and of various diiferent varities ?? It will be great help to me.

normal ore
#

🎬 New Dataset Live on Kaggle! 🚀
https://www.kaggle.com/datasets/suhanigupta04/global-movies-dataset-19502026
• 100K synthetic movies (1950–2026) with IMDb-style ratings, genres, budgets & revenue
• Director rankings, decade trends, blockbuster prediction targets included
• Perfect for EDA dashboards, rating prediction & recommendation systems
• ML-ready: top_100_prob, blockbuster_flag, franchise_flag targets

civic lagoon
untold jackal
#

🚀New Dataset on Kaggle! (Liver Patient data)

https://www.kaggle.com/datasets/shauryasrivastava01/liver-patient-dataset
• 583 patient records with real clinical biomarkers
• Binary classification (Liver Disease vs Healthy)
• Fully cleaned + preprocessed (no messy columns)
• Includes enzymes, bilirubin, proteins & demographic data
• Perfect for ML projects, EDA, and healthcare modeling

normal ore
#

Explore dataset for time series: About This Dataset https://www.kaggle.com/datasets/suhanigupta04/gold-futures-5-year-dataset
5 years daily gold futures (GC=F) data from Yahoo Finance]
Clean, ready-to-use for LSTM/GRU, ARIMA, Prophet time-series forecasting models
11 pre-computed technical indicators
No missing values, properly scaled features for immediate ML experimentation

🔗 [Starter Notebook created] — EDA, technical plots, LSTM baseline with RMSE evaluation

untold jackal
normal ore
#

Explore dataset for time series: About This Dataset https://www.kaggle.com/datasets/suhanigupta04/gold-futures-5-year-dataset
5 years daily gold futures (GC=F) data from Yahoo Finance]
Clean, ready-to-use for LSTM/GRU, ARIMA, Prophet time-series forecasting models
11 pre-computed technical indicators
No missing values, properly scaled features for immediate ML experimentation

🔗 [Starter Notebook created] — EDA, technical plots, LSTM baseline with RMSE evaluation

spice sun
#

https://www.kaggle.com/datasets/izzarsulynashrudin/brugada-huca

Brugada-HUCA: 12-Lead ECG Recordings for the Study of Brugada Syndrome

Summary
Brugada-HUCA is a dataset of 12-lead electrocardiogram (ECG) recordings developed to support the study and classification of Brugada syndrome, a rare but potentially fatal cardiac arrhythmia. The data were collected retrospectively from patients evaluated at the Cardiology Department of the Hospital Universitario Central de Asturias (HUCA) and were reviewed by clinical experts. Diagnostic labels were assigned according to established international criteria.

The dataset includes 363 subjects, comprising 76 patients diagnosed with Brugada syndrome and 287 healthy control subjects. Each recording is accompanied by diagnostic metadata.

native coral
#

Hey @everyone, use this dataset for the new EDA + model build:

https://www.kaggle.com/datasets/vedantbhavsar43/ipl-2007-to-2026-complete-ball-by-ball-dataset

This is a better base than the older IPL datasets because it already has:

  • latest available IPL 2026 data
  • full ball-by-ball coverage
  • cleaner ML-ready structure
  • better feature engineering scope

The main advantage is that we can skip a lot of cleaning and directly focus on:

  • EDA
  • feature engineering
  • stronger model building

It should be much better for winner prediction, score prediction, and live match modeling.

last kettle
#

New dataset drop: 69k Japanese names with gender, sourced from real Wikipedia people
I believe this is the only large scale dataset with real Japanese name with gender labeled.
Most Japanese name-gender datasets come from dictionaries or frequency surveys — not real individuals. I scraped Japanese Wikipedia's gender-segregated occupational categories to get a dataset of actual public figures (actors, athletes, politicians, musicians, etc.) with inferred gender labels.

  • 69k entries | 87.1% include birth year
  • Kanji + hiragana for each name
  • Crawler code included

Kaggle Dataset: https://x.gd/MffYV

I'll release model for gender prediction from name, and a 450k meda dataset of Japanese names with gender soon

polar star
#

Just released: AI Hiring Bias & Fairness Benchmark

A realistic synthetic recruitment dataset with:
• 5,000 candidate profiles
• Embedded hiring bias patterns
• Fairness auditing & SHAP explainability
• XGBoost + XAI analysis notebook
• Enterprise-style hiring simulation

Perfect for:
MachineLearning FairnessAI ExplainableAI XGBoost DataScience EDA Kaggle

Built for bias detection, hiring prediction, and ethical AI research.
https://www.kaggle.com/datasets/sridipbasu/ai-hiring-bias-and-fairness-benchmark

last kettle
polar star
primal raven
agile sable
#

Hey everyone! 👋

I just published my latest notebook on Kaggle: "Behind the Screens: Indian Developer Burnout & Layoff Anxiety Analysis".

I focused on feature engineering to create a custom "Vulnerability Matrix" to visualize burnout risks in 2026. I'd love to get some feedback on my visualization choices and the analytical approach.

Check it out here: https://www.kaggle.com/code/abdallahahmed701/behind-the-screen-indian-developer-burnout-eda

Any feedback or upvotes would be greatly appreciated! 🙏✨

craggy owl
#

@agile sable This is really cool, the topic really relevant right now in the tech industry. The idea of building a Vulnerability Matrix through feature engineering is a creative approach. I like the Existential Dread Checklist visual. What stands out immediately is how close the two bars are for almost every role, do these scores shift when you control for years of experience or company size? A junior dev at a startup probably feels this very differently than a senior one at a big tech firm. The grouped bar format works really well here though, makes the comparison clean and easy to read at a glance.

shut quartz
#

The World Has a Data Problem. We Fix It.
Every AI team hits the same wall eventually.
You have the model. You have the architecture. You have the engineers. But you don't have the data, and everything stops.
Maybe your dataset is too small to train on. Maybe it carries sensitive patient records, financial transactions, or personal identifiers that legal won't let you touch. Maybe you've been waiting months for a vendor to deliver labeled data that still isn't ready. Maybe your edge cases are so rare in real life that your model keeps failing exactly where it matters most.
This is not a skill problem. This is a data problem. And it is quietly killing more AI projects than any other single reason.
We generate synthetic data.
Not as a workaround. Not as a compromise. As a legitimate, statistically rigorous alternative that lets your team move again. We produce tabular, text, image, and time-series synthetic datasets that mirror the distributions, correlations, and behavioral patterns of real-world data without exposing a single real record.
We have solved this for teams in healthcare who couldn't share patient data across departments. For fintech companies building fraud detection models with almost no real fraud examples to train on. For startups that needed 10x their dataset size before a funding deadline. For enterprises blocked by GDPR, HIPAA, and compliance teams that said no to everything.
The problem you are sitting with right now, whether it is a privacy blocker, a data scarcity issue, a class imbalance, a regulatory wall, or a timeline that real data collection simply cannot meet, has a solution. We will tell you exactly what it is within 24 hours of hearing from you.
No long sales cycles. No vague proposals. You describe your data problem in plain language, and we come back with a concrete plan.
Send us your situation: [synthox.ai@gmail.com]
The only thing worse than a data problem is spending another month pretending it will resolve itself.

obsidian cloak