#data-science-and-ml
1 messages · Page 166 of 1
read the documentation
you need to edit the code
as isit will just make a generic dataset
if you are interested just look into the project
Sorry, but I had to ask because I didn't think the documentation was clear.
I don't really see how its unclear it says: ```
"The Synthetic Conversations dataset is a set made up of inputs and outputs that was completely automated and generated by AI language models. I used AI models such as DeepSeek R1 Llama 70B Distil, Google's Gemini 2.0 Flash, Microsoft's Phi 3, and Qwen3-0.6B."```
and:
DeepSeek R1 Llama 70B Distil
Gemini 2.0 Flash
Phi 4 Reasoning
Qwen3 0.6B
Only the best responses are selected and added to the dataset. This is done by having all of the AI models voting on which output they think is the best without being able to vote for their own output."```
I asked a specific question about generating math data, and you told me to read the docs but the docs don’t explain how to do that. If your project depends on users editing the code to guide the output, that should be clearly explained. Saying “it’s in the README” doesn’t work if it isn’t.
^
"You can modify the script to ask the cluster to only generate data that will help train an AI on python debugging or math or whatever you want."
Right — but saying “you can modify the script” isn’t the same as explaining how to do it. That sentence is a claim, not documentation. If customizing the prompts is essential, the README should walk through it clearly. Otherwise, pointing to it doesn’t help.
I don’t know what to tell you man. This is a Python discord server I posted thinking that you would understand basic python. Beyond that this is the AI channel within the server. If you don’t know how to modify a prompt in a script learn how to do that first. As stated in the documentation I used the OpenAI and Google-GenAI SDKs so maybe look into that
I joined that kaggle comp for Stanford RNA Folding. I followed their outline but I added the UCF to extract features and it actually makes the predictions way more accurate by finding hidden patterns in the RNA that regular models miss. The UCF lets us see how "mathematically complex" different parts of the RNA are, which helps guide the 3D folding. Here are some images, notice how the data points sit in the Complex/Chaotic region, its telling us RNA have intricate folding patterns, we know that, but the AI does too. And the 3D visuals show the actual predicted structures with each nucleotide color-coded (A=green, U=red, C=blue, G=yellow). Heres a few samples
Guys I need help with Python Pandas
from subprocess import call
import pandas as pd
import time
#func for opening files on command
def openfile(x:str):
call(["python", x])
#setting up the dataframe and email variables
df = pd.read_csv("CSV_Files/logindata.csv")
df.set_index('Email',inplace=True)
emp_email_end = "@can.emp"
adm_email_end = "@can.adm"
#signup and login
print("***** Welcome To Login Page *****")
choice_signin = input("Would you like to login or signup?: ")
if choice_signin == "login":
mail = input("Enter your email:- ")
pwd = input("Enter your password: ")
act_pwd = str(df.loc[mail][0])
if pwd == act_pwd:
if str(mail).endswith(emp_email_end):
print("Welcome Employee!")
time.sleep(3)
openfile("Python_Code/employee.py")
elif str(mail).endswith(adm_email_end):
print("Welcome, Admin")
time.sleep(3)
openfile("Python_Code/admin.py")
else:
print("Welcome To The Canopy!!")
time.sleep(3)
openfile("Python_Code/customer.py")
elif pwd != act_pwd:
print("Password is incorrect")
elif choice_signin == "signup":
new_mail = input("Enter Your Email ID:- ")
new_pwd = input("Enter Your Password:- ")
df.loc[new_mail] = [new_pwd]
if str(new_mail).endswith(emp_email_end):
print("You cannot register with the company email!")
elif str(new_mail).endswith(adm_email_end):
print("You cannot register with the company email!")
else:
print("Welcome To The Canopy!!")
time.sleep(3)
openfile("Python_Code/customer.py")
#updating the csv file
df.to_csv("CSV_Files/logindata.csv")```
I'm getting this exception when I use the login choice as ''login''
FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]` act_pwd = str(df.loc[mail][0])
For reference, I set the index of the csv file to the email names
not an integer index
I want to get rid of the exception, since need to show this as a school project, and I'm trying to use the try, except method, but it's not working
Please help
not entirely sure on your question since you say you are trying to use the try/except method but I dont see a try anywhere in that code. Can you clarify what you mean by I'm trying to use the try, except method, but it's not working ? It looks to me like it is interpreting your column name as not a string. are you sure [mail][0] is actually coming back as a single column label string?
mail = input("Enter your email:- ")
pwd = input("Enter your password:- ")
act_pwd = str(df.loc[str(mail)][0])
try:
act_pwd = str(df.loc[str(mail)][0])
except FutureWarning:
print("test")
if pwd == act_pwd:
if str(mail).endswith(emp_email_end):
print("Welcome Employee!")
time.sleep(3)
openfile("Python_Code/employee.py")
elif str(mail).endswith(adm_email_end):
print("Welcome, Admin")
time.sleep(3)
openfile("Python_Code/admin.py")
else:
print("Welcome To The Canopy!!")
time.sleep(3)
openfile("Python_Code/customer.py")
elif pwd != act_pwd:
print("Password is incorrect")```
here
This isn't working either. I'm still getting the same exception, but twice now
FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]` act_pwd = str(df.loc[str(mail)][0]) /home/jon/Desktop/IP_Project_Hotel-Management/Python_Code/logins.py:30: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]` act_pwd = str(df.loc[str(mail)][0]) Welcome To The Canopy!!
and yes, the code is working as it's supposed to, it's only the warning I want to get rid of
How do I do that, using try and except, or sys module?
Or any other fix that you know of
So my assumption is still the same in that the argument you are passing to df.loc() is in fact not seen as a string but an integer or otherwise. you could try setting the value outside of the function like
mail_id = str(mail)[0]
act_pwd = str(df.loc[mail_id])```
or try using thje df.iloc like the message suggests.
Also the reason youre try/except technically did not work is because you are calling the same function outside of the try statement (5th line) which is what actually throws the exception.
What should I do to fix the try thing
try changing this:
act_pwd = str(df.loc[str(mail)][0])
try:
act_pwd = str(df.loc[str(mail)][0])
except FutureWarning:
print("test")```
to this:
```py
mail_id = str(mail)[0]
try:
act_pwd = str(df.loc[mail_id])
except FutureWarning:
print("test")```
not working lol
same error or doesnt work at all?
same error
newbie here, anyone know how to make a deliniate (red line) a echogram data? x= depth , y=time
https://www.kaggle.com/code/robikscube/all-python-data-visualization-libraries-in-2022/notebook#Seaborn after seein these, i think it cant match with it cz can't found to add the red line
Need to ask, should I use statsmodel to help refine my model whilst using sklearn as my main library for linear regression? Also is there any difference in how sklearn and statsmodel handles linear regression? If i can get some clarification, that should help remove the frustration on what library to use for my model
in statsmodel you have OLS (ordinary least squares)
difference in writing
also you have statistical summary (p-values, confidence intervals etc)
would it make sense to set the index to be the target variable in a pandas dataframe or should i just leave it as a column?
like that
im guessing just leave it as a column
Its a work in progress but Ive updated my paper on the UCF if anyone is interested . Ive never done anything like this so feedback is a plus https://docs.google.com/document/d/1Bey9Qt6dcif0r4--rE3BP3GupnAw306EJGRgkZG7N3s/edit?usp=sharing
Still working on incorporating all the visuals though
dont be scared just say it. Im putting myself out there. dont be shy
lol brooo just checked it out, and honestly?? not bad at all for ur first time 😮💨👏 like fr I wasn’t expectin it to be that detailed lmao. Some parts were a lil dense hhhmmmm maybe simplify a bit? but overall it lowkey makes sense 😭 visuals gonna help a lot once u throw em in tho frfr. Keep grindin, this got potential 🤙 let me kno when u update it again lol I’ll def take another look!
need help in #1373301621253210244
whats the difference between sklearn and statsmodel? Do I use either one or do I use a mixture of both modules?
Hi all I am Aakash and want help in terms of creating a sql agent which uses local llm with service like ollama local models and using lang chain currently I am not able to create that efficient agent by these things can anyone please suggest me how can i create fine sql agent which can talk with database and answer user's query accordingly ❓
I have tried llama3 deepseekr1 llama3.2 models but I am getting some OutputParse exception.
The last image is the RNA structure of Covid 19. Even predicted the pseudoknot. The first one is 363 nt long. All these test ran in under a minute. The speed at which this thing preforms is bonkers
Its a straight web of connections hahaha
Look how all the RNA structure line up in the 135 degree region. It's detecting something for sure
In all the intial UCF test, the RNA tests, the Tade bot tests, all integrated with parts of the UCF.. it seems they all point to the same conclusion. There's something underlying in data we've been overlooking for a while. The consistent ~135 degree phase angle appearing across completely different types of data suggests a fundamental mathematical principle that seems universal.
There are different complexity spaces from crypto coins too.
Statsmodel has models sklearn doesn’t have, specifically things related to time series. When there is an overlap I’d say that sklearn is about prediction and statsmodels is … about statistics, inference, interpretation etc.
Hi,
I am having a hard time getting bipedalwalker v3 with PPO agent to walk. The reward seems to be stuck around 10 to 20. I am trying to get at least 200.
I have tried changing the parameters and the architectures nothing worked
I want to know if the issue lies in the architecture or in the training parameters
the script uses
ActorNet (actor policy)
Criticnet to compute state value and
ActorCriticNet combining both networks and adds helper methods to act and evaluate samples
Does anyone have experience or know something about deep reinforcement learning and can help?
When should I use either modules? Im getting confused which to use
you use botth
In what way? I was thinking at first, use sklearn as my main library for the actual ML part with statsmodel to refine it. Im getting confused with this shit
the actor makes action choices and the critic evaluate thos actions until the action is good enough so that the agent walks
teaching a kid how to walk
basically
Is there any significant difference in how statsmodel uses linear regression techniques compared to sklearn?
I can't really answer that for you, what are you trying to do
Are you trying to just do predictions or are you doing data analysis with linear regression?
If you're "just" trying to predict a value --> sklearn
If you're doing data analysis, statistics, are interested in interpreting coefficients etc. --> statsmodels
linear regression is linear regression.
also it depends on what you're trying to accomplish; if sklearn has what you need, then use sklearn. if it doesnt, then see if statsmodel does
in my experience, sklearn's strength is more machine learning while statsmodel is used primarily for classical & descriptive statistics
That puts it into perspective. I guess what Im trying to do is first do some data analysis on how the price of gold is affected by certain factors and then make a simple price prediction program based on said factors
Hi can someone explain to me this diagram? It’s in regards to L1 vs L2, I don’t understand the circle/diamond and the ellipses
those are the points which are less than 1 away from the origin
using those norms
It predicts RNA binding sites with pretty good accuracy, i built it with a validation dataset using PDB structures with experimentally verified binding sites, measuring distances between RNA residues and bound ligands to identify ground truth
75% precision / 60% recall on the biotin aptamer (1F27_A) and 41% precision / 54% recall on the FMN riboswitch (1FMN_A)
For SARS-CoV-2 RNA frameshifting element, my algorithm identified a key binding pocket at positions 10-16 (sequence GGGUUU) with a phase angle of 133.7° - precisely matching the universal pattern.
RNA structures consistently align near and around 135°, while cryptocurrency price data appears to align near 90°. Both show strong phase alignment, just at different characteristic angles.
yes as I said statsmodel has OLS
and statsmodel has summary statistics related to hypothesis testing etc
Guys, I want to ask suppose I want to train my object detection model with resnet fpn backbone on 640 x 640 images but no augmentations whatsoever, I use 80:10:10 split so I use 40 images for training and 5 for validation, which resnet backbone is the best
I know the dataset size is not enough but I can only work with the available data for now because my project manager told me not to make any augmentation/s first
if im using scikit learn do i need to always use TimeSeriesSplit if my data has a date column?
or is TimeSeriesSplit only for if your trying to predict what will happen in the future?
versus like categorising something that has a date column?
correct
is it possible to make some ai that is well optimized for deployemnts and be very good at it?
i am finding it hard to optimize
You're talking about a neural network?
yes
What hardware are you using?
raspberry pi
There's probably nothing you can do to get good performance.
great then
Raspberry Pis aren't intended to be very powerful. If you're trying to run a neural network, you'd need to run the neural network on a different machine that can communicate with the pi
🙂↕️ tysm
is there sth like resource of popular papers from arxiv and other sources?
similar to arxiv sanity vanity
any idea when to use minmax scaler vs standard scaler in scikit learn?
sth like
stable diffusion model, attention is all you need, variational bayes (actually dont remember title some related to vae), ...
or I must compile it?
I need just most popular or popular dont need all of them
theres the AI hat you can buy that allows you to run ai applications on it but Ill be honest, either buy an expensive ass pc or, use google colab notebooksa nd run everything off the cloud
🤧 okay 👍 custom PCBs do they work expensive ass pc too too heavy and bulky for my project I need something small cloud is a second priority since it's offline based system that i designed
Pi is useless even with that usb coral tpu thing
first time learning and practicing neural networks and ai, if any of yall could help that'd be great #1373696202398240909
Any suggestions on the fastest way to convert csv to txt files ? I was thinking of just using pandas but i think it might be slower than the base csv to txt converter. Any suggestions?
CSV files are already plain text. What's the issue?
Hey guys.
Could someone please help me and look at a python code I'm working on? I'm not programmer nor have degree in IT, so, I'm not a pro. I posted the code two weeks ago and nobody answered, so I figured I ask here. Any help is much appreciated.
You always have to post the code before anyone can look at it, so it's always best to post it right away.
Yes, you're correct. I posted it and waited for a couple of days, however, nobody commented or anything. I just wanted to ask here first if someone would have a look.
People won't commit to looking at the code and providing feedback if they don't know anything about the code, or how long it is, or what it's intended to do. So it saves work for everyone, including yourself, if you always give people the information they need to do what you're asking of them right away.
It might be that people won't look at your code even if you do post it, and that would be unfortunate, but they certainly won't if you don't post it.
And if people look at it after telling you that they will look at it, they would have done that even if you hadn't made them ask you to post it.
Post it again and ask.
Hello! Can someone please help me with ONNX exporting? I'm trying to export an ELM custom model into ONNX format, but keep running into this mysterious error:
Cell In[1], line 4
1 import numpy as np
3 from onnx import helper
----> 4 from skl2onnx import convert_sklearn
5 from skl2onnx.common.data_types import FloatTensorType
6 from skl2onnx.common.utils import check_input_and_output_numbers
File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\skl2onnx\__init__.py:16
12 __model_version__ = 0
13 __max_supported_opset__ = 21 # Converters are tested up to this version.
---> 16 from .convert import convert_sklearn, to_onnx, wrap_as_onnx_mixin
17 from ._supported_operators import update_registered_converter, get_model_alias
18 from ._parse import update_registered_parser
File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\skl2onnx\convert.py:8
6 import numpy as np
7 import sklearn.base
----> 8 from .proto import get_latest_tested_opset_version
9 from .common._topology import convert_topology
10 from .common.utils_sklearn import _process_options
File ~\AppData\Local\Programs\Python\Python313\Lib\site-packages\skl2onnx\proto\__init__.py:22
18 except ImportError:
19 # onnx is too old.
20 pass
---> 22 from onnx.helper import split_complex_to_pairs
25 def make_tensor_fixed(name, data_type, dims, vals, raw=False):
26 """
27 Make a TensorProto with specified arguments. If raw is False, this
28 function will choose the corresponding proto field to store the
(...) 31 this case.
32 """
ImportError: cannot import name 'split_complex_to_pairs' from 'onnx.helper' (C:\Users\Admin\AppData\Local\Programs\Python\Python313\Lib\site-packages\onnx\helper.py)```
I'm using Python 3.13.2.
I plugged the UCF into a three body problem simulation.
I have the latest version of ONNX though
Also that's not what the error says...
This was all matplotlib
For the RNA stuff?
generally speaking
300 lines for the three body problem stuff
oh dear 
just for visual
didn't use chatgpt to get it faster?
i dont use chatgpt, but i do utilize AI
way I see it, my times limited on this planet. I got things todo
fair enough
This ones ploty and matplotlib
R1136 makes my computer lag lol
you dont even wanna see the 700 nt one
:incoming_envelope: :ok_hand: applied timeout to @rich moth until <t:1747637582:f> (10 minutes) (reason: attachments spam - sent 7 attachments).
The <@&831776746206265384> have been alerted for review.
!unmute @rich moth
:x: There's no active timeout infraction for user @rich moth.
so if im using scikit learns gridsearch and ive got unbalanced categories, which score function should i use?
im currently using f1_macro
but idk if i should be using roc_auc_ovo or one of the other ones
ok so not issue with onnx
any idea what it means if a model has like 99% accuracy on both test and train data?
like is that overfitting or is it just really accurate?
i dont think theres any data leakage or anything
It is really accurate OR yout testing dataset is including into your train dataset
i dont see any way it could be
im using a pipeline in scikit learn
Depends on what you're doing
f1_macro and so on all make the assumption that the cost of misclassification is the same
All classification problems I've worked on in the past month all had assymmetric costs. I really needed to optimize for precision or recall
People have probably gotten tired of me asking "Do we care more about false positives or false negatives"
But that's the reflex you need 🙂
hmmm
(even if your dataset is balanced)
ignore number 5, its not in the version im using
tho i could recreate number 5 from number 6
its for a uni assignment and they removed a column
So you're predicting #6
yeah
Each record belongs to just 1 attack
yeah
Exactly 1, not 0 not 1+?
If it's a school assignment and not a "real life" problem then f1_macro or similar is probably fine
how did you split the train and test data?
tho im guessing it would probably be better if it miscategorised something thats normal as an attack than an attack as normal
Yeah in the wild a false negative is worse
You'd want to flag more things and have that as a starting point to investigate
i used train_test_split from scikit learn?
And I'd sell it to "business people" as possible attacks
do i need to use TimeSeriesSplit?
Hence recall > precision here
let me rephrase: Did you shuffle it before splitting or take the tail as the test data?
interpolating is a lot easier (and arguably less useful) than extrapolating
if you included the records for 18:33:31 and 18:33:41 for a given day, then it should be easy for the model to guess that everything in between those two timestamps has the same label
not sure if you need a time series split
But maybe yes
Look at the data and see if you have correlations along the time axis yeah
in contrast, if you ask for the model to predict a label for a day that was not present in your data the chances for it to get it wrong are much, much higher
what do you mean by interpolating?
It's a very strange case the more that I look at it
yeah i have no idea how the dataset actually works
If you random split your accuracy will be near 100 %
its like gps data or something
Due to what Etrotta is talking about
You have N data points from each attack
understanding your data is the very first step you should take before trying to do anything with it whatsoever
If you drop all features except time and do a random split you have near 100 % accuracy
"Oh it's around 18:30, what attack happened there? I see, that's when we had the ddos"
if I tell you that something costs 10$ on the day 1, 20$ on the day 3, then 10$ on the day 5, what would you guess it costs on the days 2, 4 and 6?
like i get what the data is
i dont get how the date and latitude and longitude works to tell what the attack type is
Could be that they're using a specific data centre for ddos
Hey! I've narrowed down my error to being unable to install onnxconverter_common for some weird reason. I have CMake installed, I have Visual Studio installed, the PATH variables are updated, long file names in Windows are enabled. Whenever I try to install that module, latest version for Python 3.13.2, it tries to build something called "wheels", waits for like 5 minutes, and then gives me this monstrosity of an error many thousands of lines long that ends with this:
Does anyone know what could possibly be causing this?
lol what, what are those points in the ocean? islands or it's normalized/scaled in some way
or mock data
it goes from like 0 to 500 lat long
uhhh usually long goes -180 +180 and lat goes -90 +90
there are some different scales and other special ways of measuring, but still seems very weird
there might be a more detailed error message further up
personally I would probably just try installing it via conda instead
So, conda wouldn't give the same error message?
I've just never used conda so I don't understand how it'd be different as to what I'm doing here
it might be that going around the world multiple times just keeps going higher?
I mean that functionally isnt how long/lat works. Unless in the codes' case it is using some other coordinates to denote it like rotational. But that wouldnt really make sense as a data list of source locations. Which reading that table you posted doesnt seem to be the case so perhaps an error
the code from this worked https://gis.stackexchange.com/questions/303300/calculating-correct-longitude-when-its-over-180
longitude = (longitude % 360 + 540) % 360 - 180 turns it into -180 to 180
unless you find some documentation explicitly saying that this is indeed how they constructed the dataset, there is no guarantee it is correct
that question is specifically adjusting it given the way vue-leaflet works, a dataset created using different tool may have a different logic
unless you plot it and see the coordinates make perfect sense (e.g. all points are in cities with datacenters) I wouldn't rely on it
i mean its literally the exact same on the map
well i guess that would make sense
there is a chance the dataset is just completely senseless I guess
(random mocked data)
i emailed my lecturer to ask about it
the website for the dataset is here if it helps https://research.unsw.edu.au/projects/toniot-datasets
in first place, did you include any rows with no ongoing attack or you always predict some kind of attack?
theres a category called normal if thats what you mean?
oh
right i checked the original source, everything that isnt normal is an attack
becuase i wasnt sure if password was a type of attack
but it is apparently
which type of data in this were you using? Also, possible theory. is your glove view backwards(mirrored) by chance? since so many end up in the ocean I wonder if its inverted.
my guess is in the processed_network_dataset based on the contents
which dataset in that site are you using for this? Trying to look at its formatting but lots of files here. certainly one of the processed ones it seems
the iot gps tracker one
except its not actually the one from that site
its like a small section of it
basically my uni said heres 6 datasets we modified and links to where the originals and information about them is
so its this one but no column 5
also yeah i think its the preprocessed folder
Yea I dont think the coloumn labels for this sheet are correct
Well I should state that I dont know why the lat/lon numbers are so high but simply doing the calculation you posted should result in correct data. Though no idea why the plotting doesnt land on actual, well, land
well gps works if your not on land too i guess
could be weather balloons or something i guess
Well yea I do know that it works regardless generally speaking.
I mean weatherballons or similar would certainly align with the inherent issue in IoT device security in general
that looks neat
Hey there, I ran into hardware constraints while trying to finetune 3B and 8B variants of qwen2.5 with fp16 and bf16 precision (Bzzt, OOM errors). I have access to a total of 48(24+24) GB of VRAM but this is clearly not enough to train them in full precision so I have reverted to using 8-bit quantized models for the same. For some reason on the internet, everyone seems to be training their quantized models with LoRA and I wished to know if it will be possible to train these quants with SFT/RL without relying on LoRA as I do want to change the base model's weights.
Tie-dye Bowties
huh? A customer PCB for what exactly
Running a multi model yolo cc and then ocr together with a base ml models
nevermind, I think I was thinking of a different definition of 'custom pcb'
8bit is only for inference right , not training and even if you train using it , i doubt any changes will be made to the model (updates)
Have you used QAT?
It's kinda complex so my suggestion would be LoRA unless you have massive power
came across it while researching more training methods but QAT seems to be useful primarily for training models that have to be quantized by the end of the training process.
gotcha, I honestly wasn't aware that quants were inference only to be honest. i guess i have to stick with LoRA/QLoRA since i am on a deadline lol
Tbh LoRA gets the job done in most cases , unless you working specifically for some task-specific cases
yeah it's a coding task for a particular language
hope it's going to be enough- my earlier misunderstanding of ignoring that loading into the memory =/= VRAM consumed during training will cost me some days of progress welp
am i not supposed to do dataFiltered.latitude = (dataFiltered.latitude % 360 + 540) % 360 - 180 to overwrite the latitude in the dataframe?
i got a chained asssignment warning
what should the sequence length for a LSTM be over a very long period of time?
Trueeee
Depends try different sequence length
what's your use case?
honestly, I just do that into the MSE Loss function is as small as possible
from the 40's until now
LSTMs can handle long sequences up to a certain limit , try using bidirectional
Helps understanding context better
I did in tensorflow, and I have not touch a RNN forever, I did it with natural langauge processing until I realized they were useless compared to the transformer and everything else
RNNs were like the 1st milestone , then LSTMs , then Encoder-Decoder, then Attention and lastly Transfomers
and nowadays most of the tasks like in nlp are done by transformers , so very less use cases of the previous networks
but its good to know
I know, I am using it for time series\
Yea lstms can also be used ther
i actually meant to ask what kind of data you are working with here. Anyways, a good suggestion would be to begin with 7 as that can cover a lot of time(week) and should serve as a good starting point. Besides this, you can experiment with various values and pick the one which fits your loss expectations the best! You might have some problems if its the first time working with LSTMs directly, but that's also how i started and i'm sure gpt/gemini/claude can help a lot here!
is it possible to train llm on laptop instead of on cloud and it just will take much time?
so money saved but time not?
how it goes?
1-2 weeks of training on some A100 or the like
so it would take few months, estimated, not on A100, but some pc gpu
I have iris xe
Fine Tuning an LLM requires more compute power than can fit in a laptop.
yes but with some weights checkpointing
And if you spent a few months trying it anyway, you'd fry the laptop
ah finetuning not training read wrongly sorry
You can't train an LLM from scratch on any consumer hardware
And if you mean "training but not from scratch", that's what fine tuning is
so tldr gen ai must be done only on cloud?
sorry I thought this way train llm on some compute powerful, cost few million of $
but training llm on laptop would be free (not considering power consumption)
but would just takes longer
but this not working like this
because then companies would train for months for free
so in short I thought can just split compute
but training some language model is possible on laptop? (not llm)
I remember I used colab for resnet50 and vgg16 so they too are not possible to train on laptop?
so question is from what number of parameters its not possible to train from scratch on laptop?
ok also I remember resnet and vgg were pretrained and it was about transfer learning
it costs them a few USD per hour per GPU, with each GPU being many times more powerful than a laptop's
taking https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E for example, it required millions of hours worth of compute to train from scratch (divided in parallel amongst a lot of GPUs)
fine tuning is possible on high end consumer hardware, but not laptop level hardware
You can use Google Colab or Kaggle to borrow GPUs from google for free
yes but for noncommercial use, for learning
if it would be for commercial, then meta would use just colab 😂
I don't think that you understand the sheer scale of the data and number of GPUs they are using to train LLMs 
a single training run (*from scratch) costs millions of dolars worth of computing power
its more like the point is, commercial LLM cost money to train. There are not free versions at that scale even for "smaller" sets of data.
Because of the sheer amount of GPU power required to actually do it in an amount of time that is not absurd
could someone help me in #1374233576706408458 ? I've been tweacking it for hours and I cant seem to fix it
why are you using neuralnetwork package?
Use tensorflow or torch
Any advice for imputing missing categorical data ? None of my variables appear to cluster well or have relationships with the categorical variable
Hi guys, so i have an assignment at school that requires an expert in the field of artificial intelligence to be interviewed for my scientific article assignment. I really hope someone could help me here
can you direct people to your thread on this and post the list of questions?
Can someone help me answer some questions for my scientific article assignment here?
https://discord.com/channels/267624335836053506/1374415687870845049
Is this chat a good place to ask abou data engineering stuff?
I'm a data engineer with 2 years of experience. Currently, I'm looking to start an AWS certification, but after studying through AWS Skill Builder, it seems more like a marketing stunt than a real certification. Based on my experience, most AWS services feel like auto-managed versions of open-source tools. At my startup, cost is a huge concern, so aside from Redshift, Lambda, and RDS, we avoid other AWS services. Am I wrong for sticking with hosting everything on EC2 (e.g., Kafka, Airflow, dbt for ETL) and using Lambda for code execution? This is how I’m handling things now. Any advice would be much appreciated!
Basically, all my problems are solved with SQL on RedShift and relatively simple Python scripts in Lambda (serverless). This setup handles everything we need right now!
what does "AI" mean to you?
Artifical intelligence, something that can talk to you, like a person
So you wouldn't consider a self-driving car to be AI?
No, I would consider it, but a diffent type of AI
Okay, so if you say that you want to "create AI", you have to be specific about what kind you're talking about
What you're describing is probably an interactive language model. You can't create those from scratch.
They cost millions of dollars to create
There are other things you can do with AI that are attainable
how can I get know even a little about llamaindex,langchain,crewai?
what is best option official docs?
I'm only really familiar with the LangChain part of it, and dang there is a lot of surface area to cover. I used the official docs myself, which seem pretty nice.. it's just a lot to take in before you might understand the "idiomatic" way to do something with it.
Does anyone know how to run the langchain repository locally on Windows?
I get lots of errors and a whole mess when I run the make test command in both paths
https://python.langchain.com/docs/contributing/how_to/code/setup/
I'll have to give it a shot, but not every project's actual test suite works on Windows sadly.. lemme see if it's obvious whether that's the case here
Not a great sign, they do not appear to have automated Windows builds in their GitHub action setup.
wsl is a option
I have tried multiple things and it's going crazy
yeah but that means I will work on Linux! right? but besides windows. I want to work on Windows
Sorry for the delay, this is what I get in my Windows env:
=================== 2 failed, 564 passed, 87 skipped, 1 xfailed, 172 warnings, 63 errors in 20.06s ====================
mingw32-make: *** [Makefile:25: test] Error 1
That's after uv sync etc like their docs suggest.
I guess all I can say is that this is an obvious place where a new contributor could make a positive impact on the project.
It just needs some stuff set up, like cross-platform in their CI config instead of just Linux
I'm sure these are just tests that aren't perfect yet etc rather than the lib being massively broken on Windows.
I actually do not love the style of this test suite implementation
you could use both
all your work can be done on windows and you pop up a wsl terminal to use lang chain
@viscid urchin Thanks for trying and it's okay don't worry about the time of respone. yeah, I got this amount of errors as well before, is it okay to ignore or what then? like is it safe to ignore them and do the work
definately not optimal but it works
If you plan to contribute to langchain, it's probably worth it to at least also set up a WSL environment so you can have an 'all green' test run to compare against. If you just plan to use it, I'd say simply using it on Windows and expecting it to work is fine.. If you find something that doesn't work on Windows, you can open a github issue etc.
thanks, it's quite weird that a big thing like them didn't sort such a thing
anyway thanks to all of you @viscid urchin @regal bane
Honestly you might consider filing an issue for "please add Windows to your CI build"
Somebody might come along and do it
(I might even do it)
If I used LC "in anger" I 100% would.
okay mate, I will see thanks
(I did just look pretty hard on their Issues list and there are a lot of things mentioning Windows, but nothing that seems to be asking to enhance the automated tests that get run.)
sorry for being late, probably because most of people just use wsl or linux on VBox or Linux as a main OS
who contribute a lot, idk. this is just my guess
Yeah, I'm one of those weirdos who runs a "Windows native zsh" env
did you find an issue about this case or it's better to open one?
I'd open one; didn't find one that looked good to jump on.
I am a windows lover tbh, I used linux for quite good time but didn't like it. although I studied it and so on
Just be super clear/polite/etc and describe the problem + proposed next step etc.
Yeah I've never come to love Linux.. (I do love FreeBSD though)
okay so, so I am not gonig to open one as long as you will do
Go ahead if you've got the inclination; I'm feeling lazy, just catching up on MotoGP 🙂
I'll gladly star/react/etc it if you do though 🍹
for now, I am feeling lazy too. lol maybe another time
Do you contribute in Langchain?
No, but I've been toying with the idea to learn it better
and honestly you've found some low-hanging fruit that I might work on
nice, good luck
any idea why my MLPClassifier from scikit learn performs better when i do a gridsearch cv but worse when i just fit it with the pipe?
i think it might be something to do with the cross validation?
im using skf = StratifiedKFold(n_splits=5, shuffle=False) because ive got time series data and i figured it would be best ot keep it in order
!!!!
You watch MotoGP?
My homie
🫂
so i split it once at the beginning to to get a test and train split
skf = StratifiedKFold(n_splits=5, shuffle=False)
skf.get_n_splits(X, y)
groups = dataFiltered[target].values
for train_index, val_index in skf.split(X, y):
train_set = dataFiltered.iloc[train_index]
test_set = dataFiltered.iloc[val_index]
X_train, y_train = train_set.drop(columns=[target]), train_set[target]
X_test, y_test = test_set.drop(columns=[target]), test_set[target]
then i ran
gridSearch = GridSearchCV(pipe, param_grid=param_grid, scoring='f1_macro', cv=skf, n_jobs=-1)
gridSearch.fit(X_train, y_train)
is it because i set cv=skf?
and somehow its matching the gridsearch results now
no idea whats going on
it was like 100% train accuracy and 85% test accuracy after the grid search
and then it was like 20% for both when i just fitted the pipe
Yeah, I watch MotoGP, WEC, WRC, and F1 currently. I miss WRX but don't have an easy way to get it it seems 😦
people familiar with sktime: how do I use parallel processing with transformations like Catch22?
I think I've tracked it down to
c22 = Catch22().set_config( ... )
```but nothing I put in `set_config` seems to do anything,
```py
cfg = { "backend:parallel": "loky" }
cfg = { "backend": "joblib" }
```etc, cpu usage is about the same
https://discord.com/channels/267624335836053506/1374609166747828305 Hello, can someone help me answer some questions here for my scientific article assignment?
Hi, I'm trying to train a yolov11n model (to run on mobile devices) and I'm trying to train it using the entire COCO dataset (for real-time object detection). Problem is I vastly underestimated how long it was going to take to train and I wanted to know if there's anything I'm doing wrong or anything I can do to speed up the process.
Here's my code below (I haven't even changed much, it's mostly just straight from the ultralytics documentation except the dropout, patience and device (because I'm using an M1 Pro Macbook))
from ultralytics import YOLO
# Load a model
model = YOLO("yolo11n.pt") # load a pretrained model (recommended for training)
# Train the model
results = model.train(
data="coco.yaml",
epochs=100,
imgsz=640,
patience=10,
device="mps",
dropout=0.01
)
The dataset is already installed and I had left it to train overnight but it didn't even complete one epoch
I estimated that it would complete at least two but I think the time per iteration increased significantly overnight
I played around with the batch size, and it started taking upwards of 40 GB of Memory at one point (I only have 16 GB of RAM so the rest was SWAP), so I just left it back to the default.
any idea if i should drop day of the week or just leave it?
What's the label you're training for?
type
I'm no expert, but I'd leave it in probably. What's the difference between day and day of the week?
I say so it could just be a non-linear relationship
and it's not like the other features have a high correlation either with respect to day of the week which makes it less signifcant
can i ask about data analytics, big data, data lakes and data warehouse here?
I assume this is the correct channel but just wanna be sure.
Basically, I am deciding between a building a data warehouse project or a project that involves big data concepts, data lake, machine and basically data analytics for real-time recommendations. I'm unsure which to go for. Is there anyone who worked on either and can share their opinion on how their experience was like while working on on their work/project?
I am asking this because as soon as I choose my final year project then that is likely the field I will be going into as a junior developer (whatever u call it) since this would be the biggest project I ever produced (when I complete it).
day is what the date is
day of the week is like, its a monday
Is this like end of the year for a 4 year degree project or something else? Unless this is some like guided schooling where you go right from school>internship>employment the project is probably not going to have as a massive as impact in the sense of forcing you into one side or the other in your career. What is your degree in and what kind of projects have you done so far? What level of interest do you have in either category?
update: seems to be some strangeness with pipelines in sktime, using it directly does seem to employ parallelism now (cpu high):
c22 = Catch22().set_config({
'backend:parallel': 'loky',
'backend:parallel:params': {
'n_jobs': -1 # technically not needed because -1 is the default
}
})
c22.fit_transform(time_series)
though I don't really see a difference in run time
in contrast, Catch22Wrapper requires pycatch22 but is like a bjillion times faster
for reference: I've a multivariate (6) time series, about 2500 in length
Catch22 takes ~1min to fit_transform
Catch22Wrapper takes ~0.08sec to fit_transform
my first time experience with sktime definitely isn't the best
another example: I can't seem to get something as simple as chopping / padding all time series to a length of 2500 to work
preprocess_pipeline = (
TruncationTransformer(2500)
* PaddingTransformer(2500)
)
preprocess_pipeline = (
PaddingTransformer(2500)
* TruncationTransformer(2500)
)
```these 2 both don't work, throwing out some error I'm not sure how to fix
eventually I just did the truncation part manually through some `polars` `filter`ing on the index, leaving me with only `PaddingTransformer`
then there's another performance issue, as it takes several minutes *just* to do what should be a simple pad (granted I do have a lot of data)
eventually I ditched it as well and tried only `polars`, the resulting code again only takes a few seconds
```py
# something like this for padding
(
df
.filter(c('time_series_id').is_first_distinct())
.select(
'time_series_id',
pl.lit(list(range(pad_len))).alias('index')
)
.explode('index')
.join( ... )
maybe pl.int_ranges(pad_len).alias('index') instead of pl.lit(list(range(...)))
I think I tried that but polars thinks what you're trying to do is create a column where each value is 1 int
int_range or int_ranges?
or maybe I haven't tried that idk, my brain is frying from debugging
ah right, I think I did int_range
good catch and ty
but yeah common polars W 
I haven't messed much with its time series related features, but it has a lot of methods specifically for it too
unfortunate that the integration is still lagging behind 😔
if I use sktime again, some of the stuff supports polars while others don't, so it's probably easier to just stick to pandas (or at least, before you pass into the transforms)
also there's no tutorials explaining how you'd use a polars dataframe with the transforms, I figured it out by code digging: columns starting with __index__ will be recognized as the time series id / time index / etc
so actually I had to have column names like __index__time_series_id or __index__time, then down the line find that some don't work and .to_pandas() anyway
I think shuffle=True
.
ah keep it in order
Would it be appropriate to post in here an academic website I made detailing my Neural Network that runs on a TI 84 Plus Silver Edition capable of autocorrecting words?
you may post it here once.
Understood, thank you. I hope it will be of interest to anyone who happens across it.
https://hermesoptimus.vercel.app/
A neural network implementation for the TI-84 Plus Silver Edition calculator capable of autocorrecting words.
I feel like this is the smoking gun. What do you guys think? It's domain distribution of 62 datasets across 4 domains. I found in my research all of them follow the same mathematical law when ranked. Information itself has a universal structure...
Information organizes itself different in complexity space. But even in chaos within the constraints of physical laws, there is structure.
Makes you really wonder about the universe itself..
Guys how do you train an RCNN model that generates 2000 proposals on colab, I tried and it just cradhed because the ram isnt enough
So, I modified the original RCNN’s selective search, to generate 1/4 of the original proposal size
Also is it normal for RCNN to start with insane loss like say 200 or 100
Also how do I remove tensor from GPU ram I tried del tensor and cuda cache remove but cant
detach?
hi can anyone instruct how to start with data science while u have no knowledge whatsoever
anyone aware of a open source lib that facilitates agents, data retrieval, memory and memory usage
It is for the final year of my BSc Computer Science degree. I'm gonna be entering final year in the upcoming September. Basically due to the last final year student's performance with the final year project being bad, the teachers decided for the final year project to be started in the summer (for those who are going to be in the final year of their degree in September). I have interest in data analytics and data warehouses. In particular I love machine learning with data analytics. In fact, I was going to that project instead (data analytics with machine learning). I started learning the basics of machine learning. I have beginner knowledge of Pandas. I am good with Python. Right now I am trying to look for beginner friendly projects I can work. I want to do this because for my final year project I will need a teacher to act as my supervisor for my final year project. Some teachers may ask for my CV and experience with machine learning and data analystics. I hope to do 1 or 2 beginner friendly projects so I can make convince a teacher that I am able to learn the required concepts in order to do the project I choose.
memory is generally a bit awkward, imo there isn't a good one-fits-all solution
maybe take a look at Llama Index though
will do - i specificall want to index sourcecode and then have the llm search for "bad" things and try to replace them with "good things" - as part of some kind of tech debt sweeping tool
Are websites like these good start for someone wanting to do beginner friendly machine learning projects? Advise on what projects to do as i progress would be helpful (so I can get a feel of machin learning and get practical experience/improve my knowledge)https://www.freecodecamp.org/news/how-to-build-a-house-price-prediction-model/
that would probably end up crazy expensive if you show the entire repository each time, so you might want to start by creating a tool that will allow for the agent to search more effectively
could be something simple like a ctrl+shift+F equivalent, or maybe something more complex like creating linter rules
kaggle?
my current mess of an idea is to try and create memories the llms use to compare and see how i can wire them up - its entirely ok if a run on 100 repos takes a night on controlled hardware as long as i get the single starting poitns working
what i'm trying to do is find cargo cult-ed instances of initial iterations of ideas that where adopted across many ropes - and then reporting and/or trying to suggest a fix
just to check, do you understand the difference between RAG ""memories"" and in-context "memory"? specially how much the model knows about each
indeed - i believe i have to sort that out - i may run into a situation where i have to run hundreds of prompts + do memory/storage to ensure in context memory first - rag memory may end up just being something that keeps track so i can split the problem into more chunks
it would be so nice if there was a way to make context fragments and combinations of them instead of always streaming the tokens
seems like someone came up with something for that unfortunately i only found a youtube vid discussing it - if permitted i'll post a link
pretty sure it is fine (posting relevant yt links)
I'd love an explanation of this if you have some link on hand
https://www.youtube.com/watch?v=YNQKq1YfBAI discusses a paper that has llms make and use memories from the tokesn they take - it claims to be better than infinite context
unfortunately there doesnt seem to be a implementation linked
HUMAN-LIKE EPISODIC MEMORY FOR INFINITE CONTEXT LLMS
ArXiv: https://arxiv.org/abs/2407.09450
Bytez: https://bytez.com/docs/arxiv/2407.09450
AlphaXiv: https://alphaxiv.org/abs/2407.09450
Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!
https://patreon.com/Tunadorable
https://ac...
models can only really see what is in the context window (aka the prompt itself)
most abstract "memory" techniques use RAG to control what is included in the prompt - but the model does not have full access to all memories this way, it only sees subsets of it determined by the retrieval strategy
And the context window is ultimately limited by your VRAM right?
yes but not only your VRAM, also how well the model can work with it and identify relevant information
Yeah that makes sense, thank you
first time I hear about that paper, yeah idk
there is also prefix caching which might help a little
vllm looks like something i'd run instead of ollama
hmm - oh - i jsut learned about the model context protocol - that may be neat to put agents together
Hi,
anyone with speech to speech realtime LLM experience in python?
ping me we need to develop the llm with function calling ability.
there's also a "hard limit" of sorts of what the base model was trained on which some techniques can get around while degrading quality
- e.g. llama 3 was trained on text that was 8192 tokens at the longest, so that's the native limit you'll see being thrown around
- during inference you can use Rotary Positional Embeddings (RoPE) to extend that while degrading the quality of responses a bit
- I believe that's the technique used in tuning llama 3.1 so it can "have 128k context" even though it's based on llama 3
and as mentioned by etrotta, there's a much-easier-to-hit soft limit of things being in context, but the llm being unable to utilize them
see RULER and the newer NoLiMa benchmarks
Thank you thank you!
last I checked open source models suck at it, your options are pretty much either gemini live or openai realtime, both of which are very expensive
dict_1 = {'Ideal':5, 'Premium':4, 'Very Good':3, 'Good':2, 'Fair':1}
diamonds_df.cut = diamonds_df.cut.replace(dict_1)
dict_2 = {'D':7, 'E':6, 'F':5, 'G':4, 'H':3, 'I':2, 'J':1}
diamonds_df.color = diamonds_df.color.replace(dict_2)
dict_3 = {'IF':8, 'VVS1':7, 'VVS2':6, 'VS1':5, 'VS2':4, 'SI1':3, 'SI2':2, 'I1':1}
diamonds_df.clarity = diamonds_df.clarity.replace(dict_3)
# renaming the 'x','y','z' columns to more descriptive names
diamonds_df = diamonds_df.rename(columns={'x':'length_mm', 'y':'width_mm', 'z':'depth_mm'})
# removing dimensionless diamonds
diamonds_df = (diamonds_df.drop(diamonds_df[diamonds_df['length_mm']==0].index))
diamonds_df = (diamonds_df.drop(diamonds_df[diamonds_df['width_mm']==0].index))
diamonds_df = (diamonds_df.drop(diamonds_df[diamonds_df['depth_mm']==0].index))
# dropping duplicated rows in the DataFrame if there are any
diamonds_df = diamonds_df.drop_duplicates() ```i am getting this message: "FutureWarning: Downcasting behavior in replace is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result". Is replace deprecated? Will it lose support in future? Im new to Pandas. I just want to know if I should use .replace() or is it not good to use?
Im following this tutorial: https://medium.com/@idowuadamo2904/machine-learning-for-price-prediction-a-step-by-step-guide-ad5913b5cec7
Yes, you can use pandas still with .replace
The warning is just about future changes to how pandas automatically changes data types after replacement, its basically just telling you that it wants you to follow up the the diamonds_df.color/clarity with a .infer_objects(copy=False), this will keep it as its old version instead so your code wont be affected
You can also run pd.set_option('future.no_silent_downcasting', True) to tell it to stop warning you
https://docs.google.com/document/d/1Bey9Qt6dcif0r4--rE3BP3GupnAw306EJGRgkZG7N3s/edit?usp=sharing
I updated my paper quiet a bit i think im toward the end of my journey.
The Unified Complexity Framework: A Novel Paradigm for Quantifying Data Complexity and Optimizing Curriculum Learning Andrew Scott Gracey Independent Researcher bigpunk2@gmail.com Abstract This paper introduces the Unified Complexity Framework (UCF), a novel and potentially transformative parad...
Please let me know if visuals come in
I had about diamond price prediction but it was in excel
or sth not about coding in python
Good Morning
I'm running LLama 3.2 1B with ONNX and DirectML because my AMD card is old. Loading it consumes 5.3GB of VRAM out of 8GB, which is okay, as long as it doesn't take it all.
1 initial prompt + 3 follow ups is enough to consumo the rest of the 8GB total VRAM. From the 5th prompt onwards it gets really slow. Still better than CPU, but worrying.
- Is this normal?
- Are sessions stored in VRAM?
- Is there a fix or a way to reduce VRAM usage?
I ran DeepSeek R1 using a converted model I pulled from HuggingFace and it was capable of prompting again and again just fine. Probably I didn't test it enough because the lack of an openai-compatible API convinced me to delete it. But I wonder if I'm doing something wrong or am ignorant about how this works.
This is my first time doing this but I'm a junior/mid level python dev
Tried Phi3.5-mini but there was a leak that doubled the VRAM usage on the first prompt and the model kept appending the answer over and over until it ran out of tokens and returned HTTP code 500.
Using Lemonade SDK as runtime+REST API
Maybe use Hybrid models that do integer calculus on the CPU to kinda split the data between RAMs? Idk, just brainstorming
PACE methodology or CRISP-DM ?
talking with me?
I have a dataset of 1.5 million users anime lists, and I want to build an anime recommendation website. But I have no idea how much a project of this scale would cost. Is there anyone who can give me rough estimate and maybe break down the expenses?
it depends, if you run everything locally and do not host anything on the cloud it's only going to cost electricity and time
you could also train some models in Google Colab and host in Hugging Face Spaces free of charge
it could get pretty expensive if you were to rent enough compute to handle thousands of users accessing it daily though
i feel like this is something they could accomplish locally
it depends on what methodology they use tbh
depends on the model hyperparameter. in theory you could have as large a context window as you want
gpt2 has a context window of 1024 tokens
Very interesting! Could one store the context info locally in a way? Even at detriment of performance. Or maybe disable context windows all together?
Thanks for tagging that message. Ended up being very relevant to my issue.
what do you mean?
The context's data. Idk how this work so pardon me. Could it be stored somewhere else at the cost of latency so that it doesn't keep consuming more and more VRAM?
Because a single very small context in amount of follow up prompts (4 prompts) is enough to take the remaining ~2.6GB of VRAM
I can still use it but it becomes very slow as it tries to free memory or use regular RAM, which is what I want at the end of the day: share the load. But in assume there's a more formal way of implementing this behavior?
context windows are usually relatively small compared to the model
in a model with millions of trainable parameters, a 1024 token or even a 100000 token context window takes up negligible space
what are you running?
what model?
LLama 3.2 1B version with ONNX runtime under DirectML. Hosted using Lemonade SDK because of the OpenAPI-compatible wrapper
Either this, OpenCL or CPU as far as runtimes go
I explain it here
maybe a memory leak
but i don’t think it’s an issue with the context window necessarily
Could it be something in how the model is converted to ONNX? I converted models before and the output suggests that there are losses in precision at least in my case. Those are official models, though. Converted, configured and fine tuned by AMD. They're hosted in the official organization at HuggingFace.
Maybe a leak in the runtime version that Lemonade depends on. Because of C extensions.
What about I download Lemonade's source and keep bumping versions of dependencies to see if it stops. Could it possible work? xD
guys how do i get into making ai because im stuck all i know is to learn python rn im watching bro code idk if i shoulf switch to freecodecamp
What does AI mean to you?
wdym
Please define AI for me without looking it up
aight
its artificial intelligence for me i wanna make like and programme where someone gives me info it can give info back and thats all, i know and i wanna make money by solving problem and wanna keep improving and not do a 9-5
.
Learning about AI won't help you "escape the 9-5"
i still wanna just make money idc how little
You can't do that with AI.
u can
Okay, good luck.
I'm not going to help you with something that I think is misguided and a waste of your time. If you're interested in actually learning about and understanding AI and preparing for a career in that space, I'm happy to help.
should i try cause im 14 so i wanted to make ai
If you want to do the thing I said, there are worthwhile things you can start doing at 14
dont i gotta do college for cs or smth i dont really know
Hi I was thinking of making a CNN model to track real-time deforestation using satellite imagery, what dataset should I be using?
Yes
O
Don't listen to people tell you can't do something, if you wanna do it, go Nike on it and just do it. The worst you will do is fail and possible learn something. This isn't rock climbing. But you need a better plan or idea, and start researching how you want to work on it. You have all the tools at your fingertips, I recommended start getting better with those first.
Hey guys I wanted to know How to train model on huge data
My Features are of shape for training 584,1536,1392,7 and targets 584,1536,1392
I kept to train a model at night and It has not even completed 1 epoch yet
All data is about 100gb
so i stored both features and targets in seprate npy file and then I am training them in batch so all data is not loaded in ram
any other way I can train little faster?
Or it does seems unusual actually to training this much time for 1 epoch
Are these images? What are they?
yes images we can say
Downscale if applicable.
Okie
I have one more doubt so as my model is training my ram usage is increasing
all data is about 100gb
so at max it sahould take 100gb and Im training it in batches
and still its taking 133gb ram
on idle its around10-15gb
Dude I am creating most powerful research tool x model
ty imma keep striving
What are the best courses/certificates for Data Science in 2025? 🤔
There are no certificates for data science that have any value.
Really? So what is the best way otherwise?
Getting a degree in computer science with data science related coursework
Yes, I see that a course can't replace an entire degree. However, wouldn't a online course demonstrate practical knowledge?
Every position you apply to will have many applicants with relevant degrees, so if you don't have one, your resume won't even be considered
I'm currently enrolled in a degree, but I also want to do something outside of university, you know? 😅
Talk to the professors for the data science courses and ask if you can participate in their research.
That's what I did, and it's the main reason I got a job.
Where did you study at?
Virginia Commonwealth University
Interesting... I think that would make more sense. Thanks for the recommendation! 👍
is this a good neural network model?
network size?
depends;for the most part it looks decent
if t hats the loss & train for something like MNIST, then you might be able to do better
can i ask you if this one is better? i am new to this stuff so i cant compare to other graphs
if it is loss & train for something like NLP, then ur doing really well
chat gpt told me the first one is better but i am worried about those spikes
the first one is probably better
alr thanks!
fluctuation is normal, as long as its not huge fluctuation all the time
this one is a bit concerning because it starts off at 80 something % accuracy
model = models.Sequential()
model.add(layers.Conv2D(32, (3,3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(2,2))
model.add(layers.Conv2D(64, (3,3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D(2,2))
model.add(layers.Conv2D(64, (3,3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.GlobalAveragePooling2D())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Dense(10, activation='softmax'))
this is my model
what are you trying to do>
i am learning image classification
also you could experiment with filter sizes as well
maybe 3x3, 5x5, 7x7
oh for Mnist?
what mnist?
like handwritten number detection?
no i am using a dataset with 10 image classes and trying to classify the test data
ah okay
this one:
(training_images, training_labels), (testing_images, testing_labels) = datasets.cifar10.load_data()
ill just say that my newtork for a similar kind of thing was:
Conv2d --> BatchNorm --> ReLU --> Conv2d --> BatchNorm --> ReLU --> MaxPool 2d --> Conv2d --> BatchNorm --> ReLU --> Max Pool 2d --> Dropout(0.5) --> Dense --> dropout(0.5) --> Dense --> Dropout(0.5) --> Linear Output layer
ill try that one and compare! thanks
i probably used too many batch norms 💀
idk what dimensions ur images are, but my conv filters were 3x3
also do u run stuff on your cpu or gpu? because i saw that using the nvidia gpu the training is much faster
shouldnt really matter for such a small newtork
if ur on mac, use mps
if you mean pixel wise i have scaled the to 1
if ur on a cuda supporting machine, use Cuda obviously
otherwise CPU is probably fine
training_images, testing_images = training_images / 255, testing_images / 255
layers.Conv2D(32, (3,3), activation='relu', input_shape=(32, 32, 3)) The filters (kernels) are the (3x3) thing
you mean gray scale?
should be fine
just means you get to use less feature maps in your conv layers
ok thanks for the help. so i guess when doing the model its always best to test many different things and see whats better
yeah.
for context, i ran on 24 epochs with a batchsize of 32
is there like a logic behind or just brute force it ahah
with 60000 (i think?) images
this is how ive done it history = model.fit(
training_images,
training_labels,
epochs=30,
validation_data=(testing_images, testing_labels),
callbacks=[early_stop, reduce_lr]
)
ill try your model now
dynamic lr is probably overkill but yeah, looks good
it is kind of bruteforce
im sure there is a deterministic way to do things 💀
yeah little by little ill learn this dark magic ahaha
there is a formula to prevent over fitting; if you keep the number of hidden neurons below N_h = N_s/(alpha * (N_i + N_o)) where N_i = # input, N_o = # output, N_s = # samples in training set and alpha = some scaling factor between 5-->10
oh wait is this keras
or Tensorflow or whatever
yeah
sweet
you used torch? i have never done that yet
i use torch
and also burn.dev, which is a rust framework
it really doesnt matter imo
whats the difference?
^^ no difference
ah ok so ill just learn one of the 2
i think TF might be more performant (????) but it really doesnt matter
next project ill try that gpu stuff to speed thing up. when i read that i got very interested
🔥
ill probabluy have to increase the epochs?
likely not
its overfitting?
idk what channels you used
definitely not
oh i used 32, 64, 128
also early stopping of 2 epochx
Adam with weightdecay of 1e-5
lr 1e-4
wait if ur gray scaling, why is your input 32x32x3
oh i also usex a 1x1 padding, (idk if thats the same as padding="same" in keras)
Hello am josephinewebexpert
Have a business idea but are having trouble launching it online? Come on, let's discuss. If you would like some free advice, please message me.
no i am not grayscaling. i am just normalizing the images
the data set has images of 32x32x3
right, likely maps for r, g and b
its basically image of planes, trucks, deer and other stuff. i have limited the training to 20k images and testing to 4k
sounds about right
yeah no way ur overfitting
so over fitting is when the lines go far apart because it memorises the training images?
this is the 3rd model
hmm, val > train for accuracy is usually not good
nor is train > val for loss
maybe simplify your model?
overfitting is the point when the training objective keeps improving but validation objective starts getting worse
and also maybe weight decay
in this image the model is not overfitting cause the val loss is still decreasing
oh wait this might just be because of the dropout layers
training objective being worse than validation objective is also not a total disaster as it could occur naturally
like for example, having dropout layers (which you do) actively hurts model performance in training to seek a better generalized model
or say an image augmentation step like affine is a part of your training, then your training dataset keeps changing so the model can't really overfit it ever (unless it's highly overparameterized), so you might see that training loss stops decreasing at one point yet the validation loss keeps improving
ok thanks for all the help. i have learned a lot. I will put this project to rest now and come back once i learn new things
so in my other models i shouldve let it train longer as the values didnt plateu?
well your early stopping condition is 5epochs with no performance increase, so its probably not a problem
I mean if you did your data separation right, and validation loss is still decreasing, that just means that your model is still getting better at classifying unseen data (which is good)
In the first image you shared it looks like val loss kinda plateaued tho, so maybe not much to gain from further training that one
the one I replied to is still improving
and it's also not that difficult to compare your different models; just compare the val loss of them
e.g. in your post that said:
is this a good neural network model? (img)
you can see that the val accuracy ended up at about 0.72
can i ask you if this one is better? i am new to this stuff so i cant compare to other graphs
In this post the val accuracy is also about 0.72
this is your model
in this post the val accuracy is only 0.5
so comparing the 3 in terms of performance, model 1 = model 2 > model 3 (roughly)
but model 3 can still be trained cause val loss is still improving
obviously you have to be a bit careful when you only compare on the same validation data cause in a way you're now just fitting to seen data
that's when cross validation comes in if you want to look that up
although 20000 is admittedly a limited amount of data; would increasing epochs help? I'd tend to think not
i feel like a better solution for model 3 might be to downsize the model
also batch size could make a big difference
but so a 73% accuracy means that there is still a lot of room for improvement?
not necessarily
see how testing accuracy is plateauing?
aah ok. ok ill do my last test where i train my first model to the whole dataset instead of just 20k images
well if the val loss starts plateauing then at that point more epochs probably doesn't help
(* though there was reserch a while ago that suggests training beyond that point for a long while will reach "model grokking" which is when after a long period of no improvement suddenly it improves again)
you could also downsize yoru model and decrease (?) batch size
the * part would require your optimizer to be using weight decay no?
potentially
obviously, the perfect model would get 100% accuracy so in a sense yeah there's always room for improvement until you reach that
however you have to consider other problems
example: maybe your current training data inherently can't make such a model - heck, maybe your current training data can only ever make a model of 72% accuracy
interesting; ill give it a read this summer
looks better
whatever you changed made it so the accuracy plateaus at about 0.75
(tho again be careful about overfitting yourself on the validation set)
i just increased the training data from 20k to around 60k
nice!
unet 64-128-256-512-1024-512-256-128-64
its weather data so there are two conditions lightning and no lightning
i calculated manually so lightning events are only 3% of the dataset
should i apply weighted loss calculation?
my current run I believe will be completed till morning
currently loss is reducing but I am pretty sure Its because of no lightning cases
either use a huge model or randomly sample from your dataset
more dense model?
im not sure what your goal is, but 100gb of data will certainly overfit
my goal is to predict lightning
given 7 channels
but naturally no lightning events are much more higher than
lightning events
still thihk you haev way too much data
i did some calculation
You can do Focal Loss
It will automatically handle class imbalance
total points in data = 8,052,129,792
trainable parameters?
Yeah if the model is too large it will also take forever
I didn't print it
when I am using pytorch I always forget to do that
poly loss would also wokr
Poly loss? Like MSE?
lightning events = 42,560,000
Oh didn't know about polyloss...awesome...learn something every day
yeah so maybe artificially dropout some of the non-lighting cases
just because you have a full dataset doesnt mean you should use the full dataset
then will it create bias in model? predicting lightning cases more
not necessarily
your goal is to predict lightning; your model will learn the behavior/features it should expect before lightning vs not before lightning, and it shouldnt matter that not lighting occurs more frequently necessarily
so maybe randomly sample 42 million non lightinging events
and use that as your dataset
ok
just one more thing Im passing 2d arrays so how will random sampling will work?
It will create patches in data then right?
in fact your model will probably perform worse if ur doing a binary classification, and one of your cases is only consists of 3% of the data
im not sure how your data is structured so i cant say
consider it as an image
and target is No of samples 834
Shape of one sample 1536,1392
and each time stamp got 8 images 7 features and 1 target
an lstm might actually be a good tool for this 💀
convlstm?
I am planning to use it but I am first trying to predict normally
aftee this what I will do I will give lag in data
well how do you plan to encode time series?
oh by concatting the images into one matrix?
well yes
we can say ndim array ?
What's ndim
n-dimensional tensor i assume?
Oh, I see
Looking at polyloss it seems like it needs class weights to prevent class imbalance.
Wouldn't Focal Loss be more appropriate if you don't want to compute class weights?
yes I was reading about it
maybe, but looking at Kaboom's goals and model holistically, i dont actually think itll be a problem
just use cross entropy or something
and randomly sample an equal number of non-lightning cases as lightning cases
Well 100gb class weights computation seems a bit much
it says it modifies cross entropy loss that down-weights the loss for easily classified examples
i mean class imbalance shouldnt be an issue at all
just dont use all the non-lightning data
but how we can remove it from 2d grid?
one thing I think is to clip 128 x 128 or 256 x256 snaps
over lightning events
Is this a binary classification probelm of lightning or no lightning?
essentially yes
yes
feeding in a time series, we want to find out if the next step is lightning or no lightning, is what im interpreting this as
exactly
thats latter step I will give like time t features and targets will be t+2
so really, just cross entropy loss or BCE or something and just sample for a samller subset of nonlightning data
Ahhh the events leading up to lightning will be used to predict lightning...yeah an LSTM is a good way for this with BCE
padding a bunch series would be pretty funny though
jank as hell
i mean itd work probably, but its just conceptually hilarious
how will resampling work?Any small example?
yes conv lstm conv for spatial features and lstm for temporal features
It would be yeah...maybe a consistent window of time before lightning? That way no need of padding in that dim
okie I Will try this focal loss,and resampling and conv lstm I will update you guys
is it ok?I mean can I update?
Yeah if somethings off or some error just message here yeah
Okie Thank you
Testing a Universal Complexity Framework (UCF) across different data types
I've been working on a mathematical framework that measures how information organizes itself, and got some interesting cross-domain results I wanted to share.
What I tested: UCF assigns a "phase angle" (θ) to different types of data based on their complexity patterns. The theory predicts certain ranges for different domains:
Financial markets: ~90° ("controlled uncertainty")
Mathematical sequences: ~0° ("pure order")
Physical systems: ~180° ("conservation")
Financial validation results:
Tested 4 major cryptocurrencies, all landed in the predicted 70-110° range:
BTC: 86.2°
ETH: 102.6°
ADA: 91.4°
XRP: 91.5°
Unexpected discoveries:
Prime numbers → 116.7° (closer to biological optimization than pure order)
Natural language → 180.5° (shows conservation-like patterns)
Chaos systems → 98.1° (confirmed controlled uncertainty)
What's interesting: UCF seems to detect consistent mathematical signatures across completely different types of information - financial data, language, mathematics, physics all show distinct but predictable patterns.
The financial predictions working so consistently was unexpected.
multiple runs show consistent results
do you actually have images or not? you can "consider it as an image" doesn't mean that'll be the best way to solve it
can a person with low IQ or low problem solving skills become a good data scientist by doing practice/hardwork??
Yea
Sir Richard Feynman :
I was an ordinary person who studied hard. There are no miracle people. It happens they get interested in this thing and they learn all this stuff, but they’re just people.
Please elaborate what the phase angle is and what it indicates
hey @bleak rampart thanks for the question. In the UCF the 'structural phase angle θ' is designed to capture the nature or character of the internal organization and structure within a data sample. Think of it like this, while the Magnitude ∣Φ∣ tells us how much complexity or energy there is, the Phase θ tries to tell us what kind of structure is present
Ohh! Thanks
For a given domain the angle wouldn't be a constant value, it would differ...So the angles you provided will be the average value during recent time ?
Yeah, for any data I throw at the UCF, the structural phase isn't going to be some static, one-size-fits-all number. Every individual chunk of data – whether it's a window of an RNA sequence or a snapshot of market indicators – gets its own θ based on its unique internal structure at that moment. That's why those polar plots above show a scatter of points; each one is a distinct UCF signature.
hey i have a hard time doing my project can some one look at my git maybe give me some suggestions
no? ( = . = ) its oky
I dont mind, I cant promise ill be able to help you but I'll try
Anything is better then none
Just feedback will do
Tts is. Not going great that's why idk how to fix idk ai is feeding me nonsense and yt doesn't help
damn i should visit this channel more often, TIL abt polyloss
Hello does anyone know about guard rails
In what context?
it looks good and organised but its definitely beyond my skill level. I would have liked some images but it seems you are in the process of adding them.
can anyone help me for the computer vision + ocr problem
I am trying to use yolo and tesseract for this project
Always ask your whole question and give the information people would need to start answering it. Never ask to ask
You can insert extra information after the user's prompt
what's the hiring process like for computer vision internships?
They'll ask you to tell them more about items on your resume that stood out to them, and they'll ask you "trivia questions" about computer vision to figure out if you're fake.
What are some misconceptions about A.I.? I understand that it is more useful in analyzing data than it is at writing novels or creating art, but is there anything else about A.I. that I have missed?
Also, what is it like being a data analyst or data scientist? Is it not that bad of a career path to go into? Is it a growing career path due to the development of A.I. or is there something else to being a data scientist outside of A.I. development?
People in 2025 think that AI is only generative language models.
i see
so no interview problems
like no on the spot coding
It is not? What is it then?
just computer vision theory questions and then questions about my past experience/projects that i have on my resume
Just think about what was considered AI before 2022. Those things still exist.
i don't get how people are able to get computer vision internships the summer after freshman year of college if recruiting starts in the fall
Like, self driving cars are not generative language models.
i guess they just start learning really early
You're right, that usually doesn't happen.
not much time left before recruiting season so i should probably get to work lol
oh
i saw some guy from the uni i'll be going to who is a camera perceptions intern at aptiv
i don't know the guy but i saw his linkedin profile, all he had was an mnist classifier project
and some research, which i'm not too sure if it's related to computer vision or not as it's not too clear (and i don't know anything lol)
but the interesting thing was, none of it was before september 2024
the project date was december 2024
so either aptiv has a really late recruiting cycle or they just don't expect much/very little competition
I only know what A.I. was thought up as in popular fiction. Such as Data from Star Trek, or SkyNet from the Terminator Franchise. The only other terminology I know the term A.I. was used for was for bots that simulated human players in computer games. Was any of that what you were referring to?
Generative language models like ChatGPT feel like the AI entities from science fiction, but they aren't actually very similar.
typically what's the expectation for computer vision interns
or better question, what would be considered a competitive profile
that would probably be closer to the concept of agi right
It seems like AGI because language generation feels more intrinsically human than driving a car. But LLMs aren't self aware
In what way? What is the difference in generative language models compared to any other form of A.I.? What are the other versions of A.I.? I do not know what makes something considered A.I. outside of what I had listed, and even then I understand that the A.I. mentioned in Popular culture is not possible with modern technology right now.
Decision making systems are often AI.
What are some known decision making systems then? What makes them different from Operating systems?
They're not related. An operating system is how applications interact with the hardware of a computer. A decision making system, in this context, is an application.
If you have something that decides/predicts how much a house should cost based on its properties, that's the kind of thing that I'm talking about
Apologies if I am asking basic questions, I am mostly unfamiliar with how software and coding works as I am a beginner at this moment of time. I am genuinely trying to understand you, but it is mostly going over my head as I have no experience hearing these terms before.
Oh, so like an Excel sheet?
Oh ok. Thanks for trying to explain.
Models are often trained on tabular data which might be an excel spreadsheet.
Oh ok.
But if you had an excel spreadsheet that calculates what you or someone else thinks the cost should be, you have to write a formula/function that calculates it in terms of the columns, right?
With machine learning, you have all those columns, and the actual price of the home, and the model figures out what function of the columns consistently arrives at the expected price
So automation comes with machine learning?
You can check my repo silver vi it's already 90% done
That feedback helps
well, any computer program is designed to automate something.
Oh ok
Hí
idk if it will work, because my usecase is a bit different
I wanted to learn about how to detect changes in the image data , as in if any bill has a name and someone changed it. The bot should recognise that it is altered and flag it as fraud. Are there any pre built models to do this. Also what all should I know to achieve this ik the basics but any good research paper would help.
If it's just yolo(v8) used for identification and then tesseract to do some text extraction what ever use case is keep in mind that you have to make sure that the data set you use is relevant to your thing and if you are planning to deploy then try nano and small first then move to other models if you feel this is too heavy for your deployment then consider downgrading few versions like yolo v5
Tesseract the python one
Amazon also provides one that is amazing
Currently I am working in a project, where I have to extract all the content from ppt slides and passsed it to llm for further functionality
You also need internet if it's offline then he's done for it's not great i have tested a lot of things it's only 70ish that on a clear image not blurred ones
Now I am using pdf plumber to extract tables from ppt and easyocr to extract text from images
Great thought
But with text from images I want to extract the meaning of images
Assuming that you are internet internet, I recommend trying out the google ai studio api key for summary
It's free really good and has rate limits be careful with that part
If i want to extract the context of image , how this can be done
How this can be done
i trained my images on yolov11 medium
Good did you get what you desired?
Yes that is perfect, got 99% acccuracy
but now i am trying to use tesseract to extract the text
i managed to get most while trying differernt preprocessing, but still cant it cant extract few things like digit 5
If the images are text based then you can send that directly to gemini or extract the text which may not be efficient
Well is that 5 in a weird font
Or some colour
Did it give you a s instead of 5? Or completely ignored it?
Thought so
This usually happens
need to use image preprocessing
u have any good ideas?
for thresholding and preropressing
like before it detected nothing, i tried different preprocessing and got pretty good results
Did you classify each image like this folder has images of 5
wdym
Keeping the threshold around 7 is a good try 5 or 4.5 too if
Is your data set a mix of all the characters?
gray = cv2.cvtColor(cropped_img, cv2.COLOR_BGR2GRAY)
gray = cv2.resize(gray, None, fx=2, fy=2, interpolation=cv2.INTER_LINEAR)
blur = cv2.GaussianBlur(gray, (5, 5), 0)
_, thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY +
cv2.THRESH_OTSU)```
this is my preprocessing
yes then how did it find 4 and 9
there is problem in ocr
not my dataset
You turned them into a grey scale which is a good choice to avoid the colours u blurred it a bit and resized it
Ocr is not 100 percent accurate no model i tried had that capabilities it's best to use stuff by Google or oracle basically big companies
is that paid?
They have free tiers too do your own research first to find the best model as of now
My system was hybrid online + offline
2 different softwares used
but having issue
ik its good but results are not the way i want
is there a way to tell it that lable will always be a number and do ocr from that way
I see for the data processing part try labeling and then training on that
Try improving your data set specifically images of 5 and §
Yes you can
So you are not doing text only numbers?
no
both text and number, but that lable only consists of numbers
wont be anything else except of numbers
Oh wait the labels are in numbers?
no bro
the lable is kda
kda is the lable name, and inside it there will be numbers
like this, so there will be numbers this way
I have a question you are doing these completely digital images to ocr and then extract text out of it and use the extract no physical ones? And issue is it's sometimes going to miss read 5 with § and now your pre processing method is turning the images black and white and then blurring then?
Am i clear on this part?
Kda is a label that consists of pure numbers and nothing else?
the images are digital, no handwritten images will be there
exactly, pure numbers nth else
see i am trying to get the data from a scoreboard like this into json format like given below:
{
"radiant_score": 15,
"dire_score": 10,
"teams": {
"radiant": [
{ "player_name": "Ani", "hero": "Night Stalker", "level": 9, "gold": 489, "kda": [3, 1, 2], "ultimate": true },
{ "player_name": "Alvyy", "hero": "Templar Assassin", "level": 10, "gold": 1136, "kda": [5, 1, 5], "ultimate": true },
{ "player_name": "REDRUM", "hero": "Crystal Maiden", "level": 7, "gold": 234, "kda": [1, 4, 3], "ultimate": false },
{ "player_name": "Big Doodle", "hero": "Earthshaker", "level": 7, "gold": 2138, "kda": [1, 1, 3], "ultimate": true },
{ "player_name": "pick weak=punishment", "hero": "Ember Spirit", "level": 8, "gold": 601, "kda": [0, 4, 4], "ultimate": true }
],
"dire": [
{ "player_name": "Stleip", "hero": "Tusk", "level": 6, "gold": 514, "kda": [0, 3, 6], "ultimate": false },
{ "player_name": "红双喜", "hero": "Morphling", "level": 8, "gold": 1301, "kda": [2, 1, 1], "ultimate": true },
{ "player_name": "hy not listening", "hero": "Mars", "level": 7, "gold": 788, "kda": [2, 2, 4], "ultimate": false },
{ "player_name": "xin", "hero": "Lina", "level": 6, "gold": 249, "kda": [3, 2, 2], "ultimate": false },
{ "player_name": "Love is patient, lo...", "hero": "Lion", "level": 7, "gold": 1301, "kda": [1, 6, 5], "ultimate": true }
]
}
}
and the kda is the issue since the numbers are missunderstood
hello im learning ml with the sckitlearn, but the tutorials i saw use the sckitlearn default databases, how can i make my own and save to use after?
in fact i managed to train one just don't know what to do after to store for later use
If it does not exists in a digital format yet: Create your own dataset by hand in Excel then export as a CSV file and load using pandas or polars
If it exists in a digital format, it may vary a lot but generally speaking find it online and/or write a script to format it in a way the models can understand
Does anyone know any open source vision model that takes image input and tell what basically is image about.
that is extremely generic
there are thousands of classifier models that can do that for different topics depending on what you consider "being about something" to mean, or you can just throw it at any multimodal LLM
random example: https://docs.ultralytics.com/datasets/classify/imagenet/
Hi am I allowed to send a survey for some data collection? It’s for a project
No, we don't allow that
what project to build? I want practice ml
dont know price prediction of electricity?
sth where there is data
I watched ml from scratch type of videos from vizuara
hi, i'm think about to make a tictactoe with neural network without probabilities just linear algebra, this is possible, right?
i'll use ReLU and Softmax function
neural network
without probabilities
with Softmax
what exactly do you mean by "(without) probabilities"?..
I mean that I'm using softmax just to highlight the best move, the one with the highest score
I'll create two hidden layers with 9 neurons each. I'll use ReLU as the activation function in the hidden layers, and then a softmax layer at the end to generate a vector of scores, one for each possible move. Then I'll use argmax to pick the position with the highest score and place the X there. So the softmax helps highlight the best move, but I'm not using it for real probabilities
Lets see your link.
I need some information in regards to data normalization
is it better to normalize the data before or after splitting
I don't get the logic of: by exposing the data before training, you may cause data leakage,
#Check if nromalizatio is requested
#if yes
# normalized X _ train and X_test, y_train and y_test
#
print(X_train.head());
model = LinearRegression();
model.fit(X_train, y_train)```
Let's assume that pseudo code doesn't exist. Now, the test is split, the data is not normalized. the variable model is never been exposed to the normalization,right
Hi guys, I have been learning ML, eda and data engineer, nlp and a bit of deep learning for 2 years
I am 16 yo
Do you think that I can get summer work in a company with that?
Nothing is impossible. It'll be more easier if you have built a couple of solid projects as well
It's advised to normalize after splitting to avoid leaking information to the test data
Sure, but the hard part isn't training the model it's showing you can find a suitable usecase (that isn't some kaggle stuff), understand your data, do the right preprocessing, ...
ah ok so other parts related to data science
Sure, but that makes sense right? 😄
It's as you say, the code is so easy (.fit / .predict) they'd hire no one to do just that
makes sense, thats why high salary
so you must know crisp-dm or similar project cycle
but honestly still not too much of coding compared to some web app development where there are modules, components, rather big systems or game engine development
this is what attracts me to do some data science
not high salary but much less code
Made something with python
Hi guys, I want to break into data science and have an internship at my first year at university in summer
I am currently learning python through cs50p and I was wondering if anybody has resources I can use to be able to make relevant projects during university
I have looked up at kaggle but I am not sure
I also watched this video : https://www.youtube.com/watch?v=9R3X0JoCLyU
It helps me in direction but not necessarily in the process of learning
Go from zero to a data scientist in 12 months. This step-by-step roadmap covers the essential skills you must learn to become a data scientist in 2024.
❤️ Join this channel to get access to perks:
https://www.youtube.com/channel/UCWv7vMbMWH4-V0ZXdmDpPBA/join
Download the FREE roadmap PDF here: https://mosh.link/data-science-roadmap
✋ S...
what about simplilearn data science course?
I will look into that thanks
🔥Data Scientist Masters Program (Discount Code - YTBE15) - https://www.simplilearn.com/big-data-and-analytics/senior-data-scientist-masters-program-training?utm_campaign=QjqJS0yb8YM&utm_medium=DescriptionFirstFold&utm_source=Youtube
🔥IITK - Professional Certificate Course in Data Science (India Only) - https://www.simplilearn.com/iitk-prof...
its new 6 days ago
🙏
Alright