#data-science-and-ml
1 messages · Page 163 of 1
Grab a base nvidia gtx 1000 series to start. That is what I am starting with. Most of my training models is on a 10 year old intel Xeon cpu and 10 year old nvidia quadro m1000m gpu which pretty much has like 20 CUDA cores
Just barely bought it
It’s like $60 @ Walmart ~750 cuda cores give or take
Hi, I need some help. I have to build an invoice categorizer. I’ve already created one, but it breaks when I upload a PDF with a different template. Is there something I can do about that? I’m thinking I should start from scratch.
I think a lot of the BCI stuff is done with SVMs and other algos running on CPU
You might actually be fine with a raspberry pi in that case
Cause the (typically) tiny sample sizes and extremely high dimensionality are where SVMs still reign supreme
A bigger concern is whether or not all the libraries you want to use will support ARM
Yeah, KNN is an algorithm you probably should never use. You should be able to train the net on google collab using their GPUs and do inference on your CPU. Especially for image processing, and the fact you have 1000 images you should go down that route imo
Why is KNN something you should never use? Is it because there are better modern algorithms or something else? Just curious
It’s very slow and prone to overfitting
Amongst other problems
By default it also doesn’t consider 1 feature more important than others
I see, thank you!
Sure you can probably assign weights but the default implementations just do the dot product of 2 normalised vectors and that’s it. In reality the distance of 1 feature should contribute more to the overall distance than others, we also want to learn this scoring and not
hard code it. If you add this requirement then you’ve arrived at how most real ML algos works
The idea that we want to learn this scoring is exactly what the coefficients of logistic regression mean btw
"we want to learn this scoring" here means that we want the model during training to learn how much weight to assign to this 1 feature vs others?
Or does scoring mean something else?
Correct
Is it possible to train custom ai model on virtual machines for a short period of time or the training should be going on continuous?
if it's a neural network, you usually want to keep training until the loss stops decreasing. but you can pause training if you have to.
you don't just keep training a model forever, though.
If I want to train it for a long time, is raspberry pi ok or not?
you probably can't train at all, for any amount of time, on a raspberry pi.
I got no spare pc to do that
How can I train if I want to
google colab, maybe
There's also https://vast.ai/ if you just need a short-term rental
What model from https://pypi.org/project/g4f/ can I use to create creative&scientific writings of around 6000-8000 words? Considering I won't have to pay anything
I've tried the paid deepseek api and it can only return 1500-2000 word texts at a time
And I've only tried that because is it the cheapest
All the others are too expensive, considering I'll be needing about 400k-500k tokens a day of output, my max monthly budget is 20$, that's why I'm looking at G4F models
That package is literally a collection of exploits targeting vulnerable websites to hijack their APIs and steal from them, don't use it
If one of the first things you see when you open a package's description is a Legal Notice, you probably should notice there is something weird about it
I don't care about ethics considering I'm poor and those companies are rich...
If I were to have had the money I would've just paid for the o1 api but I don't have it
5. Do not provide or request help on projects that may violate terms of service, or that may be deemed inappropriate, malicious, or illegal.
You can try using some of the "free" models under https://openrouter.ai instead, it's mostly companies voluntarily offering spare compute to test their infrastructure or trying to put their name out there to attract costumers
Thanks!
Yes and no
It's definitley possible but it may not be practical
Many ML algorithms are online by default. They're trained iteratively, think everything that uses gradient descent
That means you can always persist the weights to disk and resume training whenever. It depends on the library you're using to make this process nice or very annoying
people who have been in a team in kaggle and tried to get actual work done,how?
We're suferring.
the versioning locks you out while things are running, there is no meaningful merge screen that I can find, ...
is this kind of progression normal for a neural network or did i do something wrong
Yeah I think most people don't use their notebook until submission and bring their own tools instead.
Can someone here familiar with BCI and AI works help me?
I’m a masters student and I’m currently looking into research ideas for my thesis topic.
Initially I wanted to use brainwaves as input data to be used for controlling the mouse cursor in a PC, using AI as the interpreter of the brain wave data and further develop it to try to form texts from imagined words
But once I saw the prices of sensors that are available especially for the high requirements I have, I realised I have no chance of doing anything.
OpenBCI Daisy sensors or any good sensors with 14-16 channels costs ~$2000 if not more and as a master’s student whose monthly allowance is that (rent and food and utilities) I can’t afford to work with sensors.
So I’ve elected to instead work with existing datasets.
Is it really not possible to buy a bunch of cheap EEG sensors and then feed them into an Arduino to accomplish what OpenBCI Cyton+Daisy can do? Essentially a DIY version that would be cheaper but of course lots of wire and more stuff to program.
Or is dataset my only option?
Currently I have a good PC with a Ryzen 7600X3D cpu and 32GB RAM and a RTX 4070 Ti Super with 16GB VRAM to be used for AI training and coding
Yeah I guess they are pretty damn expensive even at the entry level https://imotions.com/blog/learning/product-guides/eeg-headset-prices/
and I guess 5 channels isn't enough for what you want to do? https://www.emotiv.com/products/insight
is it practical to implement a cnn with just numpy?
also is an ocr scanner a good resume project
5 channel I don’t know if I can do much. Especially since I’d prefer to do data to text conversion but I’ll look into them
makes sense
not really unless you're making something very small
any changes to the architecture (especially number of layers) require appropriate management of the derivatives, and the optimizers would also have to be implemented by you
there is certainly a time and place for that, but if you like numpy and wanna do a nitty gritty low level design of a network, i'd suggest jax
comes with autodiff and also some optimizers (e.g. in the optax module) if you wanna make custom architectures
Should i learn python or java
either or
Is it hard to learn java
Hello @quick igloo, this is the data science channel on Python Discord.
Java is not used in data science.
Really?
First python, Because it's easier
But Java is not used for AI
genetic algorithms with a random entry, is considered reinforcement learning too?
?
because G.A use error loss to select the better setup for the next iteration
and it's not supervised
Java is used for AI but too bad
bro im not fan of java but yeah its used for AI
Is there a roadmap I can follow for learning data science?
Sorry if it's asked too many times
Would you learn java for AI?
absolutely not
No problem
Do you speak spanish?
No I know english and hindi
Ask on reddit, maybe you'll find something about it
you can search one up on roadmap.sh
question about polars, so i've purely been using it over pandas for coming on a year now, an i notice 1-2 issues like tuples/list not being the best supported compared to pandas etc
my question is how does polars performance compare to pandas with rapids?
also also to build on that question what's the viability if any of polars with rapids
@me when u reply i dont look at my laptop much
Rapids is its own thing. Pandas, Polars, and Rapids use the same memory model (Apache Arrow), so they should all be able to transfer between each other.
ahh thx for informing me will look more into that, i use the tool rather than question much lol
by this logic using rapids polars is faster than pandas?
If the operations being done are supported by Rapids (probably numeric stuff), and your GPU is faster than your CPU (probably), then yes.
In addition, this is better performance in terms of throughput, but worse latency (GPUs have higher latency).
I don't know whether it matters if you use Pandas or Polars, probably not.
Since both are just sending their tables to Rapids.
However, Polars has that integration to make it easier to use.
According to that page, it makes use of Polars' query optimizer, which AFAIK Pandas does not have.
in pandas, lists and tuples are just stored as generic python objects
in polars, there are dedicated Array, List and Struct data types for nested data you should use instead of generic Object
Hello, remember to always always ask your actual question and never wait for someone to commit to answering before you actually ask it.
ok sorry
No problem, but be sure to always remember that every time forever
alright so this is to calculate atmic mass from molecular formula
but when i compile it
struct is a bit cumbersome for my usecase so i chose to do object conversions,
basically storing the tuple (6, 7, 'ed')
polars list/array require linear types
its idle and doesnt prompt me for input
I cant debug it because I get this error Terminal exits with code 3221225786 (or similar)
send a code snippet
loop_value = 0 # for infinite loop
while loop_value == 0: # indentation for loop
atomic_number_mass = { # dictionary
use backticks
If you're asking a basic python question, open a thread. The instructions are in #❓|how-to-get-help . Read every word of the instructions.
i.e
auto_review = load_dataset(
"McAuley-Lab/Amazon-Reviews-2023",
"raw_review_Automotive",
streaming=True,
trust_remote_code=True
)
alright thank you
That's the Huggingface API/lib right? I should probably try it sometime.
yea lol feel free
https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023
I have the following code:
# frame = cv2.resize(frame, (0,0), fx=scaleFactor, fy=scaleFactor)
overlay = frame.copy()
for idx, detection in enumerate(extracted_boxes, 1):
bbox, text, confidence = detection
cv2.rectangle(overlay, (self.scoreBoxCoords['x']+int(bbox[0][0]), self.scoreBoxCoords['y']+int(bbox[0][1])),
(self.scoreBoxCoords['x']+int(bbox[2][0]), self.scoreBoxCoords['y']+int(bbox[2][1])), (255, 0, 0), thickness=2)
cv2.putText(overlay, str(text), (int(self.scoreBoxCoords['x']+int(bbox[0][0])), int(self.scoreBoxCoords['y']+int(bbox[0][1])-8)),
cv2.FONT_HERSHEY_PLAIN, 1.5, (0, 0, 255), 1, cv2.LINE_AA)
# Blend the overlay with the original frame
alpha = 0.6 # Transparency factor
frame = cv2.addWeighted(overlay, alpha, frame, 1 - alpha, 0)
# Show the frame with text overlay
cv2.imshow("Video Text Overlay", frame)
It works fine with one caveat: the results are for a scaled image. However, when removing the comment in the first line only the image is shown and no overlay is visible. What is the reason for this? Thanks for your help!
anyone on kaggle
Why do you ask?
I'm new to Kaggle and I'm unable to select the GPU (P100) option. Not sure what I'm missing—any guidance would be great!
You've probably used 30 hours a week already
no , i just login it today
🤷
see
Try make another notebook? I have no idea tbh
done , tbh i had stuck on that for 2 hr , and exploring internet for solution
Did you find a solution?
Yea , I didn't verify my phone number
you gotta get verified getting the gpu access
is anyone familiar with fbProphet time series forecasting library
Hello, remember to always ask your actual question. Don't ask to ask
if i made a project where i did all the under the hood stuff with cython instead of using libraries, would that look better to recruiters or would they not care
If the job involves using cython then sure.
Generally speaking, writing code with self-imposed constraints isn't impressive to employers. They don't care if their developers are "hard core"--they want people who will write maintainable code that works
how often is it that mle roles involve usage of cython
would it be good to do if i don't have a gpu though?
i'm trying to implement my own cnn and am using numpy but just setting up the architecture and getting the cost takes super long to run, and so i'm not sure how astronomically long the gradient descent part would take
and also i think it would speed up my first ann implementation, which i've never been able to test cause it never stops running
AI i been working on for a year and a half training constantly
I used cython exactly once to make an open source contribution to spaCy. Users of spaCy do not need to know or use cython.
Cython is not a viable alternative to GPU-accelerated code.
you can use a GPU for free on google colab.
when i do they get ignored
if you don't ask your actual question, it's guaranteed that it won't get answered.
as if nobody understands the data science questions i have
when i do ask them they still do whats the point
i assume maybe people are busy, seeing if someone may have knowledge and wanna answer later
you have to be willing to front the effort of actually asking your question if you want to get free help.
i did
is anyone familiar with fbProphet time series forecasting
unless you're doing a survey of who knows about what, this isn't your actual question.
its a data science library
why do you care if anyone knows about it
because asking the question will get it overlooked and pushed out of chat
someone that knows is more likely to answer starting with the subject
That is false.
You need to always ask your actual question. If you're not willing to do that, for whatever reason, I have to ask that you refrain from asking at all.
ok ill ask it in python discussion
Okay, but the same rule applies. If you ask "does anyone know about X", I'm going to tell you again that you need to ask your actual question.
not if its a discussion
but it isn't. there's a question you want help with. If you're not willing to say what it is, you need to not waste the time of our volunteers.
how am i wasting time im not forcing anyone to do anything
my real questions already get ignored how can their time be wasted
because people are going to tell you to ask your actual question, and this whole conversation is going to happen over again.
then copy/paste the real question that you already wrote.
so u rather me spam then see if i should even continue asking the ignored question
I would rather that you repost your actual question than ask to ask, yes.
this is lawyering. please stop.
bro what 😭 you are mall copping
I mean, I'm one of the directors of this community.
But I'm genuinely trying to maximize the number of questions that get answered and minimize the amount of volunteer time that gets spent not answering questions (by, for example, coercing people into asking their question).
tell volunteers to ignore asking-to-ask questions the same way normal questions get ignored
We don't ignore asking to ask because we genuinely care that questions get answered, and we want you to maximize your chance of getting one.
well then what's the point of asking what you asked if no one's gonna answer that either
it happens to all of us sometimes, people are busy or maybe no one who knows the answer sees the question
and if we go by your logic which is that either one will get ignored, at least if you ask the full question there is a greater chance of it being answered
helps not give the brain the micro relaxation of expecting help after sending something ur working on in the chat
it does
if u cant figure something out and u send it in chat asking for help its like 200% easier to start working on something else and waiting for a response
i agree, but what extra are you doing by asking the full question
wasting ur own time
i do that too a lot
?
you're spending an extra 5-10 seconds typing out the entire question
if u gonna ask a question u should provide as much context as u can
otherwise the person will be complaining about not seeing the examples/data/code, etc
i mean do you want help or not
so if asking for help is a waste of time for you, why ask in the first place
its the type of question
i somewhat get what you mean, but if you want help then you gotta ask properly
im assuming nobody active in chat uses the prophet library so if i ask why its forecasting strange results itll waste my time showing the results and data
i know
but if you are asking your colleagues for help, how would you ask
"have you done time forecasting im getting strange results from this model, expecting x but getting y and not sure if its just the model or the parameters"
a simple y or n then u show them
yall acting like i started pinging people bothering them
i'm not gonna lie i don't care much about it either, in another server i had this same issue with the people there
but in all the programming servers i'm in, it seems to be a common norm and so i just follow it cause it's not really causing any inconvenience for me
i suggest you do the same just cause it would probably be more helpful for you
Anyone want to practice Pytorch with me?
I am trying to convert torch.FloatTensor into cuda's type. But it's not converting. What to do?
path = os.path.join("train", "masks/**")
path_lst = glob.glob(path)
y = torch.tensor([], dtype=torch.float32)
out = torch.tensor([], dtype=torch.float32)
for path in path_lst:
with rio.open(path) as dataset:
img = dataset.read(1)
flat = torch.from_numpy(img).float().flatten()
y = torch.cat((y, flat), dim=0)
out = y
np_out = out.numpy()
weights = compute_class_weight(class_weight="balanced", classes=np.unique(np_out), y=np_out)
class_weights = torch.from_numpy(weights)
c_weights = class_weights.to(device)
I think you need to move it back to the cpu to do the numpy stuff?
e.g. np_out = out.cpu().numpy()?
And maybe specify device=your_device when initializing the tensors?
Should a laplace mechanism be applied to every value in a dataset or just to the result of a query?
Don’t "presume" before actually asking. If you’re looking for commitment and a quick response, avoid asking a question just for the sake of it. Instead, provide clear and detailed context upfront. This allows those who know the answer to immediately understand your problem and offer help without needing to dig for more information before deciding whether they want to assist.
does anyone know the best course on Data analytics online ? kindly give some suggestions
maybe Google or DeepLearning.ai
The Data Analytics Certificate, developed by Google, can help you learn how to use AI to process, analyze, and visualize data.
how about the python c api, how often is it used
not asking as a replacement for gpu usage, i mean how much is it used for performance optimization on top of gpu usage
many of the tools that DS/AI people use are written in C, but users of those libraries don't need to use C.
I've never created a python tool in C.
so the only reason i would need to worry about using c/cpp for anything in ml is if i'm implementing a model to be used for an embedded system
you would only need to worry about that when it comes time to deploy the model. you'd still be implementing it in python code.
oh how would deployment work then
and in that case why do many people (i know no where near the majority but still a decent number) use cpp for computer vision
you can export models to something that can be used in cpp code. idk the specifics.
oh i see
that's cool
so efficiency of c/cpp while not having to go through all the extras complications of the language
the important parts of a model, the weights, can be exported to a common format that different languages can read
If you are doing computer vision on a small device you need to be as efficient as possible. Technically you are also using CPP when doing computer vision in Python with something like OpenCV, just indirectly, and not written by you.
However, there will often still be some Python that is just there to basically connect things. Like feed the camera data to your CPP library, read config files, maybe do some networking, etc. Python shows up almost all the time as at least some general scripting tool that ties it all together.
This is because Python is simple to use, easy to get packages for, and has a lot of packages for everything imaginable.
These packages are usually all each implemented in something like C, with a small Python layer on top (sometimes that Python layer is large).
However, there are some cases where you may need to cut out Python entirely, these are not super common (for most ML devs).
so i'm working with a fairly large dataset (200gb) so i moved it to parquet, then read it in using polars, however when merging/cleaning my kernel keeps dying I assume due to ram so i went from 32->48gb of ram but similar issues, haven't really been seeing success with the streaming engine
Is there any advice on dealing with datasets this large? I'm considering getting WSL to try using rapids with polars but idk the viablility of that if its a ram issue
code for context
lf_review: pl.LazyFrame = pl.scan_parquet("amazon_review_auto.parquet")
lf_meta: pl.LazyFrame = pl.scan_parquet("amazon_meta_auto.parquet")
lf_review: pl.LazyFrame = lf_review.filter(pl.col("rating").is_in([1, 2, 3, 4, 5]))
lf_review = lf_review.filter(pl.col("text").str.strip_chars().str.len_chars() > 0)
lf_review = lf_review.with_columns([
pl.col("text").str.count_matches(r"\b\w+\b").alias("review_length"),
(pl.col("timestamp").cast(pl.Datetime("ms")).dt.year()).alias("year")
])
lf: pl.LazyFrame = lf_review.join(lf_meta, on="parent_asin", how="left")
lf = lf.with_columns([extract_brand()])
lf = lf.unique(subset=["user_id", "text", "asin"], keep="first")
df: pl.DataFrame = lf.collect(streaming=True)
df: pl.DataFrame = lf.collect(streaming=True) is the problem.. Yes it's streaming, but you're asking it to collect everything into a single dataframe
so how should i approach it instead? appreciate the advice btw
I'm a total Polars noob, but looking at their docs, this is maybe what you want? https://docs.pola.rs/api/python/stable/reference/api/polars.LazyFrame.sink_parquet.html
The example they give at the bottom is:
lf = pl.scan_csv("/path/to/my_larger_than_ram_file.csv")
lf.sink_parquet("out.parquet")
yea i was considering that
the only thing is what happens when i need to merge multiple parquets about 34 into 1
do i just sink it into a parquet again? (Asking btw not being condescending)
Yeah, you should just be able to feed it more
It looks like there's also for batch in lf.collect_streaming_batches(): if you need to operate on the data before writing it to disk
Edit: I might be wrong about the function name, double-check me
ohh smart
It looks like you can also ask for compression and stuff, that's useful:
lf.sink_parquet(
"processed_results.parquet",
compression="zstd",
maintain_order=False, # Might improve performance?
streaming=True
)
i'll read into it
some quick very very dumb questions, I'm new to ML if i was to train after reading in a large parquet would that affect my ram or mainly cpu?
also would using rapids improve this?
This looks maybe helpful https://www.rhosignal.com/posts/streaming-in-polars/
One major advantage of Polars over Pandas is that working with larger-than-memory datasets can be as easy as adding a single argument to a function call. However, streaming doesn’t work in all cases. In this post I introduce how streaming works and how to work around some challenges you may face.
Rapids I guess would presumably let you split it up across machines more readily, but I think polars can do this on its own if I'm understanding the docs correctly
ahh yea just one machine
was just wondering if the gpu methods wud help with the ram/performance issues
I guess it depends on how intense the operations you want to perform on the data are. If they are 'lightweight' you will be memory limited on the GPU.
pl.Config.set_streaming_chunk_size() seems to be how you can manually adjust how big the chunks it works on are.
I guess it's measured in rows
ahh will have to check this out later tn hopefully i dont get cooked
Oh, aha, my earlier thing should probably be phrased as for batch in lf.collect(streaming=True, streaming_chunk_size=batch_size) where you pick the batch size. The function I mentioned earlier is something I found on Google but seems to have been a user wondering about something
appreciate it 🙏
anyone know why plotly express boxplot appearing like scatter of data points instead of like the seaborn box
tryna use the dash app with it but its only pltexp compatible
check if total_price is properly treated as a float, not as string or something else like decimal
(as in, check the dataframe dtypes)
They have decimal places like a float but I'll check if it's decimal type in a bit
anyone getting this recently in vscode and aware how to turn it off? (i dont use copilot)
Generate button in notebook cells
nvm found it idk if anyone is dumb as me so i'll leave this here
its float64
i nvr seen that before i personally just disabled copilot
a string wouldnt even be plottable
its like it gave a box to every value or somethin
if i wanted to implement an rnn for predicting the next word in a sequence, would it be a good idea to implement a graph database with the words for better word embeddings?
I don't think so.
it was because of the colors lolz cant do it with pltexp unless theres a way to lessen the bins
oh specifically because i did colors on the y instead of x makes sense
oh should i just randomly assign an index to each word
i thought having organization with respect to semantic relationships was helpful
how would you decide which words are semantically related without making the whole graph by hand?
yeah i would do it by hand
though that would probably be an issue for larger datasets
would take too long.
how do people typically do it
typically do what?
I've never heard of anyone doing this.
so would we just disregard semantic relationships and randomly assign an index to each word
the two parts of this sentence are unrelated
you want to create a model that generates text, right?
not generates, but predicts what the next word will be based on what's typed by the user, and i'll be doing it using a rnn
that's ultimately the same thing
Can I ask, why u r called stelercus papabilissimus?
oh i thought they were different tasks
so it's essentially not necessary?
that is the greatest name one can find in a discordian
I see…
Similar
also with the graph dbms tool, wouldn't it help by allowing us to create more data with less
because with the same set of words, we can find new sequences
I don't think you'll get anything from the graph database step that you wouldn't have gotten implicitly from other language model training techniques.
oh what technique should i use
never enabled it so ig it just kinda did its thing not a fan myself
im making my first CNN project and im just after some advice on what statistics i should have. So far i have:
- accuracy vs validation accuracy before data augmentation
- loss vs validation loss before data augmentation
- accuracy vs validation accuracy after data augmentation
- loss vs validation loss after data augmentation
- time taken for epochs
- multiclass confusion matrix
- f1 score
is there anything else people would recommend me adding?
you don't need to consider lot of things actually
the main goal should be your validation loss is decreasing over the period of time ( along with training loss )
and then you will test the model on different images ( but of same type ) to check if model has overfitted or not
perfect, thank you!
guys could anyone clear out my confusion , what does LSTM(64) mean?
is it like an lstm layer with 64 units?
and could some one clarify between lstm layer,lstm cell ,lstm unit
Yes
if object != None:
filt = {"_id": index, object: {"$exists": True}}
update = {"$set": {f"{object}.{to_edit}": value}}
#* check if the field exists
check_exist = file.find_one(filt)
if check_exist == None:
#* create new field
file.update_one({"_id": index}, {"$set": {object: {}}})
elif object == None:
filt = {}
update = {"$set": {to_edit: value}}
file.update_one(filt, update)```
guess the library
im a little lost on what i have done wrong with my CNN. Im using Efficientnetb0 model with CIFAR10 database.
i ran the cnn for 50 epochs without data augmentation and 50 epochs with data augmentation
after data augmentation my validation loss is slightly increasing and my validation accuracy is sitting around 0.85.
tbh that accuracy is good
yeah the accuracy is good but my validation accuracy hasnt really changed which im confused about
well if you consider train accuracy is 92 and test accuracy is 85 , there is a slight overfitting issue here
how would that be solved then, saying this is after the data augmentation?
because from my understanding, data augmentation is one way to solve an overfitting problem?
did you use early stopping?
i have not, no. What would that do?
basically you stop training model once its goes above a certain patience
check it out
makes sense. Anything else you'd recommend as well as the early stopping?
batch normalization , dropout , using different optimizers , i think regularization also not sure
Sweet, I'll check them out and try again. Thanks for your time
btw, do you know any github repos or kaggle notebooks that do this?
most things I see are always confusing
Hosting a data science workshop in a few hours
any ideas on how one would load a large file/parquet using polars or alternatives? The files range from 15gb-100gb parquets
context: https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023
my compute 32gb ram 256ssd 2tb hdd doesnt seem enough
I'm debating if to try ssh into amazon ec2 and just ripping it once/twice
stream it in chunks with collect() or similar, operate on it as needed, and use sink_parquet to stream it out to disk, seems to be the way with Polars
where do you see the option to stream it in using chunks with collect?
https://docs.pola.rs/api/python/stable/reference/lazyframe/api/polars.LazyFrame.collect.html
Oh damn I thought there was a chunk_size arg on collect, lemme look for where that is
aha polars.Config.set_streaming_chunk_size on that page
ahh i saw this will try it albeit based on the docs its believed that this may lead to memory errors
Aha, here's one approach it seems https://github.com/pola-rs/polars/issues/18820
to get all the ids and then scan over them in chunks
damm lil cooked
i wonder if pyspark could help curb this issue not sure of the config hopefully someone discussed it in chat since imma search
It really seems like collect should support a 'how much to collect' arg to me, but I guess I'm not a Polars expert, maybe there's a good reason not to.
yea really curious how people workwith this dataset cant find anyone except the publisher
there's also read_csv_batched https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv_batched.html
thats really nice however
wouldnt i be exponentially increasing my storage requirements on the csv side choosing csv over parquet?
not trying to be difficult btw if going parquet->csv means i can actually use the data it might be a common sense trade off
still no idea how this is going into plotly and sklearn lol
ah right, you have a parquet... well yeah, ig parquet -> csv if you want to use this
the problem with streaming is it's still be actively worked on, so you might see some rough edges
slice to get chunks of a lazyframe if you need ig
how this is going into plotly and sklearn
ngl, I don't think it's going to
you're gonna have to sample it before plotting if you don't want plotly to combust
sklearn... well no way all of that data's fitting in your memory, so look for estimators that can incrementally learn through.partial_fitso you can give it chunks of data and don't need to fit the entire thing in memory
you might also just consider neural nets, so pytorch
if you're willing to consider other libraries, other than spark, also check dask for dataframes and datashader for plotting (+ hvplot if you want a higher level API)
Dask is very good, IMO.
Used it at work for a major thing; we ended up mostly rewriting it in Databricks due to management pressure, but Dask worked great.
sorry for the late reply was researching hard lol
i just need two columns for the sklearn classifiers so hopefully it fits lol
initially a year ago when u guys introduced me to polars (Forever grateful), i had seen dask, would dask be able to load such large datasets though?
will look into these (the plotting libs)
With dask the approach is usually to divide the work up into chunks that each get processed by a dask worker, and individually fit in RAM.
You could pair it with a streaming thing though if you wanted to do it differently
ohh so with some research i could compute any large dataset i.e 200gb on dask regardless of my PCs capability?
Yeah, assuming the task can be 'chunked' in the first place, some things are really hard to break up.
ahh i see dask has some cloud compute on their website (cud be wrong) as my next step was making a pipeline, buying an amazon ec2 instance with 512gb ram and running it once
https://www.dask.org/
Yeah, "dask cloud provider" is the cloud back-end. It's optional but pretty handy.
Could you recommend a suitable free and open-source model for generating embeddings to populate a vector database?
BERT
Hey wait, what if we're making this way harder than it needs to be.. do you HAVE to call collect() yourself up-front, or will sink_parquet() just do it for you? Hmm.
Polars now allows you to write Parquet files even when the file is too large to fit in memory. It does this by using streaming to process data in batches and then writing these batches to a Parquet file with a method called sink_parquet.
Unlike a normal lazy query we evaluate the query and write the output by calling sink_parquet instead of collect.
polars also improved their streaming engine a lot recently, try updating to the latest version if you aren't using it yet
Hello, all,
Please be transparent about what this project is by posting a link to the open-source repository.
The project is currently private so I can't post a link to it at this time.
I removed your message, as soliciting contributions to closed-source projects is not allowed.
My mistake, didn't realize that was against policy. I just made the repo public. You all can find it at https://github.com/gkerr708/D2DraftNet.
what would be the best approach to encoding text data into vectors without using any libraries except numpy
I guess you'll need to choose a word encoding and write it from scratch; people typically use a second library to do that and then jam the encoded text into numpy, but you can do it yourself also
"Word2Vec" is the/a classic
SciKit has a ton of choices that make it popular for this https://scikit-learn.org/stable/auto_examples/text/plot_hashing_vs_dict_vectorizer.html#sphx-glr-auto-examples-text-plot-hashing-vs-dict-vectorizer-py
yeah i usually do that with images, but that's only cause i have no clue how i would manually convert images to array format
with text though it's probably worth trying on my own if it's not too much data right
Yeah definitely
i'd probably just use python's default file i/o to parse each text file and get them all into an array, and then from there it would be pretty easy to create a vector for each word
also one thing, i don't know if this is some variation of imposter syndrome or something, but i get this feeling that i need to only use numpy unless there's something i really need to use a library for (like using PIL to convert images to arrays), otherwise i'm skipping learning the concepts
I mean, you're not wrong.. that's a great way to make sure you learn it.
Not sure in modern times.. it probably varies a lot
I could also see the "get it working first, then go and explain all the parts" approach being used
as in use any libraries i want and then afterwards learn the theory behind each step?
i think part of the reason why i feel a need to stick to numpy is cause i take a very implementation based approach to learning
so typically i'll just watch a quick theory video, and then i'll start trying to implement it myself
i usually never look at code examples or even pseudo code
Does anyone know if this is how you are supposed to set the temperature??
Temperature iirc controls the reliability of the model. It also controls exploration. Sometimes it can generate sth new. Usually it just increases hallucinations. Play around with it and see.
I know what it does but I don’t thing I changed it correctly
Lower means more reliable.
Higher means less reliable, more hallucinations.
I believe 1.0 is the default for gemini-2.0 so 0.1 is pretty low I guess
Yes I am aware but is what I did the correct way to change it in python
I know I need it to analyze data so I want a reliable accurate responce
What makes you think the setting isn't working?
Aah, the README shows setting it slightly differently https://github.com/googleapis/python-genai?tab=readme-ov-file#system-instructions-and-other-configs
with config=types.GenerateContentConfig(...)
Got this.
There was an error uploading your paste.
Cool. Are you already using any relevant libraries, or are you just starting out?
I'm just starting out on the cleaning process. Just got pandas
OK. One thing people seem to use a lot for this is scikit, because it has https://scikit-learn.org/stable/modules/impute.html
but it looks like Pandas has some stuff we could try to use directly https://pandas.pydata.org/docs/user_guide/missing_data.html
What about external contextual data which can not be averaged, such as geographical locations/coordinates?
Hmm, it looks like all the available 'scipy' interpolation algorithms are for data that is smoother than yours
Yep. It's super rough
So what's an example missing attribute in your data that we need to impute?
There's a lot.
The main critical are the latitude, longitude, bird species, and the municipality
these are missing a lot
diagnosis date is another too
I asked for Copilot to calculate and summarize the quantity of data present in the columns, and he came back with;
Column Missing %
focos_de_dnc 94.5%
focos_de_iaap 94.5%
doença 86.3%
número_da_investigação 86.3%
longitude 86.3%
data_do_laudo 86.3%
latitude 86.3%
espécie 86.3%
municipio 86.3%
ocorrência 19.2%
some of these will just have to be dropped
I'm fine
However, I'd like to be able to recover what we can
Hmm. I guess those each kinda need a different approach. For example if the municipality is set but not lat/lon, we could just use the lat/lon of the center of that municipality. For the bird species, we might need a classifier that can figure out the most-likely bird for a location?
yes
we got contextual data about common bird migratory patterns and also possibly domestic birds such as chicken
how can we do something with that?
I can also mix in environmental and biome data
such as wetlands, forests, etc
Hmm, isn't latitude/longitude totally missing here, or am I reading the columns wrong?
but to be honest, I have no clue how to do that.
yep 🥴
we need to infer those somehow
OK that's fine, we just need to calculate it from the municipality. I wonder what we can know about it when THAT isn't set though?
I have no clue 😩
From the bird type, perhaps?
They have some set migration patterns that should narrow down the possible location
Try to mean it
I guess let's just work on it one piece at a time, and maybe the rest will fill itself in
for geocoding, we can use from geopy.geocoders import Nominatim
and then like
geocoder = Nominatim(user_agent="avian_influenza_analysis")
geocoder.geocode(f"{municipio}, {uf}, Brazil")
there's a rate-limiter thing built into geopy you might need to wrap that Nominatim() instance with, I guess
like maybe
locator = Nominatim(user_agent="avian_influenza_analysis")
geocoder = RateLimiter(locator.geocode, min_delay_seconds=1)
Do you think we could possibly enrich the data with contextual information before cleaning?
Would that make it simpler, perhaps?
Since we would have more things to infer from
Maybe, yeah. What else do you have to join with? You mentioned bird migratory patterns, I guess that could be cross-linked via the lat/lon you determine...
So far, I've thought about;
Weather data
Environmental/geographical data
Bird migratory patterns
Just these three
Best I can think of I guess for determining municipality when given only a state is to have a list of towns in the state, and go by whichever one has the most of these rows associated with it?
I don't know/haven't studied in depth what else more could I plug in
We could train a little classifier on the municipalities in the dataset, but the data is so small, hmm.
Sure
I think a random forest might make sense to train the municipality-guesser, but we're getting into the limits of my experience now
Ensemble methods combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. Two very famous ...
Just a second. I accidentally cooked my notebook...
I've got a little implementation I'm working on, let's see if I can get it to successfully make up anything plausible
Please react with ✅ to upload your file(s) to our paste bin, which is more accessible for some users.
1309 rows
Imputed 1130 missing values in doença
Imputed 72 missing values in situação
Imputed 72 missing values in tipo_de_exploração
Imputed 1130 missing values in espécie
Imputed 72 missing values in espécie_principal
hmm, I guess that's something, let's see what it looks like
Successfully imputed coordinates for 179/1309 records hmm that's way fewer than I expected, I guess I have something to fix.
I wonder what the right play is in situations like this, where so much of the key data is missing. Seems really challenging to get right.
Oh I guess there are two columns in this for municipality? 'municipio', 'município'
Yes
I guess it's just a duplicated from the combining.
I'm combining three spreadsheets into one here.
🥴
Yeah.
This one going to take a while.
And worse, I have to turn this into a pipeline.
I sorta have that aspect of it working, but the imputations I've got are still far from ideal
@weak oxide
Guys... I'm curious and got a perhaps stupid idea while coding a MLP library on what will happen if you initialize the network with 0 weights and biases or any real values
How can I start AI/ML .What could you suggest for beginners?
Not a dumb question at all, actually a great way to run into a fundamental property that's worth understanding.
If all the neurons in a layer start with identical weights and biases, they will:
A) all calculate the same output
B) receive the same gradient during backpropagation
C) all make the same updates to their weights
So basically instead of a multi-neuron network, you just have one neuron per layer now
You specifically are trying to avoid symmetry
a neural network is a bunch of numbers (the weights and biases) that you compute with in a specified order (the computation graph). if you don't have any weights or biases, you have nothing.
I never said no baisas and weights
I mean all parameter will be initialized on a same value
let's say zero's
am I doing it right???
look on the first attempt
You have weigth spelled two different ways between __call__ and dif etc
huh
where??
lines 13, 16, and 18 in 'nuralnet.py'
also here https://github.com/hitoyaCute/making-new-AI/blob/main/first_attempt/FUNC.py#L42
that should probably be m.tanh() not just tanh()
first_attempt/FUNC.py line 42
return 1 - (tanh(x) ^ 2)```
I saw it thanks
ill Change that one's I finished making a simple neural network that can defeat me on tic tac toe
@viscid urchin on summary am i doing it right????
I'm heading to something???
^ isn't power btw
It's not nothing, but you've got a number of things left to conquer as well.
The current situation is kinda:
No actual weight initialization
No backpropagation implementation
No loss function
No training loop
Usually you put your weights in a matrix instead of having an explicit Edge concept but I'm not enough of an expert to say having what you have is wrong, just less-likely to be fast on modern hardware
Also I think your forward method is backwards, you raise an error when values are compatible?
To summarize before I crash..
a neuron is a weighted sum of inputs + bias, followed by activation
forward propagation is matrix multiplication between inputs and weights
backward propagation is computing gradients and updating weights
Any actual-expert feel free to correct anything I've said, glad to learn.
well obviously it's still work in progress and I haven't done any testing but yeah
it will take a value from a list EDGEs that will be processed by network.forward then it will make sure that network.forward and layer.forward is heading on same thing
and for debugging
Sure, but when your modulo test returns 0, that's when things have the same shape and are compatible, right?
Surely you want to throw an error when that's non-zero?
it will make sure if we divide the amount of the layer to the amount of value that each neuron will have same amount of input values
YES
I wanna make sure the shape of the input is compatible to the layer
Right, so I'm saying you have it backwards, but feel free to test it
first_attempt/nuralnet.py line 18
return self.weight```
oh nvm
??
I think I don't get your point...
I didn't scroll down enough but according to that repo (I assume that's yours) it shows 2 different spellings
first_attempt/nuralnet.py lines 13 to 18
self.weigth = weigth
def __call__(self, value:float) -> float:
"""takes a value, apply to the parent node, the multiply that output to the weigth"""
return value * self.weigth
def dif(self) -> float:
return self.weight```
Pick one: weight or weigth
alr fixed btw
I know
You have if len(values)%len(self.nodes) == 0:, which will be True whenever the two things fit together evenly.. but inside the if block, you raise an error saying they don't fit together.
hmmm matrix mul of weights and inputs will work to
math lib has it???
I know numpy has it but numpy takes so damn long t get imported
You can do it, it's just nested loops. I don't think anything in the stdlib implements it for the general case? Maybe I'm wrong https://medium.com/@vtalladin06/matrix-multiplication-in-python-without-libraries-a83b68819477
uhhhhh just in case...I wanna learn how to make my code use gpu to process stuff... how to do it???
I don't have tensor compatible gpu so no tenserflow
i don't have cuda compatible gpu to
what I have is Intel dual core graphics
I mean, technically you can do it on that platform, but it's all very experimental and complex, not something I can really walk you through. On Intel the right path is to use a thing called SYCL
PyTorch has some Intel support now but I think it's only for their datacenter GPUs https://github.com/pytorch/pytorch?tab=readme-ov-file#intel-gpu-support
Oh maybe I'm wrong, I see Intel Core Ultra on the page that links to.. but nothing earlier.
Honestly I wouldn't think about it at all until you're comfortable building your neural network on the CPU
that's fair
thanks for advice anyways
I also wanna learn attention block just in case I want to make transformers._. is there resource you would suggest?
3blue1brown on YouTube has a series on transformers, I don't remember it being super in-depth but it does explain attention blocks at some point and I think the explanation is pretty good (and visual)
This site has some good stuff https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Translations: Chinese (Simplified), French, Japanese, Korean, Persian, Russian, Turkish, Uzbek
Watch: MIT’s Deep Learning State of the Art lecture referencing this post
May 25th update: New graphics (RNN animation, word embedding graph), color coding, elaborated on the final attention example.
Note: The animations below are videos. Touch or...
Linked to from the very-good https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
Hello guys, I want to learn about ML and AI, how LLMs work and stuff like that, can someone recommend any good resource/books that englobe AI and its entirety please (please bip me if anyone has something to recommend :c)
https://kaggle.com/learn is a good one.
Or a more in-depth ML with Tensorflow: https://www.tensorflow.org/learn
Or if you wanna go into Deep Learning, Keras is pretty good: https://keras.io/getting_started/
but hmm I wanted to learn a bit of the theoretical parts first, like how things work behind the scenes
The intro sections in Kaggle and Tensorflow explains it quite well
Ok, will have a look at them ,ty !
search 3blue1brown on YouTube
and look for the machine learning list
6
guys what will happen to a MLP if you straight up initialized it's parameters to 0?
before I tell you, what do you think?
what happens when you multiply by zero?
it will be zero
can you imagine what consequences that will have for the neural network?
I barely has experience on machine learning and on that I doubt I have any idea
also I didn't just stated "zero"
wait
....
@late lichen I'm not following. Can you restate what you're current question is, from the top?
I want to know what will be the behavior of a neural network if you initialized its parameters with zero or any value
"with zero or any value"
do you mean "with zero or no value"?
because "zero or any value" is just "any value", and the "or zero" is meaningless.
infact it's rather opposite
I was emphasizing the situation where it's all zeroes
if all the weights in the neural network start as 0, they'll never be able to stop being zero.
why
the updated values are determined through multiplication, but x * 0 is always 0.
yeah?? then??
so if you have a neural network where all the weights are 0, they will stay as 0 no matter how much you try to train it.
and the network won't learn anything.
uhh is there some resources where it clearly shows that??
look into gradient descent
if all the weights are the same, you get a symmetry problem
I think its better you draw simple neural network and calculate the weights and biases in forward propagation ,it will give you a clear understanding
I have learned some basics of the ml, what fun projects i should work on to get started??
I also answered this yesterday
classifier? seniment-analysis ?
Artificial Intelligence: A Modern Approach (4th edition). For about AI in general.
Also has stuff on ML, deep learning, language, etc.
A bit of everything.
ok, ty !
Here's a really good video (more stuff from the same person, also worth checking afterward)
https://www.youtube.com/watch?v=SmZmBKc7Lrs
Shortform link:
https://shortform.com/artem
In this video we will talk about backpropagation – an algorithm powering the entire field of machine learning and try to derive it from first principles.
OUTLINE:
00:00 Introduction
01:28 Historical background
02:50 Curve Fitting problem
06:26 Random vs guided adjustments
09:43 Derivatives
14:34 ...
what activation function you are using matters here such that so long as at 0 it has a nonzero gradient, learning is possible, e.g., with the standard logistic function
Do you have to use "pickle" format? It's not really the fastest way to serialize/deserialize data
There are some things that claim to be faster at pickle stuff, but I'm not sure how much they manage to beat dill by
There are lots of alternatives, if you're in control of the format of the incoming data https://github.com/jcrist/msgspec etc
pyarrow is sweeping the world too https://pypi.org/project/pyarrow/
If you can just use JSON also that's way faster
OK, so you can't just json.dumps(your_root_object)?
Aha, yeah, you may have better luck with PyArrow then.
can someone pls help me understand why the residual plot D is a problem. Ive read the solution and I still cant seem to get it
Man those are close, to me, but I guess what it's saying is that you're looking to find a purely random pattern
whereas the one on the right has kind of a curve shape to it, where the residuals are positive as you get closer to 0 or 1, and negative as you get closer to the middle values (0.4 to 0.6)
but ugh I do have to stare at it to see that
I dunno, maybe I'm even reading it wrong, it's so subtle
Guys
In a recommendation algorithm, the system analyzes the items that the user has already viewed and tries to predict what they might like. This type of AI is closest to:
A) Supervised learning.
D) Unsupervised learning.
What's the best answer here? He didn't say that the user rated the movies
Collaborative filtering (which is what I get out of what you describe) is considered "unsupervised"
If you had explicit ratings it would/could be supervised
Arguably though there's a spectrum between unsupervised and supervised, and it's not a binary thing
Because what happens if you take view-counts into account.. that's suddenly "kinda" supervised...
Got it
That's just my take at least; Google search seems to back me up but I guess it's subjective.
I might be in over my head tonight, team:
It's making sense, but slower than I hoped. Oof.
Hello, very new to all of this after years away from any programming. More of a system setup to take advantage of GPU, running a laptop with 4060 and when searching how there was one method that adds Visual Studio Code to advanced graphic settings and selecting high performance GPU usage and then there is the Nvidia CUDA … are these doing the same thing or are they apples and oranges. Mainly for class project so it isn’t a must but getting into this so I figured I should learn. Any advice for the rookie would be great
Those are apples and oranges, yeah.
CUDA is a 'programming toolkit' from nVidia for running code on GPUs, whereas the other thing is just telling Visual Studio Code to use hardware accelerated drawing techniques etc.
If you can say more about what kind of projects interest you, we can give better advice about what you should look at next.
advanced graphic settings -> high performance
this is telling windows that it should use GPU to draw that program instead of the cpu; if said program (in your example, VSCode) is graphics intensive, it'll boost the performance, making it look smooth, etc.
one quick example off the top of my head is RPG MV games; by default on my pc it draws using CPU, lagging it a lot; setting it to high performance makes it use the GPU which gets me way higher fps
nvidia CUDA
this is probably what you're looking for in the context of programming, but honestly you may not even have to worry about it; for example, if you want to use the popular deep learning librarypytorch, you can just install the correct version ofpytorchand it'll automatically install CUDA for you during the process
Be the change you want to see in the world
<@&831776746206265384> spam ad
!ban 1360871168776867991 giveaway spam
:incoming_envelope: :ok_hand: applied ban to @teal stump permanently.
Any good books i can get from amazon on data science?
Thanks, later today I’ll throw some more details!
Thanks, I’ll follow up with a couple questions on this later today.!
How do I make my models usable(integrated in a system). My friend asked me to design a project priority level prediction model I finished it but i tried deploying it using fastapi so as for him to access through api but am failing miserably. am developing in colab notebook
you cannot really use google colab to host it long term
generally you should export/download the model after training, then host the API on a computer/server/virtual machine you own or rent
Am developing it to be used in a friend's system how would he go about it coz basically I develop them and then github keeps them
I haven't done this
what do you mean by "github keeps them"?
How and where is your friend planning to run/use that system?
I normally deploy them on github and as of my friend his system is a java springboot project(project priority level ) kinda of a system plans of where he wanna use it am not sure but he just want to use it as part of his system
what exactly do you mean by 'deploy them on github'? how are you "deploying" it?
you can upload your code to GitHub, but when doing that it only stores the files, it does not runs anything
or do you mean GitHub Pages? It only supports static websites (i.e. you cannot run python in the server side, at most embed in the browser)
they will need to host their system in some machine for users to be able to access it
usually you'll want to host your API in the same system, or something connected to it (same cloud provider if you're hosting on the cloud, or in the same network if self-hosting)
In simple terms yeah I was uploading them. I was tryin to host using fastapi and ngrok
your api you'll create using fastapi is a computer program
you need to have a computer to run your program in first place
you could host it on your own computer, but if you do so it'll only be available while you are running it yourself
neither Google Colab nor GitHub offers compute for you to host (run) it, are you planning to run it yourself? in some cloud server? in a machine your friend owns?
not really obviously I wanna host it globally not locally
you will need to decide where to host it in first place then
I was using ngrok idk if am allowed to share a link in here, I wanted to show you where am at at the moment coz you know how we used to host web locally and get like a responsive page and you are the only seeing the page unless you hosted on like heroku.
I tried to host it globaly using ngrok but I can only access the root endpoint no other endpoint is accessible
root is a get request but I can't post
kidly reach back to me plz
Was there any recording of this marimo presentation?
There is, @spiral peak is currently editing the recording and will share it at a later date.
ok 🙂
@safe agate you were on TalkPython recently, right?
It was the creator of marimo, Akshay, not myself.
https://talkpython.fm/episodes/show/501/marimo-reactive-notebooks-for-python
@jaunty helm hitting both of your responses at once, thanks again for the feedback. In the short all i am working on now is class work for a graduate class on data analytics, so basic ML dealing with Logistics Regression, SVM, model comparison ... im sure I am not doing the summary justice. However following this semester I want to begin take some of what I have learned and begin slowing seeing what I can do in my current role within distribution center planning and design (order management, inventory management.....). In the next couple of weeks I will be wrapping up this class and I noticed some of the datasets are taking longer to run and the impatient person I am began to look into these things. Example some exercises we will use for loops with 3-4 kernel values, 3-4 C values - generally speaking "linear" always takes the longest. Like I said not catastrophic but looking down the road more than anything.
Cool, hopefully you're making good progress with your class. Linear kernels often scale poorly, it's kinda a core problem in ML. If you end up trying scikit-learn, it has some built-in parallelization that might help (n_jobs etc)
For SVM specifically, there's a thing called LinearSVC that is supposedly zippy, but I'm not an expert: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
Do you have a sample dataset for sales/orders/etc you want to play with? You could make a model to predict the right inventory levels for a given product based on historical data or something?
Actually it looks like there's one on Kaggle you could use https://www.kaggle.com/datasets/amirmotefaker/supply-chain-dataset
This is also an interesting list https://www.interviewquery.com/p/supply-chain-datasets
Ooh, this one has 308,000 rows https://data.montgomerycountymd.gov/Community-Recreation/Warehouse-and-Retail-Sales/v76h-r7br/about_data
what would be a good way to get into nlp, and what models should i be focusing on to start with
How comfortable are you with Python?
Here's one option that uses a particular PyTorch-based library: https://guide.allennlp.org/your-first-model
and there's this, but yikes it covers a lot of topics very briefly, you'll want to follow some links and do some reading https://www.projectpro.io/article/how-to-build-an-nlp-model-step-by-step-using-python/915
-
Prof. Jurafsky's book is really good and beginner friendly. https://web.stanford.edu/~jurafsky/slp3/
-
🤗 Intro to NLP course
Speech and Language Processing
i'm not too good at programming and i think i should definitely get better, but i do know the basics and have tried implementing a neural network with numpy
would a good approach be to read through that textbook and implement models/algorithms on my own along the way after reading that model's/algorithm's respective section
Soooooooooooooooooooo
I cooked a schema.
Also why do we have a slowmode here? Anyway,
!pastebin
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
It took a lot of Github copy, documentation diving, and just asking around.
Came up with this.
You remember that CSV file that we were cleaning? So, yeah. I managed to get it... somewhat working and clean
(Clean as in; I manually took each column and filled it out just to test the schema)
Thoughts and review? This is still a lot of magic as I've never built a schema for this purpose.
Cool, let me look
Honestly this looks better than I expected it to. Only a few things come to mind.
One is that you might want to add a confidence_score column to track how confident the prediction was that filled it in
If you have multiple people working on it, it might make sense to add modified_by but that's really about auditing not quality
the postgis stuff looks correct to me also, nice
ST_SetSRID(ST_MakePoint(NEW.longitude, NEW.latitude), 4326); is nardog but I believe that is a fully correct invocation for WGS84
It totally depends on how you'd wanna learn tbh.
Learning from 1st principle, that is, implementing stuff from scratch is what makes you cracked, however, it requires a clear plan, serious dedication and consistency.
In summary, you know yourself better than I do. Just do what works best for you.
when making a GAN, does the image size have to be pretty large? Should the image be normalized to be larger?
no, that would make it too big
Actually the early/influential papers used sizes like 32x32, 64x64, so no
Some approaches I guess start out super small, like 4x4 and 8x8, and then progressively scale up
e.g. https://github.com/NVlabs/stylegan does fancy stuff
So, if it is RGB, then the image size is 32 * 32 * 3?
Yeah, I guess width x height x color-channels
what would be a good latent_size if it 32?
Apparently the rule of thumb is that the latent space is 10x to 30x smaller than the output space, so I guess somewhere betwen 3072 / 30 and 3072 / 10?
Split the difference and call it 200 to start with maybe?
This paper just uses 100 https://arxiv.org/abs/1511.06434
I want to ask which order for learning data analysis with python from Jose portilla courses on udemy he has many courses on python data analysis and i feel there are the same libraries in each course I don't know if they complete each other but which order should I take them to master these libraries??
Never used Udemy, but there's not some 'landing page' for his courses that has some order to it? That surprises me.
I guess it looks like these two, in this order?
https://www.udemy.com/course/complete-python-bootcamp/
https://www.udemy.com/course/learning-python-for-data-analysis-and-visualization/
Bizarrely, Udemy doesn't even let you filter by instructor.
there has to be some order to it, I just do not know if the laten size and hidden size depending on if it is RGB
The latent space depends on the size of the 'output space', so RGB matters in that you've got three color channels to care about.
yes
Anybody tried to do anything with HiDream yet?
hey guys anyone aware of any open-source embedding models that works just as fine as OpenAIEmbeddings?
Hello, can anybody advice me a discord channel with topic of AI development (preferably on python) ?
I am new to AI, need to create one for my game. Just researching.
This is the one.
Well, then is there any articles about AI usage in gamedev. I just want to implement smart NPC enemies. I want to understand is it feasable in my project.
you can have AI-driven NPCs, yes
Can I get Resources to learn data science
!resources data science
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
Thank you
No matching results
what settings did you do?
Paid course sql
python c++ AI ML?
AR VR XR PYHON C++
I have to listen still to this episode and I am looking forward to test it. Out of curiosity, what alternatives besides marimo and jupyter are still available? Did i understand correctly, that marimo is pure Python, easier to track changes with git? I am also looking forward to this new option, to keep order of cells
Yeah that's correct, marimo is pure Python and git friendly.
however looking at the python code in editor it lights up quite a bit because of warnings from ruff linter 😛
p1 - low leverage + small residual (good leverage point)
p2 - low leverage + large (?) residual (outlier (?))
p3 - high leverage + large residual (bad leverage point) (influential)
is this correct?
I asked chatgpt and it gave me something completely diff
it said p1 was an outlier but I cant see why
and it said p2 had high leverage
Also am I understanding this right?
low res + high lev = influential
high res + low lev = outlier
both high = could be influential
Are GANS just nonsense hard and unpredictable and just fry your gpu? Like, they are fighting to get what they want. They are in chaos. Does anyone casually just make GANS? At least in RL, they is some sort of sanity. You know?
They are hard to train and resource-intensive, yeah.. but arguably with modern frameworks people are out there "casually" making them. It's just not easy to get to that level from scratch.
Are you doing your own by hand or are you using a library?
I'm feeling super dumb suddenly, what's a use-case for np.sum(array, axis=-1)? I'm trying to think of when I want to work backwards vs forwards.
Try making an array and taking the sum of it with different values for axis
See what happens
I mean, I understand what it's doing mechanically, I'm just trying to come up with an algorithm where I'm going to want that
I feel like I should be able to think of four really easily and it's not happening haha
Oh duh, image processing where you want to sum across the three color channels or something, I suppose.
I guess axis=0 is like "reduce rows", axis=1 is like "reduce columns", and axis=-1 is like "reduce innermost dimension"
Note that this isn't just part of np.sum -- it's shared behavior for most vectorized numpy operations ("ufuncs")
This is precisely the idea
It's a very flexible system
I'm trying to re-program my brain to think of the 'matrix approach' to things first; it's slightly painful. Thanks for the confirmation.
ye -1, is p nice for that
generator, discriminator, leaky relu, I do not know what you mean by scratch. Honestly, it is resource intensive no matter what.
off the top of my head, I think prior to diffusion models GAN was pretty competitive in imagegen
tho I'll admit I don't know the specifics
fry your gpu
I mean all neural networks do that once you get big enough
llama 8b? casually eats 20gbs of vram (if you do no quantization)
i'm guessing it's the same as setting the axis to the final dimension
i could be wrong though
gan isn't too bad tbf in the grand scheme of things, like compared to diffusion models inference and training is p cheap for hte most part
ye it's similar to how do u -i
as in if there's n dimensions, then axis = -1 would be the same as axis = n - 1
for lists, its select the ith list strating from the inner d imension
What do you thin of Maven Data analysis course with python is it good or I will just repeat the process and when I find a real world problem I will be stuck like stupid? And What About Alice Zhao is she good she made an advanced sql course but I couldn't download it because it was uploaded to rapidgator?
I do not know, two NN's just figting for nash equillibrium. Bert and stuff, you are pretraining text data on a genius, not two little angry neural nets trying to win a war and come to terms.
I mean, clearly there's merit to it if it works well
probably will have to read a bunch to really understand why it works
I am just stressed, it is hard. I get how it works. you are trying to get the discriminator and generator to pretty much come to terms and agree on where they are like "ok, I can pump fakes and you can pump real data", and they are like "ok, that works". Pretty much. Just take so long to train and it is never optimal even with insane epochs and hyper parameters. The GAN game.
The history of captchas is the longest running training session
Ah soon I will be active In this group as I am taking data analytic, ai, data science/ml path later on
So excited to discuss with you guys n learn some cool hacks n tips
I wonder if anybody's tried to make a meme model that just uses the mantissa bits of FP16 NaNs, and ignores all real floats
Hey guys give me the road map to data base administration
does huggingface allow to generate embeddings through api without downloading the model locally ?
or there is any other free service ?
Thank you so much agentQ
underrated comment. that was amazing
Yo! What's up mates I am back after a LOT of work! What's goin' on?
I learnt using Pandas and learned how to Clean Data
Now what??
Hi, are there any pages good for checking for data science jobs, scientific software dev in Europe? Linkedin shows me only promoted jobs first, very annoying
umm HI mate!
You can do Freelancing! Its easy and good for DS and ML.
Go to Fiverr or Upwork
sign in, create Gig and give out your sample projects. and BAM! you are done!
u get it?
bro, like, when most people talk of RL who are not in robotics or optimal control theory EE stuff, are they just talking Q-learing? With Q-tables? Like, what is up?
Not that I'm aware of
They're more of a hosting thing than a service thing
nvm, you absolutely can: https://huggingface.co/docs/inference-providers/en/index
Doesn't seem to support every model type though
Here's a snippet on how to generate embeddings:
from huggingface_hub import InferenceClient
client = InferenceClient(
provider="hf-inference",
api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxx",
)
result = client.feature_extraction(
inputs="Today is a sunny day and I will get some ice cream.",
model="intfloat/multilingual-e5-large-instruct",
)
Here's a bunch of other embedding models: https://python.langchain.com/docs/integrations/text_embedding/
A lot of things can cause this.
From this learning curve we could infer your model is overfitting. Your model keeps fitting better to the training data, but it’s no longer generalizing better to unseen validation data after 70th-80th epoch.
- What's the model you're training?
- Are you using sufficient amount of data to train the model?
- Did you apply any regularization?
- BatchNorm, Dropout, Weight Decay... are you using any of those?
- Have you done hyperparameter sweep on your learning rate?
- Tried using different batch size and it's still not improving?
I'd like to hear what you've tried so far.
Hello I've made a Hcaptcha solver in python as university project, useing Ml.
Don't need anything but a feedback.
That's the repo -> https://github.com/Irodavlas/HCaptchaSolver .
If anyone wants to give it a try lmk it should take few minutes since I've put tests on it.
Contribute to Irodavlas/HCaptchaSolver development by creating an account on GitHub.
hey guys i have just used the function calling in openai. and I dont understand where could it be useful? I understand that we define a function description and if the prompt matches the description , the model would extract the parameters . Now these params could be use for an api function to extract the real-time info , but then to pass that real-time info , we again need to send the same prompt again , to get the final outcome . So its a two step process.
I have a question
What's your opinion on Neural Prophet?
Because I saw normal Facebook prophet get completely bashed
guys bge-m3 or openAI, which embeddings sare better?
#1362805794004799698 message
anyone can hele me?
Hello guys Anyone here who master using RAG framework with chatbots ?
Hello, remember to never ask to ask or ask for an expert. Always ask your actual question right away.
How do I train my AI model with my own scratch datasets? I'm planning to use pytorch and pandas for this.
This is a big question, you may want a whole course https://course.fast.ai/
In a hypothetical situation, if I have a project that uses AI and Machine Learning, I should probably learn the basics first to understand the better logic and such?
Yep. There's a lot to learn, and it's daunting at first, but you should learn it from the ground up.
Feel free to have a big goal in mind to motivate you of course.
For me that's meant re-learning a bunch of calculus I hadn't paid enough attention to the first time around.
Well damn, that's a tough one. but ig I'll read the needed docs for that. hopefully I can show some progress for my project that involes machine learning and ai.
Supposedly, my only focus are on web-dev, but I got shifted to learning ai and such.
That's a big shift of scope
Like, I don't want to belittle webdev, but ML is a bigger problem to tackle
Check out the 'pinned' stuff at the top of the channel, it seems pretty good
Welcome to the server.
Are you familiar with any basic ML stuff and/or Python, or is this your first outing?
Have you looked at this? https://docs.ultralytics.com/
Do you have a Kaggle notebook going already?
Do you have a dataset to work with?
yes i have a notebook and i also create my own custom dataset for my project
This might just be your font that it's chosen to use, because I think the MongoDB console is utf-8 by default.
Has anyone played around with the HALO Hat for the Raspberry Pi 5?
It adds 26 TOPs and I was wondeing if I should get one or save to build a dedicated rig with a 3060
Is there a fundamental difference between using an embedding layer and one‑hot encoding into a fully connected layer?
Imagine I want to create a program where that will guess your facial expression and based on your facial expression place the song on Spotify
I don't have any Spotify premium
Are people with knowledge in Deep Learning here? If yes, please write me a DM
What is it about. We can all learn on the channel here
I am kind of confused by ResNets. I've seen the provided picture and know that the architecture on the right represents them. But I don't get how the input is provided since it is combined with more prior output as of what I've read, so how the "Residuum" really is calculated. And how the Skip Connections work
From my experience, the most simplified explanation of everything in data science and ai is - input goes in, output comes out. It is really that simple.
I know that one haha but that doesnt help me 😄
This looks like a convulational neural network for image processing. With a bunch of processing layers
@versed bloom did you pull the equations? That is the how. Or what you may be seeking to understand
The skipping is the dotted lines i gues, ResNet is a translated form of CNNs. I dont need a equation, I dont know how the input of a layer comes up since it is combined with some other prior output and the initial input?
The ‘skipping lines’ have equations. That is how most of these ‘complex’ models work
Just googled it
Did you try this? https://stackoverflow.com/a/61034368
On a Windows 10 PC with an NVidia GeForce 820M
I installed CUDA 9.2 and cudnn 7.1 successfully,
and then installed PyTorch using the instructions at pytorch.org:
pip install torch==1.4.0+cu92 torch...
I am trying to create a bot that extracts energy prices in de EU, per country and want to have live updates that are relatively up to date. I found one website that is both free and updates their dat frequently. But I can't find out how to use their API as they seem to be transitioning websites.
Does anyone know an alternative database for this or how to actually access their api to extract the data live and semi-continuous?
I made a GAN that did not mode collapse, I have goosebumps this is amazing and magical. GANS are magical. They really are.
I did not think this was possible. I love this!
wdym "transitioning websites"? I can't find any API docs on their site anyway
Check to see what government agencies have available for this category of data. Leading energy companies. But I wouldn’t think they would freely have that data available for their competitors to gain a competitive advantage
Hey does anyone know if PyTorch could take advantage of 2 GPUs? Was planning on getting x2 3050’s with 6gb of VRAM each. I know I can’t train a massive model but I want to try training something small from scratch or fine tuning a 500M - 1B model
Do I need 1 GPU with 12GB vram or will x2 with 6gb do the trick?
Yes. You can adjust the device map settings
Yeah it's got a thing called DataParallel. Here's a slightly old tutorial but I don't immediately see anything wildly out of date https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html
Thank you 😄
Hey everyone I have basic knowledge of python mainly I am a web dev but want to learn ai since I am not getting anyone who wants a website :-:
Can anyone guide how to start in ai field and how to improve further
For context i ald have basic knowledge of python can use APIs have a foundation of pandas too
Oh hey I do the same stuff! If you want to get started first learn prompt engineering. I made a flask website and started using the Gemini 2.0 flash API since it’s free and easy to get started with . I recommend learning more about how AI works before you start making your own.
My Gemini site is cryptoknightai.com
Building it definitely helped me learn the basics and now I’m learning PyTorch
Great tool to start learning
import numpy as np
from PIL import Image, ImageOps
const_x_mean = 33.318421449829934
const_x_std = 78.56748998339798
epsilon = 1e-10
img_path = 'one.png'
img = Image.open(img_path).convert('L')
pixel_mean = np.mean(img)
if pixel_mean > 127:
img = ImageOps.invert(img)
img = img.resize((28, 28))
img_array = np.array(img).reshape(1, 784)
img_array = (img_array - const_x_mean) / (const_x_std + epsilon)
prediction = model.predict(img_array)
predicted_class = np.argmax(prediction)
print("Predicted class:", predicted_class)
I need help it load wrong image data
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = Sequential(
[
Dense(128,activation='relu', input_shape=(784,)),
Dense(128,activation='relu', ),
Dense(10,activation='softmax')
]
)
any boday can help me how i can load image correctly when i predict
model work with the test data correctly
hey
any of you remember your first GAN?
I remember mine like yesterday, even tho it was today. What am I going to? Generate fake celebs? It just sounds so dumb to master gans. I mean another tool in the arsenal. Oh, boys,(and girls) I am proud.
tried everything. nothing seems to work.
What does torch.zeros(1).cuda() do for you? Do you get an error message?
yes.
Did you install CUDA toolkit?
How did you install pytorch?
normally. pip install torch.
thats not right.
yea, just noticed this.
done.
device = 'cuda' if torch.cuda.is_available() else 'cpu'
this too returns cpu only.
still doesn't work.
Try uninstalling pytorch completely and then run your test code. If it throws "Module Not Found" error, then its definitely in the right environment. If it still runs, then you've installed it in the wrong environment.
You might consider setting up a virtual environment
Virtual environments are isolated Python environments, which make it easier to keep your system clean and manage dependencies. By default, when activated, only libraries and scripts installed in the virtual environment are accessible, preventing cross-project dependency conflicts, and allowing easy isolation of requirements.
To create a new virtual environment, you can use the standard library venv module: python3 -m venv .venv (replace python3 with python or py on Windows)
Then, to activate the new virtual environment:
Windows (PowerShell): .venv\Scripts\Activate.ps1
or (Command Prompt): .venv\Scripts\activate.bat
MacOS / Linux (Bash): source .venv/bin/activate
Packages can then be installed to the virtual environment using pip, as normal.
For more information, take a read of the documentation. If you run code through your editor, check its documentation on how to make it use your virtual environment. For example, see the VSCode or PyCharm docs.
Tools such as poetry and pipenv can manage the creation of virtual environments as well as project dependencies, making packaging and installing your project easier.
Note: When using PowerShell in Windows, you may need to change the execution policy first. This is only required once per user:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
i've already done it.
the same thing's happening. how do i uninstall it from the "wrong" environment too?
How do you know its in the "wrong" environment?
How did you uninstall it?
If you have done it, then you would have used it and installed directly in it. Then you won't have this "right" or "wrong" environment
still runs.
"postCreateCommand": "cd detectron2-0.6 && python3 -m pip install -e ."
this is in my docker file
I got
Running the postCreateCommand from devcontainer.json...
[5223 ms] Start: Run in container: /bin/sh -c cd detectron2-0.6 && python3 -m pip install -e .
Defaulting to user installation because normal site-packages is not writeable
Obtaining file:///workspaces/FFS-main/detectron2-0.6
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [8 lines of output]
running egg_info
creating /tmp/pip-pip-egg-info-qpqnconv/detectron2.egg-info
writing /tmp/pip-pip-egg-info-qpqnconv/detectron2.egg-info/PKG-INFO
writing dependency_links to /tmp/pip-pip-egg-info-qpqnconv/detectron2.egg-info/dependency_links.txt
writing requirements to /tmp/pip-pip-egg-info-qpqnconv/detectron2.egg-info/requires.txt
writing top-level names to /tmp/pip-pip-egg-info-qpqnconv/detectron2.egg-info/top_level.txt
writing manifest file '/tmp/pip-pip-egg-info-qpqnconv/detectron2.egg-info/SOURCES.txt'
error: package directory 'detectron2-0.6/projects/PointRend/point_rend' does not exist
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
[7480 ms] postCreateCommand from devcontainer.json failed with exit code 1. Skipping any further user-provided commands.
Done. Press any key to close the terminal.
However
vscode ➜ /workspaces/FFS-main $ ls -l detectron2-0.6/projects/PointRend/
total 24
-rw-rw-r-- 1 vscode vscode 7467 Oct 26 2021 README.md
drwxrwxr-x 4 vscode vscode 4096 Oct 26 2021 configs
drwxrwxr-x 2 vscode vscode 4096 Oct 26 2021 point_rend
-rwxr-xr-x 1 vscode vscode 5160 Oct 26 2021 train_net.py
I do have this folder, any ideas?
strange to see pip3 in a windows. What happens if you do pip instead of pip3?
can you share what you do have installed? pip list I think
Package Version
------------------ -----------
certifi 2025.1.31
charset-normalizer 3.4.1
filelock 3.18.0
fsspec 2025.3.2
idna 3.10
Jinja2 3.1.6
MarkupSafe 3.0.2
mpmath 1.3.0
networkx 3.4.2
numpy 2.2.4
pandas 2.2.3
pillow 11.2.1
pip 25.0.1
python-dateutil 2.9.0.post0
pytz 2025.2
regex 2024.11.6
requests 2.32.3
setuptools 78.1.0
six 1.17.0
sympy 1.13.1
tiktoken 0.9.0
torchaudio 2.6.0
torchvision 0.21.0
typing_extensions 4.13.2
tzdata 2025.2
urllib3 2.4.0
why the hell is there no "torch"
but the code still runs?
how did you run your notebook?
What does your "kernel" show?
"Python 3.12.6"
'night.
"Run All" and yes, i restarted the kernel many times.
I use CoPilot and ChatGPT to find resources and to give me an idea on how to go about starting a particular project in mind. Is this an effective way of using these tools? I never use AI for helping me complete any of my programming projects; any issues that come up in my code, either i figure out myself, ask the discord here or seaching on google (stackoverflow, reddit etc)
crazy pfp
Is there a preferred Python version to use when it comes to machine learning and AI? I used to work with Python 3.13 but had many issues with PyTorch and had to roll back to an older version, now i use 3.10 mainly and 3.12.6 occasionally.
I'm building an ai voice agent in python that streams audio from twilio to assmebly ai stt for transcription and vad, but it takes 4 seconds to reply back with the final transcript. I want it to be less than 1 sec.
Can anyone help with this?
3.11 is sort of a sweet spot at the moment for many libraries.
should I only use sklearn correlation matrix before start feature scaling?
Anyone have experience with embedding? Trying to understand how it difference form one hot encoding at a low level.
Take a look at VAPI if you havn't yet.
I use the same. Helps me with research but I combine them together with my Google-fu searches. Once I lay everything into a document and review what I want and how I want, I can then start properly.
There's no preference except look at whether the libraries you wanna use is supported for that Python version. For instance, spaCy is supported just before 3.13.
When doing PCA, increasing number of components shouldn't affect previous features are selected should it? i.e. picking N features should lead to the same list of features N-1 but just with an extra feature added
Increasing number of components should affect previous features, because you added a new principal component.
maybe i need to rewatch how it works
i thought it was like tuning a polynomial fit like a taylor series
hey guys, i just implemented a byte-pair encoding algorithm for a corpus of text, however i'm not exactly sure why this issue is happening
basically the issue is that after a certain number of iterations, rather than creating pairs it adds empty strings to the corpus, which originally started out as an array of each char from the original text used
here is my code and the corpus used, if anyone knows about this and can see the cause of the issue can you please help? thank you
import os
command = 'cat text_data/forms/abc/AbcPoems2AbcHkAndChinaV2Cauchy3Poembycheungshunsang.txt'
executable = os.popen(command)
corpus = list(executable.read())
executable.close()
vocabulary = []
for i in corpus:
if i not in vocabulary:
vocabulary.append(i)
for i in range(1000):
pairs = dict()
for j in range(len(corpus) - 1):
key_list = list(pairs.keys())
pair = ''.join(corpus[i:i+2])
if pair in key_list:
pairs[pair] += 1
else:
pairs.update({pair: 1})
pairs = dict(sorted(pairs.items(), key = lambda item: item[1]))
vocabulary.append(list(pairs)[-1])
for j in range(len(corpus) - 1):
chars = ''.join(corpus[i:i+2])
if chars == vocabulary[-1]:
corpus[i:i+2] = [chars]
and here's the corpus i used, which i then converted to an array of chars
2 ABC of H.k. and China revised vision.
Barrels tears are wines and salts.
With a whisk on goody tails!
Wiggle maces to fix the heads.
Heads in jack on boxes are ceased.
Cry to paranoid truly bosses.
Bosses are jokers take your boys.
Studs are bogs with fire apples.
True predicates worth cases.’
Descents wash in badly bands.
Wholly sales are smart with cats.
Who got tenth honors in China?
Homage grand to play and plays!
Trim the times of hearts then cry.
Tanks in steels but voice wail.
Bossy dragged by tails that whisked.
Go very timid and love the wise.
Hands are lent but laws are ends.
Cases on courts are borrowed lands.
Length long with treads to retch!
Straps on times and watch here.
Arrays tanks but all are men.
Cross all suctions steal the ends.
Cave on minds are cages on objects.
Rouser rockets powers holes.
Confine curses to stop our wounds.
Whirl your bodies and jump on grounds.
Crouch of soldiers after kicks with flings.
Block one leg and hit the middle.
Cauchy3 know the tricks to kill.
Threaten weak oppressed ill.
Surpass scores are bad in honors.
Wash to think that build the homes.
Angel sins but cauchy3 has funs.
Make ones tools when hats are found.
Worlds are drawers on bottom noses.
Singular ugly piece is rose.
Wily mores are teeth of sharks.
Saw with tooth is laws in arts.
Artful men power with grids.
Bodies stamped and wills are ridden.
Sign in forth with battles conquered.
Triumphs on candles whip the stands.
Soups are soaps and faiths not come.
We are meats in balls and rice to constants.
---Cheung Shun Sang=Cauchy3---
i know the poem is a little weird lol, was from a random dataset i found on kaggle
here's an example result of what i mean with what's going on with the corpus
Click here to see this code in our pastebin.
I might be thinking of a cosine transform actually
I'm pretty sure that works that way
Exactly what I do. Tbf I think using AI in this regard is so much more useful than just letting it do everything for you
Hello everyone, I implement some optimizers using TensorFlow. I hope this project can help you.
This project implements optimizers for TensorFlow and Keras, which can be used in the same way as Keras optimizers. Machine learning, Deep learning - NoteDance/optimizers
I’m building a new rig and I am getting into AI training and running LLMs locally. Are there any good AMD GPUs for AI devs or is it just really an NVIDIA thing? I’m finding a lot of AMD GPUs with a decent amount of VRAM are much less than NVIDIA.
I'm not aware of any non-NVIDIA hardware that's anywhere nearly as widely supported as NVIDIA hardware. You can look to see if PyTorch runs on any non-NVIDIA devices, and with what caveats.
You will find that the amount of compute resources needed to fine-tune or deploy LLMs varies by orders of magnitude. you might consider not buying any AI-specific hardware at all, and using the savings to rent cloud compute.
You can't train an LLM from scratch on consumer-grade hardware--you can only maybe fine-tune an existing one.
Modern AMD works mostly fine on PyTorch now, this is in large part due to AMD directly supporting Pytorch.
Actually you can from scratch. I know a guy who trained a 2B and a 4B model off data sets he got off hugging face
The 4B may have been cloud but I know he did the 2B himself
Yea I was gonna say that too
It’s all compatible just asking wether or not it runs well
You would need something like this to run it locally though: https://tinygrad.org/#tinybox
One GPU is not enough.
(15k USD)
Damn 😭
Not tryna train the next chat gpt, just learning how everything works. I trained a 100M on my M1 MacBook Air
It’s more important to learn the skills IMO and the you can judge if a device like that is worth it
This is not to train something like ChatGPT, that costs millions.
Ok, if you just want to learn some ML, any modern consumer GPU will do.
Except Intel or whatever.
Yea but from what I’m finding most AMD cards come no where near NVIDIA
No idea of the status of Intel, seems like no one cares about it.
Imagine running trying to fine tune DeepSeek on an Arc 😭
The most recent is not too far off. And way cheaper.
AMD is chosen for price.
There not great for much. They started with laptop GPUs and then tried to do desktop but it didn’t really work out
iirc there are some programs that support inference on AMD, but for training you'll really want NVIDIA
Kind like snapdragon is now trying to make non phone chips
There all so expensive tho 😭
You can train on AMD.
Ima email Jenson and js be like hey you gotta have an extra H100 laying around somewhere right
Can and should are very different
Nvidia is the typical option, and probably what you want. If you can get one...
there is also the option of just renting cloud compute instead of purchasing a GPU though, specially if you want to try training/fine tuning larger models
Your goal is not to make some giant model or anything anyhow, so why not.
I have a $200 digital ocean credit from the GitHub student program so that’s what I’ve actually been doing
It’s $3.39 an hour for an H100 rig
Because I also don’t wanna make a micro model. Target is like 2B-5B-7B
By not huge I mean not like DeepSeeks full 164B or whatever it is
Anyway any AMD cards that you think would be semi fast for training a 5B?
No, nor would any Nvidia I think. Not enough memory even. IIRC you would need like 60-80GB of VRAM.
3090 24gb actually could and would only take a few days-a week
Have you done this?
Same friend that I was talking about before. He trained a 2B on a 3090
From scratch with torch
And I ment it’s the guy I was talking about that did the 2 and 5B
I asked if he did both locally or the 5 in the cloud and he said 3090 did both
Well it seems like you have your answer then already.
Nah not rly, I’m asking about AMD cards that have similar performance
The memory is 24 GB, and it has about 36 (rounded up) TFLOPS at half precision.
It goes for about $1,700-1,800.
The Radeon RX 7900 XT has 20 GB, and about 103 TFLOPS at half precision. It goes for about $1,000-1,300.
The Radeon RX 6950 XT has 16 GB, and about 47 TFLOPS at half precision. It goes for about $500.
So the 3090 is clearly optimized around memory, likely to be able to hold a lot of texture data.
So games can load once and hold it all in there.
The conclusion here is the Nvidia prices are absurd, especially for a GPU that old.
Nvidia 5090 and such are way faster in terms of half precision FLOPs, but nobody can get a hold of them.
(And also have 32 GB)
Doing a bit of math to check this. It seems like with 24GB and some tricks you can just barely fit the 5B in 24 GB (during training).
So, important to keep that in mind if you want to go AMD, since it has less VRAM (unless you are willing to increase the price, then you can get 32 or 48 GB).
But on the other hand more FLOPs. So if you go smaller, you can go faster than the 3090 (e.g. 3-4B).
Note that the tricks used also degrade the quality, but since this is just for learning / messing around, that does not really matter.
hi i saw this right one, first of check which cuda toolkit you have and if its compactible with exisiting torch version
and also are you on windows or linux?
windows.
ok, ill get back to you in a while.
sometimes the versions of cuda , cudnn might cause such problems ,
This is using the detectron2 framework
from detectron2.engine import launch
...
def main2(args):
...
print(outputs)
return outputs
def launch_main():
# Create arg parser
arg_parser = setup_arg_parser()
# args = arg_parser.parse_args()
args = arg_parser.parse_args(["--dataset-dir", "/workspaces/FFS-main/data",\
"--test-dataset","E2E_Robotics_ood_val",\
"--num-gpus", "1",\
"--config-file", "/workspaces/FFS-main/Flow_Feature_Synthesis/detection/configs/AD-Detection/regnetx.yaml",\
"--inference-config","/workspaces/FFS-main/Flow_Feature_Synthesis/detection/configs/Inference/standard_nms.yaml",\
"--random-seed", "8",\
"--image-corruption-level","0",\
"--visualize","1"
])
# Support single gpu inference only.
args.num_gpus = 1
# args.num_machines = 8
print("Command Line Args:", args)
outputs = launch(
main2,
args.num_gpus,
num_machines=args.num_machines,
machine_rank=args.machine_rank,
dist_url=args.dist_url,
args=(args,),
)
print("outputs in launch main are:")
print(outputs)
the outputs printed inside main2 are correct
but when I try to get the result in launch_main, it shows
outputs in launch main are:
None
any ideas?
You’re not returning anything from “launch”
does the autograd/backprop engine in PyTorch first build a topologically sorted graph and then just runs backprop or do they somehow "merge" the two?
Sounds like it does a topological sort first https://pytorch.org/blog/how-computational-graphs-are-executed-in-pytorch/
What type of projects can I make with CNN classification?
mnist is a classic
cnn works wonders in image recognition
Our teacher gave us a dataset to classify cotton leaf diseases and the images are about 100kb on avg. Would I be able to train the model with a good F1 score using these low scaled images ?
use a pretrained model
Ok
hey guys I'm interested in data science is there any specific website in the pythondiscord.com/resources for data science? or should i just learn python for now?
Do you have any budget for courses/websites? Some of the nicer-seeming options cost a little something.
(Plenty of free stuff too, but there are some nicely-structured paid things)
Are you past the basics of Python? If not this is a pretty good course https://pll.harvard.edu/course/cs50s-introduction-programming-python