#data-science-and-ml
1 messages ยท Page 164 of 1
and then there's https://course.fast.ai/
down the line what should it be? should i learn like advanced math and data and shit?
Yeah, I see it as a repeated loop of learning a new ML idea, then going back and re-learning its math, and then going again
for now tho just python? as a beginner of course
Yeah just Python is fine. You can learn Triton or something to add to what you can do later.
any other languages along python later on?
A lot of people like to learn C++ or Rust, but to me this might be the power move second language for ML https://triton-lang.org/main/index.html
can you tell me why?
(This is what most of OpenAI's models are written in)
This lets you write code that runs on the GPU very efficiently, and do it cross-platform without needing a specific nVidia card.
ah nice
This would be an alternative to, say, learning CUDA directly
what jobs can one land with data science?
At this point, whatever you can dream almost; everybody wants data science.
highly competitve? evne more than software engineering?
Well, it IS software engineering, so it's hard to compare
It's super hot so yeah, you could say so.
But how competitive depends on your level really
super beginner
You'll have a lot of competition, but if you know stuff, the market is thirsty.
never back down never give up
If you're actually interested in the topic, you should be able to stand out, in my opinion
There are a lot of people just doing it because it's supposed to be hot
nah i kidna like it and interested in it
Is it normal to get a career in data science without a degree?
No, it pretty much never happens.
Data science is even more degree-requiring than software development.
woah
Thank you, thought that would be relevant to @umbral hatch
thanks. yeah it is, but i do want to have some experience and background, alongside some real projects
What kind of projects are you thinking of?
for now, im barely learning loops. For a future project, it'd be something like asking the user for inputs what's your name, age...etc and then in the same project something interactive, multiplying the age, scrambling names. this is far down the line tho
Very cool! Also smart to keep it simple at first.
yeah and for fun maybe start coding games and learning other languages. main reason i went into coding
How do you lot use databases with your ML programs? I'm trying to make my first ML program to see the likelihood of someone developing heart failure. I cleaned the data and then transferred it into an SQLite database
ML program?
@serene scaffold do you know what the issue here is
sorry to ping, you're the only one here that i know is into nlp
Machine Learning program. Just something simple; using linear regression. Ill spend this week on it, send it to this channel for everyone to critique and then I'll attempt to make a proper project (which will have mini projects within it as I try to expand my knowledge and skills in ML programming)
OH! I should probably know that considering thats what Im going for ๐
when you say "pairs", do you mean "two adjacent words"?
remember that lists are not arrays.
corpus[i:i+2] -- if you try to do a string slice that is out of range, you'll get an empty string.
!e
print("hello world!!!"[100:100000])
:warning: Your 3.12 eval job has completed with return code 0.
[No output]
note that it did print("") rather than erroring.
to adjacent items within the list, which starts out to be all chars but then they get grouped together
i increment until one less than the length though
for i in range(1000):
pairs = dict()
for j in range(len(corpus) - 1):
key_list = list(pairs.keys())
pair = ''.join(corpus[i:i+2])
do you see the problem?
also, look at the number of lines: https://paste.pythondiscord.com/NMA23TQ64ACSOJMZ4K5GC4OREA
i end at one before the last char though and i'm joining two chars
i don't understand how it's going past the last character
look at what variables you're using for what.
for x in range(1000):
pairs = dict()
for y in range(len(corpus) - 1):
key_list = list(pairs.keys())
pair = ''.join(corpus[x:x+2])
this might make it easier to see.
you are not
no these kinds of mistakes happen far too frequently for me
i should have known to check a few more times before asking for help
It's good to figure out what you can on your own, but don't be ashamed to ask for help.
nah that's not the part i'm worried about
sometimes i just make so many mistakes like this
it's scary what's gonna happen in the future
You're being too hard on yourself. Everyone starts out like this.
hopefully this issue goes away as i practice more
I think it will.
Hi all, in machine learning, in particular feature update or (rollbacks?) how do you solve the "Any change can break everything" problem?
Basically I made some feature changes to my model inputs in jupyter notebook and it performed worse than an hour ago. how do we solve this?
Has anyone tried using Helicone? I was set on using it to track the cost of tokens and requests per conversation, etc but the dashboard just kept showing the default sample data metrics ๐ญ
I've verified that I am currently sending through their gateway tho... Does it lag?
If i have a rag from which I want to pass in context to an llm (to use relevent data)
should i send this in system prompt or user prompt?
how can it effect the outputs
how much it affects the output will vary depending on the model, but in general you'll want to put "trusted" data in the system prompt, "untrusted" data in the user prompt
you shouldn't rely on the model to distinguish between it, but it might help it understand how official or reliable that data is
thanks that makes sense,
Hello! I am trying to dive deep into tokenization and understand the need of the shift towards subword based tokenization.
One of the main points I see is how hard it might be to define splitting rules, especially for complex languages. I keep seeing turkish as an example but I don't know the language so I can't tell that much, but I think I can see ot even in english of a word is a somewhat complex combination of prefixes and suffixes (but no specific examples come to mind unfortunately)
Another main point I see is ambiguity, in the sense of which representation to use. I see example like "don't", should be considered one token or be split into "do" and "n't" or "not". And "U.S.A" or "New York-based" for example, how should it be split? And I'm wondering if it's that hard to agree on one common way of doing or if it is that hard to define rules for doing so. I see arguments saying that it depends on the use case, so for some cases one way of splitting is better than the other, but I can't think of any examples.
Can you shed some light maybe? Or give some clear examples? Thank you!
I feel like that is a complex question with not at all the same set of examples. realistically an LLM would understand "don't" as a single token and breaking that up would be model dependant. (and also would it even be useful?) The same can be said for splitting prefixes and suffixes, at least in English where you would fundamentally change the meaning of a word by doing so which I say would just cause more harm than good. A stronger argument could be made for your last example but by that same degree I feel like splitting the tokens just by some set standard would yield unreliable outcomes. given that New York-based would have a different meaning than New York, based or similar depending on the model in question.
The greater support or capability for natural language interaction makes the argument even harder to justify IMO.
is it possible to use fuzzyword to generate and map such subwords? as a part of tokenization. Because how long can we count and create such subword lists?
Hello, I'm here to seek guidance. I have a project in which I have to train an AI model on segmenting tumors from breast mammograms. This is my first AI project so I'm kind of lost at the moment.
The dataset that I'm working with is the INbreast 2012 dataset on kaggle. I have managed to load the DICOMs and their corresponding tumor masks and train a UNet model on them, but I did not get any promising results. All metrics like dice and IoU are very small (less than 1%).
I'd really like any help if possible. Thanks in advance.
I've been learning a lot of the statistics for ML. As I'm still at the beginning of my ML journey, how much of a focus should I have put in learning the linear algebra side?
A lot. It will also enhance your understanding in others parts, like statistics. It's foundational as a building block due to modern computers being designed around it for performance.
Ill finish off learning the stats. Tbf a lot of the stats I still can recall from high school several years ago. Just need to revise them. Then Ill go onto learning all the linear algebra stuff
gans make me feel dirty about myself, they waste time, I thought they were harder than obj detection. Any of you young and good lads have any resources at your disposal for object detection?
you think rlhf is not really RL?
I wouldn't call it fake, but it's clearly different when humans are involved
yes, the reward are still through actual tangible data
As in computer vision object detection?
We find no clear correlation between failures and the point at which the modelโs context window becomes full, suggesting that these breakdowns do not stem from memory limits
Interesting. Any thoughts about it?
Also, arxiv entering the 21st century with HTML! ๐
Nice work, science! ๐
Yeah that part is super interesting, I'm still trying to decide what my mental model for it is. It's not like the model can get 'bored'.. is it just the 'game of telephone' it is playing with only being given the last 30,000 tokens etc? Not sure.
I haven't read the full thing so don't have a good mental model of it. But I've got some decent mental models in general. I like Tim Scarfe's take on it, that they tend towards the mean while we push towards chaos: https://www.mlst.ai/p/agentialism-and-the-free-energy-principle
Does anybody that has worked with VAE's know what ways i can increase model performance?
yes
https://www.youtube.com/watch?v=8jXIAWg_yHU&list=PLjMXczUzEYcHvw5YYSU92WrY8IwhTuq7p&index=1
^ Full university course on computer vision from the creator of YOLO. Thi sis the best resource you're likely to find anywhere
guys
could anyone suggest me a machine learning documentation:
i am a kid and i am intrested in ML/DEEP LEARNING.
i am trying to learn linear algebra and stuff
could anyone please suggest me a doc, cause the docs i find is very complicated and not well explained
Have you tried a more hands-on learning course like https://kaggle.com/learn ?
Can anyone help me try to decipher what a model matrix is and how to create one? Third fucking time I'm asking. If you cant be asked to help, at least guide me to a decent resource that can help me understand what a model/design matrix is because I cannot find anything remotely useful on the internet that tells me what I need.
Ah yes be condescending and rude while asking for help. Excellent tactic
I'm sorry you're frustrated. I've never heard of a model matrix or a design matrix.
We do our best in this server to connect knowledgeable people to newcomers, but everything is ultimately voluntary and no one is required to help or entitled to receive it.
This raises the question: why do you think you need to know about something for which scant resources exist?
Is pandas is equivalent of Excel
I mean pandas is a library for data queries and other stuff. Excel is a spreadsheet program. Yes both can be used for data query and analysis but they are very different approaches
For data analysis what's basic skills required ??
I feel like that heavily depends on what data you may be analyzing, what your goal of the analysis is and what exactly are you wanting to do.
keep in mind that you almost certainly need a degree--employers are going to be selective about who they trust to help them make consequential business decisions.
pandas is one of the most popular tools for data analysis, but you also need to understand statistics and have some domain knowledge in what you're analyzing.
Just find job as data analysis
Ah then very much what Stelercus said. A degree most likely as a minimum
Hello guys. I am little bit confused on data sharing
I provide Instagram data but i can't extend number of clients
I search lot of website skill required they said Excel ,sql,tableau ,numpy ,pandas,matplotlib ,seaborn
I mean that is basically a catchall wordsalad but still something to start with in what of those things do you know? You are certainly going to need at least familiarity and understanding of those tools and libraries. Knowledge of usage and experience beyond that
Going to need more details like what you created your system with, what errors you are getting or basically just more details in order for someone to be able to help
Doesn't look like that there's a pre-requisite to learning data analytics. Why don't you give it a try? https://www.coursera.org/google-certificates/data-analytics-certificate
Hey guys. I'm new to here (but not entirely new to python). I just wanted your opinions on this
I'm an advanced beginer in python ( I know the basics. Loops, ifs, whiles, input data types... all that kind of stuff) and i've worked on natural language processing using the NLTK library in python. so it's been quite a lot though ofcourse not being familiar with it makes me forget a lot of syntax in the library
I'm presently 17 and will be starting college pursuing a CS and Cognitive Science degree and I'll be working on a research paper on AI before resumption (September) with top level profs and graduates. This paper would be submitted to arXiv and top AI conferences like NeurIPs. I'd be aiming to pursue a FAANG internship though I'd settle for whatever I'll get but my main goal is to master Python Programming by the end of this year up to a given extent.
I'd love anyone that has inputs or advice they are willing to share so I begin working.
^ I responded to this in #career-advice
Apologies if I was being brash. I'm going through a book (Introduction to Statistical Learning for Python or ISLP). It mentions something about a model matrix or design matrix which I believe is to set the template for your model (i.e. defining what X and y will be). ISLP uses its own custom module to create a design matrix. I'll continue on with my project and see how it goes. I'll post it up on here when I'm done with it.
Does this help at all? https://en.wikipedia.org/wiki/Design_matrix
Each row in the matrix is an 'observation', at least as far as I understand them currently.
I guess this is what you're doing? https://intro-stat-learning.github.io/ISLP/models/spec.html
and I guess this is where the rubber meets the road
Never used this API, looks fancy
Yes. That book I'm going through. It feels like a decent book to go through the basics of machine learning and the stastical maths behind (not so much the linear algebra side). I've made a design matrix from the book 'manually' but I dont understand what makes their custom design matrix module better if that makes any sense
Not sure; my guess is that it helps you generate the matrix from the dataset, but I guess I'd have to look at their docs to know why it's cool
Perhaps the most common use is to extract some columns from a pd.DataFrame and produce a design matrix there we go I suppose
I'll continue on with my current project and see how it goes from there. Thanks for trying to help. I have this weird obsession where I must know how everything works or else i cant use it to its fullest potential. I'm slowly getting better at just needing to learn the surface details then applying it straight way and after, go deeper into the subject
yea this is about all I can glean from the terminology in that book as well. Its more of a term used to describe the matrix created with extracted data vs a 'concept' of its own
For making my own image network convelution
Do I make a file with each reference
Image
yo, RAG stuff, any good tid-bits or links?
no, it should be labeled in each folder with what the image is.
hi im new just trying to find my way around this server. im trying to learn python with data science as one of the goals
hello and welcome to our wonderful data science channel
hi thanks took me a while to find this channel its a big server
a "model matrix" is a particular flavor of linear statistical model, so what you're looking for is a "statistical model" https://en.wikipedia.org/wiki/Statistical_model#Formal_definition
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process. When referring specifically to probabilities, the corresponding term is...
after some discretization/sampling and/or a choice of basis in a finite dimensional vector space, a "model matrix" is roughly equivalent to the assumption that your data is described by a statistical model with deterministic but unknown parameters, and those parameters are related to the observed data via a linear transformation
So like fox image one fox image 2 etc?
what do you have? Is the data in files? and in those files, is the data labeled?
Not yet and I'm going over foxes and cats and dogs
Hi everyone, please, I need your help. I currently use Google colab on a mobile device to run datasets. I only just started. I need to analyze datasets from kaggle. How can I use these datasets on Google colab without downloading it?
im doing a project for my uni and everything looks good except that the graph isnt showing anything
i cant fix it and i dont know where's the issue
can someone help? i can provide the code and other files and stuff but it's difficult to just upload everything here lol
Easiest way to download kaggle data in Google Colab
Thank you so much. Will check it out.
probably going to at least need some snippets of code and output screenshyot
can i dm you?
You can but I can't promise I alone could help. And probably won't have a chance to look today ๐
yea all good i will appreciate just trying :D ^^
im busy atm so i will send it later but appreciate you
The basics
i need some pandas help
def in_prop(formula):
parsed = _parse_formula(formula)
bools = [el in elements for el in propeties.keys()]
return all(data)
filtered = df.loc[lambda x: ~in_prop(x['formula'])]
df['en diff'] = filtered.map(en_diff)
i want to filter elements in the formula column of the dataframe based on the result of the in_prop function, but i cant figure out how to do it
this code doesnt work
Sorry if I'm being dumb, what is en_diff?
also propeties is spelled wrong
and where is data coming from?
im creating a new column for the filtered data which i appply a fucntion to
a json file
Sure, but I mean, it's not in the code you show; is that variable in scope?
yeah
im doing this in jupyter notebook
heres the error im getting btw
TypeError: expected string or bytes-like object, got 'Series'
for this line filtered = df.loc[lambda x: ~in_prop(x['formula'])]
i just dont know how to do the filter
Don't you want to say return all(bools)?
in in_prop?
Like..
def in_prop(formula):
parsed = _parse_formula(formula)
bools = [el in properties.keys() for el in parsed]
return all(bools)
``` ?
ohh right
Or is that not what 'parsed' is all about?
ty
If I understand you correctly I THINK the way to say it is:
filtered = df.loc[~df['formula'].apply(in_prop)]
``` but I'm not a pandas wizard.
yeah data wasnt a variable. but the error still persists, i dont think its even calling the function yet
oh awesome that works
wait
nvm
Remove the ~ if you want to invert it
and then you'd do df['en diff'] = filtered['formula'].map(en_diff)
Anyone know if there is a difference between embedding and one hot encoding?
One-hot encoding is simple and is used for categorical data when relationships between categories donโt matter.
Embeddings are advanced and crucial for tasks like natural language understanding in LLMs, where capturing meaning, context, and relationships is essential. Because it stores the info as vectors.
But is the first layer of a DNN not encoding the one hot encoded vector into some latent space in the same way as imbedding algorithm?
oh was that a chat gpt response?
note that this is the best channel in this server, you've come to a wonderful place โผ๏ธ
This is the funniest thing I saw all day
like, they are very different categories, so IDK how to answer that.
1hot can be considered pre-processing, not part of the model, and there are no learnable weights
but ins't enbedding just relating an index to a vector?
embedding is just a concept for what's hapening on the 1st layer of DNN usually
the idea is it is projecting from data space to a latent space.
there are many ways to do embedding. usually, it is not.
Although it can be done that way.
you got an example where this is the case and we can take a look?
Cross Beat (xbe.at) - Your hub for python, machine learning and AI tutorials. Explore Python tutorials, AI insights, and more. - xbeat/Machine-Learning
Fundamentally, isn't embedding just a mapping from a set of binary features to an arbitrary vector?
No. In many cases, the features are non-binary.
take an image or a sound file.
Thank you that's helpful.
or like, stock prices
Oh sick, GItHub has notebook support these days? https://github.com/google-gemini/cookbook/blob/main/quickstarts/Embeddings.ipynb
Itโs had it for a long time
It got a few improvements recently though
Maybe a couple months back? You could search the changelog
Neat
Does the scaling of the feature correspond to a proportional scaling of the embedded vector?
"learned vectors that place semantically or structurally similar items close together in high-dimensional space" is a definition I just found that I kinda like.
I like that as a high level explanation
It's interesting I guess that this sense of 'embedding' is different from the broader mathematical term
Does have a bit of an LLM slant to it though
Like, I was just thinking about whether a 'consistent hash ring' like you'd find in a distributed system is an 'embedding' of its nodes.. and I guess it is in the mathematical sense but not in the machine-learning sense.
I'm interesting in embedding dota2 and league heros/champions
I'm afraid that's a bit over my head.
Imagine you have a bunch of servers and you want each of them to store an even chunk of your data.. You might use an algorithm that assigns them to 'positions' on a 'ring' or 'clock face'
really depends on your embedding.
If it is just a matrix multiplication, then yes.
But in general, no.
Could you point me to an example which doesn't correspond to a simple mapping of a feature to an vector (+ some scaling if the feature is nonbinary)
i didint knew that kaggle had this types of content, any way thank you for helping me out buddy.
classic VGG? https://arxiv.org/pdf/1409.1556v6
Don't think I see anything about embedding in this paper but perhaps I missed in on my first skim through. Cook paper though!
I mean, it is cited 140k times
Well, embedding is just a concept right? Usually the 1st layer we call it emebdding
in this paper they don't use the embedding concept, but you can think of the 1st layer as embedding
Unless I'm misunderstanding you this qualifies right? https://jesusleal.io/2021/01/13/node2vec-tutorial-with-capitol-bikeshare-data/
@limpid dew Or you could say https://medium.com/%40eddiewctan/collaborative-filtering-and-embeddings-3d6a49034965 ?
Maybe I don't understand what you mean by 'simple mapping'
In the second one the embedding represents the 'latent factors' you are trying to optimize for
Yeah, that's what I mean. The main point of my question is to say ins't one hot encoding -> the first layer the same as embedding.
and then there's like BERT where the embeddings are totally contextual https://medium.com/%40davidlfliang/intro-getting-started-with-text-embeddings-using-bert-9f8c3b98dee6
I think the difference might be in how the vectors are treated after the first layer.
Perhaps the only difference is that, in embedding, the vectors are assumed to have the same basis, and therefore can be added together without increasing the dimensionality.
multiplying a one-hot by the weight matrix selects that row I guess
But presumably in practice your embeddings would have a much more efficient way of being looked up
In Lesson 5, Jeremy talks about how he converted user and movie IDs to one hot encoded vectors and then multiplied it with the weight matrices. I just missed the point of this. The one hot encoded matrix is just an identity matrix right? Multiplying it with the weight matrix just gives you the weight matrix again. What was the point of that? A...
Perhaps it's more efficient but mathmatically is would be the same then.
that stack overflow article is helpful, thanks
I take that answer to mean that one hot encoding would be isomorphic to embedding.
hi guys
Hi ccccccp
can you say where general
were general
yeah
nice
?
I can't think of a rule against it as long as it's not against any terms of service, but I guess be careful?
Maybe send a ModMail asking? Dunno.
I'm trying to build a model to beat betting odds for esports like dota or league.
looking for someone to help with the leage side of things as I only know dota.
each dota hero -> an embedding vector hense the questions earlier.
Holy smokes guys! I just made a univerisal complexity scoring tool that works across every digital domain I could throw at it, tabular, time series, images, text, you name it.
In a nut shell it quantifies how hard each example is, automatically sorts training data, from easy to hard, hard to easy, or random. On time series its boosted it to 192% on time series data.
Can someone explain what this means?
Sorrry to flood the channel, I just felt this was too awesome to not share ๐
Could you give a little more detail?
What is a complexity scoring tool
do you mean Kolmogorov complexity?
Sure! It bascailly quantifies how "difficult" or "complex" different data examples are for ML.
It repesents compleixty as a complex number ฮฆ(x), which provides both Magnitude and Phase arg.
Yes but how do you define complexity?
With a mathmathically formula I made
Would you share that with us?
Not at this moment in time, no.
Interesting, why do you chose to represent the complexity as a complex number?
Is it because complex is in the word complexity?
๐ฏ ๐
A better way would be to say ฮฆ(x) gives us a single score that tells us both how hard a data example is and what kind of challenge it poses, so we can train models more efficiently.
You should check out Kolmogorov complexity. I think a similar version of what you describe has been done before.
Heck no I call bs ๐ so it can sort categorical data automatically 
Still fire tho ๐ฅ
idk about that
Well yea it still has to be tested
joking. It's a cool idea
appreciate your guys feedback, bedtime now. but we can dig deeper tomorrow
Naturally makes sense text is the hardest for ML.
Languages, dialects, slang. ,translations. I mean the list goes, its chaotic.
Ok now bed time
You may have to define it more. I was looking over @limpid dew โs conjecture of the Kolmogorov complexity. That conjecture is from the 1960โs for example
I did a brief online search and went into reading about Alan Turingโs research
Bro wakey wakey eggs and bakey ๐ have a good day chat or night
Time series is underrated. Very underrated. It requires patience.
It's not hard mate ๐คฃ
No one said โhardโ, I did not at least. I said patience.
For agglutinative languages where words boundaries are not defined by spaces, and where word segmentation is needed, what kind of definition should a "word" have? For instance, there are such things as compound verbs, these are conjugation of lone verbs that is in practice used as a single word, but in terms of meaning have their own spots in the dictionary. Should these be segmented or kept together? What is the benefit of keeping them together vs segmenting them? Practically speaking, for the usage of tokenization, would keeping them together be better?
my understanding is that since this is word segmentation, the segmented pieces doesn't need to be morphemes, and so keeping these compound verbs together would make more sense, but then there is also the fact of variation, the lone verbs in the compound verbs can take other forms as well, which doesn't betray its own part in other variaties. Would training on more segmented be better this way?
I would tokenize them separately and let the model figure out what it means when they appear sequentially
What language is this?
It would depend on the language but I would say the tokenization should be done at the level where the entire word actually makes sense.
Even for languages like English, it's already a thing to have "sub tokens", which is really just when you tokenize at the morpheme level.
Why? If I say "misunderstand", the "mis" has discrete meaning that can be applied in other words, even if it can't stand on its own
Ah I see
this is Burmese
แแฎแธแแผแฐแแฑแฌแแบแธ แแแบแแฑแฌแบแแแบแแแบ แแฝแแบแธแแฑแฌแบ แแแแแแแแ แแ แ
e.g.
Is there a case where segmenting compound words together would make sense?
mm
I would just always tokenize them separately and let the model figure it out.
well what would be the point of taking prefixes and suffixes off to re-use them when they have no meaning (in those cases) by themselves?
They do have meaning by themselves, they just can't be used by themselves. You want the model to "know" what the prefix itself means
I see I misread the end of their question about training and was only thinking of interraction. yes I do agree that breaking the words apart into their repeated components is worthwhile. It could even be worth doing both?
I notice that in some tokenizers, when this happens, the token appears not as "mis", "understand" but something like "mis##", "##understand" Is this a standard to mark that mis doesn't stand alone, or just specific to the tokenizer i saw?
I've seen that notation used in BERT tokenizers.
I was thinking. Maybe one case where it would be useful to put those together is in something like an aspect based sentiment analyzer? Because this is something reader-facing for business insights and whatnot. But then training the classifier is separately perhaps
Gotcha. So it isn't really a standard across tokenizers.
iirc openai uses tiktoken and they don't do that
I was training a pytorch mobilenetv2 model on limited dataset - only have 1 image for a class so I used data augmentations to make it 10, even with training accuracy around 0.96 it cannot recognize images outside of training dataset their are not in the first 10 predictions . only difference I see in new image and training image is the background and size, and I applied gradient background/resize to training images hoping to resolve this issue. is it worth to use data augmentation on same source image and train model?
https://drive.google.com/drive/folders/1XwHbWh9OEAnCoRS6_glW98BYxwZqtQYM?usp=sharing
A link to the visual results.
Are you planning to monetize this Plunder
if itโs the real deal Holyfield 
Because this cuts across a ton of tech stacks. Like a butt-ton
Even our beloved Python stacks
Python is one of if not the goat ๐
That's the plan. I'm still working through all the possibilities for monetization. The cross domain applicability is what makes this kinda exciting The python language is just the begining, since the formula is language agnostic
Yep. Be ready for the C community. A lot are boomers ๐
Laughing more about dynamics in the community. The C community is the old guard
Most donโt believe any of this data sciencey ai mumbo jumbo
nah but c is genuinely useful
in embedded systems applications c/cpp are pretty much necessary
I believe it. Conversations get very rigid very fast when I converse with the C community
They could easily claim that C drives Python
that's probably cause people using c use it for very specific use cases, unlike python which is becoming the go to for most
Why are there people who talk about LLMs all of the time, but have never ever mentioned a transformer?
it's not necessary to understand transformers to use LLMs, or design new use cases for them, or to evaluate their performance thereon.
Yes. Why do you ask? You'll find that you get better and more answers when you're more forthcoming.
@serene scaffold I would like to know what technologias should dominate to work with Azure, I have the pcap and it is Ml,DL
sorry, but I do not understand your question.
What's a pcap?
@viscid urchin The certificate
@serene scaffold knowing python and AI
Can I get a job related to Azure?
what do you think Azure is?
Ok
I'm asking you a question.
@karmic pond I recommend you join this: https://hablemospython.dev/
So if I wanted to make my own AI data set it would go like this?:
Folders
Eggplant:
Eggplant_image1
Eggplant_image2
Eggplant_image3
Eggplant_image4```
@serene scaffold Thanks bro
For anyone who uses LLM, how do you currently track your tokens for input and output?
Some people track em' using specific tokenizers for that LLM. Autotokenizer from HF loads the correct and taliorred tokenzier for whatever model you're using
So anyways, after the tokenzier is loaded you can use .encode() method. You give it a your text string (prompt) and it gives you back lost of numbers which are your token IDS . To finally get the actual token count you use len() function on that list.
So basically... There's no active library/app that I can integrate to get the cost of tokens used per user?
Well, not to my knowledge, but I never bothered to search ๐
Alright thanks
I mean theres more than one way todo it thats for sure.
Maybe addtional feedback would be helpful
Wanna see what real feature engineering in machine learning looks like ๐
Did I write this example right?
figsize(8,8),
subplots=True,
layout=(2,2,),
sharey=True,
legend=False,
)
plt.show()
``` in plot method what does mean layout (2,2) ``
(rows, columns) for the layout of subplots.```
Yes according to docs https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
in this example layout=(2,2) what does 2,2 mean it create 4 subplot ??
That is 2 rows and 2 columns of subplots. Which is 4 total, yes
Do you guys know of services that host zero shot classification models and serve them as REST APIs?
zero-shot classification models don't require special considerations. you should be able to deploy it from any cloud VM.
Oh yeah we currently serve the model from an AWS EBS, we were looking to cut costs because we use NVIDIA GPU instances to be able to leverage CUDA but those are too expensive
so the model requires a GPU to run in a reasonable amount of time?
My boss wants the reasonable amount of time to be a bit unreasonable so yeah lol
I'm not aware of a cloud compute service that offers GPUs at a lower price than AWS, but I run all my stuff on my company's own hardware.
I was thinking of ML platforms that offer the models themselves since we use a default huggingface one
what model is it?
valhalla/distilbart-mnli-12-9
Is ONNX runtime something that would make sense here to try and lower GPU costs?
Hi all,
Hope you are all doing well.
I wanted to see if someone could help me out on something that's been cracking my skull for 3 FULL days now in matplotlib/numpy
I want to use spectogram data to plot ontop of the spectogram, the plotting part is easy, but I cant wrap my head around the data.
spectrum, frequencies, time, image = spectogram()
Is there any way I can transform this data to get x, y, z points?
example of a before and after attached
Any help is appreciated.
I know
time = x axis
frequencies = y axis
But how do I transform spectrum, which is a 2d array, into z values in order to decide whether to plot a point there or not?
the 2d array already contains the z values. the plot shows them as color, but the values are numeric
the row and column index of the 2d array correspond to a frequency and a time, so you can then use that to e.g. plt.scatter(time_val, freq_val) whenever the spectrum value exceeds a threshold
Wow, thanks, I cant believe how simple that was.
I think my mistake was I was trying to implement all the logic within a the view_lim of the graph, to only calculate whats visible. And i kept mixing up the arrays.
Thanks so much @wooden sail
Now that I can get the Z data I'll try to clip spectrum to fit into the visible axis x and y limits. If you have any suggestions they are more than welcome.
Else I will post results later on.
Thanks!
i'm not sure i understood the clipping part, can you give an example of what you wanna do?
You know how you can zoom and pan around the graph?
Well I only want to calculate on whats visible in the current image.
I know that with the following I can get the limits of the graph that are currently visible
y0, y1 = self.ax1.get_ylim()
x0, x1 = self.ax1.get_xlim()
That way I can plot the line only on whats visible.
So i would have to find a way to clip spectrum to fit within x0, x1 and y0, y1
hello!
i used a short python script to concatonate a few json files intoa csv and set the Timestamp column as index:
df['ts']= pd.to_datetime(df['ts'])
df=df.set_index(df['ts'])
df=df.drop(columns=['ts'])
df.index = df.index.tz_convert('Europe/Berlin')
df.to_csv('C:\\Privat\\Python_VSC\\Spotify\\MyData_2025\\Data_concat.csv')
ye i importated the new saved Data_concat into a jupyter notebook
unfortunatle the Dtype of the idnex is
dytpe("O")
so i cannot use commands like df.index.hours
Why are you dropping the ts column after making it be the index?
Don't you just want?
df['ts'] = pd.to_datetime(df['ts'])
df = df.set_index('ts')
df.index = df.index.tz_convert('Europe/Berlin')
df.to_csv(r'C:\Privat\Python_VSC\Spotify\MyData_2025\Data_concat.csv')
``` ?
this is waht the ts column looks like in the json itself:
2021-01-06T19:04:34Z```
no need, unneccary column if i have it as the index
Hmm, is that how that works? Interesting.
i mean i can use the index as basis for new columsn like Hours weekdays and so on
for x axsises
anyway ye, the json timestamp is UTC
i need it in Berlin Time so i converted it in the python script
yet as already mentioned when opening it in jupyter notebook i am loosing the dtype of that index
Well, it's a CSV, everything starts out as a string, right? How are you asking pandas to cast the values on load?
.read_csv
hm
thats true
Sure but you have to pass the dtype argument
(like, a dict mapping column names to types)
but trying to run df.index= pd.to_datetime(df.index) returns the following
ValueError: Array must be all same time zone
and thats true, cause some conversions are made to +2 and +1
i assume because of szummertiem change
anyway i have to analyze the difference and the value_counts of the different +02:00 and +01:00 that exist
probably a string splitter?
Oh that's weird; so it won't let you tz_convert the ISO8601 strings?
yup
but i know it worked before
cause it was an old project ijust picked up again
i mean i casted tz.convert in the pythjon script, and that works
trying to run it in the jupyter just returns me that i cannot use that command on the index
is it a trading bot? just curious. felt like ive dealt with this before
cause it thinks its not a datetiem object
nah, its csv spotify streaming history
ahh ok
my top 1 song is still from kanye west
ims o cooked
307 plays of Stronger
2.3k hours is actually fine for 7 years
a more condensed version is found in #1365417667573583943
gen ai course, video or book?
Hello, I have a little problem, got some embeddings done in clap (512vectors) and want to cluster them using HDBSCAN, I get OOM pretty quick because I've got the embeddings on 50k files. How can I fix this, it's kinda out of my league.
Tried some LLM answers was:
use k-NN to build a sparse distance matrix and metric='precomputed'
Dimensionality Reduction with PCA
pca is compression technique
dont sure but can stem or lemmatize words
to make fewer or have shorter words
also chunk files to not process at once 50k
I mean split
but for clustering specifically I have to run them all at once for what I know
ah ok
maybe im mistaken and maybe there is a way to split them on disk, but honestly I dont have that much expertise so im looking for a solution
and Id like to avoid taking smaller samples because then ill have to reorder all the files I have in the correct clusters
I know you say you want to use HDBSCAN, but this is incremental and might be worth a look? https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html
You could start with that and refine it with HDBSCAN maybe?
Gallery examples: Compare BIRCH and MiniBatchKMeans Comparing different clustering algorithms on toy datasets
I see a paper about streaming DBSCAN but I don't, sadly, understand it yet.
I don't want to necessarily use HDBSCAN but I was trying with DBSCAN initially and moved to HDBSCAN which was the first that worked, I kind of want some granularity between clusters where for example it can detect if something is a car or a chainsaw, it should be distinct (the embeddings are made from audio files)
Ill take a look
hey guys
do you suggest tensorflow over pytorch for machine learning
general purpose machine learning
no, over the last few years pytorch has greatly overtaken tensorflow in popularity
for some things you don't need of either of them though, just sklearn could be enough if you don't need of neural networks
At the moment I donโt need complex neural networks but Iโm thinking of taking an advanced machine learning class and you know, better be prepared
from flask import Flask, render_template, request, jsonify
import requests
app = Flask(name)
Replace this with your Hugging Face token
HUGGINGFACE_API_KEY = "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"
Inference API URL
API_URL = "https://api-inference.huggingface.co/models/HuggingFaceH4/zephyr-7b-beta"
headers = {
"Authorization": f"Bearer {HUGGINGFACE_API_KEY}",
"Content-Type": "application/json"
}
@app.route('/')
def home():
return render_template("index.html")
@app.route("/api/chat", methods=["POST"])
def chat():
try:
user_message = request.json.get("message")
print("User Message:", user_message)
prompt = f"You are a helpful medical assistant named Oxy. {user_message}"
# Generate a reply using a local model pipeline (if installed)
# Or you can use a simple hardcoded reply for now
# Here's a dummy response for testing:
response_text = ""
if "fever" in user_message.lower():
response_text = "It sounds like you have a fever. Stay hydrated, rest, and monitor your temperature regularly."
elif "cold" in user_message.lower():
response_text = "Symptoms of a cold include a runny nose, sore throat, and mild fatigue. Get rest and drink fluids!"
else:
response_text = "Sorry, I couldn't understand. Try rephrasing your question."
return jsonify({"reply": response_text})
except Exception as e:
import traceback
traceback.print_exc()
return jsonify({"reply": "โ ๏ธ Something went wrong. Please try again."}), 500
Please react with โ
to upload your file(s) to our paste bin, which is more accessible for some users.
i need help why this is not workig
the HUGGINGFACE_API_KEY (and anything with "key" or "secret" in the name overall) is supposed to be kept secret, API keys are used to identify who is making the request, may provide access to confidential information owned by the account that created them, and any operations that have a cost will be billed to whoever owns the API key
Be very careful not to share them.
Ideally shouldn't include it in the code in first place, but rather use environment variables or other ways of managing secrets
(go delete/revoke/regenerate it in your HuggingFace settings ASAP)
I remember seeing something about storing your api keys in an .env file. What exactly is an .env file and what makes them useful?
it's a file with lines of text like:
NAME=value
OTHER_NAME="some other value"
you can source .env on it, and if you pip install envfile it'll load them up for you I think. at least it does in vs code
you can put .env into .gitignore so it doesn't get shared with anyone
But what about the overall usage of an .env file? As in what would you use an env file aside from storing and preventing others from using your API keys?
env means "environment", as in environment variables. that's all it does
Hello everyone, this next weekend Iโm going to have a coding challenge and Iโm going to need to tackle docker, aws s3, lambda and ec2, flask/fast and restapi and pytest. Does anyone have a comprehensive kaggle notebook or GitHub repository link in which I can get some practical experience. Thanks!
there's not going to be a kaggle notebook that covers all of these.
or possibly any, since you don't really use flask, fastapi, or pytest in a notebook.
and docker and aws aren't part of python.
Hello! Anybody here with experience with the Awpy library?
Remember to always ask your whole question so that someone who knows the answer can start answering it. Never ask to ask.
Of course but this is integrated into your work environment framework
Hey guys
How do you find relation between continous values and categorical values?
i checked the distribution of data what are the ways i can see if there is linear or non linear relation between them
pearson coorelation will not work as we need mean as well in it
but we cant find mean of categorical values
its not used for coorelation I think
Have you tried an ANOVA
annova has assumptions that data is normal distributed
Normalize it then
GMP is your target right?
and TIR!,WV,MIR are continous vars
yes
so i plotted the distribution of each TIr1,wv,mir whenever flag is 0 ,1,2
Have you done modelling yet
no I am performing eda
trying to find if there is any relation between these vars with GPM Flag
after this I will move to Modelling
Itโs used for not normal distributions
yes I checked it
How were the results
doing it now
I was watching a vid about it
still watching I have to watch mann witney test before this
I will update you after going through them
plot all 3 in one plot
use histtype='step'
https://matplotlib.org/stable/gallery/statistics/histogram_histtypes.html
okie just a min
so it looks like this, so you can compare https://stackoverflow.com/questions/26691836/multiple-step-histograms-in-matplotlib
another alternative is by adjusting alpha. this is a matter of preference
depending on the number of sample, if you only have 3 continous variable, you can also plot 3 scatters, each scatter, will plot 2
(Actually seaborn pairplot is the way to go haha) gives you everything you need
for categorical values?
I bin the data based on the categories and then I create plot Like I am doing i nthese
yes, like this
sns.pairplot(penguins, hue="species")
no, I mean the other way around.
for each figure, you plot the same variable e.g. TIR1
but you plot 3 different histogram for differeng GPM_Flag
Oh ok got it
so we will find difference
I excluded Flag = 0 as it represents no rain
and its in dominance and because of that flag 1 and flag 2 were not visible properly
desnity=True
coz you have imbalance data
and you can plot all 3 if you use density=True
it feels so weird to do back seat EDA but also fun
lol
Why we are not considering frequency ?
well, we are trying to compare the distribution no?
like if they have different mean/std / peak etc2?
if they have different shape?
yeap
yes we can say that I want to see whats the relation between them
You can say that flag0 causes TIR1_Temp to go down?
increasing flag increases the skew of WV_Temp?
sorry yes, lol
i.e. negative skewness
yes
maybe im blind lol, or just preference, but I prefer to look at these 3 plots, instead of that grid of 9.
its easier to comapre when its three in one
do you mean statistical test, or just visualisation?
just visulisation
I mean, is this good enough, or you want more?
Good enough
it looks like Tir1 will be the major predictor
for flag
Im sending the pairplot
is it ok If i send it in some time? I have to go to mess for dinner or it will close
lol hahaha, you don't have to send it to me, it is for yourself to dechiper
enjoy dinner, I have to sleep too
Haha Thank you
hahaha, you need to ajdust alpha, maybe use KDE, and turn on density
Might be easier to redo it in pyplot manualy
but this might give you interesting insight, and I hope you get the gist
hello ๐
had a question, how to approach multivariate time series anomaly detection with or without ML? Actually the inputs are signals in a steel manufacturing plant -- so signals of equipment at latter stage correspond to the same metal from earlier region.
There's data drift as well i.e. whenever the plant turns on after a config change, value range of many signals change. We have over 100 signals corresponding to many regions
would really appreciate some guidance and thoughts!
oh yes i used kde but didnt turned on density I will go that got it I got the idea now
you can start with checking the trends of signals over a period of time
@gritty vessel yes we do this for some signals with rule based methods.
How do I create and train my own AI Model in scratch for a voice recognition with text translation using PyTorch?
https://discord.com/channels/267624335836053506/1365761594411450378 hey guys can you discuss with me what I can do on this
you do ml with python or c++?
python
dont be biased
somearray.dtype
i remembered type(smth) exists
in normal python
can someone recommend a course/video series/webpage/book to learn about some matplotlib fundamentels and how i responsible setup plots with stuff like .subplot, .ax and that stuff.... and especially when and when to use the fig,ax setup method
for a liscence plate recognizer would using easyocr be considered cheating
Depends on the "rules"; it's certainly not cheating if you just need to get it done
my goal is to learn as much as possible, and for that doing it from scratch would probably be best but also it would be super inefficent
so i think i'll stick to just doing implementations from scratch and using some libraries for projects
but it seems like easyocr takes off a little too much of the work
easyocr turns it into a text classification problem vs. an image recognition one, yeah
i see
my reason for doing this project is to use both
so the end goal is for it to be able to recognize the location of the liscence plate on an image and then print the plate number elsewhere
though i have a lot to learn before i can do this if i want to truly understand the concepts behind it
I'm building a KNN using sklearn and using cross-validation to find the best value for k.
Is it overkill to test different numbers of folds and pick the most common k across them?
if you're doing k nearest neighbors and doing m fold cross validation, you can try several combinations of k and m, yes
Yes forgot to mention, different k values as well
(it's usually called k fold cross validation, but we're already using k for k nearest neighbors, so I picked a different letter)
pick a different letter for each variable
nvm it was implied in my first question
but anyways, thanks
I did it "manually" with the help of GridSearchCV from sklearn because I want to plot the results for each fold / k later
sure
getting really wild fluctuations while testing with the valid set, but the train set seems to be doing fine... anybody know why this might be?
increasing batch size seems to help a little bit, so i guess im over fitting? it's a binary classifier and my average validation epoch accuracy is hovering at around 50-60% (so it looks like pretty much random classification)
i dont want to get into web development i want to becoma a data analyst or fo in the field of fintech, ai ml what should i do i know some basic python and currently learning flask and web scraping etc
i m a student by the way just cleared of high school
print(my_array)
print(my_array.dtype)
my_series = pd.Series([1,np.nan,2,3])
print(my_series)```
what is difference numpy data type and pandas numpy and pandas also same handle missing value
they're basically the same, pandas uses numpy
Hello everyone!
Can anybody suggest some good resources to learn AIML? I am already good with basic data structures and have done some web scraping, so I think I am eligible to dive directly into concepts.
Oh... This chat seems to be really dead
Did you check the pinned message?
Oh sorry forgot to mention that. And yes I did check the pinned messages, but I am still confused to find resources from https://jgreenemi.github.io/MLPleaseHelp/
It doesn't look that confusing. Its literally a resource list
guys is there any good course available on statistics ? for ml and dl enthusiasts
Have you looked at the pinned messages?
Sorry but, there are 104 of them... And I can't seem to decide the best one for my cause.
ohhh will check.thanks
Try the first one
Ok
pretty much
same same. under the hood theyโre implemented the same way
numpy has been the backend for pandas, but that's largely an implementation detail.
so it looks like I must sample github for code?
because massive amounts of data is beyond my storage
looks like biggest roadblock is data collection when I have data I just use some llm model train on data and predict as usual
Data:
AI code generation models are trained on massive datasets of public code, including open-source projects.
maybe there is some code dataset?
it will be quicker than collecting
just want to see how it works assuming I have already data
I assume its not like
prompt: write hello
code: print("hello")
result: print("hello")
because it looks like pairs or rule based
check kaggle or huggingface
ok
natural language generation is disntincly not rule based
How likely is that two distinct k values perform exactly the same in a knn algorithm ?
Precision scores (and acc) are exactly the same (15+ floating point precision)
its possible, but exceedingly unlikely
basically unless you have an incredibly noisy dataset or a very easily separable data set, it probably won't happen
ุงุฑุญุจ
can it be caused by a mistake on my side ?
I am using sklearn, the algorithm is not my own
looks like bad param/params
likely, yeah
shi
compare your to some example knn
might be your dataset as well
I got a really quick question
Is it worth it to get the PCPP-32-1?
I got pcep and pcap and i feel like pcpp practices arent that important
like not scaled, normalized?
I mean, mine is pretty straighfarward as well
def feed_data(self, train_data):
self.X = train_data.drop(self.response_column, axis=1)
self.y = train_data[self.response_column]
categorical_cols = self.X.select_dtypes(include=["object", "category"]).columns.tolist()
numeric_cols = self.X.select_dtypes(include=["int64", "float64"]).columns.tolist()
self.preprocessor = ColumnTransformer(
transformers=[
("num", StandardScaler(), numeric_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
]
)
self.X_train, self.X_valid, self.y_train, self.y_valid = train_test_split(
self.X, self.y, test_size=self.test_size, random_state=self.random_state, stratify=self.y
)
def fit(self):
if self.best_k is None:
raise ValueError("k value has not been set yet. Call find_best_k() or define it manually when creating the KNN")
if self.X is None:
raise ValueError("Training data has not been fed to network yet. Call feed_data() first.")
self.final_model = Pipeline(
[
("preprocessor", self.preprocessor),
("classifier", KNeighborsClassifier(n_neighbors=self.best_k)),
]
)
self.final_model.fit(self.X_train, self.y_train)
validation_predictions = self.final_model.predict(self.X_valid)
validation_accuracy = accuracy_score(self.y_valid, validation_predictions)
validation_precision = precision_score(self.y_valid, validation_predictions, average="macro", zero_division=0) # type: ignore
self.validation_metrics = {
"Accuracy": validation_accuracy,
"Precision": validation_precision,
}
precision_yes = precision_score(self.y_valid, validation_predictions, pos_label="yes", zero_division=0) # type: ignore
precision_no = precision_score(self.y_valid, validation_predictions, pos_label="no", zero_division=0) # type: ignore
accuracy_yes = (self.y_valid[self.y_valid == "yes"] == validation_predictions[self.y_valid == "yes"]).mean() # type: ignore
accuracy_no = (self.y_valid[self.y_valid == "no"] == validation_predictions[self.y_valid == "no"]).mean() # type: ignore
self.final_model.fit(self.X, self.y)
return self
ok so you have standardscaler
I don't remember why I added it, I think I was getting perfect results without it
lemme check something
also shuffle your set for training
yes to avoid model of remembering
actually idk if that impact will be huge if ur tanking your weights between each run
definitely not
those certifications and stuff are pretty much entirely meaningless
removing StandardScaler my acc and pre goes down, also my network seems to prefer the smallest k value
but scaling is best practice
(I am testing different number of folds and k's)
reduce your model size?
that might help a bit
wdym ?
also it might just be your dataset
seems like youโre under fitting
This is with StandardScaler(), seems better I think but its weird that 3 and 7 have the exact same performance idk
First time tackling with this stuff so yea
wdym by "size" though
like reduce dimensionality
linear discriminate analysis or whatever
what this is about classifier of what?
you really just gotta tune all your parameters until it works out
and ofc it might be your dataset
so looks like noisy dataset and or need still some preprocessing
yea, its binary one, just yes / no
maybe noisy, or also could be wayyyy too well separated
for k = 3 the results are good no ?
just weird but idk if I want to bother more with it
pretty crazy results
why ?
theyโre good!
ah aright
overfitting?
feel like pushing for more accuracy would beโฆ pushing it
Exactly the same with k = 7 but ill leave it like that
oh possibly
but usually if overfitting youโd see wild fluctuations in the validation
and thereโd be a huge gap between accuracy during train and test
Damn, I gotta learn so much yet
My syllabus in AIML diploma hasn't even truly started ig
Lmao
also often for over fitting, test accuracy will be a non negligible percentage higher than train accuracy
i donโt think itโs over fitted
better seen on plot
I am cross validating for multiple num of folds and k values and picking the most common k among them
wouldn't that prevent it
i donโt think ur overfitting
^
idk what spam is
I am trying to predict the if a person is a responder on a campaign based on some past data
ok maybe now it will be easier to help
64 female urban free never 1.00 1.00 0.00 0.00 0.00 no
This is what the data looks like
first number is age and the rest are logins the last 4 weeks, 6 months, purchases in last 4 weeks, 6 months and total purchases
they arent between 0-1 it just happend to be to this example i copied
and last is no/yes i see
I don't even know if there is something wrong, the only odd thing is the similarity between 3 and 7 k values
thats all
I suggest to compare your code with some tutorial knn but watch out there will be different problem and dataset, but many things have in common
This article covers how and when to use k-nearest neighbors classification with scikit-learn. Focusing on concepts, workflow, and examples. We also cover distance metrics and how to select the best value for k using cross-validation.
aright ill see what I can do
(pic from website)
Seems like the acc for k = [10, 14] is the same
so I guess is an "ok" phenomenon
yes but be a bit wary about these kinds of comparisons
Ill try to copy his code see what i get
different model, different data, so accuracy could mean different things
i saw weird jumps
I think i did an opsie
perhaps
wait iโm not seeing where the problem is
fluctuations are p normal
just not huge percentages in fluctuations
How do I implement a stop so the network doesn't over fit to the data
if you're doing the training in a loop, you can use an if statement to decide if the change in loss has flatlined, and break from the loop.
Thank you
self.final_model.fit(self.X_train, self.y_train)
...
self.final_model.fit(self.X, self.y)
is it ok?
not just one .fit?
just find the derivative of the accuracy graph or the loss graph and see if itโs close to 0
another way to check is to just see if your epochs are seeing any improvements
most of the data processing or ai/ml related libararies are implemented with C under the hood. python is a wrapper, and the overhead is not worth talking about
basically hot loops go a language that's slow to write and fast to execute, cool ones go in one that's easy to read and fast to write
i write most of my models in rust o_o
hi
how would I do this question? ๐ญ have an exam tmr
would u1 be from bottom left to top right
and u2 perpendicular to that?
both passing through that black circle in the middle
:incoming_envelope: :ok_hand: applied timeout to @rich moth until <t:1745818246:f> (10 minutes) (reason: attachments spam - sent 8 attachments).
The <@&831776746206265384> have been alerted for review.
Oh snap. Plunder why are you spamming the channel ๐ post your plots in a single directory file or something
Are you supposed to draw vectors here ๐
Looks like you just have to draw 2 arrows here u1 and u2 in the direction of the PCA. Not sure about this. What do you think chat
https://drive.google.com/drive/folders/1s9G7Db1IVL1JiPmEX4lRTexWLkBcevKA?usp=sharing
Thanks @limber spear . Here yall go
This channel is too fire.
all the data I've been looking at about this makes me wonder something about data's overall structure, almost like its "dimensional constraints". so I made this formula. but in it I have this phase and it seems this phase angle is basically acting like almost some type of trace for that idea.
im calling it like a "structural DNA", but it seems inherent it in all data types based on the complexity measurement tool which gives you ฮฆ(x)
anyone? what do you guys make of this ? honestly
Iโm learning set theory this semester. Worth a look into. It provides a framework to develop a mathematical theory of, get thisโinfinity.
well at first i though just the different curriculm method you used determined by the domain was the key, but then i saw something different in all the data . something about the inherent complexity on data thats based off magnitude and phase paint a realy important picture
can I transfer learn text to code I mean ready made text model which works on code?
because code is some kind of text
so use some text generating model to adapt to code
Text to code. Like text to speech
TTS
I see there are code gen models but text gen models are more
dont know yet
Text to code.
I donโt think even ChatGPT can do that at the moment. Or the trending DeepSeek model tbh
Claude. Pretty much every โcutting edge AIโ today ๐
yes claude can
Explain explain 
for example can generate some snippet of sentiment analysis
Does it have speech to code
dont know
But my point is. Your idea is innovative ๐
I didnt tested it i just got some code from someone one year ago and I thought he wrotes but said its claude output
If big tech steals our ideas know it came from the Python community ๐
hi
https://claude.ai/login?returnTo=%2F%3F
its already there text to code
fine
This is neat
I wrote
write sentiment analysis
got text
then
write code for it and above code as in screenshot
I dont think my idea is innovati e text to code is already
what is the difference between torch.compile and torch.export?
compile is same here as in keras?
oh can also share link i stead of screenshot https://claude.ai/share/e1bbc591-3719-4fa1-898e-970d4fe3a733
still dont know how it works I mean input text output code
I saw somewhere it tokenizes code but what next?
You have to think text to code <-> code to text <-> text to speech <-> speech to text. When YouTube videos or Zoom and Teams meetings are transcribed to text(subtitles), they are slow and 100% not accurate.
When you apply this idea to applications of text to code, there is nothing that comes to mind that is cutting edge โstableโ even with sentiment analysis
Mind you these are โproduction gradeโ products ๐
Ok I have to backtrack. People worked hard on these products.
so this is scam which need developer to fix ๐ ?
I dont trust these tools I just want to learn
I dont care about it I just want to make money if its legal
personally for me I dont use it
I have mindset just do dont care
why I dont trust because code on which is trained is not shared
its like lets train model on leetcode
but better to train it on gh repos but still different people have different style of writing code
for me its little controversial
and also I dont trust text generation I dont just see proof it works
if it makes mistakes or make mistakes sometimes
the tokenized output is fed thru a transformer and then passed through a dense layer. the output is then decoded to plain text
ok explained thanks
i suppose stacked LSTMs could also work but itโd be slow and very inaccurate
you need the attention layers of transformers
youโd use different methodologies for TTS and code generation
so just look at process of generating text and generalize it to code?
well basically
different but similar
thatโs a pretty big oversimplification tho
just implementation differs
well no not really 0_o
wait no iโm stupid
i though you said speech to text ๐
I was about to catch some Zzzโs ๐
The only difference between what some of us do in this space is 2 words, custom and proprietary. That is how I look at it
no speech to text no
Im talking about text generation and text to code
and similarities about it which I didnt talk
It can get very complex very fast mq. For example have you looked into abstract syntax trees
claude gave me lstm based instead of transformer text generation
for text to code I reached limit :sweatsmile:
on nlp course I had sth about formal grammars etc
Ah ast right because it parses code
Data science is a very new field, but in my opinion the foundation of it also includes foundations in computer science
yes better explored with context of it
My hopes are that a lot in this field study rigorously. There is a lot to explore ๐ซก
yes for example feeling semantic web but why i need it and then you meet topic of nlp where you see usage of semantic web and ontology
yes btw wordnet is for text is there sth for code?
Maybe lspโs
data science has been around forever
wait this is actually pretty smart
you can feed your context as an AST and have the model predict the next node. then, deterministically convert the ast to the language in question
lsp as in language server or other meaning?
yes I think it was about language server protocol
which provides language intelligence tools
of course server provides not lsp
maybe lsp's
that sounds good.
why are you not sure?
LSPโs drive every programming language
This is the backend of every programming language
You can dig further into compilers but that goes more to foundations in computer science
Computer science is a more robust field. I know because I dig into kernel code lol
What yโall cooking chat
Iโm handwriting a cart decision tree build this week
This is probably where you want to research. Language intelligence tools
Iโm not sure what do you think chat. Where are the tokenizers 
LLVM is the backend for many programming languages
Iโm well aware of llvm. Chris is in my LinkedIn network lol
all lsp does is interface with the IDE to provide syntax highlighting and completions and so on
nothing to do with the backend of compilers
about to make some python code generation based on some gh repos, some prototype
?
shouldnโt really matter. just use bert_cased or smth
might be useful to train a tokenizer on programming languages tho
Letโs help make mq the next Steve Jobs chat
get rid of unemployment would be enough ๐
And ethqnol his Steve Wozniak ๐
I can be some rando founding dev 
โNo one cares about him. He is just a founding employeeโ
so frontend of app is done with streamlit or flask?
instead of just showing in console which is not user friendly
Hello, i tried to implement a W-GAN but ran into a stubborn problem. The loss for both the generator and the critic starts at 0, and slowly the generators loss rise to 1.5 and the critics loss falls to -2.8, and after that the losses stay very close to those values. I tried everything but couldnt get it fixed. Here is the full code: import t...
please help me with this im desperate i tried everything ๐ข๐ข๐ข
try decreasing batch size
ah thanks. I havent seen a question like this before thats why I wasnt sure
and maybe tune up lr a bit
this is hugging face specific train and evaluate instead of fit and predict?
trainer.train()
I suppose. I think that with trainer objects, you specify the training and test data in advance, whereas with sklearn style models, you pass some or all of the training data (and not the test data) to fit.
correct me if iโm wrong but learn/train is usually associated with supervised ml, while fit is more apt for unsupervised
Terminology is not used consistently in the Ai/ML world. sklearn used fit and predict for everything, and the way they use terminology is pretty influential in the python ecosystem.
sklearn is an unsupervised ml library no?
No, it has lots of supervised models
what is KNN?
I'm guessing K Nearest Neighbors in this context
yes, K Nearest Neighbors is a machine learning algorithm used for classification and regression
Is this a safe-looking tutorial to send an ML learner re: perceptrons? It looks fine to me at first glance, but maybe someone knows of a nicer one? Buddy of mine is asking for a good reference.
https://medium.com/@becaye-balde/perceptron-building-it-from-scratch-in-python-15716806ef64
Maybe this one is better because it summarizes the back-story more? https://pyimagesearch.com/2021/05/06/implementing-the-perceptron-neural-network-with-python/
Why do they want to learn about perceptron? It's very niche and not directly relevant, even if you want to move over to neural nets later
I haven't exactly asked, but my guess is just that it's not that huge, plausible to code using just basic tools, etc?
They are sorta "my first supervised learning", right?
I may just be out of date though, hence the question I guess.
It's fairly straightforward but I'd recommend just covering linear and logistic regression
I understand how trees work but could anyone explain to me how models which use trees like random forests learn the binary operators at each node? Is there some process which is analogous to back propagation?
There isn't any part of random forests that is similar to backpropogation.
with a random forest, you use all the decision trees that you made, and take all their predictions. you can either use the most frequent prediction as the system prediction, or weigh the prediction of each tree differently, or whatever you want.
just tel him to do mnist or smth
hi is anyone here familiar with google colab? having trouble with enabling t4 gpu runtime, need some help if possible. doing unsupervised learning here and my code seems stuck, found out that Im running with cpu so i changed to gpu t4 and theres a warning that says im not utilizing gpu
You did Runtime > Change runtime type. Did you reset the notebook (namely the kernel?)
yes I changed to gpu t4 and i got this message
I tried googling it and someone mentioned i shud install pytorch and fastai to run it, when i did this came out instead
What library is your code using? Torch?
Oh, if you've maxed out your free GPU limit, you'll have to pay or wait.
Please always always share code as text
!code
cant I use my own gpu to run it? instead of using google colab's
@opaque sphinx please permanently remember this ^^^^
alright sorry, new here so forgot the rules here, very sorry
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import time
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import (
classification_report,
precision_score,
recall_score,
f1_score,
make_scorer
)
from sklearn.inspection import permutation_importance
warnings.filterwarnings('ignore')
When you execute code on google colab, it's running on their server, not your computer.
What kind of GPU do you have?
abit of context is I am trying to run to detect anomalies in a prescription dataset to detect errors, so i been running isolation forest, SVM and localoutlier factor, but the recall is too low, hence ive been doing tuning and using SMOTE for oversampling, but bcs of this I cant do so
rtx3070
I also have a macbook M3 pro
oof so is there no way to run it now on gpu unless i pay the 9.99$ a month?
been doing some reading online saying I need to run the model on pytorch, the documentation says i need to run tensorflow, but when I did the error came out again
If they told you that you used up your free compute, they're not kidding. I think it resets every day
Pytorch and tensorflow are for neural networks. You use one or the other
And it should be pytorch.
What did you actually do to "run it on tensorflow"? Just importing tensorflow has no effect.
I havent tried tensorflow, I chose pytorch but now I cant even change the runtime to gpu, it just shows this error
Don't try tensorflow if you haven't already.
ok i wont
!ltt install torch torchvision >> /.tmp
!pip install fastai --upgrade >> /.tmp
import torch
assert torch.cuda.is_available(), "GPU not available"
i ran these, but the output is also gpu not available, then i try change the runtime, same error occurs
You might not have installed the version of torch that has the cuda driver
lemme google how to install cuda into google colab
You don't install it into google colab.
Well if colab isn't letting you use the GPU, it doesn't matter
It will always say that cuda isn't available until it lets you use the GPU again
Check out that 'start locally' page, it has a thing at the top where you can click on the versions you want and it will show you the install command.
ah gotcha, only when the google colab resets again only then i can start doing
Yes, or you can move all this to your computer
ok sir i will check it out thanks @limpid dew too
Try to run the code in a .py file locally on your pc (not using colab)
idk man Keras is pretty darn good
also itโs so speedy
the only issue i have with tensorflow is that tf code is more difficult to understand
people also say itโs less โpythonicโ which is not something i really care about but if you do, wellโฆ thatโs something to consider
My understanding as someone who only uses pytorch, is that keras is more user friendly but pytorch gives the user more control and is therefore better for research applications.
i think pytorch is actually more userfriendly
thats just a personal opinion tho. ultimately i think the difference is minimal enough to not be worth talking about.
People like tensorflow better for production right?
TF is supposed to be super versatile in a production environment, and pytorch has better tools for regular users
tbh i dont think it actually matters which you use. if you learn one, learning the other is trivial. personally i use pytorch tho
That seems right. I doubt there's much you can do with one library which you CAN'T do in the other.
I've never heard of anyone use tensorflow outside of academia
yo isn't that against the server's terms/conditions
!cleanban 1266449020306587688 Asking for jobs after being told not to.
:incoming_envelope: :ok_hand: applied ban to @spiral locust permanently.
i heard currently pandas and bs4 etc are outdated and replaced by more powerful libraries like polar, crawl4ai etc so anyone who knows or is a data analyst can you pls guide me on which libraries to learn as well as i am completely open to learn under someone
please only ask your question in one place. you can cross-post your question in relevant channels, but everything should point back to one place to get the answer.
ok
[component_container-1] 2025-04-29 14:52:42.409273636 [W:onnxruntime:, graph.cc:1348 Graph] Initializer onnx::Conv_2881 appears in graph inputs and will not be treated as constant value/weight. This may prevent some of the graph optimizations, like const folding. Move it out of graph inputs if there is no need to override it, by either re-generating the model with latest exporter/converter or with the tool onnxruntime/tools/python/remove_initializer_from_input.py.
[component_container-1] 2025-04-29 14:52:42.461690793 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 2 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
[component_container-1] 2025-04-29 14:52:42.462035032 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 12 Memcpy nodes are added to the graph sub_graph4 for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
[component_container-1] 2025-04-29 14:52:42.464001375 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
[component_container-1] 2025-04-29 14:52:42.464005845 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
has anyone ever used ONNX for exporting models? Im not sure how to solve this
import onnx
import torch
from perception.src.mask_rcnn_model import MaskRcnnModel
# Function to Convert to ONNX
def Convert_ONNX(model, target_path):
# set the model to inference mode
model.eval()
target_height = 800
target_width = 1100
dummy_input = torch.randn(1, 4, target_height, target_width, requires_grad=True)
dummy_input = dummy_input.cuda()
# Export the model
torch.onnx.export(
model, # model being run
dummy_input, # model input (or a tuple for multiple inputs)
target_path, # where to save the model
export_params=True, # store the trained parameter weights inside the model file
opset_version=12, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names=["modelInput"], # the model's input names
output_names=["modelOutput"], # the model's output names
dynamic_axes={
"modelInput": {0: "batch_size"}, # variable length axes
"modelOutput": {0: "batch_size"},
},
keep_initializers_as_inputs=True,
)
if __name__ == "__main__":
print(torch.cuda.is_available())
maskrcnn_path = "maskrcnn_rgbd_2025-01-28_epoch_526.pth"
maskrcnn_model = MaskRcnnModel(maskrcnn_path, 0)._model
target_path = "MaskRCNNModel.onnx"
Convert_ONNX(maskrcnn_model, target_path)
onnx_model = onnx.load(target_path)
onnx.checker.check_model(onnx_model)
print(onnx.helper.printable_graph(onnx_model.graph))
Hello. I didn't know where else to ask this. I have experience as a game dev and a dell bhoomi dev. I am trying to switch career to data science. But idk where to start. Can anyone provide me a roadmap? Like what learn in a format? And where to learn?
the docker image i created for my document assistant RAG tool is as big as the virtual env i created for it ~ 10 gb - i tried multi stage build, minimal requirements.txt etc but nothing is reducing it's size
is it ok to deploy if the image is 10 gb
changing from keep_initializers_as_inputs=True to False solves all warnings regarding This may prevent some of the graph optimizations, like const folding
but I still havent figured out how to solve these warnings
[component_container-1] 2025-04-29 17:14:53.086686347 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 2 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
[component_container-1] 2025-04-29 17:14:53.087215064 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 14 Memcpy nodes are added to the graph sub_graph4 for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
[component_container-1] 2025-04-29 17:14:53.089272371 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
[component_container-1] 2025-04-29 17:14:53.089276591 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
so youre sure there isnt anything wrong with the model or the loss calculation
?
I used netron to visualize the graph, maybe I cannot find which nodes go wrong and just assume it won't affect the performance and ignore it
Anyone?
https://claude.ai/share/7a73e89a-54c8-49eb-8f2a-4347d484e901
right for code generation code is long
I see some usage of tree sitter
temperature is related to simulated annealing or its in different context?
generally temperature is for controlling randomness as from description of open ai
but honestly im confused
Hey guys. I created a kind of street fighter game using pygame and i want to have a ai model for my oponnent. Can someone give some insight on how to do it with reinforcement ml? I know just a little about ml. Not much.
doesnt look like it.
you could aslo try switching to gradient penalty instead
and if you're getting a bunch of fluctuations, maybe add a layer norm or batch norm somewhere in the middle
and also like tune your parameters, like maybe try doubling n_critic
check the pinned messages
Hey folks looking for advise on preferred UI
For years, I have been using anaconda spyder. I love the setup of that program. However, I need to find an alternative that has GIT repo integrations.
Which do you guys love
- I hate Jupiter notebooks
- Needs to have 3 windows at least- code, output (for print data check) , variables (click df and see the frame etc).
- Has highlight code and run options not run whole file from command line
maybe take a look at marimo
pretty sure that spyder should have a git integration though?
I use pycharm, which does all the things you mentioned wanting.
correct
I use vim, it has these things (except not clicking, keyboard based) as plugins (e.g. https://github.com/luk400/vim-jukit ).
Variable value check is included here in it being the Python REPL, so you can just type the variable name.
I dont like light theme of jupyter notebook, what else dark or other theme?
too green text boring background
I also discovered I can use jupyter notebook inside vs code
pytorch's docs actually have a tutorial about it, have you read through this yet? https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
If you have I'm guessing you want something more fundamental?
understanding which aspect of it?
I drew this with matplotlib like so. While it is kind of fine, is there some library that would be better suited to this kind of drawing?
fig, ax = plt.subplots()
for a, b in relevant_edges:
ax.plot(*zip(a, b), color='grey')
for a, b in all_edges:
ax.plot(*zip(a, b), color='blue')
for a, b in path:
ax.plot(*zip(a, b), color='green')
ax.set_aspect('equal', 'box')
ax.axis('equal')
ax.scatter(*zip(*all_pts))
ax.scatter(*start, color='red')
ax.scatter(*end, color='green')
plt.show()
maybe something like networkx? https://networkx.org/documentation/latest/tutorial.html
the biggest reason to go with networkx is usually that it can lay out the graph nodes comfortably, but here, I know the exact positions of each node.
There's Plotly https://plotly.com/python/
Something like...
import plotly.graph_objects as go
figure = go.Figure()
for a, b in relevant_edges:
figure.add_trace(go.Scatter(x=[a[0], b[0]], y=[a[1], b[1]],
mode='lines', line=dict(color='grey'), showlegend=False))
# draw the rest of the stuff
# ..draw the points etc
x_pts, y_pts = zip(*all_pts)
figure.add_trace(go.Scatter(x=x_pts, y=y_pts, mode='markers',
marker=dict(color='black'), showlegend=False))
# then some kind of figure.update_layout(...) call
# and finally figure.show()
at least that's what I get from their docs, take it with a grain of salt.
Plotly's
ooh actually maybe seaborn would be the slickest here?
It's based on matplotlib but has cooler tricks https://github.com/mwaskom/seaborn
It's got like..
x_pts, y_pts = zip(*all_pts)
seaborn.scatterplot(x=x_pts, y=y_pts, s=50)
I've never used it but I've heard it mentioned a number of times.
Plotly doesn't really seem less-verbose than your original code, it's just different.
Hah, I had to look it up, and apparently it's an obscure joke relating to https://en.wikipedia.org/wiki/Sam_Seaborn
Samuel Norman Seaborn is a fictional character played by Rob Lowe on the television serial drama The West Wing. From the beginning of the series in 1999 until the middle of the fourth season in 2003, he is deputy White House Communications Director in the administration of President Josiah Bartlet played by Martin Sheen. The character departed f...
Hence the common import alias of import seaborn as sns :\
so similar to python is not from snake but from monty python
Hi, I'm new to Python and want to start AI/ML, but I don't know how to get started. Please help me with some recommended courses and tutorials.
Check the pinned message
what kinds of things are graph neural networks used for