#data-science-and-ml
1 messages · Page 85 of 1
and the idea is that, since the readings are just values of E, with some small randomization applied, then i hoped to make a NN that can predict values of E somewhat accurately
(i tried expanding the generated readings range, but the same problem remains, as soon as i go out of the training range, error rates just start gradually increasing)
that's generally not how neural networks work, sadly 😛 extrapolation and generalization are large challenges, and the more you black box a problem, the more difficult it is to handle those things
so yeah, getting worse estimates of E the farther you move from the training data sounds just about right
unless you know ahead of time of a function that closely approximates the desired behavior, and you estimate its parameters
here, if you know for a fact your stuff behaves as some sort of power series, you can try to set up a model something like ax^-3 + bx^-2 + ... cx^3, and try to find all the coefficients
if you really in practice expect to know nothing about E though, there's no easy solution 😛
🤣 yeah, that makes sense, and if i think about it from another side, the value of the real function keeps getting closer to zero, so, i guess at some point there will be precision errors to deal with xD
on top of all the other hardships you mentioned
can you give me a reference how to do this solution?
ax^-3 + bx^-2 + ... cx^3
i am honestly still learning about ML, and i guess i could just try out this as another solution?
you could try with gradient descent
same as with ML, but with a well-motivated model instead of an arbitrary black box
using deep learning doesn't always make sense
interesting, thanks a lot for all of the tips ❤️
@wooden sail are you really pro in ai?
i just want to ask what is the difference between ml and dl
i would say no. the difference is rather arbitrary, and neither of the two terms is well defined. they're kinda buzz words
i have a little research about ai
machine learning refers very generally to many kinds of optimization problems solved through computing. deep learning refers to doing so with deep neural networks, but what "deep" is isn't well defined
and i was wondering if someone can help me understand what ai really is
it's a made up word. everywhere you look you'll find some different definition
so the main role of it is a part of machine learning?
what it originally referred to was what is now called "artificial general intelligence", which is the ability to solve and interact with a large variety of problems
ai is the broadest of these umbrella terms, it includes too much stuff to list
i think my teacher gave m the hardest things,
you could make a case for very nested if-else clauses being AI 😛
definition, models, materials, what is ann, what is dl, what is ml
i understood the definiton and ml
materials and dl i cant figure them out?
dl is a specific kind of ann. "materials" is too generic a word without further context
materials = principal components
so dl is part of ann?
and dl is part of ml?
i don't think making these distinctions is important at all, but sure
no im just trying to understand
this seems about right, but the precise definitions are arbitrary
you could set ANNs between ML and DL
so its like this?
yeah
nice
though tehcnically all deep networks are artificial neural networks
so maybe DL nested in ANN instead
ohh alright
trying to draw such hard lines is a fruitless endeavor imo
ANN is a neural network, which is the basis for DL and most of ML, they are mathematical constructs that mimic the brain's neuron activation. this is the definition?
leads to more confusion rather than helping
i would say ANN being interpreted that way is also historical artefact and hurts you more than it helps
ai is confusing
that's not how the brain works anyway
like a lot
guess imma head to chat gpt and copy everything from there 🙂
but thank you anyway
that's a horrible idea too
why?
oh boy. oh well, good luck
u really helped btw
this is the real definition?
of ann?
that's not a definition, but a pictorial representation of a common architecture (not the only possible one)
deep learning refers to doing so with deep neural networks...
not-useful-fun-discussion time!
There's a bit of contention there. Many parties assert that the Deep in Deep Learning refers to the learning principle of extracting information (learning) from deep hierarchical representations of data or learning techniques that involve multiple levels of composition, where more "complex" representations can be derived from a composition of simpler representations.
Per this view, deep learning is not necessarily restricted to neural networks. Any machine learning technique that uses the principle of deep hierarchies might qualify, and need not be neurally inspired.
For example decision trees, Gradient boosting, random forests can be deep, and not be neurally inspired, while taking advantage of the deep learning principles of hierarchies of compositions.
That said, it is also argued that per the Universal Approximation Theorem all these other methods (decision trees, grad boost etc.) can also be formulated as forms of artificial neural nets. and so are subsumed under that category.
Which would mean that deep learning is after all restricted to neural networks xD
Also, these points are not very helpful. For all practical purposes in industry and most of academia, today, deep learning is taken to mean learning with artificial neural networks.
Edd is right on all of their points, I just wanted this discussion here and to see if anyone would like to add something 👀
i would just add from my side that i generally think of a "layer" as any function to be freely composed with others 😛
I didn't get the exact context :o
"Any machine learning technique that uses the principle of deep hierarchies might qualify, and need not be neurally inspired."
just saying i also had something somewhat more general in mind when i was talking about deep learning above
ohh yes yes makes sense. I just wanted to put this out here and see what interesting things people come up with xd
do you think there's much merit to trying to cleanly sort things things out into categories?
just from your personal standpoint
Not immediately.
Also, these points are not very helpful. For all practical purposes in industry and most of academia, today, deep learning is taken to mean learning with artificial neural networks.
But I do think it's useful in a sense to know the discussion around these contentions, because it might help broaden the horizons of our thinking process / research etc.
I mean if I just thought that deep learning = deep neural networks, I might not stumble upon some interesting ideas than I might with
"deep learning not necessarily is deep neural networks" somewhere in the back of my mind.
I think it adds useful perspective to be aware of the discussion somewhere in there, rather than some standardized definition of clean categories
I'm not sure how clear I was able to make it lol xd
tldr I think the discussion can be valuable, more than an "answer" of what is the "right" categorization
i get the feel, yeah
we stand on the shoulders of giants to advance....maybe learning is high, not deep xD
high-learning
Lol yeah this took me way too long to figure out. Wasn't even aware that the forward passes in the Siamese wrapper were independent till recently, as in like after asking about the training thing
deep learning is just buzzword
I mean term
not as subfield
should be named as other name
hmm and also this
machine learning is subfield of AI but in artificial inteligence is no learning rather rules or sth like
but in machine learning there is no intelligence but learning
AI, machine learning are buzzwords too
can be named as misnomers
but people just accepted as it is
so why machine learning is subfield of AI
this is rather philosopical topic
or maybe its because intelligence contains learning
ok reinforcement learning is like intelligence and learning
but in supervised and unsupervised learning there is no intelligence
and why rl is intelligence and learning because it has intelligent agents, multiagent systems etc
but this is only my opinion
additionally expert systems has intelligence but they not learn
I paraphrase little some book
about intelligence and learning in case of ai and machine learning I found it in some rust projects book from apress
and there he is right I think
in rust projects there is chapter about ai, machine learning
I read only this chapter
and deep learning is because it is related to neural networks with deep layers
so deep neural networks
by machine learning I consider only classic machine learning
maybe shallow networks too
Hahahaha yes and afterward you feel stupid don't you? 🤣 At least I did, idk why but they're a very confusing thing to get for the first time.
I think the authors chose to name it Siamese and make it opaque so their work sounded cooler.
I would suggest you look into the history of the terms Artificial General Intelligence (AGI), Artificial Narrow Intelligence (ANI) and AI
which reinforcement learning algorithm is recommended
I'm making a Google snake AI but plan to make AI for more complex mobile games in the future
hello does anyone here tried to deploy a lambda function aws that need PyTorch? Because my issue is the size limit.
that sort of thing really doesn't works well in lambda functions, try a different deployment method like EC2 or even a different provider like HuggingFace
AI is any program where an agent makes decisions based on some input. ML is any time an agent can get better at a task by itself. AI is a broader category because it includes all of ML as well as things that don't qualify as ML like graph search algorithms.
I need to train an AI model? Finetune GPT3.5, Train opensource models or use GPT4?
I am trying to build an AI assistant app that can tailor its responses to users in different countries. It needs to sound very natural. I first ask some questions to get user data and generate responses about a specific topic for that user's country.
Now I want to train the AI on new conversational data, but when I fine-tune a model, it generates nonsense. I tried fine-tuning GPT-3.5, but it gave inaccurate info (GPT-4 is better). GPT-3 (4096 tokens) can't generate long enough responses, so I plan to call it multiple times (it's context-aware). I want the model to be smart - ask for country info and make decisions. It should output in different formats based on user needs. Each user's needs are different, so the model must analyze and generate tailored responses.
What is the best approach? How can I make a smart AI that knows each country's info and is accurate? I need a model that is smart, accurate, long context and properly trained.
... smart AI ... knows .. info and is accurate ...
Even ignoring all other factors, just that bit itself is already impossible currently.
The state of the art AI gives inaccurate information all of the time ("hallucinations"), the best you can do to avoid it is not using text generation and limiting it to tasks like document retrieval instead of using conversional agents
No, but I found it interesting to see what critics have to say about movies.
"manually"
ELI5: Haw do?
Get the indices of the rows with something like pandas and print them 🙂
Good idea. Thank you
How do I force all rows to show instead of the default "collapsed view"?
https://cdn.discordapp.com/attachments/366673247892275221/1162928327225507950/image.png?ex=653db867&is=652b4367&hm=7a46cf9aaf0d6afcc4cc6e7ce31e8eb92efa7a75377ce9e099b7d866b862e229&
Something ... in here: https://pandas.pydata.org/docs/user_guide/options.html
display.max_colwidth? If it's not that CTRL+F display.
No, that's column width. I want all 1.1 million rows to be displayed.
That way I can hunt for any finicky formatting edgecases.
Display.max_rows then?
I'd read that docs page carefully and see if there's anything in there that solves your problem. Odds are that there is but personally I'd need to "interactively" do it myself so it makes most sense that you run it 🙂
Impostor syndrome strikes again 😂
How about lambda with container image?
hi people
i need some help with fastapi
ig because there are not good resources for fastapi can anyone help
Is there an efficient way to get slices of a series so that I can iterate over s[:1], s[:2], ..., s[:n]?
how do i decide if i should but one expensive gpu or two cheaper ones? its for pytorch kaggle
First thing to check would be if your current motherboard even supports two GPUs
Even if it has two x16 PCIE slots, some boards simply don't support it
What do you have now?
If you currently have no gpu I suggest you only get a single cheap gpu.
Only once you have tested your model can you evaluate if faster training will make any significant impact on performance metrics, or if there are other factors holding your model back
Is it ok/a good idea for undergrad to dive into the field of speech recognition?
obviously anything that seems interesting to you is your best choice
speech recognition or art style recognition or writing style recognition system anything that is interesting to you anything you care is your best bet to get good at it.
i started with password generator in Python
I'm just afraid if the topic I'm jumping into is too big for my own good @~@, not to mention it's suppose to be my undergrad final thesis.
I think it's fair game for a final thesis
Plus you should get a better idea in the exploration stage, as to what level of research you want to pursue for this thesis
But yes, speech recognition is an enormous field in itself, and certainly there are areas that would be good for an undergrad thesis
Ofcourse make sure it's something you're genuinely interested in
Well I've been listening to a lot of podcast from my local podcast service provider, I was thinking if it's possible to record these podcast audio into a text which can later then be used as a timestamp label in the podcast discussion, which is a far fetch idea since I don't actually have much ties with them (the local podcast service provider), but I'm hoping by recreating a speaker diarisation to construct a transcription that's structured in a dialogue format would be a stepping stone toward this goal I have in mind. I already found quite a few plugins to support developing this project, I just need to tweak it to the data that I have. wish me luck in this endevour. ^^
why is it such a pain to install pytorch3d
could someone help me install pytorch3d on windows please ?
after installing cuda 11.8, pytorch, I'm trying to install pytorch3d from the source
I have lots of errors
first I had to download ninja, them I had to put my dev env so pytorch3d find cl.exe
and now I have a cryptic error : "file not found" in build_ext
hey ! sorry about the ping 1st of all thank you for suggesting the book ISL it was way easy to read the book with the videos i wanna ask about the Pattern recognition and machine learning 1st is the book worth reading 2nd if yes then do you have any idea if i can find video explanations of that book ? 3rd which of the 2 books would be more benficial statistical rethinking or Pattern recognition and machine learning and if you have any suggestions for a book for deep learning too ? ofc with video explanation i really need video explanations they make reading a whole lot easier lol
I never read PRML directly but a lot of our course work was based on it.
I think after reading something like ISL you need to "do" something with it before jumping to the next book.
On the topic of books with videos - I've mentioned it a few times. I don't like videos, they give you a false sense of "hey I understand this" when it's not the case. They're risky business. Reading does that as well, but less. In terms of "understanding" it definitely goes like this: videos => reading => implementation.
If you're going to do PRML I suggest you don't actually watch any videos and you implement (some of) the algorithms.
Finally, statistical rethinking is a good book but it covers a lot of stuff where the cutting edge is only available in R and not Python. Unless you want to go down the PyStan route ofc 🙂
Can anyone help me with easy project guidelines? And is there a way to calculate it for me using the Python language?
hi all, just wondering if anyone has run into /lib/python3.8/multiprocessing/connection.py", in _recv raise EOFError while trying to do inference using a language model?
@kindred isle
The image you sent shows a Python script that is counting the number of records, missing values, and unique values in each column of a Pandas DataFrame. The script uses both the ravel and values properties of the DataFrame for different purposes.
The script uses the ravel property to count the number of missing values in each column. This is because the ravel property returns a flattened view of the underlying data, which makes it easy to count the number of NaN values.
The script uses the values property to count the number of unique values in each column. This is because the values property preserves the original shape of the data, which is necessary for finding the unique values in each column.
Overall, the ravel and values properties are both useful for different purposes. The ravel property is useful for flattening data, while the values property is useful for preserving the original shape of the data
I hope it helps you. by Bard
Is this answer generated by GPT? EDIT: it's by bard It's right there
alright soo statstical rethinking it tho i tried reading it .. it was i got stuck somewhere so i had to ;-; pause the read would you mind if i ask you some doubts of places where i get stuck ?
also what are your views on pattern recognition and machine learning
Task about lakes
Generate 20 random variable from 1 to 100
Draw the plot of the sequence
Lets assume each point represent the height and so all plotting is 2d mountains.
Then consider the unlimited rain from above - cavities become lakes full of water.
Determine the deepest lake
The question is, how can I see this zone on the chart? (now on the screenshot it is bright purple, hand-drawn)
How can I do this on a chart?
I already said so above 😄
You can always ask me or others but there's no guarantee me or anyone else will know the answer ofc
ooh sorry saw it now my bad
yeaa thats the path i follow and any ideas on how do i implement what i learnt in isl ?
Kaggle, personally projects, implementing some of the algos, ... whichever you prefer 🙂
okaa okaaa
and any other books which you can recommend ? with vid explanation ofc or just some course material ?
http://www.mmds.org/ but I'd skip some chapters
I made a video for the first time: https://youtu.be/6WTXu95KrCg
Do suggest anything else I can do next time.
What would you do given a text file of measurements and you need to know statistics about that data? NumPy to the rescue!
alright thnx mate !
which ones to skip 👀 ?
Frequent Itemsets for sure, the rest you'll have to decide based on interest 😄
I have a V100 GPU, ive setup cudatoolkit on anaconda environment, however whenever i run some kind of inferenece, irrespective of the model I get the following: CUDA error 209 no kernel image is available for execution on the device
any clue why this might be happening?
why does hypergeom.pmf give me nan if I try to calculate the pmf there, but i get a value if I use the raw formula?
!d scipy.stats.hypergeom
scipy.stats.hypergeom = <scipy.stats._discrete_distns.hypergeom_gen object>```
A hypergeometric discrete random variable.
The hypergeometric distribution models drawing objects from a bin. *M* is the total number of objects, *n* is total number of Type I objects. The random variate represents the number of Type I objects in *N* drawn without replacement from the total population.
As an instance of the [`rv_discrete`](https://scipy.github.io/devdocs/reference/generated/scipy.stats.rv_discrete.html#scipy.stats.rv_discrete) class, [`hypergeom`](https://scipy.github.io/devdocs/reference/generated/scipy.stats.hypergeom.html#scipy.stats.hypergeom) object inherits from it a collection of generic methods (see below for the full list), and completes them with details specific for this particular distribution.
See also
[`nhypergeom`](https://scipy.github.io/devdocs/reference/generated/scipy.stats.nhypergeom.html#scipy.stats.nhypergeom), [`binom`](https://scipy.github.io/devdocs/reference/generated/scipy.stats.binom.html#scipy.stats.binom), [`nbinom`](https://scipy.github.io/devdocs/reference/generated/scipy.stats.nbinom.html#scipy.stats.nbinom) Notes
The symbols used to denote the shape parameters (*M*, *n*, and *N*) are not universally accepted. See the Examples for a clarification of the definitions used here.
the argument form for hypergeom.pmf is, in word form, pmf(expected, total_objects, valid_objects, draws). Using your variables, you're attempting to draw 1696 objects from a population sized at only 64, which is clearly impossible.
ahh that makes sense
my prob theory textbook and wikipedia both used n = draws, M = success objects, N = total so i figured it was standard
!e ```py
from scipy.stats import hypergeom
print(hypergeom.pmf(3, 1696, 64, 10))
@warped osprey :white_check_mark: Your 3.12 eval job has completed with return code 0.
0.004762390442527913
i think it's that
pmf(expected, total_objects, valid_objects, draws)
total -> valid -> draws looks like it goes biggest to smallest
right, i ran the equation as is there were 1696 good objects out of 64 total, which also doesn't make sense
ye ye
this is one the many cases where non-descriptive variable names are not a good idea
bad scipy 
Can anyone tell me how to calculate heuristic value in an A* search algorithm
Hello! I am quite "new" to using AIs in python, but my current project uses a bunch.
Quick run down; On my laptop i have a Speech To Text code, which id like for the response to be sent to a NLP that will filter any words taht arent names of cities and discard them basically only letting city names continue.
After that, the text the NLP gives, which will hopefylly only be city names will be sent to a Raspberry Pi with a code that gives weather info about any city. It sends the text there using and activating a HTTP POST, which i still dont really know if it works 100% despite having no errors.
Then, after the Pi gives the weather info, it sends it back to my laptop into a Text To Speech code to read it outloud.
I have a working STT, Pi and TTS code, but no proper way to link them, plus i still need to make the NLP code, which is what im trying to focus on right now but i dont know much of or how it works, or which NLPs i can use etc. If anyone can help, let me know
any help appriciated
update, looking online, maybe something like NER would work better? i guess its like a branch of NLPs
okkaa thnx a lot !!
you're looking for "named entity recognition"
Yeah, just found out, still dont know which to use, or how it works, basically same as before
spacy is probably a good place to start https://spacy.io/
Will research, thanks
Is sliding window and frame segmentation a similar audio segmentation method?
I'm trying to understand how VAD and Speaker Embedding (d-vector extraction) works.
Hey guys im trying to make histogram but it shows not corrently
Here's my Series
and I want to make it like this
but it only shows frequency
What kinda books would you guys recommend for understanding machine learning and data science, I know that there are a bunch out there already downloaded a few but I thought I'd ask here as well.
hi folks, here's something for any pandas expert/insider (let me know if there's a more focussed community for that): i've been flummoxed by something with pandas, in particular its df.replace() method for to_replace=<dict>. from the pandas documentation for replace(regex=<bool>):
Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.
Which means to_replace cannot be a dict when regex is True. However, in all my tests, something like df.replace(to_replace=nested_dict, regex=True) always works. Is the definition/documentation then wrong?
Is anyone that is very good in XAI?
Remember what I told you before: don't ask to ask. ask your actual question right away.
Bro, i don't get you? i am basically asking a question here na
what is the question?
Can't you see i am asking question?
you said "Is anyone that is very good at XAI" and posted a screenshot, but the screenshot doesn't tell us what you need help with
i don't know if that's clear enough...
Now you're getting somewhere. But none of this information can be gathered from the screenshot.
Don't know if anyone in this channel ever done this before
Whatup? I am new to using AI and CV tools! I have an OpenVINO program for facial reidentification, based on an OpenVINO sample. How would you store the identified faces? Some options:
- Save the face images; this is how it's set up in the demo by default. Then OpenVINO loads it into description vectors.
- Save the description vectors (in memory as numpy arrays) in a custom binary format, with x-endian 32-bit floats
- Save the description vectors in an SQLite database, serialized as text; there are some popular base-64 ones serialization formats IIRC
- Something else
Thoughts? TY!
Does anyone know of any "smart" VAD(voice acitvity Detection) modules that are able to differentiate between: [i] a pause of speech that indicates a person still want's to talk to finish a thought or hwat have you [ii] a pause of speech that indicates a person is done talking If there is no such thing, are there modules or APIS I can use in combination with a VAD module like webrtcvad to get what I'm looking for?
Hello, I am a beginner to data analysis, and is wondering, how should I handle missing values when cleaning data?
That depends on a lot of factors
If the amount of missing values is very little compared to the whole dataset, just drop them
If they’re more than a very small portion but not more than like 30% you could get away with filling them using means/medians/etc
If they’re a LOT of missing values and the feature isn’t too important to the data then just drop the feature entirely
And if they’re a lot of missing values and they are important to the data, get better data lol
There’s a lot of discretion involved with that as well
There can always be better data, and more data is better 99% of the time, but you have to figure out what’s “good enough”
hello i want try making stable diffusion models but give users the ability to train their models with select images.
i can get keras stable diffusion to work and generate images but fail to train it how to do that.
i would like it to be similar to dreambooth/lora for efficiency.
the problem is i am not able to find this being done or idk how to go about this.
i would like to be able build on top of existing weights instead of retraining from scratch due to the price of training(150k training hours).
hello guys i have a dataset i want to work on it
could someone help me
it's a real dataset a banking one
There are null values how can I deal with them in this case
Like do imputation doesn’t work
trying to use spaCy, but i keep getting this error:
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
Looking online, i already did the steps i saw online, and still getting the rrror any help?
you have to do python -m spacy download en_core_web_sm at the terminal to get that model
When asking for help, force yourself to describe the problem without saying "doesn't work". Because saying that something "doesn't work" isn't very informative. Is there a reason you can't use imputation? If you can use imputation, and you tried it, what happened that "didn't work"?
What would be a good way to train a word2vec models to learn synonymous = words that are often seen in the same context (around the same words) but rarely together?
I am new to this but as far as I understand, a skipgram model will make words seen together closer, but not necessarily synonymous
for instance "chocolate is good" and "candies are good" => "good" will be close to "chocolate" and "coffee" but not "chocolate" and "candy"
they will probably be relatively close but you'd expect synonymous to be very close
if the model never sees "candies are chocolate" it won't be as good as it should
@crude pilot if the model knows that two words are synonyms, how should that knowledge be represented in the model? by the vectors for synonyms a and b having a low cosine distance to each other?
yes
a very low distance I think
but as far as I understand skipgram, "chocolate candies" could be a negative example
so the model could even learn to make them not too close cause they are never in the same sentence
like "chocolate" will be closer to "good" than to "candies"
I fee like I could somehow solve that by augmenting the dataset in a certain way, or doing 2 pass
like trying to detect words that have the same neighbours but are not close together
and make positive samples out of that
I know there are models around specific to synonyms but I specifically seek an unsupervised approach + conceptually I'd like to get word2vec right
@crude pilot I don't think word2vec can account for a word (as in, an exact string) having more than one sense, which means that this problem can't really be solved, I don't think. For example, "begin" and "start" are synonyms as verbs, but not as nouns.
and then sometimes words with the same part of speech are only synonyms in certain context. Like "help" and "aid".
all this is to say that "true synonyms" aren't that much of a thing.
make sense, in my context I don't seek a very precise solution, I won't actually binary classify synonyms
this unsupervised work then feeds a few-shot supervised binary classifier
that's a good example, but I'd be eager to get them close if my datasets only contains "start a race" and "begin a race"
basically if I rephrase it's like seeking a kind of transitivity
I feel like word2vec will make close words that are seen together but not necessarily words that are used as replacement from one another
(I see a lot of models at work but only started practicing recently so it's kinda new to me, perhaps there are other known models that do that or maybe even bag of words is better)
i did it in the cmd? did it have to be in the terminal?
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
used terminal now, should it work?
cmd is a terminal. do you still get an error when you run spacy.load('en_core_web_sm') in python?
import spacy
spacy.load('en_core_web_sm')
that no? id need the import
and yes, still same error
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.
maybe because im using Pycharm? but that doesnt make much sense, that it would be PyCharm's fault
@serene scaffold still gives error
you've fallen into the "pycharm is using a different python environment than the one I installed my package into" trap
which is a pretty common trap. fear not.
do you know about virtual environments or no?
not much, no
have you been using the terminal window in pycharm to get to cmd?
erm no, just win r and enter
try using the terminal window in pycharm
and then do the spacy download command in there
yes
if that doesn't work, run this as python code
from spacy.cli.download import download
download('en_core_web_sm')
cli stands for "command line interface"--I think this just runs the same code that python -m spacy download runs
previous code still gives the same error, gonna try the new one
C:\Users\danim\PycharmProjects\EXAM\venv\Scripts\python.exe C:\Users\danim\OneDrive\Desktop\programming\NLP3.py
Collecting en-core-web-sm==3.7.0
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
new one, is that good? ima guess so
looks like it's working
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
gonna try the code from before again
you can delete those two lines and try doing what you were planning to do
OK! no error this time
dope, now i just need to figure out how to have the code filter text and let only city names pass
you'd need to use the entity recognition tools
import spacy
# Load the language model
nlp = spacy.load("en_core_web_sm")
# Your input text
text = "I traveled to New York last summer."
# Process the text with spaCy
doc = nlp(text)
# Extract city names
city_names = [ent.text for ent in doc.ents if ent.label_ == "GPE"]
# Print the extracted city names
print(city_names)
this code works, but i only want it to print the city names, as in (imagine it was in the console):
New York
But i get
['New York'], and i dont want ['']
unless i need to change the code idk
ok ok
you want to print out each item individually?
no, just list out the city name
it will then go into my Raspberry Pi, etc....
ok, now i have "almost" all 4 codes, idk if the Pi one works, i mean it makes the HTTP POST but idk if it also does the other part of the code
well, thats for a separate discord
new question, I have this STT code, how can i have the text that it prints out go as the text the NLP code uses?
import speech_recognition as sr
def transform_audio_to_text():
# almacenar el recognizer en variable
r = sr.Recognizer()
# configurar micrófono
with sr.Microphone() as origin:
# tiempo de espera
r.pause_threshold = 0.8
# informar que empezó la grabación
print('City?')
# guardar lo que escuche como audio
audio = r.listen(origin)
try:
# buscar en google
request = r.recognize_google(audio, language='es-ES')
# prueba de que pudo ingresar
print(request)
# Devolver pedido
return request
except:
# prueba de que no comprendio el audio
print('UPS, ALGO HA SALIDO MAL')
# devolver error
return 'sigo esperando'
transform_audio_to_text()
nvm, figuring it out myself
okkkk, so now im working on getting the response from the Pi to be sent to my laptop, but i keep getting SyntaxError in line 40:
response = requests.post(laptop_url, json=tts_data)
idk why, i feel like there ISNT a syntax error
i just need to have the response from the Pi go into the TTS and ill be done!
check previous line
or run your code in python 3.11 for better error messages 😛
or use an IDE for that matter; they generally also have better error messages.
previous line.... maybe yes, hold on
Yeah
it was
thanks!, forgot to check, and the interpreter just didnt higlight it, thanks 😅
hey, can i attach an excel file here?
or is that not allowed
idk what the policies are
this is the file btw
my question is would it be a bad idea if i used some of the variables here to predict the "IPAnnualReimbursementAmt" with a regression?
idk what else to do because of how complicated the kaggle notebooks are on this dataset
my prof wants me to come up with some kind of hypothesis and i wrote a 14 page paper about it already from the provider side... like using machine learning to see if a medicare provider is issuing fraudulent claims, but it was mostly a literature review
so i was wondering if anyone could help me out
also
there is no data dictionary for this dataset which really sucks
oh and here is the kaggle link i got everything from
@serene scaffold Do you have any idea how to decode abbreviations/acronyms? I have a database of acronyms and their full text counterpart and I'd like to crosswalk it with the correspond text. I'm trying research the spaCy documentation for how to modify the tokenizer for this and have no luck.
I guess I'm trying to avoid having to reconstruct the doc object.
But it appears that doing that might be required.
please ping me y'all if you can help
@tidal bough you can probably add acronym expansions as a token attribute, perhaps by adding a pipe
Oh no, I pinged the wrong person
Please don't forgive me as I do not deserve it.
@waxen girder this
I'll need to look into this, I'm new to spacy and don't know what a pipe is.
I think I understand this now, thank you!
is there a channel or person I can talk to for a cv related task
Hey guys, a very quick question.
Could you suggest some good baseline models for a forecasting problem?
this channel
It's generally better to just ask your question away if possible and whoever is familiar with the concept and free can pick it up and help you with it.
People usually don't want to engage in back and forth just to get to the question and then discover whether they're familiar / free enough to help with it or not
(it's ok to share a scan of part of a book right?)
so im currently learning NLU and found this flowchart. would you guys say it's a decent way to find out if what i'm doing can be solved with it?
i personally find this and all the explanation before make sense, but i'd like a second opinion
getting this error:
[2023-10-18 20:55:44,506] ERROR in app: Exception on /generate_tts [POST]
Traceback (most recent call last):
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 1455, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 869, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 867, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 852, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "C:\Users\danim\OneDrive\Desktop\programming\TTSPOST.py", line 28, in generate_tts
stream(audio_stream_bytes)
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\elevenlabs\utils.py", line 76, in stream
mpv_process.stdin.write(chunk) # type: ignore
TypeError: a bytes-like object is required, not 'int'
<pi's ip> - - [18/Oct/2023 20:55:44] "POST /generate_tts HTTP/1.1" 500 -
when running the TTS code. Quick run down, i have a STT and NLP code that sends text into a raspberry pi, and then the Pi sends the response it generates back into this TTS code but i get that error, any help?
??
???
It is probably a simple task but I am not able to adjust it.
I want the values below the line (See picture) not being influenced when I change the second x-axis. https://paste.pythondiscord.com/HHEQ
Looks like the issue is about the "mpv_process.stdin.write(chunk)". Check and make sure the variable is located in the right place
wdym?
i dont have that variable....?
is the code wrong?
if so, let me know
or any fixes, whatever it is, i just want to finish this project by sunday
anyone?
I don`t understand what the issues is. Explain the error code again. But I guess you have actually changed the API key to your actual elevenlab library when you aske GPT to generate the code for you.
yes, but the error message is
[2023-10-18 20:55:44,506] ERROR in app: Exception on /generate_tts [POST]
Traceback (most recent call last):
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 1455, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 869, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 867, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 852, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "C:\Users\danim\OneDrive\Desktop\programming\TTSPOST.py", line 28, in generate_tts
stream(audio_stream_bytes)
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\elevenlabs\utils.py", line 76, in stream
mpv_process.stdin.write(chunk) # type: ignore
TypeError: a bytes-like object is required, not 'int'
<pi's ip> - - [18/Oct/2023 20:55:44] "POST /generate_tts HTTP/1.1" 500 -
idk why
when running the TTS code before, it actually said text out loud, still gave the message of the bytes, but it wasnt marked or highlited as an error
Have you checked the if the file you use it having th correct type?
Can you give me the output when you run print(type(audio_stream))
like in the same code?
You can use use it in the same code and see the output
ok
sure
also, when i add the print, it doesnt print anything rn, as the Pi hasnt sent anything rn cuz its not on
like i dont have it with me either so
@app.route('/generate_tts', methods=['POST'])
def generate_tts():
data = request.get_json()
weather_info = data.get('weather_info')
audio_stream = generate(
text=weather_info,
model="eleven_multilingual_v2",
voice=Voice(
voice_id='HefIefabApyvTp4BVxGv',
settings=VoiceSettings(stability=0.24, similarity_boost=1, style=0.5, use_speaker_boost=True)
)
)
print(type(audio_stream))
audio_stream_bytes = bytes(audio_stream)
stream(audio_stream_bytes)
return 'TTS audio generated and streamed', 200
else:
return 'Invalid weather information', 400
a bunch of code errors, hold on
Try to run your Flask app and request it to /generate_tts endpoint
wait because it has a whole bunch of errors
just to know, what did you change from my code and yours?
plus, it needs to recieve the response from the Pi with the HTTP
I made change in insertion of the print(type(audio_stream)) statement, which will print the type of the audio_stream object to the console. So we can indicate what output it will give
so this?
from flask import Flask, request, jsonify
from elevenlabs import set_api_key, generate, Voice, VoiceSettings, stream
Set your API key here
set_api_key("<your_api_key>")
app = Flask(name)
@app.route('/generate_tts', methods=['POST'])
def generate_tts():
data = request.get_json()
weather_info = data.get('weather_info')
if weather_info:
# Generate TTS audio for the weather information
audio_stream = generate(
text=weather_info,
model="eleven_multilingual_v2",
voice=Voice(
voice_id='HefIefabApyvTp4BVxGv',
settings=VoiceSettings(stability=0.24, similarity_boost=1, style=0.5, use_speaker_boost=True)
)
)
print(type(audio_stream))
# Convert the audio stream to bytes
audio_stream_bytes = bytes(audio_stream)
# Stream the audio
stream(audio_stream_bytes)
return 'TTS audio generated and streamed', 200
else:
return 'Invalid weather information', 400
if name == 'main':
app.run(host='0.0.0.0', port=5000)
I am not professional in this but have you tried to correct the weather_info function?
correct? wdym
obviously yes
Haha, gottu check
nvm
Alright alright. Well, make sure it works fine
Try this
yes i know, when i put the code i naturally use the key
i didnt put the key in the code
the only difference in the code is the print no??
yes
print(type(audio_stream))
how will that fix the error tho
It wont solve the issue, mister, you know. By type of audio_stream, you can better understand why you're getting the error:
not really no
Well, we gottu find out the correct error for the audio. as it can`t find it.
yes
Maybe the source youre using isnt the best. Have you tried another approach?
idk
Does anyone know?
what do you mean by "not being influenced"?
you want the top axis to apply to the values above the line, and the bottom axis to apply to values below the line?
i see a mix of units here and you'll probably need to normalize them all somehow
Exactly what I would like. As if I change the one x to too low, if mess up how the other is looking like
Why would the unit matter? Wouldnt it be possible to make some kind of blocker where the line is?
I tried some kind of blocker function but nope
you need to use ax2 to plot things that will use that "twin" axis
if you plot everything with ax1 then it will all be connected to whatever you set for ax1 and ax2 won't be relevant
Why would the unit matter?
Because conceptually it makes no sense to compare a pH value to a mg/L value
hahaha. I get you. Well, I guess this is not for comparing but rather visualization.
can anyone help me out?
Well, I have tried to adjust it witout adjusting the other. Doesnt seem to change
this is a question about the particular problem domain, not really about statistical modeling, data science, etc.
this is why it's a good idea to pick project topics where you already have some understanding of the field
otherwise you're left wondering if anything makes sense at all
for the sake of the exercise i think it can't hurt to try it
but as with all modeling of unknown data, you need to prepare for the disappointment of your model being useless or worse
that, or it works great. who knows?
i'd suggest choosing another dataset if you have time. working without any domain knowledge and without any data dictionary is the hardest possible scenario
here's a great example of why data dictionaries matter: https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8
(should be required reading for anyone working with data, including if it's from kaggle)
i can’t seem to find one with a medicare dataset with a data dictionary
i’ll keep looking though
and an interesting followup discussion: https://fairlearn.org/v0.9/user_guide/datasets/boston_housing_data.html -- this one is mostly phrased in terms of "fairness" and "harm", but the more generally-applicable point here is that the data is bad and fucked up for a variety of reasons, and might have been tampered with or fraudulent
why does it have to be a medicare dataset?
bc that’s the topic my profs and i agreed on
and i already wrote 14 pages about providers scamming medicare ppl
actually since you're looking to work with health data, i think you should be very concerned with the topics of fairness and harm
as you're probably discovering, medical decisions can be literally life or death, or otherwise can have huge effects of people's quality of life. there's no small moral imperative to proceed with the utmost discretion and discipline.
my profs are just like whatever about it
they couldn’t care less
i’ll continue looking for data dictionaries
why did you agree on a topic that you had no experience with? i'm trying to understand the context here so i can make a useful recommendation
if you're already deep into the project, don't waste your time. if you can figure out the origin of this data, maybe you can find some documentation or other useful info (background information, where the data came from, etc)
otherwise you'll have to just make judgement calls and do the best you can
if you make a decision, document it clearly. nothing else you can do
but hopefully there are some important lessons here for the future
it looks like this particular dataset is connected with some larger project
you could also try to reach out to the author to get some info
it looks like some people have tried to get that information and the author has not responded
i would treat this data very very very cautiously
it might be good for the sake of your machine learning project though
but i would make it very clear in your project that this data has unknown provenance and therefore unknown quality and relevance to any real-world scenario
i agreed to it bc there was a ton of literature review on the topic
i did find this tho
just today
fair enough. in the future, make sure the data exists too 🙂
looks useful, if you can get access to the actual data. i suggest discussing this with your advisor to agree on a plan that makes sense within your assignment deadline
i just had a meeting with them yesterday too
send them an email with your concerns
alr
i suspect that for the sake of the project the best course of action will be to proceed with this mystery dataset just so you aren't wasting time requesting access to datasets from the US govt, trying to get university funding for the fees, etc.
ok
but keep in mind the caveats: unknown data will have unknown problems. do the best you can formulating hypotheses, but be very cautious about applying results to the real world.
i think the CMS datasets are downloadable, not sure though
for actual help formulating hypotheses (which i think was your original and good question) -- either do the best with the literature you've read so far, or try to find someone at your university who might be able to help point you in the right direction
(or keep asking online and maybe you'll get lucky with someone who has experience in this particular field)
https://www.cms.gov/data-research/files-for-order/data-disclosures-and-data-use-agreements-duas/limited-data-set-lds this looks a little more complicated than just downloading a ZIP of excel books
in the future if you want lots of intersting economically-oriented data, check out the US BLS, USDA, and Federal Reserve
hmmm
i did a fun school project years ago looking at corn and wheat prices from public USDA data (although i think some of it i had to manually copy from a PDF)
maybe it’s a better idea to just use the kaggle data then
right, that's what i'm saying
if you were earlier in the process, i'd stand by my original recommendation to change your topic. but it seems like it's too late now
yeah. thanks for the help!
I didn`t get your meaning here.
theres a lot of caveats with that dataset btw. especially during covid. just so youre aware ahead of time
yikes
i'm trying to find the datasets this guy uses... not to a lot of success tho
i have no idea where on the website he got the data
hello is someone good at fine tuning stable diffusion
i am trying to run lora but cant seem to get it working
the link is broken
yeah i did a whitepaper over this
damnit. the data this .ipynb uses is old enough that it's been archived by medicare
oof
you are doing all your plotting with ax1, so when you adjust ax1 it will affect everything you plotted using that axes object. use ax2.barh to plot the things you want connected to ax2
yikes? this is 100% normal looking, welcome to real world data 🙂
Right 🤣
these guys changed their entire data dictionary
also wtf, it's glitching my google chrome from the amount of data
ok. here's the idea. we use a json get request
just download the data? don't overthink it
it crashes everything, because of how large it is
i decided to use the medicare dataset because it has a data dictionary
Im getting some depth problems with my plots. I don't use matplotlib much so no clue what the problem is... any ideas? edit: ignore this, I just rendered them in reverse order, i'm surprised matplotlib doesn't do any depth sorting for plots
Main thing is to Understand the Math Behind AI. But make yourself understand Python Libraries, Read Books and Documentation
Coursera Machine Learning by Andrew Ng
https://developers.google.com/machine-learning/crash-course/ google's crash course
https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
3b1b's playlist covering neural networks
andrew Ng's courses are very highly acclaimed, here is a link to what I believe is a completely free one https://see.stanford.edu/Course/CS229 and he has many others on coursera
there are other resources in the pins for the channel as well
edX - Introduction to Artificial Intelligence (AI) by MIT
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron
Python for Data Analysis" by Wes McKinney
we have people of all ages here
Just say your ages
Please don't ask people to do that.
Why not ?
Often, people who want to know other peoples ages are predators
?
You just want to know how experienced people are with AI, do you not?
I just want to compare the other ages with my
Well, you can't do that here.
I don't have any suggestions for resources in Portugese, unfortunately
I don't asked this for you
alr, so i got the json data, created a pandas dataframe from it, and then exported it to an excel file
now i have to do the same thing for the part B dataset as well
still getting this error, any help?
C:\Users\danim\PycharmProjects\EXAM\venv\Scripts\python.exe C:\Users\danim\OneDrive\Desktop\programming\TTSPOST.py
* Serving Flask app 'TTSPOST'
* Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:5000
* Running on http://<laptop ip>:5000
Press CTRL+C to quit
<class 'bytes'>
[2023-10-19 19:44:55,431] ERROR in app: Exception on /generate_tts [POST]
Traceback (most recent call last):
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 1455, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 869, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 867, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\flask\app.py", line 852, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "C:\Users\danim\OneDrive\Desktop\programming\TTSPOST.py", line 28, in generate_tts
stream(audio_stream_bytes)
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\elevenlabs\utils.py", line 76, in stream
mpv_process.stdin.write(chunk) # type: ignore
TypeError: a bytes-like object is required, not 'int'
<pi's Ip> - - [19/Oct/2023 19:44:55] "POST /generate_tts HTTP/1.1" 500 -
i have no clue what to do to fix the audio not working
cuz when i run this code
from elevenlabs import set_api_key, generate, Voice, VoiceSettings, play, save, stream
set_api_key("<not gonna put my key, obviously>")
audio_stream = generate(
text="test",
stream=True,
model="eleven_multilingual_v2",
voice=Voice(
voice_id='HefIefabApyvTp4BVxGv',
settings=VoiceSettings(stability=0.24, similarity_boost=1, style=0.5, use_speaker_boost=True)
)
)
stream (audio_stream)
save(audio=audio_stream, filename="1.mp3")
it works, i mean i do get this
Traceback (most recent call last):
File "C:\Users\danim\OneDrive\Desktop\programming\APItests.py", line 17, in <module>
save(audio=audio_stream, filename="1.mp3")
File "C:\Users\danim\PycharmProjects\EXAM\venv\lib\site-packages\elevenlabs\utils.py", line 52, in save
f.write(audio)
TypeError: a bytes-like object is required, not 'generator'
Process finished with exit code 1
but the audio plays
please help
anyone?
Too complicated for my puny mind
https://discord.com/channels/267624335836053506/1164708090046853190
idk if anyone here has experience with label-studio but I am absolutely stuck with it, figured I'd link it here in the hopes someone will have experience and can help
Hey, does anyone here use PythonAnywhere often? I have code that runs perfectly in pycharm, but will not run in PythonAnywhere. I was wondering what the issue was?
Sounds like this isn't a data science question. Try asking in a help thread, but be very clear about what the code is and how you know that it doesn't run perfectly on python anywhere
I like how "is training data available" doesn't even have a 'yes' option xd
gm
im looking for some recommendations on textbooks for learning some ML + neural networks. Hoping to try some deep learning reinforcement in the future
i have made a CLI game only with numpy (Tic Tac Toe)
it includes a neural network to play with a robot
https://github.com/shijan-wow/TicTacToe
But it doesn't uses training trick (Not problem, I wanted make it like this)
I just wanna share it with all you guys
-
Prof. Sabestian's Machine Learning with PyTorch & Scikit-Learn https://sebastianraschka.com/books/#machine-learning-with-pytorch-and-scikit-learn
My name is Sebastian, and I am a machine learning and AI researcher with a strong passion for education. As Lead AI Educator at Grid.ai, I am excited about making AI & deep learning more accessible and teaching people how to utilize AI & deep learning at scale. I am also an Assistant Professor of Statistics at the University of Wisc...
@modern storm
Hello guys, I need some help in understanding my questions
don't ask to ask--if you have a question, ask your whole question all at once
no one can know for sure if they can help unless they know what the actual question is.
!pastebin
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
so i have this code over here and i'm trying to access the Rfrg_NPI key
it's there in the excel file, but i can't see it in the dataframe columns
anyone have any ideas on how i can fix this?
i swear i'm not going crazy
No idea if this can be asked in this channel, but, is there a way to make stable diffusion install/load asynchronously? As whenever i install a different version/start over, it uses ages to download and install.
i'm not sure what mistake i'm making here
censoring some what i think is private info here
but yeah the column is there
is it because of invisible characters?
i did auto stretch the excel file column to see all the numbers
nvm, figured it out.
Trying to make a pumpkin?
lmao
!pastebin
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
i have no idea what i'm doing
what does not in index mean?
idk how to fix this
can anyone help?
it means those column names do not exist in the df
!pastebin
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
so how do i fix the problem?
do i remove those columns from the brackets?
So this is my first time trying to label data myself I've only ever used pre-labelled data before in tutorials so bare with my noobness.
I am trying to create a model that can identify characters in Guilty Gear Strive (it's a fighting game, that's irrelevant though). When you pick a character you can choose roughly 12 different colour palettes for each character so my question is how should I approach this? When I'm labelling do I label each thing like SolColour1, SolColour2, KyColour1, KyColour2, etc? or is there a better way to do it than that?
the weirdest thing is my code was working before
ugh
The problem is that you're trying to reference a key that isn't there
specifically ""['npi', 'EXCLTYPE', 'EXCLDATE', 'REINDATE', 'WAIVERDATE', 'WVRSTATE', 'MIN_EXCLUSION_PERIOD', 'END_EXCLDATE', 'DATA_YEAR']" which sounds like it's supposed to be a value within the key rather than the key itself
Well that'll do it 😛
the problem was that i'm coding on a notebook and when you try to run cells all at the same time with drop statements shit gets complicated
no it's not, now i'm really confused
damnit
my code consumes too much ram for google colab now
why is my pycharm having issues
Where are the projects for this discord server @ can some share the github repo
this seems like a reasonable way to encode the data as a starting point
don't do this. you should treat your notebook like a python script. if it doesn't run cleanly from top to bottom, it's basically impossible to debug, and it will be impossible to reproduce your own results.
if your data is huge, develop and debug everything on a small subset of it first
How do i run multiline code like this in git bash? I forgot how to go about it any tips?
this code is meant to be run in the python console, not in git bash itself
(assuming you're on windows?)
when you run the program "Git Bash" on the start menu, what you actually are doing is opening a program called a terminal, which is basically a box that relays text back and forth to another program. in the case of Git Bash, that "another program" is a special kind of program called a shell, which is meant for running other programs and generally interacting with your computer. Bash is the name of the shell program running inside the terminal.
so when you open Git Bash you need to start a different program, in this case a Python interpreter
it should be as simple as running python or py, but that kind of depends on how you installed python
note that it's usually not a great idea to try developing code by pasting things directly into a console. you will almost certainly want to use a code editor to help you, i recommend IDLE or Thonny if you're a beginner
(IDLE should be included with Python if you installed from python.org)
So I also realized that dustloop.com/w/GGST actually has all of the colors and images of the characters for all of their moves but I'm not sure if I can use those images to train? I'm not sure if since there is no background on them if that's going to make it more difficult when I'm trying to detect the characters in a live environment? Lol I'm just at a bit of an impasse as to what to do because it's such a big commitment to collect and annotate a bunch of screenshots especially when I don't know if it's going to actually work. I mean I'll do it if that's what it takes but I'm trying to get some input before I dive head first into something that won't work xD
Okay great thanks.
Which is more user friendly between thonny and iddle
idle has a simpler interface, thonny has more features like a debugger
people actually use idle?
Hello, I have 167 file texts to read and i want to extract datas from all of these files and to put these in a big list in order to plot them, can you give advice on how to do this
i used it when i first learned python
i don't think many people use it for serious professional work
write some code to extract the data from one file, then loop over files. put the data in a list or data frame or whatever, and go from there. don't over-think things like this.
lol when I first learned python I was coming from C++, I launched IDLE and went "okay so this sucks, what's a good IDE to use" found Visual Studio Code and haven't looked back since
assuming you're using a CNN, maybe you can pre-train the model on this data, and then fine-tune it on your data with the backgrounds? i have no idea if that will work
this is getting into territory where i don't have first-hand experience
i actually tried something which doesn't work, may i send it here
!pastebin
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
fair, but learning programming for the first time is a bit different. how many people wrote their first C++ program in notepad?
that being said IDLE is stiil leagues above where I started learning C++ back in the 90s in notepad, at least IDLE has colours xD
rofl
beat me to it.
thanks
The joys of spending hours looking through on notepad to find that bracket or semi-colon you missed with no indication of where it might be xD
You gotta send the pastebin link
I mean there is actually a problem when i'm testing this for two files, i don't know why when i expand the list of the dates, a hash key pops
this is the regular shape of all my files "3_......txt"
Station : BREST
Longitude : -4.49503994
Latitude : 48.38290024
Organisme fournisseur de données :
Fuseau horaire : UTC
Référence verticale : zero_hydrographique
Unité : m
Source 1 : Données brutes temps réel
Source 2 : Données brutes temps différé
Source 3 : Données validées temps différé
Source 4 : Données horaires validées
Source 5 : Données horaires brutes
Source 6 : Pleines et basses mers
Date;Valeur;Source
04/01/1846 00:00:00;3.48;4
04/01/1846 01:00:00;2.7;4
04/01/1846 02:00:00;1.99;4
04/01/1846 03:00:00;1.7;4
04/01/1846 04:00:00;2.15;4
what is the error you're getting? I mean just looking at the code based on what you wanted to do I think you might have over-engineered this 😛
!pastebin
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
I don't get any error x)
it seems like my code doesn't get to line 180
the pickle files don't actually generate
I get a list for my dates which looks like : [01/01/1846,.............................................................,#,01/01/1847] something like this for the dates
not sure why the pickle files aren't working
Ohhhhh you're referring to the # that is there shouldn't be there?
yes !
Ahhhh I see
if anyone can help i'd greatly appreciate it
are you sure it's actually there and it's not your IDE's way of displaying that it's truncating the output to console?
are you asking me?
no
oh ok
i don't think so
what happens if you try to print out the index just before that contains 01/01/1847?
sorry i have a new error rn, it will take a moment to correct it
thank you for your help though
Lol I mean let's wait to see if I can actually figure out the problem before you thank me 😛
im gonna figure this out tomorrow
do we agree that making a boucle with a "with open" in it to read each file won't work for 167 files .
I mean that depends on what a boucie is 😛
not a single boucie no. but with 2 you could
boucle not boucie 😉
since all of your files follow a linear naming convention you could use a loop to iterate through all of the files
ok i'll try to figure this out
So you want it to read through each file and put the contents into a list iirc?
yes
so can it read 1 file put that into the list then read the next file and put it into the list?
no
i get this error : with open(filename, "r") as file:
TypeError: expected str, bytes or os.PathLike object, not list
one sec
what does your file structure look like?
I'm just a bit curious as to why you've done things as you've done them instead of just iterating through all txt files in a directory and instead put them in a list. Idk if that's because it's required for your task or because you didn't know how to do it that way.
ok i'll send you an example
import json
def read_urls_from_text_file(file_path):
with open(file_path, 'r') as file:
urls = [line.strip() for line in file.readlines()]
return urls
def create_label_studio_json(urls):
tasks = []
for index, url in enumerate(urls, start=1):
task = {
"url": url,
"metadata": {
"id": index,
"description": f"Image {index}"
}
}
tasks.append(task)
return tasks
def save_json_to_file(json_data, output_file):
with open(output_file, 'w') as file:
json.dump(json_data, file, indent=4)
if __name__ == '__main__':
input_file = 'E:/GGSTAI/files.txt' # Replace with your input text file
output_file = 'E:/GGSTAI/tasks.json' # Replace with your desired output JSON file
urls = read_urls_from_text_file(input_file)
tasks = create_label_studio_json(urls)
save_json_to_file(tasks, output_file)
print(f'Converted {len(urls)} URLs from {input_file} to {output_file}')
because I use this code to do something similar, only instead of putting them into a list I get it to write it into a json file
ok no it's not what i'm used to do
What does sub word tokenization mean
hello
I am new to ML
and I have been assigned a project in which ML is used..and am supposed to use the transformers model
basically the problem statement is :
Design a mathematical model for an improved sentiment understanding based on emotion-annotated corpora on Plutchik’s wheel of emotion.
so if anyone could show me way like how to proceed..I would be grateful
like I have done the text preprocessing and all
Thank you
Interesting! will listen to this keynote
interesting
Guys, I heard that vector calculus is a part of machine learning, this might be a stupid question but how exactly is it being used and implemented?
Well, it's used in many places but for one, the forward-propagation of neural networks involves multiplying matrices by vectors, and so backpropagation (which is used in gradient descent, to train neural networks by tweaking their parameters) involves taking the derivative of the loss by the neural network's (many many) coefficients, which is multivariate ("vector") calculus.
e.g. if you have a very simple one-layer neural network, that means you get a vector of inputs x and you have a matrix of parameters A and also a bias vector b, and to calculate the output you do:
y = f(A@x + b)
where f is some activation function. This output is going to be different from the corrent output, target_y, and you could quantify that difference via a loss function like mean squared error:
loss = ((y-target_y)**2).mean()
To train the network via gradient descent, you calculate the derivative of (y-target_y)**2 by A and by b, and slightly change A and b in the direction which lowers loss, and repeat this process many times until it stops helping.
(this is a pretty rushed explanation, in any beginner ML course they'd go over it in more detail)
(and of course, not all ML is neural networks and there's plenty other places where you need linear algebra and calculus - Support Vector Machines come to mind)
I think all methods except vanilla decision trees involve some level of linear algebra specifically
Ahh I see, I have a relatively good math foundation, may I ask that for example how did you fo about learning machine learning and mastering it? What did you start with?
At its heart "learning" is an optimisation problem which can be done through methods coming out of linear algebra or methods coming out of calculus typically.
everyone in here atm learned it in uni by taking mathematics courses and reading maths books, i think
at levels ranging from bsc to phd
the place to start with would be calculus, linalg, and statistics, the basics of which are more or less independent of each other
as calculus starts going into multivar and statistics starts dealing with multiple variables, linalg leaks into both
I actually started doing ML with Ngo's ML course on coursera: https://www.coursera.org/collections/machine-learning back when I was in highschool, maybe year 10 or so?
It was legendary (something like the top-1 course on coursera, I believe) back then and I think still is. Note that this was actually before the course had a massive rework - back then it was taught in Octave, now it's taught in Python. Probably it's still good, though.
Wow guys I appreciate all of your responses, thanks very much
i would also say statistics is something you have to revisit several times. basic stats doesn't require much else. multivar stats requires linalg. continuous stats requires calculus. estimation theory, which is where you want to arrive at for machine learning, requires both calc and linalg
Yeah I'm on precalc now, there are load of resources and pdfs and books and everything
gotcha
I started learning this stuff in university personally. There they give you the courses in a "logical" order 🙂
I see, I'm currently going thru cs50 machine learning and artificial intelligence course, what do you think of that?
i'd recommend the book "statistical signal processing" by louis scharf, as well as "fundamentals of statistical signal processing: estimation theory" by steven kay. these already require the 3 prerequisites we've been discussing. there are also other books that more directly address ML, but the key concepts are the same
In my first year I got linear algebra first, then calculus. In my second year I had statistics and an optimization course. In my third year I got econometrics which teaches you the "finesse" of working with data.
May I ask what course did you take?
gotcha, I'll note this down
Yeah, I think what I studied is too country specific. Any program that offers you succifiently good courses in some of the things I listed is fine 😄
I don't really have an opinion on any of the numbered courses from specific universities. Skimming the projects, it seems alright.
Yeah that's fair enough
yeah thanks
Thank you guys all you are helping a lot
Once you're "ready" this is the book to read in my opinion: https://www.statlearning.com/. It assumes you have a working knowledge of lin alg, calculus and probability/statistics
I have this cv mask I've extracted, with cv2 i want to detect how many white areas there is in this black and white mask, what would be the best, efficent and accurate way of doing this? Below i have some of the masks i've computed:
And also these white areas are going around in a circle since i'm trying to make a bot for a game
I've tried asking chatgpt and and copilot, this is the best i have so far, but it is still a little inaccurate
img = cv2.cvtColor(img, cv2.COLOR_RGBA2RGB)
img = cv2.inRange(img, blue_lower_bound, blue_upper_bound)
ret, thresh = cv2.threshold(img, 127, 255, cv2.THRESH_BINARY)
num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(thresh, connectivity=8)
print(num_labels) # num_labes > 2 then theres more than one
Would you guys recommend learning math before learning machine learning or as you go and learn it in parallel?
In my opinion there's different types of people working in ML, applying methods to problems or making new ones. They require different levels of math. If you want to be in the former category doing it somewhat in parallel is fine.
hello guys, i'm making a face recognition project, using FaceNet and SVC. It performs well in identifying trained faces, but i need it to say if the face is not identified/trained too (what not happens cause classifier only predict based on the most likely to be the label), what can i do to solve it?
I'd say it depends on what you wanna be at the end of the day. If you're like me who's more into ML Research, then try to cover the basic Math and Statistics fundamentals. It'll be nice to also understand SVD in linear algebra.
So in summary, if you're still in high school or undergraduate program, you're lucky 'cos most of these fundamentals would be covered extensively therein. Just give it more attention in school.
Other than that, you can get into ML without being the best mathematician or statistician. Whatever you don't know now, you'll learn it as you keep progressing in the field.
got it thanks 😊
got it 👍
partb = pickle.dumps("partb")
partd = pickle.dumps("partd")
dmepos = pickle.dumps("dmepos")
combined = pickle.dumps("combined")
partd = pickle.loads(partb)
partb = pickle.loads(partd)
dmepos = pickle.loads(dmepos)
combined = pickle.loads(combined)
is there something wrong with how i'm using pickle here?
i was following a youtube tutorial that used .dumps, but now i see videos saying to use .dump?
apparently they both do different things?
pd.to_pickle(partd, '/Volumes/ML_projects/Medicare_Fraud_Datasets/processed_data/partd.pkl')
pd.to_pickle(partb, '/Volumes/ML_projects/Medicare_Fraud_Datasets/processed_data/partb.pkl')
pd.to_pickle(dmepos, '/Volumes/ML_projects/Medicare_Fraud_Datasets/processed_data/dmepos.pkl')
pd.to_pickle(combined, '/Volumes/ML_projects/Medicare_Fraud_Datasets/processed_data/combined.pkl')
the ones with "s" take or return a bytes object, the ones without write to/read from a file.
this is how the guy on github did it
shouldn't it create a pickle file in my directory?
bc i see nothing
i copied my folder name in where his path is
i have no idea what's not working here
which snippet?
do macs block pickled files?
pd.to_pickle(partd, "/Users/rahuldas/Desktop/medicare fraud data/partd.pkl")
pd.to_pickle(partb, "/Users/rahuldas/Desktop/medicare fraud data/partb.pkl")
pd.to_pickle(dmepos, "/Users/rahuldas/Desktop/medicare fraud data/dmepos.pkl")
pd.to_pickle(combined, "/Users/rahuldas/Desktop/medicare fraud data/combined.pkl")
``` basically doxxing myself but whatevs
this should write some files unless it crashes
actually, hmm
I can't find docs about pd.to_pickle existing, actually? there's .to_pickle on dataframes.
that's a method on dataframes, not a function in pandas.
well, that explains why it doesn't do anything.
!pastebin
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
here's all the code
idk what's causing everything to hang
i don't get a warning or anything
actually i used to but i did ignore warnings
i'm beyond stuck
!pastebin
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
so i printed partb, partd, and dmepos types on lines 128, 129, and 130
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
so it is a dataframe
now i'm really confused
if it is a dataframe why is it not working?
the problem is the lines below it, but idk what
partb['START_EXCLDATE'] = partb['EXCLDATE'].dt.year
partd['START_EXCLDATE'] = partd['EXCLDATE'].dt.year
dmepos['START_EXCLDATE'] = dmepos['EXCLDATE'].dt.year
it stops around these three lines of code
not sure why, they look fine
at least i was able to figure out the lines causing the problem
i don’t understand what the issue is
.latex This is supposed to be a squared error loss function for fitting a linear function. But doesn't $(\text{h}(x) - y)^2$ have a different derivative than $(y - \text{h}(x))^2$ wrt $w$ or $b$? How can we guarantee that the gradient will be positive if we derive $\frac{d}{dw} (y - wx + b)^2$ and get $-2$ as a coefficient?
(h(x) - y)^2 is exactly the same as (y - h(x))^2 and so has the same derivatives, too.
If it's not obvious, note that when calculating the derivative you'll get 2(y-h(x)) * -dh(x)/dw in the second case and 2(h(x)-y)dh(x)/dw in the first, which are the same thing (minus sign is just in a different place)
Oh, I think I see how you got confused - note that if h(x) = wx+b, then y-h(x) is y-wx-b (not +b).
Have you tried SHAP? Would you say SHAP is more "customer-friendly"?
I think the learning curve of Lime & SHAP are similar though.
yeah, but it's a bit difficulty using the same transformer model of SHAP and trying to get the values of SHAP to work with Lime
@tidal bough thanks, I'll work it again when I get home 🐱
Does anyone have recommendations for reading material on hyperbolic manifolds?
Or even videos, if they're rigorous
these three lines are not the problem
i have no idea what the problem is
the code should be fine
like what am i missing here
What is the result or error@you’re getting?
hello
hi
I was wondering, how do you determine how many input nodes you need to have for a regression model? For a basic Keras Neural networks?
I don't understand. That would be determined by the dimension of your input data right?
The number of features in your data = number of input nodes
if anyone can guide me..i would be grateful
there is no error
Could you share more then? I see the code snippet you ran, but I don’t see what you mean by it not working.
yeah, i’ll show the entire code. just finishing my breakfast now
1st Position Winners of the DataCamp "Create the Perfect Party Playlist" Competition 2023 (Umar & Faizan)
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
https://paste.pythondiscord.com/CNGA here's all the code
Ok, and what’s the problem? What’s ‘not working’?
the logistic regression confusion matrix isn't showing
not like it does here
is it because i'm missing the savefig line?
I'm just missing the context here. Your code relies on an external file which I don't have.
But, at the very end, you have plt.show() when you should probably have fig.show(), or just fig (which will autodisplay if it's the last line of the cell (plt.show might be fine, I use plotly more than matplotlib nowadays)
ah i can't have datasets here, it's against the rules to attach xslx files
I don't actually want the dataset tho... but I do want to help you debug it
Your first question was about a pandas issue. Do you still have that?
yeah, i don't see anything wrong with those lines
So, from a debugging perspective, the question should be: does the dataframe "look" correct? Not the code, but the actual data.
I usually throw a "display(df)" in, so I can inspect the data
display throws an error
oh you have to import it
LASTNAME FIRSTNAME MIDNAME ... DATA_YEAR_max TARGET START_EXCLDATE
0 NaN NaN NaN ... NaN 0 2020.0
1 NaN NaN NaN ... NaN 0 1988.0
2 NaN NaN NaN ... NaN 0 1997.0
3 NaN NaN NaN ... NaN 0 1994.0
4 NaN NaN NaN ... NaN 0 2011.0
... ... ... ... ... ... ... ...
69646 NaN NaN NaN ... 2021.0 0 NaN
69647 NaN NaN NaN ... 2021.0 0 NaN
69648 NaN NaN NaN ... 2021.0 0 NaN
69649 NaN NaN NaN ... 2021.0 0 NaN
69650 NaN NaN NaN ... 2021.0 0 NaN
Yah, so is that what you expected? I doubt it. Add a few more displays and figure out the earliest place the data looks wrong.
it seems correct to me
Why so many null values?
i have no idea honestly
later on in the above github notebook, he fills the NaNs with 0s
Where did you insert that display? And which df was it?
part b
So, first thing: Open part_b_data.xlsx and compare it to your dataframe. Are the first names really null in the spreadsheet?
Are the first 5 firstnames null?
(or last 5? .. the display you showed displays the first 5 and last 5 lines)
nope, they're all actual names
So, at line 69 you have a print('hi'). Can you change this to display(part_b)?
(line 69 according to https://paste.pythondiscord.com/CNGA)
Unnamed: 0 Rndrng_NPI ... Bene_CC_Strok_Pct Bene_Avg_Risk_Scre
0 0 1003000126 ... 0.14 1.8026
1 1 1003000134 ... 0.03 1.0785
2 2 1003000142 ... 0.06 1.4920
3 3 1003000423 ... NaN 0.6362
4 4 1003000480 ... NaN 1.8233
.. ... ... ... ... ...
995 995 1003059544 ... 0.00 1.1879
996 996 1003059676 ... 0.05 1.3646
997 997 1003059684 ... NaN 1.3586
998 998 1003059783 ... 0.12 2.6992
999 999 1003059866 ... 0.17 2.9189
it's like it was never read correctly
or maybe it was
because there are a lot of columns in between
there's the NPI, then the first name, last name
Yah, so then look at line 128.. at the that, you've just renamed and reordered the columns
so that's what's causing NaN values?
No, I don't know yet. We're just trying to isolate it by displaying the df at various points.
i see
Just basic debugging: start at the beginning, and display/print the variables to make sure the data looks correct
hi how u doing guys i took a class in Ai programming and i need some resources to learn. prof suggested some but kinda 1980 style alot of theory and kinda not beginner friendly and im not really big into ai type stuff
Ai is a lot of theory, and the 1980 stuff is probably the more basic models that you want to learn and understand
thank you though
What kind of resources do you expect then? Don't mean it in a harsh way, but making a model in Python is not the same as understanding it.
So going through the theory is just an important, sometimes tedious step of the process
d/dw((w x + b - y)^2) = 2 x (b + w x - y)
d/dw((y - w x - b)^2) = -2 x (-b - w x + y)
So you're saying that
2 x (b + w x - y) = -2 x (-b - w x + y)
?
I don't find that hard to believe, even if I'm not entirely sure why
yes, because -(-b - w x + y) = b + wx - y
data-science-and-ai (AI)
That is, if ca = cb, then a = b, where in this case c is -1?
Sure, I suppose? It's just the distributive property: -(-b - w x + y) = (-1)*(-b - w x + y) = (-1)(-b) + (-1)(-wx) + (-1)(y) = b + wx - y
ah
okay
let's say you have two features
but you make a model with 10 input nodes
what does the machine learning model do then? what does it use for the other 8 input nodes
Hey yall i have a quick question writing OLS formulas in python. Im kinda new and Im trying to see if that there are 3 variable columns that would effect the outcome of a variable. But i noticed that the dependent variable is categorical or boolean. Since its a collumn that shows whether it recieved an award or not. I got an error when i tried ols.fit
“”ols_formula = “Award ~ imdb_rating + C(genres) + tomato_rating””
I noticed that award is boolean. Is there a way i can get over this?
you get an error because it expects an input array with 10 columns and you gave it 2
an input "node" is just another name for what you might call a "feature" of each input item
fit a logistic regression instead of a linear regression
Hi guys Im wanting to start on my first machine learning project is there any website I can use for this I know of kaggle just any tips would be really appreciated
Hey guys, how would I go about making a model that uses live video feed to describe objects as well as its distance relative to the camera? how would I then be able to take those generated descriptions and convert it to audio?
you can make a detection model (the one that classifies objects and draws bounding boxes around them)
then based on the coordinates of the bounding box for an object you can calculate the area and tell how far away it is from the camera
each object would have a different size bounding box areas close to and far from the camera, so you would have to account for that
those are just some ideas :)
not sure about the audio part though
(detector model below)
Also, unrelated, my question is this:
should I stop training at around epoch 18? This is a linear regression classification model
or wait
my interpreation is that I shouldn't because the model seems to be overfitting later...
but please explain why I should let it train later for longer if I should
Ohhhh ok I see thank you so much that sounds like a great idea thank you so much!!
Hi.
Just a simple question for the pandas dataframe library + sns plotting.
I have a music dataset. among other columns, I have danceability (float from 0-1) that describes how "danceable" a song is
genre and
tempo_category tempo category is just a categorical variable of 8 degrees ranging from "Extremely slow" to "Extremely Fast"
I am able to plot the danceability of each tempo category as shown with
sns.set(style="whitegrid")
ax = sns.boxplot(y=music["Danceability"],x=music["Tempo Cat"],showmeans=True,
meanprops={"marker":"o",
"markerfacecolor":"white",
"markeredgecolor":"none",
"markersize":"3"})
ax.get_figure().autofmt_xdate()```
What I don't know how to do: How do I plot it but grouped by the genre?
Meaning I have the same x and y axis, but now I have e.g. ~18 plots (1 for each genre)
Beyond individually saving each category to a new dataframe variable to replot, i am not sure how to do this.
I thought about doing a for loop with this enclosed but when i tried the newer plot just overrides the older one
curGen = "HIPHOP"
sns.set(style="whitegrid")
ax = sns.boxplot(y=music.loc[music["Genre"] == curGen]["Danceability"],x=music.loc[music["Genre"]==curGen]["Tempo Cat"],showmeans=True,
meanprops={"marker":"o",
"markerfacecolor":"white",
"markeredgecolor":"none",
"markersize":"3"}).set_title(curGen)
ax.get_figure().autofmt_xdate()
i got it!
Explained here:
https://discord.com/channels/267624335836053506/1165963881936592957
Any links where I can read about data quality plans for python?
what's a data quality plan?
Train_Allpatientdata=pd.merge(Train_Outpatientdata,Train_Inpatientdata,
left_on=['BeneID', 'ClaimID', 'ClaimStartDt', 'ClaimEndDt', 'Provider',
'InscClaimAmtReimbursed', 'AttendingPhysician', 'OperatingPhysician',
'OtherPhysician', 'ClmDiagnosisCode_1', 'ClmDiagnosisCode_2',
'ClmDiagnosisCode_3', 'ClmDiagnosisCode_4', 'ClmDiagnosisCode_5',
'ClmDiagnosisCode_6', 'ClmDiagnosisCode_7', 'ClmDiagnosisCode_8',
'ClmDiagnosisCode_9', 'ClmDiagnosisCode_10', 'ClmProcedureCode_1',
'ClmProcedureCode_2', 'ClmProcedureCode_3', 'ClmProcedureCode_4',
'ClmProcedureCode_5', 'ClmProcedureCode_6', 'DeductibleAmtPaid',
'ClmAdmitDiagnosisCode'],
right_on=['BeneID', 'ClaimID', 'ClaimStartDt', 'ClaimEndDt', 'Provider',
'InscClaimAmtReimbursed', 'AttendingPhysician', 'OperatingPhysician',
'OtherPhysician', 'ClmDiagnosisCode_1', 'ClmDiagnosisCode_2',
'ClmDiagnosisCode_3', 'ClmDiagnosisCode_4', 'ClmDiagnosisCode_5',
'ClmDiagnosisCode_6', 'ClmDiagnosisCode_7', 'ClmDiagnosisCode_8',
'ClmDiagnosisCode_9', 'ClmDiagnosisCode_10', 'ClmProcedureCode_1',
'ClmProcedureCode_2', 'ClmProcedureCode_3', 'ClmProcedureCode_4',
'ClmProcedureCode_5', 'ClmProcedureCode_6', 'DeductibleAmtPaid',
'ClmAdmitDiagnosisCode']
,how='outer')
i don't quite understand left_on and right_on
it looks like the tables are the same, so why all the extra arguments?
at least for once all the code works on my machine
Is this the right chat for a question I have about reinforcement learning?
yeah, go ahead
Ok cool thank you. I am trying to train an AI to play Pokemon Red with the pyboy emulator's API for python. My model currently has these parameters:
num_cpu = 16
ep_length_multiplier = 2
ep_length = 2048*ep_length_multiplier
model = PPO('CnnPolicy', env, verbose=1, n_steps=ep_length // ep_length_multiplier, batch_size=128, n_epochs=5, gamma=0.999, learning_rate=0.003, gae_lambda=.98)
What happens is it starts out strong, using my exploration reward system to find its way outside the starting house and eventually to the first real reward event. It is rewarded heavily for this and then continues to get rewarded if it continues down the right path. However, the furthest it gets after a couple iterations is making it through to where you pick a pokemon, and beat your rival, then starts walking toward route 1.
The problem then arises that it does this only some times, no matter how many iterations/episodes I run. It seems to learn and get better and more efficient at making its way here (which is by far the most rewarding path) but then proceeds to search other paths infinitely. If I run 16 cpus, about 4 on average will make there way to the rewarded path, but never improving in efficiency past a certain point. and the rest all get lost and bang their head against a wall.
My thought was that the reason was exploration/exploitation was too imbalanced, but I've tried decreasing exploration rewards, decreasing entropy coefficient (to a negative number because it was already 0), tried raising and lowering gamma and gae-lambda.
Nothing has worked and I think I'm misunderstanding something because if I'm to get the AI to actually eventually beat the game, it needs to very heavily favor going down the same very rewarding path almost every time and I'm not sure how to make it do that any more strictly than I already am. Obviously you need a bit of exploration so it doesn't get stuck in a local maximum, but this still seems too random or high exploration/entropy
What do you call it when you compare all the columns to each other in a grid of scatterplots? I want to see something like this image.
Thank you.
insurance charges, age, and bmi vs. each other. It seems like they charge considering what age bracket someone falls in.
I might be interpreting that incorrectly.
Yeah I am.
I wonder what would cause those three distinct lines to show up.
the diagonal? that is plotting the feature against itself
or did you mean age X charges
I mean age x charges
Which model can I use to determine how similar two short (around 5-10 words) pieces of text are?
my impression is pretty much that there are three tiers of charges, each of them scaling mostly linear with the age, but the number of points is a bit absurd - it's very possible that 99% of the points are in the lowest "line"
try using some transparency (alpha)
Are you looking for something like this: https://spacy.io/usage/linguistic-features#vectors-similarity?
alpha = 0.25
It looks like you were right.
personally I'd use even more
What do we use these days to display tables in jupyter-notebook?
I have tried QGrid, but can't get it to work in the latest versions of python.
Is there anything similar?
I want to be able to sort columns and search
if you just want to spam your outputs, offtopic
for discussion of the technical aspects, here
you may want to look for servers more specifically focused on it instead though
My use case is I have some bullet points (let's say around 5) in text 2 and there is prose text (text 1)
i want to determine which of the points in text 2 was expressed in the text 1
would spacy vectors similarity work for this use case?
https://github.com/widgetti/ipyaggrid is my go-to.
I am trying to train an AI to play Pokemon Red with the pyboy emulator's API for python. My model currently has these parameters:
num_cpu = 16
ep_length_multiplier = 2
ep_length = 2048*ep_length_multiplier
model = PPO('CnnPolicy', env, verbose=1, n_steps=ep_length // ep_length_multiplier, batch_size=128, n_epochs=5, gamma=0.999, learning_rate=0.003, gae_lambda=.98)
It starts out strong, using my exploration reward system to find its way outside the starting house and eventually to the first real reward event. It is rewarded heavily for this and then continues to get rewarded if it continues down the right path. However, the furthest it gets after a couple iterations is making it through to where you pick a pokemon, and beat your rival, then starts walking toward route 1.
The problem then arises that it does this only some times, no matter how many iterations/episodes I run. It seems to learn and get better and more efficient at making its way here (which is by far the most rewarding path) but then proceeds to search other paths infinitely. If I run 16 cpus, about 4 on average will make there way to the rewarded path, but never improving in efficiency past a certain point. and the rest all get lost and bang their head against a wall.
My thought was that the reason was exploration/exploitation was too imbalanced, but I've tried decreasing exploration rewards, decreasing entropy coefficient (to a negative number because it was already 0), tried raising and lowering gamma and gae-lambda.
Nothing has worked and I think I'm misunderstanding something because if I'm to get the AI to actually eventually beat the game, it needs to very heavily favor going down the same very rewarding path almost every time and I'm not sure how to make it do that any more strictly than I already am. Obviously you need a bit of exploration so it doesn't get stuck in a local maximum, but this still seems too random or high exploration/entropy
If someone is well-versed here in PPO models in python and why my model is doing this, I would be open to a voice chat to show what's happening if text seems laborious
How do we interpret Root Mean Squared Error of log-transformed responses/targets? Is it possible to get back to the original units like USD instead of log USD?
I found
https://stats.stackexchange.com/questions/371529/interpreting-rmse-of-log-values#:~:text=As the RMSE is in,values and the true values.
but it's hard for me to understand and they don't talk about the inverse.
The top answer was pretty clear to me, what part didn't you understand?
you can't obtain the RMSE on original scale from only RMSE on log scale because many different possible predictions and errors can produce the same RMSE on log scale but different on original scale
consider y_pred = [20, 10] and y_true = [10, 5]
then consider y_pred = [200, 100] and y_true = [100, 50]. same RMSE as above in log scale, totally different in original scale
hii ! I'm using vgg16 pretrained model CNN layers and adding some dense layers to predict the 100 classes in cifar100. What's the effective way to change all input sizes from 32x32 to 224x224
You can use PyTorch's torchvision Resize transform as part of your data loader.
It's a pretty seemless way to do this and the PyTorch docs have lots of examples on this.
Guys I would like to learn to program ai, but it seems like something unattainable, where can I start?
do you know python?
Sorry i should've mentioned this in the query itself, but I'm using tf and keras due to assignment constraints... The way they showed in example was dataImageGenerator().flow_from_directory (it had a parameter for size conversion)... But now i have the arrays from tf.keras.datasets.cifar100.load_data as numpy arrays
Hi folks,
Which lib for machine learning you use the most, Tensorflow, Pytorch or Keras?
youtube, its that simple : 10 min Ai tutorial python
is there any python code available to measure vessel radius values of 3D images
hey guys , not python realted questions , but can anyone come on vc and guide me , I want to understand how MongoDB is used for analytics ? or how does a person work with NoSQL databases
Hello I want someone to review with me a notebook please
y
I've looked at a few, but they always leave something unfinished, I'd like a more complete and structured guide
Anyone have tips on finetune gpt-2, im finetuning a 124M model using gpt-2-simple module, but i dont really know when do i overfit the trainning, i know there is the checking for validation loss, but i dont think it possible for this module, unless there some code for it
I am using pinecone on my app, it retrives information based on latest text, but I want to retrieve information based on the whole conversation context, not just a similaritySearch. How should I do this?
the information retrieved is still going to be focused on a particular query right? with the whole conversation as additional context?
In that case you could use something like ConversationSummary based RAG
that makes the entry more accesible. There are machine learning libraries with very simple python APIs so you could get your own models running without much effort
but to get good learning value, I would suggest reading up on the math and theory behind how these models work and how they're implemented is important
For that you could pick up a book
O'reilly's ML with scikit learn and tensorflow is a generally good book with a mix of both hands on coding and some accompanying theory
for more solid theory, you could try deeplearningbook.org, mml book, statlearning.com
these are all free
I know you can't read anything on that I'm sorry
def cross_validate(df, predictors, response):
estimator = sklearn.linear_model.LinearRegression()
scores = sklearn.model_selection.cross_validate(estimator, df[predictors], df[response])
logging.info(scores["test_score"])
Output
[1. 1. 1. 1. 1.]
Is this telling me that I got an r-squared of 1.0 for all five (5) folds of cross validation?
seems like so
(side note; cross_validate uses the default scoring metric of the model if you don't specify one, and the default scoring metric of the LinearRegression is r-squared)
there probably is some sort of data leak if I had to guess? unless it's a borderline trivial problem
Oh yeah there was a data leak. Thanks.
hello
i was watching a video on intro to huggingface and sentiment analysis
and followed each step side by side
but the prediction i am getting is not expected
any insights would be aprreciated
Im wanting to learn matplotlib. I currently have a horizonal bar graph bu the names on the y axis are cut off. Does anyone have any resources for styling charts and how to save them as images (i want to import them into a pdf file)
Use savefig method to save your charts as an image
hey guys if i have results like this does that mean my models are overfitting
my data has 12 features and i took a 50000 observation to train my models
this is my main function
the official documentation has a lot of examples and information in the user guides, it's a little bit scattered but there's a lot of information in there
also stackoverflow can have a lot of useful advice, but it can be hard to search for things. you might need to try several combinations of search terms
Where to start for learning genetic algorithm in python and implement practically.
Hello, i'm new here. I would like to get some help with a land price prediction algorithm i am currently working on
I have a string of binary values, 0 and 1, that I converted to this string of dark and light circles so that it could visually aid me with identifying patterns (if any) in this string. Here, the dark circle represents a 0, and the light one represents a 1. I cannot find any patterns just by looking at this string. Are there any tools out there that can help me identify patterns in this string? I'm ready to learn anything to be able to identify any patterns in this string. For more context, the aforementioned string of 0s and 1s represents the outcomes of an experiment that I conducted where each outcome could take on a binary value.
Hmm... an idea might be to try to compress the string and see if it gets any smaller, and if it does there probably exists a pattern somewhere in there
actually I think sequential pattern mining is the name of the topic you're looking for
Sequential pattern mining is a topic of data mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus time series mining is closely related, but usually considered a different activity. Sequential pattern mining...
whats the appropriate data-structure for avoiding a full-scan while doing this operation:
def search_name(prefix: str):
return next((x for x in my_names if x.startswith(prefix)), None)
the first thing that comes to mind is how a DB would index that column: a b-tree, but whats the equivalent in Python?
A "trie"?
AKA a prefix tree.
A trick that I've seen is to use a defaultdict: ```py
from collections import defaultdict
Trie = lambda: defaultdict(Trie)
STOP = object()
def add(trie, word):
for letter in word:
trie = trie[letter]
trie[STOP]
def contains(trie, word, just_prefix=False):
for letter in word:
if letter not in trie:
return False
trie = trie[letter]
return just_prefix or STOP in trie
Or you could try this library: https://pypi.org/project/pygtrie/
A more pythonic way might be to just have a dictionary mapping prefixes to sets of words that start with that prefix.
im using spacy and i want to fine tune a named entity recogniser (basically i want to fine tune en_core_web_trf)
how do i do that?
id rather not train my own NER cause that would be too expensive on the servers
Thanks for pointing me to this topic. It's going to be helpful!
thats a very neat implementation, thanks!
Anyone here any good at data cleaning?
I'm drowning here and I need someone to throw me a life vest
All I need to know is how to decide what to do with columns that have missing data:
When do I remove the whole column?
When do I Impute the data?
When do I just drop the rows?
If the column is mostly missing you might as well drop the entire column
If there's only a few missing then dropping the rows won't hurt too much
How you impute the column depends a lot on what it represents
Right now I'm going with
-
60% missing I'm dropping the whole column
- 5-60% missing I'm using KNN imputer
- <5% missing I'm just dropping the rows
Does that sound logical?
Would It help if I showed you a visualization of my data I've produced to help me decide which columns to drop?
You might need to download it to read it but it basically summarises my issue right now. I've got columns with varying correlation strengths to my target variable but they all have varying levels of missing data too
I'm using spearman for the correlations on that chart btw
unfortunately I don't think there's a magical range that works well for all features
Yea that is what is driving me nuts
I just don't know what to do other than use the force 😆
You might want to first think about what each column represents before deciding what to do
Do you mean the distributions?
Like for example (if I'm reading correctly), there's lots of missings for OWN_CAR_AGE, but missing in this case may or may not mean they don't own a car
In that case it might make sense to set all missings to 0
I went through and did this to the dataset already earlier
# Replacing all the values found to represent 'nan' in the dataset with np.nan
raw_applications.replace(['Unknown', 'XNA', 'not specified'], np.nan, inplace=True)
I'm not sure if it was a good idea though
I do apollogise for the state of that graph I know its almost impossible to read
Its kinda the nightmare dataset. I've only just realized recently that some of the column headers are in Bulgarian