#data-science-and-ml
1 messages · Page 335 of 1
anyone know the efficient way?
just need the date time portion
I've tried df["created_at"] = df.created_at.replace(tzinfo=None)
to the best of my knowledge, some people did use boxplots in their time series plots. Although, in my opinion, it's not really "pretty".
I personally used boxplots to identify the existence of outliers.
in the end, it's a matter of preference
Not talking about boxplots, but rather Box-Pierce test or Ljung-Box test
oh my god, I misread it haha
Time series is not a plot but rather a data format for predictive analytics
No worries 😛
Okay gtp-3 is brilliant
Sad :(
Crying :'(
Laughing :D
Surprise :O
Angry :@```
It generated the smilies for Laughing surprise and angry itself
Is the Malay Peninsula Standard Time should be MST in standard time format (like PST,GMT,CEST)?
to remove it i use lambda map with function:
splitted = your_str.split(" ")
day = splitted[0]+" "+splitted[1]
return datetime.strptime(day, '%Y-%m-%d %H:%M:%S')```
and do df[your_ts] = df.map(lambda x: funct(x))
it will automatically update to datetime type
or afaik MST is +8hour, add 8 hour before return the time
dunno if there is other elegant way to do this
So I’m dumb and have little to no Idea how to do this but how would I make an AI that can recognize my pet rabbit from other rabbits and tell it’s him, I know basic python. When neural networks do stuff like this do they plot points on the images and look for patterns? Any help would be gratefully appreciated I really only know basic python so please explain things
You could do this without apply
But if you want to use apply, you don't need to use a lambda when the function takes exactly one argument. .apply(funct) would suffice.
that's golden 💰 imma keep it till I die 🤣
someone make it into a copypasta
ooh, spicy stuff going on here 🌶️ https://www.reddit.com/r/MachineLearning/comments/p6hsoh/p_appleneuralhash2onnx_reverseengineered_apple/
1,505 votes and 213 comments so far on Reddit
we already have the collider projects on github. Can't wait to spam all those images to get them viral and half the population arrested 🤣
I'm looking for a good book that covers time series analysis that's more focused on the mathematics. Does anyone have some suggestions? Ideally free, but I won't mind paying for a solid book.
i forgot this
sorry, got busy at work yesterday. where is the part in this code that does the "don't buy if i already bought it" logic?
also, i don't know much about trading strategies, but it sounds like your strategy doesn't allow for the possibility of buying more of $FOO after you've already bought $FOO. is that right?
no problem, I appreciate you remembering 🙂
yes, that's correct
SMA = simple moving average? moving average of what?
and RSI is this? https://www.investopedia.com/terms/r/rsi.asp
so this python code is your whole trading strategy? good on you for being willing to share it, instead of being under the delusion that if you share it you're going to leak your genius secrets to the world and lose out on your gains 😛
since my question yesterday I've made it a lot faster. I used vectorization to remove as much data from the dataframe as possible (areas where it will not buy/sell), then I used list comprehension for the rest of the data I needed to manual iterate through
lmao
this was from my testing yesterday
Dictionary: 10.56 mins
To_list: 4.16 mins
zip list comprehension: 4.04 mins
vectorization & zip list comprehension: 3.36 mins
pre-vectorization & zip list comprehension: 1.13 mins
pre-vectorization & better list comprehension: 0.86 mins```
Does anyone know, where the idea behind Word Embeddings, especially these kind of Embeddings the Tensorflow Embedding Layer produces, comes from? I would like to cite the idea and I already figured out that there happened a lot of work over decades, but im still not sure who i can cite as an author of the idea. If anyone has an idea, feel free to @ me. Thanks 🙂
that sounds great. i'd love to see your updated version
you can cite the original word2vec papers as one of the early popular implementations, but i don't think the idea has a single "originator"
https://arxiv.org/abs/1301.3781
https://arxiv.org/abs/1310.4546
We propose two novel model architectures for computing continuous vector
representations of words from very large data sets. The quality of these
representations is measured in a word similarity...
I'm working on the next step in the project right now. When it's done the program will have to run for many hours most likely, so in that time ima clean up the code and i'll send you the updated code
sure, would be happy to see what you did
if your code is all "numeric" (i.e. no strings, dicts, etc), you can probably get significant speedups by running with numba in "nopython" mode
Thanks, will have a look 👍
yeah I looked at that yesterday but if I'm being honest it looked very difficult to install all the stuff for it and get it running properly. I know I'll have to do it eventually but I thought I'd just wait for now xD
hmmm... OpenAI didn't look like they did much filtering
it's not at all. usually you can just do pip install numba and get going with it
yeah that's what I did, but put it simple, it didn't work as well as I was hoping xD
But i'll probably have to use it soon
hey anyone knows where i can find a good written explanation on why normalisation doesn't work on SMOT data? I know it is because we are already normalising unbalanced class, but was wondering if there is a better explanation out there that I could use.
what kind of normalization?
Does anyone have experience with creating a Monte Carlo simulation? I need to create a 2D model of a blood vessel in the brain (simple - central circle enclosed by 2 barriers with some permeability, i.e. either a chance for molecule to get through based on dice roll or trapping and releasing after some time) where in the middle I would have a stream of new particles (simulating blood passing through) and I would need to do random walk while recording the number of particles in each of the 3 compartments (inside, between barriers, outside). I was thinking about using something like pygame to do the simulation, but I would prefer doing a LOT of particles and efficiency is the key since Im running it on a laptop.
use pypy
store your classes with __slots__ (although this is mostly a no-op in pypy)
that said, if you can implement this as a loop over a numpy array, you can probably do even better with numba, than with pypy
that, or maybe you can repurpose BUGS or Stan to do the heavy lifting for you; i've only used those for bayesian probability modeling so i wouldn't know how
e.g. in numba it might look something like this:
import numba
@numba.njit
def run_simulation(n_samples):
n_inside = 0
n_between = 0
n_outside = 0
for i in range(n_samples):
# Do your complicated stuff here
...
but it sounds like maybe you can "vectorize" this simulation? e.g. pre-generating a big list of random values with np.random and then doing cumsum-type calculations thereupon. if you can post the actual algorithm for the simulation i can probably help more
depends of course on your performance requirements too
Before pypy I would recommend just having anything working first
@desert oar lots of info here, unfortunately I got the task like 30 minutes ago and I don't even know how to set the model up such that the particles interact with the barrier yet 😦 once I figure that out I can start working on optimisation
Then you can think about jit/compilers/etc. later
Yeah I recommend having a minimum working example before optimisation
Although the faster you get a minimum working example, the faster you can optimise
any ideas on how to do random walk simulation with a barrier that has some permeability? all the random walk tutorials I found do it in free space and only a few particles, I need constrained space with a ton of particles
ofc getting it done with just one is fine for now, I just don't know how to do the particle-wall interaction because I feel like that is game design and I have no experience in that
which is why I thought about using pygame at first...
hm... if the barrier is circular, would I make the particles do the classic random walk and after each step, check if their initial position was less than r away from the centre and final step more than r and if it is, add a random probability of them not moving at all?
does that sound good as barrier simulation?
https://mathematica.stackexchange.com/q/57561/16075
https://mathematica.stackexchange.com/q/49063/16075
here are mathematica demos of such a thing
I want to simulate a random walk in two dimensions within a bounded area, such as a square or a circle. I am thinking of using an If statement to define a boundary. Is there a better way to define a
in your case, what's the logic? "particle hits boundary, then p% particle goes through it vs bounces off"?
? Just have a probability of it passing/not passing?
ye that's probably the easiest in terms of theory
I think that's going to be my initial try, I am simulating a particle going from blood through a barrier into a different area and then again into another area so probabilities like that make sense to me
I might eventually have to add in a bias in the walk that would make it more likely to walk away from the centre but that's down the road
this is just 2d though right? circle inside, infinite area outside, particle has p chance to pass outside the circle when it hits the boundary?
yeah I think 2D will be enough in this case just for simplicity
I can try drawing it
Hey @flat hollow!
It looks like you tried to attach file type(s) that we do not allow (.heic). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
noone likes apple's .heic 😦
@desert oar something like that? There could be a central starting point, then 2 circular barriers with p1 and p2 probabilities of particles passing through them and infinite space outside (may need to constrain it eventually, have to talk to supervisor first)
Seems possible
Actually thinking about it, it might be possible to solve without simulation assuming circles, but don't quote me on this
Im not doing the math for that 😄 I just want a nice simulation that is repeatable, people can see the evolution and I can plot the results nicely for my (hopefully) future paper 🙂
can the particles bump into each other?
and do the particles have spatial extent or are they just points?
for the simplicity of getting at least something done I would do point-like and no collisions, but eventually I would probably have to add collisions and Im not sure about dimensions, would have to check the tables
I have molecules inside a blood vessel, do point-like particles work with collisions? 😄
doesn't sound like they would...
https://colab.research.google.com/drive/1Lof26snWVk6wm3Y-sRtKCYrGZvFdHCwP?usp=sharing I'm having error how to correct this?
@flat hollow do the whole thing in polar coordinates
a particle isn't an (x,y) pair, it's an (angle,radius) pair
then a "collision" with the barrier is just particle.radius >= barrier_radius
the "particles" can be a 2d numpy array, and you can also use an array to track which particle is in which circle, so you don't have to recompute it at every step
particle_positions = np.array([
[radius0, angle0],
[radius1, angle1],
...
])
particle_circles = np.array([
0, # inside the inner circle
1, # between the circles
2, # outside the outer circle
...
])
a "step" could be something as simple as a fixed-size step in a random direction
just need to do some trig to figure out the new angle and radius after a step
this sounds fun, simplifies even particle-particle collision
though i often use manhattan distance when doing that stuff
idk how well it works for non-point particles
i also can't think of how to calculate the next step without going back to cartesian coordinates 🤦
i'll have to write it out on paper
oh i think that's actually how you do it
yeah, but you can vectorize that stuff conveniently --- while particle-particle collision is probably some python loop, dunno
ah, my terrible math skills are catching up to me, the one thing that Im not sure about right now is the particle tracking...
unfortunately in 7 hours Im driving 1,5 hours and then walking for 8 more hours so I need some sleep, thanks a lot for your time @desert oar
of course, i'm happy to procrastinate on my own work with this 😛
that would be my guess, but if that gets slow you can drop down to numba and do it
adding this to a long todo list
Minmax, standard etc
@pine wolf @flat hollow https://paste.pythondiscord.com/bihipocuri messing around a bit
need to figure out how to animate
i haven't heard that before, but i can imagine that there is a problem with using simulated data to estimate things like the maximum, mean, etc. i would guess that you should do those things before oversampling
Are you looking for this? https://en.wikipedia.org/wiki/Brownian_motion
Brownian motion, or pedesis (from Ancient Greek: πήδησις /pɛ̌ːdɛːsis/ "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas).This pattern of motion typically consists of random fluctuations in a particle's position inside a fluid sub-domain, followed by a relocation to another sub-domain. Each relocation is follo...
Or just really simple, bunch of particles bouncing around and when they hit the border they have some chance of being repelled or passing through.
💩 sorry, was doing some scaling while typing, i mean sqrt, cbrt, log etc
ok, this is with a .001 probability of pass the barrier
so you take a initial set of points and simulate random forces?
ohh, brownian motion
but why? 🤔
Brain tickler - I have a set of 2 angles (in rads) can anyone think up a way to represent two angles with a single number?
complex(angle_1, angle_2)
the function has to be reversible too 👈
A real number, that is
unless there's some constraint on your angles, i don't think there's a reversible method for arbitrary angles
usually those numbers would have a high decimal accuracy too
yea, I was thinking that way too
guess I will just do multi-output regression ¯_(ツ)_/¯ returning the 2d vector in the end...
wouldn't the banach tarski theorem suggest that such a function exists, albeit probably not one that we can comprehend or even write the definition of?
gonna share that code? 👀
matplotlib animations are being a pain
bless you
let's see how bad mine is compared to yours
i haven't done the random diffusion collision part yet, got sidetracked w/ animatinos
i just hacked apart your code
i hacked apart my code too 😛
might require some knowledge of nurses_2 particles
i mean the size of the sets RxR and R are the same, so there's definitely a mapping somehow --- i'm not gonna dream one up
it's pretty python-specific -- there's better in the c-domain
fair enough
if notcurses python bindings get improved and they add support for windows terminal, i'll probably make a nurses_3
depends
we're not talking arbitrarily, right
we are
then how are you storing them
the angles have a pretty high degree of accuracy
you can take every digit of one angle and every digit of another angle and zip them together
but some numpy number type
thought so too
or rather...
angle_1 = .01010101010...
angle_2 = .5959595959...
compressed = .051905190519...
this would work
oh, and the number has to be real and function reversible
it wouldn't work if both numbers are @ max double precision
it would still work, you would half their precision before compressing
as in they have different lenghts?
which means you lose information
therefore not reversible
just use a higher precision float for the final
well, they are both angles so the function should be sensitive to both and actually retain their information
and you don't lose information
A simpler method could be just to output a 2D vector
but I wanna see what you guys come up with
if your initial values are float64 you're out of luck
then use float128
why is that my fault
it's not, but this is not possible
the constraints were stated @ the start
@grave frost realistically speaking though
are you going to need all the digits?
you can create a new type that stored 128 bits, and then you can represent your 2 64 bit numbers as a single 128 bit number
yep,
- Function has to be reversible
- output should be Real
- Information from both angles should be preserved in the output
15 is generally a lot
you can't store 8 bits of information in 4 bits though
maybe not - but approximations would lose on error
you could do this if you're willing to give up easy vectorisation
this seems much more practical
i mean, i'm not recommending this
this seems like an XY problem, but i don't address this issue
commission a custom processor that can natively handle quad precision floats
its not really a huge problem per se - I can get by without solving it
nice. anyone up for some funding?
like, what problem do you solve by using one float to represent two?
I've got a couple of thoughts and prayers
convenience, apparently
but I also think it's not the right way to do it
nothing really, as I said thats much more to tickle your brains
especially when you have to use resources to go to and from the representation
nah resources dont matter
just asking mathematically here. 2 angles, 3 conditions
you can do it with infinite precision floats
yeah
maybe there's something with some transformation?
mathematically, it is defo possible
In C you can just create 128bit floats directly (GCC x86, x86-64). Idk if numpy supports it.
not as of stable, AFAIK
would torch even work with those, if torch's own tensor is not available at 128b
but the point here is that floats, more than many other things, are a leaky abstraction
Can of course just wrap the 128 bit floats with cython.
I don't think that its actually fully 64b
Doing it manually without the hardware support would be super slow.
its def in middle
like a little more than float32
technically, isn't it a 2D array - so a vector that with a transformation be transformerd in such a way as to get to 1D line?
so we could reverse the transformation and get theoretically the same thing back and the transforming matrix has no eigenvectors
This using one number to represent two thing seems like something one would find in some old commodore 64 code or something so maybe look in that area if you really want an answer.
But it does sound useless even if it's doable, much like the xor variable swap.
it was bugging me, so i vectorized the conversion back to cartesian coordinates
I've got a question in Chocolate if someone can help out.
@finite wasp idk what chocolate is, but you should always just ask your question. Asking if someone knows about the topic of an unasked question is less helpful than putting the real question out there.
Looks like I may have misread what you had said 🤷♂️
I got mine humming along with polar, i kind of like the idea of not going to cartesian except for plotting. Figuring out the angle of reflection off the barrier will be interesting though
i just didn't change the radius at all when crossing barrier, instead of reflecting
I saw
the lazy way
i don't think you have to add too much to reflect
or maybe you do
could just give the particles a velocity and then all you have to do is reverse negate it
To reflect i think you'd have to find the distance from point to barrier, compute the tangent of the circle at that point, get the perpendicular line to that, then compute the angle of reflection around it. Then reposition the particle accordingly
Or yeah don't use reflection rules and just reverse direction lol
All this probably goes out the window if the blood cells have spatial extent anyway
Has anyone ever deployed a ML model on a chrome extension, looking to do that but haven't found many resources 
Hey @errant flare!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Quick question i have a .csv file data that is structured kinda like this with a 150 responses (https://docs.google.com/spreadsheets/d/12z4FBN8_mW3T7I4WPb2JbLx9WGxVnelfNspRwVMKF9Q/edit?usp=sharing) , is this is any way worthy of a linear regression or multivariate regression model? and if so on the basis of what independent and dependent factors? Kinda new to ai and datascience so yeah a little confused here
And if it isn't are there any other models or ways through which I could build a predicitive model with this kind of dataset?
I'd appreciate any and all direction / help, thanks once again!
Sample_Survey
Timestamp,Which Country are you from?,Age,Gender,Name a random source of information,On a Scale of 1 to 10, how much blah blah,On a Scale of 1 to 10, how much blah blah,On a Scale of 1 to 10, how much blah blah,On a Scale of 1 to 10, how much blah blah,On a Scale of 1 to 10, how mu...
in order of increasing complexity (and usually increasing quality), you can look at regression trees, random forest, adaboost, gradient boost, xgboost
statquest on youtube has a good explination of all of them, and usually he or others have implementation examples in python
could i though uh try to have a regression model with the age variable in comparison to the 1 to 10 number being the dependent variable?
cuz for some reason they were a little specific on having a regression model
i'm not sure why
you can have whatever you want be your inputs and whatever you want be your output
hmmm
I've never really done multivariable linear regression but I think that's also a thing
there are advantages and disadvantages to every method
yep i don't have that many independents though for multivariable
anyways thanks for the help, gotta go now so yeah!
gl!
Do any of you know where the best place is to learn data scientist and will age be a barrier?
are you old enough to work in your country?
I'm talking about learning it.
I got a while before I can work.
if you are old enough to understand language and abstractions
then you're old enough to learn
I've done a lot of java so I'm familiar with programming languages
Java
🥴
ew
sorry...not a Java fan
but yeah
you can defo start learning
how're you @ mathematics?
in particular, statistics
linear algebra?
statistics knowledge is important for data science
it depends on what you wanna do, though
it's a wide field
depending on your specialisation
graph theory
linear algebra
calculus
all might be relevant
Do you know where I can learn data science?
nope
there's tons of stuff online
gpt-j is nice
what the hekk is KeyError
in pandas
god its driving me nuts
nvm i got it but still why
you probably were looking for a column that wasn’t there
yeah deleted a piece of code in my notebook by mistake
and sometimes there's apparently a space in the csv's name which I didn't notice
Hey there, i'm really interested in Data Science, but how should I begin with?
do you already know Python or just getting started overall?
Well, I already know python
mostly this channel's pins then
Sorry, english is not my first language, could you try to elaborate?
Check the Pinned messages 📌
Thanks!
uh quick question
i'm doing this in python with pandas
soi = data["SOI"]```
and the output i get has the index column in it, any way i can get rid of it?
Try out
soi = data.SOI.values
why do you want to do that?
data.reset_index()['SOI'] should do
i wanna use the index for other parts of the program but i wanna isolate only the soi colum without the indexes so yeah
and soi = data["SOI"].values works in that regards i think so yeah thanks!
can I ask you a question
if you train a transformer on the bible
what would be the result?
wait what
would it come up with new data or would it take input and put it into context
a transformer?
a transformer is an NLP model.
have no experience with nlp
but from a couple of google searches i'm assuming it would come up with new data
that would resemble somethings that the bible has
because apparently transformer models are used for text summarization
would it be an accurate representation of the bible
so that's taking the idea of the text but coming up with technically new data
yeah no i highly doubt that it would be
or would it be a new idea.
yea... I was wondering what kind of model I would need to have to answer questions directly from say the bible, torah or the Quran.
where it has to be onpoint in the interpretation of questions, verses and choice of answers.
hmmm maybe then
transformers are all on the rage these days to be honest thought it would be a good choice
have you tried gpt-j?
EleutherAI web app testing for language models
try this
i have no clue how to help but uh here? https://towardsdatascience.com/nlp-building-a-question-answering-model-ed0529a68c54
yep just gave it a try hehe
the bible i'm assuming is quite vast so it's not like only 5 or 6 questions that you have to answer
For, it seems that the church has no plans for a Christmas celebration this year. Instead, the Vatican is proposing a celebration of the end of the world.
Last week a Vatican statement said that the Pope is planning to address the United Nations on December 21st, in order to address the "global environmental crisis".
It said that the Pope will urge the world's leaders to work for a "dramatic reduction" in carbon emissions.
In his address, the Pope is```
the prompt was In god we trust, said the Pope to lol
in addition it isn't like the questions from a bible are like "accurate" as in when someone asks a question they aren't looking for a specific value it could be varied
probably very tough to be honest seeing how there might not be one answer to a question asked in regards to religion
yea it's interpretations and rulings etc. The same model would essentially be able to derive legal rulings and supporting verses from say the constitution or penal code.
and some that could be quite wrong
yea.. that's also a problem
hm...
yeah if you're building and nlp to be able to interpret something and not just spit the same thing out that automatically throws accuracy to the root text out the window I think
but there should be general consensus.
it should interpret based on the root text.
and a supporting corpus perhaps.
based on what i know from law there's cases based on cases based on the constitution.
ahh it's alright.. I'll look into it through a search
take care
Hey Guys I have Outliers in my covid dataset and i am not getting how do i deal with it.. Like say active cases so some states like Usa,Nw in that Usa has value aboove 100 k or something which impacts the mean too, so the outlier is actually a valid type s you can have no of cases in 100k so how to deal with such type of outliers ?
hey need some help with a simple numpy question
i asked this some day ago as well but couldnt figure it out
yes
what have you tried so far
well i have been taught stack concentate arange and some other basic numpy stuff
and i havent been able to do anything lol
idk how
because this isnt a hard one
i had someone help me with slicing and stuff a few days ago
actually i got the first function now
#!/usr/bin/env python3
import numpy as np
def get_row_vectors(a):
return [i[np.newaxis,:] for i in a]
def get_column_vectors(a):
s = [a[:,i] for i in range(len(a))]
return [i[:,np.newaxis] for i in s]
def main():
np.random.seed(0)
a=np.random.randint(0,10, (4,4))
#a = [[5, 0 ,3], [3, 7, 9]]
print("a:", a)
print("Row vectors:", get_row_vectors(a))
print("Column vectors:", get_column_vectors(a))
if __name__ == "__main__":
main()
this is what i got atm
this almost passes the tests. it says this though:
"FAIL:
RowsAndColumns: test_column_count
3 != 5 : Wrong number of columns"
ur talking about (n, m)
yes
kk
if u use that list that is hidden with "#"
my code fails
says list indices are tuples
what is [:,np.newaxis] in numbers?
how do i fix my code
fixed it god damn this one sucked
if in doubt print everything haha
yep that helped it
i have another one but i have to try it first
it looks pretty hard tho
that doesnt work
oh?
I didn't remove range if that's what you mean
just fine-tune a model on the bible
fine-tune a pre existing model?
yes
did u copy my code
or create it from scratch purely based on the bible?
but won't it make assumptions from outside the scope of the bible?
yes just added three chars to your get_column_vectors function
show what u did
def get_column_vectors(a):
s = [a[:, i] for i in range(len(a[0]))]
return [i[:, np.newaxis] for i in s]
I couldn't find any channel for machine learning doubts, so should I post them here?
yes
Topic - Decision Trees in ML in Python
Given two different datasets, a training dataset and a testing dataset, the instruction was to model a decision tree on the training dataset, make predictions, select the best or the most ideal value of max_depth for the tree and then compare the results with the testing dataset.
I thought of splitting the training dataset, writing the training algorithm inside a loop over an arbitrary range, then select the result with the best accuracy and the corresponding max_depth. Is this a good way to get the best value of max_depth?
I would be happy to get suggestions.
And the similar process for other machine learning algorithms too.
how to change the legend when i plot this way
Is there a written source or any book where I can learn python image processing?
I'm so bored watching videos
arXiv
Classify text with small data set
Hello, I am looking for ideas and knowledge, my task is classify legal text sentences very particulars and the size of my train data set is 1200 classified sentences, I have to classify in 4 or 5 classes, I mean 4 or 5 because I know what is the problem.
My vocabulary is around 20k (filtered by min_df=10) of unique words and I try classify with BERT, CNN and SVM+TF-IDF.
The length of my sentences is close to 512 words although I can change it.
My scores with the test part of 300 sentences is close to 65% (precision, recall, F1, etc.).
I don't know what I have to try, help me with links or papers or something for text with small data set.
what about it are you trying to classify?
the topic of the sentences? something else?
1200 isn't a lot. i strongly recommend pursuing dimension reduction, so that your model has fewer parameters to learn
not easily
plotting with pandas is convenient but you lose customisability
hello all. i am facing problem related to installation of pandas. please help
@fallow prism some options:
- use word2vec, glove, fasttext, bert, etc. to generate sentence vectors, then logistic regression
- PCA on the count-vectorized sentences, then logistic regression
- Factorization machine on the count-vectorized sentences
Logistic regression with L2/"ridge" regularization and linear SVM are somewhat interchangeable; mathematically they amount to the same model with a different loss function. There's also L1/"lasso" regularization and elastic-net regularization which is a blend of ridge and lasso. The differences should be fairly minor among all of these, although you can efficiently compute the entire "regularization path" for elastic-net and lasso, and you can efficiently compute "generalized cross validation" for ridge. Generally I tend to prefer logistic over linear SVM anyway because you also get a decent probability model out of it.
basically all 4 models are the same, minimizing the difference between y and w*x , but with different loss functions
can anyone help?
crf bilstm bert crf bilstm bert crf bilstm bert
precision precision precision recall recall recall f1 f1 f1
micro animal 0.930068 0.890407 0.843404 0.702475 0.781577 0.854810 0.800407 0.832450 0.849069
dose 0.711624 0.668271 0.636848 0.445432 0.617051 0.763121 0.547908 0.641640 0.694290
exposure 0.853591 0.809184 0.642202 0.542343 0.695919 0.675735 0.663268 0.748290 0.658542
endpoint 0.685054 0.705040 0.650032 0.367747 0.512205 0.617647 0.478584 0.593348 0.633426
macro animal 0.930068 0.890407 0.843404 0.702475 0.781577 0.854810 0.800407 0.832450 0.849069
dose 0.711624 0.668271 0.636848 0.445432 0.617051 0.763121 0.547908 0.641640 0.694290
exposure 0.853591 0.809184 0.642202 0.542343 0.695919 0.675735 0.663268 0.748290 0.658542
endpoint 0.685054 0.705040 0.650032 0.367747 0.512205 0.617647 0.478584 0.593348 0.633426
I want the columns to be ordered by (crf, bilstm, bert) and then (precision, recall, f1) within those three groups.
calling sortlevel on the mulitindex for the columns worked.
and reindexing from there.
@desert oar @pine wolf you guys are wizards 😄 I was worried the whole day about coding in polar coords with numpy (havent done it before) and after a long day I come back to what seems like a working model? I will try to download nurses_2 and run both codes tomorrow, I am completely exhausted after today's hike.
@desert oar thanks for including links to explanations ❤️
so in a couple of weeks im supposed to start an ai with python course
which i have been invited to
and i dont knwo anything about how those two work together and how to work with them as is
I have finished a couple of python courses and somewhat decently know python but i have no idea about its use in ai
can i get some help with that?
Aren't you supposed to learn all that on the course? Or is like an non-beginner course?
I just want to know what to look for
since this is the place i reached out when i was beggining my initial python course
Just give me a little info about it so i can understand more of it when it comes to it
i hope u all are okay
i wondering
hoes anyone know a platform where i can do data analysis interviews ? like leetcode but for data analysis
i know about kaggle
but im asking is something like leetcode
Is this a good channel for asking help with Telegram bots?
what is a telegram bot?
Need a bit of help with an ordinal regression. Dv= 10 point scale or 0-10 and IV= are a 4 point scale of 1-4. I have two independent variables that are a 2 point scale, yes/no. I am wondering if I can keep these variables because the basis of my analysis assumes that there is 1 order of magnitude between 1-2,2-3,3-4 etc but with binary it's more like 0-100.
Is there a book, I can read up a bit on to understand how to best set up an ordinal regression?
unless you're super interested in terminal graphics, i might recommend using pygame or kivy to render this, i just used something that was convenient for me -- but i did update it with a nice background circle:
thank bro its just what i was looking for
very cool, I would also need to keep track of the number of particles in each of the 3 areas and a 2nd barrier... so much to do 😦 but now at least I have an example to work from 🙂
does the diffusion rate increase as the diameter decreases?
Hey guys, I want to create a new column for a dataset, but I'm having a small issue here. I've used the Close column to get the EMA9 and EMA21 column. However, I've noticed that those EMAs aren't properly alligned(I want the EMA9 for the day 09/19/2014, index 2, to be at the index 1, something that can be made in Excel).
I've tried doing this by removing the first row of EMA9 with
data['EMA9'] = data['EMA9'].drop([0])
However, this only makes the index 0 in EMA9 to become NaN. I've also tried using the argument inplace = True, but this results in the entire column being replaced by NaN.
Can someone lend me a help?
🤷 that's probably one of the outcomes of the simulation? I just need a randomwalk with the barriers with some permeability
my guess is it must --- as diameter decreases there should be more collisions with it, increasing the odds that a cell passes through
hm... wouldnt that be solved by adding particle collisions?
and perhaps some momentum calculations?
i don't think it matters
i think that's just what happens in simulation or real life, probably
probably good, so our fingers don't asphyxiate
oh sorry, I misunderstood your question, I think yes, for the same number of particles there should be an increase
so you just want to remove the 2nd row in column EMA9?
Just the 1st index and move everything below it up in EMA9 column.
data[EMA9].shift(1)
can't remember if it shifts forward of backwards, so shift(1) or shift(-1)
I think they want to keep index 0 intact
For the other columns, yes. But for EMA9 and EMA21 I want the index 0 to be removed.
ah, then shift should work
Yes, it worked. Thank you!
Keep in mind that if you're working on a price prediction model or any trading model, you're now looking at the next timeframe's data in that same row, so you're basically looking at a future value - so practically you would not be able to use that value in a real scenario to trade
Yes, thanks for the advice. I was trying to get my dataset lined up with the charts where I got the data from. But, now that you've mentioned it...I should probably rethink.
At least now I know how to modify datasets rows like this. I had to do this once with another dataset and ended up just opening the DataFrame in an excel file to do this.
how to data science?
try reading "data science from scratch"
u mean...scratch pl?
No, it's a book
maybe a hard question to answer but... does anyone know the major difference between k-modes clustering and multiple correspondence analysis? they both seem to have similar results and methods, but I'm not sure how to interpret them
can someone help me with numpy and specifically concencate and how to use it
it combines arrays into one but what else can it do
That's the point of it
well i got a question thats hard for me and i gotta use concencate in it
and some other things
What is it?
Do you know how to use the eye function?
yea i know what it does
i havent used it though
yeah cant do this without some help seems a bit too hard
data['EMA9'] = data['Close'].ewm(span=9).mean().shift(-11)
like this?
Yes, thanks!
out of curiosity why do you want it aligned like that?
Just so it matches the chart where I took the data from.
Hey @timid grove!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
oh, if you want it to match without shift use this:
data['EMA9'] = data['Close'].ewm(span=9, adjust=False).mean()
Nah, what I wanted was indeed shift.
with adjust enabled, it does the calculation around the centre value, with it disabled it does it around the last value, as it would in a technical indicator on a chart
It's just that the EMA9 for day X in the chart was registered as the EMA9 for day X+1
I'd consider shift a workaround since you'd have to change the value depending on the window you're using
in this i downloaded a dataset of approx 800 images , but this algorithm giving only 102 images as output other images are taken into consideration...
Hm...I tried using adjust = False, but it didn't work.
That's interesting, currently slowly working on a technical analysis library and didn't run into that issue for the EMA
anonymous
in their case they literally needed a 1-day shift i think, no?
are you keen on anonymous group and data science?
no , could you please help me resolve this
From original question yes, they confirmed your solution of shifting by -11 though (probably a typo?), so thought it might have meant something different
i'm a noob, sorry
np!
oh hah yes, the 11 was a typo
it was supposed to be -1
Yea, that's why it just confused me a bit, thought they might have the centring issue. Weird that the chart they're pulling from has it offset of 1 though
i suspect there's something missing in their explanation, but 🤷♂️
perhaps they meant
data['EMA9'] = data['Close'].shift(-1).ewm(span=9).mean()
help
Hey. Using matplotlib.pyplot as plt how can I change the appearance of the date tick labels in my graph x-axis without re-writing my code in the way demonstrated in the docs: https://matplotlib.org/stable/gallery/text_labels_and_annotations/date.html
?
I'm not using fig, ax = plt.subplots() and then using fig or ax to control my graph. I'm just using plt so the ax methods are not available to me in the way demonstrated.
ax = plt.gca()
then go from there
in general, though
I would advise against using plt methods
it makes it a lot harder to do stuff
Thanks, and yes, I'm beginning to see this.
i generally use plt for quick things, and switch/refactor to fig, ax for more involved tasks
or use seaborn
Huber(x)= {12x2 for |x|≤δ,
{δ|x|−12δ2 otherwise.
what do i need to learn to understand this lol
math notation
i wouldnt know
Hello 😄 I have a questions about Pandas and trying to figure something out (and I'm having very little luck finding out how).
Question: How do I aggregate data within a CSV (Add 2 cells together from different rows based on a common value)
Example:
Employee ID:, Box Type:, Count Per Order
0001, Large, 4
0001, Small, 2
0001, Large, 2
0001, Small, 3
0020, Small, 1
0020, Small, 2
I want to be able to calculate
Employee 0001 - Large - 6
Employee 0001 - Small - 5
etc
How would I go about doing this? or would I use something besides Pandas?
😀 Idk why but I also prefer Seaborn to Matplotlib. The syntax is more direct and more 'customer friendly'
you can use the groupby method
df.groupby('Box Type').sum()```
Would this keep them separated by employee ID too?
if u want that as well do
df.groupby(['Employee ID', 'Box Type']).sum()
Guys I need help.. I want to learn data science and ai I really want to learn but the problem is i don't have anyone around me interested in programming at all.. When I make new project or solve the problem i faced for two weeks there is no one who can celebrate with me.. At least I want someone I can make projects with.. I know it's silly problem I have.. But I just can't do it all by myself..
What do you guys think? Have you ever faced my situation? Any advice?
Thank you so MUCH! I had gotten "Employee ID" but i didn't even think about adding "Box Type" to the groupby #TrueHero lol thank you!
lmaoo np
theres plenty of passionate people in this disc u can celebrate with
i have a question about tree based models like random forest and gradient boosting
do any of these methods have the same kind of persistence that ANNs have?
like an indefinitely long window for training
anyone know why tensorflow recommends installing from pip? it seems like a lot of pain to install it. i am wondering why not use a conda install?
use a conda install if you have it.
Pip is the standard package installer so the default instructions are going to be for installing stuff using it.
Have you used conda? I never have, but I don't really see the advantage. It might be because virtual environments have existed the entire time I've been using python.
i use conda extensively. it's been absolutely fantastic, however at the same time, it's not "necessary" to use. You can easily do "most" of what you need without conda as well.
Where conda truly shines however, is two things: 1. it's not just a python package installer. it's a binaries installer. for any binary that's built for it. That means that conda makes trivial some installs that would be absolutely miserable without it. (actually, if im being completely honest, conda has a branding issue. this tool is so insanely good at what it does, but doesn't get enough emphasis on this aspect). the 2: closely related - because conda is a binaries installer, it can control python installations itself. This means your environments made in conda encompass multiple python versions too, and make it trivial to have different versions of python in different environments with zero friction.
Ofcourse, this is all on top of also supporting pip installs. So there's genuinely zero downsides in that sense that i can think of.
to keep it simple though, if someone is already using conda, they should use conda installs before pip installs.
Like what darr said conda is able to install some lower level binaries that support some of python packages. I dont think i have needed to install manually although i dont know if this is optimal for your system. I would assume if you take the time and find the most optimal libraries for your architecture you could get better performance
I see. Thanks!
specially when working with different versions of cuda & cudnn
I just babble about stuff with my family, even if they don't care or understand.
I second Guitar's response, even if people around you normally don't care about your field, the ability to explain what you're doing to a layman audience and get them excited about it is a great skill to have. If you talk to them about the project as you're slowly building it, they will also be ready to understand the mode involved things you needed to have done.
if you want to do team projects, it's not necessary that the other person knows python. Try to exploit the domain knowledge of the other person and as a bonus you end up learning a lot from them.
Does anyone know how I can scale the x-axis labels with matplotlib.pyplot.hist()? I want the histogram to look exactly the way it does now, except that I want to display my x-axis ticks in % rather than fraction, so I would need to scale by a factor of 100
Got it. Since it seems common enough, for anyone else:
scale_x = 100
ticks_x = ticker.FuncFormatter(lambda x, pos: '{0:g}'.format(x*scale_x))
ax.xaxis.set_major_formatter(ticks_x)
can someone experienced in machine learning help me in #help-bagel pls, I'm very new to machine learning
Hey, I made a Telegram bot with Python on VSC, but now I have only one question: does it work 24h/7?
Smart Project Shrink wrap - Surface normal version
I made this math up from scratch
😄
if you host it yes
the bot will work as long as the program is running, so just set it to run on a system that's on 24/7
you can use a VPS, a raspberry pi, or any system that you're willing to always leave on
can someone link me pandas vectorisation with numpy docs or something like that
I'm not sure I understand the dilemma. Each DataFrame is a layer on top of a numpy array, so pretty much all of its operations are vectorised.
no I am asking like the numpy methods which we use to vectorise data
can you give me an example of some data and what transformation you're thinking of?
are you sure you're not talking about encoding?
like imagine if i have a.csv
string
aaaa
aaaabbbb
cccccdddd
dddddeeee
now i want i have some substrings like ddee, aa i want to get a output like this
ddee aa
dddddeeee aaaa
NaN aaaabbbb
like how will i vectorise this type of data
the first row is the column name
I don't see the connection between these two.
alright here what i would do to approach this problem is basically iterate through the columns and check if the conditon matches, but as i have read iterating is bad practise and we should use vectorisation
What is the condition you are trying to check? I can't tell you how got from the first example to the second example. What is ddee and why is it the first value?
it is the column name
like you see
name
bob
john
here name is the column name
I believe they want to specify column names, and then rows in the column has to have the column name as a substring?
can you explain what is happening here? what do these two have to do with each other?
this looks meaningless to me.
ddee | aa -> these are column names
---------|-----
dddddeeee| aaaa
NaN | aaaabbbb
what is it that happens that you go from
aaaa
aaaabbbb
cccccdddd
dddddeeee
to
dddddeeee aaaa
NaN aaaabbbb
why are you going from four rows to two? what does any of this mean?
why is one of them NaN?
Imagine I have two sub-strings ddee,aa right and I have this csv
now I want a output like this where the sub-string are the column names and the strings are in the column and then
so we will get two match for aa but only one for ddee
so we will populate the ddee column with a NaN value
Alright, give me a moment
no i dont want the code, like can just you explain how we will approach this with numpy
numpy is an implementation detail.
you don't have to think about how numpy is involved.
what did you read that made you think to ask this question? I think something is confusing you.
In either case, when you're working with numpy and pandas, you should avoid using them with for loops or calling methods like .apply and .map as much as possible, as the other methods are optimized and do all the looping internally.
Yes, that is right. You should avoid iterating over rows as much as you possibly can.
You just have to look in the docs for what method does what you are trying to do.
The methods that you'd be calling are usually vectorised. you don't have to do extra work to make them vectorised.
those things they list: arithmetic, comparisons, reductions, etc. Those are already implemented in pandas. You just have to use them.
Have always been able to use .apply, .map, and to a smaller extent, .rolling instead of ever having to iterate over rows
.apply and .map is the same as iterating over them as far as optimization is concerned. Those are not vectorised.
so how i will approach that question that i asked without like iterating
Is there a real example of something you're trying to do? The substring thing that you mentioned seems very obscure.
umm not really just trying to learn this
In [45]: s
Out[45]:
0 aaaa
1 aaaabbbb
2 cccccdddd
3 dddddeeee
dtype: object
In [46]: s.str.contains('aaaa')
Out[46]:
0 True
1 True
2 False
3 False
dtype: bool
In [47]: s[s.str.contains('aaaa')]
Out[47]:
0 aaaa
1 aaaabbbb
dtype: object
see how s.str.contains('aaaa') gives you True or False for each row as one operation. No looping required.
s[s.str.contains('aaaa')] then selects only those rows for which the condition is True.
ooh
so, "vectorised" isn't something you have to do. it's a design concept where a given operation is applied to all the data.
Here's a similar concept with arrays
In [52]: a
Out[52]:
array([[3, 2, 0],
[4, 0, 2]])
In [53]: b
Out[53]:
array([[3, 3, 1],
[4, 3, 2]])
In [54]: a + b
Out[54]:
array([[6, 5, 1],
[8, 3, 4]])
In [55]: a * b
Out[55]:
array([[ 9, 6, 0],
[16, 0, 4]])
@dawn crown the different operations are applied element-wise, but syntactically, it looks like you're just adding two things. This is also vectorised.
alright
Works with regular numbers, too.
In [56]: a / 2
Out[56]:
array([[1.5, 1. , 0. ],
[2. , 0. , 1. ]])
In [57]: 2 / a
Out[57]:
array([[0.66666667, 1. , inf],
[0.5 , inf, 1. ]])
thanks i will try some pandas general problems
I recommend this: https://www.kaggle.com/learn/pandas
Solve short hands-on challenges to perfect your data manipulation skills.
Anyone familiar with R and blogdown? Not necessarily a datasci question at this point.
R questions are out of scope for this server.
Hi guys, I understand that in SVM the regularization term Ccontrols how a complex a model is. For example a high C will tolerate misclassified data points
But how does this apply to support vector regression? For example the epsilon tube controls the width of the tube. As such, a wider tube will fit more data points and minimize the slack variables. But how about C here? How does it balance this because now were are trying to fit data points inside the tube
I.e regression
in SV regression, C can be interpreted as controlling the number of points that can fall outside of a pre-defined ±ε error bound, instead of the number of points that can be misclassified. see http://www2.cs.uh.edu/~ceick/ML/SVM-Regression.pdf
in case the link dies, the citation is:
"A Tutorial on Support Vector Regression"
Alex J. Smola and Bernhard Schölkopf
September 30, 2003
hey guys, got a small problem with pandas
fifa["Weight"] = fifa["Weight"].astype(str).apply(lambda x: x.replace("lbs", "")).astype(float)
light = fifa.loc[fifa["Weight"] < 140].count()[0]
light_medium = fifa.loc[fifa["Weight"] >= 140] & fifa.loc[fifa["Weight"] >= 155]
error code:
unsupported operand type(s) for &: 'float' and 'float
anyone know how to solve this please? 😦
@lyric ermine you only want one call to loc in that last part
You shouldn't be using the ampersand in between two calls to loc
Though I'm not really sure what the intended logic is
I'm going to bed soon, but even if I weren't, it's better to put your question where everyone can get to it
This channel is specifically for this kind of question
light_medium = fifa.loc[fifa["Weight"] >= 140] & fifa["Weight"] >= 155
you mean like this?
Yes. Can you state in English what this is intended to do?
i have a series of values with weights
i wanna get the amount of weights between 140 and 155
So both of those comparisons need to be inside the call to loc
Look where you have your closing ] for loc
Also one of the comparison operators is wrong. I think you can figure out which one is wrong.

trying this
from numpy import arange
b = round(-5/2,1)
c = round(5/2,1)
a = list(arange(b,c,0.1))
print(a)```
outputting this
[-2.5, -2.4, -2.3, -2.1999999999999997, -2.0999999999999996, -1.9999999999999996, -1.8999999999999995, -1.7999999999999994, -1.6999999999999993, -1.5999999999999992, -1.4999999999999991, -1.399999999999999, -1.299999999999999, -1.1999999999999988, -1.0999999999999988, -0.9999999999999987, -0.8999999999999986, -0.7999999999999985, -0.6999999999999984, -0.5999999999999983, -0.4999999999999982, -0.39999999999999813, -0.29999999999999805, -0.19999999999999796, -0.09999999999999787, 2.220446049250313e-15, 0.10000000000000231, 0.2000000000000024, 0.3000000000000025, 0.4000000000000026, 0.5000000000000027, 0.6000000000000028, 0.7000000000000028, 0.8000000000000029, 0.900000000000003, 1.000000000000003, 1.1000000000000032, 1.2000000000000033, 1.3000000000000034, 1.4000000000000035, 1.5000000000000036, 1.6000000000000032, 1.7000000000000037, 1.8000000000000043, 1.900000000000004, 2.0000000000000036, 2.100000000000004, 2.2000000000000046, 2.3000000000000043, 2.400000000000004]```
what do I do
floating point error
even after rounding?
anything I can do?
since 0.1 isn't exactly 0.1000000000000000000
i was going to suggest rounding the arange to whatever decimal points you want
I can't figure out that module
like this? a = list(round(arange(b,c,0.1))),1)?
right
how to use it
so, basically, every mathematical operation I want to do with those numbers, I'll have to use decimal?
what are you doing?
is it an operation that really requires precision?
if not, I wouldn't bother
this amount of imprecision is like...eyeballing it, 10^-15 or something?
Hi all, I've created an sns map of my p-values and all works as intended. I am trying to modify the heatmap coloring to be based around my alpha, as right now 1 is being colored the most, and 0 the least, whereas I really want to highlight significance. Any suggestions on a more significance based coloring method?
Could I host it by myself?
Hello
Hi!
You can append _r to the end of any colormap to reverse the colors
hello!
I have an error index 13 is out of bounds for axis 1 with size 1
and i dont understand because i already runned it before and i did work. I removed my new lines and restored the old version
### PLSR ####
#OUVRIR LE CSV
data=pd.read_csv(r'C:\path\donnees_grece.csv')
datalist_x=data.values.tolist()
data=np.array(datalist_x)
print(data.shape)
data=np.random.permutation (data) #mélanger les lignes
print(data.shape)
data_x= data[1:,65: ].astype(float)
data_y=data[1:, 13 ].astype(float) #13: colonne de cible clay
#DEFINIR VARIABLES x ET CIBLE y
X=data_x
y= data_y
#DIVISER test set ET train set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#CHOISIR LE MODELE PLSR ET FIXER LES PARAMETRES
pls = PLSRegression(n_components=15, max_iter=500)
#CROSS VALIDATION
scores = cross_validate(pls, X_train, y_train, cv=5, scoring="r2", return_train_score="true")
print(scores)
print (scores["train_score"].mean()) #tableau avec scores pour chaque paquet
#resultats chaque paquet pour test
scores1=cross_val_score(pls, X_train, y_train, cv=5, scoring='r2')
#moy des r² des cv
print (scores1)
#DEFINIR HYPERPARAMETRE NB VARIABLES LATENTES (composantes?)
i= np.arange(1,20)
train_score, val_score = validation_curve(pls, X_train, y_train,
'n_components', i, cv=5)
print(val_score.mean()) #score moyen de CV à toutes iterations jusqu'à 50:S
plt.plot(i, val_score.mean(axis=1), label='validation')
plt.plot(i, train_score.mean(axis=1), label='train')
plt.ylabel('score')
plt.xlabel('n_components')
plt.legend()
hello, could anyone refer to me a pytorch specific discord server?
Hey everyone, I hope this is the right channel to ask this data visualization question:
I'm looking for a library that can produce a graph similar to this Flourish line graph race: https://app.flourish.studio/@flourish/horserace/8
Ideally, I'd like to make it interactive in my react ts frontend, alternatively I'd like to display it as a gif or video.
So far I've only found bar graph races, for example made with plotly express or matplotlib (like those: https://pypi.org/project/bar-chart-race/, https://towardsdatascience.com/bar-chart-race-in-python-with-matplotlib-8e687a5c8a41)
I'm very thankful for any advice and hints, also let me know if I should move this question elsewhere! 🙂
thanks
hey guys
i want to know if the track of data science in python at datacamp is worth it or not
hey guys just a general question about hardware for text generation task. So if i have a basic LSTM with attention layers and beam search algorithm which i want to train and evaluate on multiple datasets ranging between size 500mb to 4gb size (before pre-processing) whats the hardware i would need? For example within cloud how much ram and what kind of gpus i would need for quick training (ideally within 4/5 hours)
2) For fine tuning GPT Models (124M layer) on a 2.6gb dataset (before pre-processing) what kind of gpu + ram i would need. Goal is to finetune and evaluate within 10 hrs.
3) GPT NEO Model. For this how much ram and computational power i would need considering dataset size is 8GB. Goal is to fine tune within 24 hrs
am I to understand, then that you'll have up to 4gb worth of tensors in memory?
yes for 1) LSTM's
I'll defer to someone else as I don't want to lead you astray.
no worries. Thanks tho
Question about grouping data by datetime64[ns].. I want to know can I group by day/hour/min from? I have been looking for an example but haven't made any progress
*using pandas
!d pandas.DataFrame.resample
DataFrame.resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, loffset=None, base=None, on=None, level=None, origin='start_day', offset=None)```
Resample time-series data.
Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the `on`/`level` keyword parameter.
Would this be what you're looking for?
that might be it, thanks. I am trying groupby, will look into the above as well
alpha = 0.01
b0, b1, b2 = 0, 0, 0
x = [4, 7, 5, 7, 27]
y = [4, 3, 7, 7, 4]
z = [8, 10.5, 17.5, 24.5, 54]
error = []
for i in range(50000):
idx = i % 5
p = b0 + (b1 * x[idx]) + (b2 * y[idx])
err = p - z[idx]
b0 = b0 - alpha * err
b1 = b1 - alpha * err * x[idx]
b2 = b2 - alpha * err * y[idx]
error.append(err)
error = list(map(abs, error))
error.sort()
print(error[:1])
test = float(input())
test1 = float(input())
pred = b0 + (b1 * test) + (b2 * test1)
print(pred)
Im trying to make machine learning to predict this problem,
it is trying to caclulate the area of a triangle without the equation
what do i do
please ping me
@vestal agate if I understand correctly, you're trying to make a model that can predict the area of a triangle given the length of its sides?
You shouldn't use machine learning for things when there's a simple, known solution. But if this is just for practice, I guess you could do it with regression.
yes
this is linear regression and gradient descent
but maby im doing more than 2 variables wrong
i only really know how x and y works with b0 and b1
not up to b2 or infinity
I would put your x, y, and z data into a matrix (ie a numpy array) and look into the regression tools in sklearn
yeah but its pratice
i want to the equation myself
in that case I'd use numpy but not sklearn.
any1 here know how to create histogramms?
yes; is the data in a DataFrame or an array or what?
yes
which one?
lemme send u the code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
import pandas as pd
import seaborn as sns
import scipy.stats
import matplotlib.pyplot as plt
url = "https://covid19.who.int/WHO-COVID-19-global-data.csv"
df = pd.read_csv(url)
filt = (df['Date_reported'] == '2021-08-20')
df1 = df[filt]
filt = (df1['Cumulative_cases'] >= 2000000)
df2 = df1[filt]
df2
Great, now do print(df.head().to_csv()) and paste that text into this chat the same way.
,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
5363,2021-08-20,AR,Argentina,AMRO,9764,5106207,247,109652
17283,2021-08-20,BR,Brazil,AMRO,41714,20457897,1064,571662
26223,2021-08-20,CO,Colombia,AMRO,3154,4877323,93,123781
43507,2021-08-20,FR,France,EURO,23102,6384773,105,111839
47083,2021-08-20,DE,Germany,EURO,9280,3853055,13,91956
Great, this makes sense. What are you trying to convey with your histogram?
countries over 2m cases, the new and recoverd per country
these are two histograms, then?
i was thinking can do in one but wouldnt look good so yes 2 would be the better approach i guess
I did this
df.loc[df['Cumulative_cases'] >= 2_000_000, ['Country', 'New_cases', 'Date_reported']].plot.hist('New_cases', 10)
And I got this
you can probably mess with it from there
!docs pandas.DataFrame.plot.hist
DataFrame.plot.hist(by=None, bins=10, **kwargs)```
Draw one histogram of the DataFrame’s columns.
A histogram is a representation of the distribution of data. This function groups the values of all given Series in the DataFrame into bins and draws all bins in one [`matplotlib.axes.Axes`](https://matplotlib.org/stable/api/axes_api.html#matplotlib.axes.Axes "(in Matplotlib v3.4.3)"). This is useful when the DataFrame’s Series are in a similar scale.
ahh, let me think
@grim orbit df.loc[df['Cumulative_cases'] >= 2_000_000].plot.barh('Country', 'New_cases')
I see
!docs pandas.DataFrame.plot.barh
DataFrame.plot.barh(x=None, y=None, **kwargs)```
Make a horizontal bar plot.
A horizontal bar plot is a plot that presents quantitative data with rectangular bars with lengths proportional to the values that they represent. A bar plot shows comparisons among discrete categories. One axis of the plot shows the specific categories being compared, and the other axis represents a measured value.
I'm not sure how you'd stack the two types of cases or change the colors.
yes, but it's calling matplotlib under the hood
wym?
these are just dataframe methods, but those methods are calling matplotlib functions
you don't have to import matplotlib but you do have to have it installed.
is there a way to stop nan and inf from showing in pycharm
and limit the numbers shown
comparison operations don't work that way for NaN
i see
for my exapmle
filt = (df2['Country'] != 'France, Spain, United States of America, England,');
how do I write this properly i see no differnce
You would do ~df2['Country'].isin(['France', 'Spain', 'United States of American', 'England'])
where would it be inserted?
when i define?
the expression I gave you evaluates to a boolean series, you can use it as an indexer in loc
df2.loc[df['Cumulative_cases'] >= 2_000_000].plot.barh('Country', 'Cumulative_cases',df2['Country'].isin(['France', 'Spain', 'United States of American', 'England'])))
not like this
selected = df2.loc[(df['Cumulative_cases'] >= 2_000_000) & ~df2['Country'].isin(['France', 'Spain', 'United States of American', 'England'])]
selected.plot.barh('Country', 'Cumulative_cases')
You also deleted the ~, which was important.
~ is a negator. You can read (df2['Cumulative_cases'] >= 2_000_000) & ~df2['Country'].isin(['France', 'Spain', 'United States of American', 'England']) as "where Cumulative_cases is >= 2 million and Country is not in France, Spain, etc."
It flips true and false values.
I see thanks
is it possible to retrieve info from a enjin forum? (without having to emulate a web browser)
Hi! I have a dataframe with three columns (df1):
column1: unique Identifier, column 2: 0 or 1, column 3: 0 or 1. I have a second dataframe (df2) with the same three columns, but with 1s and 0s in different rows. I want to join the two, so that if df1 has a 0 for a unique ID in the relevant column where df2 has a 1, df1 gets updated to be a 1. But if df2 has a 0, df1 stays as 1 for that ID, and nothing is done to df1 at all.
Importantly, the lengths of the two dataframes are not the same, and the IDs are not in the same rows, though df2 will always be a subset of IDs in df1.
In reality, my actual databases are around 30 columns as opposed to the 3 in the above example though.
does anyone have a code for the face recognition (opencv)
This isn’t open cv but it is still a really high level with keras
@glossy moth it would be easier to follow what you're trying to do if you provided a minimal example (potentially with mock data), but it sounds like you need to merge
Also if the "IDs are not in the same rows" then you need to make sure you've set an appropriate index for each frame.
what library visualizes data like this
@vestal agate matplotlib
Hey @near aspen!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
Different lengths but df2 indexes are a subset of df1, so if I'm understanding your problem correctly, you want to update column1 to the maximum value of the ID from either df1 or df2 for column1.
idx = df2.index.intersection(df1.index)
df1.loc[idx, 'column1'] = np.where(df2['column1'] > df1.loc[idx, 'column1'], df2['column1'], df1.loc[idx, 'column1'])
I doubt it's the best way, but that's what I likely would have done
As Stelercus mentioned, you'll want to make sure that you set your index as the ID column
Thank you for the help! Sure, see mock example below:
So in this case, I would want Identifier 3, Group 1 and 2 in the first table to be updated to 1. For identifier 1, I would want nothing to be changed in the first table at all
Still not sure, do you want the column to change to the max of the two columns, or does it update based on something else? If it's something else could you explain when it should be updated and when it should not be updated?
Sure sorry. All columns contain binary values of either 1 or 0. If the df2 which is a subset of df1 has a 1 for an ID where df1 has a 0, I want df1 for that ID to change to 1. If df1 has a 1 and df2 has a 0, I want it to do nothing
So I want to update df1 just in cases where the ID in df2 has a value of 1 and the ID in df1 has a value of 0, in all other cases, do nothing
Yea, then the code snippet I provided should work
Thank you! Just so I understand your code:
idx = df2.index.intersection(df1.index)
df1.loc[idx, 'column1'] = np.where(df2['column1'] > df1.loc[idx, 'column1'], df2['column1'], df1.loc[idx, 'column1'])
line 1 sets idx as the Identifier values that are shared between the two sets. Then line 2 looks at those IDs in column 1 where df2 is greater than those IDs in column 1, and if greater, sets df1 to that value, otherwise, leaves as is?
hey can someone teach me how to code multi variable linear regression
no tutorials that make sense
100% know how linear regression and gradient descent works
but without multiple variables
Yea, that's correct
Thank you! I'll give it a shot
So in my real situation, I have ~30 columns I need to do this for. Can I just loop through the snippet you provided updating the column as I go, or will this apply all at once to every column?
Same number of columns in df1 and df2 and they are named the same and everything if that matters
Snippet I gave needs to loop over the columns, so
for i in range(1, 30):
col = f'Group{i}'
idx = df2.index.intersection(df1.index)
df1.loc[idx, col] = np.where(df2[col] > df1.loc[idx, col], df2[col], df1.loc[idx, col])
I think it should be possible to change it to just apply once, but my 1am brain has stopped functioning at 100%
Thank you! 🙂
@mortal dove it looks like you're recomputing idx every time for no particular reason
Oh yea, can probably take that out of the loop
Though I wonder if this can be accomplished without any loops
As I said, 1am brain is running on its last fumes
I'll probably have a look tomorrow
Thank you both! If you know of a way to do it without loops, I'd definitely be super curious to hear! 🙂
For your reference, @glossy moth, you want to avoid loops and apply as much as possible in the context of numpy and pandas.
If they don't make sense you probably don't understand the fundamentals. Have a look at Introduction to Statistical Learning(available for free), Section 3.2 p 71
Goes much more in depth than any tutorial will, but it explains the math/stats behind Multiple Regression, not how to code it
I think this can be done with one statement if both dataframes have overlapping indices.
Hey guys, how can I turn this final model so that I get cross-validation score of MAE and RMSE with scikit-learns cross_val_score
svr = SVR(kernel = 'rbf',C=100, epsilon=0.1, gamma = 100)
svr.fit(X_train, y_train)
y_pred_train = svr.predict(X_train)
y_pred_test = svr.predict(X_test)
#Metrics - if squared = True returns MSE value, if squared = False returns RMSE value.
#Performance on training set
mae_train = mean_absolute_error(y_train,y_pred_train)
rmse_train = mean_squared_error(y_train,y_pred_train, squared = False)
#Performance on testing set
mae_test = mean_absolute_error(y_test, y_pred_test)
rmse_test = mean_squared_error(y_test, y_pred_test, squared = False)
print(f'SVR completed in : {round((time.time() - start_time), 2)} seconds...')```
Yeah the identifiers for df2 are a subset of the df1 identifiers, though order is not the same
if there is a way to .apply() and .where() to do this, that would be awesome to learn
thanks again for the help!
The order doesn't matter, the identity of the index does.
yes all identities of df2 are found in df1 somewhere
Problem is I'm on mobile so I can't experiment
You don't want to use apply if you can get away with it. And you usually can.
Ah ok. I'm just not sure how to get this to apply to every column vs index:
idx = df2.index.intersection(df1.index)
df1.loc[idx, col] = np.where(df2[col] > df1.loc[idx, col], df2[col], df1.loc[idx, col])
doesn’t really seem like the right approach to me
I feel like you should do a join
on the index
that’s my gut feel anyway
@velvet thorn they just need a boolean series from the second dataframe to index the first, but there are indices missing in the second dataframe
Beyond that it's just df.loc[...] = 1
yeah, so Series.where over the columns on the joined dataframe
no?
🥴 I read a bit up there
but I could be wrong I just woke up
I've never used Series.where tbh
!d pandas.Series.where
Series.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=<no_default>)```
Replace values where the condition is False.
Thanks for helping as well! It sounds like where will indiscriminately take the two dataframes I've joined and replace 0 values with 1s? I only want replacement went index position of df2 is 1 and index position of df1 is 0, no other times
Sorry super new to python in general, I'm sure im misunderstanding!
nope
that’s the point of the condition
what it does is basically
if condition is true, take from left, otherwise take from right
here left is df1 and right is df2
But haven't I already joined them at this point?
yes, but join means
both columns are present
so more accurately
left is the relevant column from df1
and right from df2
okay it’s a bit hard to visualise
if your current solution works
just go with it
Can you clarify the question
Same as stelercus im not at a computer but maybe i can answer later
So I have done gridsearch to find the best hyperparameters for my SVR but I want a cross-validated score of MAE and RMSE because right now the model is overfitting The complete thing looks like this:
'epsilon': [0.001,0.01,0.05,0.1,1,10],
'gamma': [0.01, 0.1, 1, 10, 100]},cv=5, verbose=0, n_jobs=-1)
gsc = gsc.fit(X_train, y_train)```
```svr = SVR(kernel = 'rbf',C=100, epsilon=0.1, gamma = 100)
svr.fit(X_train, y_train)
y_pred_train = svr.predict(X_train)
y_pred_test = svr.predict(X_test)
#Metrics - if squared = True returns MSE value, if squared = False returns RMSE value.
#Performance on training set
mae_train = mean_absolute_error(y_train,y_pred_train)
rmse_train = mean_squared_error(y_train,y_pred_train, squared = False)
#Performance on testing set
mae_test = mean_absolute_error(y_test, y_pred_test)
rmse_test = mean_squared_error(y_test, y_pred_test, squared = False)```
@velvet thorn you'd be surprised at how many code blocks I've written on my phone hah
Ok let me see if I understand this correctly:
So basically I'm joining them based on df1, which I've stated df2 is a subset of. So I'll now get a dataframe that goes from 30 columns to one with 60 columns, with a lot of the new 30 columns having a bunch of NaN rows. Then I do series.where() and say if columndf1 < corresponding columndf2, replace with columndf2 value, otherwise leave alone?
kind of
this gives you a bunch of Series
then you pd.concat them
I occasionally help here on my phone
@lapis sequoia look up "nested cross validation" perhaps, sklearn has it. Although the grid search should already be using cross val
it's a bit 🥴
I definitely don't have patience for it anymore
Yes but I need the score in MAE or RMSE, cross validation score does not say anything about the errors between actual and predicted value.
You can tell it which scorers to use
You can even tell it to compute multiple different scores (although it will only use one for selection)
The cross validation score is whatever score you request
The default i think is rmse for regression and accuracy for classification but you can change it
This is described in the reference docs for the various CV classes
Yes but I don't understand how to implement it in the code block
I think the parameter is scoring= or scorer=, something like that
you should read the docs again
I have generally found sklearn docs quite clear + comprehensive
I have done that. 'mean_absolute_error' is not a valid scoring value. Use sorted(sklearn.metrics.SCORERS.keys()) to get valid options. although they say that that is what to use in scoring
where do "they" say that?
because...
if you want the MAE scorer...
you need to use 'neg_mean_absolute_error'
and that is because
higher error is worse.
this information is available here
so my question is
did you read that?
and where was this said?
because if it was, then that's a documentation bug, so I would like to see it
hello guys ~ can you help me with using Grid serch from scikit learn on a pls regression? here is my code:
from sklearn.metrics import *
from sklearn.model_selection import GridSearchCV
from sklearn.cross_decomposition import PLSRegression
pls = PLSRegression(n_components=15, max_iter=500)
n_components= np.arange(1, 100)
max_iter=[500]
#gridsearch : create a dictionnary withhyperparametres
param_grid = {'n_components':n_components, 'max_iter':max_iter ,
'metric': ['r2', 'neg_root_mean_squared_error']}
grid = GridSearchCV(PLSRegression(), param_grid, cv=5)
#train the grid
grid.fit(X_train, y_train)
it keeps saying Invalid parameter metric for estimator PLSRegression(). Check the list of available parameters with `estimator.get_params().keys()`.
Yes, please have a look, nan values
The grid parameters only apply to the model "inside", not the grid search process itself. It also makes no sense to "search" over different metrics. Are you trying to calculate multiple metrics for each step of the search?
There is a separate parameter for that, not part of the parameter grid
As with the other person, this sounds like a case of needing to read the user guide and reference documentation more carefully
Are you sure those are the right arguments to cross_val_score?
That's the distance metric for the KNN model that I assume is being searched over, there is no "metric" parameter for PLS Regression as per the error and as per the docs
Unrelated to the scoring metric
i see, but in my case im interested by two metrics, does it mean i should apply grid search twice?
rmse and r²
in the doc you can see the different metrics for pls
why is doing cp on individual files sooo slow 😦
i dont really get the problem . is the line before metric correct tho?
Do you understand what the "parameter grid" is for?
You can do that, but I am pretty sure there is an option to add additional scorers or use a list of several scorers
yes its to select the best parameters to get the best score
Check the GridSearchCV docs
i think i got the idea?
So do you understand why it doesn't make sense to try to use it to pass multiple parameters to GridSearchCV itself?
i know that pls has few parameters : ones of them the number of iteration that needs the machine to learn, and the number of components
Because it's literally copying the data byte for byte, it only goes as fast as it goes. Use rsync maybe?
by parameter you mean score?
No, I mean parameter. As in, parameters that control how the model behaves. The score/loss function is unrelated
oh.. so i can use multiple metrics but only one parameter?
I guess that might me due to other latencies ¯_(ツ)_/¯
for row in tqdm(train_df['Image_ID']):
tgt_img = row + '.jpg'
!cp ../input/dataset/Train_Images/Train_Images/$tgt_img ./FiftyOneDataset/data/
if is that so it doesnt bother me to make 2 grid search because i honestly think it is interesting to know for both parameters
or maybe i just plot a validation curve
btw it doesnt change, i have the same error Invalid parameter metric for estimator PLSRegression(). Check the list of available parameters with `estimator.get_params().keys()`.
the error is somewhere else
param_grid = {'n_components':n_components,
'metric': ['metrics.r2_score',
'metrics.mean_squared_error']}```
@desert oar no difference with rsync 😦
On first run there won't be, but subsequent runs should be much faster
I recommend using DVC + rsync for this kind of thing
Or a Makefile with a wildcard
./bar/%: ../input/foo/%
cp $< $@
I used 4 spaces instead of a tab because mobile, but that's the idea
./bar/%: ../input/foo/%
dvc run -d $< -o $@ cp $< $@
git add $@.dvc $(dir $@)/.gitignore
Rsync I think does more intelligent file diffing or deduplicating or something
Because you're still misusing it in the same exact way
mh...
Grid search lets you search over a grid of parameters for the model using one or more scorers. Choosing which scorers to use is not part of the "parameter grid", it's a completely different setting
The "metric" in that one example screenshot was unrelated to the scoring
In that very specific example, the model happened to have a parameter called "metric"
nooo i know i want both score, its not which score?
you said using one or more scores?
This is a different parameter in GridSearchCV, do not use the parameter grid for this
Please read the docs and the user guide
okay.. maybe you give me a simple example maybe?
Stop guessing based on examples
i already did
im not fluent in english so i read but sometimes it tooks time to understand
why yall mad
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html read the "scoring" parameter doc
okay
But go to sleep first
no i have a meeting tomorrow
im stuck . I take too much time to correct my errors
lemme read
I see. I understand it's probably not easy to read these docs if you don't read English well
I apologize for sounding annoyed. We get a lot of people in here who don't want to learn and just want people to do their homework for them
yes i know, i really wanna learn i think i m improving my english skills progressivly
Yes, unfortunately programming tends to be very English-centered
to have an idea about how i read the documentation: my eyes look to key words because i know them, but sometimes the verbs or the synthax makes me confuse
yes but its a good training !
it just needs time
okay this part is about multiple scoeres i guess?"If scoring represents multiple scores, one can use:"
isnt what i tried to do? with 'metrics ':[...]
okay so i tried to read but i dont think i got more than what i previously thought :x
Hi, I was asking a bit about this earlier but I realized some info that simplifies things:
I have two dataframes of equal size. All cells other than an identifier column contain 0 or 1. The only difference between the two is that cells differ on which rows have 0 and which have 1. I want to compare equivalent cells in the two, and if there are any occurences of a 0 in one df where the other has a 1, replace it with a 1. If both have 0 leave 0. If both have 1, leave 1. Is there a quick way to do this without goign column by column?
Please I need help
Why do I keep getting this error
This is jupyter notebook
I keep getting this error "AxesSubplot:
this might be useful
I imported them already @rough mountain
@rough mountain it worked. I typed %matplotlib online and the graph showed. Thanks
welcome
I'm just going to use this as an image processing channel as it's the closest thing.
When I floodfill this image with cv2
kitty!!!!!!!!
I get this strange result
