#data-science-and-ml

1 messages · Page 335 of 1

glad aspen
#

its a dataframe from pandas

#

anyone know the efficient way?

#

just need the date time portion

#

I've tried df["created_at"] = df.created_at.replace(tzinfo=None)

weary echo
#

to the best of my knowledge, some people did use boxplots in their time series plots. Although, in my opinion, it's not really "pretty".

I personally used boxplots to identify the existence of outliers.

#

in the end, it's a matter of preference

acoustic forge
weary echo
#

oh my god, I misread it haha

acoustic forge
#

Time series is not a plot but rather a data format for predictive analytics

#

No worries 😛

acoustic halo
#

Okay gtp-3 is brilliant

#
Sad :(
Crying :'(
Laughing :D
Surprise :O
Angry :@```
#

It generated the smilies for Laughing surprise and angry itself

inland zephyr
#

Is the Malay Peninsula Standard Time should be MST in standard time format (like PST,GMT,CEST)?
to remove it i use lambda map with function:

splitted = your_str.split(" ")
day = splitted[0]+" "+splitted[1]
return datetime.strptime(day, '%Y-%m-%d %H:%M:%S')```
#

and do df[your_ts] = df.map(lambda x: funct(x))

#

it will automatically update to datetime type

#

or afaik MST is +8hour, add 8 hour before return the time

#

dunno if there is other elegant way to do this

woven rivet
#

So I’m dumb and have little to no Idea how to do this but how would I make an AI that can recognize my pet rabbit from other rabbits and tell it’s him, I know basic python. When neural networks do stuff like this do they plot points on the images and look for patterns? Any help would be gratefully appreciated I really only know basic python so please explain things

serene scaffold
serene scaffold
acoustic halo
#

Accidentally created the AI from I have no mouth and I must scream

grave frost
#

someone make it into a copypasta

grave frost
#

we already have the collider projects on github. Can't wait to spam all those images to get them viral and half the population arrested 🤣

mortal dove
#

I'm looking for a good book that covers time series analysis that's more focused on the mathematics. Does anyone have some suggestions? Ideally free, but I won't mind paying for a solid book.

desert oar
#

sorry, got busy at work yesterday. where is the part in this code that does the "don't buy if i already bought it" logic?

#

also, i don't know much about trading strategies, but it sounds like your strategy doesn't allow for the possibility of buying more of $FOO after you've already bought $FOO. is that right?

chilly skiff
#

no problem, I appreciate you remembering 🙂

desert oar
#

SMA = simple moving average? moving average of what?

chilly skiff
#

the prices

#

yep

desert oar
#

so this python code is your whole trading strategy? good on you for being willing to share it, instead of being under the delusion that if you share it you're going to leak your genius secrets to the world and lose out on your gains 😛

chilly skiff
#

since my question yesterday I've made it a lot faster. I used vectorization to remove as much data from the dataframe as possible (areas where it will not buy/sell), then I used list comprehension for the rest of the data I needed to manual iterate through

#

lmao

#

this was from my testing yesterday

#
Dictionary: 10.56 mins
To_list: 4.16 mins
zip list comprehension: 4.04 mins
vectorization & zip list comprehension: 3.36 mins
pre-vectorization & zip list comprehension: 1.13 mins
pre-vectorization & better list comprehension: 0.86 mins```
red hound
#

Does anyone know, where the idea behind Word Embeddings, especially these kind of Embeddings the Tensorflow Embedding Layer produces, comes from? I would like to cite the idea and I already figured out that there happened a lot of work over decades, but im still not sure who i can cite as an author of the idea. If anyone has an idea, feel free to @ me. Thanks 🙂

desert oar
desert oar
# red hound Does anyone know, where the idea behind Word Embeddings, especially these kind o...

you can cite the original word2vec papers as one of the early popular implementations, but i don't think the idea has a single "originator"
https://arxiv.org/abs/1301.3781
https://arxiv.org/abs/1310.4546

chilly skiff
#

I'm working on the next step in the project right now. When it's done the program will have to run for many hours most likely, so in that time ima clean up the code and i'll send you the updated code

desert oar
#

sure, would be happy to see what you did

#

if your code is all "numeric" (i.e. no strings, dicts, etc), you can probably get significant speedups by running with numba in "nopython" mode

chilly skiff
#

yeah I looked at that yesterday but if I'm being honest it looked very difficult to install all the stuff for it and get it running properly. I know I'll have to do it eventually but I thought I'd just wait for now xD

grave frost
#

hmmm... OpenAI didn't look like they did much filtering

desert oar
chilly skiff
#

yeah that's what I did, but put it simple, it didn't work as well as I was hoping xD

#

But i'll probably have to use it soon

lapis sequoia
#

hey anyone knows where i can find a good written explanation on why normalisation doesn't work on SMOT data? I know it is because we are already normalising unbalanced class, but was wondering if there is a better explanation out there that I could use.

flat hollow
#

Does anyone have experience with creating a Monte Carlo simulation? I need to create a 2D model of a blood vessel in the brain (simple - central circle enclosed by 2 barriers with some permeability, i.e. either a chance for molecule to get through based on dice roll or trapping and releasing after some time) where in the middle I would have a stream of new particles (simulating blood passing through) and I would need to do random walk while recording the number of particles in each of the 3 compartments (inside, between barriers, outside). I was thinking about using something like pygame to do the simulation, but I would prefer doing a LOT of particles and efficiency is the key since Im running it on a laptop.

desert oar
#

use pypy

#

store your classes with __slots__ (although this is mostly a no-op in pypy)

#

that said, if you can implement this as a loop over a numpy array, you can probably do even better with numba, than with pypy

#

that, or maybe you can repurpose BUGS or Stan to do the heavy lifting for you; i've only used those for bayesian probability modeling so i wouldn't know how

#

e.g. in numba it might look something like this:

import numba

@numba.njit
def run_simulation(n_samples):
    n_inside = 0
    n_between = 0
    n_outside = 0
    for i in range(n_samples):
        # Do your complicated stuff here
        ...
#

but it sounds like maybe you can "vectorize" this simulation? e.g. pre-generating a big list of random values with np.random and then doing cumsum-type calculations thereupon. if you can post the actual algorithm for the simulation i can probably help more

#

depends of course on your performance requirements too

chilly geyser
#

Before pypy I would recommend just having anything working first

flat hollow
#

@desert oar lots of info here, unfortunately I got the task like 30 minutes ago and I don't even know how to set the model up such that the particles interact with the barrier yet 😦 once I figure that out I can start working on optimisation

chilly geyser
#

Then you can think about jit/compilers/etc. later

#

Yeah I recommend having a minimum working example before optimisation

#

Although the faster you get a minimum working example, the faster you can optimise

flat hollow
#

ofc getting it done with just one is fine for now, I just don't know how to do the particle-wall interaction because I feel like that is game design and I have no experience in that

#

which is why I thought about using pygame at first...

#

hm... if the barrier is circular, would I make the particles do the classic random walk and after each step, check if their initial position was less than r away from the centre and final step more than r and if it is, add a random probability of them not moving at all?

#

does that sound good as barrier simulation?

desert oar
#
#

in your case, what's the logic? "particle hits boundary, then p% particle goes through it vs bounces off"?

chilly geyser
flat hollow
flat hollow
#

I might eventually have to add in a bias in the walk that would make it more likely to walk away from the centre but that's down the road

desert oar
#

this is just 2d though right? circle inside, infinite area outside, particle has p chance to pass outside the circle when it hits the boundary?

flat hollow
#

yeah I think 2D will be enough in this case just for simplicity

#

I can try drawing it

arctic wedgeBOT
#

Hey @flat hollow!

It looks like you tried to attach file type(s) that we do not allow (.heic). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

flat hollow
#

noone likes apple's .heic 😦

#

@desert oar something like that? There could be a central starting point, then 2 circular barriers with p1 and p2 probabilities of particles passing through them and infinite space outside (may need to constrain it eventually, have to talk to supervisor first)

chilly geyser
#

Seems possible

chilly geyser
flat hollow
desert oar
#

can the particles bump into each other?

#

and do the particles have spatial extent or are they just points?

flat hollow
#

for the simplicity of getting at least something done I would do point-like and no collisions, but eventually I would probably have to add collisions and Im not sure about dimensions, would have to check the tables

#

I have molecules inside a blood vessel, do point-like particles work with collisions? 😄

#

doesn't sound like they would...

waxen sinew
desert oar
#

@flat hollow do the whole thing in polar coordinates

#

a particle isn't an (x,y) pair, it's an (angle,radius) pair

#

then a "collision" with the barrier is just particle.radius >= barrier_radius

#

the "particles" can be a 2d numpy array, and you can also use an array to track which particle is in which circle, so you don't have to recompute it at every step

particle_positions = np.array([
    [radius0, angle0],
    [radius1, angle1],
    ...
])

particle_circles = np.array([
    0,  # inside the inner circle
    1,  # between the circles
    2,  # outside the outer circle
    ...
])
#

a "step" could be something as simple as a fixed-size step in a random direction

#

just need to do some trig to figure out the new angle and radius after a step

pine wolf
#

though i often use manhattan distance when doing that stuff

desert oar
#

idk how well it works for non-point particles

#

i also can't think of how to calculate the next step without going back to cartesian coordinates 🤦

#

i'll have to write it out on paper

#

oh i think that's actually how you do it

pine wolf
#

yeah, but you can vectorize that stuff conveniently --- while particle-particle collision is probably some python loop, dunno

flat hollow
#

ah, my terrible math skills are catching up to me, the one thing that Im not sure about right now is the particle tracking...

#

unfortunately in 7 hours Im driving 1,5 hours and then walking for 8 more hours so I need some sleep, thanks a lot for your time @desert oar

desert oar
#

of course, i'm happy to procrastinate on my own work with this 😛

desert oar
desert oar
pine wolf
#

adding this to a long todo list

lapis sequoia
desert oar
#

need to figure out how to animate

desert oar
# lapis sequoia Minmax, standard etc

i haven't heard that before, but i can imagine that there is a problem with using simulated data to estimate things like the maximum, mean, etc. i would guess that you should do those things before oversampling

iron basalt
# flat hollow Does anyone have experience with creating a Monte Carlo simulation? I need to cr...

Brownian motion, or pedesis (from Ancient Greek: πήδησις /pɛ̌ːdɛːsis/ "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas).This pattern of motion typically consists of random fluctuations in a particle's position inside a fluid sub-domain, followed by a relocation to another sub-domain. Each relocation is follo...

#

Or just really simple, bunch of particles bouncing around and when they hit the border they have some chance of being repelled or passing through.

pine wolf
#

gif is choppier than the program itself

lapis sequoia
pine wolf
#

ok, this is with a .001 probability of pass the barrier

grave frost
pine wolf
#

this is only a single barrier, which i didn't draw

grave frost
#

ohh, brownian motion

#

but why? 🤔

pine wolf
#

it's red blood cell diffusion apparently

#

which is brownian motion with a barrier

grave frost
#

Brain tickler - I have a set of 2 angles (in rads) can anyone think up a way to represent two angles with a single number?

pine wolf
#

complex(angle_1, angle_2)

grave frost
#

the function has to be reversible too 👈

grave frost
pine wolf
#

unless there's some constraint on your angles, i don't think there's a reversible method for arbitrary angles

grave frost
#

usually those numbers would have a high decimal accuracy too

grave frost
#

guess I will just do multi-output regression ¯_(ツ)_/¯ returning the 2d vector in the end...

desert oar
#

wouldn't the banach tarski theorem suggest that such a function exists, albeit probably not one that we can comprehend or even write the definition of?

desert oar
#

matplotlib animations are being a pain

desert oar
#

bless you

#

let's see how bad mine is compared to yours

#

i haven't done the random diffusion collision part yet, got sidetracked w/ animatinos

pine wolf
#

i just hacked apart your code

desert oar
#

i hacked apart my code too 😛

pine wolf
#

might require some knowledge of nurses_2 particles

desert oar
#

nurses 2?

#

is that your own library?

pine wolf
#

yes

#

terminal graphics library

#

the README has animated examples

pine wolf
desert oar
#

unixporn people might like that kind of thing

pine wolf
#

it's pretty python-specific -- there's better in the c-domain

desert oar
#

fair enough

pine wolf
#

if notcurses python bindings get improved and they add support for windows terminal, i'll probably make a nurses_3

velvet thorn
#

we're not talking arbitrarily, right

grave frost
velvet thorn
grave frost
#

the angles have a pretty high degree of accuracy

velvet thorn
#

like

#

when I say "arbitrary"

#

I mean, not real numbers

pine wolf
#

you can take every digit of one angle and every digit of another angle and zip them together

velvet thorn
#

but some numpy number type

grave frost
#

like 3.245145416426537532

#

np.float64

velvet thorn
#

at least, not in the general case

grave frost
velvet thorn
#

or rather...

pine wolf
#
angle_1 = .01010101010...
angle_2 = .5959595959...

compressed = .051905190519...
#

this would work

grave frost
#

oh, and the number has to be real and function reversible

velvet thorn
pine wolf
#

it would still work, you would half their precision before compressing

grave frost
#

as in they have different lenghts?

velvet thorn
#

therefore not reversible

pine wolf
#

just use a higher precision float for the final

grave frost
#

well, they are both angles so the function should be sensitive to both and actually retain their information

pine wolf
#

and you don't lose information

grave frost
#

A simpler method could be just to output a 2D vector lemon_tongue but I wanna see what you guys come up with

velvet thorn
pine wolf
#

then use float128

velvet thorn
#

float128 is just float64

#

with extra padding

pine wolf
#

why is that my fault

velvet thorn
#

the constraints were stated @ the start

#

@grave frost realistically speaking though

#

are you going to need all the digits?

pine wolf
#

you can create a new type that stored 128 bits, and then you can represent your 2 64 bit numbers as a single 128 bit number

grave frost
#

yep,

  1. Function has to be reversible
  2. output should be Real
  3. Information from both angles should be preserved in the output
velvet thorn
#

15 is generally a lot

pine wolf
#

you can't store 8 bits of information in 4 bits though

grave frost
velvet thorn
velvet thorn
pine wolf
#

i mean, i'm not recommending this

velvet thorn
#

okay

#

how about this

pine wolf
#

this seems like an XY problem, but i don't address this issue

velvet thorn
#

commission a custom processor that can natively handle quad precision floats

grave frost
velvet thorn
#

along with the attendant firmware etc.

#

problem solved

#

🙏

grave frost
#

nice. anyone up for some funding?

pine wolf
#

like, what problem do you solve by using one float to represent two?

velvet thorn
#

I've got a couple of thoughts and prayers

velvet thorn
#

but I also think it's not the right way to do it

grave frost
pine wolf
#

especially when you have to use resources to go to and from the representation

grave frost
#

I can just get a 2D array

#

and treat it as a multi-output regression problem 🤷

grave frost
#

just asking mathematically here. 2 angles, 3 conditions

pine wolf
#

you can do it with infinite precision floats

velvet thorn
#

yeah

grave frost
#

maybe there's something with some transformation?

velvet thorn
#

mathematically, it is defo possible

iron basalt
#

In C you can just create 128bit floats directly (GCC x86, x86-64). Idk if numpy supports it.

grave frost
velvet thorn
iron basalt
#

Can of course just wrap the 128 bit floats with cython.

grave frost
#

I don't think that its actually fully 64b

iron basalt
#

Doing it manually without the hardware support would be super slow.

grave frost
#

its def in middle

#

like a little more than float32

#

technically, isn't it a 2D array - so a vector that with a transformation be transformerd in such a way as to get to 1D line?

#

so we could reverse the transformation and get theoretically the same thing back and the transforming matrix has no eigenvectors

iron basalt
#

This using one number to represent two thing seems like something one would find in some old commodore 64 code or something so maybe look in that area if you really want an answer.

#

But it does sound useless even if it's doable, much like the xor variable swap.

pine wolf
finite wasp
#

I've got a question in Chocolate if someone can help out.

serene scaffold
#

@finite wasp idk what chocolate is, but you should always just ask your question. Asking if someone knows about the topic of an unasked question is less helpful than putting the real question out there.

#

Looks like I may have misread what you had said 🤷‍♂️

desert oar
pine wolf
#

i just didn't change the radius at all when crossing barrier, instead of reflecting

desert oar
#

I saw

pine wolf
#

the lazy way

#

i don't think you have to add too much to reflect

#

or maybe you do

#

could just give the particles a velocity and then all you have to do is reverse negate it

desert oar
#

To reflect i think you'd have to find the distance from point to barrier, compute the tangent of the circle at that point, get the perpendicular line to that, then compute the angle of reflection around it. Then reposition the particle accordingly

#

Or yeah don't use reflection rules and just reverse direction lol

#

All this probably goes out the window if the blood cells have spatial extent anyway

old thorn
#

Has anyone ever deployed a ML model on a chrome extension, looking to do that but haven't found many resources sadge

arctic wedgeBOT
#

Hey @errant flare!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

errant flare
#

Quick question i have a .csv file data that is structured kinda like this with a 150 responses (https://docs.google.com/spreadsheets/d/12z4FBN8_mW3T7I4WPb2JbLx9WGxVnelfNspRwVMKF9Q/edit?usp=sharing) , is this is any way worthy of a linear regression or multivariate regression model? and if so on the basis of what independent and dependent factors? Kinda new to ai and datascience so yeah a little confused here

And if it isn't are there any other models or ways through which I could build a predicitive model with this kind of dataset?

I'd appreciate any and all direction / help, thanks once again!

umbral ferry
#

statquest on youtube has a good explination of all of them, and usually he or others have implementation examples in python

errant flare
#

thanks a lot!

#

i'll have a look at them!

errant flare
#

cuz for some reason they were a little specific on having a regression model

#

i'm not sure why

umbral ferry
#

you can have whatever you want be your inputs and whatever you want be your output

errant flare
#

hmmm

umbral ferry
#

I've never really done multivariable linear regression but I think that's also a thing

#

there are advantages and disadvantages to every method

errant flare
#

yep i don't have that many independents though for multivariable

#

anyways thanks for the help, gotta go now so yeah!

umbral ferry
#

gl!

gusty frost
#

Do any of you know where the best place is to learn data scientist and will age be a barrier?

velvet thorn
gusty frost
#

I got a while before I can work.

velvet thorn
#

then you're old enough to learn

gusty frost
velvet thorn
#

🥴

#

ew

#

sorry...not a Java fan

#

but yeah

#

you can defo start learning

#

how're you @ mathematics?

#

in particular, statistics

gusty frost
#

in algebra

velvet thorn
#

linear algebra?

gusty frost
#

I'll look into it.

velvet thorn
#

statistics knowledge is important for data science

#

it depends on what you wanna do, though

#

it's a wide field

#

depending on your specialisation

#

graph theory

#

linear algebra

#

calculus

#

all might be relevant

gusty frost
#

Do you know where I can learn data science?

velvet thorn
#

there's tons of stuff online

thorn bobcat
#

gpt-j is nice

errant flare
#

what the hekk is KeyError

#

in pandas

#

god its driving me nuts

#

nvm i got it but still why

velvet thorn
errant flare
#

yeah deleted a piece of code in my notebook by mistake

#

and sometimes there's apparently a space in the csv's name which I didn't notice

maiden bluff
#

Hey there, i'm really interested in Data Science, but how should I begin with?

agile cobalt
#

do you already know Python or just getting started overall?

maiden bluff
#

Well, I already know python

agile cobalt
#

mostly this channel's pins then

maiden bluff
#

Sorry, english is not my first language, could you try to elaborate?

agile cobalt
#

Check the Pinned messages 📌

maiden bluff
#

Thanks!

errant flare
#

uh quick question

i'm doing this in python with pandas

soi = data["SOI"]```
and the output i get has the index column in it, any way i can get rid of it?
vague stratus
velvet thorn
#

data.reset_index()['SOI'] should do

errant flare
#

and soi = data["SOI"].values works in that regards i think so yeah thanks!

thorn bobcat
#

can I ask you a question

#

if you train a transformer on the bible

#

what would be the result?

errant flare
#

wait what

thorn bobcat
#

would it come up with new data or would it take input and put it into context

errant flare
#

a transformer?

thorn bobcat
#

a transformer is an NLP model.

errant flare
#

have no experience with nlp

thorn bobcat
#

ahh yea same here tbh..

#

I wanna get into it. I just worked with image processing.

errant flare
#

but from a couple of google searches i'm assuming it would come up with new data

#

that would resemble somethings that the bible has

#

because apparently transformer models are used for text summarization

thorn bobcat
#

would it be an accurate representation of the bible

errant flare
#

so that's taking the idea of the text but coming up with technically new data

errant flare
thorn bobcat
#

yea... I was wondering what kind of model I would need to have to answer questions directly from say the bible, torah or the Quran.

errant flare
#

ahhhhhh

#

interesting

thorn bobcat
#

where it has to be onpoint in the interpretation of questions, verses and choice of answers.

errant flare
#

hmmm maybe then

thorn bobcat
#

transformers are all on the rage these days to be honest thought it would be a good choice

#

have you tried gpt-j?

#

try this

errant flare
#

the bible i'm assuming is quite vast so it's not like only 5 or 6 questions that you have to answer

thorn bobcat
#

For, it seems that the church has no plans for a Christmas celebration this year. Instead, the Vatican is proposing a celebration of the end of the world.

Last week a Vatican statement said that the Pope is planning to address the United Nations on December 21st, in order to address the "global environmental crisis".

It said that the Pope will urge the world's leaders to work for a "dramatic reduction" in carbon emissions.

In his address, the Pope is```
#

the prompt was In god we trust, said the Pope to lol

errant flare
#

in addition it isn't like the questions from a bible are like "accurate" as in when someone asks a question they aren't looking for a specific value it could be varied

#

probably very tough to be honest seeing how there might not be one answer to a question asked in regards to religion

thorn bobcat
errant flare
#

and some that could be quite wrong

thorn bobcat
#

hm...

errant flare
thorn bobcat
#

but there should be general consensus.

thorn bobcat
#

and a supporting corpus perhaps.

errant flare
#

hmmm i don't know how to help but yeah lol

#

anyways man gotta run bye!

thorn bobcat
#

based on what i know from law there's cases based on cases based on the constitution.

thorn bobcat
thorn bobcat
old grove
#

Hey Guys I have Outliers in my covid dataset and i am not getting how do i deal with it.. Like say active cases so some states like Usa,Nw in that Usa has value aboove 100 k or something which impacts the mean too, so the outlier is actually a valid type s you can have no of cases in 100k so how to deal with such type of outliers ?

wheat yew
#

hey need some help with a simple numpy question

#

i asked this some day ago as well but couldnt figure it out

royal crest
#

have you tried the help channels

#

ah beat me to it

wheat yew
#

yes

royal crest
#

what have you tried so far

wheat yew
#

well i have been taught stack concentate arange and some other basic numpy stuff

#

and i havent been able to do anything lol

#

idk how

#

because this isnt a hard one

#

i had someone help me with slicing and stuff a few days ago

#

actually i got the first function now

#
#!/usr/bin/env python3

import numpy as np

def get_row_vectors(a):
    return [i[np.newaxis,:] for i in a]


def get_column_vectors(a):
    s = [a[:,i] for i in range(len(a))]
    return [i[:,np.newaxis] for i in s]
    
def main():

    

    np.random.seed(0)
    a=np.random.randint(0,10, (4,4))
    #a = [[5, 0 ,3], [3, 7, 9]]

    print("a:", a)
    print("Row vectors:", get_row_vectors(a))
    print("Column vectors:", get_column_vectors(a))

if __name__ == "__main__":
    main()
#

this is what i got atm

#

this almost passes the tests. it says this though:

"FAIL:
RowsAndColumns: test_column_count
3 != 5 : Wrong number of columns"

royal crest
#

ahh yea

#

it works when n == m but not when n != m

wheat yew
#

ur talking about (n, m)

royal crest
#

yes

wheat yew
#

kk

#

if u use that list that is hidden with "#"

#

my code fails

#

says list indices are tuples

#

what is [:,np.newaxis] in numbers?

#

how do i fix my code

#

fixed it god damn this one sucked

royal crest
#

one problem i found was the range(len(a))

#

ah good

#

👍

wheat yew
#

yep that was it

#

i had to put range(a.shape[1])

royal crest
#

if in doubt print everything haha

wheat yew
#

yep that helped it

#

i have another one but i have to try it first

#

it looks pretty hard tho

royal crest
#

i did len(a[0])

#

same thing i guess

wheat yew
#

that doesnt work

royal crest
#

oh?

wheat yew
#

if u call with (5,4)

#

its gonna go up to 4

#

and while the list only has 0,1,2,3

royal crest
wheat yew
#

"IndexError: index 4 is out of bounds for axis 1 with size 4"

#

this is what i get

royal crest
#

I didn't remove range if that's what you mean

grave frost
thorn bobcat
grave frost
wheat yew
thorn bobcat
#

or create it from scratch purely based on the bible?

thorn bobcat
royal crest
wheat yew
#

show what u did

royal crest
#
def get_column_vectors(a):
  s = [a[:, i] for i in range(len(a[0]))]
  return [i[:, np.newaxis] for i in s]
wheat yew
#

ah that works yeah

#

i thought u meant u did

#

range(a.shape[0])

#

this doesnt work

royal crest
#

anyways yay!

thorn bobcat
#

I'll have to look into it

cedar void
#

I couldn't find any channel for machine learning doubts, so should I post them here?

thorn bobcat
#

yes

cedar void
#

Topic - Decision Trees in ML in Python

Given two different datasets, a training dataset and a testing dataset, the instruction was to model a decision tree on the training dataset, make predictions, select the best or the most ideal value of max_depth for the tree and then compare the results with the testing dataset.

I thought of splitting the training dataset, writing the training algorithm inside a loop over an arbitrary range, then select the result with the best accuracy and the corresponding max_depth. Is this a good way to get the best value of max_depth?

I would be happy to get suggestions.

cedar void
hollow falcon
#

how to change the legend when i plot this way

lapis sequoia
#

Is there a written source or any book where I can learn python image processing?
I'm so bored watching videos

royal crest
#

arXiv

fallow prism
#

Classify text with small data set

Hello, I am looking for ideas and knowledge, my task is classify legal text sentences very particulars and the size of my train data set is 1200 classified sentences, I have to classify in 4 or 5 classes, I mean 4 or 5 because I know what is the problem.

My vocabulary is around 20k (filtered by min_df=10) of unique words and I try classify with BERT, CNN and SVM+TF-IDF.

The length of my sentences is close to 512 words although I can change it.

My scores with the test part of 300 sentences is close to 65% (precision, recall, F1, etc.).

I don't know what I have to try, help me with links or papers or something for text with small data set.

serene scaffold
#

the topic of the sentences? something else?

desert oar
velvet thorn
#

plotting with pandas is convenient but you lose customisability

random prairie
#

hello all. i am facing problem related to installation of pandas. please help

desert oar
#

@fallow prism some options:

  1. use word2vec, glove, fasttext, bert, etc. to generate sentence vectors, then logistic regression
  2. PCA on the count-vectorized sentences, then logistic regression
  3. Factorization machine on the count-vectorized sentences

Logistic regression with L2/"ridge" regularization and linear SVM are somewhat interchangeable; mathematically they amount to the same model with a different loss function. There's also L1/"lasso" regularization and elastic-net regularization which is a blend of ridge and lasso. The differences should be fairly minor among all of these, although you can efficiently compute the entire "regularization path" for elastic-net and lasso, and you can efficiently compute "generalized cross validation" for ridge. Generally I tend to prefer logistic over linear SVM anyway because you also get a decent probability model out of it.

#

basically all 4 models are the same, minimizing the difference between y and w*x , but with different loss functions

serene scaffold
#
                     crf    bilstm      bert       crf    bilstm      bert       crf    bilstm      bert
               precision precision precision    recall    recall    recall        f1        f1        f1
micro animal    0.930068  0.890407  0.843404  0.702475  0.781577  0.854810  0.800407  0.832450  0.849069
      dose      0.711624  0.668271  0.636848  0.445432  0.617051  0.763121  0.547908  0.641640  0.694290
      exposure  0.853591  0.809184  0.642202  0.542343  0.695919  0.675735  0.663268  0.748290  0.658542
      endpoint  0.685054  0.705040  0.650032  0.367747  0.512205  0.617647  0.478584  0.593348  0.633426
macro animal    0.930068  0.890407  0.843404  0.702475  0.781577  0.854810  0.800407  0.832450  0.849069
      dose      0.711624  0.668271  0.636848  0.445432  0.617051  0.763121  0.547908  0.641640  0.694290
      exposure  0.853591  0.809184  0.642202  0.542343  0.695919  0.675735  0.663268  0.748290  0.658542
      endpoint  0.685054  0.705040  0.650032  0.367747  0.512205  0.617647  0.478584  0.593348  0.633426

I want the columns to be ordered by (crf, bilstm, bert) and then (precision, recall, f1) within those three groups.

#

calling sortlevel on the mulitindex for the columns worked.

#

and reindexing from there.

flat hollow
#

@desert oar @pine wolf you guys are wizards 😄 I was worried the whole day about coding in polar coords with numpy (havent done it before) and after a long day I come back to what seems like a working model? I will try to download nurses_2 and run both codes tomorrow, I am completely exhausted after today's hike.

#

@desert oar thanks for including links to explanations ❤️

exotic palm
#

so in a couple of weeks im supposed to start an ai with python course

#

which i have been invited to

#

and i dont knwo anything about how those two work together and how to work with them as is

#

I have finished a couple of python courses and somewhat decently know python but i have no idea about its use in ai

#

can i get some help with that?

digital sandal
#

Aren't you supposed to learn all that on the course? Or is like an non-beginner course?

exotic palm
#

I just want to know what to look for

#

since this is the place i reached out when i was beggining my initial python course

#

Just give me a little info about it so i can understand more of it when it comes to it

trail horizon
#

i hope u all are okay
i wondering
hoes anyone know a platform where i can do data analysis interviews ? like leetcode but for data analysis
i know about kaggle
but im asking is something like leetcode

shadow gate
#

Is this a good channel for asking help with Telegram bots?

serene scaffold
civic summit
#

Need a bit of help with an ordinal regression. Dv= 10 point scale or 0-10 and IV= are a 4 point scale of 1-4. I have two independent variables that are a 2 point scale, yes/no. I am wondering if I can keep these variables because the basis of my analysis assumes that there is 1 order of magnitude between 1-2,2-3,3-4 etc but with binary it's more like 0-100.

#

Is there a book, I can read up a bit on to understand how to best set up an ordinal regression?

pine wolf
trail horizon
flat hollow
#

very cool, I would also need to keep track of the number of particles in each of the 3 areas and a 2nd barrier... so much to do 😦 but now at least I have an example to work from 🙂

pine wolf
#

does the diffusion rate increase as the diameter decreases?

hasty mountain
#

Hey guys, I want to create a new column for a dataset, but I'm having a small issue here. I've used the Close column to get the EMA9 and EMA21 column. However, I've noticed that those EMAs aren't properly alligned(I want the EMA9 for the day 09/19/2014, index 2, to be at the index 1, something that can be made in Excel).

I've tried doing this by removing the first row of EMA9 with

data['EMA9'] = data['EMA9'].drop([0])

However, this only makes the index 0 in EMA9 to become NaN. I've also tried using the argument inplace = True, but this results in the entire column being replaced by NaN.
Can someone lend me a help?

flat hollow
pine wolf
#

my guess is it must --- as diameter decreases there should be more collisions with it, increasing the odds that a cell passes through

flat hollow
#

hm... wouldnt that be solved by adding particle collisions?

#

and perhaps some momentum calculations?

pine wolf
#

i don't think it matters

#

i think that's just what happens in simulation or real life, probably

#

probably good, so our fingers don't asphyxiate

flat hollow
#

oh sorry, I misunderstood your question, I think yes, for the same number of particles there should be an increase

flat hollow
hasty mountain
mortal dove
#

data[EMA9].shift(1)

#

can't remember if it shifts forward of backwards, so shift(1) or shift(-1)

flat hollow
hasty mountain
flat hollow
#

ah, then shift should work

hasty mountain
#

Yes, it worked. Thank you!

mortal dove
#

Keep in mind that if you're working on a price prediction model or any trading model, you're now looking at the next timeframe's data in that same row, so you're basically looking at a future value - so practically you would not be able to use that value in a real scenario to trade

hasty mountain
#

At least now I know how to modify datasets rows like this. I had to do this once with another dataset and ended up just opening the DataFrame in an excel file to do this.

lapis sequoia
#

how to data science?

serene scaffold
lapis sequoia
serene scaffold
lapis sequoia
#

thx

umbral ferry
#

maybe a hard question to answer but... does anyone know the major difference between k-modes clustering and multiple correspondence analysis? they both seem to have similar results and methods, but I'm not sure how to interpret them

wheat yew
#

can someone help me with numpy and specifically concencate and how to use it

#

it combines arrays into one but what else can it do

serene scaffold
wheat yew
#

well i got a question thats hard for me and i gotta use concencate in it

#

and some other things

serene scaffold
#

What is it?

wheat yew
serene scaffold
#

Do you know how to use the eye function?

wheat yew
#

yea i know what it does

#

i havent used it though

#

yeah cant do this without some help seems a bit too hard

desert oar
desert oar
#

out of curiosity why do you want it aligned like that?

hasty mountain
#

Just so it matches the chart where I took the data from.

arctic wedgeBOT
#

Hey @timid grove!

It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

mortal dove
hasty mountain
#

Nah, what I wanted was indeed shift.

mortal dove
#

with adjust enabled, it does the calculation around the centre value, with it disabled it does it around the last value, as it would in a technical indicator on a chart

hasty mountain
#

It's just that the EMA9 for day X in the chart was registered as the EMA9 for day X+1

mortal dove
#

I'd consider shift a workaround since you'd have to change the value depending on the window you're using

timid grove
#

in this i downloaded a dataset of approx 800 images , but this algorithm giving only 102 images as output other images are taken into consideration...

hasty mountain
mortal dove
#

That's interesting, currently slowly working on a technical analysis library and didn't run into that issue for the EMA

timid grove
#

this is the error..

#

plz help!!

lapis sequoia
#

anonymous

desert oar
lapis sequoia
timid grove
mortal dove
lapis sequoia
timid grove
desert oar
#

it was supposed to be -1

mortal dove
#

Yea, that's why it just confused me a bit, thought they might have the centring issue. Weird that the chart they're pulling from has it offset of 1 though

desert oar
#

i suspect there's something missing in their explanation, but 🤷‍♂️

#

perhaps they meant

data['EMA9'] = data['Close'].shift(-1).ewm(span=9).mean()
lapis sequoia
#

Hey. Using matplotlib.pyplot as plt how can I change the appearance of the date tick labels in my graph x-axis without re-writing my code in the way demonstrated in the docs: https://matplotlib.org/stable/gallery/text_labels_and_annotations/date.html

?

I'm not using fig, ax = plt.subplots() and then using fig or ax to control my graph. I'm just using plt so the ax methods are not available to me in the way demonstrated.

velvet thorn
#

then go from there

#

in general, though

#

I would advise against using plt methods

#

it makes it a lot harder to do stuff

lapis sequoia
desert oar
#

i generally use plt for quick things, and switch/refactor to fig, ax for more involved tasks

#

or use seaborn

vestal agate
#

Huber(x)= {12x2 for |x|≤δ,
{δ|x|−12δ2 otherwise.

what do i need to learn to understand this lol

desert oar
vestal agate
tender hawk
#

Hello 😄 I have a questions about Pandas and trying to figure something out (and I'm having very little luck finding out how).

Question: How do I aggregate data within a CSV (Add 2 cells together from different rows based on a common value)
Example:

Employee ID:, Box Type:, Count Per Order
0001, Large, 4
0001, Small, 2
0001, Large, 2
0001, Small, 3
0020, Small, 1
0020, Small, 2

I want to be able to calculate
Employee 0001 - Large - 6
Employee 0001 - Small - 5
etc

#

How would I go about doing this? or would I use something besides Pandas?

odd meteor
random solar
tender hawk
random solar
lapis sequoia
#

Guys I need help.. I want to learn data science and ai I really want to learn but the problem is i don't have anyone around me interested in programming at all.. When I make new project or solve the problem i faced for two weeks there is no one who can celebrate with me.. At least I want someone I can make projects with.. I know it's silly problem I have.. But I just can't do it all by myself..
What do you guys think? Have you ever faced my situation? Any advice?

tender hawk
random solar
lapis sequoia
#

i have a question about tree based models like random forest and gradient boosting

#

do any of these methods have the same kind of persistence that ANNs have?

#

like an indefinitely long window for training

vale hedge
#

anyone know why tensorflow recommends installing from pip? it seems like a lot of pain to install it. i am wondering why not use a conda install?

ripe forge
#

use a conda install if you have it.

serene scaffold
serene scaffold
ripe forge
#

Where conda truly shines however, is two things: 1. it's not just a python package installer. it's a binaries installer. for any binary that's built for it. That means that conda makes trivial some installs that would be absolutely miserable without it. (actually, if im being completely honest, conda has a branding issue. this tool is so insanely good at what it does, but doesn't get enough emphasis on this aspect). the 2: closely related - because conda is a binaries installer, it can control python installations itself. This means your environments made in conda encompass multiple python versions too, and make it trivial to have different versions of python in different environments with zero friction.

#

Ofcourse, this is all on top of also supporting pip installs. So there's genuinely zero downsides in that sense that i can think of.

#

to keep it simple though, if someone is already using conda, they should use conda installs before pip installs.

vale hedge
#

Like what darr said conda is able to install some lower level binaries that support some of python packages. I dont think i have needed to install manually although i dont know if this is optimal for your system. I would assume if you take the time and find the most optimal libraries for your architecture you could get better performance

novel elbow
#

specially when working with different versions of cuda & cudnn

mortal dove
flat hollow
novel elbow
onyx drum
#

Does anyone know how I can scale the x-axis labels with matplotlib.pyplot.hist()? I want the histogram to look exactly the way it does now, except that I want to display my x-axis ticks in % rather than fraction, so I would need to scale by a factor of 100

#

Got it. Since it seems common enough, for anyone else:
scale_x = 100
ticks_x = ticker.FuncFormatter(lambda x, pos: '{0:g}'.format(x*scale_x))
ax.xaxis.set_major_formatter(ticks_x)

opal loom
#

can someone experienced in machine learning help me in #help-bagel pls, I'm very new to machine learning

shadow gate
#

Hey, I made a Telegram bot with Python on VSC, but now I have only one question: does it work 24h/7?

surreal elm
#

😄

austere swift
#

the bot will work as long as the program is running, so just set it to run on a system that's on 24/7

#

you can use a VPS, a raspberry pi, or any system that you're willing to always leave on

dawn crown
#

can someone link me pandas vectorisation with numpy docs or something like that

serene scaffold
dawn crown
serene scaffold
#

are you sure you're not talking about encoding?

dawn crown
#

the first row is the column name

serene scaffold
#

I don't see the connection between these two.

dawn crown
serene scaffold
dawn crown
#

like you see

name
bob
john
#

here name is the column name

mortal dove
#

I believe they want to specify column names, and then rows in the column has to have the column name as a substring?

serene scaffold
#

can you explain what is happening here? what do these two have to do with each other?

#

this looks meaningless to me.

dawn crown
serene scaffold
#

why are you going from four rows to two? what does any of this mean?

#

why is one of them NaN?

dawn crown
serene scaffold
#

Alright, give me a moment

dawn crown
serene scaffold
#

you don't have to think about how numpy is involved.

serene scaffold
#

what did you read that made you think to ask this question? I think something is confusing you.

serene scaffold
#

In either case, when you're working with numpy and pandas, you should avoid using them with for loops or calling methods like .apply and .map as much as possible, as the other methods are optimized and do all the looping internally.

serene scaffold
# dawn crown

Yes, that is right. You should avoid iterating over rows as much as you possibly can.

#

You just have to look in the docs for what method does what you are trying to do.

dawn crown
serene scaffold
#

The methods that you'd be calling are usually vectorised. you don't have to do extra work to make them vectorised.

#

those things they list: arithmetic, comparisons, reductions, etc. Those are already implemented in pandas. You just have to use them.

mortal dove
#

Have always been able to use .apply, .map, and to a smaller extent, .rolling instead of ever having to iterate over rows

serene scaffold
dawn crown
serene scaffold
dawn crown
serene scaffold
#
In [45]: s
Out[45]: 
0         aaaa
1     aaaabbbb
2    cccccdddd
3    dddddeeee
dtype: object

In [46]: s.str.contains('aaaa')
Out[46]: 
0     True
1     True
2    False
3    False
dtype: bool

In [47]: s[s.str.contains('aaaa')]
Out[47]: 
0        aaaa
1    aaaabbbb
dtype: object
#

see how s.str.contains('aaaa') gives you True or False for each row as one operation. No looping required.

#

s[s.str.contains('aaaa')] then selects only those rows for which the condition is True.

dawn crown
#

ooh

serene scaffold
# dawn crown ooh

so, "vectorised" isn't something you have to do. it's a design concept where a given operation is applied to all the data.

#

Here's a similar concept with arrays

#
In [52]: a
Out[52]: 
array([[3, 2, 0],
       [4, 0, 2]])

In [53]: b
Out[53]: 
array([[3, 3, 1],
       [4, 3, 2]])

In [54]: a + b
Out[54]: 
array([[6, 5, 1],
       [8, 3, 4]])

In [55]: a * b
Out[55]: 
array([[ 9,  6,  0],
       [16,  0,  4]])
#

@dawn crown the different operations are applied element-wise, but syntactically, it looks like you're just adding two things. This is also vectorised.

dawn crown
#

alright

serene scaffold
#

Works with regular numbers, too.

In [56]: a / 2
Out[56]: 
array([[1.5, 1. , 0. ],
       [2. , 0. , 1. ]])

In [57]: 2 / a
Out[57]: 
array([[0.66666667, 1.        ,        inf],
       [0.5       ,        inf, 1.        ]])
dawn crown
#

thanks i will try some pandas general problems

serene scaffold
lapis sequoia
#

Anyone familiar with R and blogdown? Not necessarily a datasci question at this point.

serene scaffold
lapis sequoia
#

Hi guys, I understand that in SVM the regularization term Ccontrols how a complex a model is. For example a high C will tolerate misclassified data points

#

But how does this apply to support vector regression? For example the epsilon tube controls the width of the tube. As such, a wider tube will fit more data points and minimize the slack variables. But how about C here? How does it balance this because now were are trying to fit data points inside the tube

#

I.e regression

desert oar
lyric ermine
#

hey guys, got a small problem with pandas

fifa["Weight"] = fifa["Weight"].astype(str).apply(lambda x: x.replace("lbs", "")).astype(float)

light = fifa.loc[fifa["Weight"] < 140].count()[0]
light_medium = fifa.loc[fifa["Weight"] >= 140] & fifa.loc[fifa["Weight"] >= 155]

error code:

unsupported operand type(s) for &: 'float' and 'float

anyone know how to solve this please? 😦

serene scaffold
#

@lyric ermine you only want one call to loc in that last part

#

You shouldn't be using the ampersand in between two calls to loc

#

Though I'm not really sure what the intended logic is

lyric ermine
#

can i pm you?

#

i wanna get values of weight between 140 and 155

serene scaffold
#

I'm going to bed soon, but even if I weren't, it's better to put your question where everyone can get to it

#

This channel is specifically for this kind of question

lyric ermine
#
light_medium = fifa.loc[fifa["Weight"] >= 140] & fifa["Weight"] >= 155
#

you mean like this?

serene scaffold
#

Yes. Can you state in English what this is intended to do?

lyric ermine
#

i have a series of values with weights

i wanna get the amount of weights between 140 and 155

serene scaffold
#

So both of those comparisons need to be inside the call to loc

#

Look where you have your closing ] for loc

#

Also one of the comparison operators is wrong. I think you can figure out which one is wrong.

lyric ermine
#

yeah

#

okay ill try some more

#

ty for tips 🙂

serene scaffold
lyric ermine
#

i never ran into this error before so iam kinda confused haha

#

have a good night

gentle epoch
#

trying this

#
from numpy import arange

b = round(-5/2,1)
c = round(5/2,1)

a = list(arange(b,c,0.1))
print(a)```
#

outputting this

#
[-2.5, -2.4, -2.3, -2.1999999999999997, -2.0999999999999996, -1.9999999999999996, -1.8999999999999995, -1.7999999999999994, -1.6999999999999993, -1.5999999999999992, -1.4999999999999991, -1.399999999999999, -1.299999999999999, -1.1999999999999988, -1.0999999999999988, -0.9999999999999987, -0.8999999999999986, -0.7999999999999985, -0.6999999999999984, -0.5999999999999983, -0.4999999999999982, -0.39999999999999813, -0.29999999999999805, -0.19999999999999796, -0.09999999999999787, 2.220446049250313e-15, 0.10000000000000231, 0.2000000000000024, 0.3000000000000025, 0.4000000000000026, 0.5000000000000027, 0.6000000000000028, 0.7000000000000028, 0.8000000000000029, 0.900000000000003, 1.000000000000003, 1.1000000000000032, 1.2000000000000033, 1.3000000000000034, 1.4000000000000035, 1.5000000000000036, 1.6000000000000032, 1.7000000000000037, 1.8000000000000043, 1.900000000000004, 2.0000000000000036, 2.100000000000004, 2.2000000000000046, 2.3000000000000043, 2.400000000000004]```
#

what do I do

royal crest
#

floating point error

gentle epoch
#

even after rounding?

royal crest
#

the problem isn't the rounding

#

it's the arange

gentle epoch
#

anything I can do?

royal crest
#

since 0.1 isn't exactly 0.1000000000000000000

#

i was going to suggest rounding the arange to whatever decimal points you want

gentle epoch
#

I can't figure out that module

gentle epoch
#

right

#

how to use it

#

so, basically, every mathematical operation I want to do with those numbers, I'll have to use decimal?

velvet thorn
#

is it an operation that really requires precision?

#

if not, I wouldn't bother

velvet thorn
glossy moth
#

Hi all, I've created an sns map of my p-values and all works as intended. I am trying to modify the heatmap coloring to be based around my alpha, as right now 1 is being colored the most, and 0 the least, whereas I really want to highlight significance. Any suggestions on a more significance based coloring method?

shadow gate
pearl heart
#

Hello

shadow gate
mortal dove
stuck karma
#

hello!
I have an error index 13 is out of bounds for axis 1 with size 1
and i dont understand because i already runned it before and i did work. I removed my new lines and restored the old version

### PLSR ####
#OUVRIR LE CSV
data=pd.read_csv(r'C:\path\donnees_grece.csv')
datalist_x=data.values.tolist()
data=np.array(datalist_x)
print(data.shape)
data=np.random.permutation (data) #mélanger les lignes
print(data.shape)

data_x= data[1:,65: ].astype(float)
data_y=data[1:, 13 ].astype(float)      #13: colonne de cible  clay

#DEFINIR VARIABLES x ET CIBLE y
X=data_x
y= data_y

#DIVISER test set ET train set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#CHOISIR LE MODELE PLSR ET FIXER LES PARAMETRES
pls = PLSRegression(n_components=15,  max_iter=500)


#CROSS VALIDATION
scores = cross_validate(pls, X_train, y_train, cv=5, scoring="r2", return_train_score="true")
print(scores)  
print (scores["train_score"].mean())       #tableau avec scores pour chaque paquet

#resultats chaque paquet pour test
scores1=cross_val_score(pls, X_train, y_train, cv=5, scoring='r2')
#moy des r² des cv
print (scores1)


#DEFINIR HYPERPARAMETRE NB VARIABLES LATENTES (composantes?)
i= np.arange(1,20)
train_score, val_score = validation_curve(pls, X_train, y_train,
                                          'n_components', i, cv=5)


print(val_score.mean())         #score moyen de CV à toutes iterations jusqu'à 50:S

plt.plot(i, val_score.mean(axis=1), label='validation')
plt.plot(i, train_score.mean(axis=1), label='train')
plt.ylabel('score')
plt.xlabel('n_components')
plt.legend()
devout zodiac
#

hello, could anyone refer to me a pytorch specific discord server?

neat cedar
#

Hey everyone, I hope this is the right channel to ask this data visualization question:
I'm looking for a library that can produce a graph similar to this Flourish line graph race: https://app.flourish.studio/@flourish/horserace/8
Ideally, I'd like to make it interactive in my react ts frontend, alternatively I'd like to display it as a gif or video.
So far I've only found bar graph races, for example made with plotly express or matplotlib (like those: https://pypi.org/project/bar-chart-race/, https://towardsdatascience.com/bar-chart-race-in-python-with-matplotlib-8e687a5c8a41)
I'm very thankful for any advice and hints, also let me know if I should move this question elsewhere! 🙂

rigid fable
#

hey guys

#

i want to know if the track of data science in python at datacamp is worth it or not

indigo skiff
#

hey guys just a general question about hardware for text generation task. So if i have a basic LSTM with attention layers and beam search algorithm which i want to train and evaluate on multiple datasets ranging between size 500mb to 4gb size (before pre-processing) whats the hardware i would need? For example within cloud how much ram and what kind of gpus i would need for quick training (ideally within 4/5 hours)
2) For fine tuning GPT Models (124M layer) on a 2.6gb dataset (before pre-processing) what kind of gpu + ram i would need. Goal is to finetune and evaluate within 10 hrs.
3) GPT NEO Model. For this how much ram and computational power i would need considering dataset size is 8GB. Goal is to fine tune within 24 hrs

serene scaffold
indigo skiff
#

yes for 1) LSTM's

serene scaffold
#

I'll defer to someone else as I don't want to lead you astray.

indigo skiff
#

no worries. Thanks tho

lapis sequoia
#

Question about grouping data by datetime64[ns].. I want to know can I group by day/hour/min from? I have been looking for an example but haven't made any progress

#

*using pandas

mortal dove
#

!d pandas.DataFrame.resample

arctic wedgeBOT
#

DataFrame.resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, loffset=None, base=None, on=None, level=None, origin='start_day', offset=None)```
Resample time-series data.

Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the `on`/`level` keyword parameter.
mortal dove
#

Would this be what you're looking for?

lapis sequoia
#

that might be it, thanks. I am trying groupby, will look into the above as well

vestal agate
#
alpha = 0.01
b0, b1, b2 = 0, 0, 0
x = [4, 7, 5, 7, 27]
y = [4, 3, 7, 7, 4]
z = [8, 10.5, 17.5, 24.5, 54]
error = []
for i in range(50000):
    idx = i % 5
    p = b0 + (b1 * x[idx]) + (b2 * y[idx])
    err = p - z[idx]
    b0 = b0 - alpha * err
    b1 = b1 - alpha * err * x[idx]
    b2 = b2 - alpha * err * y[idx]
    error.append(err)
error = list(map(abs, error))
error.sort()
print(error[:1])
test = float(input())
test1 = float(input())
pred = b0 + (b1 * test) + (b2 * test1)
print(pred)

Im trying to make machine learning to predict this problem,
it is trying to caclulate the area of a triangle without the equation

#

what do i do

#

please ping me

serene scaffold
#

@vestal agate if I understand correctly, you're trying to make a model that can predict the area of a triangle given the length of its sides?

#

You shouldn't use machine learning for things when there's a simple, known solution. But if this is just for practice, I guess you could do it with regression.

vestal agate
#

this is linear regression and gradient descent

#

but maby im doing more than 2 variables wrong

#

i only really know how x and y works with b0 and b1

#

not up to b2 or infinity

serene scaffold
vestal agate
#

i want to the equation myself

serene scaffold
#

in that case I'd use numpy but not sklearn.

grim orbit
#

any1 here know how to create histogramms?

serene scaffold
grim orbit
#

yes

serene scaffold
#

which one?

grim orbit
#

lemme send u the code

serene scaffold
#

Post it in this channel as text

#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

grim orbit
#
import pandas as pd
import seaborn as sns
import scipy.stats
import matplotlib.pyplot as plt
url = "https://covid19.who.int/WHO-COVID-19-global-data.csv"
df = pd.read_csv(url)
filt = (df['Date_reported'] == '2021-08-20')
df1 = df[filt] 
filt = (df1['Cumulative_cases'] >= 2000000)
df2 = df1[filt] 
df2
serene scaffold
#

Great, now do print(df.head().to_csv()) and paste that text into this chat the same way.

grim orbit
#
,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
5363,2021-08-20,AR,Argentina,AMRO,9764,5106207,247,109652
17283,2021-08-20,BR,Brazil,AMRO,41714,20457897,1064,571662
26223,2021-08-20,CO,Colombia,AMRO,3154,4877323,93,123781
43507,2021-08-20,FR,France,EURO,23102,6384773,105,111839
47083,2021-08-20,DE,Germany,EURO,9280,3853055,13,91956
serene scaffold
#

Great, this makes sense. What are you trying to convey with your histogram?

grim orbit
#

countries over 2m cases, the new and recoverd per country

serene scaffold
#

these are two histograms, then?

grim orbit
#

i was thinking can do in one but wouldnt look good so yes 2 would be the better approach i guess

serene scaffold
#

And I got this

#

you can probably mess with it from there

#

!docs pandas.DataFrame.plot.hist

arctic wedgeBOT
#

DataFrame.plot.hist(by=None, bins=10, **kwargs)```
Draw one histogram of the DataFrame’s columns.

A histogram is a representation of the distribution of data. This function groups the values of all given Series in the DataFrame into bins and draws all bins in one [`matplotlib.axes.Axes`](https://matplotlib.org/stable/api/axes_api.html#matplotlib.axes.Axes "(in Matplotlib v3.4.3)"). This is useful when the DataFrame’s Series are in a similar scale.
grim orbit
#

i wanted it like this

serene scaffold
#

ahh, let me think

#

@grim orbit df.loc[df['Cumulative_cases'] >= 2_000_000].plot.barh('Country', 'New_cases')

grim orbit
#

I see

serene scaffold
#

!docs pandas.DataFrame.plot.barh

arctic wedgeBOT
#

DataFrame.plot.barh(x=None, y=None, **kwargs)```
Make a horizontal bar plot.

A horizontal bar plot is a plot that presents quantitative data with rectangular bars with lengths proportional to the values that they represent. A bar plot shows comparisons among discrete categories. One axis of the plot shows the specific categories being compared, and the other axis represents a measured value.
serene scaffold
#

I'm not sure how you'd stack the two types of cases or change the colors.

grim orbit
#

what package did u use?

#

just pandas?

serene scaffold
#

yes, but it's calling matplotlib under the hood

grim orbit
#

wym?

serene scaffold
#

these are just dataframe methods, but those methods are calling matplotlib functions

#

you don't have to import matplotlib but you do have to have it installed.

grim orbit
#

I understand

#

why does it look like this

vestal agate
#

is there a way to stop nan and inf from showing in pycharm

#

and limit the numbers shown

grim orbit
#

u can filter them out?

#

=!

serene scaffold
grim orbit
#

i see

#

for my exapmle

#
filt = (df2['Country'] != 'France, Spain, United States of America, England,');
#

how do I write this properly i see no differnce

serene scaffold
grim orbit
#

when i define?

serene scaffold
grim orbit
#
 df2.loc[df['Cumulative_cases'] >= 2_000_000].plot.barh('Country', 'Cumulative_cases',df2['Country'].isin(['France', 'Spain', 'United States of American', 'England'])))

#

not like this

serene scaffold
# grim orbit not like this
selected = df2.loc[(df['Cumulative_cases'] >= 2_000_000) & ~df2['Country'].isin(['France', 'Spain', 'United States of American', 'England'])]

selected.plot.barh('Country', 'Cumulative_cases')

You also deleted the ~, which was important.

#

~ is a negator. You can read (df2['Cumulative_cases'] >= 2_000_000) & ~df2['Country'].isin(['France', 'Spain', 'United States of American', 'England']) as "where Cumulative_cases is >= 2 million and Country is not in France, Spain, etc."

grim orbit
#

I see

#

what does the ~ do?

serene scaffold
#

It flips true and false values.

grim orbit
#

I see thanks

forest arrow
#

is it possible to retrieve info from a enjin forum? (without having to emulate a web browser)

glossy moth
#

Hi! I have a dataframe with three columns (df1):
column1: unique Identifier, column 2: 0 or 1, column 3: 0 or 1. I have a second dataframe (df2) with the same three columns, but with 1s and 0s in different rows. I want to join the two, so that if df1 has a 0 for a unique ID in the relevant column where df2 has a 1, df1 gets updated to be a 1. But if df2 has a 0, df1 stays as 1 for that ID, and nothing is done to df1 at all.

Importantly, the lengths of the two dataframes are not the same, and the IDs are not in the same rows, though df2 will always be a subset of IDs in df1.

In reality, my actual databases are around 30 columns as opposed to the 3 in the above example though.

lapis sequoia
#

does anyone have a code for the face recognition (opencv)

quiet vault
#

This isn’t open cv but it is still a really high level with keras

serene scaffold
#

@glossy moth it would be easier to follow what you're trying to do if you provided a minimal example (potentially with mock data), but it sounds like you need to merge

#

Also if the "IDs are not in the same rows" then you need to make sure you've set an appropriate index for each frame.

vestal agate
#

what library visualizes data like this

serene scaffold
#

@vestal agate matplotlib

arctic wedgeBOT
mortal dove
#

I doubt it's the best way, but that's what I likely would have done

#

As Stelercus mentioned, you'll want to make sure that you set your index as the ID column

glossy moth
glossy moth
mortal dove
#

Still not sure, do you want the column to change to the max of the two columns, or does it update based on something else? If it's something else could you explain when it should be updated and when it should not be updated?

glossy moth
#

Sure sorry. All columns contain binary values of either 1 or 0. If the df2 which is a subset of df1 has a 1 for an ID where df1 has a 0, I want df1 for that ID to change to 1. If df1 has a 1 and df2 has a 0, I want it to do nothing

#

So I want to update df1 just in cases where the ID in df2 has a value of 1 and the ID in df1 has a value of 0, in all other cases, do nothing

mortal dove
#

Yea, then the code snippet I provided should work

glossy moth
# mortal dove Yea, then the code snippet I provided should work

Thank you! Just so I understand your code:
idx = df2.index.intersection(df1.index)
df1.loc[idx, 'column1'] = np.where(df2['column1'] > df1.loc[idx, 'column1'], df2['column1'], df1.loc[idx, 'column1'])

line 1 sets idx as the Identifier values that are shared between the two sets. Then line 2 looks at those IDs in column 1 where df2 is greater than those IDs in column 1, and if greater, sets df1 to that value, otherwise, leaves as is?

vestal agate
#

hey can someone teach me how to code multi variable linear regression

#

no tutorials that make sense

#

100% know how linear regression and gradient descent works

#

but without multiple variables

glossy moth
glossy moth
# mortal dove Yea, that's correct

So in my real situation, I have ~30 columns I need to do this for. Can I just loop through the snippet you provided updating the column as I go, or will this apply all at once to every column?

#

Same number of columns in df1 and df2 and they are named the same and everything if that matters

mortal dove
#

Snippet I gave needs to loop over the columns, so

for i in range(1, 30):
        col = f'Group{i}'
        idx = df2.index.intersection(df1.index)
        df1.loc[idx, col] = np.where(df2[col] > df1.loc[idx, col], df2[col], df1.loc[idx, col])
#

I think it should be possible to change it to just apply once, but my 1am brain has stopped functioning at 100%

glossy moth
#

Thank you! 🙂

serene scaffold
#

@mortal dove it looks like you're recomputing idx every time for no particular reason

mortal dove
#

Oh yea, can probably take that out of the loop

serene scaffold
#

Though I wonder if this can be accomplished without any loops

mortal dove
#

As I said, 1am brain is running on its last fumes

#

I'll probably have a look tomorrow

glossy moth
serene scaffold
#

For your reference, @glossy moth, you want to avoid loops and apply as much as possible in the context of numpy and pandas.

mortal dove
#

Goes much more in depth than any tutorial will, but it explains the math/stats behind Multiple Regression, not how to code it

serene scaffold
lapis sequoia
#

Hey guys, how can I turn this final model so that I get cross-validation score of MAE and RMSE with scikit-learns cross_val_score

svr = SVR(kernel = 'rbf',C=100, epsilon=0.1, gamma = 100)
svr.fit(X_train, y_train)

y_pred_train = svr.predict(X_train)

y_pred_test = svr.predict(X_test)

#Metrics - if squared = True returns MSE value, if squared = False returns RMSE value.

#Performance on training set
mae_train = mean_absolute_error(y_train,y_pred_train)
rmse_train = mean_squared_error(y_train,y_pred_train, squared = False)

#Performance on testing set 
mae_test = mean_absolute_error(y_test, y_pred_test)
rmse_test = mean_squared_error(y_test, y_pred_test, squared = False)

print(f'SVR completed in : {round((time.time() - start_time), 2)} seconds...')```
glossy moth
#

if there is a way to .apply() and .where() to do this, that would be awesome to learn

#

thanks again for the help!

serene scaffold
glossy moth
#

yes all identities of df2 are found in df1 somewhere

serene scaffold
#

Problem is I'm on mobile so I can't experiment

serene scaffold
glossy moth
velvet thorn
#

I feel like you should do a join

#

on the index

#

that’s my gut feel anyway

serene scaffold
#

@velvet thorn they just need a boolean series from the second dataframe to index the first, but there are indices missing in the second dataframe

#

Beyond that it's just df.loc[...] = 1

velvet thorn
#

no?

#

🥴 I read a bit up there

#

but I could be wrong I just woke up

serene scaffold
#

I've never used Series.where tbh

velvet thorn
#

is the same as the numpy version

#

with a default left

mortal dove
#

!d pandas.Series.where

arctic wedgeBOT
#

Series.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=<no_default>)```
Replace values where the condition is False.
glossy moth
# velvet thorn is the same as the `numpy` version

Thanks for helping as well! It sounds like where will indiscriminately take the two dataframes I've joined and replace 0 values with 1s? I only want replacement went index position of df2 is 1 and index position of df1 is 0, no other times

#

Sorry super new to python in general, I'm sure im misunderstanding!

velvet thorn
#

that’s the point of the condition

#

what it does is basically

#

if condition is true, take from left, otherwise take from right

#

here left is df1 and right is df2

glossy moth
#

But haven't I already joined them at this point?

velvet thorn
#

yes, but join means

#

both columns are present

#

so more accurately

#

left is the relevant column from df1

#

and right from df2

#

okay it’s a bit hard to visualise

#

if your current solution works

#

just go with it

desert oar
#

Same as stelercus im not at a computer but maybe i can answer later

velvet thorn
#

you know what really sucks

#

backticks on a phone

#

😔

lapis sequoia
# desert oar Can you clarify the question

So I have done gridsearch to find the best hyperparameters for my SVR but I want a cross-validated score of MAE and RMSE because right now the model is overfitting The complete thing looks like this:

'epsilon': [0.001,0.01,0.05,0.1,1,10],
'gamma': [0.01, 0.1, 1, 10, 100]},cv=5, verbose=0, n_jobs=-1)

gsc = gsc.fit(X_train, y_train)```

```svr = SVR(kernel = 'rbf',C=100, epsilon=0.1, gamma = 100)
svr.fit(X_train, y_train)

y_pred_train = svr.predict(X_train)

y_pred_test = svr.predict(X_test)

#Metrics - if squared = True returns MSE value, if squared = False returns RMSE value.

#Performance on training set
mae_train = mean_absolute_error(y_train,y_pred_train)
rmse_train = mean_squared_error(y_train,y_pred_train, squared = False)

#Performance on testing set 
mae_test = mean_absolute_error(y_test, y_pred_test)
rmse_test = mean_squared_error(y_test, y_pred_test, squared = False)```
desert oar
#

@velvet thorn you'd be surprised at how many code blocks I've written on my phone hah

glossy moth
# velvet thorn yes, but join means

Ok let me see if I understand this correctly:
So basically I'm joining them based on df1, which I've stated df2 is a subset of. So I'll now get a dataframe that goes from 30 columns to one with 60 columns, with a lot of the new 30 columns having a bunch of NaN rows. Then I do series.where() and say if columndf1 < corresponding columndf2, replace with columndf2 value, otherwise leave alone?

velvet thorn
#

this gives you a bunch of Series

#

then you pd.concat them

velvet thorn
desert oar
#

@lapis sequoia look up "nested cross validation" perhaps, sklearn has it. Although the grid search should already be using cross val

velvet thorn
#

it's a bit 🥴

desert oar
lapis sequoia
desert oar
#

You can tell it which scorers to use

#

You can even tell it to compute multiple different scores (although it will only use one for selection)

#

The cross validation score is whatever score you request

#

The default i think is rmse for regression and accuracy for classification but you can change it

#

This is described in the reference docs for the various CV classes

lapis sequoia
desert oar
#

I think the parameter is scoring= or scorer=, something like that

velvet thorn
#

I have generally found sklearn docs quite clear + comprehensive

lapis sequoia
# velvet thorn you should read the docs again

I have done that. 'mean_absolute_error' is not a valid scoring value. Use sorted(sklearn.metrics.SCORERS.keys()) to get valid options. although they say that that is what to use in scoring

velvet thorn
#

because...

#

if you want the MAE scorer...

#

you need to use 'neg_mean_absolute_error'

#

and that is because

#

higher error is worse.

#

this information is available here

#

so my question is

#

did you read that?

velvet thorn
#

because if it was, then that's a documentation bug, so I would like to see it

stuck karma
#

hello guys ~ can you help me with using Grid serch from scikit learn on a pls regression? here is my code:

#
from sklearn.metrics import *
from sklearn.model_selection import GridSearchCV
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=15,  max_iter=500)

n_components= np.arange(1, 100) 
max_iter=[500]   
 #gridsearch : create a dictionnary withhyperparametres                
param_grid = {'n_components':n_components, 'max_iter':max_iter ,
              'metric': ['r2', 'neg_root_mean_squared_error']}
grid = GridSearchCV(PLSRegression(), param_grid, cv=5)

#train the grid
grid.fit(X_train, y_train)

#

it keeps saying Invalid parameter metric for estimator PLSRegression(). Check the list of available parameters with `estimator.get_params().keys()`.

lapis sequoia
desert oar
#

There is a separate parameter for that, not part of the parameter grid

#

As with the other person, this sounds like a case of needing to read the user guide and reference documentation more carefully

stuck karma
#

mh, i already saw code with different metrics

desert oar
stuck karma
#

in this tuto for example

#

which line? ccross val score is before

desert oar
# stuck karma in this tuto for example

That's the distance metric for the KNN model that I assume is being searched over, there is no "metric" parameter for PLS Regression as per the error and as per the docs

#

Unrelated to the scoring metric

stuck karma
#

i see, but in my case im interested by two metrics, does it mean i should apply grid search twice?

#

rmse and r²

#

in the doc you can see the different metrics for pls

grave frost
#

why is doing cp on individual files sooo slow 😦

stuck karma
#

i dont really get the problem . is the line before metric correct tho?

desert oar
desert oar
stuck karma
#

yes its to select the best parameters to get the best score

desert oar
#

Check the GridSearchCV docs

stuck karma
#

i think i got the idea?

desert oar
stuck karma
#

i know that pls has few parameters : ones of them the number of iteration that needs the machine to learn, and the number of components

desert oar
stuck karma
#

by parameter you mean score?

desert oar
stuck karma
#

oh.. so i can use multiple metrics but only one parameter?

grave frost
#
for row in tqdm(train_df['Image_ID']):
    tgt_img = row + '.jpg'
    !cp ../input/dataset/Train_Images/Train_Images/$tgt_img ./FiftyOneDataset/data/
stuck karma
#

or maybe i just plot a validation curve

btw it doesnt change, i have the same error Invalid parameter metric for estimator PLSRegression(). Check the list of available parameters with `estimator.get_params().keys()`.
the error is somewhere else

param_grid = {'n_components':n_components,
              'metric': ['metrics.r2_score', 
                         'metrics.mean_squared_error']}```
grave frost
#

@desert oar no difference with rsync 😦

desert oar
#

I recommend using DVC + rsync for this kind of thing

#

Or a Makefile with a wildcard

#
./bar/%: ../input/foo/%
    cp $< $@

I used 4 spaces instead of a tab because mobile, but that's the idea

desert oar
#

Rsync I think does more intelligent file diffing or deduplicating or something

desert oar
desert oar
#

The "metric" in that one example screenshot was unrelated to the scoring

#

In that very specific example, the model happened to have a parameter called "metric"

stuck karma
#

nooo i know i want both score, its not which score?

#

you said using one or more scores?

desert oar
#

Please read the docs and the user guide

stuck karma
#

okay.. maybe you give me a simple example maybe?

desert oar
#

Stop guessing based on examples

stuck karma
#

i already did

#

im not fluent in english so i read but sometimes it tooks time to understand

vestal agate
#

why yall mad

stuck karma
#

lmao

#

its 2 am im soooo tired

stuck karma
#

okay

desert oar
#

But go to sleep first

stuck karma
#

no i have a meeting tomorrow

#

im stuck . I take too much time to correct my errors

#

lemme read

desert oar
#

I see. I understand it's probably not easy to read these docs if you don't read English well

stuck karma
#

yes, also im sorry if i ooks mad lmao

#

i'm happy you answered tbh

desert oar
#

I apologize for sounding annoyed. We get a lot of people in here who don't want to learn and just want people to do their homework for them

stuck karma
#

yes i know, i really wanna learn i think i m improving my english skills progressivly

desert oar
#

Yes, unfortunately programming tends to be very English-centered

stuck karma
#

to have an idea about how i read the documentation: my eyes look to key words because i know them, but sometimes the verbs or the synthax makes me confuse

#

yes but its a good training !

#

it just needs time

#

okay this part is about multiple scoeres i guess?"If scoring represents multiple scores, one can use:"

#

isnt what i tried to do? with 'metrics ':[...]

#

okay so i tried to read but i dont think i got more than what i previously thought :x

glossy moth
# velvet thorn then you `pd.concat` them

Hi, I was asking a bit about this earlier but I realized some info that simplifies things:
I have two dataframes of equal size. All cells other than an identifier column contain 0 or 1. The only difference between the two is that cells differ on which rows have 0 and which have 1. I want to compare equivalent cells in the two, and if there are any occurences of a 0 in one df where the other has a 1, replace it with a 1. If both have 0 leave 0. If both have 1, leave 1. Is there a quick way to do this without goign column by column?

near aspen
#

anyone here familiar with pandas?

#

what method is used for something like that

fossil idol
#

Please I need help

#

Why do I keep getting this error

#

This is jupyter notebook

rough mountain
#

this might be useful

fossil idol
#

I imported them already @rough mountain

#

@rough mountain it worked. I typed %matplotlib online and the graph showed. Thanks

rough mountain
#

welcome

#

I'm just going to use this as an image processing channel as it's the closest thing.

#

When I floodfill this image with cv2

serene scaffold
#

kitty!!!!!!!!

rough mountain
#

I get this strange result