#data-science-and-ml

1 messages · Page 6 of 1

tidal bough
#

ooh, maybe...

#

oh my fucking god

#

it was right all along, since I was using apply

#

but it didn't look right

#

because I didn't know that pandas has special goddamn support for lists

#

and if you have a column of lists, it will visually "unwrap" these lists, each element into a row

#

This is one row 😩

untold bloom
#

no, there's no such automatic thing for lists

#

and no, that's not a one row...

#

you need explicit .explode to roll lists to rows

tidal bough
#

huh, you're right

#

indeed it's of len 2

untold bloom
#

that screenshot implies a MultiIndex frame

tidal bough
#

so I guess the issue then is that apply explodes a column of mine without asking

#

(or however it's called, here)

untold bloom
#

well, without an MRE, i don't know what to write

#

because not sure what function you're applying to what kind of dataframe :|

tidal bough
#

fair enough

#

lemme see if I can hack something together

tidal bough
# untold bloom well, without an MRE, i don't know what to write
test_df = pd.DataFrame.from_dict(dict(name=["A", "A", "A", "B", "B"], thing=["1", "2", "3", "1", "2"]))


def test_f(group):
    lst = [row.thing for row in group.itertuples()]
    return pd.DataFrame.from_dict(dict(
        name=group.iloc[0].name,
        lst=lst
    ))


test_df.groupby("name").apply(test_f)
#

here's a simplified example. Each group is collapsed into a 1-row dataframe with a list-type column.

#

The result becomes multiindex:

MultiIndex([('A', 0),
            ('A', 1),
            ('A', 2),
            ('B', 0),
            ('B', 1)],
           names=['name', None])
#

oh hey, I got it

#

the way to do it is pretty counterintuitive to me, though:

return pd.Series(dict(
        name=group.iloc[0].name,
        lst=lst
    ))
#

if apply returns a Series, it's not unwrapped into a multiindex. If it returns a dataframe, it is.

#

I wonder if there's even a mention of that in the docs.

solid quail
#

Hey everyone,

untold bloom
#

pandas will try to be helpful by putting another level of index consisting of the grouper keys so you can identify which group led to which new indexing scheme.

#

if you want, you can disable this via passing group_keys=False to .groupby(...).

#

in your case,2 groups had the indicies [0, 1, 2] and [3, 4]; but what you returned from the function per groups did not preserve the corresponding indexes fully, e.g., you returned [0, 1, 2] for "A" (fine) but [0, 1] for "B" (not so fine), hence the multiindex appearing.

tidal bough
#

Hmm, interesting

solid quail
#

Hey everyone, I have encountered a unique issue and I could really use some suggestions on how to resolve it as I have been stuck for a few days. I am currently trying to iterate through a df and create a new column that contains the value of the difference between two columns. The issue is that I need to find the difference between two separate columns on different rows. (so the difference between column A on row 1, and Column B on row 2, and store the difference value in column C on row 1. ) Does anyone have any experience doing this?

untold bloom
#

hi, from that explanation, it seems like df["new"] = df["A"] - df["B"].shift(-1) should work?

#

.shift(-1) will "pull up" the column B one row above; then, when subtracted from df["A"] it is as if you're substracting row k of A and k + 1 of B

solid quail
#

Thank you so much! I will give it a try

desert oar
#

you can also .shift() or .shift(1) to shift forward instead

#

!e ```python
import pandas as pd

data = pd.DataFrame({
'x': [1,2,3],
'y': [33,22,11],
}, index=list('abc'))

print(data['x'])
print(data['x'].shift(-1))
print(data['x'].shift()) # default value is 1

arctic wedgeBOT
#

@desert oar :white_check_mark: Your 3.10 eval job has completed with return code 0.

001 | a    1
002 | b    2
003 | c    3
004 | Name: x, dtype: int64
005 | a    2.0
006 | b    3.0
007 | c    NaN
008 | Name: x, dtype: float64
009 | a    NaN
010 | b    1.0
011 | c    2.0
... (truncated - too many lines)

Full output: https://paste.pythondiscord.com/ecegevatub.txt?noredirect

desert oar
#

note that shift shifts the data while keeping the index in place. this is perfect for doing computations exactly like what @solid quail described

tropic matrix
#

If you have a regression problem, where the output can range from 1000 to 10 billion, is it a good idea to scale the output down? (atm i'm using the sklearn minmaxscaler)

modest onyx
#

And then you just did one hot?

#

That's so weird cuz I did just that and my model is just getting stuck at a local minima where it outputs the most frequent character in the text

#

In my case it's either e or spaces

#

And when I researched on Google, it looks like people use more fancy encoding/decoding methods to avoid this issue

dusty valve
#

even our computers agree

lapis sequoia
#
class Layer_Dense:
    def __init__(self, n_inputs, n_neurons):
        self.weights = 0.01 * np.random.rand(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))

    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases

X, y = spiral_data(samples=100, classes=3)
dense1 = Layer_Dense(2, 3)
dense1.forward(X)
print(dense1.output.shape)

Any idea why dense1 output is in the shape (300, 3)? Shouldn't it be 200 since I'm inputting 100 different x and y values for each dot?

#

Where does the 300 come from?

rich trail
#

any recommendations on courses with projects or a course that goes over practical parts to get me steup creating projects?

#

im currently waiting on financial aid for the 2nd Ng Andrew course in the ml specialization but since it takes 2 weeks im looking for something while i wait

desert oar
rich trail
#

and just google and watch videos when i need?

desert oar
rich trail
#

so books? i got acces to O'reilly learning for free rn

desert oar
#

practice reading actual software docs, and get a textbook if you don't already have one

rich trail
#

they got all their books and some videos associated with the books

#

u recommend skipping that too?

desert oar
#

no, those are good books. the videos are probably fine

#

OpenIntro has a statistics textbook too

rich trail
#

ye im doing the math as im preparing for a masters in cs and focus on ml

wooden sail
desert oar
#

there is just so much "blogspam" out there on beginner-level ML that it's very difficult to avoid as a newbie

rich trail
#

just wanna get into the practical parts beforehand

desert oar
#

yeah, then just spend your time messing with data

#

if you are going to read blog junk, at least make sure it's from towardsdatascience.com, their blog junk is still somewhat junky but at least the minimum quality is somewhat high

#

rando youtube videos are unlikely to be useful

#

fast.ai is also a free online course and collection of resources

lapis sequoia
modest onyx
wooden sail
rich trail
lapis sequoia
rich trail
#

asides from learning ml, any recommendations on learning how to actually work with data @desert oar

#

just learn by practise?

desert oar
wooden sail
#

because matrix products between matrices of sizes m x n and n x k are of size m x k, so naturally your X is of size 300 x 2

lapis sequoia
wooden sail
#

indeed. so the question is what did you expect the shape to be given the parameters you passed to spiral data

lapis sequoia
#

Ah... I thought it would be 100 since I specified samples=100

#

maybe I misunderstood what samples=100 means

wooden sail
#

what does spiral data return?

lapis sequoia
wooden sail
#

... what it does return lol

#

what is the data supposed to look like?

#

you specified 3 classes, too

#

how are those supposed to be returned

lapis sequoia
#

I just told you the shape of the X values, the y values(labels) are either 0, 1, or 2.

#

an example of the X value would be:

[-8.67685974e-01 -3.85566771e-01]
 [-9.11101043e-01  3.01196396e-01]
steady basalt
#

opinions on balancing the test set in binary classification of medical record data?

wooden sail
#

yes dude, that's not what i'm asking. you were expecting it to return something of shape 100 x 2. why?

#

what do the classes mean?

#

how are the classes being concatenated?

lapis sequoia
#

what are you even talking about

#

you've lost me bro

wooden sail
#

i'm asking you what your code means

#

X, y = spiral_data(samples=100, classes=3)

#

you wrote this, yes? what does this mean? what do the classes do there?

lapis sequoia
#

I was expecting spiral_data to return 100, 2 because I thought samples=100 would mean 100 different instances of X

wooden sail
#

what does spiral_data return? did you write this function? is it from another lib?

lapis sequoia
#

it's from library called nnfs, they created spiral_data where X returns (x, y) values which are points, and y returns classes(0, 1, or 2) which show what spiral the dot(X) belongs to

#

And I thought there would be 100 rows because I specified spiral_data(samples=100) and not 300 rows

#

isn't that what samples=100 makes sense that it would be, right?

wooden sail
#

i'm looking for the docs for this func but can't find anything useful

tropic matrix
modest onyx
#

one thing you can do is just tell us the shape and type of the thing it returns

#

using print

#
print(X.shape, y.shape)
print(type(X), type(y)
lapis sequoia
#
X, y = spiral_data(samples=100, classes=3)
print(X.shape, y.shape)

returns

(300, 2) (300,)
lapis sequoia
modest onyx
#

oh so the input shape is 300?

lapis sequoia
#

Yeah

modest onyx
#

wait this doesn't make sense

lapis sequoia
#

how so?

modest onyx
#

if you have 100 samples and the input shape is a vector of size 300

#

then I would expect the shape to be (100, 300)

wooden sail
#

the only question is why is it 300 x 2 instead of 100 x 2

modest onyx
#

oh

lapis sequoia
#

let me show a different output with different sample and class values

wooden sail
#

i know, i'm just getting abdulhaleem up to speed. yeah, change classes to 1

#

maybe it generates n samples per class

#

i can't find the docs anywhere

lapis sequoia
#
X, y = spiral_data(samples=10, classes=2)
print(X.shape, y.shape)

returns

(20, 2) (20,)
#

and ```py
X, y = spiral_data(samples=100, classes=1)
print(X.shape, y.shape)

prints

(100, 2) (100,)

wooden sail
#

ok, so it generates n samples per class indeed

modest onyx
#

actually yeah that makes sense

#

I'm guessing that the spiral_data returns x,y coordinates representing a spiral

#

hense why it's 100, 2

modest onyx
lapis sequoia
wooden sail
#

100 samples for each of the classes

#

just like in the examples you did rn

modest onyx
#

oh interesting

#

well then I'd expect the matrix multiply to return shape n_samples * num_classes

steady basalt
#

opinions in balancing test set?

lapis sequoia
wooden sail
#

yes

lapis sequoia
#

what I was expecting it to return was 100 samples, with 33% of them in class 1, 33% in class 2, and 34% in class 3

#

idk why I was expecting it to act like that but yeah it doesn't act like that

wooden sail
#

i would've expected that too

modest onyx
#

yeah that's pretty confusing

lapis sequoia
#

yeah, thanks for your help Edd and Abdulhaleem

modest onyx
#

bad design choice if you ask me

steady basalt
#

anyone know why a neural net would do this if the traiing data was oversampled to balance?

#

random forest manages just fine to reach 0.6 recall and 0.1 prec for class 1

cold smelt
#

I'd like to use NLP to generate answers in a simple oracle bot, but I haven't dealt with ML much yet. What are my options?

modest onyx
#

I haven't done any NLP but I've heard of GPT 2 which could be a good option

modest onyx
# dusty valve **E** truly is the most magnificent

actually although using categorical sampling at inference time significantly improved my results, it turns out my biggest oopsie was thinking that torch.nn.functional.cross_entropy accepts a probability distribution as input when it actually accepts logits

#

now my model actually works

lapis sequoia
#

Does anyone have some tips, or know a good tutorial, for someone who wants to create a classification model on images on my own PC? A lot of tutorials just use the built-in datasets.

modest onyx
#

so you want to do classification on your own dataset?

#

that isn't that different from using built in datasets

lapis sequoia
modest onyx
#

In pytorch for example, the built in datasets are built using the DataLoader and Dataset classes. So all you have to do is get your images and wrap them on those classes

lapis sequoia
#

Ok, I'm gonna try that

modest onyx
#

oh

#

wait so you still not able to turn your images into tensors?

lapis sequoia
#

No, I'm not

modest onyx
#

well there's a long way to do it and an easy way to do it

#

but if you're already working within a framework then just using DataLoader and Dataset can make your life pretty easy

#

they can do all that under the hood

lapis sequoia
#

Do Dataloaders handle turning images into tensors?

modest onyx
#

yeah I think so

#

but you might need to load the images as PIL

lapis sequoia
#

Ok, trying that now

modest onyx
#

Yeah I think once you are able to turn your images into PIL, then the rest should be easy

lapis sequoia
#

Ok, I converted an image to a tensor

modest onyx
#

but I'm not sure if that's the most efficient way

lapis sequoia
#

I don't know about efficiency either. I just loaded the image as a PIL image and used a ToTensor() transform and it seems to work

modest onyx
#

probably shouldn't matter since you only need to load the dataset once and you're done

lapis sequoia
#

it returns 3 RGB values like expected

lapis sequoia
modest onyx
#

rather than going one by one

#

but you probably don't need to worry about that

lapis sequoia
#

Yeah

modest onyx
#

are you using dataloader and datasets?

lapis sequoia
#

not yet

#

I'm working with 1 image right now

#

Do you know how to convert a tensor to a numpy array?

modest onyx
#

wait so you're not using either pytorch or tensorflow?

lapis sequoia
#

I'm using pytorch

#

my tensor is currently in the form of a ToTensor object

modest onyx
#

if you're only dealing with one image then you don't need a dataloader

#

but why would you want to turn it into a anumpy array if youre using pytorch?

#

you want to show it using matplotlib?

#

in that case it's just .numpy()

lapis sequoia
#

well the ToTensor object isn't the same thing as a Tensor object right?

#

Trying to figure out how to convert it to a tensor object or a numpy array

#

unless they're the same thing

#

Ok, not the same because when I run "my_tensor.shape" I get

AttributeError: 'ToTensor' object has no attribute 'shape'
modest onyx
#

ToTensor is a function

#

it returns a function that can be used to convert images into tensors

#

if you want to use the function right away, theres a pil_to_tensor function

#

but ToTensor is good if you want to compose it with other transformations

#

torchvision.transforms.functional.pil_to_tensor might be what you're looking for and you can import it

#

also I checked one of my recent projects and it looks pretty simple to load an entire folder of images as a dataloader

lapis sequoia
#

I used "PILToTensor" and it worked as well

lapis sequoia
#

wait... I could've just kept using ToTensor and set a variable to the function and it would've worked

modest onyx
#
transform = transforms.Compose([
        transforms.Resize(image_size),
        transforms.ToTensor(),
        # other transformations
    ])

# the folder data/fiftyk has a single folder in it where all images are
train_dataset = datasets.ImageFolder(root="data/fiftyk", transform=transform)
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
modest onyx
#

maybe that's just me though

#
“Shame I want to get us coming from his s1elling table!” said Stan. “Harry yelled after 
chance of glittening of Otter, while Kuttic tells me.” 

They all thowed his broom below at the bit past twelve 
Prifet consulets, had said, “I always middle shop I told your common room, the 
statue and next way. “So you’ve eaten to do, Harry ... it’s dead ideas had had 
exams to frightens, and she — what was a rust. 
#

Trained for 2 epochs and already this good 😵‍💫

modest onyx
#

it literally returns F.pil_to_tensor

tropic matrix
#

If you have a regression problem, where the output can range from 1000 to 10 billion, is it a good idea to scale the output down? (atm i'm using the sklearn minmaxscaler, which scales it to be between 0 and 1)

on top of that, does having the values range from 0-1 affect how mse is calculated? (the square of a decimal between 0 and 1 will be less than the value itself)

wooden sail
#

yes, it's a good idea because the gradients will depend on the size of the values the function takes. rescaling prevents exploding gradients

#

the MSE will behave as usual

#

what does matter about the MSE, independent of what you mentioned, is that small error values are effectively "ignored. this happens regardless of the error dynamic range, and is one of the reasons regularization is helpful

#

the TL;DR is "yes it's a good idea" and "no, MSE still works the same"

modest onyx
#

as long as there's no problem of outliers then minmax feature scaling is good

worthy phoenix
#

im getting a memory error even tho htop says i have almost 3gb memory at my hand idk why

wooden sail
#

what are you trying to do?

primal shuttle
#

@worthy phoenix you can double check that with psutil

#
import psutil
psutil.virtual_memory()
worthy phoenix
#

got the reason for the error

steady basalt
#

thesis results complete, not great but at least we can say 'u DEF dont have cancer'

#

what would you do

worthy phoenix
#

its reading the whole model into memory and the model is about 10gb

steady basalt
#

is it worth balancinbg the test set to get a better understanding of the model

primal shuttle
#

If your set is overall unbalanced, then no

#

Preprocess, split with stratification, balance the training data, train the models, and then test on the imbalanced data with appropriate metrics

steady basalt
#

i trained this model on balanced data

#

but its hard to get a clear understanding of the model when 98% of the test data is a single class

#

also i used smote on the train set

primal shuttle
#

If your set is balanced to begin with, you don't want to overengineer

steady basalt
#

it was extremely un balanced

#

like 100k vs 4k samples

primal shuttle
#

Oh ok

steady basalt
#

so u can say im training on lets say 4000 of each

#

then testing on 2000 and 500

#

BUT, if i tested on 500 and 500 i might better see how it determines class

#

frmo those results, would you say its over predicting class 0 because the data available doesnt allow for much class 1 predictions

primal shuttle
#

For the train/test splitting you can additionally apply the K-fold, and then to the remainders for each fold

steady basalt
#

wdym

primal shuttle
steady basalt
#

id di oversample

#

i used smote

#

maybe i shud try non-informed over sampling

#

because the data is v noisy?

#

AND its quite high dimensional, i one hot encoded a couple features

lapis sequoia
#

Hi there, any suggestions on data science interview practice sites? I've used Hackerrank, Leetcode, and AceAI so far

steady basalt
#

my experience so far with coding data science interviews is so negative

#

idk how u can give someone a bunch of tables in an env and with code theyve never seen before and expect them in 10 mins to return a table exactly how you like it when it takes ages ot understand whats even going on

#

especially when hacker rank error output is bugged and invisible

gloomy anvil
#

Hello, has one of you experience with using statmodels? I used it to plot acf plots like such:

steady basalt
#

its good but i prefer stata

gloomy anvil
gloomy anvil
steady basalt
#

matplotlib lets u do that yes

#

im sure statsmodels isnt the graph itself, pyplot is

#

just add a bunch of plots and then draw it theyll all go in the space

#

on that axis

gloomy anvil
#

how would I do this? Should I first sum the 10 Timeseries up to one timeseries and then run acfplot?

steady basalt
#

well are you allowed to do that?

#

10 time serires cant be represented by 1 time series

#

ud need to plot them as seperate lines surely

gloomy anvil
#

I have 10 different timesseries that are kind of correlated, but I want to see if there is generally some autocorrelation on a meta level.

steady basalt
#

cant u do them alongside each other then as seperate values

#

wouldnt that add to the corrleatrion

gloomy anvil
#

or is there a way to get the autocorrelations per lag per timeseries from the plot? And calculate a mean autocorrelation per lag?

steady basalt
#

like multiple variables

gloomy anvil
#

Well I already have created an acf plot per dataset and per predictor (all in all 80 plots), so on a detailed level I can already assess the autocorrelations. I just want to have an aggregated view of autocorrelations

unique flame
#

Been training this yolo algorithm for 6.5 hours now..though google colab. I watched 2 movies, died a few times in playing ps4 games all while keeping the session active...

storm sigil
#
y_1 = lambda x: x**2
plt.scatter(list(range(1000)),[y_1(a) for a in range(1000)], s=20, edgecolor='none', cmap=plt.cm.Blues)

The cmap doesn't work here. Why is that?

storm sigil
#

got it

meager crater
#

Hey is the structure of make_column_transformer right?

# First would need to deal with Binary Labeling

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder

bin_cols = ["gender", "ever_married", "Residence_type"]
ohe_cols = ["work_type", "smoking_status"]

ct = make_column_transformer(
    (LabelBinarizer(), bin_cols),
    (OneHotEncoder(), ohe_cols),
    remainder="passthrough"
)

ct.fit(df)

Error:
TypeError: LabelBinarizer.fit_transform() takes 2 positional arguments but 3 were given

wooden sail
# storm sigil

cmap is for 2D and 3D images. for 1d plots, you can directly specify the color of each curve with a letter or by making your own colors

real oyster
#

`model = Sequential()
model.add(Conv2D(64, kernel_size=4, activation="relu", input_shape = (256,256,3)))
model.add(MaxPooling2D(4,4))

model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(3,3))

model.add(Conv2D(32, kernel_size=3, activation="relu", padding="same"))
model.add(MaxPooling2D(3,3))

model.add(Flatten())
model.add(Dense(32))
model.add(Dense(5, activation='softmax'))`

#

I trained this model and got these results

barren snow
#

Hi! I want to calculate the array of sound envelope of a signal(in each music note), and I see a tool called https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.hilbert.html. However, there are two questions I want to check from the reference.

The first one is what are they doing in

signal = chirp(t, 20.0, t[-1], 100.0)
signal *= (1.0 + 0.5 * np.sin(2.0*np.pi*3.0*t))

I am not sure the number 20 and 100 in first line is for what meaning?

The second question is, I want to calculate the envelope of a signal(onset), not the whole sound. I already have onset and offset. Then, how to calculate it?

Thanks!

real oyster
# real oyster

I was wondering if the this is overfitting or non-ideal and also wondering how you would change the model to increase accuracy

wooden sail
barren snow
wooden sail
#

i don't know what you mean by onset and offset here, you'd have to clarify a bit more

barren snow
wooden sail
#

those two things are the same here

#

well. i lie. if you look at the spectrum, depending on the extent of the signal w.r.t. the total time duration, you'll get higher frequency harmonics.

#

let's go with starting and ending value of the modulation frequency

barren snow
wooden sail
#

and by note you mean, pitch?

barren snow
#

What is b note?

wooden sail
#

a typo 😛

barren snow
wooden sail
#

this is kind of a difficult problem, depending on how complicated you want to make it

wooden sail
#

the easiest answer is "you don't need a hilbert transform, but rather a fourier one", but that's probably not what you're looking for

barren snow
#

hahaa

wooden sail
#

there's ongoing research in this stuff

barren snow
#

ot other methods, maybe?

wooden sail
#

do you know which notes you are looking for?

barren snow
#

each note in this music piece

barren snow
#

But I the same problem is that i don't know how to calculate per sound event

wooden sail
#

i'm reading a paper right now and they suggest using a filterbank, which is a generalization of the fourier approach i mentioned just now

barren snow
#

Oh, cool. Let me check for a sec

wooden sail
#

not only that, they seem to use a time-windowed approach, so it's really more like a short time fourier transform

#

i think that's what you're looking for

#

that'll give you a time-varying magnitude (well, really amplitude and phase, but the phase presumably won't matter much unless you want a super sophisticated method) for each frequency bin

#

that's how you make these "spectrograms"

#

each horizontal line is the time-varying magnitude of a bin or "note"

barren snow
#

Great! COuld you send me the link of paper?

wooden sail
#

if you're doing this for fun, i'd recommend to start with an STFT. if you're doing research, then it's time to read about polyphase filters

barren snow
#

Also, Librosa has one, like this. Do you think it's the suitable way to calculate?

barren snow
#

😭

wooden sail
#

i'm checking the docs and it makes something called an "onset strength envelope", which is going to be some variation of what i mentioned right now

#

idk how it finds the peaks, but it's probably something like a thresholded smoothed derivative

barren snow
#

Find the peak is not a big problem for me, because there is a function tool called get_peak 🙂

wooden sail
#

that's ok, but those have very low resolution

#

what kind of research are we talking 😛 super resolution parameter estimation?

barren snow
wooden sail
#

i'm gonna check the pick_peak code really quick, let's see

barren snow
wooden sail
#

ah it's even simpler

#

it's just a simple heuristic, pick the max value in an interval if it's above a theshold

barren snow
#

Cool

#

Is it this one?

wooden sail
#

that will 'work' but you won't get state of the art results. depending on what your research is in, this is not good enough

barren snow
#

scipy.signal.find_peaks(x, height=None, threshold=None, distance=None, prominence=None, width=None, wlen=None, rel_height=0.5, plateau_size=None)

wooden sail
#

they actually coded their own in librosa

#

lemme read how scipy does it

#

eh pretty similar

barren snow
#

What do you "feel" about this? About calculate onset strength envelope.

wooden sail
#

in fairness, this type of peak finder is indeed a maximum likelihood estimator of peak locations, but only if the underlying parametric model is "easily resolved"

barren snow
#

Yeah, u'r right...

wooden sail
#

i'm not sure i fully understand what their onset strength computation is doing, i don't have enough time to go through all the details right now. they reference this mel spectrogram though, so this is a good place to start
[#] Böck, Sebastian, and Gerhard Widmer.
"Maximum filter vibrato suppression for onset detection."
16th International Conference on Digital Audio Effects,
Maynooth, Ireland. 2013.

#

at any rate, it's a filterbank, they're applying some bandpass filter to the signal to split it into bands over different time windows

#

this is a good place to start if your research is in onset detection itself. if not, and you just need this as an intermediate result, i'd say this is probably good enough. if this IS exactly what you're researching... the next question is whether you want to develop a new, better method or just make a survey of what is out there

barren snow
#

Sure! Thanks for giving me those information and suggestion! I appreciate it!

lapis sequoia
#

Does anyone know how I would proceed making my own image generator from words app?

serene scaffold
#

Not to say that you'll never be able to do it. But it would be a supremely disappointing first project.

steady basalt
languid stratus
#

Anyone know if it's possible to pass a spacy object through an SKlearn pipeline

#

(I need the info in the object at different stages)

lapis sequoia
#

Anyone know any open source ai image generator that generates purely based on word descriptions?

ebon hazel
#

I think Dream too

tropic matrix
#

I'm solving a regression problem, but when i calculate the MSE for my loss it becomes "inf". I believe this is because my data ranges from 1,000 to 2,147,000,000, but what should i do to solve it?

#

(using keras)

#

i have rmse as a metric, and that displays a valid number, but when the loss is initially high it is impossible for it to be displayed as mse

#

so is it a viable solution to set objectives (like for early stopping, reducing lr, and hyper parameter tuning) to be the validation rmse?

wooden sail
#

a quick fix is to change the dynamic range of the data

tropic matrix
rapid cedar
#

whats the best modules to learn b4 starting doing ML

desert oar
rapid cedar
#

ok

modest onyx
#

you should learn a bit of ML then start learning these modules

#

or at the very least be doing them in parallel

merry wadi
#

Are there any ML models that have sequential/ordered splits?

I’d like the model to take into account specific columns first then others

crisp axle
modest onyx
tropic matrix
modest onyx
#

that doesn't make sense

#

I suspect you're implementation could have a bug

#

could you show me how you normalized your targets?

elfin whale
#

which is the best course of tableau for a beginner

#

any one?

quaint leaf
old grove
#

Hello wave, can anyone tell What exactly is Generalization error and how does it differ from train or test error ?

elfin whale
elfin whale
quaint leaf
elfin whale
#

tell me the easiest one

quaint leaf
gray pasture
#

Hi,
I need help with this question about clustering and finding optimal number of clusters. I would appreciate for any help.

https://datascience.stackexchange.com/questions/113303/finding-suitable-measure-for-optimal-number-of-clusters-for-the-specified-cluste

elfin whale
quaint leaf
elfin whale
elfin whale
quaint leaf
thick marlin
#

Hello, I'm trying to remove the sky from this tree picture. I have used kmeans to cluster the colors. And this is the output. Now I need to remove the sky. What would be the best way about it.

#

original picture

glacial wadi
steady basalt
#

Looks like regression at first glance

serene scaffold
#

do you know what an activation function is?

#

each image goes through the whole network. what changes with each image is the output, not whether or not it goes all the way through.

#

and keep in mind that we're talking about mathematical functions, which (unlike Python functions) always have an output.

#

anyway, activation functions are non-linear functions

#

here's a great comment from Emyrs about what activation functions are for

wooden sail
#

as for the 128, this determines how many weights and biases there are. how to pick this "well" is difficult to answer and it is often the case one has to try a couple different configurations to see which one works best

#

in estimation theory this is called the "model order" and one tries to strike a balance between having too few and too many parameters. too many means you have quite a bit of "descriptive power", but it is both difficult to tune the parameters correctly and it is easy to overfit. if you have too few parameters, you'll simply lose predictive power

#

a more in depth discussion requires quite some information theory and statistics

#

the number stands for the size of the output of that layer. that means you have an input of 28*28 and an output of 128. one dense layer is an affine transformation with a matrix of size layer input x layer output biases of size layer output, so you're telling the layer to grab the input of size 28*28, multiply it by a weight matrix W of size 28*28 x 128, and add a bias vector b of size 128. then, the relu activation function is applied

#

google happened to quickly grace me with this illustration 😛 this is exactly the same as the network you have

meager crater
#

Hey I had a quick question about coefficient and p-value.
Based on the results I have noticed Radio is the correct answer for the first question and billboard is the answer for second; however, I am struggling to grasp the reasons for it. Could someone help out?

wooden sail
#

that depends on which definition of p-value you are using in your course

meager crater
#

no prior information was given or constructed

#

this is completely uncorrelated question to the previous ones

wooden sail
#

i doubt that

#

at any rate, the standard usage of p-values is the probability of observing the measurement data under a base model, and the null hypothesis is that that base model explains the data. the smaller the value p, the more unlikely it is to observe the data under the base model, meaning that the parameters you derived are more significant and a better explanation of the data

#

that'd make p values of 0 have the interpretation of "the base model cannot explain the data at all, these new parameters should be accepted". then TV would be the most effective channel, as it has the strongest positive correl and highest significance

#

my recommendation is to review the content, it looks like you skipped something either during the course, or one of its prerequisites. be it the interpretation of p-values, or the definition they decided to use here

meager crater
#

See that's the thing I'm confused about, due to the fact that p-value was so insiginificant in both cases kind of cancels it out and falls under the idea of <0.05 rejection, but that's what I am confused about is that Radio was more effective than TV in the answer

wooden sail
#

so the question is, what definition of p-value did they give during the course

meager crater
#

None, it is practice tests I'm doing with no prior information given

wooden sail
#

well, then we can't answer the question, can we 😛 you don't have enough info

#

no way to know if the quiz is wrong, or if they were working with a different def

meager crater
#

put in the TV at first and was given a feedback that Radio was the right answer

steady basalt
#

Why tf me network crashing after 3 epochs

potent flame
#

Wdum crashing

worthy phoenix
#

hi there

#

i wanna deploy a ml inference program in rpi

#

but the thing is it seems to take too much time to give a single output

#

as we all know ml requires alot of computing power

#

so should i buy a jetson instead? or what updates can i make to the rpi to get it working

lapis sequoia
#

I wanted to become a data scientist. However I wanted to know what modules and things would I need to master in Python to be a competent one.

serene scaffold
lapis sequoia
#

Hi everyone. I have a corpus of text with a continuous dependent variable, and I would like to create a model that predicts this variable based on text. Which types of ML/DL models could be used for such a task?

serene scaffold
#

!resources data science

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

serene scaffold
#

^ you can go to that page and filter for books

lapis sequoia
#

Ok

meager crater
#

Hey another question, this is easier 😄 I've tried to understand Bayes theorem, however, struggling. The answer appears to be 16.7%, but don't know how it was calculated and https://www.youtube.com/watch?v=HZGCoVF3YvM didn't help sadly.

Perhaps the most important formula in probability.
Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Special thanks to these supporters: http://3b1b.co/bayes-thanks
Home page: https://www.3blue1brown.com

The quick proof: https://youtu.be/U_85TaXbeIo

Intera...

▶ Play video
wooden sail
#

you gave some info but didn't show the question

meager crater
#

sorry, added

wooden sail
#

so bayes' theorem says that

#

.latex $P(A \vert B) = \frac{P(B \vert A)P(A)}{P(B)}$

strange elbowBOT
meager crater
#

yup where P(A) is the probability of something happening?

wooden sail
#

here, they are asking you for P(A|B) where A is a probability the person has cancer, and B is the event your code gave the output 1

meager crater
#

okay so in this case the probability of the person having cancer is 0.01? since the total population is 0.01?

wooden sail
#

i don't get what you mean by "the total population is 0.01", but yes, the probability of any person having cancer is 0.01

#

let me walk you through it because this has several steps

#

let's say P(A) is the probability a person has cancer. this is 0.01

#

next, P(B) is the probability your model outputs a 1. we will deal with this later

#

then, P(B|A) is the probability your model outputs 1 when a person has cancer. we are told the model is 0.99 correct when a person has cancer, so if a person is known to have cancer, the output 1 is 99% of the time. P(B|A) = 0.99

#

we want P(A|B): if the output of the model 1, how likely is it they have cancer? we can compute that with bayes' theorem, but we need that missing value P(B)

meager crater
meager crater
wooden sail
#

correct

#

now, P(B) is the gotcha. your model can output 1 in two ways. it can be a true positive, or a false positive. a true positive happens when a person has cancer AND you detect it correctly. this is the probability of A and B happening at the same time. the probability of this happening is P(output 1 | person has cancer) * P(person has cancer).

#

we need something else here

meager crater
#

the probability of false positive?

wooden sail
#

.latex $P(A \cap B) = P(A \vert B) P(B)$

strange elbowBOT
meager crater
#

so that will be 0.99 * 0.01?

wooden sail
#

this is the probability of two dependent events happening together

#

right. so P(model gives 1 | person has cancer) = 0.99, as we were told. that means the probability of getting a TRUE positive is 0.99 * 0.01

#

but for the false positive, we need P(model gives 1 AND person does NOT have cancer)

#

this would be P(model gives 1 | person does not have cancer) * P(person does not have cancer)

#

we know the model is correct 95% of the time when a person does not have cancer

meager crater
#

okay so that would be .99 -> prob of getting 1

wooden sail
#

since it's correct 95% of the time, that means it's wrong 5% of the time

#

so the missing term for a false positive is 0.05*0.99

meager crater
#

yup that makes sense

wooden sail
#

so we have

meager crater
#

0.99 * 0.01 / (0.99 * 0.01 + 0.05 * 0.99)

wooden sail
#

.latex $P(cancer \vert test = 1) = \frac{0.990.01}{0.990.01 + 0.05*0.99} = 16.6 \cdots$

strange elbowBOT
wooden sail
#

i missed the % mark

meager crater
#

ahhh okay so that will be division of the population

#

the we just extract the probability of model being incorrect in the false positive

#

let me try to flip the question and see what comes out

wooden sail
#

idk what you mean by division of the population

meager crater
#

so You run the model and predicted 0. What is the probaility that this person does not have Cancer?

#

that would be:

0.95 * 0.99 / (0.95 * 0.99 + 0.01 * 0.01) 
wooden sail
#

looks aight

meager crater
#

okay great that makes a lot of sense, thanks Edd!

odd meteor
#

Edd how do you invoke Sir Lancelot to display latex? 😊

mild dirge
#

you just type .latex <latex stuff> I think @odd meteor

#

.latex $\sqrt{5}$

strange elbowBOT
odd meteor
#

.latex VIF_{Weight} = \frac{1}{1-R_{Weight}^{2}}

strange elbowBOT
wooden sail
#

you need .latex at the beginning of the message and $ to start and end an in-line equation environment

#

.latex so presumable this is in-line $x = 3$ but this other one makes a separate environment
\begin{align}
\phi(x) = \int f(\tau - x) d\tau
\end{align}
but i'm not sure if it'll work

strange elbowBOT
wooden sail
#

oh sweet, all good

#

there's a typo but you get the idea

#

@odd meteor

odd meteor
strange elbowBOT
odd meteor
wooden sail
#

you didn't put the $$ is my guess

#

.latex let's see

strange elbowBOT
wooden sail
#

huh

#

.latex $VIF{Weight} = \frac{1}{1-R{Weight}^{2}}$

strange elbowBOT
wooden sail
#

yeah latex doesn't like unexplained back slashes. \frac is not expected outside of math mode, so you need either $$ or another math environment

serene scaffold
odd meteor
#

.latex $VIF = \frac{1}{1 - R^{2}} = \frac{1}{Tolerance}$

strange elbowBOT
steady basalt
#

It’s easier to just use Microsoft word equation writer

serene scaffold
#

Just learn latex.

#

I can't stand wysiwyg

wooden sail
#

word also accepts latex for typesetting equations

serene scaffold
odd meteor
wooden sail
#

this is the worst conversation ever, jupyter and word brought together

serene scaffold
odd meteor
wooden sail
#

i can't deny that, it took me months to warm up to latex

misty flint
#

notion allows for inline latex

#

and it changed my note-taking habits

mild dirge
#

We have to use it to write reports at uni, but it's pretty nice for scientific reporting with a lot of formulas

#

Still get annoyed by latex placing figures 2 pages further than you want though :/

steady basalt
#

Yeah. Not a fan of equations in jupyter, I keep them to reports written in word

#

I don’t think many people read jupyter for anything method related right? That stuffs all reported on

grizzled verge
#

Hey guys for Gensims most_similar function how do I get a list of just the most similar word without the float value next to them

#

I was looking through documentations trying to do it and it wasn’t working

#

Bottom image is my code

#

Sorry if this is a noob question bye

#

Btw *

serene scaffold
arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

edgy whale
#

hey, for some reason tensorflow is not properly installed and I tried uninstalling and reinstalling but I get this when I try to verify the install (in the cmd)

2022-08-07 21:34:41.553613: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...```
lapis sequoia
#

What are some classification algorithms which work with categorical features.

#

Hmm

#

Like how

#

Wouldn't that be incorrect

#

Or as long as you encode them properly, categorical features can be used As numeric?

#

Last time someone taught me one-hot encoding.

#

Does that now allow me to use any algorithm to use, even if it's only supposed to work for numerical features?

#

Like knn

#

@lapis sequoia, @serene scaffold

odd meteor
# lapis sequoia What are some classification algorithms which work with categorical features.

If I get your question correctly, you mean the ML algorithms that can pretty much handle categorical data explicitly without having to preprocess or encode them to a numeric feature....

LightGBM and CatBoost does that with ease! But because I kinda love CatBoost more than LightGBM, I'll focus on 🙀-Boost

You only need to identify which feature is a categorical feature in your dataset.

You could do something like this with CatBoost


categorical_features_indices = np.where(X.dtypes != np.float)[0]

model.fit(x_train, y_train, cat_features=categorical_features_indices, eval_set=(X_val, y_val), plot = True) 
steady basalt
#

if not all

#

use dummy features and yes you can with 0s and 1s

odd meteor
quick eagle
#

Quick question...
df['timestamp'].diff() returns a series of timestamps(?); I have a data stream and trying to identify different 'epoch' (gaps in data). Most .diff() values are 1 sec; but every so often I get a gap of minutes/hours/days. How can I find the index of these 'jumps'?

untold bloom
#
is_large_gap    = df.timestamp.diff().abs().gt(pd.Timedelta("1 sec"))
inds_large_gaps = is_large_gap[is_large_gap].index

Checking whether the (absolute value of) the difference is greater than 1 second; this gives a True/False Series. Then index it with itself to let only True's through. The .index will then give the indexes of the (end point of) jumps.

#

.diff on a datetime type Series gives a Series of type timedelta.

quick eagle
#

side note: is df['timestamp'] same as calling df.timestamp? how do you distinguish between methods and columns?

serene beacon
#

using pandas to get the duration between two dates gives me a wierd output, what can I do to get the hours that passed between those dates? ```python
import pandas as pd

df = pd.read_csv('./data/essential_info_dashborad.csv')

df['Last session begin'] = pd.to_datetime(df['Last session begin'], errors='coerce')
df['Last session end'] = pd.to_datetime(df['Last session end'], errors='coerce')

df['test'] = df['Last session begin'] - df['Last session end']

df.to_csv('./data/testingdate.csv', index=False)

quick eagle
#

df['diff'] = (df['end']-df['start']).dt.total_seconds()

#

df['diff'] = (df['end']-df['start']).dt.total_seconds()/3600

quick eagle
serene beacon
lapis sequoia
#

does anyone know how to retrieve and add the rest of a dataframe(df1) based on two columns worth of data in another dataframe(df2) that have the same column names in both df? I tried merging(inner and left) but since there some of the values in the dfs are duplicative it messes the whole thing up. im coming from excel so im trying to fundamentally do a vlookup based on two conditions - thanks!

steady basalt
#

❤️ Check out Weights & Biases here and sign up for a free demo: https://www.wandb.com/papers
❤️ Their blog post is available here: https://www.wandb.com/articles/better-paths-through-idea-space

📝 The paper "Emergent Tool Use from Multi-Agent Interaction" is available here:
https://openai.com/blog/emergent-tool-use/

❤️ Watch these videos in ear...

▶ Play video
worthy phoenix
#

is there any opensource implementation of gpt3?

serene scaffold
worthy phoenix
#

xDDD, forgot to add that

serene scaffold
worthy phoenix
#

sadge

worthy phoenix
serene scaffold
worthy phoenix
#

couldnt find the gpt2 one either

serene scaffold
worthy phoenix
#

aight trying again, btw if u get free time to look for it , pls drop it to me will ya?

serene scaffold
#

Did you try "gpt2 GitHub"

worthy phoenix
#

oh ok

worthy phoenix
#

nice

#

thanks

tidal bough
#

Is there a way to make pandas's fancy integrated dataframe display break very long cells into lines?
E.g. consider

df = pd.DataFrame.from_dict(dict(a=[["hello"*30]]))

If you do pd.options.display.max_colwidth = 150, this row will be shown as a giant line. But can you make it shown fully, but broken into more than one line?

#

visual aid

#

I want this cell shown in full, but broken into two lines

lapis sequoia
#

I'm using Pytorch to try to train a model and I'm getting this error:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (10x25568 and 400x120)

any tips on how to fix?

#

my code:

transform = transforms.Compose([transforms.Resize((150, 200)), transforms.ToTensor()])
-----------------
for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(train_dler, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
            running_loss = 0.0
#

I can share more code if needed

serene scaffold
lapis sequoia
#

I think that my images are sized incorrectly

serene scaffold
lapis sequoia
#

I understand that it can't multiply the two matrices(my image's pixels and the weights I presume)

#

but I might be wrong

serene scaffold
#

not even as it relates to your code. does "mat1 and mat2 shapes cannot be multiplied (10x25568 and 400x120)" mean anything to you?

serene scaffold
lapis sequoia
#

because the row in the first matrix needs to be the same size as the column of the 2nd matrix

#

and it isn't so

serene scaffold
#

so the question is, how are mat1 and mat2 created from your code?

#

we would need to see the whole traceback (ie, the whole error message) to begin to guess.

lapis sequoia
#

sure

serene scaffold
#

what is net?

lapis sequoia
#

so Net.forward() is:

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
#

net is an instance of this ^

serene scaffold
#

great. the error message shows you that the problem starts here: ---> 19 x = F.relu(self.fc1(x))

#

and then return F.linear(input, self.weight, self.bias)

lapis sequoia
#

yes

serene scaffold
#

so figure out what x is when you get the error

lapis sequoia
#

I think x is the matrix of my images,

#

I don'

#

t understand what the second matrix (400x120) is

#

and I'm not certain if the first matrix is even the image

proper ingot
#

How do I include the head of the dataframe (first column) in the csv

#

The code for that would be under

#

#CENTRAL DATAFRAME

lapis sequoia
modest onyx
#

the error is not even in the code you gave us

#

it's probably inside net

#

oh you also gave that nvm

modest onyx
#

in other words, it's expecting inputs that are 400 dimensional vectors

#

convolutional layers don't care about the spatial size of the image, but fully connected layers do

#

so you have to make sure that your input image is the right size such that the dimentions end up matching (you can do that by cropping/resizing the image)

rapid cedar
#

how do i make a ai that predict typos
like if i type "helo"
list = ["hello", "goodbye"] it will predict that "helo" is the closest to "hello"

soft lotus
#

You can make a tree of the letters in your list and find words that match by using the letters of the typo until the typo doesn’t have a letter in a path on your tree then suggest the word at that node

#

@rapid cedar

brisk apex
#

hey guys I hope this is right place to ask. I've been having hard time trying to look for real life example for usage of global temp view in spark. Has anyone used this feature, and if so, could you share your experience, like why use that over temp view.

I'm well aware of definition for global temp view which exists across all spark sessions compared to temp view which expires when session that was created ends; I just can't imagine real life case to use global temp view from the first place. What would require you to create separate session and use that global temp view instead of just using a session you already have opened with existing temp view? Thanks in advance

untold bloom
#

yes df["col_name"] and df.col_name are equivalent, except in 2 cases:
- if "col_name" isn't a valid Python identifier, it will fail
- e.g., spaces in it: "current value", starting with a number: "20th quantile", containing apostrophe: "today's"
- if it clashes with a method/attribute of a Series/DataFrame as you said
- e.g., df.sum will go to the method even if you have a column named "sum"

so, df["col name"] always works, df.col_name sometimes works. But the latter is easier to type, so if it is fine to do, i prefer it due to laziness :| It also makes code more readable IMHO when chaining things like we did above (.diff().abs()...)

untold bloom
#

sample data

In [20]: df
Out[20]:
        date  sales
0 2021-12-29    300
1 2021-12-27    100
2 2021-12-30    100
3 2021-12-31    300
4 2021-12-28    200
5 2022-01-03    500
6 2022-01-02      0
7 2022-01-01    200
#

above code applied to get is_large_gap, except the threshold being 1 day here

In [21]: is_large_gap = df.date.diff().abs().gt(pd.Timedelta("1 day"))

In [22]: is_large_gap
Out[22]:
0    False
1     True
2     True
3    False
4     True
5     True
6    False
7    False
Name: date, dtype: bool
#

so what I understood is, you want to start with some letter, say "A", and keep it going as long as "no large gap"; once it hits a large gap, change it to "B", i.e., the successive letter. And do this until the end.

rapid cedar
untold bloom
#

to that end, we can use .cumsum()... This takes the cumulative sum: let's see what it would do a Boolean series: if it sees a True, the accumulated sum then-far is increased by 1 (True is 1 in numeric context). So this is good: when it hits a large gap point, it will change its value. If, on the other hand, it sees a False, the accumulated sum then-far won't change! (because False is 0 in numeric context). This too is good: when it hits a not-large gap, it won't change its value.

#

well this gives this:

In [23]: is_large_gap.cumsum()
Out[23]:
0    0
1    1
2    2
3    2
4    3
5    4
6    4
7    4
Name: date, dtype: int32
#

see how it stays the same when it hits False's?

#

now all we need is a mapper: 0 -> A, 1 -> B, ...

#

assuming it won't exceed 25, we can use the ASCII alphabet :p

#

so here we go map:

In [24]: import string

In [25]: mapper = dict(enumerate(string.ascii_uppercase))

In [26]: mapper
Out[26]:
{0: 'A',
 1: 'B',
 2: 'C',
 3: 'D',
 4: 'E',
 ...
 23: 'X',
 24: 'Y',
 25: 'Z'}

In [27]: is_large_gap.cumsum().map(mapper)
Out[27]:
0    A
1    B
2    C
3    C
4    D
5    E
6    E
7    E
Name: date, dtype: object
#

note that you'll get NaNs instead of letters in the output if the value you're mapping is not in the mapper

#

e.g., if there's 47 to map, since it's not in mapper, it will put NaN in the result.

#

range(is_large_gap.sum()) will give you the range of numbers to cover in your mapping ('s keys).

untold cradle
#

Hi guys, im not sure where should i ask this,
but im looking for a project idea (something like AlgoTrading).
i feel like im intermediate - advanced in python

Any idea, or help would be appreciated!

strong sedge
#

I am taking machine learning classes (high level, only application), the teacher said that rather than learning about the working of the algorithm (gradient decent and OLS) I should just learn to use it
is this correct ? should I attempt to understand/learn it myself ?

wooden sail
#

you should at least have some notion of when and how it makes sense to use it

#

OLS is not always the best estimator, and gradient descent doesn't always make sense to use. even when it DOES make sense to use grad des, it only converges under special conditions

#

and WHAT it converges to is another matter

serene steeple
#

what is the google service that provides computing power for machine learning called

steady basalt
#

anyone know how to deal with such noisy data that my minority class precision is almost 0?

young ridge
#

hi guys im a new data science student and i was wondering what do i call those values that are above 600

#

do i call them anomalies or outliers?

wooden sail
#

anomaly usually refers to a behavior that is caused by something external and is not part of the underlying distribution. extreme values are part of the original distribution, just very unlikely

young ridge
#

Ohhhhhh

wooden sail
#

to be able to tell which of the two it is, some kind of anomaly detection entails

#

in some cases the difference doesn't matter and they're treated as synonyms, so more context is needed 😛

young ridge
#

is it logical to remove any values above 400 or 500?

serene scaffold
# wooden sail anomaly usually refers to a behavior that is caused by something external and is...

one time, a non-technical person asked me to do anomaly detection. and they weren't very clear on what an anomaly was in the context of their data. so I asked a senior coworker, and he said

"anomaly detection. they don't know what an anomaly is--wouldn't know if it bit them. but it's a buzzword. everyone and their grandma is doing it--that's probably who told them about it."

"sure, but what would an anomaly be in this context?"

"you know, you're asking the right questions."

wooden sail
#

sounds about right

young ridge
#

if i were to do a .describe() on my departure delay in minutes column, the 3rd iqr only has a value of 12 minutes

#

does this count as an anomaly dection?

tropic matrix
#

I'm solving a regression problem, but when i calculate the MSE for my loss it becomes "inf". I believe this is because my output data ranges from 1 to 2.147 billion, but what should i do to solve it?
(using keras)
i have rmse as a metric, and that displays a valid number, but when the loss is initially high it's impossible for it to be displayed as mse,
so is it a viable solution to set objectives (like for early stopping, reducing lr, and hyper parameter tuning) to be the validation rmse?

wooden sail
#

as a buzzword, sure. but since you're likely not trying to find how many distributions are needed to correctly describe your data without overfitting (a model order estimation problem), it sounds like you're just looking to remove outliers. you can make a histogram and see what probability distribution fits it best, then remove extreme values based on that

young ridge
#

so im kind of stuck on whether i should remove them or not

wooden sail
#

probably so if the entries skew the distribution

young ridge
#

ohhhhh

wooden sail
#

for example OLS regression is only optimal if the distribution truly is normal

#

so check the histogram

young ridge
#

thank you for the help by the way

wooden sail
#

if you're willing to use maximum likelihood based on the sample covariance instead of assuming normality, you'll get something often called "mahalanobis distance" instead of the ordinary least squares

#

would be nice to compare if you have some time to spend

young ridge
#

ah ill try

#

i just learnt OLS regression so ill have to give that a try first

tropic matrix
young ridge
wooden sail
wooden sail
tropic matrix
#

i already normalize my x data btw

wooden sail
#

the y as well. it also sounds like you have some overfitting

young ridge
tropic matrix
young ridge
#

after chaging the bins there wasnt much difference and it remained skewed

wooden sail
young ridge
#

101695 rows and 24 columns

wooden sail
#

try 500 bins for one last look at the data

young ridge
#

i think 100 is the max it can go 😆

#

if not the bins would be very very thin

#

lemme try removing some outliers

wooden sail
#

well but that's already ok

#

now the question is, which distribution does this look like

young ridge
#

usually for OLS regression it would be logical for the column to be normally distributed right?

#

technically this is right skewed

wooden sail
#

that's what you normally assume, but that evidently isn't the case

young ridge
#

Ohhhhhhhh

wooden sail
#

so your options are: DON'T use OLS and compute the sample covariance

#

or manipulate the data and use OLS

#

that was the point of this exercise

#

but it may be the case that the data never looks normally distributed no matter where you draw the cutoff point

#

poisson, gamma, and chi-squared distributions all kinda look like what you have. and indeed, these have degenerate cases in which they approach the gaussian distribution

young ridge
#

alright thank you

#

i know what to do from here

#

thanks

wooden sail
#

you can try and make a box and whiskers plot and see where the mean and median are, and use this info to make a cutoff decision

#

aight, best of luck

frail dune
#

Hey guys not sure if this is the right place to ask but could anybody here help me generalize this Matrix in a loop im kinda stuck

#

if for example here K1 = {0, 2}, K2 = {0}, K3 = {3}, K4 = {1, 3}

#

not sure how to build the Matrix so the rows equal the amount of Elements inside each K[i]

wooden sail
#

as i understand it, the first column contains the elements of the K_i. the second column is just the value of i?

frail dune
#

the first column(first row) is the Value of the first Element in K1, the first column(row 2) is the second Value of K1

#

the second column stands for the the index after K

#

and the third columns are the constraints (a(x) if its K1, b(x) if its K2, M(x) if its K3 and Q(x) if its K4

wooden sail
#

this is my artistic interpretation. you can add the bit about a,b,M and Q yourself, i think

#
In [1]: import numpy as np

In [2]: K = [[0,2], [0], [3], [1,3]]

In [3]: N = 0

In [4]: for k in K:
   ...:     N += len(k)
   ...: 

In [5]: B = np.zeros((N,3))

In [6]: row = 0

In [7]: for i,k in enumerate(K):
   ...:     B[row:row+len(k),0] = k
   ...:     B[row:row+len(k),1] = i
   ...:     row += len(k)
   ...: 

In [8]: B
Out[8]: 
array([[0., 0., 0.],
       [2., 0., 0.],
       [0., 1., 0.],
       [3., 2., 0.],
       [1., 3., 0.],
       [3., 3., 0.]])

#

ah, it was supposed to be i+1

frail dune
#

ty very much this line 7 was killing me

frail dune
#

if its K = [[0,2], [0], [n], [1,3]]

wooden sail
#

if n is already defined, yes

#

do you need the matrix to be variable?

#

your options are to make this into a function and pass in specific values, or use sympy/symengine

frail dune
#

okay i guess i'll just leave that then 😄 not sure if I even need that was just a thought

#

thank you

wooden sail
#

the code i wrote there works regardless of what is inside K, as long as it is defined and K is a list of lists of floats

serene scaffold
#

there are libraries that do this. have you googled "python speech recognition library"?

#

being a liturgical language, if you had to create your own speech recognition model, how much training data would be available?

#

you would need lots of examples of Sanskrit speech audio and transcripts. and you would probably need to align those in the time dimension.

steady basalt
#

Tfw the data won’t allow for a working model for my thesis, fuck

#

What a disaster

abstract sentinel
#

Hey, I might be in a little wrong section, but I think some of you guys could give me ideas.
I currently work in a high school as a physics teacher, and I'm considering replacing slides with colab notebooks. One of the main reasons, is use of interactive demos with widgets. I learned how to make interactive graphs which is great, however I was wondering if somebody could suggest any other cool demos/libraries. I would kill for a library where I could create simple interactive animations, as I'm already using manimCE to create short animations.
Any directions/suggestions are welcome!

desert bear
#

is there already a name for a ML program that has a model with lots of sentences and entity's. and classifies a input sentence with the data in the model and returns the entity?

limber token
#

Hey guys, how can I transpose one column to another df while matching another column? Visual example:

#

Without having to iterate through the df using df.at, that is

serene scaffold
limber token
#

Sorry if I wasn't clear

serene scaffold
#

!docs pandas.DataFrame.merge

arctic wedgeBOT
#

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)```
Merge DataFrame or named Series objects with a database-style join.

A named Series object is treated as a DataFrame with a single named column.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

Warning

If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
serene scaffold
#

it might be that you need to merge on both SKU and ATTR1.

by the way, what pandas calls a "merge" is a join in SQL terminology.

abstract sentinel
limber token
wooden sail
lapis sequoia
#

Is it fine to use clustering algorithms even if you have labels?

#

But instead to use the labels for validation

abstract sentinel
misty flint
#

manim

wooden forge
#

Hi there, I have a small issue with importing a data set, the CSV file contains : Team 1; Team 2; Score Team 1; Score Team 2, so str; str; int; int. But I can't get to import it in numpy.
I used this command python np.genfromtxt(filepathEast,skip_header=1,usecols = (0, 1)) but I get an error message for each line Line #2 (got 1 columns instead of 2). So yeah struggling a bit, I haven't touched python for a month and half by now

lapis sequoia
#

Is this a really bad model?

lapis sequoia
#

does anyone know how to recursively merge/concat a temp df into a master_df. I have a temp df that is created for every id, and it calculates the difference between row values in a column. Im trying to bring that column to the master_df so I can track the differences in the master_df. <- for python

desert oar
lapis sequoia
#

To make it better

#

Because my train data also has tons of negative labels

#

No churn labels I mean

thick marlin
#

Can someone help explain the tensorboard training logs generated for GAN network

#

Are these good or bad? How do I compare ?

mellow vapor
#

does treating a column dtype as categorical instead of object dtype make any difference?
apart from the space optimization does it have any other impact?

wooden forge
#

Would any of you have food ressources about scores prediction with neural network? I can't find something satisfying online

warped osprey
#

I have a simple pandas question... I want to fill in the NaN values for the code column by referring to the dict below... how can I do that?

steady basalt
steady basalt
#

I opt to use class weight atm

#

Easier one liner code

#

Does the same shit but you don’t lose info or gain noise

#

I just got my thesis results and the model has 0.06 precision in predicting the positive class. Do u think there’s any redeeming thing I can write so to not get bad grade? It’s the datas fault

mint palm
#

is it possible to overfit to data whiile using naive bayes?

wooden sail
#

for completeness: naive bayes is a probabilistic model. you pair this is a "rule" to make an estimator: a statistic with which you evaluate whether a model is good. maximum posterior likelihood is common here (also called maximum a posteriori)

#

you then bundle this up with 2 other things: an actual model, which is your network, and an optimizer, which optimizes the model parameters based on the estimator

#

the model, optimizer, and/or available data might not be good enough to correctly implement the estimator

mint palm
#

isnt sigma and mean all fixed? also in naive we consider variable to independent and disjoint.

wooden sail
#

so you're specifically using a gaussian model? also no, the means and variances of the gaussian model are computed per class

#

and yes, you consider each mean and variance per vector in a class to be disjoint and dependent only on the class

#

so each class has its own mean vector and covariance matrix

mint palm
#

i dont get, what are we optimising

wooden sail
#

they are fixed, but they are computed from the data. and i guess a gaussian is about as easy as it gets

mint palm
#

can we further nudge sigma and mean?

wooden sail
#

wdym nudge

mint palm
#

optimize

wooden sail
#

they're already computed from the data

#

you solve an optimization problem to find them

#

and once those are learned, what the estimator does is optimize for the class of each sample you feed it by maximizing the posterior likelihood

#

so it's optimizing 2 things

#

the parameters of the gaussian distribution of each class first, as parameters of the model during training

#

then those params are used to infer the classes of examples you feed it when using the trained model

#

if you're using neural networks for the inference, then you can consider learning the parameters of the gaussian distributions a "pre-learning", and then you feed labeled exampled so that the network trains the weights of the inference network that predicts the class

#

i formulate it this way because all you said was naive bayes with gaussian model. that's barely enough to describe the estimator, not how you're implementing it

mint palm
#

i sort of get it, but sort of difficult to understand how optimisation will be take place.

wooden sail
#

optimization of what?

mint palm
#

i will look at code, it will clear things

wooden sail
#

i would think it's usually backwards, but hopefully that helps you

trail adder
#

Hi!! Do any of you know about data scrubbing?

merry wadi
#

Is there anything wrong with using a model to make predictions then incorporating the predictions into another model?

magic dune
#

why do people use -1 and 1 for perceptron?

weary crown
barren snow
#

Hey guys, I couldn't understand the meaning and the calculation in this section. Could sb explain to me, I appreciate it!!!

iron basalt
#

.latex $\frac{1}{N}\sum_{i=1}^{N}...$ is often an averaging.

strange elbowBOT
barren snow
#

Thanks! So what's the meaning of bins here?

wooden sail
#

bins usually refers to frequency bins, but it's hard to say without further context

celest patrol
#

Why are pd.join/merge/concat/combine so cursed

wooden sail
#

in general bins refer to reference values for the discretization of a domain, so do make sure you read the explanations given before that equation to ascertain whether it truly referred to the spectrum

lapis sequoia
#

And can you tell me how to use class weights

lapis sequoia
#
Name: City, Length: 200000, dtype: category
Categories (7489, object): ['Abbeville', 'Abbotsford', 'Abbottstown', 'Aberdeen', ..., 'Zumbro Falls', 'Zumbrota', 'Zuni', 'Zwingle'
#

7489 sounds like a lot of Categories to use for a categorical variable, but given the length of 200k it doesn't seem large enough, so it seems fine?

bold timber
#

Hi, I have a question: What the meaning of "lambda: input_fn(train, train_y, training=True)"?

Why after lambda is not any variable like "lambda x: x**2"? Why is only "lambda: " ?

#

"input_fn" is input function

untold bloom
wooden sail
#

if train, train_y and training are defined and visible in the same scope as the lambda function, then it will call the function on those parameters

unique flame
#

So I'm trying to print a pandas tabel and doing that by saving it to a csv file and opening it in excel. At first everything seems fine, but now the decimal separator of the value changed. If a value is 50.650 cm, it will now just write 50650. Any one know a way to fix this? I already tried in options>Advance.

bold timber
#

ok thank you so much! @untold bloom @wooden sail

bold timber
untold bloom
bold timber
#

Sorry, my bad. I miss understanding @untold bloom

bold timber
#

this is my function of 'input_fn'

untold bloom
#

in the first one you're calling the function with parens ()

#

in the second one you're not calling it; only referring to the function itself

#

if you do (lambda: input_fn(train_y, training=True))(), they will do the same thing

#

f is a function; f() is calling it

#

lambda: ... is a function; (lambda: ...)() is calling it

steady basalt
#

Have you thought about how models make predictions and what this means for confusion matrix

bold timber
mellow vapor
#

how to check for cross validation using a model like isolation forest?

young ridge
#

Hi I recently tried creating a linear regression model using statsmodels.api and I tried creating a scatter plot using matplot lib. After using the train_test_split function i ran into this error

#

is there any way i can make both x and y train the same size?

vast goblet
#

Hello there, I’ve transaction dataset, my goal is to find rules. So i decided to use FP-growth algorithm which is an association rule algorithm, but my minimum support is like 0.01% which is so low for 55k transactions.

What can I do to fix this?

bold timber
#

What the meaning of * * in .format(**eval_result)?

#

why it calculated power of eval_result?

steady basalt
#

if df_concat_noid['42006-0.0'][i] != np.datetime64('1970-01-01T00:00:00.000000000'):

#

anyone know why i cannot do this

steady basalt
#

hey guys... does anyone know whether it is fair practise to drop NA from oNLY the test set, because while I can conditionally impute in the training data, the massive data imbalance means that imputing at all in the test set will misguide models to predicting towards the majority class

#

so, train test split, impute training data, and dropping any rows with NA from testing data to only test on complete samples

serene scaffold
serene scaffold
untold bloom
untold bloom
# bold timber What the meaning of * * in .format(**eval_result)?

one use of ** is indeed exponentation; but that's when it's an infix operator (i.e., takes 2 operands like 5 ** 7); here it's a prefix operator (i.e., unary operator; cares about only what's after) and it does some other thing. (another example is -: when used like 12 - 9, it subtracts; when used as -7, though, it negates.)

What unary ** does is to "unpack" its mapping operand. Presumably eval_result returns a dictionary. Then that dictionary's key-value pairs are passed as keyword arguments to the .format function. Say it returned a dict {"accuracy": 0.77}. Then with .format(**eval_res), it is as if we we wrote .format(accuracy=0.77).

slender wren
#

Hi, I am not sure which channel I should send this message to, but I wanted to know if there are any good websites for challenges in Python. By challenges, I don't mean like a competition, but something which will help practise basic concepts like loops, lists, functions, etc., using libraries like Pandas, Matplotlib, Numpy etc. If you know of any such website, please let me know.

timid hollow
lapis sequoia
bold timber
untold bloom
#

you're welcome!

delicate tendon
#

Hi there I had a Q on imbalanced data

#

I am predicting if an animal gets rehomed from a shelter. In general my data is only slightly imbalanced (60:40 split), however by animal type it is extremely imbalanced (like 5% of birds get adopted, 90% of dogs get adopted). Is this an issue? What are ways I should solve it?

#

These are the average proportions of animal types that get rehomed

lapis sequoia
#

Guys, if Boolean variables can be handled by the Algorithms which only work with numerical distance. Then what's the point of calculating a jaccard similarity for Boolean features.

#

In my last assignment I dropped the Boolean features before feeding them to my knn classifier and converted them into a feature named jaccard.

#

But now people say that all of the Algorithms can handle categorical features with one hot encoding. And a Boolean column is already sort of one hot encoded.

steady basalt
#

What do u mean by that

lapis sequoia
#

Like

#

High BP:
1
0
0
0
1
1

#

Can I put it in directly to knn?

#

I last time transformed 5 of such Boolean features to a jaccard feature.

#

Which basically calculated the percentage of 1's for each row

#

So if it's 5 features.
1,1,0,0,1
Jaccard was 60% for that row.

steady basalt
#

What is a jaccard “feature”

#

That’s a binary feature ?

#

Jaccard index measured between values

lapis sequoia
lapis sequoia
#

5 features having values 1,1,0,0,1 gets transformed into 60% jaccard index.

#

As a feature.

lapis sequoia
#

5% of birds are adopted and 90% the dogs are, so the animal being a bird or a dog is something that should be taken into consideration while fitting the model.

#

So no, you should not balance for each animal type.

delicate tendon
#

Fair point, ty

haughty pewter
#

while doing clustering, if a scatterplot ends up looking like this, it's safe say that both columns have no correlation and thus should be disregarded, is that correct?

serene scaffold
steady basalt
steady basalt
lapis sequoia
#

Leave it. I just have to write a report. I will write silly stuff in it

#

Is this channel pretty active, or are there better discords for AI questions?

rough mountain
#

Any good datasets of free to use images? They don't have to be labeled or anything, just images.

serene scaffold
serene scaffold
lapis sequoia
rough mountain
serene scaffold
rough mountain
#

So not like ct scan data or the large scale fish dataset

lapis sequoia
lapis sequoia
rough mountain
#

not what I originally had in mind, but yep that's perfect.

serene scaffold
lapis sequoia
lapis sequoia
trim zephyr
#

Hi anyone familiar with Tweepy

wooden sail
lapis sequoia
wooden sail
#

from mult and division? no

lapis sequoia
#

Okay so I’m lost, like what does the sigmoid function do besides multiplication and division?

wooden sail
#

exponentiation

lapis sequoia
#

Which is a special case of multiplication.

wooden sail
#

not at all, not for non integer arguments

#

if you're only working with integers, i'd be willing to let that slide, but then you can't use division

#

exponentiation is a nonlinear transformation

lapis sequoia
#

Okay. So would a NN with exponentiation, multiplication, and division be sufficient?

wooden sail
#

yes

#

though sigmoids are known not to yield the best results in general

lapis sequoia
#

Do you know of any approach that does that, IE avoiding addition and subtraction?

wooden sail
#

nope

lapis sequoia
#

Would be fun to try then

wooden sail
#

there's a reason no one does it, but try it out by all means

#

multiplication, division, and exponentiation can only do so much

lapis sequoia
#

Like what’s it missing?

wooden sail
#

in particular, you can't do any translations/shift up, down, left, or right

#

which means no matter how hard you try, you cannot change activation thresholds with them

lapis sequoia
#

Hrm true.

wooden sail
#

anyway you need addition in back propagation and grad desc, but idk how strict you were being with the "no addition/mult"

#

you could do gradient-free techniques like simulated annealing, but again, depends how strict you are in allowing addition in the cost function itself

lapis sequoia
#

Well I’m kinda intrigued now.

#

What if you were completely strict, no addition/subtraction whatsoever

wooden sail
#

go find out and let me know 😛 but that limits which functions you can work with quite a bit

#

for one you are immediately kinda limited to work with min and max of functions that are bounded by below or above

#

since you can't even subtract the target value

#

can't take norms of anything other than scalars either

#

i would call it like "diet optimization", more like what you do the first time you learn about opt with local minima and maxima looking at the first and second derivative tests, or using probabilistic approaches

#

though now that you think about it, you can be a pain in the ass and work out loopholes like logs and exponents of products and divisions to achieve the same as addition and subtraction. up to you if you consider that fair or not

wooden sail
#

stuff like log(ab) = log(a) + log(b)

lapis sequoia
#

Hrm that’s probably fair

wooden sail
#

in that case you're not all that constrained regarding cost funcs. working with independent random variables already lets you create log-likelihood expressions and maximum likelihood estimators only out of products of probability density functions, and their log expressions are equivalent to common cost funcs like least squares

lapis sequoia
#

Do you know how to add something that affects the positive part of a function only (without being piecewise)?

lapis sequoia
wooden sail
#

1.) i like explainable/hybrid AI where networks are made out of classical models, but the hyperparams are optimally learned in a data driven fashion. 2.) any that need it:P i work with multidimensional data, and it's certainly needed there. that's stuff like anything with sensor arrays, composition of different sensors, hyperspectral imaging, spatial audio, etc

wooden sail
lapis sequoia
wooden sail
#

most commonly multi channel ultrasound stuff, but some colleagues with with stuff like mimo radar and satellites

#

depending how the ultrasound data is collected, you have data with axes like tx angle, rx angle, tx element, rx element, time/freq

#

i try to image stuff with it. tomographic inversion

lapis sequoia
#

Oh cool.

steady basalt
#

The wait is over. Finally found physicals and the calculus grind begins

#

All I need now is a trig book first

wooden sail
#

heh gil's book has an illustration of the "fundamental theorem of linear algebra" on the cover

steady basalt
#

I spotted a really good one which was called foundations of mathematics and it holds your hand through the very basics which I may borrow and recap, espeically for stuff like sine and angles

#

may help to go back ovre that before trying this because i bet after a few dozen pages theyll ask you to use it, tho im not so sure the methodology cares whether its a sine function or not

#

oh, and logarithms

dusty valve
#

what does WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 71742 vs previous value: 71742. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize. mean?

mint palm
#

wouldnt it be much better to just initially feed generator with real images but the ones which we are not feeding in discriminator?
Instead of random input??

#

probably not cuz when deploying, we would need similar images again to get novel output images.
We should input something that is actually input-able when deployed.

spare briar
mint palm
#

Also our generator generates noisy labels while our discriminator, being relatively robust to noise, cleans these labels which is not the case in TS framework.
I am new to generative models, but how is possible what that bold thing says?

brisk apex
#

I have question regarding concept of ELT: after transformation, after you create dataframes and tables after optimization, what happens to that dataframes/tables? Do they go back to data warehouse for data analyst to work on? If not, do data analyst need to run optimization every time they want to work on optimized dataframes/tables?

misty flint
#

different architectures for different use cases

#

different resources as well

lapis sequoia
#

How would the steepness of an activation function impact its utility?

misty flint
#

the other question is also: are you the software/data engineer serving the data analyst/scientist? are you on the cloud -- if so, this will allow for more flexible architecture.

misty flint
lapis sequoia
#

What is Data Science and AI even for?

#

Isn't it just analysing data?

hazy saddle
#

Hello, I'm using pandas to filter some data, I'm using the next code:
first_week_data = data_relevant.loc[(data_relevant["FechaEncuesta"] >= first_day_week1) &
(data_relevant["FechaEncuesta"] <= last_day_week1)]

The problem is that I'm geting the first and las day data only not the days between, any ideas where's the problem.

serene scaffold
#

please show the result as text of print(data_relevant.dtypes)

#

@hazy saddle ^

serene scaffold
hazy saddle
lapis sequoia
#

Got it

serene scaffold
#

!docs pandas.Series.between

arctic wedgeBOT
#

Series.between(left, right, inclusive='both')```
Return boolean Series equivalent to left <= series <= right.

This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.
serene scaffold
#

@hazy saddle use this instead ^

hazy saddle
serene scaffold
misty flint
hazy saddle
serene scaffold