#data-science-and-ml

1 messages · Page 69 of 1

verbal venture
#

ok, I want to create 20 classes. SO I need 20 subfolders within train and test correct? how would the class names gets created withini the code after

small wedge
#

uh you can segment images however you like, they could all be in the same file and have the label class in the file name for example

#

cat_0001.png for example if cat was a class. Note that is for single classification (i.e. only one of the classes can be chosen) it will be a little more complicated for multiclass classification

small wedge
verbal venture
#

okay awesome, and just wondering, there's a subclass of male + femaile

#

so train/test -> male/female -> 0..20 for each folder

#

how would that work having the male + female subclass?

#

I can make some UI thing that just calls 2 separate models if it's too complicated to incorporate that subvision of folders @small wedge

small wedge
small wedge
#

that wasn't a yes/no question XD

#

I'm trying to think of it mathematically for the model's input/output

#

you have an input vector of pixels in the image (/ whatever data you want in addition to that)

#

and you have an output vector of your 20 classes

#

does this subclass fall into one of those categories? or is it just an organizational thing?

verbal venture
#

The model has to work on both males + females but it has to detect whether the user uploaded photo is male or female

#

It’s age prediction but I can’t compare the ages of males + females

small wedge
#

I see

#

well you can either have 2 models, the user inputs whether the picture is male or female, and pass that to the appropriate model

#

or you can use multiple classification, with 21 classes

#

and have the model predict male/female as one of the classes

cold osprey
#

!code

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

cold osprey
#

Huh is this python

#

Why are there ;

#

And std::cout

plucky bolt
# cold osprey Huh is this python

C++ but I am using matplotlib with it
The syntax is a bit different but that shouldn't matter all that much. I'm trying to figure out why there is that additional straight blue line going from the origin to the current data point being plotted in every update.

stable remnant
#

I have a projects based on image processing and computer vision I need help with, If there's any professional ML engineer willing to help please let me know, Thank you!

stable remnant
plucky bolt
magic dune
#

!paste

magic dune
#

line 85

#

help

#

It is a nn

stable remnant
magic dune
#

Works binary but not with multiple classes

potent sky
supple prawn
#

is it better to watch tutorials than read books

simple tapir
#

@mild dirge @left tartan , thanks a lot guys! 🙏

glossy aspen
supple prawn
glossy aspen
sick ember
#

Hey everyone I have a quick question

glossy aspen
sick ember
#

Does loss in training in anyway related to number of gpu my laptop has?

#

For some reason the loss in my model is extremely high with extremely low accuracy

#

Even though I seems to have done everything right

mild dirge
#

The number of gpu your laptop has?

#

Why do you make a relation between gpu and loss?

sick ember
#

Umm I was watching a tutorials on CNN, and I was doing the exact same thing as the tutorial

#

But I’m getting different outcome

mild dirge
#

Probably not the exact same, or you got a very unlucky run

sick ember
#

Here is the tutorial

#

Something just felt wrong

glossy aspen
sick ember
potent sky
#

How much of a difference is there

mild dirge
#

Have you learned about linear regression and perceptrons yet?

sick ember
mild dirge
#

You should have a solid understanding of those before you even begin looking at CNN

#

Start with the basics first would be my advice, there's many things that could make a cnn give bad results

glossy aspen
# sick ember Weights?

When you create NN you have connections between the neurons. In the beginning of the code you are giving some random numbers to them. So try several times to see if the loss decreases by chance

sick ember
harsh bane
#

Hoi, for stable diffusion, how do i start off making a script that loads last of everything on boot, and has "if not detecting X extensions/models locally, install extension through extensions, install from url tab", restart webui, then read from new extension to fetch model", then reads it all to confirm it's there.

left tartan
#

Even if you know it, it’s a fun watch

sick ember
verbal venture
#

doeas anyone know of a link/anywhere to know about what neural network architectures to create based off the problem

#

say if I wanted to classify 196 classes, how would I know what NN to create

#

is there a formal process or any resource cheat sheet

west oyster
#

II want to write a python program for a typical data analytics workload: collect data, clean it, do some prediction, and display insights/dashboard on a website. I want to write it in a modular way so folks can replace a component with a different one, not even in Go. What's a good approach to do that? write python program that make calls to other python executables (out of process)?

#

And if someone has a codebase that I can get inspiration from, please share it

timid kiln
#

So I get a warning if I do this:

# code #1
output_df.dropna(subset=['flow'], inplace=True)
# compiler returns this message:
# See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# But no warning here, code #2:
output_df = output_df.dropna(subset=['flow'])

What I'm trying to do is remove all rows where flow has a value of None.

I looked up that link and I don't see where it applies to what I'm (doing).

I know I can import warnings and filter out the warning but... is that the right thing to do? Seems like I could just do #2 above and move along but... what's the right thing to do?

timid kiln
#

"a pox on thee..." 😄 love it

serene scaffold
#

I never use inplace with pandas

#

if I ever make koalas, that won't have it

tidal bough
#

it's a bit annoying in pandas to things nicely and immutably
like, plenty of assigns

timid kiln
#

Most times for what I'm doing I want to modify the underlying data. Otherwise i run the risk of making more and more copies of a dataframe, making the code more confusing. I think. Probably.

timid kiln
#

I just used .melt for the first time today so seems timely.

left tartan
#

also, my real answer is: don't use pandas, but I wont go there (and sometimes you have to)

timid kiln
serene scaffold
left tartan
timid kiln
#

So, essentially database tables/queries?

left tartan
#

Polars is fast(er) than pandas, but I don't want to learn yet another pythonic dataframe api that's not as good as sql.

timid kiln
#

I avoided pandas for a while and just used sqlite because I'm more comfortable with sql.

serene scaffold
#

there's a lot that I dislike about pandas. but I'm too familiar with it to want to learn something else. and it's so well documented.

timid kiln
#

"avoid" being relative... I have plenty of dataframes around the code things.

left tartan
#

pyarrow starts to change the game a bit tho.

worn stratus
# left tartan Polars is fast(er) than pandas, but I don't want to learn yet another pythonic d...

polars has a super intuitive API, if you know how to do something in SQL, you pretty much know how to do it in Polars. But you get the advantages of being able to do a ton of stuff in polars that is much harder in SQL - complex aggregation, pivoting etc - alongside composability which makes it workable as part of actual software rather than one off analysis/ETL

I was never super familiar with Pandas, but I really like polars

left tartan
#

I wish polars and pandas would get together and make a love child.

worn stratus
#

I don't really know what Pandas has to offer

#

that's a lie - it has more IO options

tidal bough
#

indexes

worn stratus
#

but the main reason is so popular is that it was the first mover

tidal bough
#

what would be a messy join in polars is often something trivial like df1 + df2 in pandas, thanks to indexes.

left tartan
#

(something something duckdb .... )

tidal bough
#

but yeah, polars is very nice

iron basalt
#

I would recommend using Polars instead of Pandas unless you have to or if Pandas is fast enough and you are already familiar with it.

worn stratus
left tartan
#

pandas indices are helpful for that specific case: for (in sql terms) natural joins across two df's

worn stratus
pure hinge
#

Since there is no finance sub here, thought I'd leave this here:

https://github.com/IlyaKipnis/PythonBacktesting/blob/main/edhec_perfa.py
https://github.com/IlyaKipnis/PythonBacktesting/blob/main/Return_portfolio.py

Some utility functions to rebuild some of R's financial ecosystem for portfolio allocation backtesting in Python considering that Zipline/Pyfolio are prone to breaking. More functions will be upcoming in the edhec_perfa file.

GitHub

Backtest asset allocation strategies in Python with only a background in pandas necessary - PythonBacktesting/edhec_perfa.py at main · IlyaKipnis/PythonBacktesting

GitHub

Backtest asset allocation strategies in Python with only a background in pandas necessary - PythonBacktesting/Return_portfolio.py at main · IlyaKipnis/PythonBacktesting

left tartan
#

(i mean, not exactly of course, but yah)

pure hinge
#

Heh--yeah, used a lot of chatGPT for this translating from R

left tartan
#

zipline was frustrating

pure hinge
#

But so many jobs say "must have Python, must have Python", though IDK which packages people use for testing signal based trading systems and limit orders

#

Market orders can be done just using pandas, but all of quantstrat's depth from R is just...nope nope nope

#

I looked at Zipline's code once and said "I am not touching that incomprehensible mess"

left tartan
#

Yah, I opted for (i'm a huge duckdb stan) duckdb over pandas, but started from probably the same place

verbal venture
#

hey guys weird question, but how do I find out the input dimensions of my model layers

pure hinge
#

Have you tried asking chatGPT yet?

verbal venture
#

yeah, just returns so many errors

#

plus i acutally need to learn this

pure hinge
#

Seems you're not specific enough in your query then.

verbal venture
#

I should learn it anyway. It's a cnn. 50x256x256x196

pure hinge
#

not familiar with them, but doesn't the cnn library itself have a way of outputting that data?

verbal venture
#

do my model layers matter with the exception of the output (classes)

#

if you use transfer learning yeah

#

if you want to make your own model you set the input feature/outputs

worn stratus
left tartan
tidal bough
#

yeah, they are generally not nice

#

it's also horrible that pandas makes you use them. like, if you want to do a join, usually you have to do set_index (it can only join on column in one of the dfs, not both)

agile cobalt
#

you can specify left_on and right_on for pandas.merge iirc?

tidal bough
#

yup, that one you can I think

#

also today I tried to do a cross join in pandas and it just. doesn't work right. the good way according to google is, I shit you not,

pd.merge(
    df1.assign(_tmp=0),
    df2.assign(_tmp=0),
    on="_tmp",
).drop(columns="_tmp")
agile cobalt
#

I've never had to do a cross merge before, but does how="cross" just not works?

worn stratus
#

@left tartan I hadn't ever looked at it till now, but DuckDB is very interesting

DuckDB looks like a solid api, and from a glance has good Polars support. I think I actually have a usecase where it will save me a ton of effort

left tartan
#

Yah, polars & pyarrow... feel free to dm me, I'm super into it right now.

worn stratus
#

Arrow has been great for getting everything to be able to talk to everything else.

glossy aspen
left tartan
#

So, numpy is still important and will be for a long time, but what we're really talking about is vectorized operations and there are multiple ways to get there. Dataframes are just containers to make those operations convenient (perhaps too much of a simplification?)

tidal bough
#

Or do you use structured arrays?

glossy aspen
tidal bough
#

No, I mean, when you have several columns of wildly different types.

glossy aspen
left tartan
#

Yah, the machine learning libraries tend to tell us what we must use. scitkit-learn wants numpy, so we end up with pandas+numpy data types.

glossy aspen
tidal bough
#

anyway, a pandas dataframe is pretty much a bunch of equal-length numpy arrays (one per column) collected into a table-like structure

#

with nice methods to work on single columns, multiple columns, selecting rows, etc.

#

when working on a single column it's often easier to do it the numpy way, but few datasets have one column.

glossy aspen
tidal bough
#

yeah, it's very connected to numpy

#

whereas e.g. polars, not so much (you can easily convert columns to numpy arrays but internally they're actually arrow I believe)

left tartan
#

Pandas is headed in that direction too, but in baby steps.

glossy aspen
tidal bough
#

first thing that comes to my mind is that I have numba functions working on pandas dataframes, and I have some doubts numba works with arrow

pale hemlock
#

hmmm

left tartan
pure hinge
verbal venture
#

is anyone able to tell me why my model has 0.00045%

serene scaffold
verbal venture
#

ok, what's the difference between _, predicted = torch.max(output, 1) and _, predicted = torch.max(output.data, 1)

serene scaffold
#

I don't know what output is.

verbal venture
#

for data in train_loader: images, labels = data. output = model(images)

#

CNN model

#

I can send you the full model actuall to see if you see anything wrong with it.

serene scaffold
#

what does print(type(output)) show?

verbal venture
#

side question I was also considering doing a masters in AI. how much did that prep you for your job

serene scaffold
#

I'm pursuing a masters currently, I got a job with just a bachelors in CS, but only because I had a publication under my belt.

#

in general, a masters is basically a requirement for entry level ML jobs.

verbal venture
#

how'd you get a publication

serene scaffold
#

one of my professors wanted to publish with me. and I thank god (I am an atheist) for this every day.

verbal venture
#

cool

#

so what's the diff between output and outputs.data. I see it used in different models

#

mainly outputs with NN, and .data with CNns

serene scaffold
#

I still need the answer to the most recent question that I asked you.

verbal venture
#

yeah just waiting for my model to be done

verbal venture
#

outputs.data is also torch.tensor

serene scaffold
#

I would just ignore it (and not use it)

verbal venture
#

ok so my model is still fucked

#

are you able to take a look?

#

but it shouldn't be fucked. it's like pretty dense

#
class Model(nn.Module):
    def __init__(self, num_classes=num_classes):
        super(Model, self).__init__()
       
        self.conv_layers = nn.Sequential(

        nn.Conv2d(3, 64, kernel_size=3, padding=1),
        nn.ReLU(inplace=False),
        nn.Conv2d(64, 64, kernel_size=3, padding=1),
        nn.ReLU(inplace=False),
        nn.MaxPool2d(kernel_size=2, stride=2),
        
        nn.Conv2d(64, 128, kernel_size=3, padding=1),
        nn.ReLU(inplace=False),
        nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.fc_layers = nn.Sequential(
        nn.Linear(128 * 56 * 56, 512),
        nn.ReLU(inplace=False),
        nn.Dropout(0.5),
        nn.Linear(512, num_classes))
            
    def forward(self, x):
        x = self.conv_layers(x)
        print("conv layers", x.shape)
        x = torch.flatten(x, 1)
        print("after flattening", x.shape)
        x = self.fc_layers(x)
        print("After FC layers", x.shape)
            
        return x
        
        
model = Model()

model.parameters

def training(model, train_loader, loss_fn, optimizer, num_epochs):
    model.train()
    model.to(device) # using GPU if available 
    
    for epoch in range(1):
        epoch_train_loss = 0.0
        correct = 0
        
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            
            optimizer.zero_grad()
            
            outputs = model(images)
            
            train_loss = loss_fn(outputs, labels)
            
            train_loss.backward() 
            optimizer.step()
            
            epoch_train_loss += train_loss.item()
            
            _, predicted = torch.max(outputs.data, 1)
            
            correct += (predicted == labels).sum().item()
        accuracy = 100/batch_size * correct / len(train_loader)
     
Accuracy: {accuracy:.4f}")```
serene scaffold
#

replicating all your work locally is more than I'm willing to commit to--sorry

#

though in general, please always use markdown blocks for pasting code

#

!code

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

verbal venture
#

don't replicate. just take a look and see why the accuracy might b e .5%. just in general: epochs were 30, batch size 64, pretty dense network, nn.CrossEntropyLoss() etc. accuracy shouldn't be half a percent

serene scaffold
#

I don't know

#

though it looks like you only have one epoch

#

for epoch in range(1):

verbal venture
#

it should still hit like 30% from that tho

#

I can't trust chatgpt either. spews so much nonsense

serene scaffold
#

why should it reach 30% after one epoch?

verbal venture
#

cuz if it's a good model it can't start at .5% can it?

#

I've trained like 3 models, but basically all the good ones started relatively good

#

but what do I know

serene scaffold
#

do enough epochs to get a noticable diminishing rate of return on the loss

#

and then if it's still performing poorly, you can reevaluate

verbal venture
#

its either a bug or I made something insanely shitty

#

cuz I think even like a single MLP could have gotten more of a prediction

#

should I do correct / len(train_loader.dataset) or / train_loader

serene scaffold
#

you should do more epochs

#

you even said "just in general: epochs were 30"

#

but then it was secretly actually 1.

verbal venture
#

I should do more than that?

#

ah 😂

#

are product level networks run on thousands of epochs?

serene scaffold
#

no

#

for epoch in range(1):
this means that regardless of how many epochs you said you did, you did one.

verbal venture
#

right

#

so if a dataset is like a million images

serene scaffold
#

did you even actually do 30?

verbal venture
#

the network would just be even more dense and the epochs would be low yeah?

#

no I did 1 for compute

#

but running 30 rn

#

kaggle notebook gonna explode

serene scaffold
#

what did you say this model does?

verbal venture
#

there's 196 classes of cars. a CNN to classify them

#

I gpt prompted the best network to do so, which seems. to be a VGG imitation but from scratch and much lighter

#

my accuracy is steadily at 0.5%

#

does anyone know why this accuracy is so low? 😂

proven sigil
#

I'm building a multi-class classifier that classifies conversation text into one of 8 sentiments.


Within each message, there is a conversation id, which is basically which conversation the message takes place in. Each message is either the start of a conversation or a reply from the previous message. There is also a sentiment, which represents the emotion that the person who sent the message is feeling. There are 8 sentiments: Angry, Curious to Dive Deeper, Disguised, Fearful, Happy, Sad, and Surprised.

Sentiment Analysis: Build a multi-class sentiment analysis model based on this dataset. 

I'm using sentence_transformer to transform text into embeddings, then using OneVsRestClassifier using a simple estimator.

But it's taking too long to train on my machine.
How can I speed this up?

serene scaffold
#

@proven sigil how many cores does your machine have, and what value did you set for n_jobs, if any?

queen cradle
# glossy aspen I use numpy almost for everything. Do I really need something like pandas? Or is...

NumPy and Pandas have different use cases. Pandas is designed for columnar data, much like a SQL database. Many of its operations work with one column or a small number of columns, paying no attention to the others: Summary statistics like means, standard deviations, and counts; grouping; joins; and so on. Pandas uses one-dimensional arrays almost exclusively. Polars is so tightly focused on columnar data that, as far as I'm aware, it supports only one-dimensional arrays, and I think it even requires that items at adjacent indices are placed in adjacent memory locations. NumPy, on the other hand, is designed for scientific computing: Linear algebra, numerical integration, signal processing, solving PDEs, that sort of thing. Multidimensional arrays are fundamental to these applications. You can't even make sense of linear algebra without two-dimensional arrays! NumPy also needs to work with arrays where adjacent indices may refer to non-adjacent memory locations (i.e., the arrays may have arbitrary strides). For example, this allows NumPy to extract a column from a matrix without copying: It creates a new array which points to the same memory as the original matrix but has a stride that makes each successive array index skip over a whole row of the matrix. This sort of operation happens constantly in scientific computing applications, but it would be highly unusual for the columnar data that Pandas and Polars target.

You can get away with using NumPy alone, but it's not designed for data manipulation and is a bad tool for that. You can get away with Pandas or Polars alone if you want to manipulate your data (grouping, filtering, joining, and so on) but don't want to do prediction or statistical inference. However, something as simple as linear regression requires linear algebra and hence NumPy or equivalent. PyTorch and Tensorflow are more akin to NumPy than to Pandas because the use cases they're intended for are more like scientific computing applications.

#

I wouldn't mind. But I don't have much time this week and next, so I have no idea when I'll get around to it.

glossy aspen
thin geyser
#

Is it possible to use grad cams for gans? I'm trying to overlay a heatmap on an input image to localise the area which is causing a change in the output image.

errant spear
#

I am curious, what is the most appropriate algorithm for predicting stock prices?

hoary wigeon
#

Hi there!,

I'm trying to use shap[all] and I'm facing an error saying that

cuda extension was not built during install!
ImportError: cannot import name '_cext_gpu' from partially initialized module 'shap' (most likely due to a circular import) (/home/cosmix/.local/lib/python3.8/site-packages/shap/__init__.py)

Has anyone been through this error and know how to fix it?

storm valve
errant spear
#

As in, there is no most appropriate one?

storm valve
#

as in there's no general agreement about the "most appropiate one"

errant spear
#

Intriguing.

#

Any idea about what you believe the most appropriate ones would be?

errant remnant
#

Is this good channel to ask machine learning questions?

Anyways how to train our model for images it failed to recognise again without training whole dataset again

grand spindle
#

give it more data until it gets them all right

proven sigil
young granite
#

@storm valve how was this one method called which gives different forecastings something with "M.." but i cant recall it

hasty mountain
#

Hey guys, about how to calculate ROC-AUC curve, can someone help me with thresholding for Binary Classification?
If I'm using a Neural Network for Binary Classification which uses a Log Sigmoid activation function to make the classification...how should I proceed with thresholding selection?

I was thinking about simply using my output argmax, since it's the most obvious way and it's how my model is optimized(I suppose it's more or less how the BCE Loss works, even with Logits...), but when checking the source code of the model I'm using(TrimNet for drug toxicity prediction), I've found that they used scikit-learn's precision_recall_curve, which applies thresholding automatically.

I've seen that the threshold is used to get the True Positive Rate and the False Positive Rate. However, I'm simply using True Positive Predictions/Predictions and False Positive Predictions/Predictions, where the predictions are provided by my model.

I don't know how should I proceed with thresholding selection, specially since my model outputs are all in log scale. It appears to me that trying to use a threshold in this situation would be a bit arbitrary and prone to cherry picking...

hasty mountain
hasty mountain
#

I only got time to check this properly now. The thing is, no matter how my model would make correct predictions, (pred == label) would always return a mask of False booleans.

The funny thing is...I had converted my pred to numpy arrays, but kept my label as Pytorch tensor, which caused them to be interpreted as different elements.

#

Hours studying, reviewing and burning neurons on the math of ROC-AUC, and the solution was just a matter of .cpu().numpy()

#

Thanks for the help. I hope I can now implement my ROC-AUC calculation properly.

high stump
#

I hope this is the right channel for this question. 🙏 I'm new to ML and I am looking for open-source algorithms that have been built to predict the mechanical and or chemical properties of materials, any materials. Is there a place where I can start looking? Thanks.

stone glacier
#

Question:

Can anyone suggest a better text corpus model than Word2Vec?
My recommender system used Word2Vec but it's not that good

#

I should getting the LOTR titles and the 3 hobbit titles in top 5..but I got this

#

(I did try TF-IDF, but it was even worse...)

left tartan
# high stump I hope this is the right channel for this question. 🙏 I'm new to ML and I am l...

Huh, that’s a fascinating question I’ve never thought about. No idea but would love to know if there is one. I know pharma does a lot of ‘similar’ modeling for, well, pharma reasons. Example: https://medium.com/geekculture/drug-target-interaction-prediction-through-python-4af9e76fc90 Eager to hear if there’s anything here.

Medium

In this post, I present Python code snippets to predict drug-target interaction using SVD (Singular Value Decomposition) and Matrix…

tidal bough
hoary jay
#

just nervous because it's the first time, Just wish to know if the style of writing is all good and the information in the abstract is interesting enough to catch an eye, basically just need criticism

high stump
#

@left tartan , @tidal bough , and @glossy aspen . Thank you very much you lot for the leads.

hasty mountain
#

Will it fix it if I use abs()?

||I'm joking...I guess...||

#

I know that the most appropriate way would be to use integrals of TPR(x) and FPR(x) to calculate the area (instead of simply decomposing the ROC grid into triangles and squares). But I don't really know how would I define those functions...

frail quarry
#

maybe not the best place, but can't think of which other room would be better suited. I'm attempting to clip a LAS point cloud file using polygons in a geopandas GeoDataFrame. it is currenty taking quite a long time to do each one, specifically at this line in my code (it takes about 5 seconds to execute) within_polygon = np.array([polygon["geometry"].intersects(Point(point[0], point[1])) for point in coords])
any ideas on how to better do this?
full function:

def clip_las():
    start = datetime.datetime.now()
    
    # Iterate through the prepped_segments dataframe
    for index, polygon in prepped_segments.iterrows():
        print("Processing " + polygon["NAME"])
        poly_start = datetime.datetime.now()
        
        box_path = os.path.join(PROJECT_DIR, polygon[BOX_ID_FIELD] + BOX_SUFFIX)

        ## Read in the LAS file
        las = laspy.read(os.path.join(box_path, polygon[BOX_ID_FIELD] + ".las"))
        
        # Get coordinates of points
        coords = np.vstack((las.x, las.y, las.z)).transpose()
        
        # Get boolean array of points within the polygon
        within_polygon = np.array([polygon["geometry"].contains(Point(point[0], point[1])) for point in coords])
        print("Filtered points in", str(datetime.datetime.now() - poly_start) + " seconds")

        # Get the points within the polygon
        clipped_points = las.points[within_polygon]

        # Create a new laspy file
        new_las = laspy.LasData(las.header)
        
        # Add the clipped points to the new laspy file
        new_las.points = clipped_points
        new_las.write(os.path.join(box_path, "Final", "las", polygon["NAME"] + ".las"))
        
        print("Clipped " + polygon["NAME"] + " in " + str(datetime.datetime.now() - poly_start) + " seconds")
clever egret
#

Hello all. I just joined the server and am interested in hanging out with other devs related to this channel's topic.

#

Is this place really active?

mild dirge
#

It's more so for asking questions about DS and AI, not many people use it as a social hub atm

clever egret
#

I see. Thank you for your response.

sleek harbor
#

How does sklearn calculate mutual information and why are the results different each time (has random_state)? What's the randomness about? I thought the formula for mutual information was just like this.. I see no reason for something random to go on under the hood..

left tartan
arctic wedgeBOT
#

sklearn/feature_selection/_mutual_info.py line 391

def mutual_info_classif(```
left tartan
#

I don't know more than that, but that could explain why you see differences.

sleek harbor
#

hmm.. 3 neighbors.. I gotta read that link tomorrow. Brain shutting down :/

left tartan
#

yah, i dunno, I'd have to read through the paper to get this. _compute_mi_cc doesn't seem to bin, tho

verbal venture
#

someone give me a billion dollar idea using CNNs

verbal venture
sleek harbor
# verbal venture what

make a bot that goes threw all the anime in the world, analyses it, then takes requests, and generates new anime

hasty mountain
#

Someone probably also did it with Stable Diffusion...

||but mine will be better||

left tartan
hasty mountain
serene scaffold
#

(Source: I just spent all day fighting BERT to classify some shit)

nimble shale
#

I was looking around and wondering if I could get any help to be put in the right direction. In general I was curious if it would be possible to determine a video games internal resolution based off an image. Since the output resolution can be different from its internal resolutions. Right now the most straight forward way I can think to do it is the count the pixels on a diagonal line. Though I'm wondering if there's any good opportunity here to get more into a machine learning method or some more computer vision methods. Mostly just looking for resources that would be very applicable to this or really any guidance. I'd decently experienced with python but not so much for ML, computer vision, data science, etc

cold osprey
#

idk, seems like theres a definitive way to do it right

#

How would a ML approach be better? speed, accuracy etc/

sleek harbor
#

How do you encode categorical variables before doing feature selection with mutual information/chi2/etc and are mutual information scores calculated between categorical/categorical x continuous/continuous features comparable?
If the categorical feature is ordinal, then no problem.. u just encode it as such. But what if it's nominal? Do you OHE? That'd be a bit weird, cus u get a bunch of scores for each value of the category, instead of just a score per category.. However, that just might be a good thing.. maybe in Blue, Red, Green, Yellow, Orange - yellow and orange aren't as important as the rest and could be safely discarded. But then again, that would (in my understanding) strongly affect comparability to other MI scores. And since MI doesn't really care about the ordering, would it make more sense to just ordinally encode the feature to get its MI score, even if it's nominal, and after that reencode it with OHE (after deciding whether or not to keep the entire feature)?
Are discrete numeric features treated exactly the same way as ordinally encoded categorical ones? Are the scores comparable? What about compared to continuous features?

nimble shale
# cold osprey idk, seems like theres a definitive way to do it right

Yeah I was trying to see if there's a more direct way to do it. Edge detection seemed like it could help, but wouldn't be consistent. And with different forms of AA and upscaling, well that just makes it a more difficult problem. I tried to see if there were any definitive ways to do it but haven't found much in my search

pseudo moon
#

I am training a denoising autoencoder, but often the model outputs black images (values that are extremely close to 0). Sometimes, it can output good results but the next time I try training it again, it will give pitch black image. I am using 3 conv2d layers for the encoder and 3 conv2dtranspose layers for the decoder, with relu activation except sigmoid for the last layer. Strangely enough, when I switch to Dense layers instead of convolutional layers, the model will always result in some output although not as good as when conv layers are used. Does anyone know what might be the problem here that always give me black outputs?

hasty mountain
#

If you're using RGB images, your VAE probably works with a Normal Distribution, so you have to de-Normalize the output. If you don't, you may get images that don't really correspond to the training data.

#

Also...it seems that, depending on the dataset, some outputs are more prone to generate dark images... I have a VAE trained on CIFAR100 that has this likelihood. But when I use a custom dataset(and on a simpler VAE), this doesn't happen.

hasty mountain
# sleek harbor How do you encode categorical variables before doing feature selection with mutu...

I don't know about that, but looks like you may get interested in how ROC-AUC score (a metric that is essentially categorical) is applied to regression tasks

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

#

Though...it seems to be a bit annoying... and prone to cherry picking

sleek harbor
hasty mountain
#

Maybe not for feature selection, but it could inspire you somehow.
Maybe adopting a threshold for certain continuous variables to be considered "classes"... At least this is what seems to be done in ROC-AUC. Maybe it could also work for feature selection.

past meteor
#

What's stopping you from encoding your variable with 15 colors to Red, Blue, Green and Rest if the others are unimportant

#

If you're using K-1 dummies you can even just drop the final category

sleek harbor
sleek harbor
past meteor
#

With whatever feature importance algorithm that your model offers

#

Your background was in economics right? Then I'll just say three words you probably heard a million times: beware of multicollinearity

sleek harbor
# past meteor Your background was in economics right? Then I'll just say three words you proba...

that was next on my list.. how do I deal with that? The amount of approaches.. Like, do I use PCA? If yes, then do I replace all existing features with the principal components? If yes then I potentially lose some information and relationships the model might pick up, especially if I use a polynomial transform.. If instead of replacing I just add them, then that's basically adding to the problem by adding the solution to the problem. So my plan was to just weed out the less "important" features with mutual information, and then do some sequential feature selection with what's left to determine what works best (with cv ofc), thus kinda bypassing the need to deal with multicollinearity.. kinda

wooden sail
sleek harbor
latent shore
#

Hey @past meteor can i ask you about K-means clustering in dms?

past meteor
latent shore
#

It doesnt let me put pdfs and py files

#

But can you check my python help thread

past meteor
#

I'm not going to open PDFs and py files, the discord has a paste functionality

latent shore
#

It deletes the code i copy and paste too idk why

#

But i put screenshots on python-help

past meteor
#

Considering you're mentioning a polynomial transform I can assume you're using a linear model? Why not use Lasso / elastic net and solve multiple problems at once

sleek harbor
past meteor
#

Do you know L1 regularisation and specifically Lasso?

sleek harbor
sleek harbor
past meteor
#

And we're back to multicollinearity, you need to careful define what it is you mean by unnecessary feature and say it in the mirror 25 times haha

sleek harbor
#

I haven't decided on the model yet, btw. I'm just messing with features so far

past meteor
#

For decision trees you can just add 2 features that are 100 % noise and remove all features that have a lower feature importance than those 2

sleek harbor
past meteor
sleek harbor
past meteor
#

Like in general you'll do the same thing right, you'll remove the variable. The nuance is in how you present the results etc etc

sleek harbor
past meteor
#

Oh you're going with stepwise 💀

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

sleek harbor
past meteor
#

Don't do it, this is such a well studied thing

#

It's not just me

sleek harbor
#

pfff.. time to read

latent shore
sleek harbor
# past meteor It's not just me

ok, I'll read later.. :3 pause that for a minute
Lets say I don't go with step wise, but use permutation importance (feature importance) and cross val to select the cutoff threshold.. that'd work right?

past meteor
#

Yeah, do it by adding a few features that are 100 % noise (say 2) because tha automatically tells you where your cutoff could be

latent shore
#

runfile('C:/Users/Ayla/Desktop/sonson.py', wdir='C:/Users/Ayla/Desktop')
C:\Users\Ayla\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
c:\users\ayla\desktop\sonson.py:71: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use matplotlib.colormaps[name] or matplotlib.colormaps.get_cmap(obj) instead.
colors = plt.cm.get_cmap('rainbow', num_labels)
Traceback (most recent call last):

File ~\anaconda3\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
exec(code, globals, locals)

File c:\users\ayla\desktop\sonson.py:107
plot_clusters(X_transformed, kmeans.centroids, kmeans.labels)

File c:\users\ayla\desktop\sonson.py:77 in plot_clusters
plt.scatter(transformed_centroids[:, 0], transformed_centroids[:, 1], marker='+', color='red', label='Centroids')

IndexError: index 0 is out of bounds for axis 1 with size 0

latent shore
lapis sequoia
sleek harbor
lapis sequoia
latent shore
#

These are codes and data

past meteor
sleek harbor
past meteor
#

That's a question for you not me

#

I would not ordinal encode colours because you'll get something like 1, 2, 3, 4, 5

#

And Red (1) - Green (2) being -1 makes no sense

#

I might ordinal encode high cardinality categorical variables if I'm working with a tree algorithm because they have a degree of invariance to this problem

lapis sequoia
# latent shore

thanks. looking at the code, I think PCA and KMeans classes seems fine. please make sure your "kmeans.centroids" is a 2D array so that you could perform indexing on 2nd axis.

past meteor
#

A degree, but it's definitely not perfect

#

That leaves you with one - hot vs target encoding. Target loses a lot of information but killing your model by expanding high cardinality categories is worse.

sleek harbor
# past meteor And Red (1) - Green (2) being -1 makes no sense

but for feature selection purposes, forget the model. Obv if I ordinally encode a feature for selection I'll reencode as one hot before passing it to the model, but before that. Before we get to the model, it makes a big difference: ord vs one hot
If I select a certain threshold, I'll get vastly different results afterwards

past meteor
#

TL;DR

  1. Low cardinality: One hot
  2. High cardinality and tree: ordinal encoder
  3. High cardinality and not a tree: target encoding

past meteor
#

If it treats it like something continuous then ordinal encoding would not make sense

sleek harbor
#

*actually implementations differ, but that's the easiest explanation

past meteor
#

How is the binning done, Is it done exactly at the cutoffs of your ordinal values or do some get grouped?

latent shore
#

Thank you

#

If you can help with editing the code let me know please @lapis sequoia

past meteor
#

Even if it it does it at exactly the right granularity, the issue is still that you imo, still want to OHE it at the feature selection level so you can see what specific levels are relevant.

sleek harbor
past meteor
sleek harbor
past meteor
#

Are you using sci-kit learn's implementation? I'll just read the docs

lapis sequoia
sleek harbor
sleek harbor
lapis sequoia
#

What is your doubt exactly? I couldn't catchup with the convo

past meteor
#

Based on the docs it looks like if you make them ordinal and then specify that it's discrete you should be fine but then indeed you're computing the MI for the entire feature and not the levels

#

So if you want to drop a feature in its totality that works

#

If you want to know what specific levels have a high MI, then you should OHE

past meteor
past meteor
sleek harbor
# past meteor If you want to know what specific levels have a high MI, then you should OHE

if I OHE, will the resulting scores be comparable to scores of other features? As far as I understand - no, and that would kinda defeat the purpose, because I wouldn't be able to set a threshold for elimination. Unless, that is, I set a separate threshold for each group of features created by the one hot encoding.. but then I'd end up with many thresholds.

And the other unrelated question: are MI scores between categorical features comparable to MI scores between continuous features?

past meteor
lapis sequoia
past meteor
#

(I still believe you should just use proper regularization and then you can leave this be)

#

Plot your data vs. the target etc

lapis sequoia
# sleek harbor How do you encode categorical variables before doing feature selection with mutu...

I think @past meteor already answered most part of it. As, its more about experimentation and results could still vary irrespective of MI scores. maybe you can calculate MI individually, eliminate few categories after OHE. Also keep the original feature and concatenate that with selected OHE features, that way probably some of the information about eliminated categories will still exist for model to learn.

sleek harbor
# past meteor It will be comparable I believe

U sure OHE will be comparable to others? I mean, all the information that they contain is just on/off for one category of the initial feature.. that doesn't seem like a lot, tho it can be useful.. Say you have a high cardinality feature, with a bunch of categories, but OHE is basically deviding their importance by the cardinality, making their MI scores very low, meaning they might just all get eliminated, if you use a threshold comparable to the others.. Idk if I explained this well, if not I can try clarifying what I mean

sleek harbor
wooden sail
#

makes sense after normalization, considering that misclassification is a full mistake

#

since the classes are orthogonal to each other, mistakes there automatically yield a large distance

sleek harbor
wooden sail
#

when you use one hot

lapis sequoia
sleek harbor
latent shore
#

@lapis sequoia the code worked but how can i make it do iterations?

wooden sail
#

one hot is normalized by default. it makes an orthonormal basis for the labels

#

if you normalize the other scores, you can compare them to ones derived from one hot vectors

lapis sequoia
latent shore
sleek harbor
#

that kinda clicked now, thx

lapis sequoia
sleek harbor
#

@past meteor @lapis sequoia thx for ur time, that was really helpful. I think I kinda get it now, tho will likely wake up tomorrow and have to revisit the entire conversation again.. :p

latent shore
#

The output i get is the first iteration plot only

#

It doesnt give me 2nd 3rd and 4th

lapis sequoia
#

maybe the if condition in your Kmeans class (fit function) is already satisfied at 1st iteration (and it break the iteration loop), you may try removing it once?

latent shore
#

Remove the condition?

#

I will try

lapis sequoia
#

just keep self.iteration>200 for once

past meteor
#

Like nis says, no substitute for empirically validating if your model is better with or without the feature 🙂

past meteor
lapis sequoia
past meteor
#

Great to hear. I'm a kaggler myself but I do it strictly for fun 🙂

#

Tend to do the tabular playground series. Sometimes with people from my cohort over the weekend or so. Never won anything though 🤣

lapis sequoia
past meteor
#

Yup, Kaggle + Lurking on Reddit + university (in no particular order) were what thought me data science. It's a great community.

lapis sequoia
#

Hahah quite the same for me, I recently joined full time, but the community still interests me to be active in competitions and also contribute.

past meteor
#

It's funny because recently I had an idea to make something where you have an LLM that has the right answers to projects (could be data, could be software) where people submit their answers and get feedback from the LLM on how to improve based on 1 or more model solutions.

past meteor
#

Several technical challenges in making it but overall the idea is to get "senior dev" tier advice based on model solutions to help people improve their skills - for free. A blue sky idea, I know

lapis sequoia
#

haha I won't claim myself as an expert in LLMs, but yeah this sounds interesting. So, you are planning to finetune the LLM models or do some prompt engineering and build an app/software upon it?

past meteor
#

I'm in a research lab. I'll let the idea "simmer" with our LLM gurus first. Initially my idea was to just prompt engineer and build an app on top of it.

#

Scope isn't big enough to be done at my work I think so my guess is that I'll try it as a hobby project 🙂

lapis sequoia
unique flame
#

My labelimg programs closes when labelling. I labelled yesterday and had three classes. I then continued today and it just keeps crashing. When looking at the classes.txt it gets changed to the first class I draw. Anyone had a similar thing and know how to fix?

unique flame
#

Nvm found a workaround

potent sky
#

@past meteor if ydm, what did you end up doing for that synthetic tabular data generation?

past meteor
#

Haven't done it yet but my approach is clear - I'll make a graphical model and generate it from there. If I want to make it more noisy I'll add a VAE in there.

#

But it should be possible to make it pretty noisy at the PGM level already

potent sky
past meteor
#

Simple but it needs to feel realistic. We might use the data for internal / external training so making it yourself means you get to inject whatever issues you want to cover.

potent sky
#

ah right, fair enough

pine escarp
#

Best site to practice pandas.

sleek harbor
past meteor
#

I'd say, read part of the Pandas user guide on their website and then do some Kaggle @pine escarp

boreal gale
pine escarp
#

Thank you all for your suggestions, I'll definitely try them all. emoji_71

#

I have been looking for a site that offers exercises to do.

sleek harbor
lapis sequoia
sleek harbor
pine escarp
#

Thank you, I'll try them!

sleek harbor
pine escarp
#

I'm still a beginner, so I'm starting with pandas.

#

I'll learn numpy and matplotlib next.

#

And SQL simultaneously.

#

Pandas so far have been interesting.

pine escarp
lapis sequoia
#

Can't reveal the secret sauce, but getting addicted to kaggle was the key for me ;)) might be something else for you.

past meteor
sleek harbor
# pine escarp I'll learn numpy and matplotlib next.

well once you're done with the theory for those, if you want some beginner friendly exercises, I'd recommend the 5 ones at the end of this course: https://www.freecodecamp.org/learn/data-analysis-with-python/#data-analysis-with-python-projects
I don't remember much about them honeslty, but I remember I enjoyed them. That said, they aren't exactly very good from the perspective of "high quality material", there are a few bugs in the exercises themselves (which you may or may not encounter, depending on your approach). But I think that's a good thing, cus you get to try figuring out exactly what is wrong. Long story short, try it if u feel like it. Don't have to go through the matterial if u already know the theory, can just do the exercises

Learn to Code — For Free

past meteor
#

The strength of your regularisation would be a teeny tiny bit stronger on your OHE'd variables

#

I think either way it might not make a difference. I mostly don't do it I think

sleek harbor
#

but there's no harm if I feel like it, right?)

pine escarp
hasty mountain
# pseudo moon denormalize ?

Yes. I don't know if you're using the MSE Loss for your VAE, but, in reality, the Decoder doesn't generate any image, it just generates parameters for a distribution(for RGB images, usually Normal distribution, and for grayscale, usually Bernoulli).

So, in order to convert each value in the Decoder output from a Normal Distribution to a proper image, you have to denormalize it by adding the dataset mean(which would be the mean of the Normal Distribution) and multiply by the standard deviation.

with torch.no_grad():
    input_noise = torch.randn_like(z).unsqueeze(-1).unsqueeze(-1)

    saving_image = decoder(input_noise)
                
saving_image = saving_image.view(saving_image.size(0), saving_image.size(2), saving_image.size(3), saving_image.size(1))
saving_image = saving_image.cpu().numpy()
saving_image = (saving_image * STD) + MEAN
#

||Yes, I'm the kind of guy who reject the use of .permute() in favor of .view() to convert torch tensors to numpy arrays||

#

Using this approach, your output goes from something like this:

#

To something like this:

pseudo moon
#

Interesting, but I was talking about DAE/denoising autoencoder which I believe is different from VAE?

hasty mountain
#

It is? pithink

#

I thought Denoising AutoEncoder was a Variational AutoEncoder optimized in a way that the generative factor is severely decreased

#

Ok, sorry then. Your model probably has anything to do with what I said. yert

pseudo moon
past meteor
#

Regular autoencoders can be used for denoising as well

pseudo moon
#

That is true

past meteor
#

The general idea would be to add noise to your input and try to reconstruct the original

pseudo moon
#

Do you think autoencoders should include batchnorm?

hasty mountain
#

The term "autoencoder" became very confusing to me after I learned about diffusion models...
Every paper on latent diffusion uses the term "autoencoder" to actually refer to "variational autoencoder", but they never use the "variational" term.

past meteor
#

If the bottleneck is small enough you could also just denoise naturally without adding noise.

#

Been a while since I used any autoencoder. I did a fun project some time ago where I used an AE to do an adversarial attack on resnet-50

#

Can also be used to self-supervised train nearly everything, cool stuff, cool stuff, ...

potent sky
# pine escarp Best site to practice pandas.

they're a few github repo's that basically go through a series of mini projects that involve you heavily use pandas
and they're designed in such a way that you get exposed to the different features
that could be helpful
but definitely kaggle comps + pandas user guide 💯

potent sky
grave summit
#

hy guys

#

I have a Pandas dataframe containing a time series, one column contains every day of the year 2022 with every hour, so we get 24 rows per day and the other column is a price

#

I would like to calculate the mean price of each month, I built a DIY solution with counters variables etc but it somehow fucks up at the end and includes more rows than wanted for each month from february on

#

i would like to know if there is a quick method to get the prices for the days of a given month only by selecting only these rows to calculate the mean

#

i was able to do it by comparing the date with operators to each beginning of a new month

tidal bough
#

and the month column you can calculate by... well, depends on what way your date is represented. either something from pd.Series.dt, or something from pd.Series.str.

grave summit
#

i did it by pd.to_datetime

#

with a format

#

Y-m-d H

tidal bough
grave summit
#

ok im going to work with this

#

thanks alot

#

I got this

#

@tidal bough

tidal bough
#

what am I supposed to see here? I don't see a month column.

hasty mountain
#

They have such a nice theory and idea, but it's so difficult to find anyone explaining it correctly. It's just "Make neural network, output smaller than input, then make another neural network, output equal desired image"

wooden sail
#

for a regular autoencoder, that's really mostly it lol

#

what else do you think is missing?

hasty mountain
#

It just feels a bit... too simple...and boring.

wooden sail
#

the easiest way to think of it is that you generate a pair of functions, and they are inverses of each other

#

that's... all, really

hasty mountain
wooden sail
#

you don't need the "encoded" part to be lower dimensional

#

it doesn't have to be images

#

the architecture of each of the networks doesn't matter, varies by application

#

the cost func also

#

"autoencoder" is just a very general framework. that's why it's "boring"

#

you can use it in conjunction with any concrete application and details

hasty mountain
#

I'm not used to simple things in neural networks...not since I discarded the use of keras

wooden sail
#

i work a lot with things that are technically autoencoders, and you wouldn't recognize them at all

#

stuff like task based machine learning also falls under autoencoders a lot of the time

#

same with self supervised learning

#

in some sense, it's even just an interpretation of a network

#

cuz it doesn't have to be 2 in the first place

hasty mountain
wooden sail
#

(it also doesn't have to be "networks")

hasty mountain
#

So...if I have a function that receives a 32x32x3 image...and outputs a single value (0 or 1)...can I call it an autoencoder?

wooden sail
#

no because it doesn't satisfy the main condition stargazer just referenced again

hasty mountain
#

Oh, yes...indeed...

wooden sail
#

if you took the image and did whatever you want with it, spitting whatever out

#

then you take that whatever and remake something very close to the original image

#

you have yourself an autoencoder

hasty mountain
#

Data augmentation = autoencoder?

wooden sail
#

you could generate 10 million petabytes of data as that whatever, and it's still an autoencoder

hasty mountain
lapis sequoia
#

any network performing compression and reconstruction is basically AE ig

wooden sail
#

it doesn't even have to be compression

potent sky
#

Any function

#

And then inversing

wooden sail
#

that's one of the most common cases because you use them in parameter estimation, but it does not have to be th case

#

literally just a forward-inverse pair

#

the key idea that makes it so powerful is that, if you put the pieces together in a special way, you can get the forward function to give you something useful, and the training is done based on the inverse function's output

#

essentially making the pair of functions "train themselves"

#

you only need input data, not even labeled pairs

#

but that's on the application side and varies greatly depending on what you want to do

#

the main idea is: forward-inverse pair

hasty mountain
#

I think I'm an autoencoder fan an didn't even know that...because I enjoy using self-learning and unsupervised learning models...

wooden sail
#

not always. but often times

#

in the very classical sense, like clustering, that's not an autoenc

hasty mountain
#

Maybe I should review the maths on a loss function I've been using to extract the minimum entropy... It uses the argmax of a softmax function as label pithink

wooden sail
#

let your mind be flexible

#

a lot of problems can be solved by noticing two things are actually the same thing if you close one eye and tilt your head

#

that's a big part of research and problem solving

#

so yeah, review your maths and get your intuition rolling

past meteor
#

There's kernel PCA that has the best of both worlds. I like kernel methods in general but you can barely apply them in reality

silent spire
#

Can anyone explain siamese neural networks like i'm a 5 year old

past meteor
# silent spire Can anyone explain siamese neural networks like i'm a 5 year old

Take a neural network. You have 3 inputs, 2 are of the same class and 1 isn't.

Give it input 1 and predict vector V1 (positive class)
Give it input 2 and predict vector V2 (positive class)
Give it input 3 and predict vector V3 (negative class)

You compute the euclidian distance between all points. You reward the model for having a low distance between V1 and V2 but also being far from V3. (triplet loss)

Is that clear enough?

past meteor
#

Important thing to remember next to all the math is that you want to get that sweet sweet property back that you had in PCA where each component (~ neuron in the information bottleneck) carries different information (is orthogonal)

#

Bit of a handwavy explanation but Edd can fill in the blanks haha

hasty mountain
#

Each neuron -> a different component? pithink

past meteor
#

From my slides of uni. That information bottleneck exists in both PCA and autoencoders

#

Hence why we covered them in the same lecture

#

Does this make sense or is it confusing?

wooden sail
#

you can set constraints either on G or on z to get (near) orthogonality in the representing basis, like in pca

#

you don't necessarily need that part though. really depends on the problem at hand

#

i can give you an example of one we deal with at work

past meteor
#

A position I applied for was using beta-VAE's specifically in physics

#

Because for them it was important and interesting that they were (near) orthogonal

wooden sail
#

we often want to solve a so-called "inverse problem", where we measure data, we have some idea of the physical process, and we want to extract relevant parameters. so what we can do is take the parameters and let them be x in this image. use a physically-motivated forward model, some approximate solution of a differential equation, and let it be G. this part may or may not have trainable parameters. this spits out a z, which is technically "encoded" if you choose to interpret this as an encoder, but usually z is much higher dimensional than x in this application. then let a deep network be F, and have it estimate x. once all is said and done, F can be used in stand-alone fashion on real measurement data, provided that our G was good enough, and it inverts the problem. e.g. like locating objects using x-rays

#

the properties that G and z satisfy depend entirely on the application in general though, so yeah. that's where your domain expertise comes in

past meteor
#

That's really interesting

#

Goonna shill my own stuff, like I mentioned last time I used an autoencoder it was to make image-dependent noise that works on out-of-sample examples that turns every example into an airplane

#

So essentially, it's a cheap and easy way to "attack" any model, provided you have access to its gradients which is totally unrealistic except in toy problems haha

wooden sail
#

ooh nice

iron basalt
#

The encoding can be whatever.

iron basalt
barren fable
#

i'm confused a bit in these pics, as i can understand the first column in has heart disease (True Positive) represents the actual data of training data am i right or wrong? and if im wrong what r 4 corners refer to?

serene scaffold
#

whereas if the algorithm says they have it, and they actually do, that's a true positive. and you can probably figure out what a true negative is.

barren fable
serene scaffold
#

k-fold cross validation is where you divide the data into k partitions, and then you do the algorithm k times, and each partition "takes a turn" being the test data.

#

though it sounds like you might understand that much well enough

#

the actual column of "has heard disease" & "does not have heart disease" it represents the real data which is training data is that right?
all the data--both the training data and the test data--is "real".

serene scaffold
#

for each data point, there's what it actually is, and what the model predicts that it is.

#

here it is where it says which is true positive (TP), false positive (FP), etc.

barren fable
barren fable
serene scaffold
barren fable
#

forget about this one, let's loot at other example

serene scaffold
#

the algorithm does not produce new data.

barren fable
#

so here, we got actual 142 ppl had heart disease, and our prediction misclassified 29 ppl that's right?

serene scaffold
barren fable
barren fable
# serene scaffold right

ok so i got here the last 2 question
the first one is, the 142 is that the all real data? like we took the all data which is 142 ppl had heart disease and we tested it or it's a sample like 75% and we tested on 25%?

serene scaffold
#

@barren fable make sense?

barren fable
barren fable
# serene scaffold <@1120843210604949536> make sense?

look at these 2
the one on the right said the columns r actual data and the rows r the predicted data
the one on the left said the columns r the predicted (which means as he explained the values coming out of the model) and rows r the expected (which means as he said the actual data the model is supposed to predict)
so im confused which one is right? or he transposed the row and columns?

serene scaffold
#

there's not actual data and predicted data. there's just the data. this is very important.

barren fable
serene scaffold
#

there's just the data. and there's what the data actually is (actual), and what the model says the data is (predicted)

#

no new data is created.

#

there's no standard for whether rows should represent actual, or if columns should represent actual.

#

but the point is what it represents.

barren fable
#

damn that's the answer i was looking for

#

i got it rn

serene scaffold
#

great

barren fable
# serene scaffold great

so r we using cross validation in this process or no? like dividing the data for training and testing?

serene scaffold
#

cross validation is where you do the whole process multiple times, but you divide the data into train and test differently each time. and you see how much of an impact that makes on the performance

#

if there's a big difference between the best time and the worst time, then something is probably wrong

#

(for other people reading this, I'm trying to explain this as simply as possible, using terms that the asker has already used)

barren fable
serene scaffold
barren fable
carmine nest
#

CP1

royal crest
#

What's CP1?

young pewter
#

What are some potential problems with linear regression?

#

and what exactly does linear regression do?

#

i googled the comparison between linear and logistic regressions and linear solves regression problems (not too sure what that means) while logistic regression solves classification problems, which i get more but an explanation would still help a lot

slender kestrel
slender kestrel
young pewter
slender kestrel
#

i cant send images in here so the sigmoid function is sort of hard to explain ;-;

young pewter
#

u can dm?

slender kestrel
#

sure

#

hello i need some advice regarding machine learning career that i am perusing so please let me know if anyone can help

wooden sail
#

you can fit a polynomial with linear regression

slender kestrel
#

please correct me if i am wrong

wooden sail
#

splitting them into categories like that does you a disservice

#

in both cases, you set up a matrix problem of the form y = Ax + b, and you solve it as A^-1 (y-b)

simple tapir
#

Hi guys, why is the graphics of cost function in gradient descent algorithm shown parabolically? There's no x^2 or something in linear regression

wooden sail
#

the cost function usually does have x^2s in it though

simple tapir
#

Isn't the cost function (theta.X)/m though?

wooden sail
#

.latex you usually use something of the form
[
J(bm{x}) = \Vert f(\bm{x} - y) \Vert_2^2
]

strange elbowBOT
simple tapir
#

in a linear regression model

#

oh

#

MSE

wooden sail
#

oof i forgot to make the y bold. but anyway, MSE has the squares in the name 😛

#

also when you set up y = Ax, it often does not actually have a solution

past meteor
#

The intuition is that the word linear means that one unit of increase in your variable means one a certain increase in your target, given by your coefficient

wooden sail
#

you instead minimize the error using some metric, and MSE is a common metric

#

not all metrics involve squares, but most of them are nonlinear and so you get curves

slender kestrel
simple tapir
#

^

wooden sail
#

a linear function is a function such that f(u + cv) = f(u) + cf(v)

past meteor
#

And that's a big assumption. Some things are great in the beginning but they start sucking. For example temperature vs happiness. You keep getting happier the warmer it gets but when it's 50 ° c you get sadder

#

That's a typical non linear relationship

slender kestrel
simple tapir
#

So, we don't mean Ax + b with "linear function" then?

wooden sail
#

Ax + b is an affine transformation, that's actually not even linear

#

unless you use some tricks

simple tapir
#

Otherwise, it wouldn't really work like how gradient descent works

past meteor
#

You can solve the problem by taking other models or new features, maybe you make a variable called temperature below 30 and temperature above 30

wooden sail
#

gradient descent has NOTHING to do with linear functions

#

at all

#

gradient descent itself does a linearization in a neighborhood of a point, the function you minimize does not have to be linear. only differentiable

slender kestrel
simple tapir
#

So that parabolic graphic depends on the problem. In some cases, Ax + b is used which doesn't make it a parabolic and in some cases where Ax+b doesn't work, MSE is preferred

#

right

slender kestrel
simple tapir
wooden sail
simple tapir
#

it'd be probably x = 0

simple tapir
wooden sail
#

say we have a polynomial of degree 2. you can easily generalize what i'm about to do. we let y = ax^2 + bx + c. to do regression, we need pairs of observations (x_n, y_n). then we can write N equations of the form y_n = ax_n^2 + bx_n + c

#

we can write that in matrix form as follows

wooden sail
#

.latex
[
\begin{bmatrix}
y_1 \
y_2 \
y_3 \
\vdots
\end{bmatrix}

\begin{bmatrix}
x_1^2 && x_1 && 1 \
x_2^2 && x_2 && 1 \
x_3^2 && x_3 && 1 \
\vdots
\end{bmatrix} \cdot \bm{x}
]

strange elbowBOT
wooden sail
#

sigh one second

grave summit
#

hello guys, I am trying to detect seasonalities in a financial time series that has a price for each hour of the day for an entire year. I would like to do this by using FFT, in order to do this i need to use a window function for tappering the time series, any of you got some advice on how to choose the window function ?

#

I`m trying to highlights two types of seasonalities, macro (between months and weeks) and micro (between days and hours)

strange elbowBOT
wooden sail
#

ooof man, i hate this bot

slender kestrel
slender kestrel
wooden sail
#

.latex
[
\begin{bmatrix}
y_1
y_2
y_3
\vdots
\end{bmatrix}

\begin{bmatrix}
x_1^2 && x_1 && 1 \
x_2^2 && x_2 && 1 \
x_3^2 && x_3 && 1 \
&& \vdots &&
\end{bmatrix} \cdot \bm{x}
]

strange elbowBOT
wooden sail
#

ok there we go

#

i guess those should've been horizontal dots in the y vector, but nevermind

grave summit
#

@slender kestrel In the end i aim at getting a column vector containing one number for each hourly price by which i will multiply to apply my seasonality

#

fourier might be a good way of doing so i thought

wooden sail
#

this effectively turns the polynomial fitting problem into one of the form y = Ax with a toeplitz matrix A, and we find x via linear regression

#

also, seasonality is indeed found via FFTs

slender kestrel
# strange elbow

ooh i see how you are trying to explain linear regression with polynomial regression you trynna say that polynomial regression is an extension of linear if we assume x1^2 as a dimension right

wooden sail
#

that decomposes your data into sinusoids of predefined frequencies, letting you find what the perodicity of the data is

grave summit
#

yeah that's why i want to use fft

wooden sail
slender kestrel
slender kestrel
wooden sail
#

you would usually still do an FFT of the autocorrelation

grave summit
#

i can do both since i want to learn about seasonalities

wooden sail
#

as for the window functions, using no window function is equivalent to using a rectangular function. this convolves the spectrum with a sinc, which has good and bad properties

grave summit
#

blackman might be good i heard

wooden sail
#

it has the highest resolution in the sense that the peaks are the narrowest, but in exchange you get side lobes that might make it seem like there are other frequency components

#

blackman and blackman harris are alternatives. those try to remove the side lobes but make the main lobe wider

grave summit
#

I mean the size of my window function will affect the type of seasonalities i get from my analysis right?

slender kestrel
wooden sail
#

it depends which things your data is sensitive to: false positives in the trends, or high resolution (closely spaced frequency components)

grave summit
#

how can i test for this ? @wooden sail

hexed ibex
#

@pine escarp '

slender kestrel
wooden sail
grave summit
#

ok i do a few fft with different windows an check the results

wooden sail
grave summit
#

which major did you do ?

slender kestrel
wooden sail
#

i did telecomm in bsc, comms and sig proc in msc, and doing more sig proc in phd

hexed ibex
#

@pine escarp are you there

slender kestrel
wooden sail
#

i'd rather not 😛

slender kestrel
#

i too wanted to learn more about time series data and sesaonality coz the articles avalible on medium teach you not too much

#

so wanted to ask you more about that stuff

wooden sail
#

i think maybe zestar is a better person to ask. i probably know the stuff with different names, but my approaches are "unorthodox" compared to what you usually see in data science

grave summit
#

i also have one last question Edd

wooden sail
#

yeah

grave summit
#

in the end as i said i would like to get a different coefficient for each hourly price of my series by which i multiply it to take into account seasonalities

slender kestrel
grave summit
#

will i be able to get this out of my fft analyisis ?

#

as I'm not only aiming at a graphical spectrum analysis

past meteor
#

What's the question?

slender kestrel
# past meteor What's the question?

i too wanted to learn more about time series data and sesaonality coz the articles avalible on medium teach you not too much
so wanted to ask you more about that stuff

wooden sail
# grave summit will i be able to get this out of my fft analyisis ?

hmm not with a single fft. a single fft will assign one coefficient to each frequency/period. so say something repeats every 30 minutes, you will get one coefficient for the whole data, telling you how strong the 30 minute repetition component is. if you instead what to see how this coefficient changes every x hours, you would instead use a "spectrogram". this splits the data into sub windows, and then FFTs each of them

past meteor
#

Forecasting is more business related time series analysis but I think this book is a great place to start, afterwards you can go for a more advanced text (this one is also rigorous, but quite practical)

slender kestrel
wooden sail
#

the best advice is going to uni if you wanna learn it well 😛

pine escarp
grave summit
#

yeah agree, sstem degrees are best learn at uni unless you are a genius urself

past meteor
#

Yeah, university is key

hexed ibex
#

from starting only

pine escarp
slender kestrel
wooden sail
grave summit
#

thanks edd gonna try some stuff and be back

wooden sail
#

read the papers where they discuss implementation details

slender kestrel
pine escarp
slender kestrel
wooden sail
#

there are internships at all levels

pine escarp
slender kestrel
wooden sail
#

read extra books and papers, not just what they ask of you in uni. if you find any tasks/projects/topics you're interested in, go ahead and play around with them

slender kestrel
wooden sail
#

gilbert strang's linalg and axler's linalg done right

#

boyd's convex optimization

past meteor
wooden sail
#

louis scharf's statistical signal processing too

pine escarp
hexed ibex
pine escarp
slender kestrel
past meteor
#

I did internships, I was active in my city's data science community as a student, ...

#

Kaggle was a big help too

slender kestrel
# past meteor Kaggle was a big help too

yup i make all my models on kaggle itself so far i have worked on a toxicity detector a flower classification project lip net did some basic eda projects and worked on analyzing data for my professors also completed various courses about the theory part

#

so now i am looking for internships but i want to do research internships in universities but i always have this feeling that i dont know much so thats why i was asking you guys if its ok to apply or not ?

past meteor
#

Apply

lapis sequoia
slender kestrel
slender kestrel
hexed ibex
#

@pine escarp join voice chat

past meteor
pine escarp
slender kestrel
# past meteor Your own university?

ooh see the problem with my university is there arent many opportunities in there thats i am looking for some foreign universities to apply to

hexed ibex
#

i cant share the screen

slender kestrel
past meteor
#

To do an internship? Do you have internship credits in your program?

pine escarp
slender kestrel
pine escarp
past meteor
#

How are you going to do an internship abroad when you still have class in your home university pithink

slender kestrel
#

down side to this is

#

i gotta study in the 8th semester

#

instead of doing internship at that time

#

so i am more inclined towards finding a remote internship

past meteor
#

Idk how easy to find remote internships at foreign universities are. The remote part also defeats the purpose of an internship imo.

slender kestrel
past meteor
#

Are you in Europe, if so can you do an Erasmus?

slender kestrel
past meteor
#

Totally fine as well, many indian students here doing their master's

slender kestrel
#

i guess i should wait till graduation ;-; then only i can be free smh

past meteor
#

You'll be fine, don't worry

pine escarp
#

I'm from India as well.

slender kestrel
past meteor
#

If you already know you want to be in data science / ML you can already start the practical side of things.

slender kestrel
slender kestrel
pine escarp
slender kestrel
lapis sequoia
# slender kestrel naah bro india .-.

Mitacs global link - Canada, DAAD - Eu uni, DSSG for UK uni, NTHU for taiwanese uni... there are tons of research intern / summer intern programs. you can apply for them

pine escarp
lapis sequoia
#

and not to mention you can do research intern at IITs with simple cold mailing any professor

#

given you got skills ofc

slender kestrel
silent spire
#

Hey guys, I'm new here

shadow viper
#

I hope you improve and impact here also

barren fable
#

@serene scaffold yo hru?

serene scaffold
sleek harbor
#

am I correct in assuming:

  1. you should not standardize principal components after PCA

  2. when using RandomState for reproducing results and comparing CV scores for different models you should initialize a new RandomState instance for each estimator (not declare a rng variable at the top of the program, and pass it down to any object that accepts a random_state parameter) in order to prevent them from influencing each other by consuming the RNG?

pls confirm or correct me if u don't mind 🤗

wooden sail
#

what are you calling "principal component" here? the vectors or the coefficients?

#

ah i had misread that as normalize. standardize as in making them have mean 0 and var 1 would indeed ruin them. PCA gives an orthonormal basis, so normalization is already taken care of there. due to orthogonality you can also straightforwardly conclude that all the coefficients of the vectors are between -1 and 1 if the input vectors to be PCA'd have magnitude 1. the distribution of the coefficients per input vector is arbitrary though and you can't change it, otherwise they don't synthesize the data back.

shadow viper
sleek harbor
past meteor
wooden sail
wooden sail
#

PCA decomposes into orthogonal vectors and their coefficients, sklearn should give you both

past meteor
#

But that's mostly because I'm unsure if you'll have it fully reproducible if you set the seed on the top

#

Personally I'm mostly worried about the reproducibility of my data splitting and not more than that (specifically each estimator)

shadow viper
shadow viper
sleek harbor
lapis sequoia
past meteor
#

Just code up PCA with numpy

#

It's ~5-10 lines of code. Do it once and the algorithm will make sense forever

wooden sail
#

closer to 3 😛

#

anyway, the principal vectors are orthonormal, and you want them that way

#

they have unit norm already

sleek harbor
#

sounds like "rewrite tenforflow in 100 lines of C++" to me..

past meteor
#

No, PCA is very simple

sleek harbor
#

I'm not that good at numpy

lapis sequoia
past meteor
#
  1. calculate covariance matrix 2) calculate eigenvalues, eigenvectors 3) sort by eigenvalues 4) do matrix multiplication
#

Each of these things have a numpy "verb" so it's just chaining stuff that exists off the shelf together 🙂

sleek harbor
#

maybe when I buy some more IQ

past meteor
#

No, you're 100 % smart enough to do this @sleek harbor don't underestimate yourself

#

And once you do it, you'll have something like "wow, was that all that there was to it?"

sleek harbor
#

hmm.. sounds fishy 🐟

wooden sail
#
data = ... # size n x m; let n be the data length, m the number of samples
centered_data = data - np.mean(data, axis = 1)
cov = centered_data @ centered_data.T/m #size n x n covariance matrix
principal_components,_,_ = np.linalg.svd(cov)
#

that's a PCA for you

shadow viper
wooden sail
#

using an SVD has the advantage of canonically being ordered by the size of the singular values, so it saves you the sorting. it's also equivalent to the EVD for symmetric matrices, which all covariance matrices are

lapis sequoia
#

Reminded me of funny incident, When I was doing titanic and other tutorial problems to learn, I extracted features from the indexing column provided(random string values), and I got little boost in cv scores, I got so happy. It was probably a boost from randomness or param tweaks haha

sleek harbor
# wooden sail ```py data = ... # size n x m; let n be the data length, m the number of samples...

that last line feels like cheating. Also I don't quite know the math of SVD. I understand PCA via the visualization in this vid :p https://youtu.be/FgakZw6K1QQ

Principal Component Analysis, is one of the most useful data analysis and machine learning methods out there. It can be used to identify patterns in highly complex datasets and it can tell you what variables in your data are the most important. Lastly, it can tell you how accurate your new understanding of the data actually is.

In this video, I...

▶ Play video
wooden sail
#

but idk what you find to be "cheating" about it, it's identical to the EVD

#

no one ever computers eigenvalues and eigenvectors by hand for anything larger than a 3x3 matrix, if that's what bothers you

sleek harbor
wooden sail
#

i told you it was like 4 lines

#

so did zestar 😛

#

if you use the EVD instead, you need 1 more line to sort by eigenvalue

past meteor
#
means = mean(threes);
threes_cent = threes - means;
covariance_matrix = cov(threes);
[V, d] = eigs(covariance_matrix , i);
decomposed = threes_cent * V * V';
decomposed = decomposed + means;
#

This is PCA in full, in matlab

abstract rune
#

can someone help with this, why it is not renamed ?

past meteor
#

Matlab and numpy are twins so it should be readable

sleek harbor
abstract rune
#

Thanks a lot @sleek harbor

#

this silly error

past meteor
#

Most courses "forced" us to do algorithms by hand, which was a good thing in hindsight because then you know what it's doing

abstract rune
#
past meteor
#

Fundamentally the building blocks aren't hard, just having a high level understanding is fine. Then later you can ask yourself questions like "why the covariance matrix", "why the eigenvalues", "how do eigenvalues relate to the cov matrix", "why is the reconstruction error ~ 0 if num_components == num_features"

abstract rune
#

this is the table of contents of this book

abstract rune
past meteor
#

it's fine

sleek harbor
past meteor
#

Just go with the R version for now, code is a relatively small part of the book

wooden sail
#

you can learn all the stuff detached from coding

#

then the lang doesn't matter

past meteor
#

Like if you see lm(income ~ age + year_of_experience + level_of_education) it's trivial to map that to Python's linear regression

#

I feel like the Python version will use statsmodels or something cursed anyway so that's a wrap

sleek harbor
#

as long as u can guess that lm stands for lin model..

past meteor
#

The book definitely mentions that lm stands for linear model 🙂

sleek harbor
past meteor
#

income is a function of all those variables

sleek harbor
#

now how would I guess that?

past meteor
#

I just dropped that here out of context, in the book there's a logical flow so when you'd read it, it'd make sense from the context

#

lm(income ~ log(age) + years_of_experience + level_of_education + (years_of_experience * level_of_education)) is possible as well, very flexible stuff

sleek harbor
#

i was debating on going for a masters in financial algorithms for analysis (with R).. maybe I should've went for it..

past meteor
#

R as a programming language is so horrible

sleek harbor
#

good thing I didn't go for it then :3

past meteor
#

But it has a few nice ideas for specifically statistics

#

I'm talking about the language itself, not what can be done in it

#

I'm generally not a fan of languages that are dynamically typed and have less strong typing than Python. You know, languages that do a lot of casting like JS, PHP and R

#

You gotta be really awake because they'll do stuff that is imo silently failing, sort(c("1", 2, "3", "four", 5, 6)) is equivalent to sorted(["1", 2, "3", "four", 5, 6]) . In Python the latter (luckily) errors out while in those langs it does not

wooden sail
#

python still does plenty of that though, and it's my biggest beef with it

#

😩

past meteor
#

As dynamically typed languages go it's well designed imo but ig there's a limit to what you can do

#

Hence why I use mypy judiciously. For stuff that's not strictly data I might someday look for an alternative with more type safety but doesn't feel as verbose as idk Java.

weary warren
#

could anyone help me with a whatsapp gpt trying to integrate stripe into it. having difficulties creating the cancel subscription

lavish lily
#

If i run the OpenAI CLI fine_tunes.follow command and my stream disconnects is it still being trained and processed in the backend?

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i <model id>
iron basalt
#

Then I spend an hour wondering why the variable has the wrong value.

serene scaffold
native umbra
#

Hello, guys should I start taking "world quant data science lab" course, or should I focus on something else?

serene scaffold
lapis sequoia
#

Is there a really good guide to kaggle that has like the top 5-10 or so challenges that ramp up in difficulty so you can learn as you go?

somber panther
#

matpltlib, what are left and bottom in add_axes?

simple tapir
#

hey

#

I've learned from a tutorial that ridge regression basically sets the theta of useless features to closer to 0. But I didn't get how it knows whether a feature is useless

agile cobalt
#

it draws all features closer to zero
the usual gradient descent process just outweights that effect for the actually useful ones

simple tapir
#

How does gradient descent know which one is useful?

agile cobalt
#

the same way it knows what to update for normal linear regression models

simple tapir
#

ah

#

so instead of going forward everytime it updates itself, it starts from 0, right?

agile cobalt
#

it isn't "totally useless" / "totally useful", more of "at some point it is useful enough to resist the pull towards 0"

simple tapir
#

Yeah and I wanted to know the criterias that computer utilises to determine whether it's useful

agile cobalt
#

that'd be the loss function and back-propagation mechanism (aka calculating gradients)

simple tapir
#

Ah I see

agile cobalt
#

what Ridge regression changes compared to linear regression in the end of the day is just adding a term to the loss function that increases the loss based on the weights

grave summit
#

Hello guys, i need some advices

simple tapir
#

Ridge regression draws all the features to closer to 0 where lasso regression draws them to 0 but I don't think that it'd affect the model much, since 0.0001 can be assumed as 0. Why would you choose ridge over lasso, though?

grave summit
#

Let's set some background here, I am studying a financial time series which is representing the hourly price of electricity for each day of the year 2022 so I have a Pandas DataFrame containing two columns:

The first one is a TimeIndex containing the date and hour in a datetime format of Pandas. The second one contains the price associated to each hour.

I am studying this sample to make some predictions on the hourly price of let's say 2024. For this purpose I would like to do an in depth study of the seasonality patterns in this time series, on multiple granularity levels (hours, days, weeks, months and quarters).

In the end I would like to obtain a column vector of 8760 scalars corresponding to the 8760 hours of the year that are priced. Those scalars would represent a seasonality coefficient that I will use to make my predictions for the year coming.

Now comes the reason that i am here, I thought of doing this search for seasonality using FFT and an appropriate window function. I would like to know from you guys which window function should I use for this purpose, I know each one has its advantages and disadvantages. I would also like to know how should I choose the width of my window as obviously this will have an effect on the FFT performed. I am also open to advices on how to complete this seasonality study, would you do it another way? Which tool would you use?

This is a general question, I am looking for other people's opinion on how to do this research, I am not encountering any particular coding problem for the moment

agile cobalt
simple tapir
#

hmm I see

#

thanks a lot!

left tartan
past meteor
#

I never use it because it's a bit janky but I ... like the fact I can do it

past meteor
#

So you can just remember, for now, that it finds a way to get the best value for money performance wise, which means setting some values closoer to 0.

#

As for why they don't hit exactly with ridge regression and they do with lasso, you can look at the equations for that.

Write out the partial derivative of an arbitrary coefficient for L2 loss and see under what conditions beta gets to be exactly 0. It's at lambda (regularization strength) -> inf. For lasso this is not the case. (frequentist pov)

There's also the Bayesian statistics way to look at it. No regularization == uniform prior, ridge == gaussian prior and lasso == laplacian prior. If you look at the laplacian, you see a nice big peak at 0 with a big drop off. There's a high probability to be exactly 0 while the gaussian has a lot more "mass" around 0.

Honestly, idk how much value knowing this is if you're a "practitioner" and not someone making methods 🙂

iron basalt
past meteor
queen cradle
# grave summit Let's set some background here, I am studying a financial time series which is r...

First of all, you can do an FFT on the whole time series if you like. An FFT length of 8760 should be easy.

Second, the FFT only gives you the results you want if everything lines up perfectly in time, and for calendars they don't. Suppose, for example, that there is a monthly effect (for example, perhaps something happens on the first of the month). Unfortunately, months don't all have the same number of days: January has 31 days, February has 28, and so on. So these effects will be unevenly spaced and hence not clearly visible if you do an FFT. Or suppose that you think there's a weekly component (a pretty reasonable guess, since people's activity is different on weekdays and weekends). The year is not a round number of weeks: It's 365 days, and 365 = 52 * 7 + 1. Consequently there is no frequency corresponding to weekly effects.

Third, if you only have a single year of data then you will need to smooth your results pretty heavily. You say you want 8760 scalars, one for each hour of the year. Well, you started with 8760 scalars, one for each hour of the year. If the price in hour of the year was independent of the price in every other hour of the year, then the maximum likelihood estimate of next year's price would just be last year's price. The only reason why your problem is complicated is because you expect that the prices are not independent. Really your question about seasonality is about how to measure possible certain kinds of non-independence. And what you have to hope is that the interdependence of the different variables is strong enough and discoverable enough that you can actually predict 8760 scalars reliably.

dire violet
#

hi, im new to ml and im trying to build a food recommendation system. i'm not sure what type of model to use though. i did a little bit of research and found that a hybrid of collaborative filtering and content-based filtering could be ideal for what i want. for hybrid, i'm not sure where to find these models to use. in addition, how does one train a model? I understand you feed it data but what type of data do i feed

serene scaffold
dire violet
#

what do you mean by user experience? like how what data the user will have?

serene scaffold
dire violet
odd relic
#

what are your thoughts on this? It seems to have dropped real quick which concerns me. When training on a smaller dataset it took 80 epochs to go from 10K to 14.2, now when I moved to a dataset of 10K images it has this behavior

serene scaffold
#

do you expect that that will inform their food preferences in some way?

odd relic
#

oh man this is weird lol

cursive crown
#

Hi everyone, I think I found a bug in pandas

import pandas as pd

date_1 = pd.to_datetime("2012-02-05")
print(date_1 - pd.offsets.MonthBegin())
# prints 2012-02-01 (prints a timestamp but this is the date)

date_2 = pd.to_datetime("2012-02-01")
print(date_2 - pd.offsets.MonthBegin())
# prints 2012-01-01

I want date_2 to remain as is because that's the beginning of the month. How do I do this if both kinds of dates are in the same column?

agile cobalt
#

never mind, not sure

#

maybe .replace(day=1)

cursive crown
dire violet
cursive crown
agile cobalt
#

no clue, my last guess would be trying something like se - timedelta(se.dt.days - 1, unit='days') but idk which kind of timedelta would fit

#

something like resample might work depending on what exactly you're doing

odd relic
#

AI seems overkill

#

just do some research

#

like what does gender have anything to do with it

#

food preferences yes

#

but again, you dont need AI for that

cold osprey
#

AI everything innit

dire violet
cursive crown
young pewter
#

anybody doing the kaggle project?

#

the spaceship titanic

lusty lotus
simple tapir
young pilot
#

We need a freelancer who can build prediction model based on the dataset according to requirement.

hybrid mica
#

I have trained a self organising map. Do SOMs have a stochastic nature? Why am I getting the same SOM after re-running the code?

slender kestrel
slender kestrel
lapis sequoia
#

Anyone help plz ?

latent tundra
#

I programmed a simple 2d topdown shooter and want to apply a tensorflow agent on it. Should I use the absolute coordinates of the enemy or the coordinates relative to the player as input?

bronze jacinth
past meteor
#

I'd say relative - when I was doing Nethack RL stuff I expressed everything in relative coordinates

latent tundra
#

Yea, know that I really think about it there is probably no advantage to giving absolute coordinates and they would probably internally be converted to relative coords or even just the distance

past meteor
#

If you don't encode your agents position and you're using absolute coordinates you'd have a bad agent I think

dusk tide
#

I am working on movies dataset. These 2 images containes the collection names with revenue and budget sum and mean of each collection .
When I did sum of revenue , I found Harry Potter collection stood in 1st place .
But when I did mean of revenue I found the Avengers collection stood in 1 st place.
So truly which one is a success between the two??