#data-science-and-ml
1 messages · Page 69 of 1
uh you can segment images however you like, they could all be in the same file and have the label class in the file name for example
cat_0001.png for example if cat was a class. Note that is for single classification (i.e. only one of the classes can be chosen) it will be a little more complicated for multiclass classification
your model's output for class prediction will be a vector of length 20, so converting from index of the vector to class name and back is quite simple
okay awesome, and just wondering, there's a subclass of male + femaile
so train/test -> male/female -> 0..20 for each folder
how would that work having the male + female subclass?
I can make some UI thing that just calls 2 separate models if it's too complicated to incorporate that subvision of folders @small wedge
idk what you mean by subclass, is this some data you would pass to the input of the model or expect it to classify images as male/female?
Yeah
that wasn't a yes/no question XD
I'm trying to think of it mathematically for the model's input/output
you have an input vector of pixels in the image (/ whatever data you want in addition to that)
and you have an output vector of your 20 classes
does this subclass fall into one of those categories? or is it just an organizational thing?
The model has to work on both males + females but it has to detect whether the user uploaded photo is male or female
It’s age prediction but I can’t compare the ages of males + females
I see
well you can either have 2 models, the user inputs whether the picture is male or female, and pass that to the appropriate model
or you can use multiple classification, with 21 classes
and have the model predict male/female as one of the classes
!code
C++ but I am using matplotlib with it
The syntax is a bit different but that shouldn't matter all that much. I'm trying to figure out why there is that additional straight blue line going from the origin to the current data point being plotted in every update.
I have a projects based on image processing and computer vision I need help with, If there's any professional ML engineer willing to help please let me know, Thank you!
What sort of projects?
Pardon me, It's only 1 project!
You need to include some details to explain what it is you are trying to do and to what extent and maybe why.
self.weights[layer_idx] -= self.learning_rate * np.dot(self.deltas[layer_idx + 1],
File "<__array_function__ internals>", line 180, in dot
ValueError: shapes (5,3) and (1,4) not aligned: 3 (dim 1) != 1 (dim 0)
line 85
help
It is a nn
Well, it's a freelance project for a business they have a requirement of image processing and computer vision, Looking for someone who has expertise in this field for support!
Works binary but not with multiple classes
you'd have better luck just describing your problem here or in the help forum
people generally are less eager to get on a DM with someone
plus by putting it here you get the combined expertise of the community rather than one person
is it better to watch tutorials than read books
also, is this https://github.com/ossu/data-science a good curriculum to follow for data science
@mild dirge @left tartan , thanks a lot guys! 🙏
Too broad and why there's a java section. Maybe better to look for more specific courses
can you suggest any which i can do as a beginner
I don't remember a specific curriculum now but I would suggest zoomcamps like this:
https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp
Hey everyone I have a quick question
Also khan academy courses would be easier to follow for statistics rather than the MIT calculus courses
Does loss in training in anyway related to number of gpu my laptop has?
For some reason the loss in my model is extremely high with extremely low accuracy
Even though I seems to have done everything right
Umm I was watching a tutorials on CNN, and I was doing the exact same thing as the tutorial
But I’m getting different outcome
Probably not the exact same, or you got a very unlucky run
#CNN #ConvolutionalNerualNetwork #Keras #Python #DeepLearning #MachineLearning
In this tutorial we learn to implement a convnet or Convolutional Neural Network or CNN in python using keras library with Tensor flow backend.
Convolutional Neural Networks are a varient of neural network specially used in feature extraction from images. In this v...
Here is the tutorial
Something just felt wrong
The initial weights would be different
Weights?
How much of a difference is there
Have you learned about linear regression and perceptrons yet?
I have watch some videos of it
You should have a solid understanding of those before you even begin looking at CNN
Start with the basics first would be my advice, there's many things that could make a cnn give bad results
When you create NN you have connections between the neurons. In the beginning of the code you are giving some random numbers to them. So try several times to see if the loss decreases by chance
I have taken linear algebra so I do understand
Thank you I will try that!
Hoi, for stable diffusion, how do i start off making a script that loads last of everything on boot, and has "if not detecting X extensions/models locally, install extension through extensions, install from url tab", restart webui, then read from new extension to fetch model", then reads it all to confirm it's there.
If you haven’t watched, this series is excellent visualization of the topic: https://www.3blue1brown.com/lessons/neural-networks
Even if you know it, it’s a fun watch
He’s my favorite math YouTuber I watch that entire series❤️👍
doeas anyone know of a link/anywhere to know about what neural network architectures to create based off the problem
say if I wanted to classify 196 classes, how would I know what NN to create
is there a formal process or any resource cheat sheet
II want to write a python program for a typical data analytics workload: collect data, clean it, do some prediction, and display insights/dashboard on a website. I want to write it in a modular way so folks can replace a component with a different one, not even in Go. What's a good approach to do that? write python program that make calls to other python executables (out of process)?
And if someone has a codebase that I can get inspiration from, please share it
I found this one but I am not sure how to run it yet: https://github.com/tdpetrou/Build-an-Interactive-Data-Analytics-Dashboard-with-Python-Oreilly
So I get a warning if I do this:
# code #1
output_df.dropna(subset=['flow'], inplace=True)
# compiler returns this message:
# See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# But no warning here, code #2:
output_df = output_df.dropna(subset=['flow'])
What I'm trying to do is remove all rows where flow has a value of None.
I looked up that link and I don't see where it applies to what I'm (doing).
I know I can import warnings and filter out the warning but... is that the right thing to do? Seems like I could just do #2 above and move along but... what's the right thing to do?
In general, avoid inplace. https://towardsdatascience.com/why-you-should-probably-never-use-pandas-inplace-true-9f9f211849e4?gi=611f8dfd661e
Truly it is a curse on the library and a pox on thee if you use it
"a pox on thee..." 😄 love it
it's a bit annoying in pandas to things nicely and immutably
like, plenty of assigns
Most times for what I'm doing I want to modify the underlying data. Otherwise i run the risk of making more and more copies of a dataframe, making the code more confusing. I think. Probably.
Read the whole article.
yessir, will do rn 🙂 Thank you for the link.
I just used .melt for the first time today so seems timely.
also, my real answer is: don't use pandas, but I wont go there (and sometimes you have to)
OOOoooo tell me more please. Always looking for alternatives/options and such.
you prefer polars, or what?
I do a lot with pycompute + duckdb now
So, essentially database tables/queries?
Polars is fast(er) than pandas, but I don't want to learn yet another pythonic dataframe api that's not as good as sql.
I avoided pandas for a while and just used sqlite because I'm more comfortable with sql.
there's a lot that I dislike about pandas. but I'm too familiar with it to want to learn something else. and it's so well documented.
"avoid" being relative... I have plenty of dataframes around the code things.
Yah, everyone needs to know pandas, but eventually you hit the wall.
pyarrow starts to change the game a bit tho.
polars has a super intuitive API, if you know how to do something in SQL, you pretty much know how to do it in Polars. But you get the advantages of being able to do a ton of stuff in polars that is much harder in SQL - complex aggregation, pivoting etc - alongside composability which makes it workable as part of actual software rather than one off analysis/ETL
I was never super familiar with Pandas, but I really like polars
I wish polars and pandas would get together and make a love child.
indexes
but the main reason is so popular is that it was the first mover
what would be a messy join in polars is often something trivial like df1 + df2 in pandas, thanks to indexes.
(something something duckdb .... )
but yeah, polars is very nice
I would recommend using Polars instead of Pandas unless you have to or if Pandas is fast enough and you are already familiar with it.
do you have an example? Ime with Pandas, indexes were largely an annoyance rather than helpful
pandas indices are helpful for that specific case: for (in sql terms) natural joins across two df's
familiarity and company tooling built on top of Pandas are the things that are hamstringing polars over Pandas
this section in Modern Pandas has some impressive examples: https://tomaugspurger.net/posts/modern-3-indexes/#indexes-for-alignment
Since there is no finance sub here, thought I'd leave this here:
https://github.com/IlyaKipnis/PythonBacktesting/blob/main/edhec_perfa.py
https://github.com/IlyaKipnis/PythonBacktesting/blob/main/Return_portfolio.py
Some utility functions to rebuild some of R's financial ecosystem for portfolio allocation backtesting in Python considering that Zipline/Pyfolio are prone to breaking. More functions will be upcoming in the edhec_perfa file.
Backtest asset allocation strategies in Python with only a background in pandas necessary - PythonBacktesting/edhec_perfa.py at main · IlyaKipnis/PythonBacktesting
thanks, that's actually very timely. That's pretty much what I just finished writing.
(i mean, not exactly of course, but yah)
Heh--yeah, used a lot of chatGPT for this translating from R
zipline was frustrating
But so many jobs say "must have Python, must have Python", though IDK which packages people use for testing signal based trading systems and limit orders
Market orders can be done just using pandas, but all of quantstrat's depth from R is just...nope nope nope
I looked at Zipline's code once and said "I am not touching that incomprehensible mess"
Yah, I opted for (i'm a huge duckdb stan) duckdb over pandas, but started from probably the same place
hey guys weird question, but how do I find out the input dimensions of my model layers
Have you tried asking chatGPT yet?
Seems you're not specific enough in your query then.
I should learn it anyway. It's a cnn. 50x256x256x196
not familiar with them, but doesn't the cnn library itself have a way of outputting that data?
do my model layers matter with the exception of the output (classes)
if you use transfer learning yeah
if you want to make your own model you set the input feature/outputs
I see what you mean - it also seems like a pretty uncommon use, nicer with the Pandas way, but also less explicit and more magical - which often causes more pain than it saves
Yah, I really hate them (pandas indices)
yeah, they are generally not nice
it's also horrible that pandas makes you use them. like, if you want to do a join, usually you have to do set_index (it can only join on column in one of the dfs, not both)
you can specify left_on and right_on for pandas.merge iirc?
yup, that one you can I think
also today I tried to do a cross join in pandas and it just. doesn't work right. the good way according to google is, I shit you not,
pd.merge(
df1.assign(_tmp=0),
df2.assign(_tmp=0),
on="_tmp",
).drop(columns="_tmp")
I've never had to do a cross merge before, but does how="cross" just not works?
@left tartan I hadn't ever looked at it till now, but DuckDB is very interesting
DuckDB looks like a solid api, and from a glance has good Polars support. I think I actually have a usecase where it will save me a ton of effort
Yah, polars & pyarrow... feel free to dm me, I'm super into it right now.
Arrow has been great for getting everything to be able to talk to everything else.
I use numpy almost for everything. Do I really need something like pandas? Or is it necessary for production level processing?
You're really asking about dataframes (or tables): I only know my world, but everything in my world ends up in a table/dataframe of some type. Secondly, there's a trend towards Arrow away from Numpy.
So, numpy is still important and will be for a long time, but what we're really talking about is vectorized operations and there are multiple ways to get there. Dataframes are just containers to make those operations convenient (perhaps too much of a simplification?)
How do you manage to use numpy when you have heterogenous data?
Or do you use structured arrays?
I put np.nan values - if I understand the question correctly
No, I mean, when you have several columns of wildly different types.
I am not an industry guy so trying to understand/learn thanks
Yah, the machine learning libraries tend to tell us what we must use. scitkit-learn wants numpy, so we end up with pandas+numpy data types.
I usually don’t have them but I use dictionaries for it
anyway, a pandas dataframe is pretty much a bunch of equal-length numpy arrays (one per column) collected into a table-like structure
with nice methods to work on single columns, multiple columns, selecting rows, etc.
when working on a single column it's often easier to do it the numpy way, but few datasets have one column.
I think I understand the problem thanks. As I remember pandas is also numpy based and it won’t make so much difference
yeah, it's very connected to numpy
whereas e.g. polars, not so much (you can easily convert columns to numpy arrays but internally they're actually arrow I believe)
Pandas is headed in that direction too, but in baby steps.
Never used them but I’ll check them (numpy vs arrow articles etc.)
this would break so much stuff
first thing that comes to my mind is that I have numba functions working on pandas dataframes, and I have some doubts numba works with arrow
hmmm
Yah, that’s why it’s opt in right now.
That's what Pandas is for.
is anyone able to tell me why my model has 0.00045%
what model? 0.00045% what?
if you ask a question, ask yourself if you've given enough information for anyone to answer it.
ok, what's the difference between _, predicted = torch.max(output, 1) and _, predicted = torch.max(output.data, 1)
I don't know what output is.
for data in train_loader: images, labels = data. output = model(images)
CNN model
I can send you the full model actuall to see if you see anything wrong with it.
what does print(type(output)) show?
side question I was also considering doing a masters in AI. how much did that prep you for your job
I'm pursuing a masters currently, I got a job with just a bachelors in CS, but only because I had a publication under my belt.
in general, a masters is basically a requirement for entry level ML jobs.
how'd you get a publication
one of my professors wanted to publish with me. and I thank god (I am an atheist) for this every day.
cool
so what's the diff between output and outputs.data. I see it used in different models
mainly outputs with NN, and .data with CNns
I still need the answer to the most recent question that I asked you.
yeah just waiting for my model to be done
output is torch.tensor
outputs.data is also torch.tensor
looks like they're basically the same, and .data is there for historic reasons https://stackoverflow.com/questions/51743214/is-data-still-useful-in-pytorch
I would just ignore it (and not use it)
ok so my model is still fucked
are you able to take a look?
but it shouldn't be fucked. it's like pretty dense
class Model(nn.Module):
def __init__(self, num_classes=num_classes):
super(Model, self).__init__()
self.conv_layers = nn.Sequential(
nn.Conv2d(3, 64, kernel_size=3, padding=1),
nn.ReLU(inplace=False),
nn.Conv2d(64, 64, kernel_size=3, padding=1),
nn.ReLU(inplace=False),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.ReLU(inplace=False),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.fc_layers = nn.Sequential(
nn.Linear(128 * 56 * 56, 512),
nn.ReLU(inplace=False),
nn.Dropout(0.5),
nn.Linear(512, num_classes))
def forward(self, x):
x = self.conv_layers(x)
print("conv layers", x.shape)
x = torch.flatten(x, 1)
print("after flattening", x.shape)
x = self.fc_layers(x)
print("After FC layers", x.shape)
return x
model = Model()
model.parameters
def training(model, train_loader, loss_fn, optimizer, num_epochs):
model.train()
model.to(device) # using GPU if available
for epoch in range(1):
epoch_train_loss = 0.0
correct = 0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
train_loss = loss_fn(outputs, labels)
train_loss.backward()
optimizer.step()
epoch_train_loss += train_loss.item()
_, predicted = torch.max(outputs.data, 1)
correct += (predicted == labels).sum().item()
accuracy = 100/batch_size * correct / len(train_loader)
Accuracy: {accuracy:.4f}")```
replicating all your work locally is more than I'm willing to commit to--sorry
though in general, please always use markdown blocks for pasting code
!code
don't replicate. just take a look and see why the accuracy might b e .5%. just in general: epochs were 30, batch size 64, pretty dense network, nn.CrossEntropyLoss() etc. accuracy shouldn't be half a percent
I don't know
though it looks like you only have one epoch
for epoch in range(1):
it should still hit like 30% from that tho
I can't trust chatgpt either. spews so much nonsense
why should it reach 30% after one epoch?
cuz if it's a good model it can't start at .5% can it?
I've trained like 3 models, but basically all the good ones started relatively good
but what do I know
do enough epochs to get a noticable diminishing rate of return on the loss
and then if it's still performing poorly, you can reevaluate
its either a bug or I made something insanely shitty
cuz I think even like a single MLP could have gotten more of a prediction
should I do correct / len(train_loader.dataset) or / train_loader
you should do more epochs
you even said "just in general: epochs were 30"
but then it was secretly actually 1.
I should do more than that?
ah 😂
are product level networks run on thousands of epochs?
no
for epoch in range(1):
this means that regardless of how many epochs you said you did, you did one.
did you even actually do 30?
the network would just be even more dense and the epochs would be low yeah?
no I did 1 for compute
but running 30 rn
kaggle notebook gonna explode
what did you say this model does?
there's 196 classes of cars. a CNN to classify them
I gpt prompted the best network to do so, which seems. to be a VGG imitation but from scratch and much lighter
my accuracy is steadily at 0.5%
does anyone know why this accuracy is so low? 😂
I'm building a multi-class classifier that classifies conversation text into one of 8 sentiments.
Within each message, there is a conversation id, which is basically which conversation the message takes place in. Each message is either the start of a conversation or a reply from the previous message. There is also a sentiment, which represents the emotion that the person who sent the message is feeling. There are 8 sentiments: Angry, Curious to Dive Deeper, Disguised, Fearful, Happy, Sad, and Surprised.
Sentiment Analysis: Build a multi-class sentiment analysis model based on this dataset.
I'm using sentence_transformer to transform text into embeddings, then using OneVsRestClassifier using a simple estimator.
But it's taking too long to train on my machine.
How can I speed this up?
@proven sigil how many cores does your machine have, and what value did you set for n_jobs, if any?
NumPy and Pandas have different use cases. Pandas is designed for columnar data, much like a SQL database. Many of its operations work with one column or a small number of columns, paying no attention to the others: Summary statistics like means, standard deviations, and counts; grouping; joins; and so on. Pandas uses one-dimensional arrays almost exclusively. Polars is so tightly focused on columnar data that, as far as I'm aware, it supports only one-dimensional arrays, and I think it even requires that items at adjacent indices are placed in adjacent memory locations. NumPy, on the other hand, is designed for scientific computing: Linear algebra, numerical integration, signal processing, solving PDEs, that sort of thing. Multidimensional arrays are fundamental to these applications. You can't even make sense of linear algebra without two-dimensional arrays! NumPy also needs to work with arrays where adjacent indices may refer to non-adjacent memory locations (i.e., the arrays may have arbitrary strides). For example, this allows NumPy to extract a column from a matrix without copying: It creates a new array which points to the same memory as the original matrix but has a stride that makes each successive array index skip over a whole row of the matrix. This sort of operation happens constantly in scientific computing applications, but it would be highly unusual for the columnar data that Pandas and Polars target.
You can get away with using NumPy alone, but it's not designed for data manipulation and is a bad tool for that. You can get away with Pandas or Polars alone if you want to manipulate your data (grouping, filtering, joining, and so on) but don't want to do prediction or statistical inference. However, something as simple as linear regression requires linear algebra and hence NumPy or equivalent. PyTorch and Tensorflow are more akin to NumPy than to Pandas because the use cases they're intended for are more like scientific computing applications.
I wouldn't mind. But I don't have much time this week and next, so I have no idea when I'll get around to it.
Probably I'll continue to use only numpy for my processing algorithms but thanks
Is it possible to use grad cams for gans? I'm trying to overlay a heatmap on an input image to localise the area which is causing a change in the output image.
I am curious, what is the most appropriate algorithm for predicting stock prices?
Hi there!,
I'm trying to use shap[all] and I'm facing an error saying that
cuda extension was not built during install!
ImportError: cannot import name '_cext_gpu' from partially initialized module 'shap' (most likely due to a circular import) (/home/cosmix/.local/lib/python3.8/site-packages/shap/__init__.py)
Has anyone been through this error and know how to fix it?
i dont think anyone generally agrees with this
I do not understand.
As in, there is no most appropriate one?
as in there's no general agreement about the "most appropiate one"
Is this good channel to ask machine learning questions?
Anyways how to train our model for images it failed to recognise again without training whole dataset again
give it more data until it gets them all right
None 🙅♂️
My machine has 8 cores. I did not set n_jobs :/. I'm using SGDClassifier
@storm valve how was this one method called which gives different forecastings something with "M.." but i cant recall it
Hey guys, about how to calculate ROC-AUC curve, can someone help me with thresholding for Binary Classification?
If I'm using a Neural Network for Binary Classification which uses a Log Sigmoid activation function to make the classification...how should I proceed with thresholding selection?
I was thinking about simply using my output argmax, since it's the most obvious way and it's how my model is optimized(I suppose it's more or less how the BCE Loss works, even with Logits...), but when checking the source code of the model I'm using(TrimNet for drug toxicity prediction), I've found that they used scikit-learn's precision_recall_curve, which applies thresholding automatically.
I've seen that the threshold is used to get the True Positive Rate and the False Positive Rate. However, I'm simply using True Positive Predictions/Predictions and False Positive Predictions/Predictions, where the predictions are provided by my model.
I don't know how should I proceed with thresholding selection, specially since my model outputs are all in log scale. It appears to me that trying to use a threshold in this situation would be a bit arbitrary and prone to cherry picking...
Hm... I've double checked the TrimNet's source code. They actually used scikit-learn's metrics.precision_recall_curve to calculate the precision through scikit-learn's metrics.auc. For ROC-AUC, they used the model predictions and the labels themselves.
Well, this solves my problem, but I don't really get the difference... Shouldn't ROC-AUC calculate both the precision and recall of the model?
I only got time to check this properly now. The thing is, no matter how my model would make correct predictions, (pred == label) would always return a mask of False booleans.
The funny thing is...I had converted my pred to numpy arrays, but kept my label as Pytorch tensor, which caused them to be interpreted as different elements.
Hours studying, reviewing and burning neurons on the math of ROC-AUC, and the solution was just a matter of .cpu().numpy()

Thanks for the help. I hope I can now implement my ROC-AUC calculation properly.
I hope this is the right channel for this question. 🙏 I'm new to ML and I am looking for open-source algorithms that have been built to predict the mechanical and or chemical properties of materials, any materials. Is there a place where I can start looking? Thanks.
Question:
Can anyone suggest a better text corpus model than Word2Vec?
My recommender system used Word2Vec but it's not that good
I should getting the LOTR titles and the 3 hobbit titles in top 5..but I got this
(I did try TF-IDF, but it was even worse...)
Huh, that’s a fascinating question I’ve never thought about. No idea but would love to know if there is one. I know pharma does a lot of ‘similar’ modeling for, well, pharma reasons. Example: https://medium.com/geekculture/drug-target-interaction-prediction-through-python-4af9e76fc90 Eager to hear if there’s anything here.
In this post, I present Python code snippets to predict drug-target interaction using SVD (Singular Value Decomposition) and Matrix…
I recently attended a presentation on the topic of predicting mechanical properties from composition, and according to it you pretty much want USPEX.
It's not opensource, though.
just a paragraph haven't completed the whole paper yet, but tomorrow is the abstract submission only, do you mind if i DM
just nervous because it's the first time, Just wish to know if the style of writing is all good and the information in the abstract is interesting enough to catch an eye, basically just need criticism
They use graph neural networks and pytorch geometric library to model them: https://pytorch-geometric.readthedocs.io/en/latest/get_started/colabs.html
no clue
@left tartan , @tidal bough , and @glossy aspen . Thank you very much you lot for the leads.
I got ROC-AUC with negative values 
Will it fix it if I use abs()?
||I'm joking...I guess...||
I know that the most appropriate way would be to use integrals of TPR(x) and FPR(x) to calculate the area (instead of simply decomposing the ROC grid into triangles and squares). But I don't really know how would I define those functions...
maybe not the best place, but can't think of which other room would be better suited. I'm attempting to clip a LAS point cloud file using polygons in a geopandas GeoDataFrame. it is currenty taking quite a long time to do each one, specifically at this line in my code (it takes about 5 seconds to execute) within_polygon = np.array([polygon["geometry"].intersects(Point(point[0], point[1])) for point in coords])
any ideas on how to better do this?
full function:
def clip_las():
start = datetime.datetime.now()
# Iterate through the prepped_segments dataframe
for index, polygon in prepped_segments.iterrows():
print("Processing " + polygon["NAME"])
poly_start = datetime.datetime.now()
box_path = os.path.join(PROJECT_DIR, polygon[BOX_ID_FIELD] + BOX_SUFFIX)
## Read in the LAS file
las = laspy.read(os.path.join(box_path, polygon[BOX_ID_FIELD] + ".las"))
# Get coordinates of points
coords = np.vstack((las.x, las.y, las.z)).transpose()
# Get boolean array of points within the polygon
within_polygon = np.array([polygon["geometry"].contains(Point(point[0], point[1])) for point in coords])
print("Filtered points in", str(datetime.datetime.now() - poly_start) + " seconds")
# Get the points within the polygon
clipped_points = las.points[within_polygon]
# Create a new laspy file
new_las = laspy.LasData(las.header)
# Add the clipped points to the new laspy file
new_las.points = clipped_points
new_las.write(os.path.join(box_path, "Final", "las", polygon["NAME"] + ".las"))
print("Clipped " + polygon["NAME"] + " in " + str(datetime.datetime.now() - poly_start) + " seconds")
Hello all. I just joined the server and am interested in hanging out with other devs related to this channel's topic.
Is this place really active?
It's more so for asking questions about DS and AI, not many people use it as a social hub atm
I see. Thank you for your response.
How does sklearn calculate mutual information and why are the results different each time (has random_state)? What's the randomness about? I thought the formula for mutual information was just like this.. I see no reason for something random to go on under the hood..
The source contains the note: " # Add small noise to continuous features as advised in Kraskov et. al." https://github.com/scikit-learn/scikit-learn/blob/364c77e04/sklearn/feature_selection/_mutual_info.py#L391
sklearn/feature_selection/_mutual_info.py line 391
def mutual_info_classif(```
I don't know more than that, but that could explain why you see differences.
hmmmmm.. ok, but how are continuous features binned? there's no option to chose bin parameters..
hmm.. 3 neighbors.. I gotta read that link tomorrow. Brain shutting down :/
yah, i dunno, I'd have to read through the paper to get this. _compute_mi_cc doesn't seem to bin, tho
someone give me a billion dollar idea using CNNs
anime
what
make a bot that goes threw all the anime in the world, analyses it, then takes requests, and generates new anime
Already done. Anime-GAN, I think there's also Anime-DCGAN
Someone probably also did it with Stable Diffusion...
||but mine will be better||
Make a cnn that analyzes cnn (the news)
BERT? I know BERT is a classifier, but I don't know if it could be seen as "corpus model"
(I don't really know what would be a "text corpus model" specifically", but...well, usually Transformers are a jack-of-all-trades in NLP...and in most tasks...)
BERT can be fine-tuned for classification, but it's a language model.
(Source: I just spent all day fighting BERT to classify some shit)
I was looking around and wondering if I could get any help to be put in the right direction. In general I was curious if it would be possible to determine a video games internal resolution based off an image. Since the output resolution can be different from its internal resolutions. Right now the most straight forward way I can think to do it is the count the pixels on a diagonal line. Though I'm wondering if there's any good opportunity here to get more into a machine learning method or some more computer vision methods. Mostly just looking for resources that would be very applicable to this or really any guidance. I'd decently experienced with python but not so much for ML, computer vision, data science, etc
idk, seems like theres a definitive way to do it right
How would a ML approach be better? speed, accuracy etc/
How do you encode categorical variables before doing feature selection with mutual information/chi2/etc and are mutual information scores calculated between categorical/categorical x continuous/continuous features comparable?
If the categorical feature is ordinal, then no problem.. u just encode it as such. But what if it's nominal? Do you OHE? That'd be a bit weird, cus u get a bunch of scores for each value of the category, instead of just a score per category.. However, that just might be a good thing.. maybe in Blue, Red, Green, Yellow, Orange - yellow and orange aren't as important as the rest and could be safely discarded. But then again, that would (in my understanding) strongly affect comparability to other MI scores. And since MI doesn't really care about the ordering, would it make more sense to just ordinally encode the feature to get its MI score, even if it's nominal, and after that reencode it with OHE (after deciding whether or not to keep the entire feature)?
Are discrete numeric features treated exactly the same way as ordinally encoded categorical ones? Are the scores comparable? What about compared to continuous features?
Yeah I was trying to see if there's a more direct way to do it. Edge detection seemed like it could help, but wouldn't be consistent. And with different forms of AA and upscaling, well that just makes it a more difficult problem. I tried to see if there were any definitive ways to do it but haven't found much in my search
I am training a denoising autoencoder, but often the model outputs black images (values that are extremely close to 0). Sometimes, it can output good results but the next time I try training it again, it will give pitch black image. I am using 3 conv2d layers for the encoder and 3 conv2dtranspose layers for the decoder, with relu activation except sigmoid for the last layer. Strangely enough, when I switch to Dense layers instead of convolutional layers, the model will always result in some output although not as good as when conv layers are used. Does anyone know what might be the problem here that always give me black outputs?
When I started playing with VAEs, I had many problems regarding the fact that they work around distributions...so, when you sample your images, did you remember to denormalize them?
If you're using RGB images, your VAE probably works with a Normal Distribution, so you have to de-Normalize the output. If you don't, you may get images that don't really correspond to the training data.
Also...it seems that, depending on the dataset, some outputs are more prone to generate dark images... I have a VAE trained on CIFAR100 that has this likelihood. But when I use a custom dataset(and on a simpler VAE), this doesn't happen.
I don't know about that, but looks like you may get interested in how ROC-AUC score (a metric that is essentially categorical) is applied to regression tasks
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
Examples using sklearn.metrics.roc_auc_score: Release Highlights for scikit-learn 0.22 Release Highlights for scikit-learn 0.22 Probability Calibration curves Probability Calibration curves Multicl...
Though...it seems to be a bit annoying... and prone to cherry picking
I don't see how you would use roc auc for feature importance/selection, which is what I'm currently concerned with
Maybe not for feature selection, but it could inspire you somehow.
Maybe adopting a threshold for certain continuous variables to be considered "classes"... At least this is what seems to be done in ROC-AUC. Maybe it could also work for feature selection.
Is your problem that you encode color to 3 levels for example red green and blue and you want to know if you need to throw out color in its entirety?
What's stopping you from encoding your variable with 15 colors to Red, Blue, Green and Rest if the others are unimportant
If you're using K-1 dummies you can even just drop the final category
I want to know what the correct approach is in general. Do I throw out the entire feature, do I throw out certain categories of the feature? How do I do so?
the question is, how do I find out if they're unimportant?
With whatever feature importance algorithm that your model offers
Your background was in economics right? Then I'll just say three words you probably heard a million times: beware of multicollinearity
that was next on my list.. how do I deal with that? The amount of approaches.. Like, do I use PCA? If yes, then do I replace all existing features with the principal components? If yes then I potentially lose some information and relationships the model might pick up, especially if I use a polynomial transform.. If instead of replacing I just add them, then that's basically adding to the problem by adding the solution to the problem. So my plan was to just weed out the less "important" features with mutual information, and then do some sequential feature selection with what's left to determine what works best (with cv ofc), thus kinda bypassing the need to deal with multicollinearity.. kinda
denormalize ?
this IS dealing with multicollinearity :p
ok, but I still don't know the answer to this #data-science-and-ml message to do that correctly :'/
Hey @past meteor can i ask you about K-means clustering in dms?
I'd prefer you just ask here
I'm not going to open PDFs and py files, the discord has a paste functionality
It deletes the code i copy and paste too idk why
But i put screenshots on python-help
What's preventing you from just using regularisation?
Considering you're mentioning a polynomial transform I can assume you're using a linear model? Why not use Lasso / elastic net and solve multiple problems at once
well.. regularisation doesn't exactly remove the problem.. it just smooths it out, so to speak. And I do intend to use regularization, on top of whatever else I do :3
Do you know L1 regularisation and specifically Lasso?
not necessarily. I'll be using a polynomial transform largely to just catch integrations between features
I do know about it, yes. But can you say with certainty that using it will be the best move in removing unnecessary feature for, say, a tree based model? Probably not, so I'll probably stick to inbuilt regularization methods for whatever model I go with
And we're back to multicollinearity, you need to careful define what it is you mean by unnecessary feature and say it in the mirror 25 times haha
I haven't decided on the model yet, btw. I'm just messing with features so far
For decision trees you can just add 2 features that are 100 % noise and remove all features that have a lower feature importance than those 2
my unnecessary I mean, those that contain little to no useful information about the target variable. In this case, I decided to go with mutual information to choose what is and isn't necessary
They might contain information about the target but 2 other features might contain exactly the same info
🤔 I've never heard about that.. I probably won't be trying that now, but if u got a link I could add to my read list, wouldn't say no
Like in general you'll do the same thing right, you'll remove the variable. The nuance is in how you present the results etc etc
that I'll solve with cross validating sequential feature selection - those that harm the model will automatically be removed
Oh you're going with stepwise 💀
Background Stepwise regression is a popular data-mining tool that uses statistical significance to select the explanatory variables to be used in a multiple-regression model. Findings A fundamental problem with stepwise regression is that some real explanatory variables that have causal effects on the dependent variable may happen to not be stat...
!code
this time, lets say yes, I go with that 💀
pfff.. time to read
Thank you
ok, I'll read later.. :3 pause that for a minute
Lets say I don't go with step wise, but use permutation importance (feature importance) and cross val to select the cutoff threshold.. that'd work right?
Yeah, do it by adding a few features that are 100 % noise (say 2) because tha automatically tells you where your cutoff could be
runfile('C:/Users/Ayla/Desktop/sonson.py', wdir='C:/Users/Ayla/Desktop')
C:\Users\Ayla\anaconda3\lib\site-packages\numpy\core\fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
return _methods._mean(a, axis=axis, dtype=dtype,
c:\users\ayla\desktop\sonson.py:71: MatplotlibDeprecationWarning: The get_cmap function was deprecated in Matplotlib 3.7 and will be removed two minor releases later. Use matplotlib.colormaps[name] or matplotlib.colormaps.get_cmap(obj) instead.
colors = plt.cm.get_cmap('rainbow', num_labels)
Traceback (most recent call last):
File ~\anaconda3\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
exec(code, globals, locals)
File c:\users\ayla\desktop\sonson.py:107
plot_clusters(X_transformed, kmeans.centroids, kmeans.labels)
File c:\users\ayla\desktop\sonson.py:77 in plot_clusters
plt.scatter(transformed_centroids[:, 0], transformed_centroids[:, 1], marker='+', color='red', label='Centroids')
IndexError: index 0 is out of bounds for axis 1 with size 0
Do you know what does this error mean
transformed_centroids[:, 0] , transformed_centroids --> this variable is probably None
ok, but that "solves" the correlation problem. But that still doesn't answer this #data-science-and-ml message how to encode cat features before calculating mut info and are results comparable between feature types?
So what can i do
I am not sure how you defined that variable, probably share the snipper of code where you did.
These are codes and data
You'd just one-hot or target encode it
so which is it??? do I one hot, or do I ordinal?
That's a question for you not me
I would not ordinal encode colours because you'll get something like 1, 2, 3, 4, 5
And Red (1) - Green (2) being -1 makes no sense
I might ordinal encode high cardinality categorical variables if I'm working with a tree algorithm because they have a degree of invariance to this problem
thanks. looking at the code, I think PCA and KMeans classes seems fine. please make sure your "kmeans.centroids" is a 2D array so that you could perform indexing on 2nd axis.
A degree, but it's definitely not perfect
That leaves you with one - hot vs target encoding. Target loses a lot of information but killing your model by expanding high cardinality categories is worse.
but for feature selection purposes, forget the model. Obv if I ordinally encode a feature for selection I'll reencode as one hot before passing it to the model, but before that. Before we get to the model, it makes a big difference: ord vs one hot
If I select a certain threshold, I'll get vastly different results afterwards
TL;DR
- Low cardinality: One hot
- High cardinality and tree: ordinal encoder
- High cardinality and not a tree: target encoding
It depends on how MI is specifically calculated
If it treats it like something continuous then ordinal encoding would not make sense
it treats it as discrete categories (continuous gets binned)
*actually implementations differ, but that's the easiest explanation
How is the binning done, Is it done exactly at the cutoffs of your ordinal values or do some get grouped?
Okay
Thank you
If you can help with editing the code let me know please @lapis sequoia
Even if it it does it at exactly the right granularity, the issue is still that you imo, still want to OHE it at the feature selection level so you can see what specific levels are relevant.
binning is for continuous variable, they just get grouped into bins, like for a histogram, and those bins are used as categories to calculate the score. But that's not the main question. What do I do with nominal categorical variables is what I'm asking, not continuous
... yes but does your implementation treat nominal categorical as continuous
as discrete* so either ohe or ordinal encoding works
but like this
Are you using sci-kit learn's implementation? I'll just read the docs
is it kaggle titanic problem?
yeah. The docs aren't very clear on anything tho
yeah
What is your doubt exactly? I couldn't catchup with the convo
this is the original question: #data-science-and-ml message
Based on the docs it looks like if you make them ordinal and then specify that it's discrete you should be fine but then indeed you're computing the MI for the entire feature and not the levels
So if you want to drop a feature in its totality that works
If you want to know what specific levels have a high MI, then you should OHE
Does this answer your question or did I miss something?
For the titanic dataset specifically the second plot makes more sense, not all decks are important so you could drop those levels ocne you one hot encode. The first plot doesn't give you that information
if I OHE, will the resulting scores be comparable to scores of other features? As far as I understand - no, and that would kinda defeat the purpose, because I wouldn't be able to set a threshold for elimination. Unless, that is, I set a separate threshold for each group of features created by the one hot encoding.. but then I'd end up with many thresholds.
And the other unrelated question: are MI scores between categorical features comparable to MI scores between continuous features?
It will be comparable I believe
for continuous, there could be little loss of infomation as we will be doing binning of values, so it depends. but still comparable in most cases.
(I still believe you should just use proper regularization and then you can leave this be)
Plot your data vs. the target etc
I think @past meteor already answered most part of it. As, its more about experimentation and results could still vary irrespective of MI scores. maybe you can calculate MI individually, eliminate few categories after OHE. Also keep the original feature and concatenate that with selected OHE features, that way probably some of the information about eliminated categories will still exist for model to learn.
U sure OHE will be comparable to others? I mean, all the information that they contain is just on/off for one category of the initial feature.. that doesn't seem like a lot, tho it can be useful.. Say you have a high cardinality feature, with a bunch of categories, but OHE is basically deviding their importance by the cardinality, making their MI scores very low, meaning they might just all get eliminated, if you use a threshold comparable to the others.. Idk if I explained this well, if not I can try clarifying what I mean
hmm.. ok, that kinda makes sense, maybe, I think 🤔 or not? I mean, descrete scores are deterministic, where continuous ones aren't
makes sense after normalization, considering that misclassification is a full mistake
since the classes are orthogonal to each other, mistakes there automatically yield a large distance
this kinda went over my head.. 😅 where r classes orthogonal?
when you use one hot
yes right, it's a bit tricky for continuous as it depends on number of bins. high bins --> overfit, too low bins --> we miss useful patterns. I try to plot the features to judge what could be the ideal bins to be used.
so.. normalizing after one hot makes the scores comparable?
@lapis sequoia the code worked but how can i make it do iterations?
one hot is normalized by default. it makes an orthonormal basis for the labels
if you normalize the other scores, you can compare them to ones derived from one hot vectors
iterations for? I think you are already doing Kmeans iterations while kmeans.fit in your code.
The professor wants us to make 4 iterations out puts idk how i will send screenshots of what output she expects
huh.. never thought about that. looks like sklearn standardizes all features by default when calculating MI, so I guess the scores are comparable
that kinda clicked now, thx
probably you can use k means elbow method and show the plot of clustering results based on different number of clusters, or simply print out output vectors shape.
@past meteor @lapis sequoia thx for ur time, that was really helpful. I think I kinda get it now, tho will likely wake up tomorrow and have to revisit the entire conversation again.. :p
The output i get is the first iteration plot only
It doesnt give me 2nd 3rd and 4th
maybe the if condition in your Kmeans class (fit function) is already satisfied at 1st iteration (and it break the iteration loop), you may try removing it once?
just keep self.iteration>200 for once
Like nis says, no substitute for empirically validating if your model is better with or without the feature 🙂
I wonder why factor analysis isn't more common in CS-aligned data science communities: https://github.com/MaxHalford/prince
I use FA often, atleast when I want to compare results with PCA. We even used it in our winning solution for one of the featured kaggle competition.
Great to hear. I'm a kaggler myself but I do it strictly for fun 🙂
Tend to do the tabular playground series. Sometimes with people from my cohort over the weekend or so. Never won anything though 🤣
aaha that's nice, winning is not important, it's the learning which interests me. I am competing for 3-4 years now haha.
Yup, Kaggle + Lurking on Reddit + university (in no particular order) were what thought me data science. It's a great community.
Hahah quite the same for me, I recently joined full time, but the community still interests me to be active in competitions and also contribute.
It's funny because recently I had an idea to make something where you have an LLM that has the right answers to projects (could be data, could be software) where people submit their answers and get feedback from the LLM on how to improve based on 1 or more model solutions.
That's a nice use case.
Several technical challenges in making it but overall the idea is to get "senior dev" tier advice based on model solutions to help people improve their skills - for free. A blue sky idea, I know
haha I won't claim myself as an expert in LLMs, but yeah this sounds interesting. So, you are planning to finetune the LLM models or do some prompt engineering and build an app/software upon it?
I'm in a research lab. I'll let the idea "simmer" with our LLM gurus first. Initially my idea was to just prompt engineer and build an app on top of it.
Scope isn't big enough to be done at my work I think so my guess is that I'll try it as a hobby project 🙂

My labelimg programs closes when labelling. I labelled yesterday and had three classes. I then continued today and it just keeps crashing. When looking at the classes.txt it gets changed to the first class I draw. Anyone had a similar thing and know how to fix?
Nvm found a workaround
@past meteor if ydm, what did you end up doing for that synthetic tabular data generation?
Haven't done it yet but my approach is clear - I'll make a graphical model and generate it from there. If I want to make it more noisy I'll add a VAE in there.
But it should be possible to make it pretty noisy at the PGM level already
makes sense. you were looking for a simpler approach right? none convincing?
Simple but it needs to feel realistic. We might use the data for internal / external training so making it yourself means you get to inject whatever issues you want to cover.
ah right, fair enough
Best site to practice pandas.
I think just sticking to ur bio is a good idea
I'd say, read part of the Pandas user guide on their website and then do some Kaggle @pine escarp
try to help in this discord 😉 people come with interesting (explicitly related to pandas or otherwise) problem, solving them is good practice!
also maybe https://github.com/ajcr/100-pandas-puzzles
and https://pandas.pydata.org/docs/getting_started/tutorials.html
Thank you all for your suggestions, I'll definitely try them all. 
I have been looking for a site that offers exercises to do.
should you standardize one hot encoded features if you standardize all your other features? Specifically asking about standardizing (mean 0, std 1), not normalizing. I think you should, but.. for some reason I couldn't find any confirmation on google.. :/
kaggle mini courses are good as well.
If you pay, you can do some exercises on datacamp, but.. the exercises are really bad, I wouldn't recommend. Just kaggle and practice
Thank you, I'll try them!
Ohh, I see. Thank you.

are you only interested in pandas, or numpy and visualization too?
Yes!
I'm still a beginner, so I'm starting with pandas.
I'll learn numpy and matplotlib next.
And SQL simultaneously.
Pandas so far have been interesting.
Train me to smoke data as well. xD
😂😂
Can't reveal the secret sauce, but getting addicted to kaggle was the key for me ;)) might be something else for you.
Yes if you're using L1 or L2, it's critical that your variables are on the same scale then
well once you're done with the theory for those, if you want some beginner friendly exercises, I'd recommend the 5 ones at the end of this course: https://www.freecodecamp.org/learn/data-analysis-with-python/#data-analysis-with-python-projects
I don't remember much about them honeslty, but I remember I enjoyed them. That said, they aren't exactly very good from the perspective of "high quality material", there are a few bugs in the exercises themselves (which you may or may not encounter, depending on your approach). But I think that's a good thing, cus you get to try figuring out exactly what is wrong. Long story short, try it if u feel like it. Don't have to go through the matterial if u already know the theory, can just do the exercises
The strength of your regularisation would be a teeny tiny bit stronger on your OHE'd variables
I think either way it might not make a difference. I mostly don't do it I think
but there's no harm if I feel like it, right?)
Thank youuu! I'll try them. 
Kaggle, I'll definitely try it.
Yes. I don't know if you're using the MSE Loss for your VAE, but, in reality, the Decoder doesn't generate any image, it just generates parameters for a distribution(for RGB images, usually Normal distribution, and for grayscale, usually Bernoulli).
So, in order to convert each value in the Decoder output from a Normal Distribution to a proper image, you have to denormalize it by adding the dataset mean(which would be the mean of the Normal Distribution) and multiply by the standard deviation.
with torch.no_grad():
input_noise = torch.randn_like(z).unsqueeze(-1).unsqueeze(-1)
saving_image = decoder(input_noise)
saving_image = saving_image.view(saving_image.size(0), saving_image.size(2), saving_image.size(3), saving_image.size(1))
saving_image = saving_image.cpu().numpy()
saving_image = (saving_image * STD) + MEAN
||Yes, I'm the kind of guy who reject the use of .permute() in favor of .view() to convert torch tensors to numpy arrays||
Using this approach, your output goes from something like this:
To something like this:
Interesting, but I was talking about DAE/denoising autoencoder which I believe is different from VAE?
It is? 
I thought Denoising AutoEncoder was a Variational AutoEncoder optimized in a way that the generative factor is severely decreased
Ok, sorry then. Your model probably has anything to do with what I said. 
I've never thought of it that way 
Regular autoencoders can be used for denoising as well
That is true
The general idea would be to add noise to your input and try to reconstruct the original
Do you think autoencoders should include batchnorm?
The term "autoencoder" became very confusing to me after I learned about diffusion models...
Every paper on latent diffusion uses the term "autoencoder" to actually refer to "variational autoencoder", but they never use the "variational" term.
If the bottleneck is small enough you could also just denoise naturally without adding noise.
Been a while since I used any autoencoder. I did a fun project some time ago where I used an AE to do an adversarial attack on resnet-50
Can also be used to self-supervised train nearly everything, cool stuff, cool stuff, ...
they're a few github repo's that basically go through a series of mini projects that involve you heavily use pandas
and they're designed in such a way that you get exposed to the different features
that could be helpful
but definitely kaggle comps + pandas user guide 💯
yeah VAEs
ig they kinda laid it out in the stable diffusion paper so it became shorthand
hy guys
I have a Pandas dataframe containing a time series, one column contains every day of the year 2022 with every hour, so we get 24 rows per day and the other column is a price
I would like to calculate the mean price of each month, I built a DIY solution with counters variables etc but it somehow fucks up at the end and includes more rows than wanted for each month from february on
i would like to know if there is a quick method to get the prices for the days of a given month only by selecting only these rows to calculate the mean
i was able to do it by comparing the date with operators to each beginning of a new month
I'd add a month column and groupby by it.
and the month column you can calculate by... well, depends on what way your date is represented. either something from pd.Series.dt, or something from pd.Series.str.
pandas has a lot of stuff for manipulating datetime columns, such as https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month.html.
what am I supposed to see here? I don't see a month column.
AutoEncoders are really forsaken.
They have such a nice theory and idea, but it's so difficult to find anyone explaining it correctly. It's just "Make neural network, output smaller than input, then make another neural network, output equal desired image"
for a regular autoencoder, that's really mostly it lol
what else do you think is missing?
It just feels a bit... too simple...and boring.
the easiest way to think of it is that you generate a pair of functions, and they are inverses of each other
that's... all, really

you don't need the "encoded" part to be lower dimensional
it doesn't have to be images
the architecture of each of the networks doesn't matter, varies by application
the cost func also
"autoencoder" is just a very general framework. that's why it's "boring"
you can use it in conjunction with any concrete application and details
I'm not used to simple things in neural networks...not since I discarded the use of keras
i work a lot with things that are technically autoencoders, and you wouldn't recognize them at all
stuff like task based machine learning also falls under autoencoders a lot of the time
same with self supervised learning
in some sense, it's even just an interpretation of a network
cuz it doesn't have to be 2 in the first place

(it also doesn't have to be "networks")
This ^
So...if I have a function that receives a 32x32x3 image...and outputs a single value (0 or 1)...can I call it an autoencoder?
no because it doesn't satisfy the main condition stargazer just referenced again
Oh, yes...indeed...
if you took the image and did whatever you want with it, spitting whatever out
then you take that whatever and remake something very close to the original image
you have yourself an autoencoder
Data augmentation = autoencoder?
you could generate 10 million petabytes of data as that whatever, and it's still an autoencoder

any network performing compression and reconstruction is basically AE ig
it doesn't even have to be compression
that's one of the most common cases because you use them in parameter estimation, but it does not have to be th case
literally just a forward-inverse pair
the key idea that makes it so powerful is that, if you put the pieces together in a special way, you can get the forward function to give you something useful, and the training is done based on the inverse function's output
essentially making the pair of functions "train themselves"
you only need input data, not even labeled pairs
but that's on the application side and varies greatly depending on what you want to do
the main idea is: forward-inverse pair
I think I'm an autoencoder fan an didn't even know that...because I enjoy using self-learning and unsupervised learning models...
these are often autoencoders
not always. but often times
in the very classical sense, like clustering, that's not an autoenc
Maybe I should review the maths on a loss function I've been using to extract the minimum entropy... It uses the argmax of a softmax function as label 
let your mind be flexible
a lot of problems can be solved by noticing two things are actually the same thing if you close one eye and tilt your head
that's a big part of research and problem solving
so yeah, review your maths and get your intuition rolling
Small bits I wanna add:
AE's are a bit similar to PCA in their application domain. PCA explicitly has orthogonal eigenvectors. AE's do not (think of what cons this specifically has). PCA is a linear method (obvious cons). Solution(s): people are looking towards beta-VAEs as well.
There's kernel PCA that has the best of both worlds. I like kernel methods in general but you can barely apply them in reality
Can anyone explain siamese neural networks like i'm a 5 year old
Take a neural network. You have 3 inputs, 2 are of the same class and 1 isn't.
Give it input 1 and predict vector V1 (positive class)
Give it input 2 and predict vector V2 (positive class)
Give it input 3 and predict vector V3 (negative class)
You compute the euclidian distance between all points. You reward the model for having a low distance between V1 and V2 but also being far from V3. (triplet loss)
Is that clear enough?
Hm...interesting...
https://lilianweng.github.io/posts/2018-08-12-vae/#beta-vae
Important thing to remember next to all the math is that you want to get that sweet sweet property back that you had in PCA where each component (~ neuron in the information bottleneck) carries different information (is orthogonal)
Bit of a handwavy explanation but Edd can fill in the blanks haha
Each neuron -> a different component? 
From my slides of uni. That information bottleneck exists in both PCA and autoencoders
Hence why we covered them in the same lecture
Does this make sense or is it confusing?
you can set constraints either on G or on z to get (near) orthogonality in the representing basis, like in pca
you don't necessarily need that part though. really depends on the problem at hand
i can give you an example of one we deal with at work
A position I applied for was using beta-VAE's specifically in physics
Because for them it was important and interesting that they were (near) orthogonal
we often want to solve a so-called "inverse problem", where we measure data, we have some idea of the physical process, and we want to extract relevant parameters. so what we can do is take the parameters and let them be x in this image. use a physically-motivated forward model, some approximate solution of a differential equation, and let it be G. this part may or may not have trainable parameters. this spits out a z, which is technically "encoded" if you choose to interpret this as an encoder, but usually z is much higher dimensional than x in this application. then let a deep network be F, and have it estimate x. once all is said and done, F can be used in stand-alone fashion on real measurement data, provided that our G was good enough, and it inverts the problem. e.g. like locating objects using x-rays
the properties that G and z satisfy depend entirely on the application in general though, so yeah. that's where your domain expertise comes in
That's really interesting
Goonna shill my own stuff, like I mentioned last time I used an autoencoder it was to make image-dependent noise that works on out-of-sample examples that turns every example into an airplane
So essentially, it's a cheap and easy way to "attack" any model, provided you have access to its gradients which is totally unrealistic except in toy problems haha
ooh nice
Autoencoders encode, automatically.
The encoding can be whatever.
^
i'm confused a bit in these pics, as i can understand the first column in has heart disease (True Positive) represents the actual data of training data am i right or wrong? and if im wrong what r 4 corners refer to?
so the point of the algorithm is to figure out if someone has heart disease or not. and there are two different ways for the algorithm can be incorrect: it can say that someone without heart disease does have it (a false positive), or it can say that someone with heart disease doesn't have it (a false negative)
whereas if the algorithm says they have it, and they actually do, that's a true positive. and you can probably figure out what a true negative is.
i will explain to u what i understood and if i got smt wrong tell me the right thing, first
so we divide the data whatever into 4 fold cross validation or whatever.. so we divide the data let's say into 75% of training data and 25% for testing data, the actual column of "has heard disease" & "does not have heart disease" it represents the real data which is training data is that right?
sounds like you have a couple different concepts mixed up.
k-fold cross validation is where you divide the data into k partitions, and then you do the algorithm k times, and each partition "takes a turn" being the test data.
though it sounds like you might understand that much well enough
the actual column of "has heard disease" & "does not have heart disease" it represents the real data which is training data is that right?
all the data--both the training data and the test data--is "real".
yep u r right
for each data point, there's what it actually is, and what the model predicts that it is.
here it is where it says which is true positive (TP), false positive (FP), etc.
i mean the output, ok what i meant that we took like 75% of data and we took 25% for testing to c yk which method will gave us lowest errors..
cool so what does true positive false positive etc.. represent? training or testing?
you do those counts when you test the algorithm
@serene scaffold the point that made me truly confused in this pic that he said the columns r the actual data and the rows r the predicted data
forget about this one, let's loot at other example
there isn't separate predicted and actual data. there's predictions about the data. this is a key distinction
the algorithm does not produce new data.
so here, we got actual 142 ppl had heart disease, and our prediction misclassified 29 ppl that's right?
the algorithm misclassified 29 + 22 people.
u r right, but exactly it misclassified 29 ppl said they does not have heart disease but they have right?
right
ok so i got here the last 2 question
the first one is, the 142 is that the all real data? like we took the all data which is 142 ppl had heart disease and we tested it or it's a sample like 75% and we tested on 25%?
There's no fake data. It's all the real data.
The number of data instances in the test data is the sum of the four squares. 142 + 22 + 29 + 100
@barren fable make sense?
i was watching a vid to explain it so i'll tell u what i understood step by step
look at these 2
the one on the right said the columns r actual data and the rows r the predicted data
the one on the left said the columns r the predicted (which means as he explained the values coming out of the model) and rows r the expected (which means as he said the actual data the model is supposed to predict)
so im confused which one is right? or he transposed the row and columns?
actual data and the rows r the predicted data
Banish this from your mind
there's not actual data and predicted data. there's just the data. this is very important.
well um that what the video said not me 😂
then find a better video 
there's just the data. and there's what the data actually is (actual), and what the model says the data is (predicted)
no new data is created.
there's no standard for whether rows should represent actual, or if columns should represent actual.
but the point is what it represents.
great
so r we using cross validation in this process or no? like dividing the data for training and testing?
this is a confusion matrix. you can have a confusion matrix whether you're doing cross validation or not.
cross validation is where you do the whole process multiple times, but you divide the data into train and test differently each time. and you see how much of an impact that makes on the performance
if there's a big difference between the best time and the worst time, then something is probably wrong
(for other people reading this, I'm trying to explain this as simply as possible, using terms that the asker has already used)
yo Stel thx bro u r a lifesaver ❤️
yw
are you taking a class or something?
rn nope
CP1
What's CP1?
What are some potential problems with linear regression?
and what exactly does linear regression do?
i googled the comparison between linear and logistic regressions and linear solves regression problems (not too sure what that means) while logistic regression solves classification problems, which i get more but an explanation would still help a lot
it works best mostly for linear data it cant handle complex data like a polynomial type data
find the best fitting line through your data thats it
wb logistic regression?
well it fits a sigmoid fucntion through the data i can send you a video link that explain its very clearly
i cant send images in here so the sigmoid function is sort of hard to explain ;-;
u can dm?
sure
hello i need some advice regarding machine learning career that i am perusing so please let me know if anyone can help
you can fit a polynomial with linear regression
umm iirc its polynomial regression right ?
please correct me if i am wrong
splitting them into categories like that does you a disservice
in both cases, you set up a matrix problem of the form y = Ax + b, and you solve it as A^-1 (y-b)
Hi guys, why is the graphics of cost function in gradient descent algorithm shown parabolically? There's no x^2 or something in linear regression
the cost function usually does have x^2s in it though
Isn't the cost function (theta.X)/m though?
.latex you usually use something of the form
[
J(bm{x}) = \Vert f(\bm{x} - y) \Vert_2^2
]
oof i forgot to make the y bold. but anyway, MSE has the squares in the name 😛
also when you set up y = Ax, it often does not actually have a solution
The intuition is that the word linear means that one unit of increase in your variable means one a certain increase in your target, given by your coefficient
you instead minimize the error using some metric, and MSE is a common metric
not all metrics involve squares, but most of them are nonlinear and so you get curves
but is int the meaning of linear regression supposed to fitting a linear fuction aka a line ?
^
a linear function is not a line, that's the problem 😛
a linear function is a function such that f(u + cv) = f(u) + cf(v)
And that's a big assumption. Some things are great in the beginning but they start sucking. For example temperature vs happiness. You keep getting happier the warmer it gets but when it's 50 ° c you get sadder
That's a typical non linear relationship
bro wait you just shook my entire math foundation i thought linear fucntion =line lol
So, we don't mean Ax + b with "linear function" then?
Ax + b is an affine transformation, that's actually not even linear
unless you use some tricks
Otherwise, it wouldn't really work like how gradient descent works
You can solve the problem by taking other models or new features, maybe you make a variable called temperature below 30 and temperature above 30
gradient descent has NOTHING to do with linear functions
at all
gradient descent itself does a linearization in a neighborhood of a point, the function you minimize does not have to be linear. only differentiable
yea i meant its just finding the minima of a the damm loss function '
So that parabolic graphic depends on the problem. In some cases, Ax + b is used which doesn't make it a parabolic and in some cases where Ax+b doesn't work, MSE is preferred
right
so you trynna say that i can do polynomical regression with linear regression care to explain how please ?
Doesn't it though? If it's a line, how would you find the local minimum with learning rate?
none of those things are related to linearity
it'd be probably x = 0
I mean in the cases where the loss function is defined as Ax's
say we have a polynomial of degree 2. you can easily generalize what i'm about to do. we let y = ax^2 + bx + c. to do regression, we need pairs of observations (x_n, y_n). then we can write N equations of the form y_n = ax_n^2 + bx_n + c
we can write that in matrix form as follows
right
.latex
[
\begin{bmatrix}
y_1 \
y_2 \
y_3 \
\vdots
\end{bmatrix}
\begin{bmatrix}
x_1^2 && x_1 && 1 \
x_2^2 && x_2 && 1 \
x_3^2 && x_3 && 1 \
\vdots
\end{bmatrix} \cdot \bm{x}
]
sigh one second
hello guys, I am trying to detect seasonalities in a financial time series that has a price for each hour of the day for an entire year. I would like to do this by using FFT, in order to do this i need to use a window function for tappering the time series, any of you got some advice on how to choose the window function ?
I`m trying to highlights two types of seasonalities, macro (between months and weeks) and micro (between days and hours)
ooof man, i hate this bot
umm why you wanna find seasonality using fft there are other ways to do it right
lol thnx a lot man for trying
.latex
[
\begin{bmatrix}
y_1
y_2
y_3
\vdots
\end{bmatrix}
\begin{bmatrix}
x_1^2 && x_1 && 1 \
x_2^2 && x_2 && 1 \
x_3^2 && x_3 && 1 \
&& \vdots &&
\end{bmatrix} \cdot \bm{x}
]
ok there we go
i guess those should've been horizontal dots in the y vector, but nevermind
@slender kestrel In the end i aim at getting a column vector containing one number for each hourly price by which i will multiply to apply my seasonality
fourier might be a good way of doing so i thought
this effectively turns the polynomial fitting problem into one of the form y = Ax with a toeplitz matrix A, and we find x via linear regression
also, seasonality is indeed found via FFTs
ooh i see how you are trying to explain linear regression with polynomial regression you trynna say that polynomial regression is an extension of linear if we assume x1^2 as a dimension right
that decomposes your data into sinusoids of predefined frequencies, letting you find what the perodicity of the data is
@wooden sail
yeah that's why i want to use fft
the two problems are the same thing, you can write your polynomials as vectors
very true but seasonality can be found via auto correlation too
got it got it what you were trynna say
you would usually still do an FFT of the autocorrelation
i can do both since i want to learn about seasonalities
as for the window functions, using no window function is equivalent to using a rectangular function. this convolves the spectrum with a sinc, which has good and bad properties
blackman might be good i heard
it has the highest resolution in the sense that the peaks are the narrowest, but in exchange you get side lobes that might make it seem like there are other frequency components
blackman and blackman harris are alternatives. those try to remove the side lobes but make the main lobe wider
I mean the size of my window function will affect the type of seasonalities i get from my analysis right?
if the auto correlation graph of a function is sinosudal or cosinosudal it wont require fft iirc
it depends which things your data is sensitive to: false positives in the trends, or high resolution (closely spaced frequency components)
how can i test for this ? @wooden sail
i am not usually impressed but you have impressed me today by your immense amount of knowledge
the easiest way is to just try a couple of them tbh, this is part of the exploratory part
ok i do a few fft with different windows an check the results
given how long i've been in uni, fourier and linear algebra are the types of things you could wake me up at 3 am to ask me questions about, and i should be able to answer
which major did you do ?
i mean linear algebra i am okish at that but fourier naah i i didnt like that a bit in uni
i did telecomm in bsc, comms and sig proc in msc, and doing more sig proc in phd
@pine escarp are you there
ooh can i dm you sometime for advice ? if you dont mind it ?
i'd rather not 😛
lol fine
i too wanted to learn more about time series data and sesaonality coz the articles avalible on medium teach you not too much
so wanted to ask you more about that stuff
i think maybe zestar is a better person to ask. i probably know the stuff with different names, but my approaches are "unorthodox" compared to what you usually see in data science
i also have one last question Edd
zestar75 that person ?
yeah
in the end as i said i would like to get a different coefficient for each hourly price of my series by which i multiply it to take into account seasonalities
i shall try pinging them @past meteor
will i be able to get this out of my fft analyisis ?
as I'm not only aiming at a graphical spectrum analysis
What's the question?
i too wanted to learn more about time series data and sesaonality coz the articles avalible on medium teach you not too much
so wanted to ask you more about that stuff
hmm not with a single fft. a single fft will assign one coefficient to each frequency/period. so say something repeats every 30 minutes, you will get one coefficient for the whole data, telling you how strong the 30 minute repetition component is. if you instead what to see how this coefficient changes every x hours, you would instead use a "spectrogram". this splits the data into sub windows, and then FFTs each of them
Forecasting is more business related time series analysis but I think this book is a great place to start, afterwards you can go for a more advanced text (this one is also rigorous, but quite practical)
alright thnx mate ! also i am looking for advice for my machine learning career can you help with that ?
the best advice is going to uni if you wanna learn it well 😛
Since when do you get this error?
yeah agree, sstem degrees are best learn at uni unless you are a genius urself
Yeah, university is key
from starting only
So ever since you installed anaconda, you get this error for jupyter notebook?
i am in uni 😭 they dont teach a crap they teach theoretical stuff not the pratical thing ... i have completed specializations like the one by andrew ng completed the statquest play list for stat and machine learning but i still think i am missing stuff
yes
if you understand the theoretical stuff, the practical part follows immediately
thanks edd gonna try some stuff and be back
read the papers where they discuss implementation details
yup i recently worked on implementing lipnet myself was able to do it but took me a hell lot of time
i am looking for research internships in universities you think my level is ok to apply for them ?
there are internships at all levels
Did you try opening the notebook using the anaconda prompt?
any other advice you can give me (other than going to uni) to improve myself in datascience more ?
read extra books and papers, not just what they ask of you in uni. if you find any tasks/projects/topics you're interested in, go ahead and play around with them
any book you want to recommend ?
My uni was very theoretical as well but the trick is to do practical work on the side
louis scharf's statistical signal processing too
Try it, it might work.
can you come in voice chat i will share the screen
Oh, sure.
alright thnx imma look into them !
I did internships, I was active in my city's data science community as a student, ...
Kaggle was a big help too
yup i make all my models on kaggle itself so far i have worked on a toxicity detector a flower classification project lip net did some basic eda projects and worked on analyzing data for my professors also completed various courses about the theory part
so now i am looking for internships but i want to do research internships in universities but i always have this feeling that i dont know much so thats why i was asking you guys if its ok to apply or not ?
Apply
Should be easy to grab research internship in unis, not many students do ML/ data science. Get your hands dirty with coding part, you already have theoretical knowledge I presume, gather few research ideas, and reach out to professors directly or ask your professor to refer you.
alright any specific universities you would like to recommend ?
thnx mate imma try reaching out to some professors in universities
@pine escarp join voice chat
Your own university?
Done.
ooh see the problem with my university is there arent many opportunities in there thats i am looking for some foreign universities to apply to
i cant share the screen
everything i have learned is on my own basically no help from uni
To do an internship? Do you have internship credits in your program?
Ohh.
yup thats for the last semester like the 8 th semester and i dont want to wait till the last semester
I'll tell you what to do.
Search anaconda prompt in search and open it.
Type jupyter notebook once it's ready to use.
How are you going to do an internship abroad when you still have class in your home university 
remote internships program ? and my university will only let me convert any other semester to internship semester if am doing a research project in some renowned university smh
down side to this is
i gotta study in the 8th semester
instead of doing internship at that time
so i am more inclined towards finding a remote internship
Idk how easy to find remote internships at foreign universities are. The remote part also defeats the purpose of an internship imo.
well so i am in a mess
basically
Are you in Europe, if so can you do an Erasmus?
naah bro india .-.
Totally fine as well, many indian students here doing their master's
i guess i should wait till graduation ;-; then only i can be free smh
You'll be fine, don't worry
I'm from India as well.
ayo fellow indian lol hello
If you already know you want to be in data science / ML you can already start the practical side of things.
🫂 thnx mate
yup trying to implement research paper on my own these days
Are you from north?
yup
Mitacs global link - Canada, DAAD - Eu uni, DSSG for UK uni, NTHU for taiwanese uni... there are tons of research intern / summer intern programs. you can apply for them
That's cool!
and not to mention you can do research intern at IITs with simple cold mailing any professor
given you got skills ofc
ooh imma try that planning to spam their dms lol jk
Data smoker.
you too indian my guy ?
Hey guys, I'm new here
Hey, welcome
Hope you enjoy your stay
I hope you improve and impact here also
@serene scaffold yo hru?
I'm just fabulous as always.
am I correct in assuming:
-
you should not standardize principal components after PCA
-
when using RandomState for reproducing results and comparing CV scores for different models you should initialize a new RandomState instance for each estimator (not declare a rng variable at the top of the program, and pass it down to any object that accepts a random_state parameter) in order to prevent them from influencing each other by consuming the RNG?
pls confirm or correct me if u don't mind 🤗
what are you calling "principal component" here? the vectors or the coefficients?
ah i had misread that as normalize. standardize as in making them have mean 0 and var 1 would indeed ruin them. PCA gives an orthonormal basis, so normalization is already taken care of there. due to orthogonality you can also straightforwardly conclude that all the coefficients of the vectors are between -1 and 1 if the input vectors to be PCA'd have magnitude 1. the distribution of the coefficients per input vector is arbitrary though and you can't change it, otherwise they don't synthesize the data back.
I'm really sorry but I gotta ask... Did y'all learn all this PCA stuff from school or self taught?
I'm new to this that's why I'm asking
I meant what you get when you call fit_transform (sklearn.decomposition.PCA), could be using the wrong terminology here..
You need to rescale your data to unit variance before doing PCA. Afterwards it should still be unit variance, if in doubt just plot your data
learned matrix decompositions in uni
idk what that gives you
PCA decomposes into orthogonal vectors and their coefficients, sklearn should give you both
fully self taught
I'd set the seed on every single instance if I'm really bothered about having complete reproducibility
But that's mostly because I'm unsure if you'll have it fully reproducible if you set the seed on the top
Personally I'm mostly worried about the reproducibility of my data splitting and not more than that (specifically each estimator)
Cool...
Same here... YouTube is my best friend right now
Thanks, I will get there
I'm guessing it's the vectors the, cus it's not the things I call "loadings", which I think is the coefficients. Getting the terminology right is the hardest part..
it basically returns a numpy array with each col representing a Principal component, and they are ordered by variance.
Just code up PCA with numpy
It's ~5-10 lines of code. Do it once and the algorithm will make sense forever
closer to 3 😛
anyway, the principal vectors are orthonormal, and you want them that way
they have unit norm already
sounds like "rewrite tenforflow in 100 lines of C++" to me..
No, PCA is very simple
I'm not that good at numpy
doesn't need to worry about seed or random states in ml based algorithms, it gets harder to reproduce results when dealing with NNs specifically tensorflow, where it is almost impossible to reproduce same results even if we keep everything same :))
- calculate covariance matrix 2) calculate eigenvalues, eigenvectors 3) sort by eigenvalues 4) do matrix multiplication
Each of these things have a numpy "verb" so it's just chaining stuff that exists off the shelf together 🙂
maybe when I buy some more IQ
No, you're 100 % smart enough to do this @sleek harbor don't underestimate yourself
And once you do it, you'll have something like "wow, was that all that there was to it?"
hmm.. sounds fishy 🐟
data = ... # size n x m; let n be the data length, m the number of samples
centered_data = data - np.mean(data, axis = 1)
cov = centered_data @ centered_data.T/m #size n x n covariance matrix
principal_components,_,_ = np.linalg.svd(cov)
that's a PCA for you
No lol... He's right
using an SVD has the advantage of canonically being ordered by the size of the singular values, so it saves you the sorting. it's also equivalent to the EVD for symmetric matrices, which all covariance matrices are
Reminded me of funny incident, When I was doing titanic and other tutorial problems to learn, I extracted features from the indexing column provided(random string values), and I got little boost in cv scores, I got so happy. It was probably a boost from randomness or param tweaks haha
that last line feels like cheating. Also I don't quite know the math of SVD. I understand PCA via the visualization in this vid :p https://youtu.be/FgakZw6K1QQ
Principal Component Analysis, is one of the most useful data analysis and machine learning methods out there. It can be used to identify patterns in highly complex datasets and it can tell you what variables in your data are the most important. Lastly, it can tell you how accurate your new understanding of the data actually is.
In this video, I...
then do EVD instead
but idk what you find to be "cheating" about it, it's identical to the EVD
no one ever computers eigenvalues and eigenvectors by hand for anything larger than a 3x3 matrix, if that's what bothers you
I mean the part were there's just one function call. That's cool, but I was expecting.. more code 🤣
i told you it was like 4 lines
so did zestar 😛
if you use the EVD instead, you need 1 more line to sort by eigenvalue
means = mean(threes);
threes_cent = threes - means;
covariance_matrix = cov(threes);
[V, d] = eigs(covariance_matrix , i);
decomposed = threes_cent * V * V';
decomposed = decomposed + means;
This is PCA in full, in matlab
can someone help with this, why it is not renamed ?
Matlab and numpy are twins so it should be readable
your X is capitalized, when it shouldn't be :3
Most courses "forced" us to do algorithms by hand, which was a good thing in hindsight because then you know what it's doing
guys I am using this book to get started with ML
This book is written to provide a strong foundation in machine learning using Python libraries by providing real-life case studies and examples. It covers topics such as foundations of machine learning, introduction to Python, descriptive analytics and predictive analytics. Advanced machine learn...
Fundamentally the building blocks aren't hard, just having a high level understanding is fine. Then later you can ask yourself questions like "why the covariance matrix", "why the eigenvalues", "how do eigenvalues relate to the cov matrix", "why is the reconstruction error ~ 0 if num_components == num_features"
@abstract rune https://www.statlearning.com/ is the best course text for ML in my humble opinion. Yes the examples are in R but there are Python equivalents for nearly every algo.
this is the table of contents of this book
I think I will stick to this book because continiously changing books will be a hassle, can someone who is has learned ML confirm for the table of contents of this book
it's fine
I'm waiting for the python version.. should come out this summer
Just go with the R version for now, code is a relatively small part of the book
Like if you see lm(income ~ age + year_of_experience + level_of_education) it's trivial to map that to Python's linear regression
I feel like the Python version will use statsmodels or something cursed anyway so that's a wrap
as long as u can guess that lm stands for lin model..
The book definitely mentions that lm stands for linear model 🙂
what does income ~ age mean? cus i read that as "income is not age"
income is a function of all those variables
now how would I guess that?
I just dropped that here out of context, in the book there's a logical flow so when you'd read it, it'd make sense from the context
lm(income ~ log(age) + years_of_experience + level_of_education + (years_of_experience * level_of_education)) is possible as well, very flexible stuff
i was debating on going for a masters in financial algorithms for analysis (with R).. maybe I should've went for it..
R as a programming language is so horrible
good thing I didn't go for it then :3
But it has a few nice ideas for specifically statistics
I'm talking about the language itself, not what can be done in it
I'm generally not a fan of languages that are dynamically typed and have less strong typing than Python. You know, languages that do a lot of casting like JS, PHP and R
You gotta be really awake because they'll do stuff that is imo silently failing, sort(c("1", 2, "3", "four", 5, 6)) is equivalent to sorted(["1", 2, "3", "four", 5, 6]) . In Python the latter (luckily) errors out while in those langs it does not
As dynamically typed languages go it's well designed imo but ig there's a limit to what you can do
Hence why I use mypy judiciously. For stuff that's not strictly data I might someday look for an alternative with more type safety but doesn't feel as verbose as idk Java.
could anyone help me with a whatsapp gpt trying to integrate stripe into it. having difficulties creating the cancel subscription
If i run the OpenAI CLI fine_tunes.follow command and my stream disconnects is it still being trained and processed in the backend?
Stream interrupted (client disconnected).
To resume the stream, run:
openai api fine_tunes.follow -i <model id>
My favorite Python related bug is setting a member variable on a object, but there was a typo and it just silently creates a new member with that name.
Then I spend an hour wondering why the variable has the wrong value.
everyone should just use getters and setters /s
Hello, guys should I start taking "world quant data science lab" course, or should I focus on something else?
who's teaching that course, what does it cover, what does the course expect you to know before taking it, and how much of that material do you already know?
Is there a really good guide to kaggle that has like the top 5-10 or so challenges that ramp up in difficulty so you can learn as you go?
matpltlib, what are left and bottom in add_axes?
hey
I've learned from a tutorial that ridge regression basically sets the theta of useless features to closer to 0. But I didn't get how it knows whether a feature is useless
it draws all features closer to zero
the usual gradient descent process just outweights that effect for the actually useful ones
How does gradient descent know which one is useful?
the same way it knows what to update for normal linear regression models
ah
so instead of going forward everytime it updates itself, it starts from 0, right?
it isn't "totally useless" / "totally useful", more of "at some point it is useful enough to resist the pull towards 0"
Yeah and I wanted to know the criterias that computer utilises to determine whether it's useful
that'd be the loss function and back-propagation mechanism (aka calculating gradients)
Ah I see
what Ridge regression changes compared to linear regression in the end of the day is just adding a term to the loss function that increases the loss based on the weights
Hello guys, i need some advices
Ridge regression draws all the features to closer to 0 where lasso regression draws them to 0 but I don't think that it'd affect the model much, since 0.0001 can be assumed as 0. Why would you choose ridge over lasso, though?
Let's set some background here, I am studying a financial time series which is representing the hourly price of electricity for each day of the year 2022 so I have a Pandas DataFrame containing two columns:
The first one is a TimeIndex containing the date and hour in a datetime format of Pandas. The second one contains the price associated to each hour.
I am studying this sample to make some predictions on the hourly price of let's say 2024. For this purpose I would like to do an in depth study of the seasonality patterns in this time series, on multiple granularity levels (hours, days, weeks, months and quarters).
In the end I would like to obtain a column vector of 8760 scalars corresponding to the 8760 hours of the year that are priced. Those scalars would represent a seasonality coefficient that I will use to make my predictions for the year coming.
Now comes the reason that i am here, I thought of doing this search for seasonality using FFT and an appropriate window function. I would like to know from you guys which window function should I use for this purpose, I know each one has its advantages and disadvantages. I would also like to know how should I choose the width of my window as obviously this will have an effect on the FFT performed. I am also open to advices on how to complete this seasonality study, would you do it another way? Which tool would you use?
This is a general question, I am looking for other people's opinion on how to do this research, I am not encountering any particular coding problem for the moment
copy paste?
and this btw
iirc Ridge and Lasso do the same thing?
the difference is that Ridge operates on the square of the weights, while Lasso operators on their absolute value
https://scikit-learn.org/stable/modules/linear_model.html#regression
https://scikit-learn.org/stable/modules/linear_model.html#lasso
The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the features. In mathematical notation, if\hat{y} is the predicted val...
Are you familiar with arima? Since I see seasonality, just wanted to make sure
Big oof, I do kind of like being able to do monkey patching though
I never use it because it's a bit janky but I ... like the fact I can do it
The intuition is that your model has a "budget" it can spend to get a certain performance because you punish it for increasing the weights
So you can just remember, for now, that it finds a way to get the best value for money performance wise, which means setting some values closoer to 0.
As for why they don't hit exactly with ridge regression and they do with lasso, you can look at the equations for that.
Write out the partial derivative of an arbitrary coefficient for L2 loss and see under what conditions beta gets to be exactly 0. It's at lambda (regularization strength) -> inf. For lasso this is not the case. (frequentist pov)
There's also the Bayesian statistics way to look at it. No regularization == uniform prior, ridge == gaussian prior and lasso == laplacian prior. If you look at the laplacian, you see a nice big peak at 0 with a big drop off. There's a high probability to be exactly 0 while the gaussian has a lot more "mass" around 0.
Honestly, idk how much value knowing this is if you're a "practitioner" and not someone making methods 🙂
Could still have that, but require manually calling setattr or something.
Stop being sane and very reasonable 🤣
First of all, you can do an FFT on the whole time series if you like. An FFT length of 8760 should be easy.
Second, the FFT only gives you the results you want if everything lines up perfectly in time, and for calendars they don't. Suppose, for example, that there is a monthly effect (for example, perhaps something happens on the first of the month). Unfortunately, months don't all have the same number of days: January has 31 days, February has 28, and so on. So these effects will be unevenly spaced and hence not clearly visible if you do an FFT. Or suppose that you think there's a weekly component (a pretty reasonable guess, since people's activity is different on weekdays and weekends). The year is not a round number of weeks: It's 365 days, and 365 = 52 * 7 + 1. Consequently there is no frequency corresponding to weekly effects.
Third, if you only have a single year of data then you will need to smooth your results pretty heavily. You say you want 8760 scalars, one for each hour of the year. Well, you started with 8760 scalars, one for each hour of the year. If the price in hour of the year was independent of the price in every other hour of the year, then the maximum likelihood estimate of next year's price would just be last year's price. The only reason why your problem is complicated is because you expect that the prices are not independent. Really your question about seasonality is about how to measure possible certain kinds of non-independence. And what you have to hope is that the interdependence of the different variables is strong enough and discoverable enough that you can actually predict 8760 scalars reliably.
In addition to Kyle's answer, you might find some value in this year's ClimateAI workshop at ICLR in Rwanda https://colab.research.google.com/github/bitstoenergy/iclr-tutorial/blob/main/SmartMeterDataAnalytics_Tutorial.ipynb#scrollTo=qcgRU09V3LMh
hi, im new to ml and im trying to build a food recommendation system. i'm not sure what type of model to use though. i did a little bit of research and found that a hybrid of collaborative filtering and content-based filtering could be ideal for what i want. for hybrid, i'm not sure where to find these models to use. in addition, how does one train a model? I understand you feed it data but what type of data do i feed
How one trains a model depends on a lot of things. For the moment, I'm not clear on what you want the user experience to be
what do you mean by user experience? like how what data the user will have?
What information are they expected to give to the model
i was planning like
gender, age, ethnicity, food type preferences (spicy etc) and as they interact with the app, it'd gather more info. clicks, ratings, reviews etc
what are your thoughts on this? It seems to have dropped real quick which concerns me. When training on a smaller dataset it took 80 epochs to go from 10K to 14.2, now when I moved to a dataset of 10K images it has this behavior
what do you plan to do with the gender, age, and ethnicity information?
do you expect that that will inform their food preferences in some way?
oh man this is weird lol
Hi everyone, I think I found a bug in pandas
import pandas as pd
date_1 = pd.to_datetime("2012-02-05")
print(date_1 - pd.offsets.MonthBegin())
# prints 2012-02-01 (prints a timestamp but this is the date)
date_2 = pd.to_datetime("2012-02-01")
print(date_2 - pd.offsets.MonthBegin())
# prints 2012-01-01
I want date_2 to remain as is because that's the beginning of the month. How do I do this if both kinds of dates are in the same column?
just floor instead of using an offset?
never mind, not sure
maybe .replace(day=1)
That works👍. Never new there replace for Timestamp. Thanks!!
well age and ethnicity i feel like have some impact on food preferences so to answer your question, yes
@agile cobalt my mistake, the series is dtype is datetime64[ns] and it has no replace method. I tried accessing .dt and then doing replace which also doesn't work
no clue, my last guess would be trying something like se - timedelta(se.dt.days - 1, unit='days') but idk which kind of timedelta would fit
something like resample might work depending on what exactly you're doing
I feel like this can just be a non-AI app
AI seems overkill
just do some research
like what does gender have anything to do with it
food preferences yes
but again, you dont need AI for that
AI everything innit
how bout for dietary preferences? cuisines and stuff. if not, how would i accomplish this
Thanks! But I resolved it by doing date_series.dt.to_period('M').dt.to_timestamp()
have i implemented my training loop correctly? the LR goes really high after I implemented L1 and L2 for my regression problem but the shape of the graph is still kinda the same, it's just that the numerical value of the error itself got really high
https://pastebin.com/G5hngfd6
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
ooh, great explanation, thanks a lot! 🙏
We need a freelancer who can build prediction model based on the dataset according to requirement.
I have trained a self organising map. Do SOMs have a stochastic nature? Why am I getting the same SOM after re-running the code?
lol it did go down real quick i have made a couple of models it never really goes down that quick imo
someone had the same issue on stack overflow i can link you the question if you want
Anyone help plz ?
I programmed a simple 2d topdown shooter and want to apply a tensorflow agent on it. Should I use the absolute coordinates of the enemy or the coordinates relative to the player as input?
whats the command you used for this?
Depends on how you want your agent to act
I'd say relative - when I was doing Nethack RL stuff I expressed everything in relative coordinates
Yea, know that I really think about it there is probably no advantage to giving absolute coordinates and they would probably internally be converted to relative coords or even just the distance
If you don't encode your agents position and you're using absolute coordinates you'd have a bad agent I think
I am working on movies dataset. These 2 images containes the collection names with revenue and budget sum and mean of each collection .
When I did sum of revenue , I found Harry Potter collection stood in 1st place .
But when I did mean of revenue I found the Avengers collection stood in 1 st place.
So truly which one is a success between the two??