#data-science-and-ml

1 messages · Page 67 of 1

potent sky
#

Using latest_checkpoint() method? I think it takes as arg the directory where you have those checkpoints

past meteor
#

I'm not using it for anything particularly important downstream hence why I was worried about the overengineering part. At best it's just something we might use if we're doing workshops and we want to quickly generate data on a topic if we want to talk about it.

brave sand
# potent sky Using latest_checkpoint() method? I think it takes as arg the directory where yo...

I used this to generate the .ckpt file:

import tensorflow as tf

# Path to the .meta file
meta_path = '/home/ethan/UAS4STEM/models/checkpoints/model.ckpt.meta'

# Create a session
with tf.compat.v1.Session() as sess:
    # Restore the graph structure from the .meta file
    saver = tf.compat.v1.train.import_meta_graph(meta_path)
    # Save the variables to a .ckpt file
    saver.restore(sess, '/home/ethan/UAS4STEM/models/checkpoints/model.ckpt')
potent sky
potent sky
brave sand
potent sky
brave sand
#

oh

#

what should it be then? also, what is a good value for num_steps?

past meteor
brave sand
#

and batchsize

potent sky
brave sand
#

my batch_size is 64 and my num_steps is 1000

potent sky
brave sand
potent sky
potent sky
# brave sand and batchsize

same for this but you don't want to keep it too small as the updates to the gradient might be "volatile" then

past meteor
#

It's truly a case of drawing up a PGM for any topic you care about, picking values and then just simulating 😛

brave sand
#

yeah, i think the problem is that the .ckpt file is 0 bytes

#

which means it'll be stuck on the loading thing

#

i have no idea why it's 0 bytes though

potent sky
brave sand
#

the other 2 .ckpt files aren't 0 bytes

past meteor
#

I haven't heard of rule based generation, I'll look it up

past meteor
#

probabilistic graphical model, sorry

potent sky
#

that should also make it a bit non-boring xd

#

I think there was a paper on smtg similar but I'm not sure

potent sky
brave sand
#

is that bad?

potent sky
potent sky
brave sand
past meteor
potent sky
potent sky
past meteor
#

Yeah, I just took that one but it could be any distribution

#

It's been a while since I've done anything related to graphical models but I think that idea might be subsumed by them

potent sky
#

By model here I mean anything which can replicate the desired distribution

past meteor
#

I think I'll just go over my old slides again but in the meantime I'll also think of conceptually simpler ways to do this 😛

potent sky
past meteor
#

One of the many things of university that are squarely under YAGNI

potent sky
past meteor
#

Then you'll enjoy spending time with graphical models. They're really at the intersection of CS, stats and domain knowledge

potent sky
#

Yeah I was collaborating on a graphical models x federated learning project but I was mostly handling the fl part
Gotta dive into graphical models properly

potent sky
potent sky
#

Should work

past meteor
#

Not a lot of value - many of my former profs peaked in the late 90s early 00s so a lot of time was spent on esoteric things like that or restricted boltzmann machines when more relevant techniques existed. Took "offense" with the RBMs because I'm a relatively recent grad and covering other generative methods in more detail would've been more relevant but I digress

potent sky
#

But relevance is important yeah. Maybe a good overview of RBMs and then they could've moved on xd

past meteor
potent sky
#

Looks interesting. And that too goes onto the impossible pile of "I'll check it out"

past meteor
#

Put it at the bottom of that pile. Either way, enjoy your day 🙂

potent sky
#

Haha sure, you too!

brave sand
#

@potent sky hey do you mind explain batch size?

crimson summit
#

thank you !!!!!

#

I would not worry about if models have progressed or not I would learn the basics inside and out because then you will have a deep understanding that will allow you to use and build with the current models much better. I followed along building the neural network in a book called make your own neural network by tariq rashid. It was good and got the job done but theres def way better tutorials out there. I would also watch and understand all of 3blue1brown's videos on youtube the he has put out on neural networks

potent sky
brave sand
#

oh ok, so not something i should be worrying about

potent sky
#

You'd select your batch size depending on how your data is distributed and how many instances you want to process before taking a learning step
In case of larger models, it also depends on how much memory you have
Popular choices for batch size are 16, 32, 64 etc

potent sky
brave sand
#

https://www.youtube.com/watch?v=amURyS6CAaY
around 20:22, he sets the checkpoint to ckpt-0, I don't get how though, what does that represent

This video shows step by step tutorial on how to train an object detection model for a custom dataset using TensorFlow 2.x. The custom object trained here is a face mask.

① ⚡⚡ My Website Blog post on this ⚡⚡
👉🏻 https://techzizou.com/training-an-ssd-model-for-a-custom-object-using-tensorflow-2-x/

② ⚡⚡ My Medium post on ⚡⚡
👉🏻 https://techzizou0...

▶ Play video
brave sand
potent sky
#

Cool enough

past meteor
brave sand
#

I am training on GPU

#

8 GB vram

#

the process sis always killed though:
2023-06-07 16:05:21.075792: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_24' with dtype resource [[{{node Placeholder/_24}}]] Killed

vernal ore
#

Is there a xgboost-gpu version for Windows 10? When I do pip install xgboost-gpu i get the error: ERROR: Could not find a version that satisfies the requirement xgboost-gpu (from versions: none)
ERROR: No matching distribution found for xgboost-gpu

past meteor
brave sand
#

Can someone please help with this error:

         [[{{node Placeholder/_0}}]]
2023-06-07 21:29:23.493493: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_29' with dtype int64
         [[{{node Placeholder/_29}}]]
Killed
hollow meadow
#

I need help with the code for paraphrasing the text, can someone here help or /i should ask in another channel?

hollow meadow
alpine temple
#

Way too ambiguous if you're after anything reliable.

#

Hey room, I'm trying to think of a way in which I can analyse the results of two different classification models that I made with 37 classes.

#

Here is a confusion matrix which is a little misleading, because the scales are not the same.

#

EfficientNet:

#

This is SqueezeNet

#

I also have the confusion matrix for each of these stored in a DataFrame.

past meteor
#

top 5 accuracy and top 1 accuracy?

alpine temple
#

1.) I will be adjusting the scales on the confusion matrix images.
2.) Do you have any ideas on how I should review why certain breeds of animal might have been misclassified?

arctic wedgeBOT
#
Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

alpine temple
#

And by shortlist, I mean I'll do that.

hollow meadow
#

thank you

potent sky
hollow meadow
#

https://paste.pythondiscord.com/ucokovugib

I am not so good at this topic yet but maybe someone here is more knowledgeable. I am working with NLTK libraries and other tools for NLP, my goal is to paraphrise the text using grammatical rules to identify synonyms and conjugate them and then replace in the text. But those conjugating rules don't work properly and the words in text stay in their base form and don't even get replaces if there was synonym found. That's quite a lot of a problem but at least maybe someone can help me resolve the problem wiht grammatical rules not working, maybe I went wrong in the algorithm when writing it or else

#

or advice where I can ask about this so someone can help

alpine temple
alpine temple
#

I would also specifically be looking at something like BERT.

#

Which is really good at understanding context.

vital widget
#

Anyone working with GLCM, LBP and GABOR filters? I need to ask a few questions about my homework

rose dagger
#

Are learning curves like these a sign of overfitting? Or does it look fine and i should just add more epochs?

mild dirge
#

It looks like the accuracy is all over the place, did you test it on a very small number of text examples?

#

I doubt you can draw very many conclusions from this

rose dagger
cold osprey
#

hmm the acc graph looks weird

rose dagger
#

indeed it looks weird lol

mild dirge
#

Yeah either your model changes very chaotically, or the measurement isn't very accurate

#

80 images (duplicated 3 times each) isn't that much data

#

Try running the model like 10 times, and then average over those 10 runs

#

Also think it's a bit sketchy to augment your test data, normally you want to keep it as is

rose dagger
mild dirge
#

You shouldn't generally augment your test data

rose dagger
#

Sorry i meant training data

mild dirge
#

How have you split it now into train/validate/test?

#

Could also use k-fold cross validation

rose dagger
rose dagger
mild dirge
#

split it 80/20 intro train and test. Perform k-fold cross validate on the train data

#

And you could even run k-fold more times to average it over more runs

rose dagger
#

Thank you so much for the quick help!

uncut iris
quick bay
#

Does anyone has any experience on doing classification of time series data?

past meteor
quick bay
#

What I mean is like this.
I have a time series data set of sea level anomaly.
The feature in the dataset is 'sea level anomaly' and it has 3 classes (so the data is labeled) .
Class 0 : normal
Class 1: anomaly-nontsunami
Class 2: anomaly-tsunami
So basically, i would like to build a model to predict the classes based on the 'sea level anomaly'

past meteor
#

Yes but it's still not clear.

Do you have a sequence of data points for which you want to make 1 prediction. For example, a full day's worth of measurements and you want to predict if an anomaly will occur the next day.

Or is your case one where you want to predict something for each subsequent input in the sequence.

quick bay
#

Each subsequent input in the sequence

past meteor
#

123456789 is our series

123 -> label(4) | 234 -> label(5) ...

That's how your problem is structured right?

#

What you could do is making rolling features. You take the last n features and you compute some stuff on that, for example summary statistics and that's your feature vector

quick bay
#

Would you like to discuss it in private chat? I'll send the dataset and i'll send what i've been working on, maybe yoy can evaluate whether what i'm doing make sense or not?

past meteor
#

Better that we discuss it here. I might not be available later and there's smarter people in this channel that can correct me if I make a mistake so it's in your best interest as well 🙂

quick bay
#

I have a note that shows me on what time the normal, anomaly tsunami and anomaly non tsunami occured

#

So i labeled it in the code

#

Do you think what i've been doing is right?

#

Please give me some enlightment

#

the sea level anomaly is from
sea level - tide

#

it is in meters

past meteor
#

So your features are the last 35 lags?

quick bay
#

including the sea level anomly 0, which is the original sea level anomaly value, so there are 37 in total

past meteor
#

The notebook is quite long, looking at it from afar it seems good

quick bay
#

sea level anomaly 0 - 35 and the minute

past meteor
#

It's very close to what I proposed, the extra I proposed was essentially making more / different features on the basis of your lags. Whether or not those make sense is problem dependent

#

I'm not a fan of class weighting to solve imbalance, better to reason about the precision-recall tradeoff by yourself and pick an operating point

#

Finally, the only xgboost hyperparameter I'd tune is the number of estimators. It's the one that has the most impact

hollow meadow
quick bay
#

can you explain what is 'the precision-recall tradeoff by yourself and pick an operating point'?

#

sorry, i'm a newbie

#

and english is not my first language

#

you're right, the estimators gave big impact to the model

past meteor
#

Don't worry - the output of your model is a probability right? Under normal circumstances (in the 0 vs 1 case) above 0.5 => 1 and under 0.5 => 0. You used class weights to solve this. Instead of using class weights you could for instance look at the distribution of the scores your model is giving. You can plot the precision and recall as a function of you increasing or decreasing the minimum score to be 0 or 1 from its default 0.5 (the operating point)

#

So on the basis of finding false positives or false negatives worse you can pick your own threshold that is different

brave sand
#

does the xml file path matter for the data?

quick bay
#

thank you so much

karmic void
#

Sklearn or tensorflow?

reef lantern
#

hey does anyone here using language R or is learning it ?

potent sky
# karmic void Sklearn or tensorflow?

Both have their uses. Tensorflow is more of a deep learning framework, besides having an entire ecosystem associated with it (model formats, dataset file formats, serving, tflite, tfx, tfhub etc.)
sklearn has its own uses

past meteor
serene scaffold
dusty valve
#

I have a bunch of data of x and y coordinates, the shape is (n, 5000, 2), so is there any library that can render 5000 points and updates 30 times a second? Ive been told pygame will not be suitable, but maybe matplotlib will be or pyglet

past meteor
serene scaffold
past meteor
#

Write unmaintanable spaghetti

serene scaffold
#

but like what are they trying to do

past meteor
#

Data science

#

But the R people don't do deep learning at all afaik

#

There are "core" non-DL models that are better supported in R than in Python like ARIMA and GAMs to name 2. The Python version(s) are respectively poorer ports and unmaintained. Is it worth the effort? Largely depends on what you do I guess. Especially considering the overall experience in R is way worse than Python, both syntax and semantics.

agile cobalt
lapis sequoia
#

Anyone here think they could recreate this within matplotlib or another python plotting library?

potent sky
#

I tried to write capsule Networks in R sometime ago (I'm a masochist that's why) and even tho the code was all correct it ended up not working due to some bug in keras R

#

Tried lots of stuff to get it working but didn't. It was like a deadlock situation but with errors. Annoying

#

But as much as I dislike the code style and practices and paradigms of R, ig I can see how it'd be useful to people doing pure Data Analytics

crystal obsidian
#

Is this script good for correcting error prone text

night kernel
#

what do you guys think are the best models available on hugging face?

#

theres many ways to filter - curious to hear what you guys think

crystal obsidian
lapis sequoia
tidal bough
brave sand
#

how do I change this:

from keras import backend
from keras import initializers
from keras.optimizers import utils as optimizer_utils
from keras.optimizers.schedules import learning_rate_schedule
from keras.utils import tf_utils```
to 
```py
from tensorflow ...```
hasty mountain
#

It's possible that you may have to do some adjustments, but in general it's basically just adding tensorflow. before keras

night kernel
verbal venture
#

how much harder is generative AI compared to vision?

brave sand
hasty mountain
# verbal venture how much harder is generative AI compared to vision?

Depends on which model you want to use for generation.
If you want to use a Variational AutoEncoder...well, it's quite simple, though the theory is complicated...and it's also quite hard to find a decent tutorial.
If you want to use a Generative Adversarial Network, then the theory is easy and it's easy to find a tutorial, but it's too hard to make it work as it involves too much trial and error.
If you want to use a Diffusion Model, then you'll have the mid-term between those two above: theory complicated, easy to find a tutorial, and not that hard to make it work, but also requires some trial and error.

#

PS: Flow-models are an aberration

verbal venture
#

ok, what about transformers

hasty mountain
#

I have never used a Transformer for generating images, but there might be a version of it for that... I know that there's a GAN model that uses Self-Attention, which is a mechanism from Transformer...

But I suppose the problem could be pretty much the same as for texts: teacher enforcing bias, gradients biased due to residual blocks, crazy gradients due to how the layers weights behave...

brave sand
#

my cpu load is 99%

verbal venture
#

how long do you think it would tak to get a firm understanding of GANs, Transformers, VAEs and diffusion models?

hasty mountain
# verbal venture how long do you think it would tak to get a firm understanding of GANs, Transfor...

I think that, for a GAN...maybe 3 days? 1 week, at most, I think. Really, the idea behind them is quite simple: a Generator trying to fool a Discriminator with fake images and behare of Nash Equilibrium, exploding/vanishing gradients. Follow a tutorial and you may be able to make a DCGAN work easily with MNIST or CelebA dataset(which are the most common ones). The problem is if you try to go beyond that.

Transformers...if you already has some knowledge on NLP, maybe also 3 or 4 days. If know nothing about NLP, vectorizing, etc, it may require some weeks. There's a course on Transformers on Coursera made by Andrew Ng(one of Transformer creators) which can help.

VAEs...around a week. Diffusion Models may take a bit more.

verbal venture
#

no.. to make production models

#

create my own models* similar to stabilityAI

hasty mountain
verbal venture
#

I have 0 deep learning background

hasty mountain
#

Oh, then you may need to add some weeks to those estimations

verbal venture
#

so I can literally make my own production-ready GAN in 3 weeks?

#

that doesn't seem possible

hasty mountain
#

Well, it depends on how you want to do it.
Tutorials on how to do a GAN are quite easy to find, so you may be able to make one quite fast. The only thing that may slow you down a bit will be debugging your code. But if you follow everything correctly, you may be able to have a working DCGAN on MNIST/CelebA dataset relatively fast.
However...as I said, if you want to modify the architecture of your DCGAN to generate different results, then it'll take quite some time to make it work.

#

In fact, it can take months of trial and error and studying.

brave sand
#

Does anyone know of a way to test if my system is lacking ram?

hasty mountain
#

VAEs and Diffusion Models are a bit more tough to understand, but making them work is quite easy. And that's why they've been getting the favor of most people nowadays.

Transformer follows the same idea, plus the fact that you may need to learn Natural Language Processing...which can take a while to understand...especially the idea around vectorization and embedding matrices.

verbal venture
hasty mountain
#

Some days were more...some less...

wanton sentinel
#

Quick one and I'm feeling real dumb for having to ask it... but selecting data based on the value of a combined index in Pandas... Say I have ['FIRST','SECOND'] as my index labels, how would I select based on a filter on the values in SECOND, ignoring any matches in FIRST? This seems like what I want, but not sure how to restrict it to just the second part of the index: https://pandas.pydata.org/docs/user_guide/indexing.html#selection-by-label

agile cobalt
#

explained in #python-discussion but leaving here too for others to know it's been answered```pycon

df
0
a a 1
b 2
b a 3
df.loc[(slice(None), 'a'), :]
0
a a 1
b a 3

wanton sentinel
#

Works perfectly. Thanks muchly!

empty furnace
#

data scientists who worked for companies where SQL was an optional requirment, how much do you use it? And if you use it how techical is your knowledge?

broken arch
#

_\

glad glade
#

bro I'm just following some yt tutorial for python but for some reason it won't detect mss idk how to fix this, I thought the file might be corrupted so I deleted it and reinstalled it but as shown it did not work

lapis sequoia
#

I need some help in visualising the counts of present vs absent per person

#

P means present and A is absent, which are stored in the respective dataset

#

And there are 79 unique IDs

past meteor
royal dagger
#

has anyone finetuned any model using QLoRA?

past meteor
#

Although I think if you're not exclusively doing DL it's worth to learn it. I treat it like a statistics DSL.

potent sky
glacial rampart
# lapis sequoia I need some help in visualising the counts of present vs absent per person

Can you provide a bit more information on what the df's look like?
Based on the image I now assume that every [x, y] in df_present contains an id of someone present and every [x, y] in df_absent contains an id of someone absent. Is that correct?
If so, you would first have to aggregate the dataframes into a [1, 79] df and then you can easily visualize it:

df = pd.DataFrame(np.random.randint(0,79,size=(2000, 12)), columns=list('ABCDEFGHIJKL')) # [2000 x 12]
df_counts = df.apply(pd.Series.value_counts) # [79 x 12]
df_sum = df_counts.sum(axis=1) # [79 x 1]

plt.rcParams["figure.figsize"] = (100, 5)
df_sum.plot.bar(range(df_sum.shape[0]), df_sum.values)
# Repeat the same for absent and plot in same image

Please make sure to give more details on how you want to visualize the data, if I misunderstood

#

Separate post:
So I joined this community mainly because I'm looking for a specific kind of Python course. I think this channel is probably best suited for the question:
I'm a Data Engineer with basic Python knowledge, but I'm looking to expand that through project-based learning. I'm looking for a kind of course or training that offers such kind of learning.
Many courses seem to offer basic explanations etc. whereas I really would like to learn more advanced Python tips and tricks.
Could anyone recommend a person or company for such trainings? (Can be online, classroom based, multi-day, thats all fine)

little vector
#

I just finished an applied machine learning course, but I want to learn more about the subject. Any tips for documentations/tutorials I could study?
I want to expend my knowledge and perhaps to some ML on my own.

karmic void
#

For beginners, sklearn or tenserflow or any other module?

serene scaffold
#

but tensorflow is for neural networks, and you definitely shouldn't start with that.

karmic void
serene scaffold
#

except maybe keras for neural networks.

karmic void
serene scaffold
arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

little vector
#

I am familear with most of the libaries. We did a lot of work wit sklearn

subtle knot
#

how do I change the font size of a single markdown cell in jupyter?

potent sky
little vector
little vector
errant lake
#

Actually, the Cloud providers DE certifications are always good to have imo. So you can check what your favorite Cloud has to offer

glacial rampart
#

again though, I'd definitely be interested in hands-on courses for that, not just going through tons of theory 🙂

brave sand
#

what does this mean:

2023-06-09 15:56:31.123032: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 297 of 2048
2023-06-09 15:56:41.180448: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 622 of 2048
2023-06-09 15:56:51.125136: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 907 of 2048```
boreal gale
quick fable
#

Hello guys can somebody suggest me should I go for 3060 12gb or 3070ti 8gb for my pc , I will be doing dl/ml work and running models most of the time. Plus pls suggest me other components around it .

brave sand
boreal gale
brave sand
#

what are these files?
events.out.tfevents...

boreal gale
quick fable
# brave sand 3060

Brotha any specific reason ? Cuz of vram ? So more vram > Cuda cores?
If more vram should I go amd 6900xt 16 gb ?

past meteor
#

You might have some hickups here and there, if you're fine with that then you can go ahead with AMD

junior rain
#

This may be a dumb question but is it bad practice to make a nested list a value in a pandas data frame? I'm analyzing EEG data and the module will output an indeterminate number of peaks (one participant could have 2 and the other 3) for each participant and each peak is a list of some values. Right now I'm thinking of storing each list in one list for the participants peaks, but is there a better way to do this?

serene scaffold
#

I assume when you say "a value in a dataframe", you mean the value for a single cell. not a whole row or a whole column.

junior rain
#

yes

quick fable
# past meteor I'm not sure how well DL libraries play with ROCm and AMD

Not good , amd still ain't a choice for deep learning in most of the cases
I don't know if I should go for 3060 12gb , I could be fine tuning model bert model , and bert requires at least 12gb vram and 3060 is slow for it but it's in a budget range.
Only proper gpu could be 3080 12gb but I can support used one only in this range.
Should I go for used one's ? Obviously non refurbished and non mined

serene scaffold
# junior rain yes

whatever the column is that is going to have these nested values, you might want to make it a separate dataframe with multiple levels of indexing.

junior rain
# serene scaffold whatever the column is that is going to have these nested values, you might want...

Each peak has 3 characteristics that are stored in those lists and then the amount of peaks for each individual can vary from zero to usually a max of 3. So you're saying I should represent those as a new DF with columns for the characteristics correct? If so, how could I account for the indeterminant number of peaks while making it clear that those peaks belong to one person, since in the original method they'd just be in the row for that person.

potent sky
#

Yep that's what it's doing
The shuffling operation is already done when it's placed into the buffer. From my understanding it samples form the original tensor and places elements into the buffer, so they are "shuffled"
This buffer can then be used for further processing, i.e. it can be sampled from (which is what the comment refers to)

potent sky
#

I really thought I had it on reply ;-;

junior rain
#

lol I was very confused for a minute

potent sky
#

Yeah mb idk why that happens on my phone

past meteor
#

If it's for personal use there's nothing wrong with not getting a SoTA card and just using the cloud whenever your needs exceed your card's capacity. You can probably do some back of napkin calculations and see that this is likely the cheaper option

potent sky
junior rain
#

single number

potent sky
#

You can build a multi index then

junior rain
potent sky
#

One index for each patient/subject
And a lower index for each peak
With 3, columns in the df, one column each for each of the "characteristics" of a peak

#

Yep should work I think

#

!d pandas.MultiIndex

arctic wedgeBOT
#

class pandas.MultiIndex(levels=None, codes=None, sortorder=None, names=None, dtype=None, copy=False, name=None, verify_integrity=True)```
A multi-level, or hierarchical, index object for pandas objects.
junior rain
# potent sky Yep should work I think

Let me read into it more but it seems like that would work. My only concern is if I could output this to a csv file and what that would look like.

#

Yea based on the documentation it seems like I'd have a 3 dimensional df, so how could I export this into a simpler/readable database format?

potent sky
#

I think the CSV would have a column for each index

junior rain
potent sky
#

No...the peak would be described by one level of index
This would translate to one column

#

So you'd have 3 rows for participants with 3 peaks, 2 rows for those with 2 etc.

#

And your first level index would presumably be some sort of participant id

junior rain
junior rain
#

Thanks @potent sky and @serene scaffold

potent sky
#

oops wrong tag again ;-;

wooden forge
#

Hey there, I am kinda confused with np.gradient. I would like to get the derivative of a 2D array (an image) but it returns two arrays x and y and I don't know what to do with these

wooden sail
wooden forge
#

nvm I found it

#

I didn't see the x and y were 2D arrays as well

#

so it's like the dx and dy

wooden sail
#

yep

wooden forge
#

debugging saves the day once again

wooden sail
#

if you know what behavior you expect though, it might be easier to use a 2d convolution from scipy to produce the derivative you want

wooden forge
#

gradient does exactly what I want so I'll stick with this thanks !

wooden sail
#

all righty

subtle knot
#

Is the audit only version of the machine learning course by Andrew ng good or should I buy the course(labs etc)?

glacial rampart
# past meteor How is your SQL?

It´s fine, I currently write jinja-sql with the dbt package. I don´t like SQL as much, but I think I know enough to be able to use it for most projects
For the current project we write all transformations in jinja-sql so that some nice hands-on experience I'll already get

past meteor
#

My issue with cloud certs is that a lot of it is just about their product and not necessarily concepts.

glacial rampart
#

Yeah well regardless those cloud certs are becoming industry standard :/ I'll need to have one from any cloud provider at some point.
And yeah I know SQL is like the must-know for all older systems/ companies. I think I'll manage as long as I understand the idea behind SQL, which I do learn with jinja. It's just that without jinja you need to write a lot of additional sql yourself

past meteor
#

The irony is that SQL is cutting-edge

#

As in, "modern" SQL is just a language/specification that is sent to a query an S3-like datastore in a distributed way. See: Snowflake, Databricks spark-SQL, ...

That being said, it kinda sucks because queries can get long and unreadable ofc.

#

Pretty sure there's DBT adapters for Snowflake and the Spark runtime.

Re cloud certs: are they asking for that for an entry level job? Idk if cloud certificates have value before working with the stack either way. I have a couple of Azure certs because where I live they have 80 % of the market share and there's ways to get them for free. Personally wouldn't have grokked anything without having used Azure before though.

potent sky
glacial rampart
glacial rampart
past meteor
hoary jay
#

So say I have a data ranging from say floating val data from -1 to 1 and i basically want to seperate it into 3 categories , basically +ve and -ve values and the 3rd one including data that is too close to 0 depending on a threshold, problem is how do i mathematically define and justify this threshold? I tried numpy's quantiles but that just divides it equally but in my case i can have 70% negative values and 10% neutral and so on.. so i can divide just in equal ratios, Im writing a research paper so i cant just assume and say i devide the data at -0.05 and +0.05 because that would be random so what can i do?

wooden sail
hoary jay
#

I'm working on a NLP problem and i have some cosine similarity scores ranging from floating values from -1 to 1 and i basically want to seperate it into 3 categories, basically wanna segment values that are closer to -1, those that are closer to 1 seperately and those which are near to 0 are neutral values

this values are basically result of a cosine similarity b/w a sentence vector and a target vector (such as "I am a man" i calculate its sentence vector and then use cos_sim with say a tagret word "men" or "women" to identify which target word it is more related to )

#

its kind of like i just wanna cluster in 1d

wooden sail
#

one way you could do this is to treat the thresholds as learnable parameters

#

or have them be hyperparams and test which one perform best. but there is no unique way of choosing them here

#

are you doing anything with the neutral values?

hoary jay
hoary jay
hoary jay
wooden sail
#

you could try a grid search for that and keep the thresold that works best

past meteor
past meteor
#

Do you do anything with it that is quantitative or do you just qualitatively analyse what's in the groups?

wooden sail
#

from what they said, i understood they use this value to remove samples from the training data

#

i might've misunderstood

hoary jay
#

I'm seperating different sentences actually on the basisi of the similarity values

wooden sail
#

how do you do that separation?

past meteor
#

Okay but can we phrase it under qualitative and quantitative for a second

#

Do you have a downstream task that computes something?

hoary jay
hoary jay
#

like wdym by that can u elaborate?

past meteor
#

Like if you have a downstream task that computes something (quantitative) it's a hyperparameter you can tune like Edd says

wooden sail
#

what do you do with the data after separating into these 3 groups?

past meteor
#

If these are 3 groups and you'll kind of just look at what is inside and analysing it without a clear measure of performance then it's qualitative

hoary jay
#

the data is useless after seperation, once i seperate the sentences based on their similarity values i dont use that data anywhere else

wooden sail
#

what do you do with the data you DO use

past meteor
#

Why do you want to separate the data?

#

Like if there's nothing you're doing with it why separate at all and not just make a histogram?

#

I'm missing something here, don't fully understand

hoary jay
#

ok so let me explain

we have s (sentence vector) and t1 and t2 two target sets

now i calculate cos(s, t1) - cos(s, t2)

if its -ve it will tell me s is more related to t2 then i can do more statisitcal tests on the "s" by analzying its word vectors

wooden sail
#

what kind of statistical tests?

hoary jay
# wooden sail what kind of statistical tests?

like im not sure if im allowed to tell its for a research paper, im contributing to under a prof, like sorry if it sounds stupid, but I can tell you it has nothing to do with the data im talking about I just use it for initial classification

wooden sail
#

then we can't help you 😛 cuz the answer depends on that

#

some statistical tests turn this into a tunable hyperparameter you can optimize

hoary jay
#

ah

wooden sail
#

descriptive statistics do not, and all this threshold gives is a family of descriptors

#

in the former case you can optimize it. in the latter, all you can do is graph the family and the choice is kinda arbitrary

#

what you do with this info is up to you now 😛

past meteor
#

Graphing the data, picking a threshold and then doing certain tests is a bit suspect because you can pick a cutoff to select the conclusion you want to reach, no?

wooden sail
#

in the descriptive case you would NEED to plot the whole family, otherwise the results are not meaningful

#

this is something you need to discuss with your supervisor or something, we can't help much more with the amount of info you can share

hoary jay
#

I'll confirm with him once, whether I'm allowed to talk about our work on online help forums untill then , thanks for the help tho

wooden sail
#

i'm not asking you exactly which tests on which data btw

#

just what kind of statistics

past meteor
#

Can a sentence not be related to T1 and T2 at the same time and produce a ve (whatever this is) that is near 0

wooden sail
#

if it's descriptive statistics that ones thing

#

if you're fitting a model or comparing to some reference, that's a different kind of statistics

hoary jay
#

not fitting a model totally unsupervised learning, I'm doing hypothesis testing later after this

wooden sail
#

btw subtracting cosine similarities sounds a little weird

#

related to zestar's example, you can get a t1 similarity of 0, and -1 with t2

hoary jay
wooden sail
#

this gives you a positive 1 for sim(x, t1) - sim(x,t2) , but the correlation is negative with t2 and 0 with 1

#

even for nonzero similarity with t1, if the negative similarity with t2 is larger you get a positive. same in the opposite direction

#

that metric does not distinguish between positive correlation with t1 and negative correlation with t2

past meteor
#

I think you're probably making something simple hard

#

Define a set of rules with arbitrary thresholds and work with that group of data

#

Do it before looking at your data and motivate the choices. Statistics is full of default parameters anyway

#

You can also perform your analysis on a different set of splits and with different signifance levels and put it in a big table

hoary jay
wooden sail
#

that too. for example if you have a reference paper that you're gonna compare to or build up on, you can just steal their hyperparamts

hoary jay
wooden sail
#

all righty

hoary jay
#

thanks for the help both of ya

wooden sail
#

i think zestar and i just dropped a lot of stuff on you, so at this point i suggest you step back and mull it over for a bit and discuss with your supervisor. otherwise we're gonna confuse or derail you because we also lack some context

wooden sail
#

that categorization sounds like it should be a supervised task to me

#

let's see what zestar has to say

past meteor
#

Before I respond, without knowing fully what they do, I think this is something @serene scaffold might be able to help at

serene scaffold
#

I have to read all that?
Fuck

night prawn
#

i want to use my gpu so i followed this tutorial https://learn.microsoft.com/en-us/windows/wsl/tutorials/gpu-compute#setting-up-tensorflow-directml-or-pytorch-directml( the Setting up TensorFlow-DirectML or PyTorch-DirectML part) but for the third command it returns me conda: command not found. thank you in advance for your help

wooden sail
#

you probably didn't do the conda init at the end of the miniconda installation

#

navigate to where the miniconda executable is located, and in that directly run the command conda init. this will modify your bashrc to allow you to use the conda command from anywhere

#

then close the terminal and open a new one, or do source .bashrc from your home dir

serene scaffold
past meteor
#

Does bert even give sentence embeddings or are those gotten by averaging all word embeddings or doing a max at each dimension?

#

Okay appearantly I forgot about CLS pooling. I kind of don't want to answer as this is quite NLP specific and that's not my jam aside from the high level ideas 😛

#

But it does look like classification tbh.

hoary jay
hoary jay
# wooden sail that categorization sounds like it should be a supervised task to me

the idea is to analyze the conversation of a community such as reddit and then attempting to classify those comments. So we can never have a trained model on some labelled dataset that can be applied to every conversation on social media , because in different communities different words are used in different context, for example slangs that are used in a particular subreddit may not be getting used with the same context in say a YouTube comment section conversation

wooden sail
#

the idea of a training set is being representative, not to contain all text

#

how will you verify that your classification is working?

hoary jay
wooden sail
#

so you have labelled data. my approach would actually be to add this as an extra layer or two in the network and spit out the class instead of that function of the cosine similarity

hoary jay
#

but we are not implying that a labelled dataset is important for this process to work

wooden sail
#

in that case though, you'd have to do something like plotting a family of curves as a function of this threshold param

#

still, let's see what stelercus says. i'm on the same boat as zestar, maybe there's something else at play that i can't see

wooden sail
#

to know if your classification is working, you compare it to something. in this case you have this labelled dataset you mentioned. the goodness of the classification will be a function of where you draw the threshold of which data is neutral

#

this alone should make you think of false negatives and false positives being affected by where you set the threshold

hoary jay
wooden sail
#

very naively from my side, i would think that tuning this parameter as a hyperparam based on your labelled data and including the tuning alg and making the dataset public both justifies the approach and makes the result reproducible, which is great for a paper

#

that immediately makes it clear that the choice of the parameter depends on you having good data to go off of

#

i'm not an nlp person though

hoary jay
wooden sail
#

idk how complicated your inference process is. the bert part you mentioned is already trained yeah?

#

cuz if that's the case you can technically differentiate through that and use your labelled data to optimize the param

#

otherwise you can do a grid search

#

but still, do wait til someone who actually knows what they're doing lends a comment 😛

#

i'm just doing armchair nlp here lol

crimson summit
#

i am doing another course on machine learning by andrew ng and in it he divides the cost function by 2*M which M is the number of training examples. Do yall have any idea why he does this. It seems like an additional learning rate that makes the error smaller.

wooden sail
#

notice that scalars factor out of derivatives, so scaling the cost by 1/2M scales the gradient by the same amount

#

it does not, however, change where minima are located (remember we're looking for the 0s of the gradient)

#

it's important on your computer, however, because floating point number can only be so large. this scaling factor is useful in keeping the gradient from exploding

#

if you have several terms added together in the cost function, the relative scaling of each one IS important. for a single cost term, this scaling is only here for numerical purposes on your computer

crimson summit
#

is it kind of like making the step size smaller ?

past meteor
#

The irony of me an NLP is that I took a course on multimodal information retrieval before I took computer vision and NLP

#

I did doing computer vision propper afterwards but anything I know about NLP is from an IR perspective, not "complete" whatsoever but also not nothing

wooden sail
crimson summit
wooden sail
#

let's put it this way

#

imagine my house is 10 meters away, and each step i take is 1 meter long. it takes me 10 steps to get there

#

what if my house is now 1 meter away, but my steps are 0.1 meters long?

#

this is the same thing that is happening here, since the gradient is computed from the cost and scalars factor out of derivatives

crimson summit
#

doesnt the cost tell you in this case tell you "how far away you are from your house" and you would scale other things in the formula that adjust the weights to make your steps smaller

wooden sail
#

you could if you wanted, i'm just explaining to you that the scaling of the cost function does not have an impact

#

the gradient scales by the same amount as the cost automatically

#

if you want to change the gradient size in a meaningful way, you'd have to change the step size, which is a scalar multiplied ONLY to the gradient, not to the cost

crimson summit
#

house analogy was 👍

broken thistle
#

Hey everyone, I need help with outlier handling in each of the train folds, when using StratifiedKFold with GridSearchCV. Im currently using a strategy which might be causing data leakage. The full description and current code is given in my post here. Any insight would be greatly appreciated. Thanks!

https://stackoverflow.com/q/76446540/18559120

night prawn
wooden sail
#

wdym?

night prawn
#

in the tutorial

wooden sail
#

what are you calling "first" and "second" parts?

night prawn
#

the first : Setting up NVIDIA CUDA with Docker the second : Setting up TensorFlow-DirectML or PyTorch-DirectML

wooden sail
#

if you want to use an nvidia gpu, yes

queen cradle
#

If I understand your setup correctly, t1 and t2 are vectors, and cos(v, w) is the cosine of the angle between them. So cos(word, t1) - cos(word, t2) or cos(sentence, t1) - cos(sentence, t2) is just telling you whether you're closer to t1 or to t2. This can also be phrased in terms of the perpendicular bisector of t1 and t2. This perpendicular bisector is a hyperplane; vectors on one side of the hyperplane are closer to t1 while vectors on the other side are closer to t2. The usual reason for taking the cosine is to get rid of the effect of document length, which is great if you want to compare two arbitrary vectors (e.g., determine if one sentence is similar to another). But if you're trying to determine whether you're closer to one vector or another, then you get the same information by computing the distance to their perpendicular bisector, which can be done by taking the dot product with (t1 - t2)/||t1 - t2|| and looking at the sign. In your case, since you want a neutral category, it's the same as looking at whether the dot product is near 0, large positive, or large negative.

#

That doesn't solve your question, of course. You were asking about how to define thresholds for the categories, and all I did was tell you about a different way to compute something for distinguishing the categories. But I think it clarifies the situation; earlier the question of why you were taking the difference of cosines was raised, and I think the resolution is that it's equivalent to something more mathematically motivated.

night prawn
acoustic fjord
#

hello there! anyone here ever tried animating their python graphs? if so, how was your experience? thanks!

lapis sequoia
#

Anyone know if theres a tag in PyTorch for easy tasks like cleanup or good first task

vale idol
#

Hi I have a quick question regarding assigning values to columns using pandas

#

df.column1[df.column2 <= 3] = 'LScores'

serene scaffold
vale idol
#

Whenever I try to assign values using a function I get : SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

serene scaffold
#
df.loc[df['column2'].le(3), 'column1'] = 'LScores'
vale idol
#

Isn't it the same?

serene scaffold
#

no

vale idol
#

plus I am trying to assign the value back to the original dataframe so thoght loc wouldn't make much sense, espeicially since I'm running this command in a for loop

agile cobalt
serene scaffold
#

but you should probably store the string value in a different column. or you'll have a heterogenous column

#

columns need to be homogenous

#

oh, you are

#

fixed my code example.

vale idol
hoary jay
#

actually the formula for cosine similarity used inside scipy and pretty much every where else is infact (u•v)/|u||v| ,where u and v are vectors so essentially we are already doing w • ((t1-t2)/|t1-t2|) like you said anyways.

The problem is how to set thresholds and also how to introduce fuzzy logic in the process (or anything else that can help) because if i have a score of 0.052 and the neutral threshold is <= 0.05 then this will put it in the t1 partition which is wrong because 0.002 is not a significant difference and the sentence is actually neutral not biased.

agile cobalt
vale idol
queen cradle
vale idol
#

Specifically, I have a multi index df where I ran the following command
df['env_measure_sorts_10'] = df.groupby(['year'])[score_type[1]].transform(lambda x: pd.qcut(x, q=10, labels=labels_dec))

queen cradle
#

As far as setting thresholds, I think it depends. You said you have some labeled data. You could use that: There may be a good threshold which distinguishes "sentence about males" versus "everything else" and a second good threshold that distinguishes "sentences about females" versus "everything else".

vale idol
queen cradle
#

And (perhaps more importantly) you can find that threshold using your labeled data.

vale idol
#

Also thanks for the help @agile cobalt @serene scaffold, appreciate it 🙂

queen cradle
#

My suggestion is to plot histograms of "sentences about males" and "everything else" and see if it looks like there is a threshold that distinguishes them. And similarly for "sentences about females" and "everything else". If there's a threshold, great! If not, then you'll need a different embedding.

hoary jay
#

the case of 0.052

queen cradle
#

You'll always have false positives. That's how classification problems are.

#

Well, realistic ones, at least.

hoary jay
queen cradle
#

The problem is likely to be either your embedding or your choice of the vectors t1 and t2. If your embedding isn't very good at describing masculine-topic sentences, feminine-topic sentences, and other sentences, or if t1 and t2 are not very representative of masculine- and feminine-topic sentences, then you'll get poor results no matter what you do.

vale idol
#

Also if anyone knows, is there a difference between:
df.loc[df.column1 <= 3, 'column 2'] = 'LScores'
and
df.loc[df['column1'] <= 3, 'column 2'] = 'LScores'

#

Since they apepar to give the same output but not sure if there are caviats to this

serene scaffold
#

and if you have a column with the same name as a method, you can't get to it with dot notation.

#

or if the name of the column has a space

vale idol
queen cradle
#

@hoary jay If you haven't tried plotting histograms of the scores of your sentences, then I would do that before anything else. Those plots will tell you whether or not you have a problem with your thresholds.

wooden sail
hoary jay
queen cradle
#

For which class?

#

And centered where, and with what kind of variance?

hoary jay
queen cradle
#

That's saying that most of your data is not especially similar to either of the vectors you're comparing to. Is most of your labeled data neutral?

wooden sail
#

that'd more depend on the choice of the threshold, no?

queen cradle
#

If the whole distribution looks normalish, though, then there isn't clear separation between the classes of interest.

queen cradle
#

Okay. What if you plot KDEs of just the other classes?

#

Maybe a little separation is visible?

hoary jay
#

ok so i just select a random threshold and then plot the different classes on seperating with it?

queen cradle
#

No, I'm saying take all the data with label 1, and plot a histogram (or KDE) of that data and only that data. Then plot a separate histogram for the data with label 2, etc.

wooden sail
#

i just got that you were talking about the a priori distribution, i'm tripping today

hoary jay
#

dist for the two labels

queen cradle
#

Wait. This is the labeled data that you began with? It doesn't seem right.

#

It seems extremely unlikely to me that there are sudden cutoffs at zero. Especially for the first picture since it seems to have only just started to peak.

hoary jay
#

so u see when i plot the whole data i get a normal dist

wooden sail
#

this is the labelled data set right? not the output of your classification for a given threshold

hoary jay
wooden sail
#

that's a bimodal normal at best, btw

hoary jay
hoary jay
wooden sail
#

is this the same dataset from the reference paper? out of curiosity

hoary jay
wooden sail
#

ok

#

i also find it a little "too nice" that there don't appear to be any missclassified sentences

queen cradle
#

It looks to me like (barring a bug) your embedding isn't capturing enough information about individual sentences for you to reliably draw the distinctions you want.

#

But I also think it's suspicious that there are no misclassified sentences.

hoary jay
queen cradle
#

One distribution is negative only. The other is positive only. No misclassification.

wooden sail
#

i would've worded it as, especially in the first figure, one of the classes being super steeply affected by the choice of the threshold. similar to what kyle said, the classes are kinda hard to separate

hoary jay
queen cradle
#

If the plots are showing what I thought they were, there's one plot for all the labeled data whose labels are masculine-topic and one plot for all the labeled data whose labels are feminine-topic.

#

Is that not the case?

wooden sail
#

what labels does your labelled dataset have?

#

or what did you put in the plots rn

queen cradle
#

most of the sentences u see on the left side are actually neutral and hence wrongly classified
This sounds to me like the plots are showing how the data was classified, not how the data was labeled.

hoary jay
wooden sail
#

ok

#

and you plotted the t1 - t2 score here, yeah? separately for the labels t1 and t2

hoary jay
#

that is wrong classification right?

queen cradle
#

If the plots are showing what I think they are, then no, I can't see that.

queen cradle
wooden sail
#

the plots kinda say that the t1 vs t2 classification is trivial, and that there is not much info at all regarding the neutral class

#

can you show the histogram for the 0 labelled data?

hoary jay
wooden sail
#

i think something weird is going on

#

in how this was labelled

#

the more i think about the plots the less sense they make lol

hoary jay
wooden sail
#

ok

#

my interpretation wasn't that far off

hoary jay
wooden sail
#

according to your labelled data, the 1 vs 2 classification is trivial

#

and fishing out the neutrals comes at a terrible price

#

your data set is such that the threshold of false positives means you will get a ton of false negatives

#

it could be true of the language, that idk. if could be artifact of the choice of embedding or labelling, idk

hoary jay
wooden sail
#

the actual interpretation of this requires domain knowledge that i don't have, this really needs knowledge of the statistics of a language

hoary jay
#

same

wooden sail
#

but if we put that aside, and assume these are correct

#

my suggestion would be the same as kyles. as i said before, we can treat the threshold as a hyperparam and given this info, we should be able to optimize for it

hoary jay
#

hmm ill try that

wooden sail
#

or pick it for a choice of false negatives per class, or something of the sort

#

immediately you notice that this thing isn't symmetric, so i would recommend to pick a different negative and positive thresh

#

about the correctness of the histogram, you'd need to check papers or contact an expert

#

but those histograms look very weird to me

queen cradle
#

I agree that the histograms look weird. So I'm suspicious that there's a bug. But assuming there isn't, you may not be able to extract enough data from a single sentence to do what you want.

#

You might have better luck if you embed whole paragraphs.

wooden sail
#

ah, can you change the dimensionality of the embedding?

#

increase it a little?

hoary jay
#

wait i think i can

#

there is another big pretrained model i belive i can try that

wooden sail
#

i can understand technical constraints 😛

#

give it a shot, or what kyle said too. depends on how much time you have available for this

hoary jay
hoary jay
#

trying to submit it in a confrence

wooden sail
#

ah, pretty tight on time

hoary jay
#

yeep

#

lets see

queen cradle
#

Oof, goof luck!

wooden sail
#

but if it's just an abstract, all you need is 1 or 2 pretty pictures 😛

hoary jay
#

this my first every paper so i actually dont know what to submit either 😂

queen cradle
#

Depends on the venue. Your supervisor should be able to explain.

hoary jay
#

ill ask the prof yea

wooden sail
#

check the conference guidelines. many conferences that request an abstract have a condition like 1 or 2 paragraphs, 1 or 2 images, a word limit, and other specs. check that ahead of time

#

in some cases you can get away with promising the world

#

in others you need to show the results up front, ideally with pretty pictures

wooden sail
#

and peer reviewed ones need the paper for review months in advance (not the case if they only request an abstract)

queen cradle
#

For questions like that, there are often conventions that are specific to your field. Outsiders aren't going to have much luck making recommendations.

wooden sail
#

the conference website has the explicit guidelines, go check it out!

#

btw your stuff looks kinda like 2 beta distributions and a gaussian, if you like doing parametric estimation instead of KDE. not much of a difference here tbh since it's anyway blind, but...

hoary jay
#

sorry not that good in stats but i like stats

wooden sail
#

in KDE one takes a basic shape and uses it to build up the observed shape

#

in parametric estimation, one knows the correct parametric family ahead of time and fits the parameters of that directly

hoary jay
wooden sail
#

but differently

hoary jay
wooden sail
#

for KDE the "width" or "variance" of a kernel is chosen a priori, as well as a "model order" (number of kernels). then one finds where to shift the kernels to

#

on the other hand, if you correctly know the parametric family ahead of time, you can explicitly do model order and parameter estimation

#

the difference is that KDE in general has no physical interpretation, it just gives you a parametric representation you can evaluate anywhere

#

in parameter estimation, the estimated parameters actually represent the properties of the data

#

(assuming your choice of parametric family was correct... that's a big IF)

queen cradle
#

Parametric estimation, like the normal and beta distributions Edd is describing, is particularly useful when you have limited data, when you're trying to find a simple approximation, and when theory predicts that a distribution should be close to parametric.

wooden sail
#

that's the final kicker. the lower bound for parameter estimation requires as many samples as unknown parameters

queen cradle
#

For example, the central limit theorem essentially says that, given enough data (and some mild technical hypotheses), most things look like they have a normal distribution.

#

At least if you choose the distribution's parameters right.

wooden sail
#

kyle's practical implications are probably more relevant to you than my ramblings 😛

hoary jay
#

ig you both are insightful

wooden sail
#

you guys are doing the work, i'm just a resident rubber ducky

brave sand
#

how are you guys using collab without running out of memory?

hoary jay
#

i still dont know how u eyeballed that its a beta distribution, when i can hardly remember its formula and graph

queen cradle
#

I guess I should add that real data is never exactly normal (despite the central limit theorem). Often it's not even hard to see the difference, particularly in the tails.

wooden sail
#

you'd technically have to try a few and then do something like a kolmogorov-smirnov test to pick which one fits best

hoary jay
#

is it like another hypothesis test?

wooden sail
brave sand
#

does anyone know what this error means?

AttributeError: module 'object_detection.protos.square_box_coder_pb2' has no attribute 'DESCRIPTOR'```
queen cradle
#

I would say that finding an appropriate parametric distribution is a bit of an art. It tends to work best when there's some underlying reason (physical, experimental, etc.) why that distribution is close to what you're observing.

queen cradle
#

So, for example, if you say, "Oh, I got this quantity by taking a sum of a lot of other quantities," then that's maybe suggestive of something normal or normal-ish. There are a lot of distributions that look similar to a normal distribution (t-distributions, chi^2 distributions, etc.) but if you told me that you wanted to fit a big sum to something parametric, my first guess would be a normal distribution.

wooden sail
#

one of the things i considered when i said "beta" is that it's bounded, for example, unlike other similar-looking dists. that at least fits the behavior of the values computed by your similarity measure

queen cradle
#

If you have enough data, though, non-parametric methods will give you better results in the sense that they will come closer to describing reality.

#

Oh, and there are loads of places where people perform a normality test, see that it fails to reject the null hypothesis, and use that as justification for a hypothesis test that requires normality (e.g., a t-test). You should never do this: Failing to reject the null hypothesis does not mean your data is normal. Your data is never perfectly normal; you just may not have enough to reject normality with the test you're using.

hoary jay
#

what if it doesn't? then is ttest valid?

queen cradle
#

The usual justification for a t-test requires normality.

#

If the data isn't normal, then it may still be close enough to normal that the t-test is okay.

hoary jay
#

ok so how can i make sure the data is close to normal? im asking because I'm using T test later

#

something like Shapiro test?

queen cradle
#

Well, if you want to be really careful, then there's no a priori guarantee.

hoary jay
#

ok

queen cradle
#

Usually we apply the t-test to something like sample means. The central limit theorem says that if we have enough data, then those are approximately normally distributed.

#

But how close you get to normal depends on other parameters of the distribution that you probably don't know.

#

This is quantified by the Berry–Esseen theorem, which shows that the rate of convergence depends on the third moment (unnormalized skewness).

#

In practice, you don't know the third moment, so you don't know how close you are to normal, so you can't really justify a normal approximation and a t-test.

#

But in practice, this rarely matters. In practice, with a decent amount of data, the skewness is almost never so extreme that it messes you up.

hoary jay
#

my idea was as a 2nd classification, after classifying a sentence into t1 or t2 related if i can calculate scores of word embeddings in that sentence and take it as a sample and then perform a ttest while the population would be scores of words that are actually biased towards t1 (say) obtained from the the same dataset, then perhaps the null hypothesis can be sentence is biased because it contains biased words and the alternative hypothesis could be sentence is not biased because the mean of the sample is not related or similar to that of mean of the biased words..does that make sense?

queen cradle
#

There are various rules of thumb, like n > 30 or n > 50 or whatever. None of these are entirely reliable; you can always concoct a really bad example where they won't work. But most data isn't really bad in that way, so these rules of thumb usually work.

queen cradle
hoary jay
#

there's a difference b/w related to t1 and t2 and biased towards t1 and t2... because if a sentence is biased towards t1 and t2 then there must be a use of some Stereotypical word or anything that contributes to biases. In my very first text, i mentioned that our ref paper can actually use this word embedding scores to find out the top most biased words against t1 and t2
.

So i thought if a sentence is related to t1 then it can either be a normal sentence that just revolves around t1 and doesn't have any Stereotypical or gender bias in it...But then if a sentence that is related to t1 and has biased words getting used in it probably has a higher chance of being a Stereotypical or biased sentence against t1 right? So that's why i thought maybe we could look at the sentence from the perspective of words too

queen cradle
# hoary jay there's a difference b/w related to t1 and t2 and biased towards t1 and t2... be...

It sounds like you want to use both sentence and word embeddings. The approach you describe is a kind of hierarchical model. Those are fine; a good hierarchical model can be quite powerful. Another approach you could try (well, with enough time; I remember that you have a deadline coming up!) is making a big, combined embedding by using both a sentence and a word embedding. Maybe that would give you more information than the word embeddings alone. It has the same information as the hierarchical model, but it might also be harder to fit.

hoary jay
plain jungle
hoary jay
plain jungle
hoary jay
plain jungle
#

Thank you! It’s a standard population density graph! My goal has been lately to show people that AI is a very broad range and not everything needs NNs to be optimal

hoary jay
#

love that

tidal bough
plain jungle
tacit knot
#

Ok, I'm really hoping someone can point me in a direction here... I'm building an open source AI based thing (called AutoVR.ai, details if ya want, but doesn't matter for this question)

Under the hood I'm using ZoeDepth that is based on MiDaS and those build off of torch.nn.Module. The point of this model is to take in an image and it puts out a depthmap. There is an ability to adjust the "precision" to use for the actual inference portion and even though the output resolution will end up matching the input image, it does seem to make a dramatic difference on the quality and details clearly present in the depthmap itself. So there is a desire to crank up that "precision" as much as possible if trying to produce high quality outputs. Now, the problem is, the max precision someone can use is going to be directly associated to the VRAM available. If that "precision" is set too high, it will just throw an out of VRAM error and I'm trying to deal with that a bit more gracefully than just letting the thing crash.

I first attempted to catch that error, automatically adjust the precision down a bit, reattempt, and keep track of the working precision combinations so it doesn't need re-determined multiple times. That process generally works, but there are some issues. I've stumbled onto some insane memory leaks that I might have introduced, but they sure seem to be inherent in this ZoeDepth thing that I'd rather not rip apart and/or update to do proper garbage collection.

#

That all said, I simply don't understand python memory management well enough at the moment. For example, without that "dynamic" functionality I'm getting this as an example error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.97 GiB (GPU 0; 11.69 GiB total capacity; 7.51 GiB already allocated; 2.06 GiB free; 7.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The part I'm really not understanding is why it is telling me 11.69 total, 7.55 reserved, but that would imply 4.14 free, although it is saying only 2.06. Just understanding this a bit more might help a lot and honestly, my googling/chatgpting has sent me down more rabbit holes than anything that seems actually helpful.

iron basalt
#

There is also memory being used by the rest of the system (other processes).

tacit knot
#

Gotcha, is there any way to see/find that? I can see some using nvidia-smi, but the numbers are still way different. And I'm very aware that I just don't fully understand this stuff, just need the right thing/name/concept to look into more.

iron basalt
#

2109.44 MiB (2.06 GiB free) + 7731.2 MiB (7.55 GiB reserved) + 1251 MiB (Pytorch base amount) = 11091.64 MiB = ~10.83 GiB If you include other processes you can see how it's close to 11.69.

#

In addition, there can be memory fragmentation and other issues.

tacit knot
#

gotcha, that helps a bit. Honestly, I'm just a bit surprised how easy it is to blow this sort of thing up. The only way I was able to get the memory leak even semi under control was to tear down the entire model and re-build it if it encounters an out of memory error. Kinda rough, but I can make due with some caching of the "determined" maximums.

#

Yea, tried the environment variable at 64mb, 128mb, 512mb, and 1024mb with no significant differences.

iron basalt
#

Pytorch is not super memory efficient, it's setup for easy experimentation.

tacit knot
#

Ah, gotcha, good to know and keep in mind.

iron basalt
#

Making full use of memory requires custom implementations that are not even close to being worth the effort (a lot of work to implement all the manual allocation and custom kernels for the specific model) when you can just buy a bigger GPU or more GPUs.

#

The only applications that will try to squeeze out all of the memory of a GPU is more or less video games (and only some of them (especially if they have to run on consoles)).

tacit knot
#

Ugh, I've done a little bit of PS1/PS2 game dev long long ago. Not a fan of writing custom memory managers lol

iron basalt
#

In the world of ML we just throw more hardware at it, because we usually can / are running in the cloud.

tacit knot
#

ugh, if i re-run this same thing a handful of times, the max resolution I can run at floats up/down by 10-15% without making any changes

#

Yea, I'm seeing that lol...i'm trying to build this thing to run on consumer level hardware AND i'm trying to make it pretty much auto configured, at least for a reasonable set of defaults

iron basalt
#

Consumer level hardware ML is very rough, because of things like this.

#

The ecosystem is not really setup for it, at least not in an easy way.

tacit knot
#

Finding that out lol...the early version of this memory leak was so bad that it would crash OTHER applications and occasionally my entire user session (Kubuntu, flavor of Ubuntu) and that kinda surprised me to say the least.

iron basalt
#

Yeah, and it will crash some drivers too, can brick some ppl's PCs in some cases like some games do.

#

When I experiment on the GPU locally I often have to turn it off and back on again.

tacit knot
#

I haven't worked on the infrastructure/DevOps side of things in a long time, but would a Docker container even provide any protection/sandboxing?

iron basalt
#

We use a whole virtual machine now, and set an upper limit on memory usage, lower by a decent amount than the total. Can just restart the instance then.

tacit knot
#

It seems like some of this stuff is almost what I used to call "bare metal" code back in my C/C++ days.

iron basalt
#

Requires some annoying GPU pass-through work though, which your motherboard needs to support.

tacit knot
#

hmm my understanding was that VMs don't expose the GPU very well from a performance/overhead point of view

iron basalt
tacit knot
#

Ahh that makes sense. Kinda a limiting factor if i'm trying to make this broadly available/usable.

iron basalt
#

Yeah, support is spreading, but it will take a few years to be everywhere.

tacit knot
#

Also trying to avoid the insane install/setup process that I've had to go through with some of these experimental AI/ML things, standing up instant-ngp (NeRF) was insane for me.

iron basalt
#

These kinds of issues is one the reasons we don't see ML on consumer hardware used in applications everywhere yet. GPU work is very buggy / annoying, and kind of miracle that people manage to get video games to run on all hardware configurations (spoilers, they don't, it's endless bug reports (working with consoles is very nice by comparison / fixed hardware)).

tacit knot
#

Fixed hardware, but at least back when I was in school (degree in game design/development from a place called Full Sail) that meant you were also going to have to write your own low level libraries and such like memory managers or basic data structures. lol I shouldn't have to write my own doubly linked list damnit...

#

Damn, that was 20 years ago, just realizing that lol

iron basalt
#

And then to add to injury, the OS is really buggy (and getting worse over time (e.g. Windows 11 can barely manage to make a window fullscreen now without 10%+ of users crashing)).

tacit knot
#

ha, yea, the unknown combinations makes stuff a lot trickier

low delta
#

i'm new to AI. Anyone knows the GODEL NLP model (based on T5) taht can give me some pointers? stuck on two things.

serene scaffold
low delta
# low delta i'm new to AI. Anyone knows the GODEL NLP model (based on T5) taht can give me s...

The big thing is I'm training the model IAW GODEL data structure of Context / Knowledge / Response using Huggingface Trainer. The code executed completely, but there doesn't seem to be any effect. I tried to query the model and it wouldn't give me any hint of the training data. The second thing is I'm working on a SQUAD metric to monitor the training process, but ran into some data structure conflict between the Trainer's evaluation output eval_pred and the SQUAD metric's input metric.add_batch.

warm copper
#

guys

#

what do you prefer? over or undersampling?

serene scaffold
weak lagoon
#

I have a big data problem. I have csv files with over 300 gb. Data is in multiple files. I want to carry out EDA. Data frames memory limit is only 100gb so that is not sufficient plus it is taking too long to process a single file even after chunking. What is the solution?

weak lagoon
glacial rampart
past meteor
#

Gonna become an evangelist for this topic.

  1. train classifier
  2. Make your ROC curves, precision recall plots, DET-curves, ...
  3. Select your decision boundary based on where you want to land on these curves
#

Under/oversampling, class weights, ... are an opaque way to solve this problem. If there is a signal in the data your data you will see a difference between the negative and positive class

#

I think these are better because they're easier to reason about and you use your data, which is hopefully a representable sample of the population, as is. The one I could get most behind is class weights because at least you're not messing with your sample.

median leaf
#

do any of you guys understand AIC and BIC scores in regression models?

past meteor
median leaf
#

im tryna figure out why my AIC and BIC scores are in the 600s

past meteor
wooden sail
#

the number itself is also not that important, only that pick the smallest one w.r.t. the model order

#

much like in most other cost functions

lapis sequoia
#

how would you guys impliment an api into a chatbot

crimson summit
#

Does anyone know why in the pircture on the right it says it was much much fastter if in that image it goes through 900 iterations and in the image on the left it only goes through 9

glacial rampart
#

It doesn't say it's faster, it says it converges faster. In the left where you did 10 runs (which was your max iteration input), you didn't get to the point of convergence yet. And since the learning rate is small, it would take more iterations. Why don't you compare them both on 2000 iterations?

cosmic harbor
#

I cannot uninstall a package. Can anyone help?

$ vcs
bash: vcs: command not found...
Install package 'python3-vcstool' to provide command 'vcs'? [N/y] y

$ vcs
usage: vcs <command>

Most commands take directory arguments, recursively searching for repositories
in these directories.  If no arguments are supplied to a command, it recurses
on the current directory (inclusive) by default.

The available commands are:
   branch     Show the branches
   custom     Run a custom command
   diff       Show changes in the working tree
   export     Export the list of repositories
   import     Import the list of repositories
   log        Show commit logs
   pull       Bring changes from the repository into the working copy
   push       Push changes from the working copy to the repository
   remotes    Show the URL of the repository
   status     Show the working tree status
   validate   Validate the repository list file

See 'vcs <command> --help' for more information on a specific command.

$ pip uninstall vcs
WARNING: Skipping vcs as it is not installed.

$ pip uninstall python3-vcstool
WARNING: Skipping python3-vcstool as it is not installed.```
tidal bough
glacial rampart
#

Ye also thought its probably apt, though I'm not familiar with Bash. 'apt remove python-vcstool' should do it. If it really is pip you should be able to see your packages with 'pip list'

past meteor
#

pip list | grep vcs

dense crane
#

how am i suppose to convert each sentence to tensors in case of making the seq2seq model, because each sentece dont have to contains the same number of tokens so while i will be training that model might occurs the problem with that, dont you agree with me?

hasty mountain
#

Just add the pad tensor to all your sentences until all of them have the same size as the largest one.

dense crane
#

is this like adding zeros?

keen gust
#

just came across datalore by jetbrains, does anyone use this? I'm working on data reports/insights for some of our staff (non-technical) and found streamlit to be limited. While looking up Dash documentation I came across this and it looked interesting. Curious if anyone has tried it or uses it currently

dense crane
#

so i am suppose to the vocab where the each token will have his own number (including <pad>, <eos> and <sos>) then i am converting each sentece to those vectors of integers and from there i am converting this to embeddings, right?

#

@hasty mountain

past meteor
hasty mountain
#

So a sentence <how><are><you><doing><?><pad> will be something like[1, 2, 3, 4, 5, 0]

dense crane
hasty mountain
#

Yes

past meteor
#

I'm asking because it's important context to answer your question correctly 🙂

dense crane
#

so ok thx!

keen gust
# past meteor What type of analysis do you do?

it'll mostly focus on financial data like revenue (MoM,YoY,etc.) for 70+ locations and then stats related to our main product offering so stuff like # of players, bookings, maybe customer demographics. The end goal is to replace our current Google Studio reporting with something a bit more automated while maintaining a very user friendly ui

past meteor
#

Okay the big brain answer is to focus on data prep and give business a BI tool and let them make the reports themselves

#

If that's a step too far in your org, you should still use a BI tool (Power BI, Tableau, Looker) to make the reports for them

keen gust
# past meteor If that's a step too far in your org, you should still use a BI tool (Power BI, ...

yeah it's still a very small op so there's no dedicated team for that unfortunately - it's likely to fall on myself to push out or at least give upper mgmt key stats at a glance. I was able to put together a basic dashboard on streamlit but it got a bit limiting as I wanted to do more so I've just been searching for alternatives. Our data isn't overwhelming and luckily a lot is self reported by our locations, it's just cleaning it up and trying to put together meaningful reporting. Not opposed to using a BI tool though

past meteor
#

What's limiting about streamlit for your case?

keen gust
# past meteor What's limiting about streamlit for your case?

at the moment I have a working db for our corporate locations w/ basic stats our controller uses for her work - this likely can be scaled & adjusted for franchise locations so it's not an issue. On the other hand, staff in our corp. locations are asking if something can also be done for them so that they can skip exporting csvs, making tables, etc. for their daily reporting. I'd ideally like to keep these as private apps and right off the back not being able to deploy more than 1 on SL is an issue. Granted, I've been mostly self teaching Python for a few months so if any workarounds or more appropriate methods exist it's completeply plausible I just don't have the 'know' right now. Streamlit also re-running on a user action isn't ideal if I expect a bit of users to be on it at once.

#

also appreciate the responses & help fyi

past meteor
#

I think there's several issues here right?

  1. You want a way to onboard the data in a better way

  2. you want people to see only the data they're authorized to see?

#

if that's the case, you need to find out what is generating the data for those CSVs and either have it upload data to your DB directly or at least automate exporting the CSVs and then parsing it and putting it in your DB (less robust).

zinc briar
#

Is algebraic geometry used in machine learning

past meteor
#

BI tools at least have the second part baked into it.

cerulean kayak
#

Anybody got a website that has a list or a list of pluggins that improve productivity/are just good to have for DataScience in Python Notebook via VSCode?
At me if you got anything.

rose dagger
#

If i load my data by a Tensorflow DataGenerator from a directory like this, what file formats will the datagenerator accept? I'm trying to load grayscale images represented as 2D numpy arrays, but the datagenerator seems to recognize "0 images" in the directory.

crimson summit
#

i got another question real quick

#

Why is the graph of the cost vs itteration steeper for the picture on the right which has a smaller learning rate than the picture on the left. I would thing that since the picture on the left has a bigger learning rate the graphs would be opposite

glacial rampart
#

Whether a learning rate is (too) small or (too) big depends on the data. IF the learning rate is too small, a larger learning rate value will lead to faster convergence. However, if the learning rate is good or already too much, the too large learning rate leads to not finding the convergence at all or later. The answer lies in how the w-values develop over iterations.

crimson summit
#

the pic on the right has a much steeper cost vs itteration graph

safe lintel
#

yo anyone here now about R programming?

past meteor
safe lintel
#

have any idea about creating a subset using 2 diff datasets?

past meteor
#

I don't understand your question

safe lintel
#

like i have 2 diff data files and not all samples are present in both files. so we need to first subset the samples that are present in both files.

glacial rampart
glacial rampart
safe lintel
#

yes i did that , but the issue is my one data file does not have any header name and that is creating problem for me

#

if u could allow i can send u the file to have a look.

glacial rampart
#

I'm watching a competition atm 😛 so sorry not going to do too much. Are the files comparable? E.g. should the 2nd file have the same header as file 1?

safe lintel
#

these are headers in 1st file ("" "type" "tissue_source_site" "disease_type") but for the second file ("" "TCGA.E9.A1NI.01A" "TCGA.A1.A0SP.01A" "TCGA.E2.A14T.01A" "TCGA.AR.A24O.01A"........) it just start like this no header or something at all

past meteor
#

Are the columns the same?

#

Why not just append them to each other

safe lintel
#

there are no colomns in 2nd file.

#

1st file

#

2nd file

#

...

glacial rampart
#

So how would you like to subset them?

safe lintel
#

i have no idea how subsetting words , if u can shed some lights on it

glacial rampart
#

Well, if you want to merge or use those files together, there should be a way to match the records, right?

#

Otherwise you just put random data together which is meaningless

safe lintel
#

yes the header "type" has data which is also present in 2nd file

glacial rampart
#

I'm not sure how it works in R, but in that case I'd remove both file headers and manually create header arrays and load the files with those

#

make sure the "type" header is on the correct location for both files and name other columns whatever you want

safe lintel
#

if u can explain in python that too work

glacial rampart
#

In Pandas you can specify column names when you read_csv

#

Just like this: ['name_a', 'name_b', 'yolo', 'type']

safe lintel
#

but as i told the 2nd file looks like this , how im gonna create header ?

#

gene_expression_df = pd.read_csv('tumor_gene_expression_data.csv', 'name_a') like this?

glacial rampart
#

Hmm that file will be a problem since it all seems to be on 1 row and only whitespace separated

safe lintel
#

yes, that i was banging my head since yday, tried a lot of soln still nothing workds 😦

#

any fix for this?

agile cobalt
#

hard to tell without having the full file to look at, but worst case scenario just parse it with normal string manipulation then pass it to pandas as a dictionary of lists

glacial rampart
#

Yes it can be fixed, but it will involve some work. Think of 'readlines()', regular expressions and writing to csv. Those are the 3 things you will need to do is my guess.

#

If you can get it to csv format, all you need to do is specify headers yourself and you can use it however you want

safe lintel
#

i tried this code
import pandas as pd

Read the TSV file

df = pd.read_csv('TCGA-BRCA.htseq_fpkm-uq_gene_name.tsv', sep='\t')

Convert to CSV

df.to_csv('output_file.csv', index=False)

#

now it looks like this

glacial rampart
#

Ahah, that is a lot different than I expected. That makes it a lot easier though. There's just a lot of column_names but the file quality seems fine.

safe lintel
#

what should i do next?

glacial rampart
#

So are you now able to find the column_name you want?

safe lintel
#

no, i did not able to get

glacial rampart
#

Oh I guess I understand what you want to do now

#

You need to pivot your data

safe lintel
#

how exactly?

warm copper
#

@weak lagoon then how do you solve imbalanced target variable

#

@serene scaffold I was trying to offset imbalanced target variable

#

also imbalanced target variable also creates incorrect model and you need to offset it

glacial rampart
safe lintel
glacial rampart
safe lintel
glacial rampart
#

Haven't tested it myself, but try: df_unpivot = pd.melt(df, id_vars='unnamed', value_vars=df.columns[1:-1])

glacial rampart
#

Didn't know how to exactly write it without testing

#

df_result = df.melt(id_vars=['unnamed'], value_vars=df.loc[:, df.columns != "unnamed"])
This should work. Replce unnammed with what u called it

#

(and ofc replace df with gene_expression_df)

safe lintel
#

well after fixing a bit got this

brave sand
#

for this model ssd_resnet50_v1_fpn_640x640_coco17_tpu-8

#

what is the recommended image size?

#

my current image size from the xml files is this:

<size>
<width>4056</width>
<height>3040</height>
<depth>3</depth>```
#

is that too much?

#

i'm still dealing with a filling up kernal issue on colab

agile cobalt
agile cobalt
brave sand
#

yeah i thought that would be the culprit

brave sand
agile cobalt
arctic wedgeBOT
#

research/object_detection/configs/tf2/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.config lines 46 to 51

image_resizer {
  fixed_shape_resizer {
    height: 640
    width: 640
  }
}```
brave sand
#

this memory issue is really bugging me now

agile cobalt
coral field
#

for subwords that replace out of vocabulary words, how does the model interpret the new definition?

glacial rampart
# safe lintel

I'm convinced my code should work. Probably something went wrong with renaming that column
You can indeed remove the brackets though. So you end up with:
gene_expression_df.melt(id_vars='identifier', value_vars=df.loc[:, df.columns != 'identifier'])

brave sand
#

im not sure what is wrong then

#

should I still resize it bc the image is still massive?

#

i reduced the dataset to 100 images only

coral bloom
#

is it true that gpt-4 training data was only 45GB?

#

anyone?

agile cobalt
#

almost definitely not

#

maybe the fine tuning dataset, but the full training data (including pre-training)

warm copper
#

@agile cobalt how do you deal with unbalanced target values

agile cobalt
warm copper
#

Very unbalanced

#

Oversampling or undersampling is not suitable?

#

Or SMOT @agile cobalt

#

see @agile cobalt

#
rf_model_balanced = RandomForestClassifier(random_state=0, class_weight='balanced')
rf_model_balanced.fit(train_features, train_labels)
rf_model_balanced_pred = rf_model_balanced.predict(test_features)

print(classification_report(test_labels, rf_model_balanced_pred))
#

this is what I did

dusty plaza
#

ChessDotComResponse(stats=Collection(chess_blitz=Collection(best=Collection(date=1673051280, game='', rating=310), last=Collection(date=1686273194, rating=207, rd=119), record=Collection(draw=1, loss=8, win=3)), chess_bullet=Collection(best=Collection(date=1679884176, game='', rating=451), last=Collection(date=1686027007, rating=309, rd=113), record=Collection(draw=0, loss=5, win=4)), chess_daily=Collection(last=Collection(date=1673799161, rating=400, rd=350), record=Collection(draw=0, loss=1, time_per_move=66341, timeout_percent=0, win=0)), chess_rapid=Collection(best=Collection(date=1686345903, game='', rating=651), last=Collection(date=1686362671, rating=651, rd=31), record=Collection(draw=35, loss=187, win=169)), puzzle_rush=Collection(best=Collection(score=6, total_attempts=9)), tactics=Collection(highest=Collection(date=1685588227, rating=1349), lowest=Collection(date=1675015529, rating=412))))

Anyone know how I can use this data to pull the wins, draws, losses of a specific game type? I am planning on figuring out how to differentiate between them

plucky bolt
#

Anyone here also work with C++?

zinc briar
# plucky bolt Anyone here also work with C++?

Of course! In the realm of C++, I find myself immersed in the intricacies of its profound abstractions and the boundless tapestry of algorithmic finesse it offers. Metaprogramming serves as a key instrument, allowing us to transcend the limitations of traditional runtime execution by deftly manipulating compile-time computation. Memory management becomes a virtuosic endeavor as I orchestrate intricate ballets of optimization, minimizing runtime overhead through the adept utilization of smart pointers, move semantics, and refined memory layouts. Within the expansive dominion of the C++ standard library, I harness an array of opulent algorithms, containers, and utilities, evoking an exquisite finesse within my code. Together, we embark on a collective pursuit of enlightenment, unraveling the enigmatic tapestry of C++ as we push the boundaries of innovation and amplify the crescendo of our software engineering prowess.

serene scaffold
serene scaffold
plucky bolt
serene scaffold
queen cradle
coral field
#

why can tensorflow models take in both tensors and numpy arrays, but not regular lists? why even convert to numpy?

wooden sail
#

because you's realistically do math, e.g. preprocessing of some sort in numpy and its arrays are easy to convert for many reasons: having a single type, being memory adjacent, and having an efficient interface like being able to take in or give out their buffer. pretty much none of those are true for python lists

coral field
#

ah

wooden sail
#

you should be able to use a list of numpy arrays though

#

convert_to_tensor should be able to grab lists and make tf tensors out of them

zinc briar
#

Cant numpy arrays be tensors

wooden sail
#

sure they can, but i think they meant specifically tf tensors

lapis sequoia
#

simple uncens chatgpt

#

source code in my github

worn mango
#

anybody know why vector addition using numpy might produce different results across two machines?

#

numpy and python versions are the same ^

wooden sail
brazen lichen
#

Hi, Could someone please share data science - AI learning path and topics to be covered ?

tidal bough
#

like, np.zeros(100) may be float32 or float64 depending on the system

coral bloom
#

does anyone know

#

how to render template in a subfolder of templates in flask

potent sky
tidal bough
#

I thought basic CPU operations were guaranteed to produce consistent results

potent sky
#

That may be, but from what I understand, the vectorization process can potentially change the order of operations. Considering this manipulation is at very low level of memory management, this could lead to slightly different results, especially across different architectures as the simd implementations will also be different

wooden sail
tidal bough
dense umbra
brazen lichen
iron basalt
#

It's why some multiplayer games (e.g. Starcraft) use fixed point.

#

(For deterministic results across different devices)

#

There are some CPU flags that can be enabled to get the same results, but they come at a performance cost. Some physics engines support that.

#

Then there are bugs, found in many math libraries, and the hardware itself.

warm copper
#

Does anyone here know how to deal with imbalanced response variable?

#

I have a very imbalanced response variable for categorical prediction

#

I am trying to predict the possibility of a bankruptcy of a company

#

I am using RandomForestClassifier

#

I used class_weight='balanced' parameter

#

but Im not sure that is good enough

#

I read some people do under or oversampling

#

or SMOT

#

but some places tell me not to do those

#

so Im a bit confused

past meteor
#

I've answered this question like 3x

warm copper
#

you did?

#

you should have tagged me @past meteor

past meteor
#

Instead of saying everything above 0.5 is bankrupt and under not you should look at precision recall, ROC, DET, ... curves and determine a cut-off yourself @warm copper

warm copper
#

so like using under and oversampling? @past meteor

past meteor
#

no, just your data as-is

warm copper
#

how do I determine a cut off tho?

#

Ive never heard that before

past meteor
#

For example you can simply plot the distribution of scores (a histogram) of the scores for your positive and negative class

#

And then you eyeball your data and decide where to put it

warm copper
#

how would that even work tho when 98 percent of the data has bankrupted companies

#

lol

#

I already did that @past meteor

past meteor
#

This is an example of what I mean

warm copper
#

this is either 0 or 1 tho

past meteor
#

So here you can see if I choosse the score at 0.08 there would be no more false negatives

warm copper
#

I dont know what li score is tho

past meteor
#

In your case li score will be whatever probability you have

#

I was working on finger print recognition so the li = left index, for you it'll be "bankruptcy score" or whatever

warm copper
#

I have the count bars

#

😄

past meteor
#

your model has a .predict() and a .score() method, use the latter

#

Sorry, it's .predict_proba()

#

I'm actually giving a pretty shitty explanation I know @warm copper , part of it is me not wanting to type out what I've done x3 over the past few days but that's my fault and not yours 😅

worn mango
warm copper
#

lol

#

can I dm you?

#

@past meteor just to show

past meteor
#

I prefer if we keep it here because then other people can add stuff as well 🙂

warm copper
#
pred_prob = rf_model.predict_proba(test_features)
print(pred_prob)

plt.figure(figsize=(16, 10))
plt.hist(pred_prob[test_labels == 0], bins=50, label='Negatives', alpha=0.5, color='b')
plt.hist(pred_prob[test_labels == 1], bins=50, label='Positives', alpha=0.7, color='r')
plt.xlabel('Probability of being Positive Class', fontsize=25)
plt.ylabel('Number of records in each bucket', fontsize=25)
plt.legend(fontsize=15)
plt.tick_params(axis='both', labelsize=25, pad=5)
plt.show()
#

am I doiny something wrong here?

#

my test_features = X_test

past meteor
#

You need to normalize the scores because I can imagine test_labels == 0 is a lot more than test_labels == 1

warm copper
#

my test_labels = Y_test

#

I keep getting this error tho

#

ValueError: The 'color' keyword argument must have one color per dataset, but 2 datasets and 1 colors were provided

#

I fixed it

past meteor
#

Now you still have to normalize the scores and then you're nearly there

warm copper
#

thet are already normalized tho

past meteor
#

How is your y-axis then going to 5000

warm copper
#
train_features_scaled = scaler.fit_transform(train_features)
test_features_scaled = scaler.transform(test_features)```
past meteor
#

I mean normalize your histogram

warm copper
#

I used standard scaler

#

how do I normalize it?

#

density=true?

past meteor
#

yes

warm copper
#

question tho

#

why do I have negatives and positives legends different color

past meteor
#

No idea. The general idea btw is that these plots can help you select a sensible probability to decide if something is in the negative or positive class.

#

Ties in well with the ideas behind ROC-curves, precision - recall, etc.

warm copper
#

yeah

#

😄

past meteor
#

Undersampling, oversampling, class weights take part of this out of your hands when it should probably be a decision you make

warm copper
#

Im going to try to fix the graph first

#

interesting @past meteor

past meteor
#

Yeah they're not calibrated probabilities

#

They're over and/or underestimated but for this example that doesn't matter too much

warm copper
#

okay

#

I dont know why my graph is messing up

#

did you seaborn for yours?

past meteor
#

Yes, sns.histplot()

warm copper
#

I think the probalem is that predict_proba returns a 2d array

#

problem**

#

maybe I need to convert it to one dimensional array?

#

ValueError: The 'color' keyword argument must have one color per dataset, but 2 datasets and 1 colors

#

hence this error

zealous badger
#

hi, i have a scatterplot in plotly and i want to display labels for the bubbles. but there's a lot of them and its not really neat. is there a good way to selectively display the labels?

warm copper
#

colors are fixed @past meteor

#

increased the alpha too

#

oh wait

warm copper
#

I found the best threshold using predict_proba @past meteor

#

Best Threshold=0.230000, F-Score=0.457

#

😄

past meteor
#

Selecting the right one is more a business concern than a data science one. Do you find false positives or false negatives worse? 🙂

warm copper
#

so this is for false positive

#

should I also perform for false negative?

#

i did it for positive outcome

#

😄

past meteor
#

It's fine like this tbh

warm copper
#

Best Threshold=0.230000, F-Score=0.457
Best Threshold=0.110000, F-Score=0.125

past meteor
#

There's a million other plots like this you can make: FPR-FRR (false positive vs false rejection rate), DET (detection error), ROC, PR curve, ...

warm copper
#

false negative looks horrible

#

😄

#

this is for negative outcome