#data-science-and-ml
1 messages · Page 67 of 1
I'm not using it for anything particularly important downstream hence why I was worried about the overengineering part. At best it's just something we might use if we're doing workshops and we want to quickly generate data on a topic if we want to talk about it.
I used this to generate the .ckpt file:
import tensorflow as tf
# Path to the .meta file
meta_path = '/home/ethan/UAS4STEM/models/checkpoints/model.ckpt.meta'
# Create a session
with tf.compat.v1.Session() as sess:
# Restore the graph structure from the .meta file
saver = tf.compat.v1.train.import_meta_graph(meta_path)
# Save the variables to a .ckpt file
saver.restore(sess, '/home/ethan/UAS4STEM/models/checkpoints/model.ckpt')
are there privacy protection concerns? for the real data you're using to condition this model
ngl these days I want to give up as soon as I see tf.compat.v1
where from are you following this tutorial
I googled it, not a tutorial. the tutorial doesn't go into depth enough
this loads the ckpt file, not generates it right?
No, it's really something like where we want to quickly generate high-quality looking BS data on say health and someone draws up the PGM with some realistic looking values / relationships and we spit out the output from the VAE (or directly from the PGM)
and batchsize
check out the tf docs for tf.train.latest_checkpoint()
I think it takes as arg the directory where you have those checkpoints
my batch_size is 64 and my num_steps is 1000
then maybe you can use the original data with either some augmentation, rule based generation, or purely a VAE on top
your idea is exciting tho
I used this guide from zestar
https://stackoverflow.com/questions/41265035/tensorflow-why-there-are-3-files-after-saving-the-model
There is no original data
that really depends on your data, your learning problem, and other hparams
there is no universally good number
same for this but you don't want to keep it too small as the updates to the gradient might be "volatile" then
It's truly a case of drawing up a PGM for any topic you care about, picking values and then just simulating 😛
oh alright
yeah, i think the problem is that the .ckpt file is 0 bytes
which means it'll be stuck on the loading thing
i have no idea why it's 0 bytes though
ah then graphical model looks like a good idea to capture the relationships between the variables
what about rule based generation
the other 2 .ckpt files aren't 0 bytes
I haven't heard of rule based generation, I'll look it up
also wdym by PGM
probabilistic graphical model, sorry
it won't be very robust, I think you must be aware of it under a different name
but what you could do otherwise is use a domain specific Generative model and use rule-based conditioning instead of explicit rule based generation
that should also make it a bit non-boring xd
I think there was a paper on smtg similar but I'm not sure
how big is the .data0000-0001 file
also how did you save these checkpoint files
no, you should be good
https://towardsdatascience.com/custom-object-detection-using-tensorflow-from-scratch-e61da2e10087
step 8
Custom Dataset Training for Object Detection using TensorFlow | Dog Detection in Real time Videos | Perfect Guide for Object Detection
What is rule based conditioning? p(X|c_1) ~ N(a_1,b_1) and p(X|c_2) ~ N(a_2, b_2)? (sorry for the many questions)
Yep yep
Wait I'm not sure about representing it as a normal distribution
And np with the questions lol I learn a lot too
Yeah, I just took that one but it could be any distribution
It's been a while since I've done anything related to graphical models but I think that idea might be subsumed by them
I was thinking if you have a good p(c_1|X) model and a good p(X) model then you could get p(X|c_1) as kp(c_1|X)p(X)
By model here I mean anything which can replicate the desired distribution
I think I'll just go over my old slides again but in the meantime I'll also think of conceptually simpler ways to do this 😛
Maybe, my experience with graphical models is also limited
One of the many things of university that are squarely under YAGNI
I always like an elegant statistical solution xd
Then you'll enjoy spending time with graphical models. They're really at the intersection of CS, stats and domain knowledge
https://dtai.cs.kuleuven.be/problog/ was peak YAGNI 😢
Yeah I was collaborating on a graphical models x federated learning project but I was mostly handling the fl part
Gotta dive into graphical models properly
I hadn't come across this so far will have to check it out sometime xd
Yep this seems cool
Should work
Not a lot of value - many of my former profs peaked in the late 90s early 00s so a lot of time was spent on esoteric things like that or restricted boltzmann machines when more relevant techniques existed. Took "offense" with the RBMs because I'm a relatively recent grad and covering other generative methods in more detail would've been more relevant but I digress
Hmm fair enough
RBMs really kicked off the whole hierarchical representation thing tho, very impressive
But relevance is important yeah. Maybe a good overview of RBMs and then they could've moved on xd
RBMs are somewhat relevant but Hebbian learning and hopfield networks probably less so: https://arxiv.org/abs/2008.02217
Okay I'm not at all familiar with hopfield networks
Looks interesting. And that too goes onto the impossible pile of "I'll check it out"
Put it at the bottom of that pile. Either way, enjoy your day 🙂
Haha sure, you too!
@potent sky hey do you mind explain batch size?
thank you !!!!!
I would not worry about if models have progressed or not I would learn the basics inside and out because then you will have a deep understanding that will allow you to use and build with the current models much better. I followed along building the neural network in a book called make your own neural network by tariq rashid. It was good and got the job done but theres def way better tutorials out there. I would also watch and understand all of 3blue1brown's videos on youtube the he has put out on neural networks
It's how many data instances are processed together (in batches)
It has several advantages like making better use of parallel processing, stabilizing gradient updates, introducing controlled stochasticity, etc.
oh ok, so not something i should be worrying about
You'd select your batch size depending on how your data is distributed and how many instances you want to process before taking a learning step
In case of larger models, it also depends on how much memory you have
Popular choices for batch size are 16, 32, 64 etc
Keep a decent batch size and no
https://www.youtube.com/watch?v=amURyS6CAaY
around 20:22, he sets the checkpoint to ckpt-0, I don't get how though, what does that represent
This video shows step by step tutorial on how to train an object detection model for a custom dataset using TensorFlow 2.x. The custom object trained here is a face mask.
① ⚡⚡ My Website Blog post on this ⚡⚡
👉🏻 https://techzizou.com/training-an-ssd-model-for-a-custom-object-using-tensorflow-2-x/
② ⚡⚡ My Medium post on ⚡⚡
👉🏻 https://techzizou0...
my batch_size right now is 64 with 16 gb ram
Cool enough
How much VRAM or are you not training on GPU?
I am training on GPU
8 GB vram
the process sis always killed though:
2023-06-07 16:05:21.075792: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_24' with dtype resource [[{{node Placeholder/_24}}]] Killed
Is there a xgboost-gpu version for Windows 10? When I do pip install xgboost-gpu i get the error: ERROR: Could not find a version that satisfies the requirement xgboost-gpu (from versions: none)
ERROR: No matching distribution found for xgboost-gpu
you should install regular xgboost, one of the parameters is running it on GPU
Can someone please help with this error:
[[{{node Placeholder/_0}}]]
2023-06-07 21:29:23.493493: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_29' with dtype int64
[[{{node Placeholder/_29}}]]
Killed
I need help with the code for paraphrasing the text, can someone here help or /i should ask in another channel?
You can ask here
I need help with the code I can't input the text in the text area anymore
Way too ambiguous if you're after anything reliable.
Hey room, I'm trying to think of a way in which I can analyse the results of two different classification models that I made with 37 classes.
Here is a confusion matrix which is a little misleading, because the scales are not the same.
EfficientNet:
This is SqueezeNet
I also have the confusion matrix for each of these stored in a DataFrame.
top 5 accuracy and top 1 accuracy?
1.) I will be adjusting the scales on the confusion matrix images.
2.) Do you have any ideas on how I should review why certain breeds of animal might have been misclassified?
!paste
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
I'll def add that to the shortlist.
And by shortlist, I mean I'll do that.
yep and then u can share the link here
https://paste.pythondiscord.com/ucokovugib
I am not so good at this topic yet but maybe someone here is more knowledgeable. I am working with NLTK libraries and other tools for NLP, my goal is to paraphrise the text using grammatical rules to identify synonyms and conjugate them and then replace in the text. But those conjugating rules don't work properly and the words in text stay in their base form and don't even get replaces if there was synonym found. That's quite a lot of a problem but at least maybe someone can help me resolve the problem wiht grammatical rules not working, maybe I went wrong in the algorithm when writing it or else
or advice where I can ask about this so someone can help
I would be recommending that you look into some of the latest research into Deep Learning and Transformers with Attention.
Okay
I would also specifically be looking at something like BERT.
Which is really good at understanding context.
Anyone working with GLCM, LBP and GABOR filters? I need to ask a few questions about my homework
Are learning curves like these a sign of overfitting? Or does it look fine and i should just add more epochs?
It looks like the accuracy is all over the place, did you test it on a very small number of text examples?
I doubt you can draw very many conclusions from this
Training data consists of 1280 images (320images, each image rotated 4 times by 90°). Validation data consists of 320 images ( 80 images, each image rotated 4 times by 90°)
hmm the acc graph looks weird
indeed it looks weird lol
Yeah either your model changes very chaotically, or the measurement isn't very accurate
80 images (duplicated 3 times each) isn't that much data
Try running the model like 10 times, and then average over those 10 runs
Also think it's a bit sketchy to augment your test data, normally you want to keep it as is
Good idea.
My entire dataset consists of roughly 1600 images, so max i can do is ~1300 images as test data, then augment extra test data by rotating to get ~5000 images as train data.
You shouldn't generally augment your test data
Sorry i meant training data
How have you split it now into train/validate/test?
Could also use k-fold cross validation
I chose a random subset of size 400 of the dataset, augmented extra data, then split 80% into training data, 20% into validation data, i.e. no test data so far. The plots only show accuracy on training + validation data.
Will do
split it 80/20 intro train and test. Perform k-fold cross validate on the train data
And you could even run k-fold more times to average it over more runs
Thank you so much for the quick help!
does anyone know?
https://colab.research.google.com/drive/1Z7f2nackDDo6vmBrTT6qdWyjGekZAh-6?usp=sharing
is anyone know why the result is so bad when i try in new image?
Does anyone has any experience on doing classification of time series data?
What is you're setting, you have an entire time series and you want to classify it?
What I mean is like this.
I have a time series data set of sea level anomaly.
The feature in the dataset is 'sea level anomaly' and it has 3 classes (so the data is labeled) .
Class 0 : normal
Class 1: anomaly-nontsunami
Class 2: anomaly-tsunami
So basically, i would like to build a model to predict the classes based on the 'sea level anomaly'
Yes but it's still not clear.
Do you have a sequence of data points for which you want to make 1 prediction. For example, a full day's worth of measurements and you want to predict if an anomaly will occur the next day.
Or is your case one where you want to predict something for each subsequent input in the sequence.
Each subsequent input in the sequence
123456789 is our series
123 -> label(4) | 234 -> label(5) ...
That's how your problem is structured right?
What you could do is making rolling features. You take the last n features and you compute some stuff on that, for example summary statistics and that's your feature vector
What you have now is a standard multi-class classification problem. Packages like tsfresh do this out of the box: https://tsfresh.readthedocs.io/en/latest/text/forecasting.html I'm sure sktime and others have similar functionality
Would you like to discuss it in private chat? I'll send the dataset and i'll send what i've been working on, maybe yoy can evaluate whether what i'm doing make sense or not?
Better that we discuss it here. I might not be available later and there's smarter people in this channel that can correct me if I make a mistake so it's in your best interest as well 🙂
this is the data set
I have a note that shows me on what time the normal, anomaly tsunami and anomaly non tsunami occured
So i labeled it in the code
Do you think what i've been doing is right?
Please give me some enlightment
the sea level anomaly is from
sea level - tide
it is in meters
So your features are the last 35 lags?
including the sea level anomly 0, which is the original sea level anomaly value, so there are 37 in total
The notebook is quite long, looking at it from afar it seems good
sea level anomaly 0 - 35 and the minute
It's very close to what I proposed, the extra I proposed was essentially making more / different features on the basis of your lags. Whether or not those make sense is problem dependent
I'm not a fan of class weighting to solve imbalance, better to reason about the precision-recall tradeoff by yourself and pick an operating point
Finally, the only xgboost hyperparameter I'd tune is the number of estimators. It's the one that has the most impact
Thank you, will look into it
can you explain what is 'the precision-recall tradeoff by yourself and pick an operating point'?
sorry, i'm a newbie
and english is not my first language
you're right, the estimators gave big impact to the model
Don't worry - the output of your model is a probability right? Under normal circumstances (in the 0 vs 1 case) above 0.5 => 1 and under 0.5 => 0. You used class weights to solve this. Instead of using class weights you could for instance look at the distribution of the scores your model is giving. You can plot the precision and recall as a function of you increasing or decreasing the minimum score to be 0 or 1 from its default 0.5 (the operating point)
So on the basis of finding false positives or false negatives worse you can pick your own threshold that is different
does the xml file path matter for the data?
yes it is. It seems like i need to read about precision-recall tradeoff so that i would have better understanding
thank you so much
Sklearn or tensorflow?
hey does anyone here using language R or is learning it ?
Both have their uses. Tensorflow is more of a deep learning framework, besides having an entire ecosystem associated with it (model formats, dataset file formats, serving, tflite, tfx, tfhub etc.)
sklearn has its own uses
Sure what questions do you have?
R is mainly for scientists (by which I include economists and mathematicians) who wouldn't necessarily consider themselves programmers. I don't think there's any way to effectively do deep learning in R. Why do you ask?
I have a bunch of data of x and y coordinates, the shape is (n, 5000, 2), so is there any library that can render 5000 points and updates 30 times a second? Ive been told pygame will not be suitable, but maybe matplotlib will be or pyglet
RIP in my colleagues like to use R 😩
what do they do
Write unmaintanable spaghetti
but like what are they trying to do
Data science
But the R people don't do deep learning at all afaik
There are "core" non-DL models that are better supported in R than in Python like ARIMA and GAMs to name 2. The Python version(s) are respectively poorer ports and unmaintained. Is it worth the effort? Largely depends on what you do I guess. Especially considering the overall experience in R is way worse than Python, both syntax and semantics.
matplotlib can sort of do it, not sure if I'd really recommend it though
Anyone here think they could recreate this within matplotlib or another python plotting library?
Yep. There is tensorflow for R and keras for R but they're community maintained only. And not that great tbh
I tried to write capsule Networks in R sometime ago (I'm a masochist that's why) and even tho the code was all correct it ended up not working due to some bug in keras R
Tried lots of stuff to get it working but didn't. It was like a deadlock situation but with errors. Annoying
But as much as I dislike the code style and practices and paradigms of R, ig I can see how it'd be useful to people doing pure Data Analytics
Is this script good for correcting error prone text
what do you guys think are the best models available on hugging face?
theres many ways to filter - curious to hear what you guys think
gpt2, grammarly coedit
Would it be possible or are you too limited with the layout/design to replicate this
Placing the legend beneath the plot and using circles rather than lines in its markers might take a bit of doc-reading, but nothing else I see here looks unusual for matplotlib
how do I change this:
from keras import backend
from keras import initializers
from keras.optimizers import utils as optimizer_utils
from keras.optimizers.schedules import learning_rate_schedule
from keras.utils import tf_utils```
to
```py
from tensorflow ...```
Install tensorflow in a 2.X version. It'll include keras, so you can just use
from tensorflow.keras import backend
from tensorflow.keras.optimizers import utils as optimizer_utils
...
It's possible that you may have to do some adjustments, but in general it's basically just adding tensorflow. before keras
perhaps a novice question but is there not a gpt3.5 or gpt4, basically an updated version, available?
how much harder is generative AI compared to vision?
oh ok, thank you so much. do you know what this means:
Killed
I followed this guide exactly:
https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html
Hm... No. I don't. Sorry.
Depends on which model you want to use for generation.
If you want to use a Variational AutoEncoder...well, it's quite simple, though the theory is complicated...and it's also quite hard to find a decent tutorial.
If you want to use a Generative Adversarial Network, then the theory is easy and it's easy to find a tutorial, but it's too hard to make it work as it involves too much trial and error.
If you want to use a Diffusion Model, then you'll have the mid-term between those two above: theory complicated, easy to find a tutorial, and not that hard to make it work, but also requires some trial and error.
PS: Flow-models are an aberration
ok, what about transformers
I have never used a Transformer for generating images, but there might be a version of it for that... I know that there's a GAN model that uses Self-Attention, which is a mechanism from Transformer...
But I suppose the problem could be pretty much the same as for texts: teacher enforcing bias, gradients biased due to residual blocks, crazy gradients due to how the layers weights behave...
does anyone know why https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html
this guide uses cpu even though it's supposed to use GPU?
my cpu load is 99%
how long do you think it would tak to get a firm understanding of GANs, Transformers, VAEs and diffusion models?
I think that, for a GAN...maybe 3 days? 1 week, at most, I think. Really, the idea behind them is quite simple: a Generator trying to fool a Discriminator with fake images and behare of Nash Equilibrium, exploding/vanishing gradients. Follow a tutorial and you may be able to make a DCGAN work easily with MNIST or CelebA dataset(which are the most common ones). The problem is if you try to go beyond that.
Transformers...if you already has some knowledge on NLP, maybe also 3 or 4 days. If know nothing about NLP, vectorizing, etc, it may require some weeks. There's a course on Transformers on Coursera made by Andrew Ng(one of Transformer creators) which can help.
VAEs...around a week. Diffusion Models may take a bit more.
About the same time.
I have 0 deep learning background
Oh, then you may need to add some weeks to those estimations
so I can literally make my own production-ready GAN in 3 weeks?
that doesn't seem possible
Well, it depends on how you want to do it.
Tutorials on how to do a GAN are quite easy to find, so you may be able to make one quite fast. The only thing that may slow you down a bit will be debugging your code. But if you follow everything correctly, you may be able to have a working DCGAN on MNIST/CelebA dataset relatively fast.
However...as I said, if you want to modify the architecture of your DCGAN to generate different results, then it'll take quite some time to make it work.
In fact, it can take months of trial and error and studying.
Does anyone know of a way to test if my system is lacking ram?
And when I say months of trial and error and studying, I mean literally. I've studied GANs for more than a year in order to make my own GANs work
VAEs and Diffusion Models are a bit more tough to understand, but making them work is quite easy. And that's why they've been getting the favor of most people nowadays.
Transformer follows the same idea, plus the fact that you may need to learn Natural Language Processing...which can take a while to understand...especially the idea around vectorization and embedding matrices.
How many hours per day were you studying GANs
Good question... I'd say...an average of 1~2 hours? 
Some days were more...some less...
Quick one and I'm feeling real dumb for having to ask it... but selecting data based on the value of a combined index in Pandas... Say I have ['FIRST','SECOND'] as my index labels, how would I select based on a filter on the values in SECOND, ignoring any matches in FIRST? This seems like what I want, but not sure how to restrict it to just the second part of the index: https://pandas.pydata.org/docs/user_guide/indexing.html#selection-by-label
explained in #python-discussion but leaving here too for others to know it's been answered```pycon
df
0
a a 1
b 2
b a 3
df.loc[(slice(None), 'a'), :]
0
a a 1
b a 3
Works perfectly. Thanks muchly!
data scientists who worked for companies where SQL was an optional requirment, how much do you use it? And if you use it how techical is your knowledge?
_\
bro I'm just following some yt tutorial for python but for some reason it won't detect mss idk how to fix this, I thought the file might be corrupted so I deleted it and reinstalled it but as shown it did not work
I need some help in visualising the counts of present vs absent per person
P means present and A is absent, which are stored in the respective dataset
And there are 79 unique IDs
Beyond just style, R is lower on the strong typing spectrum than Python, bit like PhP and JS.
If your language is dynamically typed you need to throw a lot of runtime errors like Python and Ruby. R, JS, PhP sometimes produce silently wrong results 🥴. In R it's just worse because the odds you inherit a shitty codebase is higher than in JS, PhP.
has anyone finetuned any model using QLoRA?
Although I think if you're not exclusively doing DL it's worth to learn it. I treat it like a statistics DSL.
Fair ig. I've taken a few uni courses and a few mini projects, but the prospect of actively coding in R wouldn't thrill me so to speak
Can you provide a bit more information on what the df's look like?
Based on the image I now assume that every [x, y] in df_present contains an id of someone present and every [x, y] in df_absent contains an id of someone absent. Is that correct?
If so, you would first have to aggregate the dataframes into a [1, 79] df and then you can easily visualize it:
df = pd.DataFrame(np.random.randint(0,79,size=(2000, 12)), columns=list('ABCDEFGHIJKL')) # [2000 x 12]
df_counts = df.apply(pd.Series.value_counts) # [79 x 12]
df_sum = df_counts.sum(axis=1) # [79 x 1]
plt.rcParams["figure.figsize"] = (100, 5)
df_sum.plot.bar(range(df_sum.shape[0]), df_sum.values)
# Repeat the same for absent and plot in same image
Please make sure to give more details on how you want to visualize the data, if I misunderstood
Separate post:
So I joined this community mainly because I'm looking for a specific kind of Python course. I think this channel is probably best suited for the question:
I'm a Data Engineer with basic Python knowledge, but I'm looking to expand that through project-based learning. I'm looking for a kind of course or training that offers such kind of learning.
Many courses seem to offer basic explanations etc. whereas I really would like to learn more advanced Python tips and tricks.
Could anyone recommend a person or company for such trainings? (Can be online, classroom based, multi-day, thats all fine)
I just finished an applied machine learning course, but I want to learn more about the subject. Any tips for documentations/tutorials I could study?
I want to expend my knowledge and perhaps to some ML on my own.
For beginners, sklearn or tenserflow or any other module?
you don't learn ML by "learning libraries". you have to learn ML itself.
but tensorflow is for neural networks, and you definitely shouldn't start with that.
Yeah, I gotta learn the terms and theory, right?
yes. and none of the python ML libraries are intended to help you learn them.
except maybe keras for neural networks.
So, is there any tutorial I can follow?
!resources data science
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
I am familear with most of the libaries. We did a lot of work wit sklearn
how do I change the font size of a single markdown cell in jupyter?
Try to work on and get solid with the theory and the math
Nice I think this will help a lot
Can't you use a HTML div within the cell?
<div style="font-family:verdana">Lorem Ipsum.</div>
I don't know specific trainings but we can certainly discuss for which topics to aim for as a DE today
Actually, the Cloud providers DE certifications are always good to have imo. So you can check what your favorite Cloud has to offer
Yeah I'm also interested in that, the Cloud certifications are on my radar. However I prefer to gain more experience before taking them (we are setting up cloud in our company rn)
again though, I'd definitely be interested in hands-on courses for that, not just going through tons of theory 🙂
what does this mean:
2023-06-09 15:56:31.123032: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 297 of 2048
2023-06-09 15:56:41.180448: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 622 of 2048
2023-06-09 15:56:51.125136: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 907 of 2048```
if i were to guess, it's a log entry that says
"i am moving data into a special zone of memory (or just an allocated and contiguous region of memory really), this is a prerequisite to do a shuffling operation (but according to the source code it says sampling??), current progress is XXX out of the maximum capacity of YYY)
Hello guys can somebody suggest me should I go for 3060 12gb or 3070ti 8gb for my pc , I will be doing dl/ml work and running models most of the time. Plus pls suggest me other components around it .
3060
I am using collab and my batch_size is 8, around 150 images. how could I be out of memory?
please post more details, or link to your previous posted detail if you already did.
while i probably can't answer this, someone else might be able to!
How is your SQL?
what are these files?
events.out.tfevents...
https://stackoverflow.com/questions/53517857/can-i-delete-events-out-tfevents-xxxxxxxxxx-computer-name-files-from-training-fo
contains a pretty good summary of what those are, it's mostly for debugging/montiroing purposes.
Brotha any specific reason ? Cuz of vram ? So more vram > Cuda cores?
If more vram should I go amd 6900xt 16 gb ?
I'm not sure how well DL libraries play with ROCm and AMD
You might have some hickups here and there, if you're fine with that then you can go ahead with AMD
This may be a dumb question but is it bad practice to make a nested list a value in a pandas data frame? I'm analyzing EEG data and the module will output an indeterminate number of peaks (one participant could have 2 and the other 3) for each participant and each peak is a list of some values. Right now I'm thinking of storing each list in one list for the participants peaks, but is there a better way to do this?
even just having dataframe elements that are flat lists is a bit problematic--nested lists are a no-go.
I assume when you say "a value in a dataframe", you mean the value for a single cell. not a whole row or a whole column.
yes
Not good , amd still ain't a choice for deep learning in most of the cases
I don't know if I should go for 3060 12gb , I could be fine tuning model bert model , and bert requires at least 12gb vram and 3060 is slow for it but it's in a budget range.
Only proper gpu could be 3080 12gb but I can support used one only in this range.
Should I go for used one's ? Obviously non refurbished and non mined
whatever the column is that is going to have these nested values, you might want to make it a separate dataframe with multiple levels of indexing.
Each peak has 3 characteristics that are stored in those lists and then the amount of peaks for each individual can vary from zero to usually a max of 3. So you're saying I should represent those as a new DF with columns for the characteristics correct? If so, how could I account for the indeterminant number of peaks while making it clear that those peaks belong to one person, since in the original method they'd just be in the row for that person.
Yep that's what it's doing
The shuffling operation is already done when it's placed into the buffer. From my understanding it samples form the original tensor and places elements into the buffer, so they are "shuffled"
This buffer can then be used for further processing, i.e. it can be sampled from (which is what the comment refers to)
Replying to this^^
I really thought I had it on reply ;-;
lol I was very confused for a minute
Yeah mb idk why that happens on my phone
Is this for personal or professional use?
If it's for personal use there's nothing wrong with not getting a SoTA card and just using the cloud whenever your needs exceed your card's capacity. You can probably do some back of napkin calculations and see that this is likely the cheaper option
Each of these 3 characteristics will be a single number or a list?
single number
You can build a multi index then
they output like this: peak params: [[ 9.81511894 0.72768575 3.00442113]
[13.0427614 0.25957048 2.54345948]
[18.12826738 0.15093637 4.62563198]] with each list being one peak
One index for each patient/subject
And a lower index for each peak
With 3, columns in the df, one column each for each of the "characteristics" of a peak
Yep should work I think
!d pandas.MultiIndex
class pandas.MultiIndex(levels=None, codes=None, sortorder=None, names=None, dtype=None, copy=False, name=None, verify_integrity=True)```
A multi-level, or hierarchical, index object for pandas objects.
Let me read into it more but it seems like that would work. My only concern is if I could output this to a csv file and what that would look like.
Yea based on the documentation it seems like I'd have a 3 dimensional df, so how could I export this into a simpler/readable database format?
I think the CSV would have a column for each index
Interesting...How would those columns look for participants who only had 2 peaks instead of 3? Would the extra column just be blank?
No...the peak would be described by one level of index
This would translate to one column
So you'd have 3 rows for participants with 3 peaks, 2 rows for those with 2 etc.
And your first level index would presumably be some sort of participant id
oh I get it, that makes sense. This would work pretty well then.
that's exactly what it is, perfect
Thanks @potent sky and @serene scaffold
No problem! Stel gave the right idea I just explained it lol
oops wrong tag again ;-;
Hey there, I am kinda confused with np.gradient. I would like to get the derivative of a 2D array (an image) but it returns two arrays x and y and I don't know what to do with these
what are you expecting gradient to do?
nvm I found it
I didn't see the x and y were 2D arrays as well
so it's like the dx and dy
yep
debugging saves the day once again
if you know what behavior you expect though, it might be easier to use a 2d convolution from scipy to produce the derivative you want
gradient does exactly what I want so I'll stick with this thanks !
all righty
Is the audit only version of the machine learning course by Andrew ng good or should I buy the course(labs etc)?
It´s fine, I currently write jinja-sql with the dbt package. I don´t like SQL as much, but I think I know enough to be able to use it for most projects
For the current project we write all transformations in jinja-sql so that some nice hands-on experience I'll already get
For data engineering first jobs having a great understanding of SQL is a must because I can tell you out of experience many of them don't know Python 🥴
My issue with cloud certs is that a lot of it is just about their product and not necessarily concepts.
Yeah well regardless those cloud certs are becoming industry standard :/ I'll need to have one from any cloud provider at some point.
And yeah I know SQL is like the must-know for all older systems/ companies. I think I'll manage as long as I understand the idea behind SQL, which I do learn with jinja. It's just that without jinja you need to write a lot of additional sql yourself
The irony is that SQL is cutting-edge
As in, "modern" SQL is just a language/specification that is sent to a query an S3-like datastore in a distributed way. See: Snowflake, Databricks spark-SQL, ...
That being said, it kinda sucks because queries can get long and unreadable ofc.
Pretty sure there's DBT adapters for Snowflake and the Spark runtime.
Re cloud certs: are they asking for that for an entry level job? Idk if cloud certificates have value before working with the stack either way. I have a couple of Azure certs because where I live they have 80 % of the market share and there's ways to get them for free. Personally wouldn't have grokked anything without having used Azure before though.
The labs bring a significant fraction of the course's value. But you don't need to buy it. Apply for financial aid and explain your reasons, they're usually pretty generous
No, not for entry level. But they start expecting juniors to take them and expect them from mediors and up
That's true. But at least with modern tools like Snowflake, Databricks you can use Python adapters (or Go, I think go is also promising, what do you think?). And on databricks you can schedule Python notebooks
They need to let you use python first. As I said, many data engineers don't know Python, just SQL. It's not because those features exist that you can use them on every team.
This discussion is getting a bit circular.
So say I have a data ranging from say floating val data from -1 to 1 and i basically want to seperate it into 3 categories , basically +ve and -ve values and the 3rd one including data that is too close to 0 depending on a threshold, problem is how do i mathematically define and justify this threshold? I tried numpy's quantiles but that just divides it equally but in my case i can have 70% negative values and 10% neutral and so on.. so i can divide just in equal ratios, Im writing a research paper so i cant just assume and say i devide the data at -0.05 and +0.05 because that would be random so what can i do?
well, why are you splitting it into 3 groups?
I'm working on a NLP problem and i have some cosine similarity scores ranging from floating values from -1 to 1 and i basically want to seperate it into 3 categories, basically wanna segment values that are closer to -1, those that are closer to 1 seperately and those which are near to 0 are neutral values
this values are basically result of a cosine similarity b/w a sentence vector and a target vector (such as "I am a man" i calculate its sentence vector and then use cos_sim with say a tagret word "men" or "women" to identify which target word it is more related to )
its kind of like i just wanna cluster in 1d
one way you could do this is to treat the thresholds as learnable parameters
or have them be hyperparams and test which one perform best. but there is no unique way of choosing them here
are you doing anything with the neutral values?
this probably wouldnt be possible because im doing unsupervised learning so we will never have labelled data
actually identifying them is important because we dont want to include them in our further experiments so removing them is kinda important
so like just manually observe and adjust everytime?
you could try a grid search for that and keep the thresold that works best
Do you want to find 3 groups in your data and you want to know the optimal cut off points to decide the groupings?
yep
Do you do anything with it that is quantitative or do you just qualitatively analyse what's in the groups?
.
from what they said, i understood they use this value to remove samples from the training data
i might've misunderstood
I'm seperating different sentences actually on the basisi of the similarity values
yeah that too
how do you do that separation?
Okay but can we phrase it under qualitative and quantitative for a second
Do you have a downstream task that computes something?
thats what i asked, if i can group this data that will be a seperation then i can work on it further
umm no i guess
like wdym by that can u elaborate?
Like if you have a downstream task that computes something (quantitative) it's a hyperparameter you can tune like Edd says
what do you do with the data after separating into these 3 groups?
If these are 3 groups and you'll kind of just look at what is inside and analysing it without a clear measure of performance then it's qualitative
the data is useless after seperation, once i seperate the sentences based on their similarity values i dont use that data anywhere else
what do you do with the data you DO use
Why do you want to separate the data?
Like if there's nothing you're doing with it why separate at all and not just make a histogram?
I'm missing something here, don't fully understand
ok so let me explain
we have s (sentence vector) and t1 and t2 two target sets
now i calculate cos(s, t1) - cos(s, t2)
if its -ve it will tell me s is more related to t2 then i can do more statisitcal tests on the "s" by analzying its word vectors
what kind of statistical tests?
like im not sure if im allowed to tell its for a research paper, im contributing to under a prof, like sorry if it sounds stupid, but I can tell you it has nothing to do with the data im talking about I just use it for initial classification
then we can't help you 😛 cuz the answer depends on that
some statistical tests turn this into a tunable hyperparameter you can optimize
ah
descriptive statistics do not, and all this threshold gives is a family of descriptors
in the former case you can optimize it. in the latter, all you can do is graph the family and the choice is kinda arbitrary
what you do with this info is up to you now 😛
Graphing the data, picking a threshold and then doing certain tests is a bit suspect because you can pick a cutoff to select the conclusion you want to reach, no?
agreed
in the descriptive case you would NEED to plot the whole family, otherwise the results are not meaningful
this is something you need to discuss with your supervisor or something, we can't help much more with the amount of info you can share
I'll confirm with him once, whether I'm allowed to talk about our work on online help forums untill then , thanks for the help tho
i'm not asking you exactly which tests on which data btw
just what kind of statistics
Can a sentence not be related to T1 and T2 at the same time and produce a ve (whatever this is) that is near 0
if it's descriptive statistics that ones thing
if you're fitting a model or comparing to some reference, that's a different kind of statistics
not fitting a model totally unsupervised learning, I'm doing hypothesis testing later after this
it was short of negative
btw subtracting cosine similarities sounds a little weird
related to zestar's example, you can get a t1 similarity of 0, and -1 with t2
yes that is possible, that is a problem we have and thats why seprating those cases is also important
this gives you a positive 1 for sim(x, t1) - sim(x,t2) , but the correlation is negative with t2 and 0 with 1
even for nonzero similarity with t1, if the negative similarity with t2 is larger you get a positive. same in the opposite direction
that metric does not distinguish between positive correlation with t1 and negative correlation with t2
I think you're probably making something simple hard
Define a set of rules with arbitrary thresholds and work with that group of data
Do it before looking at your data and motivate the choices. Statistics is full of default parameters anyway
You can also perform your analysis on a different set of splits and with different signifance levels and put it in a big table
ah actually it was my bad actually the data will never range from -1 to 1, cossim will always give value b/w 0-1 for t1 and t2
that too. for example if you have a reference paper that you're gonna compare to or build up on, you can just steal their hyperparamts
okay..
we do have a ref paper although they never discussed the problem of classification so we are trying to adresss this, Il see what i can do now
all righty
thanks for the help both of ya
i think zestar and i just dropped a lot of stuff on you, so at this point i suggest you step back and mull it over for a bit and discuss with your supervisor. otherwise we're gonna confuse or derail you because we also lack some context
okay I'll do that!
that categorization sounds like it should be a supervised task to me
let's see what zestar has to say
Before I respond, without knowing fully what they do, I think this is something @serene scaffold might be able to help at
I have to read all that?
Fuck
i want to use my gpu so i followed this tutorial https://learn.microsoft.com/en-us/windows/wsl/tutorials/gpu-compute#setting-up-tensorflow-directml-or-pytorch-directml( the Setting up TensorFlow-DirectML or PyTorch-DirectML part) but for the third command it returns me conda: command not found. thank you in advance for your help
you probably didn't do the conda init at the end of the miniconda installation
navigate to where the miniconda executable is located, and in that directly run the command conda init. this will modify your bashrc to allow you to use the conda command from anywhere
then close the terminal and open a new one, or do source .bashrc from your home dir
I'll need more time to think
Does bert even give sentence embeddings or are those gotten by averaging all word embeddings or doing a max at each dimension?
Okay appearantly I forgot about CLS pooling. I kind of don't want to answer as this is quite NLP specific and that's not my jam aside from the high level ideas 😛
But it does look like classification tbh.
not exactly, I'm using Sentence Transformers or popularly known as sbert, i think it works differently rather than taking average of all word embeddings
the idea is to analyze the conversation of a community such as reddit and then attempting to classify those comments. So we can never have a trained model on some labelled dataset that can be applied to every conversation on social media , because in different communities different words are used in different context, for example slangs that are used in a particular subreddit may not be getting used with the same context in say a YouTube comment section conversation
the idea of a training set is being representative, not to contain all text
how will you verify that your classification is working?
we are trying to prove our results via experimentation, So we'll try to work on 100k rows labelled dataset and then try to classify we can test on datasets taken from different social media communities to prove that this process is consistent
so you have labelled data. my approach would actually be to add this as an extra layer or two in the network and spit out the class instead of that function of the cosine similarity
but we are not implying that a labelled dataset is important for this process to work
in that case though, you'd have to do something like plotting a family of curves as a function of this threshold param
still, let's see what stelercus says. i'm on the same boat as zestar, maybe there's something else at play that i can't see
can you elaborate?
to know if your classification is working, you compare it to something. in this case you have this labelled dataset you mentioned. the goodness of the classification will be a function of where you draw the threshold of which data is neutral
this alone should make you think of false negatives and false positives being affected by where you set the threshold
ok makes sense, the results do vary quite a bit on the basis of what threshold i set, like the accuracy of the model was ranging from all the way 56% to 78% on the basis of the initial threshold used for separation.
very naively from my side, i would think that tuning this parameter as a hyperparam based on your labelled data and including the tuning alg and making the dataset public both justifies the approach and makes the result reproducible, which is great for a paper
that immediately makes it clear that the choice of the parameter depends on you having good data to go off of
i'm not an nlp person though
yeahh i think this should be ok for now I hope? If i could just come up with a tuning algorithm or any process to calculate and justify what value of the threshold i can select then i think this will be more than enough for us. i Just need to find a way to tune this threshold
idk how complicated your inference process is. the bert part you mentioned is already trained yeah?
cuz if that's the case you can technically differentiate through that and use your labelled data to optimize the param
otherwise you can do a grid search
but still, do wait til someone who actually knows what they're doing lends a comment 😛
i'm just doing armchair nlp here lol
yeah pre trained
i am doing another course on machine learning by andrew ng and in it he divides the cost function by 2*M which M is the number of training examples. Do yall have any idea why he does this. It seems like an additional learning rate that makes the error smaller.
quite technically, this term is not needed
notice that scalars factor out of derivatives, so scaling the cost by 1/2M scales the gradient by the same amount
it does not, however, change where minima are located (remember we're looking for the 0s of the gradient)
it's important on your computer, however, because floating point number can only be so large. this scaling factor is useful in keeping the gradient from exploding
if you have several terms added together in the cost function, the relative scaling of each one IS important. for a single cost term, this scaling is only here for numerical purposes on your computer
is it kind of like making the step size smaller ?
The irony of me an NLP is that I took a course on multimodal information retrieval before I took computer vision and NLP
I did doing computer vision propper afterwards but anything I know about NLP is from an IR perspective, not "complete" whatsoever but also not nothing
sure, but the cost was also made proportionally smaller, so those two factors cancel out
the cost is divided by 2m what part are you reffering to cancels out
let's put it this way
imagine my house is 10 meters away, and each step i take is 1 meter long. it takes me 10 steps to get there
what if my house is now 1 meter away, but my steps are 0.1 meters long?
this is the same thing that is happening here, since the gradient is computed from the cost and scalars factor out of derivatives
doesnt the cost tell you in this case tell you "how far away you are from your house" and you would scale other things in the formula that adjust the weights to make your steps smaller
you could if you wanted, i'm just explaining to you that the scaling of the cost function does not have an impact
the gradient scales by the same amount as the cost automatically
if you want to change the gradient size in a meaningful way, you'd have to change the step size, which is a scalar multiplied ONLY to the gradient, not to the cost
it makes sense now thx bro
house analogy was 👍
Hey everyone, I need help with outlier handling in each of the train folds, when using StratifiedKFold with GridSearchCV. Im currently using a strategy which might be causing data leakage. The full description and current code is given in my post here. Any insight would be greatly appreciated. Thanks!
do you have to do the first part to do the second?
wdym?
in the tutorial
what are you calling "first" and "second" parts?
the first : Setting up NVIDIA CUDA with Docker the second : Setting up TensorFlow-DirectML or PyTorch-DirectML
if you want to use an nvidia gpu, yes
If I understand your setup correctly, t1 and t2 are vectors, and cos(v, w) is the cosine of the angle between them. So cos(word, t1) - cos(word, t2) or cos(sentence, t1) - cos(sentence, t2) is just telling you whether you're closer to t1 or to t2. This can also be phrased in terms of the perpendicular bisector of t1 and t2. This perpendicular bisector is a hyperplane; vectors on one side of the hyperplane are closer to t1 while vectors on the other side are closer to t2. The usual reason for taking the cosine is to get rid of the effect of document length, which is great if you want to compare two arbitrary vectors (e.g., determine if one sentence is similar to another). But if you're trying to determine whether you're closer to one vector or another, then you get the same information by computing the distance to their perpendicular bisector, which can be done by taking the dot product with (t1 - t2)/||t1 - t2|| and looking at the sign. In your case, since you want a neutral category, it's the same as looking at whether the dot product is near 0, large positive, or large negative.
That doesn't solve your question, of course. You were asking about how to define thresholds for the categories, and all I did was tell you about a different way to compute something for distinguishing the categories. But I think it clarifies the situation; earlier the question of why you were taking the difference of cosines was raised, and I think the resolution is that it's equivalent to something more mathematically motivated.
ok thank you
hello there! anyone here ever tried animating their python graphs? if so, how was your experience? thanks!
Anyone know if theres a tag in PyTorch for easy tasks like cleanup or good first task
Hi I have a quick question regarding assigning values to columns using pandas
df.column1[df.column2 <= 3] = 'LScores'
you should be using loc for this
Whenever I try to assign values using a function I get : SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
df.loc[df['column2'].le(3), 'column1'] = 'LScores'
Isn't it the same?
no
plus I am trying to assign the value back to the original dataframe so thoght loc wouldn't make much sense, espeicially since I'm running this command in a for loop
that's what loc is for.
you might want to enable Copy on Write by the way
https://pandas.pydata.org/docs/dev/user_guide/copy_on_write.html
but you should probably store the string value in a different column. or you'll have a heterogenous column
columns need to be homogenous
oh, you are
fixed my code example.
Yeah its just a predifined empty column I'm assigning to
actually the formula for cosine similarity used inside scipy and pretty much every where else is infact (u•v)/|u||v| ,where u and v are vectors so essentially we are already doing w • ((t1-t2)/|t1-t2|) like you said anyways.
The problem is how to set thresholds and also how to introduce fuzzy logic in the process (or anything else that can help) because if i have a score of 0.052 and the neutral threshold is <= 0.05 then this will put it in the t1 partition which is wrong because 0.002 is not a significant difference and the sentence is actually neutral not biased.
.
if you just want a if x > something else b you could use np.where instead of assigning yourself
if you have something like -inf...0 = a, 0...3 = b, 3...inf = c you could use pd.Series.map with a dictionary but it gets a bit uglier
Yeah I'm dealing with the second option
That's the formula for the cosine of the angle, which is in fact the same as distance from the hyperplane. Either way of looking at it is fine; I just think that looking at it as distance from a hyperplane clarifies why what you're doing is okay.
Specifically, I have a multi index df where I ran the following command
df['env_measure_sorts_10'] = df.groupby(['year'])[score_type[1]].transform(lambda x: pd.qcut(x, q=10, labels=labels_dec))
okay i see
As far as setting thresholds, I think it depends. You said you have some labeled data. You could use that: There may be a good threshold which distinguishes "sentence about males" versus "everything else" and a second good threshold that distinguishes "sentences about females" versus "everything else".
So the function I asked about is just assigning labels to the deciles made here
And (perhaps more importantly) you can find that threshold using your labeled data.
Also thanks for the help @agile cobalt @serene scaffold, appreciate it 🙂
My suggestion is to plot histograms of "sentences about males" and "everything else" and see if it looks like there is a threshold that distinguishes them. And similarly for "sentences about females" and "everything else". If there's a threshold, great! If not, then you'll need a different embedding.
Oh geez sweet thank you
what about the problem i discussed? basically if you test it out on a huge data then you always end up having a continuous range of data between -n to n and a strict threshold limit always end up leading to false positives all the time, like the one i mentioned above
the case of 0.052
You'll always have false positives. That's how classification problems are.
Well, realistic ones, at least.
in my last run, i set the threshold to 0.05 and about 200 sentences were ranging from 0.05 to 0.055 that's a huge amount which lowered the accuracy drastically
The problem is likely to be either your embedding or your choice of the vectors t1 and t2. If your embedding isn't very good at describing masculine-topic sentences, feminine-topic sentences, and other sentences, or if t1 and t2 are not very representative of masculine- and feminine-topic sentences, then you'll get poor results no matter what you do.
Also if anyone knows, is there a difference between:
df.loc[df.column1 <= 3, 'column 2'] = 'LScores'
and
df.loc[df['column1'] <= 3, 'column 2'] = 'LScores'
Since they apepar to give the same output but not sure if there are caviats to this
they're the same, but some people dislike using dot notation because it puts columns and dataframe methods in the same namespace
and if you have a column with the same name as a method, you can't get to it with dot notation.
or if the name of the column has a space
Good to know. Thanks a lot for the info!
@hoary jay If you haven't tried plotting histograms of the scores of your sentences, then I would do that before anything else. Those plots will tell you whether or not you have a problem with your thresholds.
ah good observation, i somehow didn't see that immediately
plotted a kde, it's a normal distribution curve...
nope plotted the complete data, centred almost at 0 shifted towards the left a little almost -0.15 ig
That's saying that most of your data is not especially similar to either of the vectors you're comparing to. Is most of your labeled data neutral?
that'd more depend on the choice of the threshold, no?
If the whole distribution looks normalish, though, then there isn't clear separation between the classes of interest.
yep
makes sense
Okay. What if you plot KDEs of just the other classes?
Maybe a little separation is visible?
ok so i just select a random threshold and then plot the different classes on seperating with it?
No, I'm saying take all the data with label 1, and plot a histogram (or KDE) of that data and only that data. Then plot a separate histogram for the data with label 2, etc.
oh okay
i just got that you were talking about the a priori distribution, i'm tripping today
dist for the two labels
Wait. This is the labeled data that you began with? It doesn't seem right.
It seems extremely unlikely to me that there are sudden cutoffs at zero. Especially for the first picture since it seems to have only just started to peak.
correct, basically i only considerd the two labels here there is another label which is the neutrals that is where you will mostly find similarity values closer to 0
so u see when i plot the whole data i get a normal dist
this is the labelled data set right? not the output of your classification for a given threshold
that's a bimodal normal at best, btw
yeah i took the labelled dataset then computed their sentence embeddings and the similarity score and that is the output, similarity score of every sentence with t1 and t2 in the plot
oh yeah right.. sorry the word embeddings and their similarity scores had a normal dist not the sentences, sorry im really confused since morning lol
is this the same dataset from the reference paper? out of curiosity
nope the red paper was not about classifcation so they didnt had any labelled data
ok
i also find it a little "too nice" that there don't appear to be any missclassified sentences
It looks to me like (barring a bug) your embedding isn't capturing enough information about individual sentences for you to reliably draw the distinctions you want.
But I also think it's suspicious that there are no misclassified sentences.
wdym? there are misclassiferd sentences...
Sure doesn't look like it.
One distribution is negative only. The other is positive only. No misclassification.
i would've worded it as, especially in the first figure, one of the classes being super steeply affected by the choice of the threshold. similar to what kyle said, the classes are kinda hard to separate
how do you know the sentence that is in the negative is actually biased towards t1? only one way to check is manyally or by going through the labels and on doing that we find that most of the sentences u see on the left side are actually neutral and hence wrongly classified
If the plots are showing what I thought they were, there's one plot for all the labeled data whose labels are masculine-topic and one plot for all the labeled data whose labels are feminine-topic.
Is that not the case?
most of the sentences u see on the left side are actually neutral and hence wrongly classified
This sounds to me like the plots are showing how the data was classified, not how the data was labeled.
0 for neutral 1 and 2 for bias against t1 and t2
ok
and you plotted the t1 - t2 score here, yeah? separately for the labels t1 and t2
in the above 2 plots u can see right, that neutral statements with value closer to 0 are in the majority .. of a population of t1 and t2
that is wrong classification right?
yep
If the plots are showing what I think they are, then no, I can't see that.
Are you saying that there are neutral statements in these plots?
the plots kinda say that the t1 vs t2 classification is trivial, and that there is not much info at all regarding the neutral class
can you show the histogram for the 0 labelled data?
no im saying that biased statements have a similarity score of closer to 0 when in reality they should not..
i think something weird is going on
in how this was labelled
the more i think about the plots the less sense they make lol
similarity score of only Neutral statements (0 labelled)
perhaps i did something wrong in the code i should re calculate the embeddings and check all the var names, i make a lot of mistakes when im fatigued lol
according to your labelled data, the 1 vs 2 classification is trivial
and fishing out the neutrals comes at a terrible price
your data set is such that the threshold of false positives means you will get a ton of false negatives
it could be true of the language, that idk. if could be artifact of the choice of embedding or labelling, idk
looks like it that is exactly what is happening lol i end up classifying the t1 and t2 class but i also end up assigning a lot of neutrals with them
the actual interpretation of this requires domain knowledge that i don't have, this really needs knowledge of the statistics of a language
same
but if we put that aside, and assume these are correct
my suggestion would be the same as kyles. as i said before, we can treat the threshold as a hyperparam and given this info, we should be able to optimize for it
hmm ill try that
or pick it for a choice of false negatives per class, or something of the sort
immediately you notice that this thing isn't symmetric, so i would recommend to pick a different negative and positive thresh
about the correctness of the histogram, you'd need to check papers or contact an expert
but those histograms look very weird to me
I agree that the histograms look weird. So I'm suspicious that there's a bug. But assuming there isn't, you may not be able to extract enough data from a single sentence to do what you want.
You might have better luck if you embed whole paragraphs.
no can't its 384 dim now
wait i think i can
there is another big pretrained model i belive i can try that
i can understand technical constraints 😛
give it a shot, or what kyle said too. depends on how much time you have available for this
that would be a slight problem but i think colab might help
abstract deadline is on 23rd of June
trying to submit it in a confrence
ah, pretty tight on time
Oof, goof luck!
but if it's just an abstract, all you need is 1 or 2 pretty pictures 😛

this my first every paper so i actually dont know what to submit either 😂
Depends on the venue. Your supervisor should be able to explain.
ill ask the prof yea
check the conference guidelines. many conferences that request an abstract have a condition like 1 or 2 paragraphs, 1 or 2 images, a word limit, and other specs. check that ahead of time
in some cases you can get away with promising the world
in others you need to show the results up front, ideally with pretty pictures
ok will do
and peer reviewed ones need the paper for review months in advance (not the case if they only request an abstract)
For questions like that, there are often conventions that are specific to your field. Outsiders aren't going to have much luck making recommendations.
the conference website has the explicit guidelines, go check it out!
btw your stuff looks kinda like 2 beta distributions and a gaussian, if you like doing parametric estimation instead of KDE. not much of a difference here tbh since it's anyway blind, but...
what does this mean actually tho?
sorry not that good in stats but i like stats
in KDE one takes a basic shape and uses it to build up the observed shape
in parametric estimation, one knows the correct parametric family ahead of time and fits the parameters of that directly
yeah kind of likes curve fitting
but differently
ok so can that info be useful? like knowing the distribution family? how can i use this info
for KDE the "width" or "variance" of a kernel is chosen a priori, as well as a "model order" (number of kernels). then one finds where to shift the kernels to
on the other hand, if you correctly know the parametric family ahead of time, you can explicitly do model order and parameter estimation
the difference is that KDE in general has no physical interpretation, it just gives you a parametric representation you can evaluate anywhere
in parameter estimation, the estimated parameters actually represent the properties of the data
(assuming your choice of parametric family was correct... that's a big IF)
Parametric estimation, like the normal and beta distributions Edd is describing, is particularly useful when you have limited data, when you're trying to find a simple approximation, and when theory predicts that a distribution should be close to parametric.
that's the final kicker. the lower bound for parameter estimation requires as many samples as unknown parameters
For example, the central limit theorem essentially says that, given enough data (and some mild technical hypotheses), most things look like they have a normal distribution.
At least if you choose the distribution's parameters right.
kyle's practical implications are probably more relevant to you than my ramblings 😛
ig you both are insightful
you guys are doing the work, i'm just a resident rubber ducky
how are you guys using collab without running out of memory?
lol dont say that im the one writing a paper and getting stuck on freshman statistics lol😭
i still dont know how u eyeballed that its a beta distribution, when i can hardly remember its formula and graph
I guess I should add that real data is never exactly normal (despite the central limit theorem). Often it's not even hard to see the difference, particularly in the tails.
through the magic of google. there are other curves that look like that too, i just picked one i recognized
you'd technically have to try a few and then do something like a kolmogorov-smirnov test to pick which one fits best
okay i should try that
is it like another hypothesis test?
you can check it out here https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test it's a measure of distance between distributions. scipy appears to have an implementation
does anyone know what this error means?
AttributeError: module 'object_detection.protos.square_box_coder_pb2' has no attribute 'DESCRIPTOR'```
It's a hypothesis test for equality of distributions. If you're trying to determine whether one distribution looks like another, though, then you can just use the KS statistic.
I would say that finding an appropriate parametric distribution is a bit of an art. It tends to work best when there's some underlying reason (physical, experimental, etc.) why that distribution is close to what you're observing.
oh i see
So, for example, if you say, "Oh, I got this quantity by taking a sum of a lot of other quantities," then that's maybe suggestive of something normal or normal-ish. There are a lot of distributions that look similar to a normal distribution (t-distributions, chi^2 distributions, etc.) but if you told me that you wanted to fit a big sum to something parametric, my first guess would be a normal distribution.
one of the things i considered when i said "beta" is that it's bounded, for example, unlike other similar-looking dists. that at least fits the behavior of the values computed by your similarity measure
If you have enough data, though, non-parametric methods will give you better results in the sense that they will come closer to describing reality.
Oh, and there are loads of places where people perform a normality test, see that it fails to reject the null hypothesis, and use that as justification for a hypothesis test that requires normality (e.g., a t-test). You should never do this: Failing to reject the null hypothesis does not mean your data is normal. Your data is never perfectly normal; you just may not have enough to reject normality with the test you're using.
wait t-test requires the population to follow normal distribution?
what if it doesn't? then is ttest valid?
The usual justification for a t-test requires normality.
If the data isn't normal, then it may still be close enough to normal that the t-test is okay.
ok so how can i make sure the data is close to normal? im asking because I'm using T test later
something like Shapiro test?
Well, if you want to be really careful, then there's no a priori guarantee.
ok
Usually we apply the t-test to something like sample means. The central limit theorem says that if we have enough data, then those are approximately normally distributed.
But how close you get to normal depends on other parameters of the distribution that you probably don't know.
This is quantified by the Berry–Esseen theorem, which shows that the rate of convergence depends on the third moment (unnormalized skewness).
In practice, you don't know the third moment, so you don't know how close you are to normal, so you can't really justify a normal approximation and a t-test.
But in practice, this rarely matters. In practice, with a decent amount of data, the skewness is almost never so extreme that it messes you up.
my idea was as a 2nd classification, after classifying a sentence into t1 or t2 related if i can calculate scores of word embeddings in that sentence and take it as a sample and then perform a ttest while the population would be scores of words that are actually biased towards t1 (say) obtained from the the same dataset, then perhaps the null hypothesis can be sentence is biased because it contains biased words and the alternative hypothesis could be sentence is not biased because the mean of the sample is not related or similar to that of mean of the biased words..does that make sense?
There are various rules of thumb, like n > 30 or n > 50 or whatever. None of these are entirely reliable; you can always concoct a really bad example where they won't work. But most data isn't really bad in that way, so these rules of thumb usually work.
I don't think I quite follow. Say you have a sentence. You embed it, and you classify it into t1 or t2 related. Great. It has a lot of words. You embed those and then score them. But I'm not sure what you propose to do with those scores.
there's a difference b/w related to t1 and t2 and biased towards t1 and t2... because if a sentence is biased towards t1 and t2 then there must be a use of some Stereotypical word or anything that contributes to biases. In my very first text, i mentioned that our ref paper can actually use this word embedding scores to find out the top most biased words against t1 and t2
.
So i thought if a sentence is related to t1 then it can either be a normal sentence that just revolves around t1 and doesn't have any Stereotypical or gender bias in it...But then if a sentence that is related to t1 and has biased words getting used in it probably has a higher chance of being a Stereotypical or biased sentence against t1 right? So that's why i thought maybe we could look at the sentence from the perspective of words too
It sounds like you want to use both sentence and word embeddings. The approach you describe is a kind of hierarchical model. Those are fine; a good hierarchical model can be quite powerful. Another approach you could try (well, with enough time; I remember that you have a deadline coming up!) is making a big, combined embedding by using both a sentence and a word embedding. Maybe that would give you more information than the word embeddings alone. It has the same information as the hierarchical model, but it might also be harder to fit.
perhaps i could, if i could make time anyways thanks for your help tho!
An AI I made to ply battleship. Am very excited on how well it turned out
did u use pygame for engine?
Tkinter , PIL, Matplotlib
cool i like the confusion matrix on the right
Thank you! It’s a standard population density graph! My goal has been lately to show people that AI is a very broad range and not everything needs NNs to be optimal
love that
Heat Map Colorbar Label 😉
Lmao, thank you for the catch! Was just too excited that I was working I forgot
Ok, I'm really hoping someone can point me in a direction here... I'm building an open source AI based thing (called AutoVR.ai, details if ya want, but doesn't matter for this question)
Under the hood I'm using ZoeDepth that is based on MiDaS and those build off of torch.nn.Module. The point of this model is to take in an image and it puts out a depthmap. There is an ability to adjust the "precision" to use for the actual inference portion and even though the output resolution will end up matching the input image, it does seem to make a dramatic difference on the quality and details clearly present in the depthmap itself. So there is a desire to crank up that "precision" as much as possible if trying to produce high quality outputs. Now, the problem is, the max precision someone can use is going to be directly associated to the VRAM available. If that "precision" is set too high, it will just throw an out of VRAM error and I'm trying to deal with that a bit more gracefully than just letting the thing crash.
I first attempted to catch that error, automatically adjust the precision down a bit, reattempt, and keep track of the working precision combinations so it doesn't need re-determined multiple times. That process generally works, but there are some issues. I've stumbled onto some insane memory leaks that I might have introduced, but they sure seem to be inherent in this ZoeDepth thing that I'd rather not rip apart and/or update to do proper garbage collection.
That all said, I simply don't understand python memory management well enough at the moment. For example, without that "dynamic" functionality I'm getting this as an example error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.97 GiB (GPU 0; 11.69 GiB total capacity; 7.51 GiB already allocated; 2.06 GiB free; 7.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The part I'm really not understanding is why it is telling me 11.69 total, 7.55 reserved, but that would imply 4.14 free, although it is saying only 2.06. Just understanding this a bit more might help a lot and honestly, my googling/chatgpting has sent me down more rabbit holes than anything that seems actually helpful.
There is a bunch of other memory used by Pytorch that is not included in this out of memory error.
There is also memory being used by the rest of the system (other processes).
Gotcha, is there any way to see/find that? I can see some using nvidia-smi, but the numbers are still way different. And I'm very aware that I just don't fully understand this stuff, just need the right thing/name/concept to look into more.
Nvidia-smi gives you some of the missing info, you also need the base amount of memory Pytorch needs.
2109.44 MiB (2.06 GiB free) + 7731.2 MiB (7.55 GiB reserved) + 1251 MiB (Pytorch base amount) = 11091.64 MiB = ~10.83 GiB If you include other processes you can see how it's close to 11.69.
In addition, there can be memory fragmentation and other issues.
gotcha, that helps a bit. Honestly, I'm just a bit surprised how easy it is to blow this sort of thing up. The only way I was able to get the memory leak even semi under control was to tear down the entire model and re-build it if it encounters an out of memory error. Kinda rough, but I can make due with some caching of the "determined" maximums.
Yea, tried the environment variable at 64mb, 128mb, 512mb, and 1024mb with no significant differences.
Pytorch is not super memory efficient, it's setup for easy experimentation.
Ah, gotcha, good to know and keep in mind.
Making full use of memory requires custom implementations that are not even close to being worth the effort (a lot of work to implement all the manual allocation and custom kernels for the specific model) when you can just buy a bigger GPU or more GPUs.
The only applications that will try to squeeze out all of the memory of a GPU is more or less video games (and only some of them (especially if they have to run on consoles)).
Ugh, I've done a little bit of PS1/PS2 game dev long long ago. Not a fan of writing custom memory managers lol
In the world of ML we just throw more hardware at it, because we usually can / are running in the cloud.
ugh, if i re-run this same thing a handful of times, the max resolution I can run at floats up/down by 10-15% without making any changes
Yea, I'm seeing that lol...i'm trying to build this thing to run on consumer level hardware AND i'm trying to make it pretty much auto configured, at least for a reasonable set of defaults
Consumer level hardware ML is very rough, because of things like this.
The ecosystem is not really setup for it, at least not in an easy way.
Finding that out lol...the early version of this memory leak was so bad that it would crash OTHER applications and occasionally my entire user session (Kubuntu, flavor of Ubuntu) and that kinda surprised me to say the least.
Yeah, and it will crash some drivers too, can brick some ppl's PCs in some cases like some games do.
When I experiment on the GPU locally I often have to turn it off and back on again.
I haven't worked on the infrastructure/DevOps side of things in a long time, but would a Docker container even provide any protection/sandboxing?
We use a whole virtual machine now, and set an upper limit on memory usage, lower by a decent amount than the total. Can just restart the instance then.
It seems like some of this stuff is almost what I used to call "bare metal" code back in my C/C++ days.
Requires some annoying GPU pass-through work though, which your motherboard needs to support.
hmm my understanding was that VMs don't expose the GPU very well from a performance/overhead point of view
It depends on the hardware, they need to support virtualization, for the GPU.
Ahh that makes sense. Kinda a limiting factor if i'm trying to make this broadly available/usable.
Yeah, support is spreading, but it will take a few years to be everywhere.
Also trying to avoid the insane install/setup process that I've had to go through with some of these experimental AI/ML things, standing up instant-ngp (NeRF) was insane for me.
These kinds of issues is one the reasons we don't see ML on consumer hardware used in applications everywhere yet. GPU work is very buggy / annoying, and kind of miracle that people manage to get video games to run on all hardware configurations (spoilers, they don't, it's endless bug reports (working with consoles is very nice by comparison / fixed hardware)).
Fixed hardware, but at least back when I was in school (degree in game design/development from a place called Full Sail) that meant you were also going to have to write your own low level libraries and such like memory managers or basic data structures. lol I shouldn't have to write my own doubly linked list damnit...
Damn, that was 20 years ago, just realizing that lol
Yeah, but at least someone could write it once, and it works. With random hardware combinations on PCs you have weird stuff like specific GPU with specific driver version bricks the PC when they plug in a specific keyboard, and you get a bug report that makes no sense, all you can do is random trial and error.
And then to add to injury, the OS is really buggy (and getting worse over time (e.g. Windows 11 can barely manage to make a window fullscreen now without 10%+ of users crashing)).
ha, yea, the unknown combinations makes stuff a lot trickier
i'm new to AI. Anyone knows the GODEL NLP model (based on T5) taht can give me some pointers? stuck on two things.
Be sure to always ask a complete question that someone who knows the answer can start answering
The big thing is I'm training the model IAW GODEL data structure of Context / Knowledge / Response using Huggingface Trainer. The code executed completely, but there doesn't seem to be any effect. I tried to query the model and it wouldn't give me any hint of the training data. The second thing is I'm working on a SQUAD metric to monitor the training process, but ran into some data structure conflict between the Trainer's evaluation output eval_pred and the SQUAD metric's input metric.add_batch.
for what? this sounds like something where you have to consider what the model is intended to do and what the tradeoffs are for its use case.
I have a big data problem. I have csv files with over 300 gb. Data is in multiple files. I want to carry out EDA. Data frames memory limit is only 100gb so that is not sufficient plus it is taking too long to process a single file even after chunking. What is the solution?
Both are wrong and will create an incorrect model. Under sampling will does have enough training and oversampling will go in too much detail to get the prediction wrong
Well that's the problem with Big Data and why cloud processing is a thing right.. if you really want to do it locally, it simply WILL cost a lot of time. But I can add using the Threading library if you havent yet. Then you can at least process multiple chunks at the same time
@weak lagoon I actually did just stumble on https://docs.dask.org/en/stable/scheduling.html while looking for examples. Might be interesting for you
None
Gonna become an evangelist for this topic.
- train classifier
- Make your ROC curves, precision recall plots, DET-curves, ...
- Select your decision boundary based on where you want to land on these curves
Under/oversampling, class weights, ... are an opaque way to solve this problem. If there is a signal in the data your data you will see a difference between the negative and positive class
I think these are better because they're easier to reason about and you use your data, which is hopefully a representable sample of the population, as is. The one I could get most behind is class weights because at least you're not messing with your sample.
do any of you guys understand AIC and BIC scores in regression models?
What is your specific question?
im tryna figure out why my AIC and BIC scores are in the 600s
Have you looked at the formula's of AIC and BIC? They're pretty intuitive imo
the number itself is also not that important, only that pick the smallest one w.r.t. the model order
much like in most other cost functions
how would you guys impliment an api into a chatbot
Does anyone know why in the pircture on the right it says it was much much fastter if in that image it goes through 900 iterations and in the image on the left it only goes through 9
It doesn't say it's faster, it says it converges faster. In the left where you did 10 runs (which was your max iteration input), you didn't get to the point of convergence yet. And since the learning rate is small, it would take more iterations. Why don't you compare them both on 2000 iterations?
I cannot uninstall a package. Can anyone help?
$ vcs
bash: vcs: command not found...
Install package 'python3-vcstool' to provide command 'vcs'? [N/y] y
$ vcs
usage: vcs <command>
Most commands take directory arguments, recursively searching for repositories
in these directories. If no arguments are supplied to a command, it recurses
on the current directory (inclusive) by default.
The available commands are:
branch Show the branches
custom Run a custom command
diff Show changes in the working tree
export Export the list of repositories
import Import the list of repositories
log Show commit logs
pull Bring changes from the repository into the working copy
push Push changes from the working copy to the repository
remotes Show the URL of the repository
status Show the working tree status
validate Validate the repository list file
See 'vcs <command> --help' for more information on a specific command.
$ pip uninstall vcs
WARNING: Skipping vcs as it is not installed.
$ pip uninstall python3-vcstool
WARNING: Skipping python3-vcstool as it is not installed.```
that's not a python package i think, that's an apt-get one, or whatever package manager you're using.
Ye also thought its probably apt, though I'm not familiar with Bash. 'apt remove python-vcstool' should do it. If it really is pip you should be able to see your packages with 'pip list'
pip list | grep vcs
how am i suppose to convert each sentence to tensors in case of making the seq2seq model, because each sentece dont have to contains the same number of tokens so while i will be training that model might occurs the problem with that, dont you agree with me?
That's why you use a <pad> token to your sentence, so you can make all your sentences have the same size and convert them to tensors.
Just add the pad tensor to all your sentences until all of them have the same size as the largest one.
is this like adding zeros?
just came across datalore by jetbrains, does anyone use this? I'm working on data reports/insights for some of our staff (non-technical) and found streamlit to be limited. While looking up Dash documentation I came across this and it looked interesting. Curious if anyone has tried it or uses it currently
so i am suppose to the vocab where the each token will have his own number (including <pad>, <eos> and <sos>) then i am converting each sentece to those vectors of integers and from there i am converting this to embeddings, right?
@hasty mountain
What type of analysis do you do?
Yes. And it's like adding zeros. The difference is that the zeros will be the value of the vector of those tokens
So a sentence <how><are><you><doing><?><pad> will be something like[1, 2, 3, 4, 5, 0]
yeah and then creating the embedding from that right?
Yes
I'm asking because it's important context to answer your question correctly 🙂
so ok thx!
it'll mostly focus on financial data like revenue (MoM,YoY,etc.) for 70+ locations and then stats related to our main product offering so stuff like # of players, bookings, maybe customer demographics. The end goal is to replace our current Google Studio reporting with something a bit more automated while maintaining a very user friendly ui
Okay the big brain answer is to focus on data prep and give business a BI tool and let them make the reports themselves
If that's a step too far in your org, you should still use a BI tool (Power BI, Tableau, Looker) to make the reports for them
yeah it's still a very small op so there's no dedicated team for that unfortunately - it's likely to fall on myself to push out or at least give upper mgmt key stats at a glance. I was able to put together a basic dashboard on streamlit but it got a bit limiting as I wanted to do more so I've just been searching for alternatives. Our data isn't overwhelming and luckily a lot is self reported by our locations, it's just cleaning it up and trying to put together meaningful reporting. Not opposed to using a BI tool though
What's limiting about streamlit for your case?
at the moment I have a working db for our corporate locations w/ basic stats our controller uses for her work - this likely can be scaled & adjusted for franchise locations so it's not an issue. On the other hand, staff in our corp. locations are asking if something can also be done for them so that they can skip exporting csvs, making tables, etc. for their daily reporting. I'd ideally like to keep these as private apps and right off the back not being able to deploy more than 1 on SL is an issue. Granted, I've been mostly self teaching Python for a few months so if any workarounds or more appropriate methods exist it's completeply plausible I just don't have the 'know' right now. Streamlit also re-running on a user action isn't ideal if I expect a bit of users to be on it at once.
also appreciate the responses & help fyi
I think there's several issues here right?
-
You want a way to onboard the data in a better way
-
you want people to see only the data they're authorized to see?
if that's the case, you need to find out what is generating the data for those CSVs and either have it upload data to your DB directly or at least automate exporting the CSVs and then parsing it and putting it in your DB (less robust).
Is algebraic geometry used in machine learning
For point 2, Yeah, thhat's an issue with streamlit. There's an authentication package for streamlit. As soon as a user is authenticated you could filter the dataset to show only what they're allowed to see. That's called row level security. Means you only need to make a single app.
BI tools at least have the second part baked into it.
thank you 🙂
Anybody got a website that has a list or a list of pluggins that improve productivity/are just good to have for DataScience in Python Notebook via VSCode?
At me if you got anything.
If i load my data by a Tensorflow DataGenerator from a directory like this, what file formats will the datagenerator accept? I'm trying to load grayscale images represented as 2D numpy arrays, but the datagenerator seems to recognize "0 images" in the directory.
makes sense thx bro
i got another question real quick
Why is the graph of the cost vs itteration steeper for the picture on the right which has a smaller learning rate than the picture on the left. I would thing that since the picture on the left has a bigger learning rate the graphs would be opposite
Whether a learning rate is (too) small or (too) big depends on the data. IF the learning rate is too small, a larger learning rate value will lead to faster convergence. However, if the learning rate is good or already too much, the too large learning rate leads to not finding the convergence at all or later. The answer lies in how the w-values develop over iterations.
on the picture on the right at the bottom why does it say that it is converging slower that the picture on the left
the pic on the right has a much steeper cost vs itteration graph
yo anyone here now about R programming?
Ask away
have any idea about creating a subset using 2 diff datasets?
I don't understand your question
like i have 2 diff data files and not all samples are present in both files. so we need to first subset the samples that are present in both files.
Dplyr inner join is probably what you're looking for
I'm not sure what they mean by that tbh
yes i did that , but the issue is my one data file does not have any header name and that is creating problem for me
if u could allow i can send u the file to have a look.
I'm watching a competition atm 😛 so sorry not going to do too much. Are the files comparable? E.g. should the 2nd file have the same header as file 1?
these are headers in 1st file ("" "type" "tissue_source_site" "disease_type") but for the second file ("" "TCGA.E9.A1NI.01A" "TCGA.A1.A0SP.01A" "TCGA.E2.A14T.01A" "TCGA.AR.A24O.01A"........) it just start like this no header or something at all
So how would you like to subset them?
i have no idea how subsetting words , if u can shed some lights on it
Well, if you want to merge or use those files together, there should be a way to match the records, right?
Otherwise you just put random data together which is meaningless
yes the header "type" has data which is also present in 2nd file
I'm not sure how it works in R, but in that case I'd remove both file headers and manually create header arrays and load the files with those
make sure the "type" header is on the correct location for both files and name other columns whatever you want
if u can explain in python that too work
In Pandas you can specify column names when you read_csv
Just like this: ['name_a', 'name_b', 'yolo', 'type']
but as i told the 2nd file looks like this , how im gonna create header ?
gene_expression_df = pd.read_csv('tumor_gene_expression_data.csv', 'name_a') like this?
Hmm that file will be a problem since it all seems to be on 1 row and only whitespace separated
yes, that i was banging my head since yday, tried a lot of soln still nothing workds 😦
any fix for this?
hard to tell without having the full file to look at, but worst case scenario just parse it with normal string manipulation then pass it to pandas as a dictionary of lists
Yes it can be fixed, but it will involve some work. Think of 'readlines()', regular expressions and writing to csv. Those are the 3 things you will need to do is my guess.
If you can get it to csv format, all you need to do is specify headers yourself and you can use it however you want
i tried this code
import pandas as pd
Read the TSV file
df = pd.read_csv('TCGA-BRCA.htseq_fpkm-uq_gene_name.tsv', sep='\t')
Convert to CSV
df.to_csv('output_file.csv', index=False)
now it looks like this
Ahah, that is a lot different than I expected. That makes it a lot easier though. There's just a lot of column_names but the file quality seems fine.
what should i do next?
So are you now able to find the column_name you want?
no, i did not able to get
how exactly?
@weak lagoon then how do you solve imbalanced target variable
@serene scaffold I was trying to offset imbalanced target variable
also imbalanced target variable also creates incorrect model and you need to offset it
Can you rename your first column to 'identifier' or whatever?
the unnamed thing?
Yes
done next
Haven't tested it myself, but try: df_unpivot = pd.melt(df, id_vars='unnamed', value_vars=df.columns[1:-1])
Didn't know how to exactly write it without testing
df_result = df.melt(id_vars=['unnamed'], value_vars=df.loc[:, df.columns != "unnamed"])
This should work. Replce unnammed with what u called it
(and ofc replace df with gene_expression_df)
well after fixing a bit got this
for this model ssd_resnet50_v1_fpn_640x640_coco17_tpu-8
what is the recommended image size?
my current image size from the xml files is this:
<size>
<width>4056</width>
<height>3040</height>
<depth>3</depth>```
is that too much?
i'm still dealing with a filling up kernal issue on colab
those are some really large images
if that 640x640 means it expects 640x640...
yeah i thought that would be the culprit
so i compress them?
I don't know how tensorflow works very well, but it sounds like it might resize automataically: https://github.com/tensorflow/models/blob/5f0f949de9667552d85f0922191a05a0c9d0a99c/research/object_detection/configs/tf2/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.config#L46-L51
research/object_detection/configs/tf2/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.config lines 46 to 51
image_resizer {
fixed_shape_resizer {
height: 640
width: 640
}
}```
lol we googled the same thing, yeah the tutorial never said anything about image size, i thought it would resize on it's own
this memory issue is really bugging me now
in https://www.tensorflow.org/api_docs/python/tf/image/resize it says the the default method is
bilinear: Bilinear interpolation. If antialias is true, becomes a hat/tent filter function with radius 1 when downsampling.
but I do not know if that applies to that config
for subwords that replace out of vocabulary words, how does the model interpret the new definition?
I'm convinced my code should work. Probably something went wrong with renaming that column
You can indeed remove the brackets though. So you end up with:
gene_expression_df.melt(id_vars='identifier', value_vars=df.loc[:, df.columns != 'identifier'])
yeah it does resize it
im not sure what is wrong then
should I still resize it bc the image is still massive?
i reduced the dataset to 100 images only
almost definitely not
maybe the fine tuning dataset, but the full training data (including pre-training)
@agile cobalt how do you deal with unbalanced target values
depends on how unbalanced it is
for some things you can just use a "balanced" function for the loss (which applies weights based on the number of elements in that group)
Very unbalanced
Oversampling or undersampling is not suitable?
Or SMOT @agile cobalt
see @agile cobalt
rf_model_balanced = RandomForestClassifier(random_state=0, class_weight='balanced')
rf_model_balanced.fit(train_features, train_labels)
rf_model_balanced_pred = rf_model_balanced.predict(test_features)
print(classification_report(test_labels, rf_model_balanced_pred))
this is what I did
ChessDotComResponse(stats=Collection(chess_blitz=Collection(best=Collection(date=1673051280, game='', rating=310), last=Collection(date=1686273194, rating=207, rd=119), record=Collection(draw=1, loss=8, win=3)), chess_bullet=Collection(best=Collection(date=1679884176, game='', rating=451), last=Collection(date=1686027007, rating=309, rd=113), record=Collection(draw=0, loss=5, win=4)), chess_daily=Collection(last=Collection(date=1673799161, rating=400, rd=350), record=Collection(draw=0, loss=1, time_per_move=66341, timeout_percent=0, win=0)), chess_rapid=Collection(best=Collection(date=1686345903, game='', rating=651), last=Collection(date=1686362671, rating=651, rd=31), record=Collection(draw=35, loss=187, win=169)), puzzle_rush=Collection(best=Collection(score=6, total_attempts=9)), tactics=Collection(highest=Collection(date=1685588227, rating=1349), lowest=Collection(date=1675015529, rating=412))))
Anyone know how I can use this data to pull the wins, draws, losses of a specific game type? I am planning on figuring out how to differentiate between them
Anyone here also work with C++?
Of course! In the realm of C++, I find myself immersed in the intricacies of its profound abstractions and the boundless tapestry of algorithmic finesse it offers. Metaprogramming serves as a key instrument, allowing us to transcend the limitations of traditional runtime execution by deftly manipulating compile-time computation. Memory management becomes a virtuosic endeavor as I orchestrate intricate ballets of optimization, minimizing runtime overhead through the adept utilization of smart pointers, move semantics, and refined memory layouts. Within the expansive dominion of the C++ standard library, I harness an array of opulent algorithms, containers, and utilities, evoking an exquisite finesse within my code. Together, we embark on a collective pursuit of enlightenment, unraveling the enigmatic tapestry of C++ as we push the boundaries of innovation and amplify the crescendo of our software engineering prowess.
This looks like a ChatGPT response
This is the data science channel. Why do you ask about C++?
The reason I asked about that is because this channel, according to the description, is also about matplotlib. I have used that and loved it. I tried to use it with C++ for easy quick plots but ran into some issues. I was hoping there might be folks on here who have similarly tried to exploit matplotlib.
While matplotlib is on topic here, that's only if you're using it in python.
Well. I tried! 😛
I wish! I will celebrate the day that someone finds a non-trivial use of ample line bundles in machine learning.
why can tensorflow models take in both tensors and numpy arrays, but not regular lists? why even convert to numpy?
because you's realistically do math, e.g. preprocessing of some sort in numpy and its arrays are easy to convert for many reasons: having a single type, being memory adjacent, and having an efficient interface like being able to take in or give out their buffer. pretty much none of those are true for python lists
ah
you should be able to use a list of numpy arrays though
convert_to_tensor should be able to grab lists and make tf tensors out of them
Cant numpy arrays be tensors
sure they can, but i think they meant specifically tf tensors
anybody know why vector addition using numpy might produce different results across two machines?
numpy and python versions are the same ^
is the backend for numpy also the same?
Hi, Could someone please share data science - AI learning path and topics to be covered ?
one thing that comes to mind is that the default integer and float types depend on the platform
like, np.zeros(100) may be float32 or float64 depending on the system
Also numpy is simd optimised which has different implementations for each architecture. So maybe differences there
I thought basic CPU operations were guaranteed to produce consistent results
That may be, but from what I understand, the vectorization process can potentially change the order of operations. Considering this manipulation is at very low level of memory management, this could lead to slightly different results, especially across different architectures as the simd implementations will also be different
as long as you do them in the same order. depending on how you install numpy tho, don't you end up with different BLAS/MKL flavors?
ah, right, that makes sense
yeah, pypi build uses blas, anaconda's is a fancy mkl one
Hi, are you still looking for this? I can dm you some recommendations if you like.
Yes please that would help.
Not for floating point operations across different devices (SIMD or not).
It's why some multiplayer games (e.g. Starcraft) use fixed point.
(For deterministic results across different devices)
There are some CPU flags that can be enabled to get the same results, but they come at a performance cost. Some physics engines support that.
Then there are bugs, found in many math libraries, and the hardware itself.
Does anyone here know how to deal with imbalanced response variable?
I have a very imbalanced response variable for categorical prediction
I am trying to predict the possibility of a bankruptcy of a company
I am using RandomForestClassifier
I used class_weight='balanced' parameter
but Im not sure that is good enough
I read some people do under or oversampling
or SMOT
but some places tell me not to do those
so Im a bit confused
I've answered this question like 3x
Instead of saying everything above 0.5 is bankrupt and under not you should look at precision recall, ROC, DET, ... curves and determine a cut-off yourself @warm copper
so like using under and oversampling? @past meteor
no, just your data as-is
For example you can simply plot the distribution of scores (a histogram) of the scores for your positive and negative class
And then you eyeball your data and decide where to put it
how would that even work tho when 98 percent of the data has bankrupted companies
lol
I already did that @past meteor
This is an example of what I mean
this is either 0 or 1 tho
So here you can see if I choosse the score at 0.08 there would be no more false negatives
I dont know what li score is tho
In your case li score will be whatever probability you have
I was working on finger print recognition so the li = left index, for you it'll be "bankruptcy score" or whatever
your model has a .predict() and a .score() method, use the latter
Sorry, it's .predict_proba()
I'm actually giving a pretty shitty explanation I know @warm copper , part of it is me not wanting to type out what I've done x3 over the past few days but that's my fault and not yours 😅
Think i could send a bit of code and you could take a look? I can tell you which line the difference occurs
I prefer if we keep it here because then other people can add stuff as well 🙂
pred_prob = rf_model.predict_proba(test_features)
print(pred_prob)
plt.figure(figsize=(16, 10))
plt.hist(pred_prob[test_labels == 0], bins=50, label='Negatives', alpha=0.5, color='b')
plt.hist(pred_prob[test_labels == 1], bins=50, label='Positives', alpha=0.7, color='r')
plt.xlabel('Probability of being Positive Class', fontsize=25)
plt.ylabel('Number of records in each bucket', fontsize=25)
plt.legend(fontsize=15)
plt.tick_params(axis='both', labelsize=25, pad=5)
plt.show()
am I doiny something wrong here?
my test_features = X_test
You need to normalize the scores because I can imagine test_labels == 0 is a lot more than test_labels == 1
my test_labels = Y_test
I keep getting this error tho
ValueError: The 'color' keyword argument must have one color per dataset, but 2 datasets and 1 colors were provided
I fixed it
Now you still have to normalize the scores and then you're nearly there
thet are already normalized tho
How is your y-axis then going to 5000
train_features_scaled = scaler.fit_transform(train_features)
test_features_scaled = scaler.transform(test_features)```
I mean normalize your histogram
yes
No idea. The general idea btw is that these plots can help you select a sensible probability to decide if something is in the negative or positive class.
Ties in well with the ideas behind ROC-curves, precision - recall, etc.
Undersampling, oversampling, class weights take part of this out of your hands when it should probably be a decision you make
Im going to try to fix the graph first
Data scientists typically evaluate their predictive models in terms of accuracy or precision, but hardly ever ask themselves: However, an accurate estimate of probability is extremely valuable from a…
interesting @past meteor
Yeah they're not calibrated probabilities
They're over and/or underestimated but for this example that doesn't matter too much
Yes, sns.histplot()
I think the probalem is that predict_proba returns a 2d array
problem**
maybe I need to convert it to one dimensional array?
ValueError: The 'color' keyword argument must have one color per dataset, but 2 datasets and 1 colors
hence this error
hi, i have a scatterplot in plotly and i want to display labels for the bubbles. but there's a lot of them and its not really neat. is there a good way to selectively display the labels?
I found the best threshold using predict_proba @past meteor
Best Threshold=0.230000, F-Score=0.457
😄
Selecting the right one is more a business concern than a data science one. Do you find false positives or false negatives worse? 🙂
so this is for false positive
should I also perform for false negative?
i did it for positive outcome
😄
It's fine like this tbh
There's a million other plots like this you can make: FPR-FRR (false positive vs false rejection rate), DET (detection error), ROC, PR curve, ...