#data-science-and-ml
1 messages · Page 302 of 1
should i send my file ?? or error pic is enough ??
no bother - just keep learning as you like 🙂
Okey then
if your question doesn't get addressed here, you can open a help channel. See #❓|how-to-get-help
just make sure your grades in maths are tip-top - I usually don't compare with grades but many colleges do see math grades so keeping them at a good level helps a lot
you can't post files here
why
We disabled it because we don't want people to have to download files. We do have a pastebin
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
You can if you'd like. People are more likely to help with text.
Hi can someone help me, suppose i have a cnn model (binary classification), suppose i give it a image thats completely irrelevant is there any way to restrict it ?
does your architecture currently do anything to account for the possibility that a given instance might not belong to any of your classes?
Can I somehow save keras lstm model so I dont have to teach it everytime I use it
into a file
Keras models support saving them on disk. Does any of this make sense to you? https://keras.io/api/models/model_saving_apis/#save-method
yeah
ty
Is there a way to take a mean value of several prediction attempts. I feel like theres a "proper" way to do it
Are you trying to evaluate the performance of your model?
yea
and this is for image classification? you might look into precision and recall
model.evaluate
i think you got me wrong
for example im making 100 predictions with the same input
i want to get mean
so your model, once trained, predicts the same input differently?
yea
that doesn't sound right to me
its lstm
just, is there a proper way to get mean of it
for example i can get predictions looking like that
I'm currently running R studio, but I've installed reticulate. I have a project to complete, and my first step is to define which of the variables are Dependent and Independent.
Anyone?
Can anyone help me?
R studio? Is this an R question or a Python question?
I'm running reticulate (Python interface in R)
Both basically
The easiest way to define dependent and independent
With R or Python
that looks very wrong - have you double-checked your model?
check, you model, input_data, predictions, increase RNN units, check masking, etc.
then make you model deeper
wdym
inception - we need to go deeper
more layers?
yea
gonna try it
please do, and don't overfit
I'm running reticulate (Python interface in R)
Both basically
The easiest way to define dependent and independent
With R or Python
So I have a bunch of data in the following layout: Id, [x,y][x,y][x,y], and I want to rearrange the x,y pairs so that the one with the largest x value is on the left, then the next largest, then the smallest. I am stuck on how to do this. (I already have the data in an sql database, which each x and y in its own column)
You mean for each row, sort the 3 pairs in descending order of x?
that was probably a mistake
i ran ran it 5 times with same input and its completely same in each answer
almost same i think
still, how can i add more data to lstm
Can anyone help me please?
i have several datasets but again, uniting them is wrong i think
hmm, are you having trouble doing this smartly using vectorized operations (I have no idea about databases, so can't really help with that), or at all?
Hey @tidal bronze!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
if you mean in SQL, then probz ask in #databases
Or write a user-defined function that slices it? Or simply reorder the columns so you see the ones you want in the first 10?
the SQL solution would be different from the numpy one
https://paste.pythondiscord.com/ratajovuga.http
let's say we have this dataset nad I want to do kmeans clustering what would be a better input:
- feed the algo averages for petal_width
- feed the algo list of observed values for each row so [0.2, 0.3,0.2,0.4] for example
What would intuitevely make more sense?
Sounds like you’re just passing simple problems to neural nets XD
Can anyone help me?
I could do it outside of sql and write it to a new database if that is easier.
I'm currently running R studio, but I've installed reticulate. I have a project to complete, and my first step is to define which of the variables are Dependent and Independent.
So, it is both an R question and Python.
Please help
if it's a numpy array (which I assume would be 3D), that would be a trivial problem
example:
>>> a
array([[[1, 9],
[2, 3],
[0, 4]],
[[2, 4],
[1, 2],
[3, 4]]])
>>> np.sort(a, axis=1)
array([[[0, 3],
[1, 4],
[2, 9]],
[[1, 2],
[2, 4],
[3, 4]]])
that's the concept
Am I invisible?
less trivial if it's a pandas dataframe (you'd need to sort only some columns row-wise)
reversing the sort is trivial (look into np.flip)
nope, we see you
I'm currently running R studio, but I've installed reticulate. I have a project to complete, and my first step is to define which of the variables are Dependent and Independent.
So, it is both an R question and Python.
Can you help me with this?
Your question is just so vague that it sounds more like a first week intro to stats thing and completely irrelevant to R or Python.
Ok
Thanks. Looks like I need to load the data into a numpy array.
how is the data normally stored?
It is not sorted. The coordinates are from a text file that was created by a cad export.
Your example does not appear to keep the x,y pairs together, which is an issue for my use case
hm actually
just use vanilla Python sorted
I actually got it to work with Array.sort.
hello ı want to get better at data science are there any good resoruces videos pdfs etc u can reach me
check what's pinned
ouch
if a network can map foo-bar, then I don't see how it is simple
but yes, that advice is consistent based on my research so its probably correct
but still, I don't get it 🤷
I don't know your data or your problem statement, but it sounds more like you're driving a Ferrari down an alley, or cutting a string with a chef knife
@grave frost Because neural networks are solving complex non-linear problems by their ensemble-on-steroid nature, if there aren't much data or the problem itself can be solved with a linear regression / logistic regression, there isn't much use to use a neural network
In fact in this case, linear models are preferable
How can I improve lstm model
hmmm....but linear models get about 60% accuracy (the highest ever on that was 68%)
I tried with autokeras and that got me about 70~ish - but I can't use it since I don't know the code for the custom layers they use
so if autokeras can do that, I guess an NN can get SOTA 🤷
What other ML algorithms have you tried?
And how many data points do you have?
Anything less than 5000 is almost not worth it
But if that's the bench mark accuracy, then I guess that's it
But I'd be weary of trying NNs before even trying other ML models
SVM, NB, random forest, xgboost classification etc.
I tried autokeras to see whether an NN works and that damn thing did manage to cross SOTA.
Hi everyone. I would like to ask your opinion about courses on coursera. Is it worth it or it's better to learn ML by reading books? Plus I'd like to know your opinion about their certificates as well.
Im trying to generate images with an opencv cascade sheet i made a while back, I cant find any info online on this, how would i do it?
@oak tusk hi, i would restate your problem as this:
- i have a dataset of stars, and there is a function/curve/time-series associated with each star
- i want to be able to input the attributes of any star and obtain a hypothetical time series for that star, so that i can interpolate between stars and watch the shape of the time series change
is that accurate?
I think yes
they are evolution tracks for different masses and I'd like to interpolate between to get tracks for any mass
I heard about something called a VARIMA model but I'm not sure how to use it
So idk the best way to do this
Wow no one is here lol
time series analysis has a gazillion models. ARMA, ARIMA, SARIMA; most of them are based on moving averages with some adjustments
How can I do VARIMA in python?
Is there any tutorial
also how will it help me to get tracks for any mass
How should I solve this problem?
I'm just not very sure at all atm
Can anyone help?
sorry man I havent done time-series in ages and I cant catch-up right now 😦
don't even know if I should do time-series here
though, i did find arima here : https://medium.com/analytics-vidhya/arima-model-from-scratch-in-python-489e961603ce
So what should I do
Commenting about handled incidents just prolongs them--please return to the channel topic
right
is there any way i can get the articles from medium about data science without having a subsciption?
i keep on using my articles up
and i dont get how an mdp works
incognito, VPN, etc lol
does anyone know how to asynchronously load image data in PyTorch while training a model so that my hard drive doesn't bottleneck it
use a generator
is there a package for that or do i need to code it from scratch? I know about pytorch dataloaders but dont know how to make it load in parallel
Dunno about pytorch, but TF has that - so I bet there must be some similar function there too
even then, I don't see how hard writing your own can be
Ok I think PyTorch dataloaders class does handle this automatically so i'll try to implement it and if it's to slow i'll just write my own. Thanks for the help though 👍
but wait - why would your HDD bottleneck model training?
I have a really old 3200rpm hard drive so I'd imagine it would slow down training a lot though i should probably test it before trying to solve a problem that might not even exist
no it won't
Guys is the ml course on standford university worth it?
The one featured by course era I think
do you know the math required for that course?
Is that math not part of it?
I don’t have probs learning math though that’s why I’m going into it
math is part of it
how do i change the marker size of sm.graphics.plot_regress_exog
I’m trying to say that they expect you to know that math going into the course
my dots are way too thick
Has anyone ever implemented the gumbel softmax trick thing (using tf/keras) ? Any suggestions on how to approach it?
if it is implemented on a research paper, sometimes the author provides a github repo for it. you check at https://www.paperswithcode.com
yeah, good point
oh i made a video on that once
let's try keep things to on topic discussion in this channel, memes aren't appropriate in this channel
👩💻🐍 Does anyone else find #Matplotlib's API hard to remember? I have friends who just export their #Pandas data and plot with #Excel. I spend a lot of time googling #Matplotlib help. How do you make your #Python #Data #Plots?
what's with the hashtags BTW?
Well, my model can overfit with a small piece of sample data. but with 80M+ parameters, it shows no sign of overfitting at all - which is pretty confusing. does anyone have any idea as to what the problem might be?
where are all the intellectuals when you need 'em
what model are you using?
tried all - CNN, LSTM, Transformer
you stacked them?
no, single
like which model did you use for your data that has 80+mil params
LSTM + fuckton of dense layers
has anyone worked with pretrained tf hub models? Im trying to work with my camera and get 224x224 images into their pretrained imagenet, but its not accepting the format. How do I convert from the numpy array to tf input?
Tensorflow tensors pretty much are numpy arrays with some extra bells and whistles, casting them to tf.tensor should be all that's needed.
Though I'd expect even that not to be necessary (the model should do it on its own)
ok well then my problem lies in a different area... im trying to build a quick classifier without training, ad ive seen that this model can already classify the things i need.
hub.KerasLayer("https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/feature_vector/4",
trainable=False)
from the tutorial I know this is the way to load the model into a keras layer, but after this do I have to put it into a sequential model to make it usable?
You mean 80m samples? what in the f?
Also, the model at some point (way lower than 80m) must have converged, that's why no overfit. The loss function isnt being reduced anymore with any training data
I'm trying to train an RNN using a module called textgenrnn, but while I was previously able to get understandable results, after trying to improve them, I've just made them pretty much non-existent. I was making loads of changes at once since it takes a while to train so I couldn't really just make one change and then see if it produced better results or what it did, so I have no idea what caused it. Any ideas? The 'biggest' change I think I made was switching to word_level, but I don't see how that could be causing this- surely it would actually be the opposite effect?
Code:```py
from os import environ,listdir
environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
for filename in listdir("inputMessages"):
if filename[:-4]+".hdf5" not in listdir(".") and filename != "all.txt":
print(filename[:-4])
from textgenrnn import textgenrnn
textgen = textgenrnn(config_path="config.json", name=filename[:-4])
textgen.train_from_file('inputMessages/'+filename, num_epochs=5)
del textgen
del textgenrnnSample text (inputMessages/wuulfy.txt, contains 26k lines):i like that colour on the left
ew anime
i like the logo's
this isnt even close to done cameron
or even red screen if that's a things
yoooooooo
except when i started i was set 3 for maths
now that i have instagram
you cant break anything
i aint judging thoSample results (from wuulfy.txt, temperature 1, other temperatures produce literally nothing):
a.
'1.config.json:json
{"rnn_layers": 2, "rnn_size": 128, "rnn_bidirectional": false, "max_length": 32, "max_words": 16384, "dim_embeddings": 100, "word_level": true, "single_text": false, "name": "textgenrnn"}```
👩💻🐍 Does anyone else find Matplotlib's API hard to remember? I have friends who just export their Pandas data and plot with Excel. I spend a lot of time googling Matplotlib help. How do you make your Python Data Plots?
seaborn or pandas plot
matplotlib is af
suppose let say model classifies dogs vs cats (binary) even if i upload a image of an ant i know it wld predict either one of them, since its binary but is there any easy hack that i can do to prevent this
i know we can do binary classification two times, or go for multi classification, im asking if there's any easy hack doing it
also im doing transfer learning using vgg16 if that helps
I can't guess unless I know what model you're using and how it's designed.
python is better than R when it comes to machine learning, correct?
"better" is a matter of what you're trying to do, but Python has a larger ecosystem. I'm in the process of applying to data scientist/data engineer positions, and almost every listing requires Python proficiency. I've seen maybe two that imply that R will be the primary language.
has anyone ever used
detector = hub.load
if you're trying to learn about machine learning, I would focus on general programming skills, and the theory behind machine learning
if you have general programming skills and you understand what you're trying to do, you'll be able to figure out the data science libraries when you need them.
ok, im currently taking a python course on codecademy, then i plan on reading a book on machine learning, and finally i plan on doing the codecademy courses on machine learning before starting any projects.
are you taking the python 2 or 3 course?
i also know very basic java and javascript
3
I have no idea what their python 3 course is like, but my one experience with codecademy like 7 years ago was pretty bad
I have no idea if the way they present the information now is the same
the machine learning courses seem pretty great
I did the code academy python course in high school and it made me frustrated
ive taken the python 3 course, it was pretty informative and they got me through the basics but majority of your expertise will come from working on projects
ive been taking the python course, its nice
I've made an Alexa program and it has nothing to do with AI
Amazon abstracts away all the AI related considerations.
This code academy stuff looks different than what I did back then
I’m not even sure the course still exists lmao
Is it free?
oh I don’t pay to learn how to code 🥴
Just based on the course titles for the beginner friendly courses, I would try to learn enough general Python knowledge to start with the intermediate content.
That’s my policy
this is what a normal lesson looks like now... not sure if this is any different
ok
@fickle surge are you aware of our resources page?
yes
Did you like automate the boring stuff?
huh?
automate the boring stuff w python
which of us are you asking?
Xeos
I don't recall them saying they had read that
read what?
right, but knowing that we have a resources page doesn't mean that they've read every book that's on it
understood
also i think i still might take the beginner courses just because they are interesting
should i look at the ones about learning r or just avoid them?
I wouldn't think of it as avoiding R, but rather focusing on a more cohesive set of learning goals
(which probably won't include R)
I wouldn't
ok
and i think the beginner lessons are mainly python oriented and kind of a way of dipping your toes in the water of machine learning
R was created specifically for scientific computing, whereas Python was intended to be general-purpose and obtained a comprehensive data science stack after the fact.
ok
but Python, even when used for scientific computing, benefits from having a large ecosystem that isn't specifically part of the data science stack.
yeah
What does ecosystem refer to?
python is great
Ecosystem of libraries?
the libraries, the community, the resources, everything
Oh
i think theyre referring to everything you can do with it
even Python Discord is part of "the ecosystem"
not just machine learning
yeah
oh I just only see stelercus referring to the ecosystem w python so I didn’t know what it meant
oh
Now I do tho
I'm not the only one who says it 🤷♂️
alright well ill finish working on this course and then ima read my book on machine learning. after that ill get started on the beginner lessons
mainly going to read about the concept of machine learning in the book rather than how to use it bc ill learn that in the course
cya!
comparing the ecosystem of python vs the ecosystem of R is analogous to the market share of windows vs linux; not even comparable
a this point R is "legacy" for DS. and tbh, if you know python, learning R should be trivial
ok
and the best about learning python over R, is, evne if you later dont care about DS / ML / AI, etc
you can always program in python
true
I have pet project of a text-based game. I'm trying to build it using no dependencies and no libraries
yes but i like the R libraries for statistics more
i mostly use python for most things tho

this is the best advantage tbh
rip R
🕯️
Hi guys, I have an issue that would need some help
so currently I have 2 csv files, 1 is the prices, the other is the amount of money invested
what I want to acheive is for it to be on a rolling basis
so like based on the money invested, I want to be able to get the number of stocks
and it keeps adding up
@misty flint any idea sir?
In aitextgen when using line_by_line, the <|endoftext|> seems to be ignored meaning I have to split at it manually and meaning the min_length is innacurate (I'm trying to force it to be a certain length, but it just shoves and <|endoftext|> then is a bit spammy, instead of properly reaching that length). Any ideas of how to solve this? I see a similar looking issue https://github.com/minimaxir/aitextgen/issues/88 but apparently this was patched early last month. I don't have the <|startoftext|> though
Hi everyone, my model can overfit with a small piece of sample data. but with 80M+ parameters, it shows no sign of overfitting at all - which is pretty confusing. does anyone have any idea as to what the problem might be?
even if the data might be too "simple" I think overfitting should still be possible, no?
Hi, I am looking for how to implement the minimum pooling 2d layer such as
`class MinPooling2D(MaxPooling2D):
def __init__(self, pool_size=(2, 2), strides=None, padding='valid', data_format=None, **kwargs):
super(MaxPooling2D, self).__init__(pool_size=pool_size, strides=strides, padding=padding,
data_format=data_format, **kwargs)
def pooling_function(self, inputs, pool_size, strides, padding, data_format):
return -K.pool2d(x=-inputs, pool_size=pool_size, strides=strides, padding=padding, data_format=data_format,
pool_mode='max')`
I got error from __init__() missing 1 required positional argument: 'pool_function'
Could anyone help this problem?
I think that is because doing from keras.models import Sequential tries to treat models as a sub package, not a file, so you can do from keras import models then use models.Sequential for example
I'm not sure how that would fix it. If models is a submodule then the first part of the first import statement should work.
I can try to replicate it in a moment.
thanks guys<3
solution:
ImportError: Keras requires TensorFlow 2.2 or higher. Install TensorFlow via pip install tensorflow
Keras is actually just a wrapper around TensorFlow. I suppose you hadn't installed it yet?
yeah, installed it just now
upd:
tensorflow have two different backend versions: cpu(pip install tensorflow) and gpu(pip install tensorflow-gpu).
nice 💥
what I am saying is that I don't think models is a sub module, isn't models a file, then Sequential a class?
what distinction are you drawing between a module and a file?
well originally I used package and meant a folder that contains files but my terminology may be wrong there
Generally speaking, a module is a representation of a .py file in your file system.
what I was coming at was the import error: py ImportError: cannot import name 'SGD' from 'tensorflow.keras.models' (C:\Users\jamie\AppData\Local\Programs\Python\Python38\lib\site-packages\tensorflow\python\keras\api\_v2\keras\models\__init__.py)from py from tensorflow.keras.models import SGDit's looking for an __init__.py when you try to import SGD from keras.models so doesn't it expect it to be a folder or am I getting confused
if a folder in your file system named thing has a file named __init__.py, then given from thing import x, x can refer to either a file in that folder named x.py or some object defined in the global scope of __init__.py
yeah, playing around with file structure that seems to work here, i just don't get why it doesn't work with the keras example in this caaes
was trying to import SQD from models 😢
ok now that error makes sense, that was just me being stupid, pycharm still doesn't like that way of importing it though so i'm not sure why it comes up underlined in red
Hi folks! I could use some help with a pandas query in #help-cookie , and I figure that if anyone could figure it out, it would be the data science gurus 🙂
Ive had an awesome model trained yesterday, but today im training the completely same code, nothing changed and model is far worse. How can I fix it?
Lstm keras.
you have to ensure the reproducibility of a model - you can set a 'seed' to accomplish that
hmm now looks working i think
how do i set a seed
for keras
im new to keras
google it
Is anyone familiar with an NER algorithm that can learn a single class when instances of that class (from a human perspective) is really two or more completely unrelated classes?
I don't think it would be possible to do it with a statistical approach because the differences between the underlying classes would throw everything off.
yeah this would be the place for that
list.pop()?
numpy array? I think np has a np.nan you can use for masking and deleting it nans
so I have two arrays that look like this: array1: [[7810.0 5710.0], [7910.0 5710.0]] array2: [[7860.0 5710.0]] and I want to check if array two falls in between elements one and two of array one.
I hope that this is the right channel,
so this happens when I convert a json file to csv
how could I possibly fix this?
also, https://data.gov.lv/dati/dataset/cb8d1838-4edc-465c-865c-0f6574f74029/resource/cf97a41b-c45a-4089-b269-f86aad8a7da7/download/activities.json, that is the url, if that helps
what do you want to happen
It looks like each cell is a json file.
How does random seed work? For example if I pick 0.9, is the trained model gonna perform close to 1, or will it be a completely different performance
I would recommend you research it rather than asking everything here - it would be much faster that way
depends on the dataset
try using tensorflow.keras.(your need) I might be so wrong about this but its much stablized than regular keras I believe
and thats what stackoverflow recommended me as well haha
also try with google colab. I found myself so much easier coding in there. (I dont need to install any libraries haha).
Keras automatically uses Tensorflow backend
even if you run import keras it would show "Using TensorFLow backend"
so there is no difference between stability
ok, thnks
Training the agent with 400 iterations in OpenAI Gym library LunarLander-v2 environment
Code: https://github.com/DevHunterYZ/Deep-Q-learning/blob/master/LunarLander.py
I'm still having the above issue, but I have another question. currently, I'm only training on a VPS (Hetzner CX11) and I want everything to be centralised there, but I was wondering if I could have my PC provide resources for modelling like a blockchain would or the likes, to speed up modeling. Would there be a way to have a script on the VPS and a script on my PC communicate and train seperate datasets together (maybe my PC could do some of the larger ones while the VPS sticks to the smaller ones, but the files remaining on the VPS)? Or would it just be best to download the large models, whitelist the smaller ones for the VPS, and train on the larger ones and upload manually? Or maybe even collaborate on the same model at the same time? Preferably, I could also out-source training to some of the people involved in this project or even publicly, but I worry this could pose some security risks. The problem is, I'm using the models live as I train them so I would prefer to be able to also upload live. Any ideas?
Found out what I'm looking for is distributed training. Any ideas how I would accomplish this with aitextgen? I know aitextgen uses PyTorch, so I assume if distributed training is possible in PyTorch then I can use it with aitextgen
just read some docs on distributed data parallel
distributed training isn't difficult to do, unless you want something more fault tolerant
which is... painful
I already have (and still am) but I don't think it's going to help since there's a reason I'm using aitextgen- it's a lot more simple than the packages it covers. I'm usually good at learning new packages such as discord.py and I'd consider myself pretty sufficient in Python but I've never been able to get my head around things like PyTorch, Tensorflow, Keras, transformers...
And I especially don't want to spend ages doing this to find out using something like SFTP would be just as effective and easier 😅
then yeah, you might be a bit out of your rocker with distributed training
again, the point of distributed training is to speed up training times over clusters of resources.. most of the ways people do it are pretty low level
So, what would be the best option that doesn't involve the likes of Keras, Tensorflow, PytTorch etc?
for what? training a neural net over multiple nodes?
Specifically for distributing training in aitextgen using mine (or multiple) computers connected to an external server (VPS)
i mean, without knowing the internals of say, pytorch or lightning? probably isn't
And aitextgen uses PyTorch ans Transformers, so pretty much something that would be efficient with those models
luckily it seems its built on lightning with is more high level
but you still need to know how to configure your gpus/multinode cpus with the right comm protocols
so nccl for nvidia gpus and mpi for cpus
Surely all the multinode stuff is only for if I'm doing distributed data parallel?
you do know what distributed training is, right
you're using multiple resources, like multi gpu or multinode cpus
if you are just training on one gpu or your local cpu node, then you don't need any of this
Yes, training a model over multiple nodes at the same time. But since that requires knowledge of things like PyTorch, I'd instead be training seperate models per computer, which surely would mean I wouldn't need to use the multinode stuff, I just connect with something like SFTP?
okay-- but then its not distributed training
I'm training on a CPU on the VPS, and would prefer to be able to use the GPU I have on my PC but isn't necessarily needed, they could both use CPU if this would be more efficient
now you're just training... n independent models
I know, I asked for an alternative to distributed training
I'm already training multiple models- currently, my VPS trains a bit on one model, then starts training another, and cycles through them all training them a bit more
i mean, if you just want n independent models that never talk to each other, then its not like distributed training at all
then at which point you can forgo all this
Yes, like I said, I'm asking for an alternative to it, not for how to do it without PyTorch since you said yourself it's likely not possible
I understand they're completely different, but in my case they accomplish the same thing. Surely if they're accomplishing the same thing, just in a different way, it would be classed as an alternative? 🤨
I'm just looking for any way to make use of resources over multiple computers to train a set of PyTorch models made in aitextgen. Distributed training was one of the options, but you said that (as I kinda expected) it would require quite a bit of knowledge in PyTorch or the likes to accomplish. So I'm asking, out of the other methods I could use to accomplish what I'm looking for, which would be easiest or most efficient for my use-case. In other words, the best alternative to distributed training for me
Sorry for the confusion 😅
okay, then this isn't distributed training
from os import environ,listdir
environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from aitextgen.TokenDataset import TokenDataset
from aitextgen.utils import GPT2ConfigCPU
from aitextgen import aitextgen
config = GPT2ConfigCPU()
models = []
steps = 512
for model in listdir("GPT-models"):
print("\n\n\n"+model)
loc = "GPT2-models/"+model+"/"
vocab,merges=loc+model+"-vocab.json",loc+model+"-merges.txt"
models.append([loc,TokenDataset("messages/"+model+".txt", vocab, merges, block_size=64, line_by_line=True, progress_bar_refresh_rate=32, bos_token="", eos_token="", unk_token="", pad_token=""),aitextgen(model=loc+"pytorch_model.bin", vocab_file=vocab, merges_file=merges, config=config, cache_dir=loc, bos_token="", eos_token="", unk_token="")])
for model in models:
model[2].train(model[1], model[0], batch_size=32, num_steps=steps, progress_bar_refresh_rate=32, save_every=steps, generate_every=steps)```That's my current code, incase it helps. I want to allow other computers to assist in the training process, but preferably while avoiding coding in things like PyTorch
which is why i asked if you know the difference
Yes, I've already explained multiple times now I fully understand this 😭
aha. thanks for the info!
Yes. Distributed training works on the same model at once over nodes connected in a special kind of network I forgot the name of. I can't exactly say the difference between that and the other option since, well, there is no other option yet, I'm asking you for the other option, but I'm assuming the other option would be involving each 'node' (computer connected to the server, including the server itself) works on it's own set of models (where if one node is disconnected, the models it was working on would then be allocated to one of the other nodes)
Also, just incase, here's the code used to make the original models in the first place, before being further developedpy for filename in listdir("messages"): if filename != "all.txt": if filename[:-4] in listdir("GPT2-models"): rmtree("GPT2-models/"+filename[:-4]) print("\n\n\n"+filename[:-4]) loc = "GPT2-models/"+filename[:-4]+"/" vocab,merges=loc+filename[:-4]+"-vocab.json",loc+filename[:-4]+"-merges.txt" mkdir(loc) train_tokenizer("messages/"+filename, prefix=filename[:-4], save_path=loc, serialize=False, min_frequency=5, bos_token="", eos_token="", unk_token="") ai = aitextgen(vocab_file=vocab, merges_file=merges, config=config, cache_dir=loc, bos_token="", eos_token="", unk_token="") ai.train(TokenDataset("messages/"+filename, vocab, merges, block_size=64, line_by_line=True, progress_bar_refresh_rate=32, bos_token="", eos_token="", unk_token="", pad_token=""), loc, batch_size=32, num_steps=8192, progress_bar_refresh_rate=32, save_every=8192, generate_every=1024) with open(loc+"sample.txt",'w+') as txtfile: txtfile.write("\n".join([result.split("\n")[0] for result in ai.generate(100, min_length=3, max_length=2000, return_as_list=True)]))
(I'm not needing help with the code shown above, it's just there incase you need to know specifics of my model in order to give advice for my original question, but any feedback would still be appreciated)
I have a friend who's new to coding and is wondering if he could add AI to a snake game.
-
Is this even reasonable? He would have another friend and myself helping him, but none of us have any experience with AI
-
Do you have any suggestions of places to start if this doesn't seem too far-fetched?
this is reasonable
but this is reinforcement learning
you might want to start at the basics
which is regression techniques
Okay thanks. I'll do some research
Hi! Sorry for not asking strictly about Machine Learning..
So I found a Dataset on Kaggle that I find really fun and cool to work with, but I noticed a strange License, and I have no idea about Licenses.
Can someone help me out a bit?
data science, AI, and machine learning are all valid topics for this channel.
did you click through to read what the license says?
Yes I did, but it took me to the Legal Notice page of the site of the European Union(or something like that) but it didn't say anything about the license.
I mean, that's alright but I want to train a model using the dataset the put it on my website so that I can reach it easily with having to run it using a notebook service.
I know that it basically impossible to tell what kind of dataset I used, but I still want to know! 🙂
....what?
how would putting a model on your website help?
Not my model.. but an application, using Ipywidgets, and Voila you can create interactive applications.
uh-huh, but the instances are gonna be pretty expensive if your model is big
They probably will be, but the model I'm intending to create meant to handle a basic task, so it wont be that big.
quick question
im using cross_val_score to test my ML model, but I'm getting very unexpected values
I do the following:```python
model = tree.DecisionTreeRegressor()
clf = Pipeline([
('feature_selection', SelectKBest(chi2, k = 10)),
('classification', model)
])
clf.fit(minmax_train, labels)
scores = cross_val_score(clf, minmax_train, labels, cv=5)
and then scores is this:array([ -0.35824288, -1.72116347, -7.53874271, -1.00218319,-259.19346663])
I tried a DecisionTreeClassifier earlier which got more expected scores, but this is a regression problem
any ideas why this might occur?
should i try a different scoring metric besides default?
is there a way to flip or separate the labels on the x axis vertically with seaborn?
cause either way it's unreadable currently
you can change the x variable to the y variable on the plot and vice versa
if you want it more readable you could make the graph higher (but that might be too high to be readable)
angle the labels
or make it interactive so the label only shows up when you hover over the bar on the graph (plotly does this)
I'm pretty sure it has to be an image but still it's unreadable
that's why im wondering if you can flip them vertically
what's really weird is im filtering out openings that have <50 games present but there's still values under 50 here
oh yeah
matplotlib is under the hood of seaborn
you can rotate the xlabels by 90deg to do this
ax.set_xticklabels(ax.get_xticks(), rotation = 90)
where ax is defined with plt earlier in the code
huh that didn't seem to work I had to do for tick in ax.get_xticklabels(): tick.set_rotation(90)
sounds good haha
this is, frankly, horrific
sounds like wrong visualisation
it's not even sorted 😢
ya I need to figure out how to filter it properly what i've got obviously isn't working
it's much worse now tho 
unless you wanna call it abstract art, then it's perfect
This is the error i get when i run this code
Im simply computing the moving averages and graphin for the users input
Im really not sure what im doing wrong and really need help
Hey, just a curiosity
How capable are the best AIs in regards of making itself questions?
like philosophical questions and such
oh
thats a very good question
i dont know
i don't think we've built something like that
there's space for it tho
i mean, in my mind the only reason it wouldnt be built is because it wouldnt work properly ?
cz it would be funny af
- you can literally study so many things by analyzing the behaviour of the machine in regards to that
its more or less trying to simulate similar scenarios humans have been in
theoretical scenarios
like trying to teach what is ethics to a computer, or making it figure out by itself

if i had data science nightmares, this would be it

what do you mean by chess engines
im using transfer learning with vgg16
i'm not sure how to use elastic net
What would you want the outcome to be in that case? For it to predict ant or to predict not dog or cat?
How should I choose features when they're all have very heavy correlation with other ones?
I have almost 5000 features to choose from after generating quadratic features from a preexisting set of features
is there a way to agglomerate features by clustering them based on which are heavily correlated with each other?
depends, I have a very surface level understanding of it, but some engines are calculators and some engines are neural nets
Alphazero is a neural net iirc
the very first chess engines like m20 were of course calculators
alphazero is on a quantum computer tho isnt it
no idea about the specifics, but you're asking about how the engines select moves right
I think stockfish you would consider a calculator
they're currently working on a neural network version of stockfish, stockfish NNUE
wouldn't it be fun to put a quantum computer on every desk on the world
just imagine the processing speeds going up
no i have crypto
please no
then no for you right lmao
kinda like bill gates
put a computer on every desk
i think what i just said is cringe worthy
what does l2 do?
intervals between duckies so far (minutes)
data is [11.0, 15.0, 13.0, 12.0, 12.0, 16.0, 15.0, 11.0, 12.0, 11.0, 16.0, 13.0, 14.0, 16.0, 11.0, 12.0, 16.0, 12.0, 12.0, 12.0, 13.0, 14.0]
Is there a way to mask features in a neural network. I'm giving it a full set of training data, but if the user cannot supply the full list of data itd obviously be good to train it for missing inputs. Just randomly inject zeros for some features in idk 15% of the sample, or will that bias it in an undesirable way?
If I have a feature set that looks like
0 1 2 3 ... 4491 4492 4493 4494
0 0.0 0.075377 1.70 1.707317 ... 0.797852 0.239469 0.033743 0.004755
1 0.0 0.065327 1.70 1.707317 ... 0.797852 0.433821 0.110739 0.028268
2 0.0 1.768844 1.65 1.658537 ... 0.797852 0.180470 0.019164 0.002035
3 0.0 0.438243 1.75 1.756098 ... 0.797852 0.086764 0.004430 0.000226
4 0.0 1.281407 1.65 1.658537 ... 0.797852 0.520586 0.159464 0.048846
.. ... ... ... ... ... ... ... ... ...
973 0.0 0.438243 1.55 1.560976 ... 0.735229 0.845391 0.437238 0.226141
974 0.0 0.438243 1.25 1.219512 ... 0.407545 0.379220 0.111079 0.032537
975 0.0 0.438243 1.55 1.560976 ... 0.681844 0.224723 0.031979 0.004551
976 0.0 0.438243 1.60 1.609756 ... 0.297484 0.285813 0.069200 0.016754
977 0.0 0.582915 1.50 1.463415 ... 0.620324 0.270069 0.048132 0.008578
and I also have clustered these feature columns, with each element in this list representing which cluster the corresponding column is in: [4 7 9 ... 1 1 1], how would I combine these features?
so like all features in cluster 4 are somehow agglomerated together, and so on for all of them
thanks and cheers!
How do I feed several datasets into lstm
Anyone managed to make a clone similar to Deep Nostalgia?
not dog or cat
@stiff barn right now im doing this like ```python
if image_classify.predict(image) == 'Other':
return 'Other Image'
return cat_dog_classifier(image)
ml is just giant if then statements
maybe if it's rule based... (and "AI" but not ML)
more accurately its differentiable if then statements
hello, I'm fairly new to ML, and I can't understand whether DecisionTrees can handle categorical vairables or not? some stackoverflow answers say they can, some say they can't? what's the deal here? Suppose I am predicting the % effectiveness of a drug based on 3 features ['Dosage', 'Age', 'Sex']. Sex is a categorical variable with values M & F. So how will the decision tree algorithm split based on the categorical data?
what...?
if you think about it, a decision tree turns everything into categorical data
for each candidate splitter, it judges its validity based on what is on what side of the split
so if you have a categorical var like s, if it decides to split on sex=m, then it'll bucket everything into two based on that split and compute the predictive gain function on it
then it tries another split, until it decides by whatever criterion to stop and pick the one with best gain
whether the package you use will support doing things this way is the better question
sklearn doesn't do it, but that's mostly because sklearn blows and you should use a better package
yes I'm using sklearn. so what does sklearn do in this case? And what other ml libraries do you recommend? Thanks for the prompt reply.
sklearn treats everything as numeric vars, so if you have categorical ones, one hot encode them
though this presents its own litany of issues
unfortunate, in python this is the best you've got
i recommend picking up some r
Alright, thankyou very much 🙌
df_comp.ftse.plot(figsize = (20,5) ,title = "FTSE Prices")
plt.title("S&P vs FTSE")
plt.show()```
Hey I'd like to know how this code really works^. Like how can we enter plt.title after plotting?
Shouldn't there be a separate function that takes all parameters together?
I read on SO that it's something related to "layers" in ds programming but didn't really understand that. Could anyone provide a primer?
Hey, folks, I'm looking for help writing a function. I am working with a pandas df that includes clinical trials that are classified as either "low_accruing" (marked with a 1) or "normal accruing"(marked with a 0). The df is a document-term matrix containing lemmas that show up in the clinical trials and is thousands of columns wide. To identify the most salient terms, I am trying to drop all columns for terms that appear in :"low_accruing" trials and "normal_accruing" trials the same number of times. To do this I tried writing a functions that would multiply the values in rows with "1" in the "low_accrual" column by -1 and then dropping the columns with a sum = 0. Here's what I've written: data3 = data1.apply(lambda row: [i for i in row if data1.loc[row,'low_accrual'] == 0] else i*-1) This returns an "invalid syntax" error. Can you help me see what I'm writing wrong?
[i for i in row if data1.loc[row,'low_accrual'] == 0] else i*-1) <- your else is outside of the list comp
that or you're missing a closing bracket for the list comp
what is that apply intended to do?
@serene scaffold I was trying to apply the function to each row
so for a given row, if the value in the low_accrual column is zero, you want to negate all the other values in that row?
I want to negate the values in a row if the value in the low_accrual column is 1.
Can you provide me with a sample csv string so that I can replicate this DataFrame on my own computer? You can either copy and paste the first few lines of the csv file, if there is one, or run print(data1.iloc[:5].to_csv()).
Hey @split eagle!
It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
This is a Python statement rather than data
Hey @split eagle!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
You have to copy and paste the text into the chat or use the paste bin
it should look like this
0,1,5,6
1,1,7,3
2,5,9,2
@serene scaffold It is still too large. Give me a sec to trim it down.
A few lines will do.
Hey @split eagle!
It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
I won't be able to help with this. Hopefully someone else can take a look.
though this may solve your problem.
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [1, 2, 5], [0, 5, 6]], columns='a b c'.split())
>>> df
a b c
0 1 2 3
1 1 2 5
2 0 5 6
# Save what the a column is for later
>>> original_a = df['a'].copy()
>>> df[df['a'] == 1] *= -1
>>> df
a b c
0 -1 -2 -3
1 -1 -2 -5
2 0 5 6
# This did what we wanted, except it also negated the a column
>>> df['a'] = original_a
>>> df
a b c
0 1 -2 -3
1 1 -2 -5
2 0 5 6
I suppose you could also do
df[df['a'] == 1] *= -1
df['a'] *= -1
@split eagle see if you can apply that to your particular dataframe
@serene scaffold Thank you for your help. Sorry for the technical difficulties on my end.
look up lazy or delayed execution, its quite a useful paradigm
hello everyone. I am trying to create leading, coincident, and lagging composite indexes but i need to determine the optimal lag for each variable. should i use autoregressive model to identify optimal lag?
no
you're not creating, say, a lag indicator for a pair of indices? you normally would just use a cross correlation model for determining optimal intervals
i'm not sure what it means to create a lead-lag indicator for an entire index
unless it was relative to another base index
oh im trying to create a composite index of a business cycle which composes of leading, lagging, and coincident indicators
oh, so like a self-lag lead indicator?
but i want to determine optimal lags for each variable
then yeah, you would base it off an autoregressive model
have you used the granger test?
look at that as a way to test for significance of intervals
so most of these variables have already been tested historically
for example, the conference board leading economic index has various indicators within it
im essentially trying to determine optimal lag period for each variable
i think i might be conflating lag-lead indicators with what you're working with
i don't think i have expertise in this in that case, sorry
all good
Hello I'm trying to use scipy's differential_evolution with a constraint of the type Bounds, but I'm getting an error. It works fine when I use a LinearConstraint
example code
import numpy as np
from scipy.optimize import (
rosen,
differential_evolution,
Bounds,
LinearConstraint,
)
bounds = np.array([
[-5, 5],
[-5, 5],
])
bc = Bounds(
*np.array([
[0, 2],
[0, 2],
]).T
)
lc = LinearConstraint(np.eye(2), [0, 0], [2, 2])
result = differential_evolution(
rosen,
bounds=bounds,
constraints=bc,
)
print(result)
the error:
Traceback (most recent call last):
File "diff_evo.py", line 37, in <module>
result = differential_evolution(
File "C:\Users\Snaptraks\anaconda3\lib\site-packages\scipy\optimize\_differentialevolution.py", line 308, in differential_evolution
ret = solver.solve()
File "C:\Users\Snaptraks\anaconda3\lib\site-packages\scipy\optimize\_differentialevolution.py", line 810, in solve
result = minimize(self.func,
File "C:\Users\Snaptraks\anaconda3\lib\site-packages\scipy\optimize\_minimize.py", line 605, in minimize
constraints = standardize_constraints(constraints, x0, meth)
File "C:\Users\Snaptraks\anaconda3\lib\site-packages\scipy\optimize\_minimize.py", line 825, in standardize_constraints
constraints = list(constraints) # ensure it's a mutable sequence
TypeError: 'Bounds' object is not iterable
when I change the constraints=bc to constraints=lc (the linear constraint object) it works fine, which I find weird because the documentation mentions the type Bounds is accepted
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.differential_evolution.html
are you using ver 1.4.0+?
yes I'm on scipy 1.6.2
yeah scipy.optimize is shipped with scipy
Im currently reading up on scikit-learn and running through some different examples, as there are some tasks at work i think could be solved nicely by machine learning.
I was wondering, when running something continually, how would you keep it learning? I have a tfid classifier rn, which reads categories from a folder, and files of those folders as training data. Would i just add onto the training data, and rerun/retrain it every time it had to run, or is there a better way to do this?
any chance someone could explain to me how I know if my neural network is over-fitting based on the loss functions graph and the mean absolute error graph?
compare training and validation performance
that’s a good start
I am a complete beginner to LaTeX, is there a good way to represent the concept of mode with a formula?
what do you mean by represent the concept of mode?
okay thanks
I am working on homework, I have to wrote the formula / equation for mean median and mode, but I can't find a formula for mode, and I can't think of how to write that in mathematical notation
For mean I did:
so you are doing this on a set of values?
Yes, an ungrouped set of values
I found a formula for grouped data but that doesn't really make sense to use here
The mode is the value that occurs most often in the data, but is there a way to write that in a formula with LaTeX as above? Not looking for a full answer, just a little guidance.
i don't think it can be written as a formula in general, if anything you could write a definition using the
\usepackage{amsthm}
\newtheorem{definition}{Definition}
\begin{definition}
\end{definition}
thank you, I think I will just include the formula for grouped values, since the assignment specifically says to include a formula...
Hello fellows. I was working on an object detection problem and had a question about PCA.
In short, I'm trying to build a model that looks at a picture and says if there's a car in it. One of the things I'm doing involves blowing up the dimensionality of the image, then cutting it back down to a reasonable size with PCA.
My question is: Is this at all reasonable when I'm trying to discern between cars and everything else? If I fit the PCA on cars and everything else, I'd think it would just drown in excess variance. On the other hand, if I fit the PCA on just the cars, wouldn't the PCA transform be terrible at representing the variance of the not cars during testing?
hm
you could do it with the indicator function
and the max function
the mode(s) of a multiset have multiplicities are equal to the maximum multiplicity
What does the learning rate in AdaBoost do?
it decreases the contribution of successive models
what libraries/apis/datasets/etc. do you guys use to get crypto price information?
Hey I wanted to get into machine learning and I wanted to start with tensorflow but from all the tutorials I have seen it was really confusing (mostly the math) so I was wondering if any has any good recommendations to where I should start
the pinned message has some pretty good resources
kaggle
or google datasets or smth
type in what you want there
you should be able to find it i think
crypto price right?
lemme check
yup
Anyone here good with pandas?
I have a column where I am trying to get a count for every type of item in the column
However, some of the items in the column are like this : "cars", "racercars", "super cars", etc...
I was wondering how I can combine all items that have "cars" in them and get the total count
I'm currently grabbing counts like this pd.DataFrame(df['Type'].value_counts().to_frame())
you can do dataframe.info()
you'll get everything
adding columns is a bit different
like wdym by add
I'm just trying to get a count of all item types
Right but how do I combine items that have the word "car" in it?
This is what i get ```
Type Count
Airplane 23
car 2
race car 2
super car 2
This is what I want
Type Count
Airplane 23
Car 6
i think there's a way
i forgot
OHHHH
wait
nvm
i keep on forgetting
df.type()
no
not that
here's what im getting from my geeks to geeks search
`# importing pandas module
import pandas as pd
reading csv file from url
data = pd.read_csv("https://media.geeksforgeeks.org/wp-content/uploads/nba.csv")
dropping null value columns to avoid errors
data.dropna(inplace = True)
substring to be searched
sub ='er'
start var
start = 2
creating and passsing series to new column
data["Indexes"]= data["Name"].str.find(sub, start)
display
data`
Hmm that will find all items that have "car" in it
probably
I suppose I could get a count that way and add a new row the original dataframe and then remove all rows that have "car" in it
It's a little tedious to do that but I suppose that's all I got
im doing a project on something that hasnt been done before i think, if i get overfitting and somewhat good looking prediction for training data, does that mean thats its possible to predict something on test data?
cause i didnt have anything giving hope for a few days on test data
hey, I wanted to share a python library I've been working on, it boosts the T5 model speed up to 5x & also reduces the model size. https://github.com/Ki6an/fastT5
good one 👍
thank you 🙂
listener = sr.Recognizer()
engine = pyttsx3.init()
def talk(text):
engine.say(text)
engine.runAndWait()
def take_command():
try:
with sr.Microphone() as source:
print('listening...')
voice = listener.listen(source)
command = listener.recognize_google(voice)
command = command.lower()
if 'bob' in command:
command = command.replace('bob', '')
print(command)
except:
pass
return command
def run_bob():
command = take_command()
print(command)
if 'time' in command:
time = datetime.datetime.now().strftime('%I:%M %p')
talk('Current time is ' + time)
else:
talk('Please say the command again.')
while True:
run_bob()```
im having a error in this code
Traceback (most recent call last):
File "d:/Shashi's stuff/everything/Coding related lol/CodeCreations/bob/bob.py", line 44, in <module>
run_alexa()
File "d:/Shashi's stuff/everything/Coding related lol/CodeCreations/bob/bob.py", line 34, in run_alexa
command = take_command()
File "d:/Shashi's stuff/everything/Coding related lol/CodeCreations/bob/bob.py", line 30, in take_command
return command
UnboundLocalError: local variable 'command' referenced before assignment```
please help
I am running a transfer model for convolutional neural nets, and my validation accuracy is actually ahead of testing accuracy due to image augementation (distorting the input).
However, eventually my testing accuracy starts to catch up ~60 epochs in. I feel like the model starts to overfit slightly afterward, but increasing dropout or augmentation only decreases overall accuracy
Any idea how I can get the val_accuracy closer to the train accuracy?
As in, early epochs, val_acc >> train_acc
Deeper in, model starts overfitting it seems. But dropout is not really working
What will be the equivalent code for annual data (1989, 1990, ... and not 01/01/1989 , 01/02/1990)? I know that asfreq would require 'a' but am unsure about the to_datetime() method and the set_index() method
df_comp.set_index("date", inplace = True)
df_comp = df_comp.asfreq('b')```
This is the plot by the way. Is this clear overfitting?
if the train accuracy gets to high 90s, then its def overfitting
another symptom of overfitting is when the train accuracy keeps climbing and the val accuracy also starts steadily decreasing.
using this, it automatically adds the 01-01 to each year, my format is just "1961, 1962, 1963 ...". How do I fix it?
that's because datetimes store the full date (at least)
Is it a problem that they are assumed to be Jan 1 each?
not really
just that only the year would look neater
I am confused about why on certain problems Neural Networks perform abysmally and traditional algos catch up? why does this happen and what kind of NN's are capable of mapping any function/data?
I'm trying to calculate number of days between dates in an groupby object but getting stuck, trying to figure out number of 'hospital days' i.e 0-indexing for each patient admission based on a datetime column, I'm managing to easily do it with dplyr in R but having major difficulties in pandas..
I have long-format data with columns with
patientid | datum| variablex | variabely
11 2020-01-01 xx yy
11 2020-01-01 xz yz
11 2020-01-01 xo yf
11 2020-01-02 xx xf
12 2020-01-04 xx yz
12 2020-01-05 xx yf
12 2020-01-04 xx zz
df['hospital_days'] = df.groupby('patientid')['datum'].diff() / np.timedelta64(1, 'D')
df['hospital_days'] = df['hospital_days'].fillna(0)
Getting negative days, so I figure I have to sort the subgroups somehow before running the above line
Hello to everyone, is anyone familiar using django and plotly dash together?
Go ahead and ask your question so that people who might know the answer will know what the question is.
Perhaps the NN is overfitting by learning too much? i.e, finding non existant relationships?
how do i counter overfitting in keras
Hi I want to learn ml and ai
can you reccomend some books on the above mentioned subject
hello
inductive biases, its all inductive biases
bishop's pattern recognition and machine learning
supplement with hastie et al's elements of statistical learning, and after you read the above transition to the canonical text by goodwillie et al on deep learning
that should be a good starting point
what are some good reinforcement learning library?
can i ask a basic backpropogation math here?
Can I ignore the PACF coefficients after lag 1 as being random?
I need to know how to use the same dashboars with different users and different datasets (django + plotly dash), any idea?
Is this a good description of Chebyshev's Theorem?
I’m using LSTM to predict 1 value by 60 values before it, will it be ok if I just concat several databases for training
I would probably rephrase it like this: "...the minimum amount of values that lay _below_or at K standard deviations"
Thanks, that is more clear
? chebyshev is a probabilistic statement, how are you turning it into a deterministic statement about minima?
Doesn't it tell you the minimum amount of values that must be within k standard deviations? It's my first day of statistics so I might be misunderstanding. @bronze skiff
I already submitted the assignment though, so we'll see XD
for a continuous random variable, also k > 0 not k > 1
pretty much - it's usually written as P(|x-mu| >= k*sigma) <= 1/k^2, which implies P(|x-mu| <= k*sigma) >= 1 - 1/k^2.
I only don't really like talking about "amount of values" here - that's only valid in case you have a uniform discrete distribution (so the probability of each possible value is constant, and the total probability of falling in an interval is just proportional to the number of values inside it). In general, it's a statement about the probability of the value being in the interval.
I may flop with this but: If we omit the len(x) we would be taking about the cumulative density of values above (or below?) that k standard deviations
as in; the proportion of values that lay below k standard deviations, accumulated. I.e -> below k standard devs, lay at least 0.x of values, aprrox
correct me pls @tidal bough :x
yup, precisely. Because P(a<=x<b) = CDF(b) - CDF(a), the theorem is also saying that CDF(k*sigma)-CDF(-k*sigma) >= 1 - 1/k^2.
among other things, it can be thought of as a constraint on how thick the distribution's tails can be (how slowly the probability falls off) before the distribution's variance becomes infinite.
def softargmax(values):
beta = 1000.0
tensor = tf.squeeze(tf.matmul(
tf.nn.softmax((beta * values), axis=1),
tf.reshape(tf.range(values.shape[2], dtype=tf.float32), [-1, 1])
))
return tensor
Does anyone has an idea, whats "non differentiable" in this code?
It throws me:
ValueError: Variable <tf.Variable 'generator_lstm/kernel:0' shape=(128, 512) dtype=float32> has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.
I already tried with tf.keras.backend ops and also with tf.raw_ops. All of them are differentiable on their own.
isn't calling values.shape[2] actually pulling values into that part of the computational graph? and then calling tf.range(..) on it is def nondifferentiable
i don't know tf, but in pytorch you can call values.detach() first to pull something off the gradient tape so it's separate from it
so you might have to use something to pull it off the tape
have you tried commenting out the body of your function and just returned a flat tensor?
maybe the issue isn't even in this function
i'll try. But as it is the only Lambda layer I use, its pretty likely that it causes the issue
I already tried like 100 things today. Frustrating
i'm also certain there's tf implementations for gumbel-softmax out there to begin with 😛
You are absolutely right. But as I need the Max-Index, the Gumbel-Softmax just transforms my normal Softmax Output into One-Hot. Would not change much for me. I need this kind of SoftArgmax implementation. For Torch there are many implementations. TF/Keras Implementations seem to be pretty rare
for sure-- this shows up a lot in differentiable sorting procedures (soft argmax)
which is why i thought the issue was a non-detach from the gradient tape
but if you already tried replacing values.shape[2] with like, a fixed number 5 and it didn't work then i'm not sure 😛
yeah thats what i did. I also replaced the tf.range thing by hardcoding the range array/tensor
well, my brain literally hurts
sometimes miracles happen after you have taken a little break
if you don't mind sharing a longer gist i wouldn't mind looking if it's not too huge
Thank you for the offer, unfortunately I can't share much more due to confidentiality reasons 😦
I'm going to take a break for now. I'll come back when my head is working again. Thanks for the help and gn8 🙂
🕯️
i have a dataframe that has duplicate values but they vary from being first or last. however, one of them is always a NaN value. any ideas on how to remove the NaN duplicate
what kind of duplicates?
like, duplicate rows?
duplicate index that has 2 values within the same column
Extremely new to all of this. What metrics do you guys always look at for the performance of your models. Specifically for assessing different scaling/normalization processes?
Following that, what would indicate a need for a more complex architecture versus varying the preprocessing routine.
Generally classifiers are evaluated in terms of precision, recall, and f1
depending on your use case
you can look at things like lift
lift?
group your dataset
by predicted probability
calculate the proportion of positives for each group
it’s somewhat related to calibration
Hello anyone can explain input_shape=() inputs in LSTM keras
i have 250+label features and 99488 samples
i have used (250,1)
showing error
look up stuff like the lorenz curve or gini metric, that's usually where you see stuff like this
Can you Guys tell why this error occuring when I use grid.fit(X_train,y_train) in my SVM project
quantitative econ is full of interesting classifier metrics that aren't auc-based
quantitative econ amazes me dude
the error message tells you exactly what's wrong
the internships they offer give you a limited list of like T20 colleges and if you don't belong to them you have to say you're from another college
you most likely instantiated the grid search estimator wrong
But it is installed
it has nothing to do with being installed?
why don't you post a code fragment instead of just the error
don't dm me, post it here
So how to solve this
post the code that is failing
yeah, that means nothing
whats the surrounding context-- how did you define grid, etc
grid = GridSearchCV(SVC,param_grid,verbose=3,refit=True)
okay, now notice that you are just calling the class SVC, you're not instantiating an object
you need to instantiate an SVC object first and use that in the constructor
the same way you instantiate any other object from a class?
i seriously suggest you do a refresher on python itself
????
OK I Got it
Anyone here use streamlit?
@misty flint Yeah its made my life a lot easier. Do you know how to return data by using date ranges?
Hello anyone can explain input_shape=() inputs in LSTM keras
i have 250+label features and 99488 samples
i have used (250,1)
showing error
thats less of a streamlit question and more of a datetime module question no?
however, if its loaded onto the df, you should be able to regardless
Anyone here working on self supervised learning on images? BYOL to be precise
@misty flint I suppose you're right. I've got 2 st.date_input() boxes that takes a start date and end date. And I was thinking I would update the date if the user clicks a button.
st.dataframe() is the function youre looking for for streamlit if you need to load the dataframe onto streamlit
@misty flint So I've got my dataframe and then I've created multiple new dataframes from the original and then I've created plotly charts out of each of the new dataframes which are based off the original.
I was thinking if I add parameters to filter the original dataframe with those 2 dates it should change the subsequent dataframes too right?
anyone?
Streamlit mostly creates dashboards right?
depends how you do it
like plotly, but easier and prettoer?
streamlit is the framework
it accepts plotly
think of it like flask
-shivers-
i'll have to check it out. im still a bit not-certain about what it does
anyone can help in LSTM keras ??
idk what your code looks like but i did something similar with crypto prices
and f strings helped me accomplish what i needed
outside of looking at your code logic, cant offer much more advice
@misty flint I think I've got it. Thank you for the help.
Any idea how you could potentially save an instance of your dashboard? Like if I create a report of a date range, could I share like a URL of that state?
I'm wondering if I'd have to integrate Flask to generate short URLs with parameters
Nonethless streamlit is actually lit
np
and saving an instance so that others can access it? youd need to host it with something like a cloud provider
or something like heroku or digital ocean
what does adding _mirror to the name of a layer in tensorflow do?
i've never seen that syntax
layer.name = layer.name + str("_mirror")
How do you report spam?
@sonic vapor
!warn 730519909221531648 Do not post referral links in the server.
:incoming_envelope: :ok_hand: applied warning to @lapis sequoia.
State Capital State Capital
0 Maharashtra Mumbai California Sacramento
1 West Bengal Kolkata Florida Tallahassee
2 Uttar Pradesh Lucknow Georgia Atlanta
3 Bihar Patna Massachusetts Boston
4 Karnataka Bengaluru New York Albany```
I have this data frame I want to delete country column how to do that?
!d pandas.DataFrame.drop
DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')```
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.
Parameters **labels**single label or list-likeIndex or column labels to drop.
**axis**{0 or ‘index’, 1 or ‘columns’}, default 0Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
**index**single label or list-likeAlternative to specifying axis (`labels, axis=0` is equivalent to `index=labels`).
**columns**single label or list-likeAlternative to specifying axis (`labels, axis=1` is equivalent to `columns=labels`).
**level**int or level name, optionalFor MultiIndex, level from which the labels will be removed.
**inplace**bool, default FalseIf False, return a copy. Otherwise, do operation inplace and return None.... [read more](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html#pandas.DataFrame.drop)
@austere swift I have given name to column using df.columns.names=['Country',None] Now i want to remove India and America using Country name
And Dataframe is multiindex
is there a way to do a rolling window on an entire pandas dataframe, and not just a single column?
Ive got a question about lstm
whats better: if I put 100 dense units inside my model, or if I put 1 dense unit, predict 1 value, add it to input and do prediction again?
1 dense unit
100 dense units
How can I boost the learning process of keras LSTM?
thats because your range is from -1000 to 1000... at those scales, the small bumps near -1 and 1 become invisible..m
try restricting your range to -2 and 2 and see what happens
I believe this to be correct.
i can't wait for the day when this stuff is sold in art galleries
OHHHHHHH
now i understand how apple's background thingy works
the one where they blur it
I am a high school student that is mostly new to programming and completely new to python, ai, dl, and ml. Where should i begin to start learning to code ai?
Look up some of the existing python ai/ml libraries.
hm?
looking at the existing libraries alone won't be enough bc ML/AI is also pretty math heavy
Companion webpage to the book “Mathematics for Machine Learning”. Copyright 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Published by Cambridge University Press.
check this out to see the math you need to know
get a copy of bishop's pattern recognition and machine learning-- basic prereq for ml practitioners
how would i go about finding these libraries
well when I had my brief endeavor into DS/ML I started with Numpy, Matplotlib, and Seaborn
and then I "learned" Pandas and sklearn
ah ok thanks!
Will highschool AP teach me the concepts for Mathematics For Machine Learning on its own?
How can I boost the learning process of keras LSTM?
I have a massive dataset with over 20000 values (should be approximately over 100000 if I use everything) and my model takes forever to learn
Increasing batch size changes perfomance significantly
Hello
high school AP what?
AP CS?
AP Calc?
appropriate lag is 0 or 7 or 11? How do I decide?
Oh kk, will do more research
I’d say almost. Maybe a little Linear Algebra and you’re good to go 🙂
Plus it’s free on Microsoft’s website
asked my sis, apparenlty they do teach linear algebra in IB, so in AP they most likely will teach it
Imagine lstm model thats predicting values. Will it break if I give it values that are .rolling(3).mean() (for example)?
These are pretty fundamental things. Machine Learning uses surprisingly low level maths.
^^
You wanna be careful giving means if something is timeseries-esque
Means assume that the distribution is IID (Independent and Identically Distributed)
Won't break, but it might not make a lot of sense. If the temperature is rising, the "things aren't changing"-scenario would refer to increase in temperature staying constant. This is because current temperature is dependent on previous temperature. And the mean wouldn't capture this 🙂 —— in this case the mean would undershoot.
Actually the mean can capture it, that's the basis behind models like ARIMA, SARIMA.
Now, if it's a good extrapolation or not, that's a whole different thing.
If there's a sequential dependence be careful with means 🙂
Im not sure there is
Not sure if its really a time series since im not passing it any datetimes
It just learns to predict values, knowing 120 behind it
Does this mean this way (It just learns to predict values, knowing 120 behind it) that giving it mean values might work?
need some help here, i have a df consisting of 500 np.float64 values and ```py
hist = sns.histplot(data=df, y='values', hue='Color', kde=True, palette='Dark2',ax=ax[1])
is returning an error: ```py
No loop matching the specified signature and casting was found for ufunc add
Any ideas?
stops working as soon as i include kde=True
no idea why
Yes 🙂 — try it on a subset of the data so it trains faster, and see what kind of metrics you're getting.
Hmmm, yeah I got a dataset that should take like 30 minutes to learn
Another question
A histogram of floating values? That sounds weird. that makes no sense
A histogram s literally a count of discrete values. passing a continuous value doesnt make much sense to me
When I made a model that learned from a not so big dataset, it (somehow) made predictions far into the future (around 2/3 weeks), but when I trained another one on several huge datasets, it predicts for only few days, even though very accurately
Let me get screenies
for continous value you can compute KDE as well, but im not sure if its done in the same way.
@exotic maple i agree, they are floats simply because they vary very slightly around integer values, casting the array to np.int before using kde=True still raises the same error
Does the normal histogram without KDE work=
tRy using the KDE class directly, without hist
The one with huge dataset (small prediction) was a 1 epoch model, but for 100 epochs its the same, just the prediction goes a little further
hist works without kde=true and kde plot raises exactly the same error. Is it time for a reinstall?
Whats better: autoregression or straight up prediction?
conda update --all and jupyter restart fixed it @exotic maple i have absolutely no idea lmao, thank you for your help though
eh
i added more units to my model and decreased batch size but perfomance only seems to drop
before i started adding more data to my model, perfomance was better too
and now...
what even is this mess??? is this some new kind of overfitting
caause im testing literally on train data here
also now thing but with autoregression
why does this happen? should i increase batch size and decrease units?
generally you should decrease batch size and increase model capacity
large batch sizes also requires some learning rate tuning as well, which can be tricky
the problem is
i decreased it
and it got worse
i added units and it got worse more
i added more DATA and though it made accurate predictions, it died too soon
i have literally no idea as to why thats happening
hello, i have a large list of full names, i would like to be able to search by any part of that name for a match. basically name fuzzy search does anyone have any library recommendations for this?
im just gonna try and teach it with 1 epoch instead of 100, maybe some kind of magics gonna happen and do everything
to be honest 1 epoch made better prediction than 100
it at least died after a small while but not instantly
Are there any other useful preprocessing routines for regression tasks? Have tried minmax, Zscore, robustScaler, yeo-johnson, quantile. Looking for more, but pretty much any document I find lists roughly the same ones.
consult your neighborhood witch
thats work for tomorrow
I wish you the best of luck and hopefully no hexing. Dont piss em off
Anybody done the machine learning AWS certification?
I don't think certifications are usually taken very seriously by employers, but I'm not sure what your goal is.
The AWS is amazon certified though, like the google ones
yes, though my statement applies to certifications in general
I mean it's hardly my main qualification, but I imagine it would help if I have little AWS experience in my job
what job?
Machine learning specialist
Atm I'm trying to get some more experience with the stuff I've found being looked for by employers but I don't have much direct exposure to
Like AWS
if your underlying question is "is the information presented in the AWS certification course useful for what I want to do?" (rather than "will anyone care if I get it?"), then I'm not the one to answer.
I was just curious if anybody had done it, since the enrollment process seems awkward and I wanna know if anybody else had studied for it
I'm sure the actual material will be dumbed down and boring like most of the certs I've looked at
But they namedrop amazon and google so...
Let's see what others might say. You can also ask in #career-advice, though you should provide context about why you're asking
I'm only a college student I'm not sure if my opinion matters
if you weren't already speaking to someone and your only comment boils down to "no comment", you don't really need to say anything.
though if you want to say "I'm in college, though I've heard xyz about the certification", that's more substantial.
but my friend did an AWS certification and it didn't really help him
he just said it was like any other certificate you could pick up
I have
Was it worth the effort?
yeah, why not
it's pretty simple
okay I took it a couple of months ago for fun?
since I didn't really have much to do @ my job
so I thought
yeah it's similar for me
might as well use some of that training budget



