#data-science-and-ml
1 messages · Page 309 of 1
i remember asking that question. then our last team project was on making a chatbot

then i figured it out
rasa 11/10
if you ever need to make a chatbot for deployment
how can i calculate the p-value of a test statistic in python?
as of right now i only plan on doing t stat, z, chi square and slope of regression line
this is what i have as of right now
manually?
or using a library
ideally manually
hm
you know the mathematics behind these calculations, yes?
i can find the correct standardized test statistic, such as z-score, t stat, chi square etc. but i'm clueless on how to find the p value from that. i know that hardcoding in a table of z score probability values is not the way to go either
okay so
you can basically
think of the p-value as the area under the PDF, right
(which corresponds to the CDF)
that's all calculus
from a normal curve the p-val would be the area to the right of it correct?
it depends on your test
oh right if you're doing like a 2 sided z stat it's P(X > z) and P(X < -z) iirc
uhh
ohhhhhhh right i think it's the area covered to the corresponding side of the test statistic right?
just finding the % of the area to the right/left of the statistic
if it was 1 sided with a null hypothesis of p = 0 and alternate hypothesis of p > 0 the upper and lower bounds of the integral is from the z-score to infinity righht?
i think
Do people actually do integrals to get those from the PDF? I've never, even in college, solved the PDF in a range to get the p-value. sounds a bit overkill unless explicitly for teaching. Most distributions have their stndard tables no?
no, but
they said
they want to do it manually
that does sound correct.
High respects to him lol. I'd just pop a z-table or t-table haha
I Found a resource called scipy which can do integrals for you, but all this integral stuff is completely new to me, I've messed around with it back in middle school but its surprising that you can use an integral to find a CDF for a normal curve. Ended up just using scipy in the program, because manually configuring tables and god forbid degrees of freedom didn't look like the best option programatically
no
you can calculate the integral manually
well
it depends on what you want to do I guess
whether its' for learning
but anyway
the CDF is the integral of the PDF
ok i need help really quickly
if i have two lists of different lengths but they're within the same range, how would i plot them over each other so they start and end at the same locations?
ok well im scratching that completely now
how do i find correlation between two lists with different lengths but which are within the same time series?
as in i measured one variable 5 times in an hour, and another 423 times in an hour, and need to see if they're correlated
thanks!
i have like 30 minutes btw, been working on this for HOURS
someone know how to change all the symbols of a graph legend? I'm using seaborn and matplotlib and but even trying to use legend_handler of matplotlib i don't got sucess
item_xy.legend(legend, fontsize=legendsize, bbox_to_anchor=(0.87, 1.15), loc=2, handler_map={item_xy: HandlerLine2D(numpoints=1)})
i'm talking about this
resample
A 28×28 numpy array can be interpreted as a matrix with order 28×28 right? For reference it was mentioned in the tensorflow docs
Can anyone explain me how polynomial regression works with the use of linear regression?
What does that line do?
Hi there ! My question might be dumb but I am looking for the most efficient way to select N rows in a matrix such as there distances to one another are :
- ideally, maximum
- at least, larger than a threshold
I thought I could cluster my data on N cluster and pick in each one (because that is my underlying idea of selecting rows of different classes), but I wonder if there is really a need in clustering.
Thanks 🙂
so there is nothing to explore there?
if you can give an example of how visualised attention maps would know what NN focus on binary classification of cats, that would be great?
Can I get a little help on counting precision and recall of a search engine ?
Anyone who has heard about AIML files?
Hey everybody!
I just uploaded a new video "Differentiable augmentation for GANs (using Kornia)"
GANs are known to be very data hungry. Are there ways how to make them more data efficient? As it turns out applying augmentations is not that straightforward. In this video, I explain a recent method called differentiable augmentation (DiffAugment) and use it to train the DCGAN.
In this video, I discuss the paper "Differentiable Augmentation for Data-Efficient GAN Training". Additionally, I take a few ideas from it and try to code up an experiment to investigate whether differentiable augmentation has any effect on GAN training. I use the open-source package Kornia to perform the augmentations. To make our lives simpler...
line 1, in <module>
from PyQt5 import QtCore, QtGui, QtWidgets
ModuleNotFoundError: No module named 'PyQt5'
help
what will be better for a data scientist future at microsoft?
- python and SQL
- python and TSQL
- R and SQL
- R and TSQL
Does anyone have good ressources to learn about Tensorflow Graph Debugging? I have a GAN which graph isn't entirely or correctly connected. Maybe there are issues caused by discrete or non differentiable parts, that no gradients can be calculated. I have no idea how to look deeper inside and find out, whats wrong. From the outside everything looks fine and the model compiles and runs the forward pass
are you running from terminal?
is so make sure your terminal is on same python version as you installed modules on
you can switch version in terminal by using
py -3.x.x filename.py
https://stackoverflow.com/questions/67400936/using-aiml-files-predicate-and-sessions-with-discord-py
Any person willing to answer my question?
I was learning about AIML files with Python. I know I need to use aiml module of Python, but I want to use it with discord.py.
I want to make it so that, suppose I am talking with the bot, and I tell
yes
I made a chatbot api with it
It's open source?
I mean it's still in development
I saw it
and I read it
basically AIML files are like HTML
but made for chatbots
what you have to do is
Ok?
you need to create a file called
std-startup.xml
I just need some help with predicates and stuff
about aiml and pandorabots
Uhhh, what's pandorabots?
pandorabots is chatbot application platform that uses AIML
it's very good platform to use chatbots from
Ohhh
especially Kuki_ai
The leading platform for building and deploying chatbots.
for python three you need to install
python-aiml
not just aiml
got those from GitHub
Yes I know that
And is there a getting started guide? Or I need to see that link u sent?
Oh, ok
of how would you make one
aiml is very easy and very good chatbot system
but it is rule based
And can I use old AIML files?
so it has its disadvantages
Aiml files are similar to xml files
so that wont matter
the only thing will matter is your the library
https://github.com/sohelamin/chatbot
Like of these...
Well does it matter?
use the official resources wait
Am just getting the AIML files
wait a sec
Sure
this and
I can't use this. It's using an API module and thinks am making a website, but am making a bot...
Uhhh, I have also heard about std-65.aiml thingy
Oh ok cool
it doesnt mattwe
matter
the aiml files are same
And also, if I want some aiml files, can I get those from any GitHub repo?
i used these in my chatbot api
yes
just download using git
And I also need to change the version?
clone it
Oh ok
aiml 2.0 will also work
Can't I download manually?
you can ofc
What if the version is 1.0.* Those would also work?
- means anything
ye it will work
Oh cool!
just try it and see lol
yea
BTW, if it's ok for u, may I ping/DM u, regarding them?
sure
And also, have u used sessions and predicates, like I asked in that question above?
this one has everything
all the files you will need
it has alice chatbot files
mitsuki and
standard aiml files
is it possible to use TextBlob in the dutch language?
yep
Damn, this person did do a looot of work
yea
they are not actually in aiml
you have tags
like learn
random
and say use
tags like that
look at the aiml website
Appreciate it dude
np
also
you are using it for your discord bot right?
is it better to use matplotlib or Charts js to show on web
For my data analysis unit I need to clean up a database to create charts from the data for the report i'm creating. I'm trying to figure out which columns won't be useful to drop them.
The scenario brief I'm set is this:
You are working in a small Data Analytics firm. A small Insurance Broker is looking to add another insurer on to their portfolio. The new insurer wants to see the claims performance of their current business (a “bordereau”).
Would I need to drop some columns?
Hey anyone faced the similar issue where pandas converts ints,floats, and etc into objects?
example_array = np.array([
[1, 2, 3],
['one', 'two', 'three'],
[4.01, 5.01, 6.01],
[np.nan, np.nan, np.nan]
])
df = pd.DataFrame(example_array, index=['int', 'string', 'float', 'nan'])
# df.select_dtypes(include = ['float'])
df.dtypes```
output:
0 object
1 object
2 object
dtype: object```
@haughty tree Do you have an high school math background?
Great
So if you want to start with neural network I heavily suggest nffs.io by sentdex (he's also making a youtube series of the book, but it might take a long time to complete)
That seems somewhat understandable, because the original array example_array is already of dtype object (because numpy arrays are homogenous), so pandas merely doesn't recalculate the dtypes when creating a dataframe from a single numpy array.
Maybe you can avoid using an array here?
yeah I did some research found it to be the way, just the issue I'm facing is if I construct multiple independent lists they will have the same problem while passing through DataFrame.
Instead of doing hole pd.Series I was wondering if there is a way to work around that issue
.
example_array = [
[1, 2, 3],
['one', 'two', 'three'],
[4.01, 5.01, 6.01],
[np.nan, np.nan, np.nan]]
dtypes = ['int', 'string', 'float', 'nan']
df = pd.DataFrame()
for index in range(len(example_array)):
df[dtypes[index]] = example_array[index]
df.dtypes```
this seems to solve the issue
if I construct multiple independent lists they will have the same problem while passing through DataFrame.
Not sure what you mean by that. You get the same problem even when passing each column as a list or a numpy array with the right dtype?
TIL: Panda associates dtypes per column basis
Basically, I'm saying that if all that you have as a numpy array of dtype object, you need to somehow make pandas recalculate the dtypes of each column. Ideally, you justy wouldn't create such an array.
Yup! In fact, secretly each column of a dataframe, a Series, is essentially a 1d numpy array with a dtype of its own.
yeah that wouldn't of been optimal, but the secondary version seems to work like a charm
Hey guys, I am using resnet18 from pytorch and I'm having trouble optimizing for recall for my binary classifier. I think I need to make a change to my fc layer. Any thoughts on how to do this?
99% of things you would propose would be done, unless you research them yourself 🤷
it's all just a google search away
thats true
# Creation of histograms (features)
temps1=time.time()
def build_histogram(kmeans, des, image_num):
res = kmeans.predict(des)
hist = np.zeros(len(kmeans.cluster_centers_))
nb_des=len(des)
if nb_des==0 : print("problème histogramme image : ", image_num)
for i in res:
hist[i] += 1.0/nb_des
return hist
# Creation of a matrix of histograms
hist_vectors=[]
for i, image_desc in enumerate(imagesarray) :
if i%100 == 0 : print(i)
hist = build_histogram(kmeans, image_desc.reshape(-1, 1), i) #calculates the histogram
hist_vectors.append(hist) #histogram is the feature vector
im_features = np.asarray(hist_vectors)
duration1=time.time()-temps1
print("temps de création histogrammes : ", "%15.2f" % duration1, "secondes")```
Hello guys, I don't know why sometimes this code works and why sometimes it's looping again and again.... can anyone help?
Thanks
Hello, anyone knows about courses involving machine learning and trading?
do you want to use ML with trading/finance?
yes
I would highly appreciate if anyone has some data about it. There is a book from Stephen Jensen named Machine Learning for algorithmic trading but idk if it is useful,
And some online courses, but nowadays many people sell bullshit specially trading related
that's cuz it's not a great idea in general
you need advanced models to turn a good profit, since humans rely mostly on luck and crypto
which can't be taught by a course
i know a course that has machine learning and involves some projects that are for frauds in finances
but its in R lang
u interested
yeah sure, maybe i can adapt it for python
Yeah i know that but trading requires lots of attention and maybe with a bot, less info may slip by
you sound like....do you know the basics of ML?
The other day i was 8 hs in front of the pc and the minute i go to the supermarket things happened haha
Yeah im learning, just started but i know the basics
because there was a guy here the other day who asked the same question
and you can guess, he is a shitposter
anyways, that's not how trading works
trading won't require a lot of "attention" per se unless you are doing day trading- which you shouldn't at all since it's very risky
its data science for r by harvardx on edx
but you should have a very specific usecase for such a "bot" for it to be actually helpful to you
in the end, it depends on what exactly you are trying to automate and to what extent
@stiff drift
I do day trading and i relly solely on myself. But it is tiring and somehow i want like a machine backup just in case i miss something you know
I will try finding some courses on udemy or the book i mentioned before. Thxs everyone
@stiff drift maybe you can try to write up some heuristic rules for what counts as "something happening" (e.g. price movement above a certain threshold) and then encode those in a simple program, just following rules, no fancy AI stuff
after all, machine learning often amounts to trying to capture human intuition and reasoning in a machine
starting with simple rules is often the best way to go
Yeah, i ve done that. But maybe applying ML i build the most profitable bot ever hahah, just let me dream
thanks !
i will research if anyone is interested dm
start with trying to match what you, a human, currently do
then worry about making it better than what a human can do
the 1st one is already very hard, the 2nd is exponentially harder
hahah yeah, will try my best
i an amateur. can anybody help me?
On what?
what is the prformance impact of copying a value to a variable
example:
for x in list1:
y = self.data["Pallets"].get((t, i, f), 0) + 1
```is what I am currently doing
but what if I did this instead:
```python
for x in list1:
d = self.data["Pallets"].get((t, i, f), 0)
y = d + 1
I think it makes the code more readable but my code should also be performant, is the impacting of assigning the value to d negligible?
It is negligible but it's not exactly the right channel
Hello, A beginner question on ML, how do I know which model needs feature scaling & which doesn't?
normalisation you mean?
yeah, although normalization is just one feature scaling technique, right? there a few more
you can research if that particular variant handles scaling (or doesn't need it) like NN's always need normalization
like with N.B, you don't normalize the probs
Is there a way of knowing which model requires it and which don't? like a thought process through which I can figure out for myself, or some kind of general rule of thumb?
just test on test set
if its too slow or too biased then it will require normalisation
you know it on your own most times (like when you know your algo in-depth) but for some, there reasons are based mostly on practical observations, later supplemented by theory
and testing will tell you and normalisation is overall very much used......saves time too
what has testing got to do with normalization?
anybody jnow how to use the https://pandas.pydata.org/docs/reference/api/pandas.Series.get.html this pandas method?
it always return my defalut value, I am using a multi-index with a tuple like this:
self.data["Pallets"].get((t, i, f), 0)
alright, thanks a ton @grave frost and @mint palm.
isnt it that some normalisation optimize how much portion of activation function we use
that do affect learning process
no, from my limited knowledge, like in NN's large values lead to larger gradients. no need to boost a grad's worth if it's not actually contributing. it would just cause unnecessary bias
i too have just started it i dont know for sure.....so wont debate much😆
but for time i can say for sure it would fasten learning
it might a little bit, but that is very insignificant. the primary purpose is to not create undue biases
see this it cause insufficient learning due to inproper scaled param
that's not insufficient learning, it's when due to lack of normalization, your gradient keeps on increasing blowing up to infinity and giving nans. that's why we normalize
ya that cause some of details in input do unnoticed
that leads to more error when testing model
???
when some param vanish
😒 i know its for analogy......then just become soo tiny they are too small to be affected by further function application
I think you are using the terminology wrongly
ya maybe
The parameters of a neural network are typically the weights of the connections. In this case, these parameters are learned during the training stage. So, the algorithm itself (and the input data) tunes these parameters. The hyper parameters are typically the learning rate, the batch size or the number of epochs.
simple definition
bruh, stop spamming the same message
lol
bro once in 15min is hardly spamming
ya take help
I already have a channel
then it's cross-posting, which is even worse
🤷
bro quit spamming me
now my message has lost visibility, thanks a lot...
aight then, if you get muted due to cross-posting, don't blame me
🤣
alright if you get muted for going off-topic, don't blame me
just delete previous ones and post one last time.....we are stopping the chat
anybody jnow how to use the https://pandas.pydata.org/docs/reference/api/pandas.Series.get.html this pandas method?
it always return my defalut value, I am using a multi-index with a tuple like this:
self.data["Pallets"].get((t, i, f), 0)
and when I print(self.data["Pallets"].keys()) I get the expected output
QQ: If both the training set and eval set have the same ratio for the unbalanced classes, should I deal with the imbalance? (I am too lazy)
Funny, never even thought about this 🙂 Any ideas?
BTW My aim is just to get a good acc on the test set. Generalization be damned
feel free to @ me or dm if you see this and can help
i recommend standardization (centering + scaling to unit variance) for all your unbounded numerical features in pretty much any model
for bounded features, normalize to [0,1]
How can i resize image to bounding box? in pytorch
i mean dynamically set (xmin, ymin, xmax, ymax) values to all images. I think transforms.Resize() can help me, but Resize() only takes two arguments and its not accurate to bounding box
seems like there are no way to crop it with pytorch, i'll use Pillow lib
So I am doing fine-tuning with images, and I had a quick question.
The keras docs suggest that I should freeze my base model (not importing it's top part), and train the classifier placed at the end of the model. Then they would fine-tune on the whole model with SGD and a slow LR.
But intuitively, I was thinking that I would freeze the classifier 'layers' at the end of the model, allow the base model to train and learn the features from my specific dataset; then I would freeze the base model (which would perform feature extraction) and fine-tune with the same recipie on my custom/target dataset.
Why don't we do the second method, as opposed to the first?
How to add machine learning model to discord bot?
For example, i have a model to predict cats, and i want to implement this predict model to bot
because if you have frozen model weights, those model weights will still propagate information backwards through the network; in this case, erroneous information
if you want to train them separately, maybe use an autoencoder or something first and then put logistic regression or something on top of the low rank representation
but thats no better (and probably worse) than just doing it the way keras recommends
wait, so if I set trainable=False the model weights still get updated?
no, but the frozen model weights still affect the gradient, which affects the weight updates for the trainable weights
hmm...so if I want to use the base model for feature extraction, what do I do?
should I train it seperately, then load that checkpoint for the base model?
I had the base model for feature extraction, with a small CNN as a classifier. now, is there somehow a way to remove that CNN classifier at the bottom, only train the base model and store it?
because if I include the CNN classifier, I won't be able to load it since keras doesn't recongize why there are weights for layers not in the base model - so it errors out
can you do "everything but the last layer"?
like, include the convolutional layers from your model, but exclude the fully connected stuff at the end
you mean train the base model only, with no other layers? on my source dataset
what is the base model
and can you link to the keras doc recommendation? im curious what their wording is
are you using the imagenet version or training from scratch
Im not including the FC classifier at top, so that flag is False.
yeah, imagenet
gives a better initialization 🤷
so imagenet weights as a starting point, then somehow modify the base model to learn features from my own source dataset. freeze it up, add FC layers and train those FC Layers on my target dataset
yeah. so train efficientb0 + fully-connected on your data, then: 1) freeze the efficientb0 and re-train the fully-connected layer at the end for better accuracy, and 2) go use the refined efficientb0 network for feature extraction elsewhere
step 0 and 1 are on the same "big" dataset, right? and step 2 I can use it on my target dataset?
that seems right, but what are you doing with the target dataset?
classification - it's composition should be slightly different than my main dataset, and it's small too
in that case, can you zero out and/or fine-tune the fully connected weights?
i think that should be roughly equivalent to taking the features from the efficientb0 part and stacking a separate model on top
yea
can you zero out and/or fine-tune the fully connected weights?
the only thing I can think of, is to freeze those layers.
yeah, that's the guide I was referrring
and yes either method is valid
so, to summarize. I would have efficientnetb0 + F.C connected initially. I freeze the F.C Layers, and train on the base model alone......?
I think what I want is to somehow train the whole effnet+F.C on my source dataset, but export weights only for the effnet. that load it somewhere else, and train on target dataset
So after reading up, it does seem keras provides two handy functions get_weights and set_weights to get weights of individual layer, and save them as numpy arrays to be loaded later (from an instantiated layer). Hopefully, with a bit of a luck I might be able try it.
Yes, you want this:
I think what I want is to somehow train the whole effnet+F.C on my source dataset, but export weights only for the effnet. that load it somewhere else, and train on target dataset
Wait
No
You want the opposite
Yes
All the deep stuff requires a lot of data
So you train that part on the big data set
Are you saying that you think the extracted features should be different between your big data set and the target data set?
Which is why you would want to retrain the deep layers on the target?
I am saying that the CNN layers would be able to extarct the features from the smaller dataset better, if they knew what to look for from the big dataset
Just just doesn't make sense to change what the deep stuff emits but freeze the classifier on top
Imagine you don't retrain the model but randomly rescale the final hidden layer outputs
Then your classifier weights will all be meaningless and your classifier will produce garbage
hmmm...so you are saying, that I keep the effnet freezed up and train my F.C on it?
Effnet + F.C gives me a decent accuracy. --> if I keep the Effnet there with the same weights, then it would extract the same features, right?
Then I just need to re-train the F.C on the new dataset (to learn to make sense of features from new dataset and use it to predict slightly different classes [which would be accomodated by the activation function]) shouldn't that theoretically work?
yes
yes
Hello. I have a question. I want to make bar plot from values of data frame, as you can see of screen. And the value I want to showed on the plot is on the bottom of picture. So I want percentage for every kind of education level. Is there any looping idea to make it happen without manually typing the differences? If its wrong chat for such a question, I'm sorry in advance.
that's literally the description of transfer learning @grave frost
aight. Then there is another slightly different method I can do. train the whole effnet + F.C shebang. then just use SGD with lower Lr on the new dataset?
you're fine-tuning the effnet model on your big data set, then using that fine-tuned model for transfer learning on the target data set
im not sure how thats different
but that's why I said earlier 😖
no. you were saying the opposite... or so i thought
where you freeze the fc layer at the end and update only the effnet parts, which makes no sense
doing LR slowly on the whole model, taking extra assumption that the weights used in source dataset won't differ too much in the target one - whatever will, would get slowly updated
that's what they do in the guide link above tho
base_model = keras.applications.Xception(
weights="imagenet", # Load weights pre-trained on ImageNet.
input_shape=(150, 150, 3),
include_top=False,
) # Do not include the ImageNet classifier at the top.
# Freeze the base_model
base_model.trainable = False
# Create new model on top
inputs = keras.Input(shape=(150, 150, 3))
# The base model contains batchnorm layers. We want to keep them in inference mode
# when we unfreeze the base model for fine-tuning, so we make sure that the
# base_model is running in inference mode here.
x = base_model(x, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
x = keras.layers.Dropout(0.2)(x) # Regularize with dropout
outputs = keras.layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
model.summary()
yeah i just saw that
huh... so they are using a very low learning rate to update (not re-train) the base model, freezing the fully connected output model?
then how is that supposed to work? if they freeze their feature extractor...then wouldn't it all just break down due to lack of features
i dont think thats whats happening in this code
i think its the opposite
it looks like they are freezing the base model, and only training the output layer (as well as their other stuff on top)
oh, so you are saying they are taking imagenet as a good point, and slowly updating to fit their dataset
hold on
back up
there are 2 things happening here:
- transfer learning: freeze the base model, train a new model on top
- fine-tuning: after step (1), un-freezing the entire model and running a few more epochs with a very low learning rate
at no point are they freezing the new layers and training only the base layers
yeah, that's cause the imagenet initialization is in the domain of their problem. I have to re-train the feautre extractor to fit my own problem, which is nothing like imagenet
but you have 2 datasets right? a big one and the target one?
and those are at least similar in domain?
however, it seems imagenet is lucky for this dataset. so I just start it as a pseudo random initialization to learn features from the biggie dataset
df3.groupby(level='parental_level_of_education')['lunch'] \
.apply(lambda y: y / y.sum())
maybe something like this?
y
- (primary training) train base+new on data A, as if from scratch, until convergence
- (transfer learning) freeze base, re-train new on data B until convergence
- (fine-tuning) unfreeze base, update base+new on data B with very low learning rate
I wanna do 2 & 3
just with training base again
to recognize features from my domain, not imagenet
think of it as initializing base with no weights
but you are only using imagenet as initialization
yes, as a pseudo-random init
so im not sure what your hangup is
just coz it lets me converge to my needed features better and faster
but why arent you doing 1?
very low accuracy
then maybe imagenet isnt good initialization after all?
why do you expect better accuracy if you dont train on the big dataset?
none initialization gives 6% less
don't ask me why 🤷
If I have a large amount of files that with the same number of values that I want to average, like I want to average a large number of curves, and some of those files have NaN as there value in some data points, can I still use the X+Y+Z/N
I do train on the big dataset, but just to get my feature extractor to start extracting features relevant to my problem.
then using the trained F.T, I extract features from small one, and train another F.C from scratch to classify my small dataset
i still dont see how this is different from the 1-3 steps i outlined
all i am saying is, dont freeze the fc network at the top and unfreeze the cnn at the base, and expect useful results
[#1 + #2 + #3 + #4...etc. / n (number of files)]. say those # files have arrays [#,#,#,#,#,#..etc] and some of them have [#,#,#,#,NaN,#,#NaN]. Can i still obtain an average curve out of those
or do I have to not use the NaN files
alright, it's kinda similar now that you point it out
you have to remove the missing values, yes
there is something called "imputation" for missing data in more advanced applications, but that will not help you here
ah damn
you can't produce new information where no information exists
Thanx a ton for the guidance @desert oar 👍 🚀
you're welcome, good luck
i was hoping there was like normalizing function to zero out that certain position in an array and somehow still do the whole array
but that makes sense
well you can tell numpy to omit the missing values for you
but thats just a convenience, its still removing them
I was thinking in a way like the scatter plot, you can scatter plot an array with NaN
values
but visually seeing something and averaging are different
yeah
same thing here
Exacly what I wanted, thank You very much and sorry for the trouble.
for file in files:
# for x in set(ids):
# if file.startswith(str(x)):
if file.endswith("_MID-R1-ECG.1D_hrv.txt"):
full_name = pathlib.Path(root) / file
try:
read_fname = full_name
data = np.loadtxt(read_fname)
avg = sum(data)/float(len(data))
np.savetxt("Average-MID-hrv.txt",np.array(data))
except Exception as e:
print (e``` it doesnt look like the Average file it prints out is in average, rather it is just the same values as the second MID-R1 file. I only have two files named that in the folder to see if the averaging works for now
in my avg=sum(data)/float(len(data)) line, do I need put something else other than data so it grabs all of the files that match data requirements (currently only 2 just to test)
what is data? If it's an array, why are you writing your own formula to get the average instead of using the array's methods?
actually it's np.mean rather than an array method.
Data is the input from np.load txt which is input from read_fname
so what type is data?
And I made the for loop to be looking a number of them.
An array
1 d array with like 10 ish values
so you'll want to use avg = np.mean(data).
So that will automatically take the multiple data files that I'm reading?
you want to take multiple files, and do what?
concatenate the arrays from each one?
what does it mean to take the average of all those arrays? do you want the output of that operation to be an array, or a single number?
To create an average file of one array with the 10 values, so each value is average of all files
Hi, In the help forum does anyone know Numpy / pandas
This is the right channel. Go ahead and ask your question.
>>> a = np.array([1, 2, 3])
>>> b = np.array([4, 5, 6])
>>> np.mean([a, b], axis=0)
array([2.5, 3.5, 4.5])
Or this--the effect is the same
>>> a = array([[1, 2, 3], [4, 5, 6]])
>>> np.mean(a, axis=0)
array([2.5, 3.5, 4.5])
axis=0 is the key.
Yeah, so np.mean(data) will know to take all of the files that match My requirements?
no, np.mean assumes that you pass what you want to it.
I'd have to use append then somehow
?
Like data =data.append under the original data
you'd have to make a 2d array and then use np.mean to take the average of each row
so you'd actually want to use axis=1
Wouldnt utilizing append on data make it appends each new input in data
Since I'm looking at multiple files with the for loop
So would data becomes a 2d array after I append it
keep in mind that append operations on arrays creates a new array. You can append to lists if you want continuity.
It looks like Numpy might even handle it the same way
[#,#,#,#,#,#] would be an array not a list though right
depends on the context. Lists are a data structure that come with Python. Arrays come from numpy.
anyone know how to graph an equation (3d) with matplotlib?
!e
import numpy as np
a = [1, 2, 3]
b = [4, 5, 6]
print(np.mean([a, b], axis=1))
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
[2. 5.]
Looks like numpy can give you the expected behavior using lists.
So doing np.mean(data) will be doing np.mean(datafile1,datafile2,datafile3,etc) with the way I have it reading in? I feel like I have to use append still for it to do that
if you mean np.mean(data, axis=1) where data is a list of lists of ints and you do axis=1, then yes.
you would need to append each sub-list to data before np.mean(data, axis=1) is calculated.
Hi guys, does anyone knows a great tutorial/course about Recurrent Neural Networks with LSTM, I did an udemy course about it, but I stuck at the predictions part
This is so succintly simple and i've neverknown about it omfg...
import matplotlib.pyplot as plt
import numpy as np
import sympy
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
x = np.linspace(-5,5,500)
y = np.linspace(-5,5,500)
x,y = np.meshgrid(x,y)
e = sympy.solve('x**2/4+y**2/9+z**2/16-1',sympy.Symbol('z'))
z = -2*np.sqrt(-9*x**2 - 4*y**2 + 36)/3
ax.plot_surface(x, y, z)
plt.show()
this isn't right
not sure why
Anyone have experience with model selection for binary independent variables and binary dependant variables? Thinking logit regression, but im trying actually trying to find which independent variables are most important and together have >70 accuracy, chi
12:00pm can't sleep thinking about projects.
Can anyone help me with kernel not found Problem ?
When i try to open jupyter-notebook it doesnt start,
- Something Error 500 is thrown back
- when i open jupyter-lab and create new book, The kernel doesnt respond.
- when i open a existing notebook, It works
Now the problem is im not able to use jupyter in anaconda for creating new notebooks
First Screen on launching Jupyter Notebook
On creating new NOTEBOOK
Can anyone help me with fitting data to my model?
This is the data
le = sklearn.preprocessing.LabelEncoder()
date = le.fit_transform(list(data["Date"]))
_open = le.fit_transform(list(data["Open"]))
high = le.fit_transform(list(data["High"]))
low = le.fit_transform(list(data["Low"]))
adj_close = le.fit_transform(list(data["Adj Close"]))
volume = le.fit_transform(list(data["Volume"]))
X = list(date)
y = list(zip(high, low, _open, adj_close, volume))
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
```But when I try to fit the data into the model as displayed below```py
linear = sklearn.linear_model.LinearRegression()
linear.fit(x_train, y_train)``` I get this error ```powershell
ValueError: Expected 2D array, got 1D array instead:
array=[2088 311 1839 ... 2422 64 1705].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.``` Thanks
hello, is there any chance there is someone here who know's littlewoods rule? I am absolutely desperate to understand how to apply this rule to a question I have been given r.e. an assignment
any help is extremely appreciated ❤️
curious if using the GPU for tensorflow makes it more "even" compared to the CPU.
And if that's what makes it faster?
:incoming_envelope: :ok_hand: applied mute to @tacit wharf until 2021-05-06 09:44 (9 minutes and 58 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
need help
I have df which generate from pqsql, I can download it on my local system
but I want to create one url so when user visit that link script should run and output can be saved on user side
Anyone know about Spectral Clustering? Just want to list the steps I think should happen and have someone be like: "yeah that's right" or "that bit's wrong"
RuntimeError: stack expects each tensor to be equal size, but got [3, 47, 47] at entry 0 and [1, 47, 47] at entry 5
This mean that my image isn't rgb?
# Creation of histograms (features)
temps1=time.time()
def build_histogram(kmeans, des, image_num):
res = kmeans.predict(des)
hist = np.zeros(len(kmeans.cluster_centers_))
nb_des=len(des)
if nb_des==0 : print("problème histogramme image : ", image_num)
for i in res:
hist[i] += 1.0/nb_des
return hist
# Creation of a matrix of histograms
hist_vectors=[]
for i, image_desc in enumerate(imagesarray) :
if i%100 == 0 : print(i)
hist = build_histogram(kmeans, image_desc.reshape(-1, 1), i) #calculates the histogram
hist_vectors.append(hist) #histogram is the feature vector
im_features = np.asarray(hist_vectors)
duration1=time.time()-temps1
print("temps de création histogrammes : ", "%15.2f" % duration1, "secondes")```
Hello guys, I don't know why sometimes this code works and why sometimes it's looping again and again.... can anyone help?
Thanks
so the for i, image_desc in enumerate(imagesarray) : loop is iterating more times than you expected?
Yes exactly
I just run it again it just continues to loop except one time it's worked
It worked one time and not the second time! Just the number of pictures is different but it's not even a question of too much pictures, cause sometimes with less pictures it doesn't work too...
Creation of histograms (features)
temps1=time.time()
res = kmeans.predict(des)
hist = np.zeros(len(kmeans.cluster_centers_))
nb_des=len(des)
if nb_des==0 : print("problème histogramme image : ", image_num)
for i in res:
hist[i] += 1.0/nb_des
return hist
# Creation of a matrix of histograms
hist_vectors=[]
for i, image_desc in enumerate(imagesarray) :
if i%100 == 0 : print(i)
hist = build_histogram(kmeans, image_desc.reshape(-1, 1), i) #calculates the histogram
hist_vectors.append(hist) #histogram is the feature vector
im_features = np.asarray(hist_vectors)
duration1=time.time()-temps1
print("temps de création histogrammes : ", "%15.2f" % duration1, "secondes")```
try this ones @hushed wasp
you getting error because you are returning hist outside of for loop
that's why you are getting hist of first input data
no, check for loop
but I ve got the same looping over and over
show full error and script
I don't really raise an error it just keep running and crash
however when it's working it just calculate the histograms in like few seconds
I adapted the code from a SIFT extraction that I try to use with some CNN
Working I only get this :
I just rerun the exact same code and know i have just iterations again and again, changing absolutely nothing... (in code and data)
in my opinion you should check line 11
I will!
Thanks for giving me some of your time @limpid oak
@limpid oak , do you have a sec to advise on the below? I have survey data that has independent variables that are all binary, as well as dependant variables that are binary. I am thinking logistic regression for my model of choice, but i am wondering what would be the best way of finding which independent variables are the most important predictors?
I have this matplotlib chart, how can I get two y axes, one for each line?
import matplotlib.pyplot as plt
import matplotlib.dates
fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b'))
plt.plot(df.when, df.total, ".-", label="# Commits", color="black", linewidth=.5)
plt.plot(df.when, df.pctcon, label="% Conventional", color="green", linewidth=4)
plt.legend()
plt.show()
ax2 = ax.twinx()
Plot the second one with ax2
can you tell me more about how to do that?
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b'))
ax.plot(df.when, df.total, ".-", label="# Commits", color="black", linewidth=.5)
ax2 = ax.twinx()
ax2.plot(df.when, df.pctcon, label="% Conventional", color="green", linewidth=4)
fig.legend()
fig.show()
like this?
Yes
Beautiful, thanks!
Hmm, the legend is a little borked like this. It appears, but i get a warning also?
UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure. fig.show()
this is in a jupyter notebook
oh, use plt.legend and plt.show i guess. maybe theres some subtle difference
%matplotlib inline in the jupyter notebook should tell matplotlib to plot in the notebook and not elsewhere, but maybe plt.show does that automatically while fig.show doesn't
plt.legend removes the warning, but now only one line is mentioned in the legend.
matplotlib is weird....
ah yep, fig.show is lower level and you should use plt.show https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure.show
not sure why the legend wouldnt detect this automatically, but you can manually specify the lines to be used in the legend
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.set_size_inches(12, 8)
ax.xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%b'))
line1 = ax.plot(df.when, df.total, ".-", label="# Commits", color="black", linewidth=.5)
ax2 = ax.twinx()
lien2 = ax2.plot(df.when, df.pctcon, label="% Conventional", color="green", linewidth=4)
plt.legend([line1, line2], [line1.label, line2.label])
plt.show()
im going based off this here https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html#matplotlib.pyplot.legend but ive never had to do this before
can't find .label, but i can repeat the strings.
sorry, use .get_label()
Nice! Now I want to force ax2 to go 0-100
YAY! thanks 🙂
navigating the matplotlib docs is not easy... these ax things are instances of matplotlib.axes.Axes https://matplotlib.org/stable/api/axes_api.html#the-axes-class and the fig thing is an instance of matplotlib.figure.Figure https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure
hey guys - when getting into algorithms I found a book intended for absolute beginners (Grokking Algorithms) which simplified concepts simply and built up. Are there any books similar to this (for beginners) which you guys would recommend?
alternatively, do you guys have any positive sentiment to platforms like dataquest?
"the dotted function is noisy in the diagram" is what i understood earlier but now the instructor says mini batch norm makes the z(tilde) more noisy due to using mean / variance for each mini batch seperatly......what does he mean here.....arent we actually just scalling z so should actually stable mini batches
This is driving me nuts, I'm trying to do a simple rolling average of some data. Blue line and points is the data, yellow line is supposed to be the moving average:
I don't understand why the yellow line would be equal to the where the blue dot is on the right side of the graph. It makes no sense
This is my code:
sma10 = data['PX']/data['WOS'].iloc[::-1].rolling(10).mean()
shouldn't that be reading in ten datapoints, finding the average? I had to throw in a .iloc[::1] because it was starting calcs from the left side
how do i get my current working directory on the left like this on spyder
how do i convert this to decimal value 3,2 to 32.0
highlight the whole column
ctrl+r
go to the replace tab
replace , with nothing
press ok
import matplotlib.pyplot as plt
import pandas as pd
ra = pd.read_csv("ramen-ratings.csv")
new_ra5 = ra.loc[(ra["Stars"] != "Unrated")]
new_ra5["Stars"] = ra["Stars"].astype(float)
new_ra5 = ra.groupby("Country","Stars").mean()
new_ra1 = ra.groupby("Country").Stars.count()
for x in new_ra:
print(new_ra[x] / new_ra1)
for x in new_ra1:
print(x)``` I have this code, how do I fix the string to float: 'Unrated' error?
but there is some instance od decimal values like 460,6 which should be 460.6 but removing , would give 4606
you would have to identify those specific ones then
maybe put a 1 in the column next to those ones, then sort them
@fervent zenith excel can't distinguish 50,8 as 508 or 50.8 unless it has some more information
Hey guys
Does anyone know how to read the cluster centeres of Kmeans?
Like this is what I have
What does this mean exactly?
@shut slate your matrix had 10 columns originally?
each cluster center is a vector of 10 coordinates
Thank you. But how do I make sense of what the clusters are?
I have the clusters and dont know what to do with it
so why are you doing k-means clustering?
you want a price that corresponds to each cluster?
let's say you have N rows and P features, and you perform K means clustering. then the cluster_centers_ is a K x P array, where each row is a cluster center, and each column corresponds to one of your original data features.
so if price is the 2nd feature in your data, then price will be the 2nd element of each row of the cluster_centers_ array
Actually yeah, i don't know why I am doing the clusterring. Here is why now that I think abot it, i want to know how the combinations of each feature corresponds to price I guess?
however note that you need all 10 elements to "describe" the cluster. you could have 2 clusters with similar mean prices, but very different values of the other features
So I guess my first problem is, Ok I clustered it into 3 clusters
Now what does that mean and how do I use it
lol
cluster analysis is fine as an exploratory technique, just keep in mind that k-means in particular tends to try to find equal-sized roughly-spherical clusters and won't necessarily give intelligent results unless you do more work to choose a suitable K
e.g. if you have completely random data it will still find K clusters for you, but those clusters will amount to basically segmenting the data into equal sizes and aren't so much "clusters" as they are "segments" with hard boundaries
I did the elbow analysisand it showed to do 3 clusters
ok, thats a reasonable place to start then
its still better to think of k-means output as "segments" rather than "clusters"
the two things you can do with k-means are:
- look at the cluster/segment centers
- determine which segment a data point belongs to, which amounts to finding the closest cluster center
the hue is the cluster
Yes, that should be the cluster
and as I can see is that from 1880 to 130s its one cluster
1930*
1940 to 1980 is 2nd
and the 3rd is 1980 to 2020
or am I talking non sense here
Yes, that is correct
So I just solved how it clusstered?
And can you get Python to tell you what features mattered the most?
It depends on the classifier
int this case it's obvious that year built is mainly driving the clustering from the plot
that's possibly because the scales of the numbers are all off
for feature importance you can do anova, i think its a nice idea https://stats.stackexchange.com/a/77693/36229
or you can do the distance between cluster centers feature-wise, that's a nice idea too
you should probably standardize your data before doing k-means
You're welcome
Most likely cudatoolkit
Since installing pytorch through conda means you're downloading the associated CUDA version, that means everything comes prepackaged, even CuDNN
but there hasn't been that major of a revamp to add 1.2GB to the Cuda toolkit
10 to 11 implies there's been one. I haven't checked the CUDA changelog however
Is there any obvious pattern on how to design a deep learning model?
Like how do you know what to set the parameters to?
nah, it's mostly guesswork and intuition. mostly, you try a set, see the result and adjust accordingly
how do you go about guessing though
there are so many parameters
because you don't need to tune them all to get a decent accuracy?
I see
idk if this is the place to ask but why does heroku install scipy? i had it in requirements.txt but i removed it and all but it still installs it
is it a dependency of another library you are using?
i narrowed it down to this:
discord.py==1.7.2
pafy==0.5.5
praw==7.1.0
prawcore==1.5.0
premailer==3.7.0
protobuf==3.15.1
pycparser==2.20
pylint==2.6.0
python-dateutil==2.8.1
requests==2.25.1
yagmail==0.14.245
youtube-dl==2020.12.31
youtubepy==6.0.2```
i deleted all the scientific stuff but it still downloaded it
who thought this was a good idea
ayy
hmm...is it a problem if it downloads scipy?
hey i would like to copy a dataset based on another but not directly. e.g. per column the distribution should remain and also conditional probabilities.
My approach so far is to see which column can be described exactly with the fewest conditions (Prob(A=1|B) = 1) and then set this column. Iteratively repeat until all columns are described.
The problem is that the column distribution may be destroyed. Does anyone have a better idea? The goal is to create a more anonymous dataset, which still has the best possible quality.
Sound source separation challenge by Sony with an exclusive dataset created just for the challenge.
Pretty neat baselines to start with. 10,000 Swiss Francs prize.
Any one interested to participate?
https://www.aicrowd.com/challenges/music-demixing-challenge-ismir-2021?utm_source=discord&utm_medium=python&utm_campaign=sony
I'm not sure which channel is appropriate for that, but has anybody managed to parse the insides of a PDF document using nothing but raw Python?
Hey is anyone here good at pytoch...I need help to change my code from theano to pytorch
I have some PyTorch knowledge but I never touched Theano
Yesterday (on the 6th of May) was our UbiOps Release Update Webinar.
During the session, Anouk Dutrée, Product Owner at UbiOps, gave a demo of several new features, which we have recently added to our platform.
She demonstrated how to deploy R code in UbiOps, set up monitoring emails and more. In case you missed the live session, we got you cov...
This doesn't sound like a bad idea. how many features are in the dataset? Maybe you could try to approximate the joint distribution directly
I believe a VAE can do that
One of the most popular models for density estimation is the Variational Autoencoder. It is a model that I have spent...
However if you ask on https://stats.stackexchange.com you might get more interesting and helpful answers
this might be an xy but i'm going to explain this as best i can
i am using pytorch and trying to create an lstm that takes a character and maps it to another, but i'm struggling a little with the representation of characters
everything i've seen encodes characters with onehot vectors, however i'm wondering why class labels aren't used instead? i.e. an integer 0-25 each one representing a letter, possibly a 26th representing padding
another issue is that i am trying to use Cross Entropy Loss and it the target seems to be required to be class label encoded instead of onehot, so why not just use class labels in the first place?
i'm a little lost :P
For anyone that used jupyter notebook. Does jupyter notebook run better if your pc is more powerful since jupyter notebook runs on your browser
it runs in your browser but it still uses local resources, it just runs on a web server on your pc
so yes
its not using any external server to run the code or anything
The computation power of the machine you're running on is what will determine performance, regardless of whether you're using jupyter or a repl or what have you.
Alright thank you
for the record, I would strongly discourage you from using jupyter notebooks unless you have a lot of experience programming Python without them: https://datapastry.com/blog/why-i-dont-use-jupyter-notebooks-and-you-shouldnt-either/
Ohh ok will look into that
I'm starting to see notebooks show up in intro python classes (like, programming fundamentals... not even data science courses) and I just cringe. It's unnecessary and it gets in the way.
In my experience helping people who are enrolled in university Python courses, they are very very bad.
If they don't teach them to use notebooks, they teach them to write getters and setters. You can't win.
Agreed. I teach a python intro course at a US university and I'm sticking with replit.com.
Is the thinking there that teaching them to setup Python on their machine would be too cumbersome?
(Because I wouldn't blame you if that's the thinking.)
We have one class for both of these students: "I can't even navigate a file system from the command line" and "I actually know how to code already"
I think the reason for that would be to just stick to the fundamentals of the course so it can cater to both spectrums
One legitimate use is for "literate programming" homework assignments that mix code and written solutions
So we introduce VS Code eventually, but it starts with the replit.com "ide"
I feel like a shill for repl.it 😄
oddly enough, explicit use of the terminal rarely came up in my curriculum.
Repl.it is great. That said I don't see why notebooks are so much worse than anything else.
but yeah, that's always the dilemma with those courses.
I do think in a university setting there should be a mandatory one or two credit "how to use the unix flavored cli" course
I like MIT's "Missing Semester"
I think the reason for not teaching notebooks early on is because students don't learn the fundamentals of scripting and seeing output in the terminal.
they encourage you to break the problem down in terms of how you can display stuff at the end of each cell, not in terms of code reusability. And then you have to have the entire state of your notebook in your live human memory if you re-execute cells for some situation-specific reason.
That being said, if you understand the problems with notebooks and are quite specifically trying to do exploratory analysis, I guess you can have at it.
Is Spyder really used out there in the professional DS world?
occasionally. when i used it, it felt like a cheap rstudio imitation.
Yeah... RStudio is probably the only thing I miss from before I switched to Python
Does anyone here use Chatterbot, and if so, do you know of any corpuses I can use to train my bot to be nice, and not ask people when they're gonna die?
https://paste.pythondiscord.com/azunajadih.json what is wrong, I'm getting a schema validation error. what's wrong with this?
@burnt bronze need more context. what are you doing? what schema? what code is doing it? etc etc
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df = pd.read_csv('ramen-ratings.csv')
countries = df['Country']
ls = {}
for x in countries:
if ls.get(x) is None:
ls[x] = 1
else:
ls[x] += 1
countries = ls.keys()
df = pd.DataFrame.from_dict(ls.items())
df.index = countries
df.plot.bar(figsize = (15,4.5))
plt.title("Number of Ratings per Country", fontdict = {'fontsize': 15})
plt.xlabel("Countries", fontdict = {'fontsize': 15})
plt.xticks(rotation = 90)
plt.ylabel("Number of Ratings", fontdict = {'fontsize': 15})
plt.show()``` Anyone know how I'd remove the '1' legend?
Hey I don't get the hate - what's wrong with notebooks? 😛
do you guys have any recommendations for applying knowledge from andrew ng's coursera into projects
does anyone know what's better to use for data visualization matplotlib or charts js
I found chartsJs is more customizable but matplotlib is a wonderful library
matplotlib
why
what does matplotlib has that chartsJs doesnt
and isnt charts Js a javascript library so i think it works better on web though I love matplotlib and used it alot
Hello, I want to use MAFA Dataset(https://www.kaggle.com/rahulmangalampalli/mafa-data) for mask-detection but the files are labeled in .mat format. Can anyone tell me how can I use this dataset in python?
I've used mafa extractor( https://pypi.org/project/mafaextractor/) but i am not sure how to implement this.
TIA
anything you find interesting 🤷@visual umbra
@upbeat topaz we don't have a general "learning" channel. we do have a list of resources, and a lot of help channels for targeted help, see #❓|how-to-get-help
!resources
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
thank you
also don't forget to read channel topics 🙂 it's the text up at the top of your discord window, to the right of the channel name and to the left of the search bar. you can click on it to read the whole thing
okay
Am I to understand that the a and b arguments can be array-likes of labels? https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
I'm not sure what is meant by "samples of scores" in "Calculate the T-test for the means of two independent samples of scores.".
i don't think it means anything here
this is just a t-test for 2 independent samples
the "scores" thing i think is just bad wording and/or someone lazily copying from their stats-for-engineers textbook
!e
from scipy.stats import ttest_ind as tt
result = tt(['a', 'a', 'b'], ['a', 'b', 'b'])
print(result)
@serene scaffold :x: Your eval job has completed with return code 1.
001 | Traceback (most recent call last):
002 | File "<string>", line 2, in <module>
003 | File "/snekbox/user_base/lib/python3.9/site-packages/scipy/stats/stats.py", line 5771, in ttest_ind
004 | v1 = np.var(a, axis, ddof=1)
005 | File "<__array_function__ internals>", line 5, in var
006 | File "/snekbox/user_base/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 3702, in var
007 | return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
008 | File "/snekbox/user_base/lib/python3.9/site-packages/numpy/core/_methods.py", line 211, in _var
009 | arrmean = umr_sum(arr, axis, dtype, keepdims=True, where=where)
010 | TypeError: cannot perform reduce with flexible type
@desert oar does one have to assign arbitrary integers to each label? that seems odd.
wait, what are you trying to do here
determine the statistical significance of two sets of predictions
banish that sentence from your mind
(that is, whether changes in the design of the model changed the predictions in the second as compared to the first in a way that can't be accounted for by random chance)
it sounds like you want to test whether 2 samples are from the same bernoulli distribution
I've never heard of bernoulli
bernoulli is yes/no with some probability p of "yes"
(meaning that "no" has probability 1-p)
the probability p does happen to be the mean of the bernoulli distribution
so yes you can use a 2-sample t-test to test the hypothesis of whether p1 and p2 are equal, if the samples are big enough for the central limit theorem to kick in. but you should calculate it directly, don't use this function (which tries to calculate the means from the data which it expects to be numeric)
note that you fundamentally can't assume equal variances unless the null hypothesis is true, because the variance of a bernoulli is p * (1-p). so if p1 != p2 then obviously p1 * (1-p1) != p2 * (1-p2)
Enroll today at Penn State World Campus to earn an accredited degree or certificate in Statistics.
Enroll today at Penn State World Campus to earn an accredited degree or certificate in Statistics.
what's this?
Bookmark command from Lancebot.
okay 
hey, I'm trying to plot times in the format %H:%M on x_axis, but it is only returning 00:00.
When using dates, it seems to work fine, but not with the times.
here's the code:
plt.plot(time_x,players_y)
plt.gcf().autofmt_xdate()
date_format = mpl_dates.DateFormatter('%H:%M')
plt.gca().xaxis.set_major_formatter(date_format)
plt.tight_layout()
plt.show()```
`time_x` is just regular datetime format eg. `2021-05-07 18:08:38`
Any ideas what's happening? (new to mpl)
@obsidian quail can you provide some sample data to reproduce with
1 sec
['2021-05-07 18:07:52', '2021-05-07 18:07:54', '2021-05-07 18:07:56', '2021-05-07 18:07:59', '2021-05-07 18:08:01', '2021-05-07 18:08:05', '2021-05-07 18:08:07', '2021-05-07 18:08:22', '2021-05-07 18:08:24', '2021-05-07 18:08:26', '2021-05-07 18:08:36', '2021-05-07 18:08:38', '2021-05-07 18:08:41', '2021-05-07 18:09:24', '2021-05-07 18:09:38', '2021-05-07 18:09:40', '2021-05-07 18:09:42', '2021-05-07 18:09:44', '2021-05-07 18:09:46', '2021-05-07 18:09:49', '2021-05-07 18:18:28', '2021-05-07 18:19:21', '2021-05-07 18:20:21', '2021-05-07 18:21:21']
[31, 35, 37, 31, 21, 24, 28, 44, 45, 39, 33, 32, 29, 19, 21, 24, 27, 29, 34, 55, 32, 27, 29, 25]
@obsidian quail it's because your time_x is all strings
i don't think matplotlib is smart enough to do that conversion for you
ah right, I'm not too familiar with datetime, I'm storing the values in an sqlite table, how would I convert into datetime whilst in the list?
pd.to_datetime would be the easiest option
you're storing them as these timestamp strings in sqlite?
as integers
like unix timestamps?
c.execute("SELECT * FROM last_24")
players_y = []
time_x = []
for x in c:
players_y.append(x[1])
time_x.append(x[0])```
I'm then appending the values to each list?
it sounds like you have an int type on the column but are writing strings to the db
sqlite will let you write the wrong datatype to a column
(i think its a non-feature but they have their reasons for doing it)
hmm, it's a default value time integer DEFAULT (datetime('now', 'localtime'))
Would it be due to this?
(btw let me know if we should move to #databases if its getting offtopic here)
from datetime import datetime
c.execute("SELECT * FROM last_24")
players_y = []
time_x = []
for x in c:
players_y.append(x[1])
time_x.append(datetime.strptime('%Y-%m-%d %H:%M:%S', x[0]))
this should work, or something like it anyway
and yes let's un-fuck your database in #databases
there is a "more right" way to do this
👍
Hello, I'm trying to find the combination of one entry of Column B with the other entries in the column B while trying to rank the top 5 entries it repeatedly occurs with reference to column A. Can anyone help me how to do this with pandas?
how would u attempt to train a nn with a lot of classes and low images? or at least, not the same amount of images per class
this description is very unclear. can you try to clarify and provide examples?
pre-training
this is a problem for any kind of model, not just a neural network. these are also 2 different problems to some extent. the 2nd problem is "imbalanced data" and the first one is just "not having enough data". you can use transfer learning for "not enough data" (as long as you can find a pre-trained model that's relevant) but imbalanced data is harder imo
imagenet to the rescue
for the not-enough-data case, if you can gather a large amount of unlabeled data but only a small amount of labeled data, train an unsupervised model on the big unlabeled dataset then use it to create features for the small labeled dataset
for imbalanced data, honestly even when i was a professional data scientist working with other professional data scientists we still struggled with this
it is a hard and unsolved problem
what solutions you solved it with BTW apart from artificially weighting and Data aug??
sometimes (eg with images) you can get some traction with data augmentation and/or generation. you can also try oversampling and/or undersampling but i dont know of anyone who gets great results with that.
spending a fuckton of money and time to acquire more labeled data
i.e. we didnt solve it...
well, sometimes the simplest solutions work the best 😉
at least, that alleviated the worst of the problem. we still ended up with severely unbalanced data, and we kind of just accepted that our accuracy on those classes would be really bad
hmmm....what was it on tho?
so we adjusted our performance metrics and set expectations with the business stakeholders accordingly
like the task/dataset?
we never got improvements by using any "fancy" methods. it only ever added noise.
yeah, our business had a huge amount of hand-constructed "categories" for different types of businesses, and we had to figure out the type of a business based on whatever we could find about it
tried DAGAN? it works for quite many use cases. I was thinking of using it, but didn't want to spend so much compute power/$ on it.
ahh, tabular.
we could get its address (e.g. for zoning information), name, scrape the web for its facebook page, etc
so as you can imagine we got great results when distinguishing photographers from nightclubs, but distinguishing bars from nightclubs was a lot harder
(made up example but you get the idea)
and the imbalance was because we only had like 3 nightclubs and 6,000 bars (again made up but not far from what we saw in some cases)
don't companies tell what they do on the website? just scrape it all, filter and BERT is up
and there were 1000+ of these classes

you would think so, and yes we used something based on bert ensembled with another model using a bunch of tabular metadata
uh-huh. tried weighting?
yep it helped a bit
but we ran up against the lower bound of "almost 0 data" at some points
weighting is great when you have 15 observations vs 150 observations
but when you have 5 observations youre kind of at the mercy of the dungeon generation algorithm, so to speak
it sucked and it sucked the life out of our team and i think theyre still working on it long after i quit
but now im just ranting 😛
hmmm....
i havent used DAGAN
Ah, I'm sorry. Column A has order details and column B has the products purchased, column c has the client details. I wanna group the products that occur together in different orders and rank them based on the frequency with which they occur with other products. Then I want to assign the top 5 product occurences for each client according to their history
Data Augmentation Generative Adversarial Networks
heh i did actually consider doing something like this
seemed like a rabbit hole though
doesn't work too well on numerical? atleast I haven't read any papers that do explore smthin like that
i wouldnt know. also a lot of the data was categorical anyway
"annual revenue < 10k, 10-100k, 100k+"
giant clusterfuck
anyways, if those companies are in US, doesn't some org keep track of companies and their type?
you would think so!
several do
none of them do it reliably
its not like humans who have a social security number and 3 well curated credit scores available from 3 well known agencies
its a dumpster fire and there isn't even a good "ripe for innovation" solution
its probably better in other countries where the government actually does things
so each row is a single product, and several rows can be part of the same order details?
I mean - you could do the reverse. if data on bar clubs is less, then reverse search and scrape
atleast you can get that 5 up to 10, if not 15
wdym reverse search
Yes, that's right.
and you want to count the frequency with which any pair of two products occurs? or something more complicated?
what, like on google? its an interesting idea, although labeling would get very messy since the org had some pretty specific and weird labels
like i said we ended up spending a lot of money to do pretty much that: find more businesses in these categories
the problem was that the data you can buy from 3rd parties doesnt have the special in-house labels
Frequency with which every product occurs within different order details. Then group the products that occur together and assign them to the client for each product they bought. One client can have multiple rows of order details and which in turn can have multiple rows of product details
can you give an example? it sounds like you might want to use .groupby
branched model then
2 inputs - for both dataset categories
some might overlap, so you can hand remove them
but overall, would be some work to ensure proper dtypes and no NAN - but I guess would be easy for data scientiests
got a reference on this? i ended up separately creating vector embeddings on the "big" unlabeled dataset then applied those to the "small" labeled data
mmm
depends on the data type tho. if 3rd party is givin tabular, and you are using NLP then you have to have 2 models 🤷 but again, you are actual scientists so..
is there a way to search with a py script images on google images and download them?
not even contextual?
so since i have the labels names, i can search the label on google and download a few
would be easy - google it up
but the real pain would be the data cleaning
i dont wanna do it manually lol
im not entirely sure what a model with 2 inputs would look like here. lets say you have 1 million records from dataset A and 5000 records from dataset B, and B might or might not contain some records from A.
I meant google up on how to write one
ah
scripting google searches is probably against their terms of service, unless you use an official google image search api. discussion ToS violations is against our server's rules and therefore we can't help you with that aspect of your project.
!rules 5
5. Do not provide or request help on projects that may break laws, breach terms of services, be considered malicious or inappropriate. Do not help with ongoing exams. Do not provide or request solutions for graded assignments, although general guidance is okay.
those are the rules
model with 2 inputs would require concatenation. you can directly ref with keras - I perosnally haven't implemented so take my ideas with a grain of salt
i repeat
unless you use an official google image search api
maybe u dont even read what u write
Im pretty sure it was just a continuation of his messages, not a reply to you
don't get too triggered
i dont think being rude or sarcastic to volunteer helpers on the internet is a good idea either
for tf/keras, a functional model can easily accomplish that
very high model complexity, but less time mucking about data
i guess im still not sure what this would mean. if i have 2 different "images" that together constitute a single record (maybe 2 different rows from 2 different databases that refer to the same entity), it makes sense. but how would i "concatenate" anything about 2 completely different entities and get any sensible results that i can use on a single entity at prediction time?
what would such a model be learning?
i have to step out but @ me so i dont miss your messages
Well, yeah. Column A has say 10 orders from 1,2,3 (clients) with each order having a combo from the list of products a, b, c, d, e. I want to know in 10 orders how many times 'a' occurs along with 'b' and 'd' and with which orders so I can assign them based on the client. I want to do the same with every product. That is to find the combinations within product column but order specific. Then I want to find the aggregate of the most occurred combinations for each product
Do I make sense?
Hello, we are all volunteers here, meaning everyone here takes personal time to do this. No one is obliged to answer your questions, and we will never provide help when it breaks our rules. Please be mindful of this in the future.
@desert oar basically from what I know, you have two input layers to take data from those 2 dataset. (tbh visually would be better) so when you build your networks for those 2 inputs - you basically have a multi-branch network (smthing like image segmentation and bounding box together, I beleive)
Anyways, you can have initial layers for each input (say conv's) and then at some point you would need a bottleneck - to merge both inputs together. you would have to create multple other "branches" too to capture complex hierarchial relations from both dataset.
This is where concat comes in - it would provide a bottleneck. I have attached a visual image that kinda gives an example. While a single LSTM layer may not be well served for most datasets - this is where multi-branched networks comes in.
nets like inception are a pain in the <> to work with due to their branched structure, but ofc complex model helps so much better than spending months on data processing and alignment.
again, I haven't implemented so there may be some important aspect I might have missed out - so take my ideas with a grain of salt 👍
Another extremnely basic one - doesn't seem to be too much a problem as long as you pool and flatten the conv outputs eh?
I think you may have misunderstood what I sad. obliged here is meant to mean that no user here is required to answer a question. This is not really the correct place to discuss this. You can contact modmail if you feel you would like to discuss this.
i dont have to discuss anything. It seems u didnt read the whole conversation
theoretically, sounds good. multiple branches constituting different networks architectures working on different data types. but I guess you can always ping up your buddies to suggest and see if they might research a bit bout it
oh wait - you can also do transfer learning 🤣 it would be hell, but if you train different branches seperately and use set_weights/get_weights to reconstruct seperate tf.keras.Model with it, you can also fine-tune the whole damn thing 
@grave frost i think this still doesn't apply to the particular case i described, whereas transfer learning would. this is just segmenting the input features for a single record. it's a great idea, kind of like doing what we did with building 2 models and ensembling them, but in a single network
but it doesn't solve the "use data from the big dataset to inform model on small dataset"
whereas transfer learning is meant for this (re: our discussion from a while ago)
nice graphics though
very helpful
so the algorithm would be something like this:
for each client:
for each product:
count each combination with other products
Yes. But if a client has multiple orders, it has to filtered by order too
it'd help if you gave some actual example data and example outputs
im not sure what "filtered by order" means, i thought that's what a combination w/ other products meant
Okay. Sorry about that.
100 A 1
B
101 A 2
C
101a B
C
102 D 3
A
B
C```
If that's my data, I'd like to get a each client to be recommended of other products based on other purchases. Here Client 2 has 2 orders so while grouping each order is treated as a separate entity for aggregation of product suggestion
That's to know what the combination is per order
def formatCurrency (balance):
if balance == savBal:
return str(updateBalance(balance, savIR))
def updateBalance (balance, rate):
savBal = balance + balance * rate/100
else:
chBal = balance + balance * rate/100
return balance + balance * rate/100
savBal = float(input("Enter your savings balance:"))
chBal = float(input("Enter your checking balance:"))
savIR = float(input("Enter your savings interest rate %:"))
chIR = float(input("Enter your checking interest rate %"))
print("Your updated savings balance is", formatCurrency(savBal))
print("Your updated checking balance is", formatCurrency(chBal))
can someone plese help me
please*
You can think of it as a single architecture trying to coordinate towards optimising weights for multiple other architectures all for the aim of increasing accuracy.
@fervent turret this is not a data science question. See #❓|how-to-get-help
ofc, I wouldn't know the immediate advantages of that as opposed to a ensemble, but I don't see in what case ever unified architectures would not be appropriate.
Plus you are missing the key point @desert oar with multiple input nodes, the network can optimize what features to use from each branch and using them in various degrees (dropping features with no significance) and allows it to pool the information gained from each branch far more accurately than a naive ensemble.
So, I have an interesting thing with GPT-2 and discord.
I wanna use GPT-2 to make a chatbot, but I don't know how to filter out previous messages from it's output so it only makes 1 message in response to other messages and stuff.
Currently, I append every message into a list called convo, and I feed it the previous 3 messages.
It generates the message, and sends it to the discord channel, which triggers another message, adding it's message to the conversation. the convo list stores everything, but also when I pass the previous 3 messages in, and it outputs, it's output contains the previous messages, which leads to messages getting huge, like over 2000 characters, in a matter of seconds. My current code is:
if message.channel.id == focus_channel:
messages = chatlog[-3:]
convo = ""
for x in chatlog:
convo += f"{x}\n"
inputs = tokenizer.encode(convo, return_tensors='pt')
outputs = model.generate(inputs, max_length=50, do_sample=True)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text.split("\n"))
await message.channel.send(text,reference=message,mention_author=False)
can't you just take the last message from the convo list?
i might be misunderstanding your question
I want to take the last few messages so that it continues the conversation instead of completeing the message.
Currently it's output is
message1
message2
message3
bot_response
with
newlines
@austere swift
But what about all the rest of it's response? (It's responses have newlines in them)
wait
I can probably just do text.split("\n")[3:]
Nope, that breaks horribly, just like the other times lol
the code of a course using tf.train.GradientDescentOptimizer(0.01).minimize(cost)
but this GradientDescentOptimizer is in version 1 of tensorflow
should i be doing this course?
or is it outdated
Tf2, specifically it's keras api, will be a lot nicer.
Unfortunately you'll probably run into this issue with a lot of courses probably, since tensor flow 2 is relatively new.
So perhaps as long as you're willing to port the codes over or at least get a sense of how the same code could be written in Tf2 you could proceed.
Ultimately the specifics of writing code don't really take away from the learnings offered by a course itself. So decide based on whether the course is good or not.
ok i believe its good
OK cool, then in that case just keep the caveat in mind, you might have to modify the codes presented
In most cases Tf2 keras looks very similar to keras so it's an easy port if they use keras.
yeah they say they will use keras later
Hi all, got a question about overfitting. Can this be assumed as overfitting? or the model is good?
LR scores
0.9452887537993921
precision recall f1-score support
0 0.95 1.00 0.97 933
1 0.00 0.00 0.00 54
accuracy 0.95 987
macro avg 0.47 0.50 0.49 987
weighted avg 0.89 0.95 0.92 987
[[933 0]
[ 54 0]]
I see the accuracy is good but the 0's at the negative side makes me question the model
also the precision, recall and f1 scores of 1( people had a stroke) is 0
Is this score on train or test.
Also, the answer is neither.
This should hopefully indicate to you that accuracy is a terrible metric for this. Clearly this model is useless
This model can be entirely replaced by a single line of code print("NO stroke")
To state it more explicitly, the problem here is class imbalance. Your dataset is going to have more cases without stroke than with. Using accuracy as a metric then, would favour a model that just gets the no stroke cases correct.
I'm going to go out on a limb and assume that you agree that would make for a fairly bad model. So this isn't a good model, and our metric choice of accuracy isn't appropriate
As for overfitting, you only get a sense of that if you compare the fit on train vs the fit on test, after choosing a good metric.
This is a score on based 'test'
Exactly, the data has lot more 'no stroke' than 'stroke' output.
Thank you, will try that now.
Hello, I am writing a code, which scrapes data from a website. Now, I want to put it in an excel file. What will be the best - csv or pandas or if there is something else? Also, I want it to append the new fields to the existing file and not make a new one when it is run again
Pandas should be easy to work with once you're used to it.
Okay, Thank you so much!
anyone know how to fill in the specified value based on the query for the df
so for eg if theres a 0 in anyone of the column in pandas df , i want to change those values to some specified val
how do i do it without using for or while loop
i know about fillna but this isnt for NaN values
No, I got this point. And I do see how that it could propagate information more effectively than a naive ensemble. And I will definitely want to try it in a project at some point. But conceptually it's the same thing as ensembling two models, it just lets the learning process be smarter about the ensembling.
But I still don't understand how you want to calculate these recommendations. Can you give an example of the frequencies you calculate for this data?
Also, there are known algorithms that sound like what you're trying to do. You might want to look into "association rules" https://towardsdatascience.com/association-rules-2-aa9a77241654, and "collaborative filtering" https://towardsdatascience.com/intro-to-recommender-system-collaborative-filtering-64a238194a26.
!e ```python
import pandas as pd
x = pd.Series([1,1,0,0,2,2])
print(x.tolist())
Option 1
x1 = x.copy()
x1.loc[x1 == 0] = 9
print(x1.tolist())
Option 2
x2 = x.apply(lambda val: 9 if val == 0 else val)
print(x2.tolist())
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | [1, 1, 0, 0, 2, 2]
002 | [1, 1, 9, 9, 2, 2]
003 | [1, 1, 9, 9, 2, 2]
thanks
writing one line at a time is easier with the csv module. once you have created the file, you should use pandas to work with it, convert to xlsx if needed, etc.
@lapis sequoia GPT-2 can't hold a convo very effectively though (not as good as it's better counterpart)
maybe it might be 🤷 I will open a question about this, cuz I can't find any good resources
I wanted to know how much images are required to make a decent quality gan
ofc thats a very general question
but i have 300 1000x1000 images of mountains
and i plan on training a wasserstein gan on these images
would that be enough
or should i collect more images?
i have trash computer, so i dont want to waste a lot of time on a gan that wont work on 300 images
thanks!
Even tough GANs works with a very few data I don't think 300 is enough
Also, I think in this case lowering the quality could help a lot
@winged stratus
thanks