#data-science-and-ml
1 messages ยท Page 241 of 1
yes
make sure this uses the correct api, this one https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md
how did you save your model before ?
if you used tf.saved_model.save(model, path) then this should work
perfect
yw
That was so much effort (I have like beginner to intermediate python) so like it was a big challenge haha XD
But I think I'm going to look into YoloV3 or V4 framework next time..
personal opinion here, but tensorflow is big mess
thinking of switching to torch
they are algorithms, you can implement them in tensorflow too
oh yeah pytorch was recommended to me too
i mean fun weekend project touching into something i wasnt comfortable with
to be fair, TF2 is orders of magnitude better than TF1
they made the api more "torch-like"
ugh finding documentation was the big pain
it was completely unintuitive before
lots of tf1 documentation mixed with tf2 XD
I think eager execution made it choppy
yeah it's officially stable but only as of recently
Ok, something went wrong with my training lol
It's recognizing my object twice..? One with a longer bounding box lmao
so, that is a common thing, but usually models have techniques to suppress overlapping bounding boxes
eg, faster-rcnn uses what's called "non-maximum suppression" at the end of the model
that only works if the boxes are extremely similar tho, if that's your case here
I'm guessing this is normal? I needed to pick a random object so I picked a toy goat from my fav robotics part supplier (i just like the goat)
I only trained it for 20 minutes with about 55 images
hard to know just like that, but it's safe to expect it's due to lack of training
or overfitting
can't really make any conclusion just like that
it was based on one of the my ssd resnet 50
maybe I shhould try a faster rcnn model
if you don't need real-time object detection, it's the current SOTA
I stopped training early because I thought something crashed, I was checking my task manager and for some reason my GPU and CPU activity dropped to liek near 0
(like idle, not 0)
weird, you should log metrics during training to have a better idea of what's going on
Is it a config on how intensive the gpu/cpu goes? Cause like all the youtube videos I was watching said it would make my computer go sicko mode
I think I still have my tensorboard, let me check
guys would you recommend a good book for code-approach deeplearning, preferably keras
documentations have me lost tbh
Not sure, how to make sense of this, documentation told me total loss of less than 1 was good, while another youtube video was like less than 0.05 consistently
I don't know about keras, but pytorch recently distributed their free book called "Deep Learning with Pytorch", and it seemed pretty good when I skimmed through it
You should try to log your validation set too
yes
also, your learning rate seems to be gradually raising, is that warm up that you stopped too early, or is it a mistake
https://pytorch.org/deep-learning-with-pytorch @lapis sequoia
I have no clue, I should probably try running it again but for a bit longer and see how the graph changes over time
Man, I got so many warning messages so I was just scared something would have gone wrong regardless
lol
it contains some code, like here, but also charts, more theoretical details, intuitions to have etc
But again, it's torch, not keras
voxels? ๐ฎ
3d pixels
So it doesn't use tensors?
Any recommendations for interactive charting for time series data on local machine, preferrably not in a browser?
well, Nd pixelsbrainfart
wait no in this case it's 3d
they have volumetric images of ct scans or something, voxels are how you refer to the pixels in a 3d image
https://pytorch.org/deep-learning-with-pytorch @lapis sequoia
@odd yoke seems pretty cool skimmed through the contents
hmm interesting
also yeah, my model is having issues with doing two bounding boxes lols
but its working so im happy ^_^
thank you again @odd yoke
i can sleep it peace tonight knowing that atleast i didnt go away empty handed with diving into this mess of documentation lol
hmm
do you think thhe fact that I didn't have multiples of the goat
led to this maybe?
If I trained it better with pictures with multiples, that might have helped the training maybe..? not sure but random theory
multiples ? as in many instances of the goat in the same image ?
yeah
it can help, but it shouldn't be needed
Yeah hmm
(cute goat btw)
thx ๐ thats why I did it haha. we had a more practical idea of detecting different types of physical ports (USB) for the tech unsavvy but i guess i just wanted to see how hard it would take to use tensorflow
Anyone familiar with text-to-spectrogram?
Specifically interested in what spectrogram features make certain vocal characteristics (e.g. "sad", "happy")
hey friends, im making a GAN atm and im having a bit of trouble with the input pipeline and training step, would anyone be able to help me out?
https://pastebin.com/91FW6kGU for reference
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
im using the tensorflow pix2pix function as the basis for the generate images and train function
but im having problems with the images. i can show how i was storing my data as a h5py file
im not sure if the issue is that i need to load my images using a tf.dataset.dataset object
but then i dont know how to go about that
and im not too sure what to do with the generate images functions and stuff because the way they do it in the pix2pix documentation on the tensorflow website seems really efficient
Hi all, I have some doubt about web scraping. Any one have experience about product image get in big basket.
I got other data for example Product name, quality, price...
pretty basic question, i am trying to decide between 2 projects to put on my resume
one project is where i built an OCR system from scratch - involved a lot of image processing to cluster and extract text patches from images and then pre-process them to look as close as possible to the actual training set
but the network was fairly simple and the data set was just EMNIST
and another was pointcloud segmention using a multi-class SVM
i hada dense point cloud data set and i trained an SVM to classify different regions in it. any suggestions?
@lapis sequoia https://stackoverflow.com/questions/48309631/tensorflow-tf-data-dataset-reading-large-hdf5-files
Take a look at this link, also by chance can you send me the code used to convert image arrays into HDF5 files. I am currently needing to do so would like some help.
for sure man
i was using sentdex as a basis
im thinking i might switch to numpy arrays but ill send in a sec
Hmm Sentdex has an HDF5 video?
nah not on hdf5
he uses pickle but i used hdf5 because i had issues with pickle
pickle seemed to mess up my files
i was using an image classifier and running a test with it
and i got significantly worse results with pickle than i did with hdf5
Ah what did you use to learn HDF5
it was a while ago but i looked up a bunch of stuff on like medium or towardsdatascience
i didnt do anything special other than store the image arrays in them
but i dont think its too complex from memory
this site's pretty good
Ah ok I see what you did I mean it helps that you are using the same code base as I am ๐
You just stored X and y in a respective dataset, cool cool, I want to do the same thing but without having to loop through all the images and store them in a var, as I am looping through 40,000 images 277x277. I was wanting to append to the X dataset and y dataset as I looped through the images so that I would not have to store the arrays in memory all at once.
Any idea on how to do this?
hmm
I grabbed a image dataset from Google and am working on a ANN, CNN, RNN and checking the differences
so you mean appending them into one single dataset rather than what i did where i have images and labels as separate
No I want to have them seperate as you have them, just that I want to be able to append my arrays to that dataset as if I was appending them to a list
So that I would not have to load them all into the list thus having loaded them into memory
ah right
Not have to loop through and add all images into a list then to transfer that list into a HDF5 file
right
@lapis sequoia Nice the link you provided has the same questions I did in the comment sections
It directed me to these two links, will have to read them through, thanks man. Is it fine if I ping you with any questions if they arise?
ah wait i think i get what youre trying to do
so you mean you have like different datasets that youre appending into one h5 file
and nah im on google's servers
Just appending in batches into one dataset
So append like 1 or 100 image arrays at a time so that I do not have to append all 40,000 at once
yeye
so like
the img arrays are np arrays right
hmm
that stackoverflow link seems to do everything you need i think
@fervent bridge Did you not manage to get memmaps working?
Hi!
I'm plotting with pyplot / mpld3 and I notice that whenever I do
plt.show()
I see the correct X tick labels, which are normal strings from a list.
However, when I do
mpld3.show()
These labels don't show correctly and I just get numbers. It seems to be a known issue but I only found a fix for it for some guy using dates, not strings.
What's the usage of this: plt.figure() Because when I got rid of it the plotting worked fine
I do fig, ax = plt.subplots(figsize=(20, 10))
What's it used for ?
For plotting.
I create a bar chart comparing the stats of different features.
I meant the fig variable
which is plt.figure() if I'm not mistaken
Plotting seems to work fine without it
Not sure how to accomplish that.
I must create a figure to have the subplots in, no?
Also if I didn't have that fig I couldn't save it to html later. \
With pyplot it works for me too.
But I am aiming for mpld3. It's much more interactive when used as html.
Soo.. any idea on how to get the labels to mpld3 plot?
ax.set_xticks(x)
ax.set_xticklabels(labels)
This does the trick for pyplot but not mpld3
I just googled, I'm not sure if this is gonna help but here
I am using matplotlib to make scatter plots. Each point on the scatter plot is associated with a named object. I would like to be able to see the name of an object when I hover my cursor over the ...
Hm, it's a bit different from what I'm going for but it might help. Thanks
excuse me, do u guys know why i keep getting the same accuracy on my SVMClassifier? but i got a variant accuracy when i test on my DecisionTree(DT) Classifier?
SVMs don't work in epochs
well, internally they do. because the fitting algorithm is usually iterative
in fact the same is true for DecisionTree
i assume these are sklearn models?
the whole idea of an "epoch" is really an implementation detail of gradient descent
basically anywhere other than a neural network, the software tries its best to hide the optimizer from the user
yeah, it's sklearn models.
but sir, what should i do if i need some accuracy testing on SVM (when i know svm don't work with epochs)
you need boostrapping or cross validation for that
you should do that with other models too btw
like cnn?
yes although if the model is big and complicated and slow to fit, then sometimes it doesnt make sense because it would take too long to run
actually my dataset only 800+ images
i recommend you step back and consider why all this is happening
look at how SVMs and decision trees are fitted
and why epochs are used in fitting NNs but not other models
ok sir, thanks btw
Is anyone familiar with the text-to-audio generation process? I'm interested in realistic / emotional voices.
Specifically:
- Are there any datasets that contain labeled emotional audio data (e.g. "sad", "happy", "surprised")
- Is there any intuition for what spectrograms would look like which emotions?
Hey all! I had posted a question earlier on what model would be good for cipher applications like Input: welcome Output; njoigfr. I had been recommended to use models like BERT with powerful word embeddings but it seems that NLP models study tokens with respect to other words in a sentence. My intention is not to have it take it as a sentence. My intention is for the model to find out a relationship between the input and output and on the basis of the relationship predict the output accordingly.
MY training data is a .csv file which looks like this:-
inp1, out1
inp2,out2
Here inp stands for input and out stands for output. So can anyone confirm whether a BERT-like NLP model can find the relationship b/w input and output data considering 1 row not to consider the whole dataset?
I recommended bert to you, it was before i knew you were doing ciphers though and would definitely be a bad idea
bert is used for getting the contextual information between words, which a hash has no use for
So would you happen to know what model would be able to handle that?
Throw out any model that relies on word embeddings, unless you know for a fast that they are used in the hashing function, do you have idea how the hashes are actually generated?
Yeah, They are made from a crytographic function though I don't know how exactly they accomplish that. For me, that function is like a black box...
and are the output hashes always the same length?
Well, does the output always have the same number of characters as the input?
I think so. I haven't decided on a cipher yet but probably the input characters would be equal to the output ones...
Well, if its going to be some kind of substitution cipher like enigma, perhaps a RNN would be best
Learning about learning.
Hmm... What if I use a crytographic hash? then RNN's won't be very suitable then. The timesteps will have no relations whatsoever....
Then you probably wont be able to solve it with a NN faster than brute force
Because you probably have a salt to figure out as well as an unknown algorithm
The whole point is to determine whether there do exist any arbitrary relations between the hashes and the inputs. There is always that bias in there. Even though the chances are pretty slim, but I wanna experiment on them
Just try a dense netwoork for a start
is the entire message encrypted? or are we talking about individual encrypted words
because if you encrypt each word in a message then you're basically solving a "fill in the blank" problem where you probabilistically infer the mots likely word in each slot
Initially I am considering hashes as input and the numbers as output..
but to decrypt an entire encrypted message is basically saying "i dont care about the theoretical results im going to try anyway" which seems like its likely to result in failure but i guess it doesnt hurt to try
The numbers correspond to the initial word?
Of course, but doesn't hurt to try. The hashes are pretty complex, but I want to start delving into some pre-college research about it and maybe brainstorm some ideas in the later years...
@acoustic halo the hashes correspond to the numbers
And how do you link a number to a hash?
By encrypting it
Basically encrypting the number as output and the hash as input for the model
Well, i would start with just a densely connected network as a start
My model should be able to derive more than the statistical relationship and move towards complex ones. That's why I am struggling to choose the right model. High dimensional vector representations seem like a weak start, but would probably do.
But Dense layers cannot absorb abstract relations
Considering you don't know if any relation exists anyway, it would be a start
And several stacked dense layers can learn complex and abstract patterns
depending on how you define abstract
Well, good for a start I guess. May look for 1024 Dense layers just for starting ๐ but I guess it will do for experimentation....
i mean stacked dense layers are basically what most neural nets are :/. And they work pretty well in a lot of cases
ehhh, that's a bit of a stretch
fully connected layers really aren't that common anymore
*networks, not layers
well networks are different. There's this concept of sparse layers that is used, but numerically the operation is pretty much the same in most cases. We're still running computations over those 0's its just a lot more effecient.
Yeah I understand that you one can be used to represent the other (not that it should be done), but with that definition you can basically go down up to like addition and multiplication, and while it may be true, it's not exactly a useful definition
true true, but at the same time a lot of people think these various layers are completely different things since they never look at the actual mathematical operation behind it.
Its good to know where their similarities lie, and why they work.
are there any coherent resources on neural network architectures for more "traditional" problems? im specifically not interested in the typical deep learning domains like images, audio, video, nlp/text, or even time series. im wondering about more mundane problems like autoencoders and prediction on "social science" datasets, more akin to titanic, boston housing, etc. than mnist.
id be interested in any research comparing training times & prediction/inference performance with other methods like xgboost
i ask because i was recently playing around with some NNs that gave me huge increases in accuracy (like 10+ percentage points) on a problem at my company, using just 1 hidden layer with parameters that just sounded like nice round numbers and weren't hyper-optimized at all. so it got me thinking that there was a lot of untapped potential for neural networks in domains where they aren't necessarily popular or dominant. trying to educate myself a bit.
What is Data Science. I dont really have a clear understanding of what it is and what it is used for.
@lapis sequoia it's a broad term that encompasses statistics, machine learning, and data analysis. usually someone with a "data scientist" job title works on some combination of those things.
hello anyone here?
have a little doubt here.
so i trained a model and the accuracy shows to be around 90%.
but when i submit the results, my AUC-ROC Score comes out to be very low.(in the range of 0.5)
so what i am doing wrong?
@past maple this is binary classification? are your classes very imbalanced?
yes its binary classification.
also yes imbalanced classes.
but when i use random forest the accuracy is quite less like 18% but the AUC-ROC Score score improves. (in the range of 0.8)
@desert oar
if your classes are 90% "A" and 10% "B" your model can get 90% accuracy by predicting "A" for any input
random forest might be doing a better job at not overfitting to the baseline class distribution
so how do i overcome this thing?
If the data is highly imbalanced, computing the accuracy is perhaps not the best way to evaluate your model
then what should this poor soul do?
I think your first model always predict 0 or always predict 1, hence the AUC score close to 0.5
yes right.
Consider using other metrics such as precision, recall or F1 score. You can also vizualise the confusion matrix to see where you make most of your errors
okay noted, will check with that.
depending on your model you can improve the outcome by adjusting hyperparameters
you might also have success with oversampling or undersampling, those dont always work well though
yes, i have tried adjusting the hyperparameters for random forests. but then it slightly improves the model.
I'll second the use of other evaluation metrics. Consider those which are more adequate for your scenario of a binary classification with an unbalanced label distribution, like weighted accuracy or class balance accuracy. For binary classification, I'm quite partial to geometric mean of sensitivity and specificity.
okay okay, thank you tho. i will see what i can do.
honestly i am just getting started so figuring out these things.
ok
Hey, so i've been doing Machine Learning for a few days and have this code
import numpy as np
import pandas as pd
from sklearn import linear_model
import sklearn
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
from matplotlib import style
import pickle
style.use("ggplot")
data = pd.read_csv("student-mat.csv", sep=";")
predict = "G3"
data = data[["G1", "G2", "absences","failures", "studytime","G3"]]
data = shuffle(data) # Optional - shuffle the data
x = np.array(data.drop([predict], 1))
y =np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
best = 0
for _ in range(20):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print("Accuracy: " + str(acc))
if acc > best:
best = acc
with open("studentgrades.pickle", "wb") as f:
pickle.dump(linear, f)
pickle_in = open("studentgrades.pickle", "rb")
linear = pickle.load(pickle_in)
print("-------------------------")
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
print("-------------------------")
predicted= linear.predict(x_test)
for x in range(len(predicted)):
print(predicted[x], x_test[x], y_test[x])
plot = "failures"
plt.scatter(data[plot], data["G3"])
plt.legend(loc=4)
plt.xlabel(plot)
plt.ylabel("Final Grade")
plt.show()```
But i still just don't get it. The results are still the same. It looks like it won't learn anything
wdym?
So the result's are the same, i can not see if it learns from it
It looks like you're recreating the model every epoch ๐ No wonder it doesn't learn.
I'm sorry my english isnt that great haha and i've just begon so if i being honest. Don't know what to do, and how to not recreate eery epoch
this is the same problem that someone else had
you dont use epochs with sklearn models
So i should just remove the line of code?
Specializations on Coursera can help you
@acoustic halo Just saw your message now, just figured that HDF5 would be of more convenience later down the road just in case I have to move around between libraries, better to learn it now then later
@lapis sequoia Did the link help you out
actually a lot of things bro ml deep learning and computer vision engineers and data scientist which to choose
You have $10 to spare @quartz crow ?
no bro i am a poor at the moment
ML/DL, computer vision, data scientist all kind of fall under the same umbrella
with computer vision you utilize ML/DL
yeah i know
just you are working withimages
as a data scientist if someone gives you images as data you as a data scientist are supposed to extract valuable information from that data and make it work with ML/AI
Hmm if you had $10 I would of recommended a nice update tensorflow 2 course that covered all those topics
but take a look at Sentdex on youtube
he has a lot of tuts some may be outdated though
hmm ok . i need help for cv
An updated deep learning introduction using Python, TensorFlow, and Keras.
Text-tutorial and notes: https://pythonprogramming.net/introduction-deep-learning-python-tensorflow-keras/
TensorFlow Docs: https://www.tensorflow.org/api_docs/python/
Keras Docs: https://keras.io/lay...
Take a look at this one its updated
coursera also has plenty of nice courses
Most of them are paid, but you can access any paid course in audit mode, which as far am I'm aware literally only disallows you from doing quizzes. All the materials, and most importantly programming assignments (including their automatic grading) are available.
tbh there's enough free material out there that paid courses aren't entirely necesarry.
The advanced stuff you can just learn through medium and reading through papers.
@flat quest Yeaup but I wouldn't recommend it if wasn't a good course, medium and reading through papers requires that the reader most of the time build their own structure in order to learn what to do next. I mean not many have a A-Z fully structured medium article ๐ but yes a lot of free material
Woot woot got HDF5 to work in appending mode ๐ currently looping through my 40k images ๐
yeah you have to figure out how to get the information, and which one will be most useful for you
But its something everyone has to do eventually
nice! ;D. Guess it didn't take too long to learn?
Nope, I mean internet was getting installed today, was out, took about 2 hours of research
๐
ah i see. Yeah its worth learning it, can use it with a number of different libs/packages.
Yeah always better to get the tough part out early rather then later.
Yeah I see it's taking care of the reshaping in itself per batch. So I don't have to reshape through Numpy.
HDF5 mantains order right? @flat quest
hey guys, I've got a work problem
I'm scraping media news articles and I need to get the article's text that's not a script or some other random sh*t you can find in html data. I'm using BeautifulSoup, for example soup.select('article p')
it is 1 am and I'm asking questions on a discord channel I just joined, so bear with me
You know if their tosโs allow it?
if you mean the source, yes, though good point to ask
I think so yeah @fervent bridge. Not entirely sure on that one
Hmm if you had $10 I would of recommended a nice update tensorflow 2 course that covered all those topics
@fervent bridge could you send the link to it
Iโm not too familiar with soup but would changing the .select class to children work?
Hey guys, do you know a good project example in data science where I can train OOP ?
@blazing bridge Its a great course for High Level knowledge, I mean I consider it a must have. Covers a wide range of NN in TensorFLow 2
Again its all High Level focused around TensorFlow 2 but great to work with.
ty
@fervent bridge thank you
yeaup going to bed, but This course and NNFS got me on the right track, NNFS providing more lower leven knowledge and the Udemy course complementing it.
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
inputs = np.array([[313, 1], [323, 1], [333, 1], [343, 1]], dtype='float32')
target = np.array([[14.76], [16.42], [18.08], [23.41]], dtype='float32')
inputs = torch.from_numpy(inputs)
target = torch.from_numpy(target)
model = nn.Linear(2, 1)
preds = model(inputs)
train_ds = TensorDataset(inputs, target)
train_dl = DataLoader(train_ds, batch_size=5, shuffle=True)
loss_fn = F.mse_loss
loss = loss_fn(preds, target)
opt = torch.optim.Adam(model.parameters())
def fit(num_epochs, model, loss_fn, opt):
with torch.autograd.set_detect_anomaly(True):
for epoch in range(num_epochs):
for xb, yb in train_dl:
pred = model(xb)
loss = loss_fn(preds, yb)
loss.backward(retain_graph=True)
opt.step()
opt.zero_grad()
if (epoch+1) % 10 == 0:
print(f"Epoch: {epoch+1}, loss: {loss.item()}")
fit(50, model, loss_fn, opt)
Output:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [2, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!```
I followed the hints
But have no idea what that means
So I guess I tried to follow the hint
oh boy that's a fun one
what is loss_fn? @desert parcel
can you also show some of said backtrace?
I wish they had an ML practice section CodeWars
just writing short code or "complete this code" for Numpy/R/PyTorch to practice when there is time to kill
is there any website like that?
Don't think so. it would require lots of computational power cuz u can't just check if your code is the same as the correct one
(it assumes u mean deep learning)
But other algorithms might require lots of power too
So, (IMO) pretty unlikely
But ofc u can google and check it yourself
Kaggle is good
actually there is a lot of good content on there
but a lot of the quick practice content is from universities, and they lock access to everyone but the people in that course
@lapis sequoia just download a dataset and play with it
e.g. from the UCI machine learning site
plenty of small clean easy to understand datasets out there
or simulate your own data, which is more advanced but potentially more educational
that's not how I learn
it's the same as giving a student a bunch of problems without teaching him how to solve them and say "just try them"
oh, you are asking for specific tasks to complete
not just data
there is actually an interesting lack of that, out there in the world
obviously students get those kinds of assignments in school
I believe Udacity and Coursera courses have such practice problems
who even writes questions for those kinds of sites?
yeah it could obviously time out computations after a certain point, limit memory usage and processes/threads
charge for premium membership to answer more than N challenges per day
etc
seems like a legit product tbh
hey guys, I've got a work problem
I'm scraping media news articles and I need to get the article's text that's not a script or some other random sh*t you can find in html data. I'm using BeautifulSoup, for example soup.select('article p')
@banville#2284 You can use regex maybe?
Using regex to parse html is an absolutely terrible idea

idk , for simple stuff regex can be ok, as a general rule its better to use bs4 or something, but for something simple egrep and sed can be useful
Can anyone explain why Keras Embedding layers doesn't accept strings? It seems to run on numbers fine however For strings it requires one-hot encoding which kinda defeats the purpose of creating the embeddings. In the end, I got it One-hot encoded (a -> 1; b -> 2) But am still curious. Isn't the whole purpose of embeddings to represent data in higher dimensions? Why didn't Keras implicitly understand it and encoded it accordingly??
Would it be just a lack of feature in Keras, or does it make sense not to have the embeddings accept strings and have the dev one-hot encode them?
@grave frost when one-hot encoding, you turn some value, in your case characters, into an array of values, containing one truthy value (a one), if keras were to try one-hot encode data as you feed it to it, it wont know how long to make each array, e.g. if trying to one hot encode a sequnce like 0, 3, 1, 2, 0, 3, then this could be turned into a matrix of shape 6x4 (6 items, 4 classes), if the model is being slowly fed information then it does not know how many classes there will be so will have inconsistantly shaped arrays representing each class (it could one-hot encode all data at once but then i think it needs to have every possible class in that input so it knows how many classes there are as this shouldn't be changeable)
hello i have a simple dataframe with 2 colums including minutes and points. i just would like to see by curiosity what a sklearn model would predict for this dataset. how can i do that please. i dont really know sklearn and would like to see thank you
look like that
@spark stag What about compromising on the inconsistency of input dim by padding after doing analysis of the data pipeline OR more practical, having the user specify the custom dims of the batches if data is like that and handle it accordingly. Like if I had encoded something like [0. 0. 0. 1. .....] for each character, it would be overkill and too memory intensive (Like used by scikit-learn lib for 1-hot). I just think that the whole way it works could be improved manifold and is a bit too complex....
@fringe cove look up Linear Regression...
ok thank you
hey folks, anyone knows how to drop multiple columns from a datframe using slices of index locations?. I've been stuck on this for a couple of hrs and can't find anything on Stackoverflow , wanting to drop columns with indexes 1:31 and all columns after column index 67 stage_metrics.drop(stage_metrics.iloc[:, [1:31, 67:], axis = 1, inplace = True)
Also tried this one stage_metrics.drop(stage_metrics.columns[1:31, 67:], axis = 1, inplace = True)
Why don't you use a Pandas DataFrame? It has all these functionalities and would serve as a much better and arguably a more feauturefull tool for any dataset...
yes, I'm on pandas, sorry trying to get the code to display in color
@grave frost there probably are ways of doing it but its quite a lot of overhead for it to process as its being fed data, especially as, unless they process the data every time they see it each epoch, a new copy of the data needs to be made that is one hot encoded so now you have 2 copies in memory
Hmm... that makes sense
idk, there may be an easy way for it to be implemented but i wouldn't say in my experiance at least its too much effort to do manually, especialy considering how much easier keras makes setting up a neural network in general
Of course, But I had to spend an hour or something just to make that one-hot encoding work (I don't like coding) And I couldn't use Sk-Learn at all for my use case..
A dedicated lib for that would be so much better and smooth..
@desert oar
A lot of beginners might use that kind of site. Don't think it'll be really that helpful in pursueing a DS career, but yeah there's a good chance beginners might buy
@flat quest yep, about as useful as pursuing a programming career ๐
good for younger people i think
or real real novices who dont know enough to make toy problems for themselves
xd
Maybe as an introductory. They're gonna have to learn to ask their own questions on the data and make their own problems. Not many ppl get to that stage :/.
But if they buy it -> its a selling product ๐
With what all the blaze on Youtube and other resources, it seems hard to beleive that any beginner would buy somthing like that. When I was starting up, I saw plenty of these paid resources but the free ones don't have any problems. The real factor is that these paid resources usually just bunch the topics in the right order in one place so as to not have people looking complex eqs on Wiki or hunting YT for an explanation on k-means clustering. That said, few people do them for learning. Mostly they are for boosting credentials for newbies who think they matter....
@grave frost ok i manage to do this following a tutorial
i suppose these scores are y = ax+B ?
you'd be surprised how many noobies do that
Yes you can learn DS through reading online articles, books, yt, working on your own datasets, etc. But very few people are actually willing to go through all that. They'd rather complete a course or a set of problmes that would certify them as job-ready.
Tho algorithmic competition sites are quite widely used. So some people might do it just for the fun of it
Is the plot for the whole data, and is it correctly represented? double check all your code because it doesn't seem like a Linear problem but rather a regression one.
i think i messed up in my head
@fringe cove no, it's the R2 coeff
@flat quest But to be honest, they really aren't actually much use even for getting "Job-ready". I have read experinces of many Data Scientists who have done an analysis on people who have put MOOC's on their CV and whether they got the job or not. The numbers aren't very pretty....
it's an indicator that represents how well your model fits the data
oh they're not useful at all @grave frost
But noobies will always fall for it.
Newbies will fall for anything, as long as it looks professional and is affiliated to a big company..
so this is the data i have just to start over becausee i think i'm overplaying it. this is nba scoreboard for a season of aa player
Saying they're not useful at all definitely wrong, sure, doing a MOOC doesn't mean you're fit for a job yet, but it doesn't mean you didn't learn anything doing said MOOC
if i know this player will have a minutes > 30 minutes in next game
Also, saying "MOOC are bad just watch youtube" is laughable
is it possible to have a model from all these data ? and make a prediction for points ?
@odd yoke Of course, but YT would still be free anyways...
Yes, that's true, but that's also the case for some coursera courses for example
(I agree the """"certification"""" they give you at the end is basically digital toilet paper)
sure you can gain exp and knowledge from an MOOC
but an MOOC doesn't really mean all that much to a job recruiter
And when you see frauds like Siraj Raval on yt having such big communities, I find it hard to say that using yt is a better idea
his videos are cool ( as a complete newb)
He is the embodiment of the ML hype taken to the extreme
He doesn't know much about it, but pretends he does, because it draws people in
@fringe cove Take my advice- stop watching him now
^^
i'm trying to get practical with data by making some scripts with nba data for player performance
i have mastered the scraping and now have a complete data for the season for every player
as u can see i can plot things etc and make some deductions with my brain
but i d love to see what a mathematical model could do
@flat quest What you guys recommend then to use to learn instead of MOOC's if you are beginner?
I think MOOCs are fine
@odd yoke Did you see some of the videos that gave evidence that he:-
- Plagiarised a Paper on Neural cubits and claimed that he wrote it
- Copied tons of Code from Github without citing the author, made minimal changes, and called it his own code
- Filed a YT copyright infringement on another YTber who unearthed all his black activities
- scammed newbies in a $200 course titled "How to earn money with ML" from which he made approx. 200,000$
Yeah I did
I'd rather not talk about him or I'll get angry and spam this channel when someone is asking questions
Ok ๐
yeah stop fight and just model my shit xd
they're not bad for new people. Just don't list them on your resume or depend on them too much.
But at some point you should transition into just reading through other people's work and suggestions (articles, papers, books, other resources, yt maybe), rather than following a predefined course @fringe cove
Yeah, just like "regular" programming really, you can't stick to tutorials indefinitely
Agree, you have to put that knowledge into practice. for someone that doesn't have a Comp Sc education i think it might help to redirect attention to another area
what should i look for when what i want to do is like feeding lots of data and make the model find the best fit for predictions ? i dont know if i make sense at all
but in my case
tonight orlando plays against indiana
i would like to know what a model would say about one player points for tonight game
You're defining machine learning here, it's kinda hard to give a useful answer with such a broad question
according to all previous data
yeah i want ml haha
only experience i had was mechanical arm movement training while in internship
but nothing else
There is all sort of recurrent networks you could use if you want to preserve knowledge from past matches, I don't have experience in that area so perhaps there are better fitting algorithms but maybe start looking there
but can u do this in like 2 lines of code with sk learn just to see a basic view ?
hello guys
i realise it looks candid
but rn i'm just curious af
and have no knloedge at all in ml
i just wanted to know if the book i bought "datascience from scratch" is a good book to start data science with
if u bought it i think thats it is worth trying it lol
its like 450 pages xd
i am trying to finish a python tutorial book first before i start
never heard of it
sorry if i killed your convo there
dw i'm just a noob like depending on them to tell me what to do so ^^
if u know the cell u can find the column name with the indice no ?
How can I get the repeated names ?
.duplicated
Hi i am a student doing web development with django.
I am thinking to start moving toward AI and deep learning!
Can any of you show me the right path to start?
I usually recommend this coursera course, it's a very nice overview of the field with programming assignments:
https://www.coursera.org/learn/machine-learning
It uses Octave for them, mind, not Python.
Hi all! Is this a good place to ask a question about Pandas?
Yeah, pretty good.
Stupid question incoming. How can I append a row to a dataframe? I've been using append with no luck. I have the row in a list.
@lapis sequoia show us sample data & the code you're running, which reproduces the error or problem you have
right away
return [i**2, i**3, i**4]
df = pd.DataFrame(columns=['i','a','b','c'])
for i in range(100):
[a,b,c] = my_fun(i)
df.append([i, a,b,c])
display(df)```
ah
DataFrame.append doesn't work like list.append
it doesn't modify the dataframe, it creates a new one with the row appended
def my_fun(i):
return [i**2, i**3, i**4]
df = pd.DataFrame(columns=['i','a','b','c'])
for i in range(100):
[a,b,c] = my_fun(i)
df = df.append([i,a,b,c])
display(df)
however i don't really recommend constructing dataframes this way. it's quite inefficient
it's much faster if you do it like this:
def my_fun(i):
return [i**2, i**3, i**4]
colnames = ['i','a','b','c']
data = []
for i in range(100):
a, b, c = my_fun(i)
record = {'i': i, 'a': a, 'b': b, 'c': c}
data.append(record)
df = pd.DataFrame(data)
or maybe better still:
def my_fun(i):
return [i**2, i**3, i**4]
data = []
for i in range(100):
a, b, c = my_fun(i)
data.append([i, a, b, c])
df = pd.DataFrame(data, columns=['i','a','b','c'])
yeah list-of-dicts is one way
list-of-lists is another
yeah adding rows to a dataframe is really slow
only call stuff from it
adding columns isn't that bad
actually adding columns is pretty efficient
but adding rows is slow and you should avoid it if possible
reading the docs and trying things
I am a Matlab refugee with 0 pandas experience
there are a lot of docs pages, not all of them are well-written or easy to understand
ah
well, you should feel comfortable with numpy
which is basically modeled after matlab
pandas is more like R
or more like Excel if you've never used R
yeah, you have to suffer through it
one of my many "todos" is to contribute better user guide content for pandas
for machine learning engineer what skills are necessary
@desert oar Thanks a lot for your time! Have a nice day/evening!
youre welcome
Has anyone here used async for their data science stuff?
only for webscraping or otherwise hitting APIs. not much value in it otherwise
I've used joblib to parallelize stuff but that's not the same thing.
async =/= parallel
right
stick with joblib
yeah its not meant for that
then what's async for for?
asyncio lets you run stuff in a separate process w/ run_in_executor
thats a complicated question
you know how __iter__ works?
ye
concurrency != parallelism
now imagine you await before you yield
that's what async for is
but yeah, async/await isn't even a good programming model for computational parallelism
let alone a good way to implement it in python
stick with joblib or concurrent.futures.ProcessPoolExecutor or multiprocessing.Pool
or dask or ray et al
oh you said what i posted above already
i went to take my food and i saw this convo, should have scrolled up if i wanted to be useful
lol it happens
anyway async/await can make life easier if you're hitting APIs and you want the freedom to ctrl+c without doing a bunch of extra work
I think replace will only do the first occurrence
while currency_symbol in my_str:
my_str = my_str.replace(currency_symbol, '')
could work
There's only one symbol in every price
After the sum
Because they are strings/objects
I assume
wait
Price object
Yeah
I have to head out but maybe rock salt lamp can help.
you have a lot of issues here
- what are you actually trying to do
- what does the source data look like
- I'm trying to sum every price that has a specific same year date so I can get the earnings of every year
so the data is like pd.Series(['Free', 'Free', 'โน 1,000', 'โน 530,000']) etc.
right?
Yeah
ok
well those are strings
python has no idea that the text contains numbers
so you can't just add them and expect them to be added like numbers
python doesn't know that "Free" means โน0
so you need to parse the strings, to extract numbers
i can give you a solution, but you've been in this server long enough to start developing your own solutions
once you know the basics, "how do i do X" is a matter of putting together what you already know. maybe 80-90% of the time.
hey guys i just wanted to share a great opportunity: https://ignition-hacks-2020.devpost.com/?ref_content=default&ref_feature=challenge&ref_medium=discover
Its a very beginner friendly hackathon and offers a prize pool of $4700
@desert oar It was kinda a nightmare not gonna lie
How ?
just clean up the prices first
make a new column of "price" that contains numbers
you can use regex to remove all the non-numerical characters:
df['Price_num'] = df['Price'].str.replace(r'[,โน ]', '').map(float)
Oh ..
df['Price_num'] = (
df['Price']
.str.replace(r'[,โน ]', '')
.mask(lambda x: x == 'Free', 0.0)
.map(float))
forgot to handle the "Free" case
now you can do whatever you need to do with df['Price_num']
str.replace doesn't use regex btw
oh right
kind of poor choice imo
mask ?
should have made it not regex, then given regex=True or something as a parameter
mask is a bit of a weird function
Is it a python thing ? Or is it related to Pandas ?
pandas
pd.Series.mask
there is also pd.Series.where which does almost the same thing, but "reverse"
the first argument pd.Series.mask is a function that should return a Series of bool (True/False)
ah you know what
do this instead, easier to understand
df['Price_num'] = (
df['Price']
.str.replace(r'[,โน ]', '')
.replace('Free', 0.0)
.map(float))
<@&267629731250176001>
Here it treats Price values one by one?
Because I had to loop to make changes to everyone of them
yes, pandas methods let you make changes without looping
they can be significantly faster than looping
and a lot less code
I see ..
I will start making columns instead from now on
By the way
I know it's too early to ask but I'm just so excited
When should I start learning AI things?
you can start learning concepts now, or at least some math
I also know I'm not ready yet
I just wanna know when will I be
it's good to learn programming concurrently with the math and the concepts
you start putting ideas together
Oh?
Do you suggest a specific source ?
for ai? no, i have no idea
For the math of AI
what's your academic level and background?
Highschool
start learning pre-calculus and calculus. logarithms, exponential functions, quadratic functions, derivatives
maybe you can start looking at intro probability & statistics
and very simple linear algebra, concepts like understanding what vectors and matrixes are
once you know a little bit on each of those areas, you will start to learn important terminology and concepts
the more you learn, the easier it will be to learn more
I see !
Doe anyone knows to fix a y axis that does not change with the addition of more traces in a waterfall/scatter plot chart?
Chart looks like this with one trace but the moment i add more, the y axis changes and it distorts the graph
Each trace index is following a list, however not every trace has all the elements of the list
@mellow spruce you can use ax.autoscale(False) to disable changing the axes
@mellow spruce you can use
ax.autoscale(False)to disable changing the axes
@desert oar is that set on the trace or on thefig.update_layout()?
sorterIndex=dict(zip(routing_list,range(len(routing_list))))
group['Route']=group['ope_no'].map(sorterIndex)
group.sort_values(['Route'], ascending=True, inplace=True)
group.drop('Route',1,inplace=True)
fig.add_trace(go.Scatter(
name=k,
mode='lines+markers',
y=group['ope_no'],
x=group['processstart'],
))
fig.update_layout(title="Title",
yaxis={'autorange':"reversed"})
fig.show()```
the first part is the order that I want each trace to follow
no, it's plotly
oh
i have absolutely no idea
in the future, clarify what library you're using
i assumed it was matplotlib, i should have asked
Sorry, my bad. Thanks anyway
my b if this is the wrong channel but heres a simplified part of my code
def foo(bar):
bar = bar + 1
return bar
play = True
while play:
baz = foo(0)
print(baz)
how do i get it so that it prints numbers increasing instead of just 1's
Hi.do someone know about a website where I can get info, data,statics .like a repository of covid 19?. I would yo get data for analyzing.
@lapis sequoia where you able to load the HDF5 file into TensorFlow?
@lapis sequoia https://github.com/nytimes/covid-19-data
@gray scaffold thank you
no problem, enjoy
@desert oar Warning: Error detected in AddmmBackward. No forward pass information available. Enable detect anomaly during forward pass for more information. (print_stack at ..\torch\csrc\autograd\python_anomaly_mode.cpp:42) Traceback (most recent call last): File "d:/python/ML/Corrosion test/test.py", line 37, in <module> fit(50, model, loss_fn, opt) File "d:/python/ML/Corrosion test/test.py", line 30, in fit loss.backward(retain_graph=True) File "D:\python\ML\lib\site-packages\torch\tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "D:\python\ML\lib\site-packages\torch\autograd\__init__.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [2, 1]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! Here is the entire error output.
There are versions where I tried to enable the anomaly detection
but have no idea how to
I did search it online but for some reason I can't find it
changing the optimizer to other options didn't work either
here is the code
I got it working
I copied another piece of sample code and that worked for some reason
After googling/stack overflowing/githubing for awhile, I believe Facebook's Prophet modeling package does not include a feature importance method. Does anyone know a workaround to use here so I can see which predictors are most importance in forecasting the target?
I'm trying to plot the earnings increasing but It's not working, What am I doing wrong ?
@arctic cliff you shouldnt iterate on the plot method. If you just feed that method a dataframe with the time data as the index and the y values as your single column you'll get the result you want.
Could someone compare the code? The one on the left works but the on the right doesn't. I tried to find the difference but so far has seen no difference.
@arctic cliff you shouldnt iterate on the
plotmethod. If you just feed that method a dataframe with the time data as the index and the y values as your single column you'll get the result you want.
@willow karma Oh! Thanks a bunch
@arctic cliff if you have a dataframe df with a date index and one column 'y_value'.. you would just need to run df.plot()
how many other methods are there to improv epredictions
improve predictions*
other than the number of iterations and messing around with the learning rate
Predictions:
tensor([[ 5.7500, 7.2500, 8.0000],
[ 5.7500, 7.2500, 8.0000],
[ 5.7500, 7.2500, 8.0000],
[15.0000, 14.0000, 15.0000],
[ 5.7500, 7.2500, 8.0000]], grad_fn=<AddmmBackward>)
----------------------------------------
Originals:
tensor([[ 5., 6., 6.],
[ 5., 5., 6.],
[ 7., 8., 10.],
[15., 14., 15.],
[ 6., 10., 10.]])
Because right now it's not the most precise
some are exactly on point
not all of them
Ohh maybe I can add more like stuff in the inputs
yes adding more data in the inputs worked
I added enough stuff until it became very precise
๐ it finally figured it out
the best way to improve prediction is to use input data that's strongly related to your target, and to represent that input data in such a way that the relationship is easy to learn
@fervent bridge hey sorry missed your message yesterday, been busy. Nah haven't been able to, might try as an npz or npy file
Hmm did you want to go over it ? @lapis sequoia I am almost done getting it into TS
For sure
Will you be free in like half an hour
I'm just doing smth at the moment
Yeaup
the best way to improve prediction is to use input data that's strongly related to your target, and to represent that input data in such a way that the relationship is easy to learn
@desert oar yeah that makes sense. I had like 5 extra rows of input data that's why the predictions were so close.
@lapis sequoia ready?
Apologies, gimme a bit more
How can I get the row of that value ?
There is an error with TensorDataSet
I'm not sure how to fix it
there is an assertion error
assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
@arctic cliff
I put the error into google but everything is in chinese
Thanks !
@fervent bridge yo
@lapis sequoia ready?, been asking some questions trying to move along, shall we continue through DM?
yeah no prob
@arctic cliff use Series.idxmax
That's what I was looking for !
Hi All, new to the group. Is there a Data Science FAQ area?
I have a basic question for Pandas
I am new to it. Let's say I only want to consider values after a certain index. My index is integers. I have 50 rows and I only want to use data from row 26 onwards in my new calculation. There are three columns which are not in ascending or descending order
vaex is on my perpetual todo list
@granite light data.loc[26:] if you want to use the index value 26, or data.iloc[26:] if you want to use the row number 26
@desert oar thank you. Now how can I use it in a condition? Let us say I am doing data[data.column1 > something & index> 26]
I am not sure how to write that condition to ensure that first condition is only checked on the rows 26:50
you can save the subset of the data to a variable first
then apply your other conditions
True
But I am kinda trying to learn, so would like to know if it can be done without copying
data_sub = data.iloc[26:]
data_final = data_sub.loc[data_sub['column1'] > something]
you aren't copying data
Also what about computation time in the two approaches?
in fact if you try to modify sub pandas will give you a warning
you aren't copying data
@desert oar Okay. I hadn't considered this
.loc and .iloc try to avoid making copies when possible
in most cases they return different "views" to the same underlying data
the pandas documentation avoids saying that they never make copies
but i cant think of a time i used it where it did make a copy
Thanks a lot, that makes it a lot easier
it's a nice feature
hey guys why does this line break my column header
and how can i merge rows with the same year column into a single row without breaking this
I have a weird question, Can anyone suggest any techniques/ approaches like how goolge gives some value to a parameter question asked?
For Example
parameter value
- "Temperate Today" - 20 F
- "Rating Dark Night" - 4.4
basically Im encountered a problem to map parameters to its value from a large set of word documents. The word documents have complex table structure / paragraph and essays.
Parameters and value are in the document but not structured.
I'm looking for some help in cracking this
With Keyword search, NER Models I was able to get the parameters. But not able to find a solution to pull the relevant value of the parameters in a set of word documents.
Please tag me if someone could help
I have a basic question
np.random.permutation(n)
This just randomly chooses a few values from n right?
so it just changes the order of the thing?
if you pass it an iterable it will shuffle those values instead of np.arange() ```py
np.random.permutation((3, 5, 4, 2, 3))
array([3, 5, 4, 3, 2])
np.random.permutation((3, 5, 4, 2, 3))
array([5, 2, 3, 3, 4])```
ah gotcha
also
this part wasn't explained clearly in the yt tut
import numpy as np
def split_indices(n, eval):
eval = int(eval*n)
index = np.random.permutation(n)
return index[eval:], index[:eval]
train_index, eval_index = split_indices(len(dataset), eval=0.2)
if you pass it an iterable it will shuffle those values instead of
np.arange()```pynp.random.permutation((3, 5, 4, 2, 3))
array([3, 5, 4, 3, 2])
np.random.permutation((3, 5, 4, 2, 3))
array([5, 2, 3, 3, 4])```
@spark stag it will make a copy with shuffled values
So here it shuffles the array then takes 20% of it and puts it inside train_index and eval_index?
uh
it shuffles an array that represents the index
uhuh
then it takes the first x% of the shuffled array containing random indices
and uses that to form the training set
@velvet thorn ah ye thats what i meant, it will use those values when crating the array but i guess i was't really clear on that
and the rest for the evaluation set
so it shuffles it once, takes 20%, then shuffles it again?
although I don't like that code
no, it shuffles once only
eval as a parameter name is Bad
So it shuffles once, takes 20% and put it into the variables?
Lol I didn't know what else to put it
the parameter set by the yt tut
So it shuffles once, takes 20% and put it into the variables?
@desert parcel yes.
well I'm not sure if you have the same understanding as me
Oh yeah lol
so just to be clear
shuffle a sequential array (0, 1, 2...n - 1) representing indices
use the last x% for the train set and the first (1 - x)% for the evaluation set
Excuse me guys.
so, im trying to run 5 random state. where the result of accuracy each random_state i want to save it into csv file, do u guys know how to do it?
What kind of package do you use to generate a pd freport of your data analysis ?
Hi! I am pretty new to Data Science. I was wondering how you would do, say Regression, on Real time data? Would you have to train the model on the whole dataset again everytime new data is avaliable? Would you be able to Pickle the model then just do model.fit(x, y)
over and over again for every new data?
I am working on a little project which deals with realtime weather data and I want to predict the weather. I want to Implement it on my website and maybe a Discord Bot.
What you want is called "online learning" - when new data becomes available in batches and the algorithm should ideally be able to quickly update on the new data without being refit on the entire updated dataset.
https://en.wikipedia.org/wiki/Online_machine_learning
What you want is called "online learning" - when new data becomes available in batches and the algorithm should ideally be able to quickly update on the new data without being refit on the entire updated dataset.
https://en.wikipedia.org/wiki/Online_machine_learning
@tidal bough Thankyou for the answer. I will check it out. Any idea on how I could go about implementing it in a Discord Bot or a Website? Like should I make an API which can be accessed by the Bot?
what should I put inside kmeans.fit() if my file is pdf file that already have been pre-processing and using td-idf for this method to work or the function is not right. I already try to look at the stack overflow and other website but I can't found the answer. My program is kmeans clustering using pdf file. So I want to put an elbow method inside it.
@warm moth Might be a good idea if you will need to access it from different places (your website and the bot).
Alrighty! Thanks for the Answer.
What are the best statistics books, you have seen at univ???
Would preprocessing in Python (well any language, just using Python as an example) mean simply taking a look at the source code and copying ONLY used functions in this code from the imported modules that contain them? Here is an example;
module that will be imported:
def add(a, b):
return a+b
def subtract(a, b):
return a-b
source code:
x = 5
y = 4
print(add(x,y))
After preprocessing:
def add(a, b):
return a+b
x = 5
y = 4
print(add(x,y))
preprocessing has a variety of different meanings, what you put is an example of preprocessing but that is down to the task at hand and what youw ant to achieve
For example, I preprocessed a bunch of c++ files, and for me that meant removing all comments and undoing all the #define preprocessor directives
and how would one go about removing all the comments and undoing all the #define directives?
wouldn't it be essentially having a function in a library perhaps that goes over your code and does this - same thing as described above?
Comments largely with regex, undoing directives is a massive effort so unless you need to, i wouldnt recommend it
not trying to replicate it, simply trying to understand the other variant of preprocessing more clearly.
I have a program where i feed in the source code text and it spits out the processed code
so basically preprocessing means editing/preparing the source code before it goes through it all?
whether it's importing functions from used modules or doing any else kind of formatting
Yeah, in my case it was, I was building abstract syntax trees for each source code file, before that, each file had to be preprocessed in that way
alright, thank you!
Ultimately though, you have to know how you want each file preprocessed for whatever task it is you want to complete, and depending on that you might find that another way of pre processing your code that is betetr
What sort of augmentation should i be applying to a dataset of skin cancer images. It's well segmented but not doesn't contain many images (size 2 GB approx), and I'm going to try a few Transfer Learning architectures first. Also what metrics/score would be best to evaluate my model?
is there easy way to assign function that generates batches of data into data generator and feed model while training? in keras
def get_batch():
# example
yield X, Y
Where is a good starting point to start learning data science with python
What exactly are you interested in learning about?
I desperately need help. is it useful to use/learn matplotlib when you can just export the data to excel or some similar program?
Yes
My goal is to develop a stock options backtester. I'm 2 months into learning programming(python specifically). With so much information and so many fields of study, what areas should I focus on in order to develop this backtesting program?
I've started learning pandas but not sure where to go from here. Should I focus on understanding classes and objects? What will I need to focus on in order for the backtester to make the correct selection of orders to buy and sell amongst so many rows of data as well as calculate the necessary statistics such as the profit/loss per strategy? Any guidance on this will help me a lot. I don't know where to look.
@odd yoke no, but I solved it, fit supports generators since 2.0 I, think I can just pass generator function
well the stock market is a really odd thing, especially rn. Breaking all the standard rules, so backtesting strats might not work as well as before.
But anyways, if you want to make a backtester, I would say learn the basics of classes and objects before jumping into pandas. As for pandas, there's lots of tutorials online and documentation is pretty good imo @fickle rampart
Yes I agree pandas is well documented and since it's so widely used I've been able to find how to do things with it with some searching. The dillema I'm facing is that making an options backtester seems to be much more difficult than a stock backtester. While in stocks there is only one stock which never changes, in options there are hundreds of options that change every week. What would be useful for me to focus on in order to understand how to make the selection of the correct options with my code?
It is so hard to read such large paragraphs, keep it short 
@desert oar you helped me with this yesterday but i had a followup question -- do you know why this fill_value is replacing everything in my dataframe with 0? this is the code
import pandas as pd
data = pd.read_csv('my-data.csv')
data['MONTH'] = pd.to_datetime(data['MONTH'])
new_index = pd.date_range(data['MONTH'].min(), data['MONTH'].max(), freq='MS', name='MONTH')
def fill_monthly(df):
return df.set_index('MONTH').drop('APP', axis='columns').reindex(new_index, fill_value=0)
data_filled = data.groupby('APP').apply(fill_monthly)```
it shouldnt be. can you also provide some small test data?
yea so
its easier than me constructing some tiny data set
ID | MONTH | INCIDENTS
AP00094 | 2017-11 | 1
AP00094 | 2018-03 | 1
AP00095 | 2019-05| 3
it worked with some other dataframes but for this one im getting 0 replaced for everything
is ID equivalent to APP?
yea
wait
i think it might be because my month columns are in string format right now
i didnt even notice. let me change that and see
๐
I am pretty new to ML and DS so I might probably misunderstood the concepts but I hope someone can clarify it for me.
What is point of having multiple kernals in a CNN's convolution layer if the Maxpool in the next layer performs a max operation? Since all the kernal outputs will give the same max values per pool window.
It's not a python specific question, so I posted it here. Hope that's alright
(i'm assuming conv2d for this example)
if you have C convolution kernels, the output will have the dimensions HWC (or CHW based on what data layout you use), pooling operations is used to down sample the spatial dimensions (HW), the C dimension still keeps its size
and the kernels are not initialized with the same values, so the values won't be the same
@lapis sequoia ping in case you left
I'm here, reading it, thanks
Nah, I'm working with the grayscale images for now. I understood the downsizing the spatial dimensions part.
Wait, lemme try to use an example
Example output after convolving with a kernel:
[1 2 3 4]
[2 1 3 4]
[4 2 1 3]
[2 2 4 1]
Now if I do a max in axis 1, won't all of them become 4?
class Layer_Maxpool:
def __init__(self, pool_scale):
# Initializing attributes
self.pool_scale = pool_scale
def maxpool(self, img, maxpool_out):
maxpool_out = np.zeros((conv_out.shape[0] // self.pool_scale, conv_out.shape[1] // self.pool_scale, conv_out.shape[-1]))
for ix in range(img.shape[-1]):
new_img = conv_out[:,:,ix]
for i in range(maxpool_out.shape[0]):
for j in range(maxpool_out.shape[1]):
segment = new_img[i * self.pool_scale:(i+1) * self.pool_scale, j * self.pool_scale:(j+1) * self.pool_scale]
maxpool_out[i,j] = np.amax(segment, axis=(0, 1))
return maxpool_out
def forward(self, inputs, training=False):
self.inputs = inputs
self.output = np.zeros(
(
inputs.shape[0] // self.pool_scale,
inputs.shape[1] // self.pool_scale,
inputs.shape[2]
)
)
# Calculate output values from input ones, weights and biases
self.output = self.maxpool(inputs, self.output)
This is the code I'm using. Might've made a mistake in it somewhere.
@odd yoke
You don't directly apply the max pool on the convolution kernel of the previous layer, you apply it on the output of said convolution
omg, I figured it out
I'm sorry ๐
Instead of broadcasting, I was looping over the images
You don't directly apply the max pool on the convolution kernel of the previous layer, you apply it on the output of said convolution
@odd yoke yeah, aware of that
the inputs here is the conv out
So the correct architecture of the model is:
Conv -> Maxpool -> ReLu -> Dense Layer -> Softmax
correct?
That looks good yep
Thanks a lot
You may see ReLu -> Maxpool instead sometimes, but it's the same result
yeah, was reading about that just now
Mostly for optimization purposes
I see
wouldn't subsampling it first reduce the overhead on Relu?
Not sure which on would be costlier as both strive to reduce the computation in their own way
intuition says maxpool does a tougher job
Relu is a really simple function to execute. So, it doesn't matter much.
I think that's a reasonable assumption, I'm not knowledgeable enough in GPGPU to know exactly what they may do to make it faster with Conv -> ReLu -> Maxpool
I see
relu = max(0, x)
yeah, aware of that tonus
Not trying to talk down, but just typing as I think.
ah lol, okay
Yeah, but when we're talking about millions of weights, that relu operation that is ran several times per iteration can make a non-negligible difference
Yeah....I don't disagree. But, I feel like that is a level of optimization that isn't necessary in my opinion to think about at this point.
https://github.com/tensorflow/tensorflow/issues/3180 Here some people shortly discuss the idea of automatically reversing relu -> pooling
But from maxpool's p.o.v, will max([1 2 3 4]) and max([-1 2 3 4]) make any difference?
ty, will check it out
When your program takes hours or days to train, even an improvement of 1% is important
Had a typo. I don't disagree with you.
Ah, my bad
Nah. It's mine.
So apparently, tensorflow doesn't optimize for it (yet?)
I am not sure how I feel about TF doing that on its own.
Not saying it isn't an affective optimization. But, I think the dev should handle that. And, there should be better documentation on similar operations.
hi, im a noob in neural networks and i was trying to make a very simple perceptron that simply tries to guess the slope.
so you give it a x, it needs to spit out the correct y (so curve fitting?)
the cost function is (a-y)ยฒ
i thought this is the way to calculate the new weight:
W1 = W0- learning_rate*i*2*(a-y)
is this right?
a is the network's prediction
y is the desired output
i is the input (so x)
Are there any good benchmarks for Flax?
anybody knows whats the best way to get the output of a particular hidden layer in a NN using pytorch?
you can create a list out of a model where each element is a layer
alternatively, when you define your model, store a reference to the layer that interests you and retrieve it using a method
this seems to be a very common question, there are multiple other solutions you can find online
I don't get df.grouby()
I don't get df.grouby()
@arctic cliff what about it
Could someone take a look at the tensor shapes, it's not getting the output I wanted
this is something I drew to help myself
The first two parts of this work, but the final part i'm not sure how to get
this is the output
I tried to do a .t() at targets to try and fix it but there are errors so I'm not sure what to do.
model = nn.Linear(13, 1) here you define your model as a linear model that takes an input of size 13, and has an output of size 1
I'm confused as to why it doesn't crash directly in your training loop
It didn't crash
model = nn.Linear(13, 1)here you define your model as a linear model that takes an input of size 13, and has an output of size 1
@odd yoke Alright but I changed it to(13, 13)but doing that just gives an error about singleton dimensions
I changed it to (13, 2) and that also crashed it
so it only works with (13, 1) I tried transposing the tensor but it didn't work either
which line crashes when you set it to 13, 13
what's the exact stack trace ?
let me get it again
d:/Coding/python/ML/winrate.py:37: UserWarning: Using a target size (torch.Size([5])) that is different to the input size (torch.Size([13])). This will likely lead to incorrect results due to broadcasting. Please ensure they
have the same size.
loss = loss_fn(preds, yb)
Traceback (most recent call last):
File "d:/Coding/python/ML/winrate.py", line 46, in <module>
fit(250, model, loss_fn, opt)
File "d:/Coding/python/ML/winrate.py", line 37, in fit
loss = loss_fn(preds, yb)
File "D:\Coding\python\ML\lib\site-packages\torch\nn\functional.py", line 2542, in mse_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "D:\Coding\python\ML\lib\site-packages\torch\functional.py", line 62, in broadcast_tensors
return _VF.broadcast_tensors(tensors)
RuntimeError: The size of tensor a (13) must match the size of tensor b (5) at non-singleton dimension 0```
oh it's the batch size
now as to why the shapes don't fit, can you print the shapes right before the loss in the loop ?
wait wait
you're using inputs
instead of xb
I'm not exactly familiar with pytorch, but that doesn't seem right
Also, is your dataset supposed to be one parameter and one label ?
In which case you want to set model to Linear(1, 1)
@velvet thorn What's it used for ?
If really you don't understand it at all, I suggest you look at the documentations directly @arctic cliff
It's used for "grouping" values together based on some arbitrary criteria
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
It even has some examples
@odd yoke sorry what.
When I do .shape
13 being the number of examples in your dataset
1, 1
Guess I got it, Thanks !
so I just put in 1,1
Yes, and in the training loop, you're using inputs but I'm p sure you want to use xb
yeah inputs is xb targets is yb
It says there is a size mismatch
RuntimeError: size mismatch, m1: [1 x 13], m2: [1 x 1] at C:\w\b\windows\pytorch\aten\src\TH/generic/THTensorMath.cpp:41
After changing it model=nn.Linear(1,1)
m2 being model
which line causes this ?
@velvet thorn What's it used for ?
@arctic cliff many things, but the most common one is to apply aggregations over subsets of data
you shouldn't get 13 as input anywhere
for example, say you have a dataset that contains three columns: department, name, and age
don't forget to remove ```py
preds = model(inputs)
print(preds.shape)
loss_fn = F.mse_loss # except this line
loss = loss_fn(preds, targets)```
if you wanted the average age of the whole company, you would do df['age'].mean()
but if you wanted the average age of each department, you would do df.groupby('department').mean()
Can't I do: df.department.mean() ?
no, that would be the mean of the column department
which doesn't make sense because it contains strings.
>>> df
department name age
0 Accounting A 36
1 Accounting B 29
2 Engineering C 24
3 Engineering D 37
4 Engineering E 33
>>> df['age'].mean()
31.8
>>> df.groupby('department').mean()
age
department
Accounting 32.500000
Engineering 31.333333
this is the simplest and (I think) the most common use case for groupby
but the general principle is split-apply-combine
split into subsets based on the value of a specified column, apply some operation, combine the results back into a DataFrame
in this case the operation is the mean aggregation.
however, you can do stuff like transform and filter, in particular
also, as you get more advanced you'll find that you don't have to group on only a single column, or even on columns at all
an easy example of the first case is...imagine you also had a "sex" column
you could do df.groupby(['department', 'sex']).mean() to get the average age by department and sex
lst = eval(input("Enter list :"))
length = len(lst)
#List to hold unique elements
uniq = [ ]
#List to hold duplicate elements
dupl = [ ]
count = i = 0
while i < length :
element = lst[i]
#Count as 1 for the element at lst[i]
count = 1
if element not in uniq and element not in dupl:
i+=1
for j in range(i,length):
if element==lst[j]:
count+=1
#when inner llop - for loop ends
else:
print("Element",element,"frequency:",count)
if count==1:
uniq.append(element)
else:
depl.append(element)
#When element is found in uniq or dupl lists
else:
i+=1
print("Original list",lst)
print("Unique elemts list",uniq)
print("Duplicates elements list",dupl)
why I'm getting error?
don't forget to remove ```py
preds = model(inputs)print(preds.shape)
loss_fn = F.mse_loss # except this line
loss = loss_fn(preds, targets)```
@odd yoke wydm
remove that code, it's not part of your model
it may be what's causing the error with the shape 13
because you should really only have shapes 1 and 5
like this?
yes
What types of career paths are you all wanting to do with data science?
Just curious
New to this
Does anyone here use notebooks.ai?
@lapis sequoia I am a machine learning engineer, it is a growing area.
Could someone explain this line?
import numpy as np
def split_indices(n, eval):
eval = int(eval*n)
index = np.random.permutation(n)
return index[eval:], index[:eval]
train_index, eval_index = split_indices(len(dataset), eval=0.2)
Here is the full code
So does it split the 20% between train_index and eval_index?
Here is the output I don't really understand it
mostly because they're different
didn't we go through this yesterday
please, do you have a good "google API's" tutorial ?
what
Series.filter(regex="..")
need to filter out strings ending with -org in the series
what will the regular expression be like
no idea cuz u didn't show any example of data
Hello guys, I have a question regarding Deep Learning frameworks. I know how to make simple neural networks architectures, but I have some difficulties implementing custom architectures even though I'm quite familiar with the theory behind the implementation. Do you have any ressources or ideas about how to practice the "coding" part of implementing custom neural networks using Tensorflow and/or PyTorch ?
Look up architectures and try to implement a broken down version of it
the last few chaps of Hands on ML with scikit learn and tensorflow are helpful
Will look into that, is the book adapted for Tensorflow 2.0+ ?
i want to ask that im have data mean training data for object detection and i want to use tensorflow for this puporse i m labeling picture but the problem is that all the picture (mostly is in horizantal) and im labeling them i want to ask that is there any problem after my model will train coz of horizantal pic>>>>>>>............sorry for RiP Inglish
I need to do a calculation over a list where I need to find the number of items smaller than any item appearing prior to that item. I have written this function using numpy array for this:
L = [5,8,2,77,34,67,....,56,342,567]
num_lower = []
for i, j in enumerate L:
cur_L = L[:i+1]
lt = np.sum(cur_L < j)
num_lower.append(lt)
Is there a way to vectorize this loop using Numpy
honestly, this sounds like it can be solved in O(n) by dynamic programming from the right
but to speed this solution up, using numba should work too if the contents are homogenous.
basically it's a time series data and for each number, I want to know where it stands with respect to historical data.
I'm not familiar with numba
I'll look it up
Pretty much just make this part into a function and apply the @numba.njit decorator to it.
It'll lag the first time you call it because it'll be compiled into C code, but then it'll be much faster.
of course, not all functions can be translated into C, but this looks like something that can - just some math and loops.
OK. I'll try it out. Thanks
honestly, this sounds like it can be solved in O(n) by dynamic programming from the right
@tidal bough What did you mean by this?
so... for each number in the list, you need to count the number of elements that are to the right of that number and smaller than it?
honestly, this sounds like it can be solved in O(n) by dynamic programming from the right
@tidal bough really...?
I'd say to the left of the number. As in, the numbers are on a timeline, starting from left and moving to the right. So I need to consider all numbers appearing before that
I can't see it but maybe you're right