#data-science-and-ml
1 messages · Page 380 of 1
if i used this i should also indicate the labels for my validation set?
wym indicate the labels ?
Examples using sklearn.model_selection.train_test_split: Release Highlights for scikit-learn 0.23 Release Highlights for scikit-learn 0.23, Release Highlights for scikit-learn 0.24 Release Highligh...
this means the x ,y right? image and class?
yes this is how you do it ```
train_test_split( x, y, test_size = some float, stratify = y
)```
only if you specify stratify it will even split the distributions for train and validation
this is sample of the split i did top is train bottom is validation but its possible that the validation set is unbalanced ? like maybe its composed of 50% class_a right?
based on imagedatagenerator its what it does
it did not specify if its balance or not
is it to necessary to have balanced validation set?
oh i see what you mean here is the validation data of fit()
but that part about regularization layer drop out meaning during prediction and validation my dropout layer is ignored?
drop out layers are just used during training?
typically dropout layers are only used during training, yes. That doesn't necessarily have to be true but that is the convention.
hello, i would appreciate any feedback on this article:
https://medium.com/@alexm5492/linear-regression-from-scratch-3-methods-2e803d82137c
so in my validation if i understand it correctly then dropout are ignored? is it only on fit or also on predict?
I cannot speak to your particular codebase. You'll have to refer to the documentation and/or explore the source code.
The typical situation would be that dropout layers are essentially only actually "in" the model during back-propogation in the training process
and are not used at any other time
its ignored based on this if i understand it correctly
back prop is the time where the weights are being updated righ?
so only half of the units on a layer will be updated?
at first i thought the dropout happens during forward prop?
yes. I'd highly consider taking this course if you have any confusion on any of these details we've been discussing. https://www.coursera.org/learn/machine-learning
consider taking some time to write a basic neural network "from scratch", meaning only using numpy functions and data structures. It'll really build up a lot of these complicated concepts of the internals of deep learning in your mind
why things are done a certain way
instead of just having to memorize "people only apply dropout during backwards propagation not during forward propagation b/c that's the way it's done"
i only read the blog version of it hahaha this series i think https://www.analyticsvidhya.com/blog/2018/12/guide-convolutional-neural-network-cnn/
it mentioned about normalization and drop outs also but i forgot and its probably not to detailed compared to the course
thank you for the reference 😅 👍
hello, i'm trying to display my data frame and i don't know why i'm trying to show the elements from my json array it doesn't...
import json
import pandas as pd
import matplotlib.pyplot as plt
from pandas.io.json import json_normalize```
with open('C:/Users/PC/Desktop/desktop/git/projets/python/Data-Analysis-Velib/station_status.json', 'r') as f:
velos = pd.DataFrame(json.loads(f.read()))
#df_velos = pd.json_normalize(velos['data'], record_path=['stations'])
df_velos = pd.json_normalize(velos['data'])```
@mighty agate use this instead of json.loads or any other explicit file IO: https://pandas.pydata.org/docs/reference/api/pandas.read_json.html
all right, like that? py with open('C:/Users/PC/Desktop/desktop/git/projets/python/Data-Analysis-Velib/station_status.json', 'r') as f: velos = json.loads(f.read()) #df_velos = pd.json_normalize(velos['data'], record_path=['stations'])
yo
a little help please?
i am not able to install ecapture module
pip install ecapture
it says scikit-image wheels cannot build
using python 3.10
please do pip install -U pip setuptools wheel and then try the other install again. If that doesn't work, copy and paste the whole console output starting from the command in the paste bin
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
thanks...
It doesn't look like this question is on topic for this channel.
whoops
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
i have the following table
and i want to add a 'Games out of Position' column.
it will detailing how many games he has played in a position that is
NOT the position listed in the 'position' column
for example
adrian it would be 2 and then 9,
and for gomez it would be 17, 5, 14(read from top to bottom)
how can i accomplish this?
it seems like your data is organized as "triples" of (player, position_preferred, position_played) or something like that. is that right?
yes, thats the index
it sounds like you are seeking to aggregate the data to just 1 data point per player
so you are looking to create a new table with just player as the index
and your dataframe is using a multiindex?
no i dont think so. just looking to add a column to the existing table. aggregation might be involved in acchieving this however im not sure
yes multi index
well that number seems like a number per player only
so you could then re-join that back into the original table, but the number will be repeated for each row for each player
no let me just send a picture of what the solution column will look like
ok. i think i am missing some context
what do you want to do with this quantity that you calculate?
essentialy add up the 'appeared' column for for each player(the first index level) for rows that are not the current row
hm... so you want the sum of all the other rows for that given player?
so for gomez, he played 17 games outside the RB position, 5 games outside the RBP position and 14 games outside the SUB position
yes
i think i understand that. what i recommend is writing a function that takes a dataframe as an input and returns this new column, that works on one single player. then you can do a .groupby(level='player name').apply(your_new_function) to get your desired column
ok i think i got it
a window function totaling the total appearences for that player
and subtract that number from the appeared column
im not sure you need a window function, but that is one way to do it
you would probably need to define a custom "window" which i think is pretty complicated
i tried to do it once and gave up, the docs weren't clear and the examples that ship with pandas are kind of convoluted internally
that's why i suggested groupby
honestly i'd just loop over rows inside the function
these individual per-player dataframes are so small that the performance isn't important
hmm im not sure how to do this, but i can bypass this by resetting index right
i wouldn't even bother
ok, ill try both way none the less and learn something new.
i personally need to get better at debugging with pandas bc what i do understand abt how pandas works is pretty loosy goosey in my head
the super-naive way is something like this:
def compute_player_oop(player_df):
result = []
for row in player_df.itertuples():
other_rows = player_df.loc[player_df.index != row.index]
other_total = other_rows['appeared'].sum()
result.append(other_total)
return pd.Series(result, index=player_df.index)
data['games_out_of_position'] = data.groupby(level='player name').apply(computer_player_oop)
something like that anyway
untested code written by volunteer strangers etc.
ahhh nice. thanks by the way
this is pretty inefficient but easy
the best type of code
there are probably faster ways using set operations on indexes or something with window functions
i always hesitate before using loops
yeah for small dataframes it's fine
if you are looping over 5 rows who cares
if you are looping over 5000 rows even who cares
the reason to avoid loops on small datasets is more for readability + concision than for performance
you're welcome
you're welcome to who? I was going to say that sometimes the only reason I encourage people to avoid loops is to force them to learn the API 
for sure, that's a good reason too
idk if there's a tidy pandas solution for this problem though
"un-selecting" one row at a time
also I see that you're speaking to Ahmad. I was momentarily confused because you wrote that message quite quickly while I was typing, so I thought that was related.
basically a "leave-one-out" operation
haha no but i was curious if you were going to chime in with a nicer approach
it'd be nice to have a general efficient idiom for "leave-one-out" operations with pandas
or even "leave-n-out", like an inverse window function
could be a good SO question (if it isn't already)
if you ever figure it out, you can ask it and answer it
if there's a way to make a boolean series where n are True, you can use that to get the retained rows and invert it with ~ to get those that aren't.
that's pretty much what i did, other_rows = data.loc[this_row.index != data.index]
(assuming that the indexes are unique, which they really should be)
non-unique indexes in pandas are just a Bad Idea, like non-string column names
u could sum all of them and then subtract the row
why there is no data in due_date and a few other columns when the dataset originally has it
please if anyone can help me asap, I have to turn in my project
that's a good solution for this particular case
does anyone know how to use kronecker delta or levi civita symbols with autograd / jax?
Anyone have idea how to stack 3d matricies in numpy?
corgi_1 = np.asarray(cv2.imread(corgi_jpgs[0]))
corgi_2 = np.asarray(cv2.imread(corgi_jpgs[1]))
corgi_3 = np.asarray(cv2.imread(corgi_jpgs[1]))
corgis = np.stack((corgi_1, corgi_2))
corgis.shape```
(2,100,100,3)
when I try to add the third one to "master" array I get a value error
corgi_3.shape
(100,100,3)
do I just need to add an extra dimension along the fourth axisa
@wide rose see if this or one of the other functions listed has the desired effect: https://numpy.org/doc/stable/reference/generated/numpy.vstack.html
tried it same error
tried literally all of them hahaha
before coming here
corgi_3 = np.expand_dims(corgi_3, axis=0)
corgis = np.stack((corgis, corgi_3))```
doesnt work
vstack gets it with the extra dim added
cool thanks
yea I just had to add the extra dim for vstack to work
wasnt expecting that behavior but in retrospect makes perfect sense
realized as soon as I typed here :/
yes
vstack and hstack don't add dimensions for you
you can use .reshape to "wrap" with an extra dimension
yea I used expand dim
ah i see, yep
I get so use to numpy being so smart sometimes i do dumb shit like that lmao
it'd be nice to have a function like "concat with extra dimension" function
there are stack_columns and stack_rows but i think those are 2d-only
yea it would
wonder how hard that would be
>>> def combine(x, y):
... return np.concatenate((np.expand_dims(x, axis=0), np.expand_dims(y, axis=0)))
Make it asserts that x and y have the same dims though*
Or use stack*
i have the following table
and i want to apply somthing like this
np.where(dfx['space'], dfx['full_name'].str.extract('.*\s(.*)'), dfx['full_name'])
its raises an error and i understand why
any ideas on how to execute this?
basically i want to get the last name, which is anything after the first space, if a space exists
i found a round about way of doing this(i created multiple columns and then removed them) but i would still like to know if its possible with one line of code
I meant like actually add it to the numpy library
I think you can with hugging face iirc
u can use np.newaxis/None as a new dim eg np.ones((3,4))[:,:,np.newaxis].shape
if i dont plan to validate or change my model do i just use train and test set? and use the test set in validation_data ?
instead of using validation set in validation_data for fit() method then i can just use the test set to evaluate it? or it is necessary to just evaluate the model on test set after the training?
hello, i would liek to please ask, im currently self teaching machine learning. The algorithms I learned mostly are about linear regression and neural networks and logitic regression. I learned a bit about featring engineerin like how to handle misisng values or imbalance dataset and do hyper paramter tuning a bit. I am focusing deeper now in NLP
am i on the right path to landing first internship?
im also making a machine learning project with NLP that uses full stack and kubernet
I have an array with shape (2, 3) and an array with shape (2, )
I want to do an element-wise division
a = [[1,2,3], [4,5,6]]
b = [3, 3]
# Want:
c = [[0.333, 0.666, 1], [1.333, 1.666, 2]]
It says "operands could not be broadcast together with shapes (2,3) (2,)"
yeah because must be same shape
since b has same element
can just use first index
a / b[:,np.newaxis]
you can use the validation data as a test set yes. the idea of the validation set is to keep some of your data from any train/test hyperparameter optimization process for the final model evaluation. if ur not doing any paramer or model changes, you can use the test set for more training data points and the validation set as the test set
Thankssss
It works. Do you mind explain what exactly the code does?
it is as the other commenter mentioned they need to be broadcastable. there is a good explanation here https://numpy.org/doc/stable/user/basics.broadcasting.html
will it be the same as this?
or maybe at first i split the trainin data to tain and validation set and after some changes i use all training data for training set and test set for validation?
ok theres 2 things here
lets say u just have a train and test set
you build your model on the train set, and it seems to go ok, then you run it on your test set but it does not do as well.
so you change something in your model, do the training again on the train set, test it again with the test set and it does better
at this point u have no idea how your model will perform on unseen/new data
like it should be ok, but you dont actually know, because you have reused ur test set and have potentially implicitly introduced overfitting
so how do u know? the validation set
if u go and make changes after evaluating performance on some new data set, that new data set cease to be a useful indicator of how the model will perform on unseen data
you explictly said in your question you would not be making any changes to the model
so u need to be clear if you are going to tweak the model, or accept its performance as is
i am trying to compare image augmentation techniques on which method will give the best raw performance in order to do so without bias i created a model from scratch and compare their performance without optimizing the model because if i do change the models it should always be identical and if i do that optimizing it will just make biases for the models
if i explained it right so i dont need to use validation i just create 2 identical models train with different augmented techniques and test which one gives the best raw performance
by raw performance i mean no optimizations or tweak to be applied to the models so they will stay the same
yeah i think if i understand u, just a train and test set is needed and you can report the results for each augmenation type
i even go to the lengths of copying the initial weights of models so they really have the same starting point
yeah you should be able to set the random seed which should give the same initialization
and if i implement it using keras do i use the test set in validation_data during training or on evaluate() after training? is the the same?
btw am using tensorflow
i am not familiar enough with those libraries to be able to say
this is what i did to copy the weights is there a correct procedure?
cst_model.set_weights(gt_model.get_weights())
i prefer setting the seed unless you are really sure there are no other random things happening in other layers
but im not really a dl person
i dont know enough about keras/tf to be able to say either way
this is what is says on validation data of sequential.fit()
if i understand it correctly the models is seeing the validation every end of epoch but its not being trained for it right? so its the same as testing the models on new data did i understand it correct?
what dl?
deep learning
yeah the validation_data wont be used to train, so u could use your test set there and the results of your experiment would be whatever the final metrics are from the final epoch
nice nice
btw this is the example i trained without validation data
the metrics showed their are the blind guess of the model on training right?
im not 100% but if you didnt provide any validation data then yeah they would be the training data metrics
its good to track the metrics with a validation set
im always really suspicious of a model that learns really well such as that one seems to be doing
its often a sign of the model overfitting the training data
but it depends on the type of data you are working with, sometimes things do work very well
overfitting is when the model is too good on training but garbage on new data right?
yeah
if the model predict something like this what does it mean?
thats just the class
also softmax is probability distribution right?
so if am getting 1 is the model too confident of the answer?
softmax is a probability-like distribution
so the middle one is the predicted class?
yeah
what does it mean?
the overall values of the classes should be 1 right?
the values should all sum to 1 yes, thats what makes it probability-like
and they are all in the range [0,1]
hyper parameter optimization, regularisation, different types of models, stuff like that
its kinda a how long is a piece of string type thing
depends on ur data, model, approach, desired outcome etc
in my case if i am comparing if the other model is overfitting while the other one is not then i can say the dataset is the one responsible ?
overfitting is fundamentally a model parameter problem
but it may also be due to not having enough data
in which case, you can still change model paramters to prevent overfitting, however it likely wont be a very good model
Hello
I am trying df.to_csv()
But it is not creating CSV file
It has created folder name as file name but not file
Ping me when replying
I am not getting output as CSV file in specified path
is state space search heuristic method??
can you say which state space search method?
like best first search uses heuristic, while bfs/dfs/uniform search doesnt.
Please show the exact code
simple state space that acts like BFS....and its one variant that tracks visited nodes that prevent cycles
both are not heuristic, right??
yes. bfs does not use heuristic, and even while we track nodes, its not heuristic.
I am not aware of the source of given slide, but as much I've studied in this field, I never used heuristics in bfs and dfs while making programms.
hill climber needs heuristic but not bfs or dfs.
yeah i agree its wrong...it even specifies blind bfs...which has no heuristic
may be the person put it to explain that we dont do that in bfs and dfs by putting blind because well, it does not have any info about environment, except what environment serves as next state.
hmm may be
how can i get the the true positives etc from keras after training so that i can create confusion matrix? multiclass
you have y_test and y_pred both are enough.
Y_test = your target variable's true value (the portion which has not been seen by your model.) Y_train on the other hand is also the true value of your target variable but the portion used to train your model.
Y_pred = The prediction made by your model
anyone saw any difference between normalising whole data vs using only batch normalisation, or using both at the same time, what is a more optimal solution?
oh so the y test is the one i will provide manually?
during prediction the row here are the test images and the column are the classes right?
but the order is not the same from the generator to me trying to manually input a single image
i tried to predict using the very first test image and the output is different from the 1st row of the predictions using test_generator
i did not use the shuffle parameter so my test set probably reading the test images in order?
oh it default shuffles hahaha my bad
Normalising your whole neural network inputs improves your model no doubt. But remember that deeper layers are trained based on the output of the previous layer. And since the weight gets updated via gradient descent, the consecutive layers unfortunately will no longer benefit from the earlier normalisation since they need to adapt to the previous layer's weight changes; hence, finding it much troublesome to learn their own weight!
With Batch Normalisation, we can evade such incidence with finesse! This is because Batch Normalization makes sure that, independently of the changes, the input to the next layer is normalized. And above all, it does this inna smart way with trainable parameters that also learn how much of this Normalization kept scaling or shifting it.
I hope you understand it better now. ✌️
thanks, is there there any sense at all to normalise whole dataset and use it with batchnormalization compared to using only batchnormalization?
I'm not sure I understand what you mean mean by 'providing Y_test manually' but both Y_test and Y_train are gotten from the original Y when you split your whole data into train set and holdout set.
Y_test is very important because that's what you'll use to evaluate the accuracy/lapses/difference in the prediction made by your model (Y_pred).
i used flow from directory to get my test images and this is what it returns
I guess maybe if you aren't using NN. Again, it also might depend on the individual, the task, and how they wanna approach the problem. But as always, Batch Normalization >>>>>
if i try to get the y
its said too much to unpack
how to i get my y_test from the test_generator?
I don't see any error here though. I'm more of an NLP guy (because that's what I'm learning at the moment) So I don't have enough experience in Computer Vision yet.
So if the problem is actually beyond what's in the pics then, other people here can help out
Consider increasing your number of batches, perhaps it'll help.
i got it now this helped me https://stackoverflow.com/questions/45413712/keras-get-true-labels-y-test-from-imagedatagenerator-or-predict-generator
i now get the true values of my test set
btw is this normal predictions for the model? .
alot of negatives some doesnt even have a positive prediction that means the model cant predict that input?
it's e powers again, they are just small fractions
its negative values right?
no
aww
here is the example haaha my bad
oh you guys back at it again haha
I'm working with some data and its ballooned into a massive amount of if statements that need to include or exclude certain key words. Looking like this
if ((routes[0] == 'IN' or routes[0] == 'BANG') and (routes[1] == 'IN' or routes[1] == 'BANG') ): return 'DBL DIG' if (((routes[0] == 'IN' or routes[0] == 'BANG') and routes[1] == 'UNDER') or (routes[0] == 'UNDER' and (routes[1] == 'IN' or routes[1] == 'BANG')) ): return 'DRIVE 6' if ((routes[0] == 'UNDER' and (routes[1] == 'SHORT OUT') and (routes[2] !='SHORT OUT' and routes[2] != 'RETURN' and routes[2] != 'RETURN')) ): return 'DRIVE 7'
I think to simplify this I'd be able to use a dictionary structure but I'm unsure how to proceed/ how it would work to exclude certain values as well. Any help would be appreciated!
Would it overfit if I were to run a model, save it and then run it again with the same data?
Hello, I want to create a dataset regarding heart rate and oxygen saturation that will determine whether a person has fainted 0 or 1. My question is how will I get the fainted value for training?
Im creating my own dataset because I couldn't find any dataset that have these features.
you have to already know if the person fainted or not. if you can figure out if the person fainted using some function of their heart rate and oxygen saturation, then you don't need ML.
hello
hello, do you have anything to say?
I've used the wrong word, It should predict whether the person will faint or have a chance of fainting.
so you already have whether or not they fainted in the training data?
it might be easiest to just show the data that you have. like copy/pasting the CSV into the chat, if you have that.
not yet, I don't have any data yet because I'm having trouble with the value for the faint column. In short, I want to create my own dataset (because I can't find anything on kaggle nor google) but I don't know how to fill the column for faint.
Here's an example of the dataset that I would want.
sadly, I don't have any dataset (I couldn't find any on google nor kaggle). But I am willing to create one. I have bought sensors MAX30102) to be used in data gathering (HR and SPO2)
For the rest of this conversation, I will only look at text (no screenshots).
So, like I said, you have to already know if they fainted or not. The point of machine learning is that it learns from real examples of the inputs (gender, age, HR, SPO2) and outputs (FAINT). So, you don't have access to a dataset with this information, you will have to conduct a study.
The alternative is to make up fake answers and see what happens, if this is just for educational purposes.
my bad. won't happen again
this is my last resort of doing it. But is it possible to use ML without dataset and only parameters?
So, there's X data and y data. X are the pieces of information that supposedly cause y. Your Xs are (gender, age, HR, SPO2), and your y is (FAINT). You're asking if you can predict y given only X, but you don't actually know what y is.
yes
if you make a model that returns a y value for a given X, but you don't know what y is, you'll never be able to confirm that the model is correct.
well, that makes sense. it's really hard to find data that have similar attributes as mine
that's my only problem
you can use unsupervised learning, which could tell you which X instances are more similar, and you might discover that there's two discernable subsets. But you'd have no way of knowing which is "faint" and which is "did not faint".
you might look for datasets with similar features (gender, age, HR, and SPO2 are all features) and see if you can derive these features from that.
for example, if you had a dataset of people that gave their birthday (as a timestamp), but you want their age (as a number of years), you could calculate that based on what you know about how age works.
I get this one and it's kinda making sense to me now
thanks!!
Hello everyone, I've been learning ML, in neural networks now [Andrew Ng course]...was wondering if I should go ahead and try out neural network implementation in pytorch tutorial docs or try and implement it from scratch first, from course material and other resources. I do have a basic knowledge of the working, feed forwards, backprop and stuff [sentdex, 3b1b] BUT I wanted to try it out on a dataset and pytorch docs seems fun. What should I go for first?
I'm coding (manually) a MNIST categroizer using no hidden layer. So it's just input -> output -> softmax -> cross entropy loss. I was trying to calculate the output's derivative wrt the loss, and stumbled upon this formula:
So... I don't even need to calculate the loss to do back prop??

hello
i was wondering if you could call a function from within .assign() that returns a dataframe instead of a series
so somthing like this
...asign(new_col = lambda dfx : get_df(dfx)[0])
would the '[0]' part be sufficient to turn it back into a series?
how large is difficulty spike for starting to learn how to make machine learning? ngl, I still feel like I'm shaking off rust, and I have no experience with interacting with large datasets...
Doing a little digging, it sounds kind of necessary to learn SQL to a certain degree at very least.
i made ai tictactoe guys
chk out this link
First move is played by computer & for now it plays it's first move as the middle box only
You can see how the AI wins /makes game tie in the vedio
Thanks for watching
Consider sharing&subscribing
ML is a large space. you can do some stuff by taking datasets, cleaning the data, and plugging it to a model with minimal configuration. if you're just doing it for interest's sake, that might be enough. but a career in ML would require SQL, as well as linear algebra, probability/statistics, and sometimes calculus. among other things.
okay thanks
Would it be possible (or a good idea) for someone with geometry level knowledge to learn the algebra and calculus that is needed for ml?
do you know any amount of trigonometry? derivative calculus isn't that big of a jump from algebra. integral calculus is more difficult.
Yeah I know a good amount of trigonometry, and I've looked at some of the math behind it and it doesn't seem that hard, I just don't know if it's that big of a deal that I don't know much algebra 2
algebra 1 and algebra 2 are just two courses for one subject. they aren't universally recognized ways of splitting up algebra.
We didnt split algebra like that in my uni lol
Right
Do you think it would be a good idea or should I wait a few years until I take all the math courses?
that's really a matter of how much free time you have and how you want to spend it.
though if you're learning algebra, it might not be a bad time to learn array/matrix arithmetic.
Alright
other branches of math you could look into are set theory and graph theory. they differ from the kind of math you learn in high school in that they're a lot more conceptual. there isn't a whole lot of calculating.
set theory is just about having things in groups, and graph theory is just about things and relationships between things. they're used to model real-world phenomena in precise terms.
Interesting. Would a lot of ml algorithms like knn or logistic regression require less math than neural networks?
not really. they all involve a lot of math, if you want to understand how they work. though you can understand what they do and when you might use them without knowing their exact formulae.
Ok, thank you for the advice
My kaggle notebook gives memory error seemingly at random (can run smoothly one time and return an error another) what should I do?
Training on MNIST using a 1 layer NN (left, no hidden layer) and a 2 layer NN (right, 1 hidden layer).
So, in the 1 layer case, the accuracy reaches maximum after just 1 epoch and basically just fluctuates around there. Is it normal?
Hello, I am totally lost.
I am trying to learn machine learning ( but I am 14 which means I don’t have a lot of experience)
So I bought the famous and recommended book “Hands-On Machine learning with Scikit-Learn, Keras & Tensorflow”.
I then saw in the Prerequisites that this book assumes that I am familiar with Python’s main scientific libraries in particular Numpy, Pandas and Matplotlib.
I then learned these libraries but I don’t really understand the code part of the book still because it uses scikit learn, keras and Tensorflow without explaining what the syntax means so I told myself that i should start learning ML on youtube first but the explanations are too simplified so the reason I wrote this big message is to ask you please tell me where to start
Or for anyone who has already read this book does it explain later on the syntax of Tensorflow, Keras and Scikit-Learn?
probably python? my guess is the book uses python so if you dont understand the syntax maybe its python?
or you already knew python?
I already know python
I am talking about the syntax of the machine learning libraries like tensor flow and Keras
it is python objects and such i think
maybe you want to find is the documentations of those library to know what those methods objects do?
Yeah what I am saying is I don’t understand the methods of these ML libraries but I am asking if (for anyone who read the book) The book will explain the methods later on and I am also asking where should I start to learn ML
there is a course recommended here its machine learning by andrew ng iirc
Ok
You just answered my question / helped me so thank you a lot
I quickly watched it and although math is extremely important for ML it only talks about math and not the libraries
that course is for theories i think and for libraries documentations is the way i think
Sorry?
i mean the course is for ml fundamentals and if you want to learn about the libraries its the documentations you should read i think
.
If you Google them they will come up, e.g. "pytorch docs": https://pytorch.org/docs/stable/index.html
Ok thanks
Also these libraries have their entire own web sites which may have tutorials on them: https://pytorch.org/tutorials/
Yes this is where I just went
Thank you a lot then @pastel valley and thank you @iron basalt have a great day
Both of you
its a bad idea to learn that way
my advice: get up to scratch with your math background first, then do ML side-by-side
you get an error at the output layer and back prop is a way of propagating that error to earlier layers, so if u have no earlier layer theres nothing to propagate the error to
yes its normal because only using the output is literally a logistic regression, ie deterministic and optimal wrt mse
you can think of a neural network as a weighted set of linear/logistics regression, the "learning" part is finding the weights
if you have no weights (no hidden layer) theres nothing to learn
if there's no activation functions, your network is just a linear operator (regardless of the number of layers it has, actually), and so equivalent to logistic regression. Since without a hidden layer there's no activations either, the same happens here.
(if we are being pedantic, it's affine (from linear algebra POV, from calculus POV it's "linear" (deg. 0 or 1)), the activation function is linear)
("linear layer" then refers to the activation function)
Ok now at least it’s an option but the reason I am not directly doing it this way is because I am only 14 so I tell my self I have time but If I see I seriously need the math I’ll learn it
I do agree - I'm 17 myself, and my approach was to do it both sides - learn things bottom up as well as top down; I really don't think that's the best or the most efficient approach, but use what feels best for you
Don't let your age bring you down. I'm 18 and got my first ML Emgineer job recently -which tbh I'm doing terrible at-
Is there an explanation for my kaggle notebook to terminate itself with a memory allocation error every now and then?
It works fine one time and then gives me the error on another run
can you show one such error as an example? the whole thing please, as text.
It normally just this:
Your notebook tried to allocate more memory than is available. It has restarted. but the notebook isn't consistent with the error.
Also I can send the part(s) where I think may be causing the errors but it will take time to reproduce
After I run
img_height =256
img_width = 256
num_channels = 3
unet = unet_model((img_height, img_width, num_channels))
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
which isn't an issue, I don't think
@serene scaffold Tried the give all the information needed, you can ask anything
Kind of desperate, been on this for hours now
@urban prism sorry I've been afk. Though generally speaking, I don't know what guarantees Colab makes about how much compute power they'll give you
It's alright. Thanks
this happens to me sometimes with colab with bigger models/datasets
the recommendation is to reduce one or the other
or get colab pro
even then sometimes you run out of memory

is there any way to predict when it's going to happen? idk if the memory allocation per user is dynamic
it really seems random at times tbh. the info at the top right is sometimes useful, but i think sometimes youre sharing with others
its like some cloud providers
I see. Thanks for the info! @misty flint
got this while training a custom model
does changing random in cfg file to 0 solve this?
im using google colab
Hi
how can i tackle this problem (which data structure i should use)
Basicaly i have 2 columns that represent each the white and black player in chess
I need to find out which player has played the most with another player
I was suggested to use this view
I am just not sure how is it called
Hello, I am trying to learn the main ML libraries like Tensorflow and scikit learn but when I watch courses on YouTube like the ones made by “FreeCodeCamp.com” people (in the comments) say it’s a great tutorial but personally I don’t understand what’s going on as they just write the scikit learn methods without explaining what they do I am lost.
The documentation and tutorials on the tensorflow and sklearn websites is a great resource
As well, if you don’t understand the fundamental mathematics/logic of the ML algorithms, you may need to study up on that background knowledge before things start to make sense
anyone want to start a data science podcast with me? i literally just started learning and have tons of questions so the podcast would just be me asking questions and my partner answering them
i think it could potentially be entertaining
or it could be some else that is also new and we report on our progress together
well, you shouldn't try to "learn libraries". you should learn to solve problems. come up with an ML project that may involve sklearn or tensorflow and do it.
yo this is overfitting right?
btw in that metrices i can also see that around 30 epochs the model started to learn almost nothing?
could be. how did the model perform on the test data?
or did you predict on the test data after every epoch...?
yeah its test data already
but the accuracy is 80% above mostly is it still good?
or do models should naturally get higher?
depends on your use case
if 80% is better than human performance for that task, then yes. if it's for something that's not very important, maybe
base on the test-train metrices their margins is it somewhat accceptable? or also depending on the task?
it always depends. for all I know, you want a model that's always wrong.
there is not right or wrong models its just about tweaking it to how you want it to perform? its bugging me if i created a right model or a wrong one but if i understand you correctly then if this performance is acceptable to me then the model is right ?
So there’s a quote often used when making new theories in science in general- “there’s no such thing as a model that is right, only models that are useful”
This quote is doubly true for machine learning where you’re essentially trying to skip the “actually understand what is happening” part of science right to the “have a model that can predict future events” part of science
If you think 90% test accuracy on your data is good enough to be useful, congrats! you’re done fine-tuning that model
i still dont get it
the individual predictions that a model make can be right or wrong. beyond that, whether or not the model itself is good enough for a certain situation depends.
We can’t make that decision for you, there’s no arbitrary “model is good enough now” cutoff that applies to all models
i did not do any fine tuning i just want to compare identical models(models with same initialized weights and layers ) performance based in the data they are trained with so maybe my questions are out of place hahaha
but is there ever a model that has 98%+ accuracy? like is there ever someone capable of creating a very good model?
And yea, like you observed it seemed your model may have stopped learning anything new around 20-30 epochs of training. At least in terms of achieving better statistical metrics. That’s quite normal, google “neural network training early stopping” and you’ll find resources in that
there are models that are basically 100% for problems that don't have ambiguous cases.
What a good accuracy is completely depends on the prior distribution of targets and features in the dataset. There’s no generic answer
neural network training early stopping there are cases called over training where the model will decrease performance? its like after the ath its will just curve down? is this what it means?
One model being more “overtrained” than another means the difference between test metrics and train metrics is higher
This may or may not have anything to do with how many epochs of training the model undergoes
so if i plan to compare performances of models ill just use the same epochs for all models?
Unfortunately it’s not that simple. Models with different architecture may take different numbers of epochs to train to equivalent performance. Usually this type of thing is addressed by using a consistent validation set and consistent early stopping criteria between the models you are trying to compare.
in my case i am comparing identical models same architecture
i just create 3 models and train them to classify same classes but the data they will be trained is different per model
IMO a good way to conceptualize this is that a model is not separate from the data is trained on. Identical architecture + identical training procedure + different data -> different models , not identical models
they are originally have the same data but applied with different augmentations techniques, for example the data set of cats, dogs, birds they will give their own training to the model
now if i apply to augmentationA to those original dataset then how much does the same model improved or if it performed worst then the same model trained on augmentationB with the original dataset then is A better or B or the base model without augmentation applied to the cats, dogs, birds dataset
does it make sense? 😅
its like experiment, does applying this kind of augmentation to this type of classes will be better or not or how about this type of augmentation or this one
like that
sorry my English is not that good 😅
Yes, it’s a sensible question, but since you’re introducing different augmentation data into the models you cannot just give them the same architecture and same training procedure and compare their performance and use this as a way to understand which augmentation is always “best”
How would you know if perhaps a slightly different model architecture trained on augmentation B data wouldn’t outperform the initial model architecture on un-augmented data?
yes this is also the one thing that will negate this experiment because there are cases where what if its a different architecture used then maybe the output will not be the same
but if i say like the class features
for example if i apply this experiment on classifying cars and say like augA is rotating etc, and augB is color casting etc, then it wouldnt makes sense because there seems to be nothing wrong with those augmentations
but if for example my model should classify something like color of a ball then there is a chance that augB will perform worst because applying colorcasting on the sample that color is a special feature would be a problem, example classes will be blue ball and red ball and i have a original sample of blue ball and when i applied augB it produced a red ball like augmented image then the model will learn that its red ball but infact its just augmented image from a blue ball
of course it is all only a "what if"
by not focusing on the model instead of the data. models learn what we give to them but if we un intentionally gave them a wrong data like the ball augmentation sample maybe just maybe i can conclude that augB is bad for classses that has color as a important feature
if i said it correct
this is the biggest question does it make sense? or i am wasting my time?
Hi can someone help me
So your experiment captures “given a fixed model architecture, which data augmentation performs best”. That’s fine. The disconnect that confuses me is that there is nothing stopping anyone from changing your model architecture, thus invalidating the test results. You can define a more generic test that includes more model architectures
oh right right that one can be said i havent figure out the possible reason for that so its a big flaw
btw in terms of cnn models what differences are there on their architectures the amount of layers and types of layers?
or there are more?
Many, many more
if i say that given this architecture and this classes which type of augmentation is the best and worst then its back to why not use different model? hahaha wew
Indeed
Hey, I want to learn more about AI and implement it into python scripts, but I don't know where to start. Any suggestions ? (I'm a total beginner when it comes to AI)
including something in a program that already exists is not "implementing it". for some reason the word "implement" gets thrown around a lot.
anyway, what is your understanding of what AI is, at a high level?
I'm trying to implement this
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()
def dice_metric(inputs, target):
intersection = 2.0 * (target * inputs).sum()
union = target.sum() + inputs.sum()
if target.sum() == 0 and inputs.sum() == 0:
return 1.0
return intersection / union
def dice_loss(inputs, target):
num = target.size(0)
inputs = inputs.reshape(num, -1)
target = target.reshape(num, -1)
smooth = 1.0
intersection = (inputs * target)
dice = (2. * intersection.sum(1) + smooth) / (inputs.sum(1) + target.sum(1) + smooth)
dice = 1 - dice.sum() / num
return dice
def bce_dice_loss(inputs, target):
dicescore = dice_loss(inputs, target)
bcescore = tf.keras.losses.BinaryCrossentropy()
bceloss = bcescore(inputs, target)
return bceloss + dicescore
Though it returns:
/opt/conda/lib/python3.7/site-packages/keras/engine/training.py:853 train_function *
return step_function(self, iterator)
/tmp/ipykernel_35/144329947.py:17 bce_dice_loss *
dicescore = dice_loss(inputs, target)
/tmp/ipykernel_35/144329947.py:8 dice_loss *
num = target.size(0)
TypeError: 'NoneType' object is not callable
I'm a bit lost. Any ideas?
unet.compile(optimizer=Adam(learning_rate=1e-4), loss=[bce_dice_loss], metrics=[dice_metric])
All I know is that AI's are a bunch of algorithms that do specefic tasks depending on factors you have set. so yeah, nothing above average knowledge x) I might even be wrong here.
That's a lot better than the general public's understanding. Yay!
@urban prism if you get an error about NoneType, it usually means something returned none when you thought it returned something
I have no idea how to use one in python and get it to do what I want it to tho, and where can I learn more about it. is there some kind of full course on the internet that you "validate" ?
We're a large, friendly community focused around the Python programming language. Our community is open to those who wish to learn the language, as well as those looking to help others.
there's also an online course by Andrew Ng, but it hasn't been reviewed by our staff yet. I also recommend 3blue1brown on YouTube for the math stuff.
Thanks !
whats the best way to code a lot of if statements?
not sure I'm following. lots of if statements are usually regarded as poor design. is this a data science question or a #software-architecture question?
might be a software design. But i have code that needs to check if a list contains specific words but excludes specific ones as well
if you want refactoring advice, try there instead.
if you need general help, see #❓|how-to-get-help
sigh
i hate minitorch
if anyone asks you to do it for fun to "learn ML from scratch", you should heavily reconsider
unless thats something youre passionate about, then feel free
doesn't make sense re-implementing literally everything fom scratch
I'd say just learn how to implement stuff, implement papers then. Writing your own autograd engine is pretty useless IMO since its just heuristic/rule-based anyways
yeah too bad its assignment for deep learning class
so no choice but to drag my feet
even tho i will never use this again
atleast you'll get a better understanding of tensor manipulation
use einops if you aren't already - might save you a ton of time
Autodiff systems are actually pretty straight forward, the real difficulty by these kinds of things (libraries) is having ALL the features. It's having to implement N different things. And if they can interact with each other it starts to become an N^2 problem. For example when implementing a database and wanting all the different selections, HTTPS interface, maybe a GUI, etc vs just having the base systems (the record storage system / file format, some indexing).
Even if the different features are not hard to implement (especially if someone already did it before), it just takes a lot of time, especially for debugging it all together.
On the other hand, if you know you only need a few of the features, then it wont take too long and you can probably optimize it further because it's more specific and more specific / less general tends to be faster on computers (and less code, so easier to debug and maintain and browse, etc).
agreed. in the end, its more of a programming exercise than conceptual one
Yeah, since it's been done before (it's called minitorch after all, so it probably has nothing new in it).
oh well. I suppose I'd have to do it too one day
Ofc, being able to do that grind is super important if you want to make something new, it's a huge initial hill to climb before you see any results.
Or in RL terms, a very very delayed reward.
doesn't seem like a particularly useful grind - but I suppose my programming skills needs a lot of work
yea. still...
If you do want to make something new that requires implementing a new library / system, then I would recommend making it very specific, don't let the feature count get too high because the time needed is not a linear function of the number of features, so adding even one more might make it way more work than expected.
It does seem a bit much to learn ML from scratch from browsing it a bit. This is more of what you might do later to package it into a nice generic library.
yes im glad the guy who does this for work agrees with me
i now feel 200% validated

When I execute this code, trainx = input_df.loc[train_index], I get this error KeyError: "None of [DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',\n '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',\n '2020-01-09', '2020-01-10',\n ...\n '2021-05-17', '2021-05-18', '2021-05-19', '2021-05-20',\n '2021-05-21', '2021-05-22', '2021-05-23', '2021-05-24',\n '2021-05-25', '2021-05-26'],\n dtype='datetime64[ns]', name='Date', length=512, freq=None)] are in the [index]"
Im not sure why
How should I adjust the learning rate wrt the batch size?
Sorry if this is a stupid question but what are some actual uses of svm, knn, logistic or linear regression, and other similar algorithms? Wouldn't a human be able to figure out patterns like this and accurately predict outcomes?
Some of them are taught because they were relevant historically and so the school thinks you should know about them
Some of them are used as components of larger and more complex systems
Hello, does anyone know how to deploy and easily share notebooks from Jupyter without having to spin up a separate streamlit server? In other words, I want a way to share notebooks like they were separate dashboard pages. I don’t want to pay for plotly enterprise.
i would argue that those models are more widespread than deep learning
I basically want to create an on-prem instance with some way to publish the out of Jupyter notbook plots in a way that any non-software engineer can go to and view. I don’t want to get in the business of trying to design separate pages using html. There has to be a better way.
linear and logistic regression is also valuable as a tool for interpretation, they are widely used in medical research and other sciences
also its not a question of if humans can do it, humans can drive cars great too but ppl are spending a lot of time on models for that
ML is best suited when u have relatively simple tasks you need to do a gagillion times very quickly
also the different model types do better with different types of data/problems, its not necessarily possible to tell in advance what is the best type of model to use for a given problem
its a rookie mistake i see over and over to pick deep learning models for everything
Actually I think a more common mistake is thinking there's a model for everything

good point
You could host a Jupyter lab instance with read-only notebooks? https://stackoverflow.com/questions/58944458/read-only-python-notebook-in-jupyter-lab
Interesting…. I’m just curious why there is not a mainstream approach. I feel like there are so many disparate systems.
Thanks for the link though
does anyone have a really simple example of multivariate regression with tensor flow?
I have done regression with a csv with just an x and a y, but if i had two independent variables what would i do
the data that i am using is 100 random floats from -10 to 10 for all 3 columns
but idk how to find any relationship btw them
i understand 2 variable linear regression, but i am completly lost with 3 variable regression
guys is there any AI assisted image labelling tool thats actually free to use? like all we need to do is draw a bounding box for some images and rest will be taken care by the ai
i have like 278k images but i want to label that 😭
this might help?
thanks ill look into it
with tensorflow, I found this: https://www.tensorflow.org/tutorials/keras/regression
---
hello, I was learning pytorch from the docs and following some sentdex neural networks examples. Was wondering if it is necessary to transform the target labels into one-hot encoded vectors. If I transform I'm having problem with nll_loss()
hey guys! i am new here, i am having trouble choosing a project for my final sem in uni, i would like to make a project on machine learning, any idea where can i get help from and get the project done? are there any good courses in udemy which can help?
Hey i need to know how to use stegnatography to hide a code inside an image and when omage opened by someone it pastes a code inside the browser console , pls this is the imformation i required for my project if u know pls help
Nobody help regarding ai and datascience in this server
Hello, I want to start my AI/Ml journey can anyone please share some roadmap or structure which I can follow for learning
can i train a neural network with untrainable params in them?
How do I make it save according to the increase of metric?
312/312 [==============================] - 74s 197ms/step - loss: 0.4691 - dice_metric: 0.7883 - val_loss: 0.5346 - val_dice_metric: 0.7840
Epoch 00001: dice_metric improved from inf to 0.78828, saving model to model_unet.h5
Epoch 2/32
312/312 [==============================] - 67s 207ms/step - loss: 0.2512 - dice_metric: 0.8916 - val_loss: 0.2132 - val_dice_metric: 0.9031
Epoch 00002: dice_metric did not improve from 0.78828
Epoch 3/32
312/312 [==============================] - 71s 217ms/step - loss: 0.1809 - dice_metric: 0.9258 - val_loss: 0.1983 - val_dice_metric: 0.9246
Start with the area that interests you. Like medicine, agriculture, art etc.
Could you explain the untrainable parameters?
What's ai or ds in this task? Seems potentially harmful
What the project is about?
CNN for traffic sign classification using LE-Net
As i can see it saves model on metric increase
Yes it's possible. It depends on your NN architecture.
Do you have data to train the model. Will be first task.
312/312 [==============================] - 74s 197ms/step - loss: 0.4691 - dice_metric: 0.7883 - val_loss: 0.5346 - val_dice_metric: 0.7840
Epoch 00001: dice_metric improved from inf to 0.78828, saving model to model_unet.h5
Epoch 2/32
312/312 [==============================] - 67s 207ms/step - loss: 0.2512 - dice_metric: 0.8916 - val_loss: 0.2132 - val_dice_metric: 0.9031
Epoch 00002: dice_metric did not improve from 0.78828
But didn't it from 0.78828 to 0.8916 ?
taking help from udemy for my project the course should have a data model
i just dont know if its viable for resume and project
Can you set a difference which triggers model save?
What do you mean?
freezing layers?
can u help me with it
i have an autoencoder
with 5 layers initially trained and now i have frozen the first 3 for transfer learning
and have created new layers as well
hi i am working on a fun project. it is a sort of alexa and it is traint with tensorflow. the text is classifier with a nlu. now i was wondering if there is a character that i can put in my training model that can mean any word? or is this not necessary and do i need to train it with for instance city names.
@tacit basin let it be btw knowledge not a bad thing
In practice Le net is not used currently. Resents more like it for image classification.
Like if you set difference level at 1, a difference of 0.5 will not be registered as improvement
I solved it. Apperantly I wrote mode="min" to the wrong kernel (two of the same kaggle notebooks were open)
:incoming_envelope: :ok_hand: applied mute to @stray crystal until <t:1646046249:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
hello, why is this returning an error?
for x in range(len(y_pred)):
print(X_test[x], y_test[x], y_pred[x])```
I've checked the length of the 3 variables and they are all the same, but somehow the `X_test` and `y_test` throws an error
i would try using zip instead
it would look like this:
for x, y, z in zip(x_test, y_test, y_pred):
print(x, y, z)
np
Hello,how can i plot the training and validation curve after kfold cross validation??
I am also searching for something similar but haven't found something that works so far. Tell us if you find something that works.
Sci-kit learn example
from sklearn.model_selection import validation_curve
max_depth = [1, 5, 10, 15, 20, 25]
train_scores, test_scores = validation_curve(
regressor, data, target, param_name="max_depth", param_range=max_depth,
cv=cv, scoring="neg_mean_absolute_error", n_jobs=2)
train_errors, test_errors = -train_scores, -test_scores
plt.plot(max_depth, train_errors.mean(axis=1), label="Training error")
plt.plot(max_depth, test_errors.mean(axis=1), label="Testing error")
plt.legend()
plt.xlabel("Maximum depth of decision tree")
plt.ylabel("Mean absolute error (k$)")
_ = plt.title("Validation curve for decision tree")
https://inria.github.io/scikit-learn-mooc/python_scripts/cross_validation_validation_curve.html
Thank you so much
You could use binder and voila. Or you could share your notebook with --collaborative flag https://jupyterlab.readthedocs.io/en/stable/user/rtc.html
Jupyterlite is also an option https://jupyterlite.readthedocs.io/en/latest/
do you just want to publish the plots somehow? it's pretty easy to save a plot to PNG and then you can host it on a fileserver somewhere
otherwise there is nbviewer which hosts notebooks in read-only "view" mode https://nbviewer.org/
if you just want to share notebooks as finished products in some on-prem fashion, your options are:
- use nbconvert to convert a notebook to plain html, which can then be shared however you want to share plain html
- nbviewer, which just does (1) automatically a server application
@tacit basin fyi since you were interested too ☝️
Google colab is an option too. Or deep note...
any idea on how to add new layers to a model in torch?
It should generally work. Here is a small example: class MyModel(nn.Module): def init(self): super(MyModel, self).init() self.fc = nn.Linear(10, 2) def forward(self, x): x = self.fc(x) return x model = MyModel() x = torch.randn(1, 10) print(model(x)) > tensor([[-0.2403, 0.8158]], grad...
this delightfully simple example from there
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.fc = nn.Linear(10, 2)
def forward(self, x):
x = self.fc(x)
return x
model = MyModel()
x = torch.randn(1, 10)
print(model(x))
> tensor([[-0.2403, 0.8158]], grad_fn=<ThAddmmBackward>)
model = nn.Sequential(
model,
nn.Softmax(1)
)
print(model(x))
> tensor([[0.2581, 0.7419]], grad_fn=<SoftmaxBackward>)
i tried this actually with this because i my neural network is nothing but an autoencoder
and i need to add my encoding function in it and idk how
What's up Python gang, when building a project is it bad to have all our data cleaning, encoding, and data engineering all in one file?
yes, seems like it would be hard to maintain that
well actually.. not if it's small
it's not bad to start that way
actually i take back what i said because i misread
yes, all of your data processing stuff can be in one file
however consider that if the file is really big you might benefit from writing some separate modules
:incoming_envelope: :ok_hand: applied mute to @topaz gale until <t:1646075445:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
I'd highly recommend taking a look at DAG execution tools like https://www.nextflow.io/blog.html, https://airflow.apache.org/, https://www.prefect.io/, or https://dvc.org/. Which one works best for you will depend on your specific needs. They have many benefits beyond encouraging you to adopt the better practice of separating your python code into separate modules
Or ploomber https://ploomber.io/
This is example on how to rewrite one huge notebook into a pipeline
https://ploomber.io/blog/refactor-nb-i/
There are many tools to chose from as usual
https://ploomber.io/blog/survey/
Ploomber - Build data pipelines. FAST.⚡️
A detailed guide to convert a Jupyter notebook into a modular and maintainable project - part 1.
An Open-source Workflow Management Tools Survey.
Yep
If you want nice visual DAGs (which can also run Python, but it's its own language too (visual scripting)), then check out Enso: https://enso.org/
it's silly but i think the big problem is that they all do slightly different things and have slightly different benefits
this ploomber article is useful but it's frustrating that they are also clearly (somewhat) biased
what is the best way to train a ai to recconice citys and countys
can you be more specific? are you asking about a model that can identify which city a given image is of?
not realy i am at a problem that i wanne train a model like a alexa or siri and i dont realy know how to begin with this i use a nlu model and tensorflow
can you give a specific use case? "recognizing cities and counties" could mean a lot of things
i have a start but i wanne use it that when i ask where new amsetdam lays that it will awnser in the usa so i need to train it to knwo the sentens and to inow that new amsetdam is a city
Yes. Is there a less biased comparison?
you probably don't need to train a model for this. you can start by just getting a list of cities and looking for them!
otherwise you are looking for a general category called "named entity recognition"
so what you're trying to do is called named entity recognition. you want something that can identify geopolitical locations.
and since that's a popular problem, there are lots of off-the-shelf solutions. you can use spaCy.
alright i think i get it thanks.
spaCy is great. 100% recommend
how fast is it
very
Some more model ideas for question answering task https://paperswithcode.com/task/question-answering
Question Answering is the task of answering questions (typically reading comprehension questions), but abstaining when presented with a question that cannot be answered based on the provided context.
Question answering can be segmented into domain-specific tasks like community question answering and knowledge-base question answering. Popular b...
yes i did look into this and i have some ideas for how to get some contect that is needed for the awnser
Python gang I have a question for making a Class for a "Trainer" to train my machine learning model. If I want to add a Scaler, where would I add this within my code? I'm trying to understand how I'd incorporate a scaler into this:
def __init__(self, X, y):
# X: pandas DataFrame
# y: pandas Series
self.X = X
self.y = y
self.knn_model = None
def set_model(self):
""" defines our model as a class asttribute"""
self.knn_model = KNeighborsClassifier(n_neighbors=10)
def run(self):
self.set_model()
self.knn_model.fit(self.X,self.y)
def evaluate(self, X_test, y_test):
r2_test = self.knn_model.score(X_test, y_test)
y_pred = self.knn_model.predict(X_test)
# Confusion Matrix
print(confusion_matrix(y_test, y_pred))
# Accuracy
print(accuracy_score(y_test, y_pred))
# Recall
print(recall_score(y_test, y_pred, average=None))
# Precision
print(precision_score(y_test, y_pred, average=None))
def save_model(self):
joblib.dump(self.knn_model, 'model.joblib')
print(colored("model.joblib saved locally", "green"))
(note: self.knn_model is an instance attribute, not a class attribute)
you would add any transformers as additional instance attributes
alternatively, use a Pipeline and set that as self.knn_model
i think scikit-learn pipelines are very useful
they are useful but I'm being lazy by not incorporating a pipeline haha
it's easier to use them
people say that, but since I've already begun transferring notebooks to pacakging I don't want to go back into notebooks
what does a pipeline have to do with a notebook?
I'd have to go back and test my pipeline in notebooks before I move into production
was told to always experiement on notebooks then go to Visual Studio
seems as if I need to review my classes
true I do find it easier testing on notebooks
i also question the value of this set_model method
maybe you're right with the Pipelines being easier
class Trainer:
def __init__(self, X, y):
self.X = X
self.y = y
self.model = None
def build_model(self):
return make_pipeline(
StandardScaler(),
KNeighborsClassifier(n_neighbors=10),
)
def run(self):
self.model = self.build_model()
self.model.fit(self.X,self.y)
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
Examples using sklearn.pipeline.make_pipeline: Release Highlights for scikit-learn 1.0 Release Highlights for scikit-learn 1.0, Release Highlights for scikit-learn 0.23 Release Highlights for sciki...
Examples using sklearn.pipeline.Pipeline: Feature agglomeration vs. univariate selection Feature agglomeration vs. univariate selection, Pipeline ANOVA SVM Pipeline ANOVA SVM, Poisson regression an...
love it.
thank you, although I feel like my code does the same thing yours is easier to understand
Sorry some times I find my self asking questions that only confuse my self... in my notebooks I decide to scale once after we have our X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .3, random_state=0)
how do I incorporate this in my class or do I need to do it in my class?
Is there a way I can calculate my model's gpu memory and ram usage? The methods I've found seem to focus on the usage while training. I want to see how it is while using it
I just use htop for stuff like that, but I'm sure there's a better way
Yeah
you could put this in your run method
Anybody have a rec for a course or website to learn python data science best practices? I'm an experienced data scientist in R and I have familiarity with python but I'm interested in learning the Pydata stack (particularly Dask)
If you're already experienced in data science generally, the tutorials & user guides for the big libraries would probably be the best place to turn IMO. Tensorflow, pytorch, sklearn, pandas, all have extensive and well-written guides/tutorials.
Hi, so I was recently trying to predict stock prices using an LSTM (seems really overdone i know) but when I was predicting on data that is outside the dataset, I get a really weird graph that I do not think is correct, am I doing something wrong?
Prediction: https://github.com/Alpheron/StockPred/blob/master/predictions/MSFT-5-Year-LSTM.ipynb
Training:
https://github.com/Alpheron/StockPred/blob/master/MSFT-5-Year-LSTM.ipynb
In order to process the data, I used the lookback index of 60 points, so when trying to predict on data that is outside of the dataset, I would need the last 60 points as well, but I am I doing something wrong with the way I am predicting?
hello! i got redirected from #discord-bots ... was wondering if anyone had a plt.style built for outputting to discord?
I can't help, but you should probably clarify what you mean by "outputting to discord". like in an embed?
I have two lists of States. is it possible to use Pandas to compare the two lists, and create a new series?
I'm going to try set intersection
I'm not being very clear. I want a list of the items from list A that are NOT in list B. Forming list C. or subtracting list B from list A.
yes, sounds like you should use sets and then convert the result back to a Series @pallid bramble
yes but I don't know what you're going to ask. you should always just ask your actual question, not if someone knows about something you haven't asked.
Nope
@stuck storm all you have to do is ask a question that someone can look at and see what the question is. then whoever's available can try to answer it.
thank you. that works great. its the difference method, not intersection
!e you can also use the minus operator.
result = {1, 2, 3, 4} - {1, 4}
print(result)
I find it easier to read.
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
{2, 3}
why do you keep deleting stuff that you say
thank you. i like that better too.
it was better before when you had the function header. but thanks for saying what the variables are. what are you confused by? (by the way, your question is really about neural network theory, not so much numpy. that's another reason why it's better to ask your actual question, not say what you think the topic is.)
def loss_and_gradient(X, A, y):
"""Compute the loss"""
loss = np.sum((X @ A - y) ** 2) / 2
#compute the gradient
grad = X.T @ (X @ A - y) # this is an array
# return the loss and gradient
return loss, grad
is this a tuple of (X, A, y)? also, what does it do that is different than what you expected?
Hi guys, I am trying to debug a hand me down code base.
The script ingest an input excel form to generate an output excel form to add in the query column but it seems like its not working it kept complain about [Column A, Column B, Column C] is not in index.
I added Column A and B to the input excel form and it cleared the error for their respective not in index error.
However, Column C that is not supposed to be in the input excel form is being flagged not in index but is expected to be in the output excel form and it is not there after it was generated.
Looking for anyone that can assist with this debugging
not entirely sure. I rewrote it at as this:
def loss_and_grad(X, A, y):
foo = (X @ A) - y
return (
np.sum(foo ** 2) / 2,
X.T @ foo
)
please show the code and an example of what data is in the excel. if you can't show the data, make an example that captures what the data is like.
@stuck storm do you not want me to help? I was looking at what you had written right as you deleted it
I need some time to prep those then
for gpu ram and other stats nvidia-smi tool would help. it will show the gpu stats
i didn't read the book or took the courese, but i've seen it the other day: https://www.udemy.com/course/clean-machine-learning-code/
Add column C as well to input to debug?
it will flag sql query error and says 'Unknown column' in the list of index (which is not supposed to be there)
I will update with details later on, gotta get some work done in office now
I'm not too sure if this is the thread I should use to ask questions about clustering , but since its about AI, here I go,. I'm interested in applying agglomoretive clustering to something I'm working on, specifically the Ward method, but I'm unable to find usefule information on how to fill out the parameter connectivity, I know what it is but I spnt know what exactly I should place in there, like whats the foramt basically?Is it some np.shape() tyoe thing I'm su[poosed to place in there? I've already looked here: https://scikit-learn.org/0.15/modules/generated/sklearn.cluster.Ward.htmlAgglomerativeClustering(n_clusters=7,. Doesnt say what the format is.
oke it is not just very fast it is insane
Holaa guys I am working on deepfake detection system I have to submit it within one week can anyone help me with it?
Traceback (most recent call last):
File "D:\college_project\modules\model_train.py", line 21, in <module>
model.add(MaxPooling2D(pool_size = (2,2)))
File "C:\Users\shubh\anaconda3\lib\site-packages\tensorflow\python\training\tracking\base.py", line 629, in _method_wrapper
result = method(self, *args, **kwargs)
File "C:\Users\shubh\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\shubh\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2013, in _create_c_op
raise ValueError(e.message)
ValueError: Exception encountered when calling layer "max_pooling2d_7" (type MaxPooling2D).
Negative dimension size caused by subtracting 2 from 1 for '{{node max_pooling2d_7/MaxPool}} = MaxPool[T=DT_FLOAT, data_format="NHWC", explicit_paddings=[], ksize=[1, 2, 2, 1], padding="VALID", strides=[1, 2, 2, 1]](Placeholder)' with input shapes: [?,1,1,16].
Call arguments received:
• inputs=tf.Tensor(shape=(None, 1, 1, 16), dtype=float32)``` how to fix this error ping me when replying
my complete code here https://paste.pythondiscord.com/roqehapeze
you're trying to max-pool an image of size 1x1, it seems, which is naturally impossible.
probably something went wrong at an earlier stage, since 1x1 is a weird size to have.
earlier code ```python
model = Sequential()
model.add(Convolution2D(16, 3, 3, input_shape = (32, 32, 3), activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2)))
model.add(Convolution2D(16, 3, 3, activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2)))```
Can we plot accuracy/loss curve of each iteration in ML algorithms? If yes, Can somebody suggest me any resource to refer?
What ML library are you using? Many of them have a builtin tool for that.
sklearn
Oh, I see. And what kind of model are you training? Is it a neural network or something more classical?
I will be training four models Logistic Regression, SVM, DecisionTrees and Naive Bayes. And I want to see how does each model perform in that dataset.
It looks like it has this tutorial:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
which uses this function, which is supposed to work for every fit-predict capable model :
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html#sklearn.model_selection.learning_curve
hii, can u please help me to find the issue in my code ?
What's the size of your images? 32x32, right?
yes
Hi, I have table with properties in first row and object in first column and I must put an X in each intersection of object with its property like this: https://github.com/xflr6/concepts/blob/master/examples/relations.csv, after that I need to do some manipulations like deleting one column , deleting one row etc... my question is what's the best way to do that ? Pandas ?
Thanks
It looks like you convolve with a 3x3 kernel, which reduces the size to 30x30. Then you do a 2x2 maxpool2d, which reduces it to 15x15. Then another conv2d, for 13x13 and another maxpool2d, for 6x6 or so. Hmm. strange, it really should work
oh hold on
@lone drumah, your convolution layers aren't just 3x3, they have a kernel_size of 3 (so 3x3) and a strides of 3 (so 3x3).
yes
so then:
0) input: 32x32
- after first convolution: something like 10x10
- after max pooling: 5x5
- after second convolution: 1x1
- and the second max pooling fails, since the input size is too small for a 2x2 pooling
can u guide me here how i can fix this ?
Alter your layers in some way so that they don't squeeze the image too much. Can't really recommend how exactly, I haven't worked with CNNs.
I'm not sure how much of a mouthful this is but could someone please explain linear regression and cost functions to me like I'm 5
Linear regression algorithm approximate observations with a straight line. The line is chosen that minimizes cost function. Cost function could be an average value of absolute distances from actual observations to the fitted line.
I am working on deepfake detection system I have to submit it within one week can anyone help me with it?
Hello, I'm using K-NN (SVM may be) classifier to detect defective part on image.
After training the classifier, I have the confusion matrix, etc.
Now I want to put a label on the original image to show which part is defective or not.
How can I retrieve data after training the data ?
So is linear regression drawing a line of best fit through a set of data? And cost function is drawing lines from your data points and recording the distance from those points to the line of best fit?
I'm struggling to understand how you would calculate all of that, and apply that to an IRL example
whats the best begginer course for ML?
That's exactly what it is.
Just draw a set of points. Then draw a straight line. Then measure distance from each point to the line. Add absolute values of all distances together and divide by number of points. You will get one value. Which you want to minimize. So you draw a new line and calculate again.
Where would I draw the line?
def hello():
return 'Goodbye, Mars!'
need helps please
help how to print "hello world!"
@tranquil oak can try this article https://medium.com/@alexm5492/linear-regression-from-scratch-3-methods-2e803d82137c
I heard the basics in ML are very large, so a course that would cover them would be wonderful, thanks for the article, very useful aswell!
Yeah i agree, this article I found to be helpful as it really is targeted for beginners and explains all the math concepts easily with examples and data science terms
But there are many other articles out there that are also great for lineae regression
Hi
Actually this seems like a good starting point for me as well
can anyone recommend a good, math-heavy book on AI and data science?
there's "the deep learning book" https://www.deeplearningbook.org/
Thank you for your solutions. A slightly different question: I am not familiar with interactive notebooks, eg. using ipywidgets, but do we know if a such notebook could be somehow packaged into an application? Does pyinstaller have any result on doing that?
I am just thinking of giving my users a set-solution for them to browse and tinker with their data
@maiden shore yeah so drawingg any random line at first
It same when implementing linear regression , we initialize the paramterers for our weights and bias as random values or 0
So initially line will be flat if everything initialized to 0
Hi I need to add some data from 3csv files to a Excel file as it is ; but the catch is there are dates available all files have series of dates like 5th feb to 1st mar. all dates need to be in a same section in master file. e.g. all 3rd feb will be consequent then all 4th will be added.
Should I copy paste and then sort only the date? will that work or any other ways there ? should I use pandas or openpyxl ? Pls help me guys I`m noob in this field.
We can figure this out; I just need to understand what the data looks like. can you show examples of the three CSVs as text? use our pastebin: paste.pythondiscord.com
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
the steps will go something like this:
- open all three CSVs as DataFrames (with pandas)
- Normalize the date representation for each DataFrame
- Concatenate all the DataFrames into one
- Sort the rows by date
I'm looking for general relativity levels of math heavy
the book is only about the math. there is no code.
Wait let me share
This is how csv file data are there. I have to paste it like the way where all same dates should be together. then apply formulla to the new column.
Please re-read my instructions for how to share the data. I will not look at screenshots.
part of the problem w/ interactive notebooks is that both the notebook frontend rendering engine (jupyter notebook, jupyterlab, nbconvert, etc.) and the kernel running the notebook both need to have the widget-related stuff installed. so you will need to include both the notebook frontend as well as the notebook kernel in your pyinstaller distribution
in addition to The Deep Learning Book, you can look into The Hundred Page Machine Learning Book by Burkov and Probabilistic Machine Learning by Murphy
the latter has a new 2022 edition which is free to read (as a draft) online
the 2012 version is also free to read online, and is more polished, but might seem a bit out of date
Burkov: http://themlbook.com/
Murphy: https://probml.github.io/pml-book/book1.html
All you need to know about Machine Learning in a hundred pages. Supervised and unsupervised learning, support vector machines, neural networks, ensemble methods, gradient descent, cluster analysis and dimensionality reduction, autoencoders and transfer learning, feature engineering and hyperparameter tuning! Math, intuition, illustrations, all i...
what version of numpy runs with tf
hi all, i have a bunch of numbers that i want to average, atm i do it like this G = countsdataG1.at['G']
RG = (countsdataG1.at['RG']+countsdataR1.at['RG'])/2
RGFr = (countsdataG1.at['RGFr']+countsdataR1.at['RGFr']+countsdataFr1.at['RGFr'])/3
GFr = (countsdataG1.at['GFr']+countsdataFr1.at['GFr'])/2
R = countsdataR1.at['R']
Fr = countsdataFr1.at['Fr']
RFr = (countsdataR1.at['RFr']+countsdataFr1.at['RFr'])/2
sometimes one of the values will = 0
in that case i want to take one of the numbers only and ignore the 0
there isn't a specific version equivalence, any recent version of numpy should be okay
if both are 0 i want to return 0
don't overthink it. these are scalar values, i.e. just plain numbers, because you are using at. so just use if and all the other usual python stuff
it's easy to get so lost in all the pandas stuff that you forget to use the basic tools that are part of python!
Thanks for the response, sorry, can you extrapolate on that?
oh so it is like
if x > 0: type thing?
what does it mean if TypeError: 'NoneType' object is not callable
"callable" means "use it like a function", e.g. input is the name of a function and input() is calling the input function. so this error means you tried to use something like a function, but the thing is None (which has type NoneType), and of course None cannot be "called" like a function
usually this is a mistake in your program logic somewhere
perhaps you overwrote the name of a function (e.g. sum = None; sum([1,2,3]))
you should show your code and the full error traceback
it's unlikely that this is a datascience-specific problem
validation_split=0.2, shuffle=True, verbose=0, batch_size=batch_size, epochs=200)```
i executed this code, but says the error came from the mat_X_train mat_y_train line
Anyone here knows how to use R script?
instead of being sorry for the ping, just don't ping. I am busy.
in general it's a good idea not to "ask to ask" like this. you can ask if something is on topic or not, but "does anyone know how to X" is not a good way to get help, because it forces people to "interview" you before getting to any useful work
Fricking
never forced anyone but ok lol i just had a question its not that deep
it's just a general principle about asking things online
ok?
@real flame salt rock lamp is giving you advice to help you get help in any online context. he is helping you, not criticizing you.
you do realize I asked a question on a python discord server to see if anyone could help, again it’s not that deep and you don’t have to be here defending them
jeez stop attacking me I asked one question and I don’t need anyone’s advice I never asked for it nor do I care
No one is attacking you. We answer a lot of questions in this server, so we're giving you suggestions that will increase the likelihood that you will receive help in the future.
‘It forces people to interview you’ where is the correlation between the question I asked and this response
I appreciate your help, but it was never my intention make it seem that way
because someone would have to interview you about what your R question is before they could attempt to answer it. people usually don't want to do that--they want to see a question and start answering it.
I wanted to see if anyone knew R in the first place since this isn’t a channel exactly for it, nothing wrong with seeing if someone understood the subject
nothing wrong with seeing if someone understood the subject
the best way for someone to know if they understand the subject matter of your question is to see the actual question.
how much R do they have to know to be able to help? they'll never know until the question is there.
okay then whatever this was should’ve been worded properly
Even though R is out-of-scope, if you decide to ask your question, I'll allow it as a gesture of goodwill.
because I had no idea what they meant and it sounded like I meant something way different than what it actually was, don’t assume I was forcing anyone to interview me
they could have said this
anyways I understand what you meant, next time I’ll ask the question instead of asking who knows how to help
I was told that I should join this community if I want to get into actual pyhton
that was probably a decent suggestion, yes. are you trying to get into data science specifically?
Is there a channel for creating ui's?
you can make UIs with python, yes
Nvmd found it
where was it?
that's not the UI channel; #user-interfaces is
Oh okay
be sure to read the channel description before using a channel for the first time
Am I allowed to post a reddit post here that I made instead of rewriting my question out?
yes, but you should probably restate enough of the question in the chat to attract those who know about the question.
I'm having a problem running code from the machine learning course on freecodecamp. I am getting this when I run it: Process finished with exit code -1073740791 (0xC0000409). I'm just wondering if this is because my computer can't handle the bigger data or if I am doing something wrong. I explained it a little better here https://www.reddit.com/r/MLQuestions/comments/t3vsj7/code_never_runs_and_getting_similar_error_message/
1 vote and 0 comments so far on Reddit
I see thanks. I am already looking for non-jupyter solutions 
i think the additional problem is that these widgets only work with jupyter
maybe you can use nbconvert to convert the notebook to plain html with the widgets intact
please help what is a tensor?
noted thanks
a generalization of an array. An individual number, like 5 is a 0-dimensional tensor. [7, 9, 10] is a one-dimensional tensor of shape (3,). [[4, 9, 7], [1, 0, 6]] is a two-dimensional tensor of shape (2, 3). and so on.
What do the numbers represent?
the numbers in my example are just random, for the example
it ran for me without error
Hey @shut trail!
You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.
Interesting. So it is my computer that just can't handle it? Is there some way I can improve performance?
Feels like a big task tbh I don't know this stuff 😛 I just want to write code Dx
you could be missing pyqt
https://github.com/tensorflow/tensorflow/issues/47934
might be your issue (found it by your exit code)
i have the following table
its per company per month
how can i create a 'last month sales' column?
@graceful glacier try using a HAVING block in your query
what are you using to manipulate the table? SQL, Pandas, or something else?
also, what is "last month sales"?
is it the sum of each sales value grouped by company and calendar month?
are March 2022 and March 2021 the same month?
yea sure, im using pandas and last month sales would indicate the sales from the month previous in the current row. the sales are summed already. and this particular dataset has only two months in it for the same year(2020)
so the last month sales would look somthing like this
please do print(df.sample(10).to_dict('list')) and copy and paste the text (no screenshots) into this chat as text.
i can print the whole table, its not big
you can do print(df.to_dict('list')) if you want.
{'Company': ['British Soaps', 'British Soaps', 'Chin & Beard Suds Co', 'Chin & Beard Suds Co', 'Soap and Splendour', 'Soap and Splendour', 'Squeaky Cleanies', 'Squeaky Cleanies', 'Sudsie Malone', 'Sudsie Malone'], 'Date': [Timestamp('2020-03-01 00:00:00'), Timestamp('2020-04-01 00:00:00'), Timestamp('2020-03-01 00:00:00'), Timestamp('2020-04-01 00:00:00'), Timestamp('2020-03-01 00:00:00'), Timestamp('2020-04-01 00:00:00'), Timestamp('2020-03-01 00:00:00'), Timestamp('2020-04-01 00:00:00'), Timestamp('2020-03-01 00:00:00'), Timestamp('2020-04-01 00:00:00')], 'Sales': [671772175.7995872, 687222935.8429785, 483505038.56760347, 508107776.28321755, 1896382984.9812155, 1933258732.721503, 1790308563.639309, 1818003538.774453, 1747533597.012284, 1794467930.657847]}
thanks, let me see
df.groupby('Company')['Sales'].diff()
Out[8]:
0 NaN
1 1.545076e+07
2 NaN
3 2.460274e+07
4 NaN
5 3.687575e+07
6 NaN
7 2.769498e+07
8 NaN
9 4.693433e+07
Name: Sales, dtype: float64
Here's one way to do it
this works because there's only one row per (company, month) and they're already in chronological order.
if one company had more than one row for a given month, you'd have to aggregate them to get unique (company, month) rows.
also this is the change in sales. I think that's what you meant
yes this would be change in sales
and thanks for the solution
i think another solution might also be a rolling sum of 2 rows subtracted by the value in the sales column
similar solution:
In [12]: df.set_index(['Company', 'Date']).groupby(level=0).diff()
Out[12]:
Sales
Company Date
British Soaps 2020-03-01 NaN
2020-04-01 1.545076e+07
Chin & Beard Suds Co 2020-03-01 NaN
2020-04-01 2.460274e+07
Soap and Splendour 2020-03-01 NaN
2020-04-01 3.687575e+07
Squeaky Cleanies 2020-03-01 NaN
2020-04-01 2.769498e+07
Sudsie Malone 2020-03-01 NaN
2020-04-01 4.693433e+07
sounds overly complicated when diff exists
one of the most important considerations when you write data science code is that people can follow along with what you wrote.
oh wait i just realized i dont want change in sales but rather the previous months sales
so sort of the step before you get change in sales
sounds like you're just adding redundance
In [13]: df.set_index(['Company', 'Date']).groupby(level=0).shift()
Out[13]:
Sales
Company Date
British Soaps 2020-03-01 NaN
2020-04-01 6.717722e+08
Chin & Beard Suds Co 2020-03-01 NaN
2020-04-01 4.835050e+08
Soap and Splendour 2020-03-01 NaN
2020-04-01 1.896383e+09
Squeaky Cleanies 2020-03-01 NaN
2020-04-01 1.790309e+09
Sudsie Malone 2020-03-01 NaN
2020-04-01 1.747534e+09
In [14]: df.set_index(['Company', 'Date']).groupby(level=0).shift().fillna(0)
Out[14]:
Sales
Company Date
British Soaps 2020-03-01 0.000000e+00
2020-04-01 6.717722e+08
Chin & Beard Suds Co 2020-03-01 0.000000e+00
2020-04-01 4.835050e+08
Soap and Splendour 2020-03-01 0.000000e+00
2020-04-01 1.896383e+09
Squeaky Cleanies 2020-03-01 0.000000e+00
2020-04-01 1.790309e+09
Sudsie Malone 2020-03-01 0.000000e+00
2020-04-01 1.747534e+09
if you must
In [15]: df.groupby('Company')['Sales'].shift()
Out[15]:
0 NaN
1 6.717722e+08
2 NaN
3 4.835050e+08
4 NaN
5 1.896383e+09
6 NaN
7 1.790309e+09
8 NaN
9 1.747534e+09
Name: Sales, dtype: float64
this one is better if you want to attach it to the original df, since it's indexed the same way.
ok great, i dont know why i didnt think of this before when i was playing around with shift
🇸 🇭 🇮 🇫 🇹
wrote it out as such
df_output_1 = df_sales.groupby(['Company', 'Date'], as_index=False)['Sales']\
.sum()\
.assign(Last_Month_Sales =
lambda dfx : dfx.groupby('Company')['Sales'].shift())
is this supposed to do the same thing as what I wrote
yess, it does
😂 the first two lines are to get the table into the form i screenshotted
df_output_1 = (
df_sales.groupby(['Company', 'Date'], as_index=False)['Sales']
.sum()
.assign(
Last_Month_Sales=lambda dfx: dfx.groupby('Company')['Sales'].shift()
)
)
what about this
Sorry, can you explain what you mean by get cpu working? Also I checked and I have cuda 11.0 and in the github thread confusedreptile posted, one of the comments mentioned it not working with 11.0 but he got it with 10.1 so I guess I just need to go to a lower version. I remember when I was first doing this I had a reason I was going for 11.0 but I really can't remember now... I will try this fix in a bit 🙂 thanks for the help.
if you're doing something that would benefit from CUDA, you might be able to do limited prototyping on the CPU.
whether or not you should solve the CUDA issues first is up to you, I guess.
Ya I'm sure I don't even need my gpu involved right now but I jumped the gun when I started and wanted to do it the best way :/ Maybe I am better off forgetting all the extras for now while I learn the basics.
what are you trying to do? GPU only helps if you're doing deep learning.
following this https://www.freecodecamp.org/learn/machine-learning-with-python/ to learn
I would like to model global warming from temperature records recorded daily from June 1920 to October 2019 in Montélimar on Python. For this, I would like to model these seasonal variations with a sinusoidal fit. However, such a model fitted to the data set will not give any average temperature increase. I therefore try to apply a sinusoidal fit for each decade.
I first plotted the data from the data file and then I wanted to create a time variable to be able to do my ten-year average.
I would like to apply the sine fit for all the decades in the data file (not just the 1950s) and then plot the entire graph with the fit. However I have no idea how to do this in code. Does anyone have any suggestions?
As the code I wrote so far is quite long, I put it here: https://paste.pythondiscord.com/holaxaxawa
Here is my data file (it is normally in .dat format but I converted it to CSV so I could send it) :
This is an example of model I must have
Did anyone here did a project on stock trend prediction?
hello
is it possible to .merge() a df to itself during chaining?
so somthing like this
df_output_1 = df_sales.groupby(['Company', 'Date'], as_index=False)['Sales']\
.sum()\
.assign(Last_Month_Sales =
lambda dfx : dfx.groupby('Company')['Sales'].shift())\
.merge(...)\
you might be able to do some walrus operator fuckery
i ask bc the .assign() method s able to take the most recent version of the df
also you didn't use my refactor from before.
i dont know what that is but ill assume its not conventional lol
do you know the difference between a statement and an expression?
not off the top of my head
i remember wht walrus operators are now, it came with one of the most recent updates of python. didnt know it was usedd with pandas
an expression is something that evaluates to another value, like df_sales.groupby(['Company', 'Date'], as_index=False). in this expression, the ['Company', 'Date'] is also an expression on its own.
but a statement is not an expression. df_output_1 = df_sales.groupby(['Company', 'Date'], as_index=False) is an assignment statement. even though df_sales.groupby(['Company', 'Date'], as_index=False) by itself is an expression, when you have the df_output_1 = part, that part isn't an expression.
you can't do (a = 1 + 1) + 3 to assign a the value of 2, and then use it as an expression
except you can, if you but a : in front of the =
🤯
so youve taken a statement and used it in an expression simultaneously
correct?
the walrus operator lets you assign a variable in the middle of an expression
but then again this stands, so one wouldnt want to use walrus expressions for the sake of readability right
!e
x = (y := 2 + 3) + 4
print(y)
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
5
yes, that's why I called it walrus operator fuckery 😄
but there are times where it saves you from having to do an expensive function call twice
and there are other times that it can make your code more elegant
i see, me personally im just trying to chain assign as much as i can
⛓️
Is there someone to steer me please?
yeah this is no problem. As long as you're not using inplace=True, every one of those operations returns a new dataframe. AFAIK python just hides that reality from you, but you can interrupt chains like this with a step debugger and play around with mid-chain objects in my experience
sorry your question hasn't been answered yet. it looks good--unfortunately I can't dive into it rn
other minor recommendation, if you nest everything after the = inside a parentheses pair, you don't need a \ at the end of every line @graceful glacier
ahh ok then this makes sense then
^
No problem!
hmm so we CAN self-join during assignment? bc the following isnt working
df_output_1 = (
df_sales.groupby(['Company', 'Date'], as_index=False)['Sales']\
.sum()\
.assign(Last_Month_Sales = lambda dfx : dfx.groupby('Company')['Sales'].shift())\
.merge(df_output_1)
)
okay so since you know a sinusoidal fit won't work, is not the simplest option to add a linear term to the equation you are fitting? Is there a scientific reason you are fitting a particular simple equation (like a sinusoid) at all?
lol dude not trying to throw shade but your use of parens and escapes is gonna make me sick
yea im working on it lol
I want to do a sinusoidal fit because by plotting the data I notice that it has the shape of a sinusoid
df_output_1 = (
df_sales.groupby(['Company', 'Date'], as_index=False)['Sales']
.sum()
.assign(Last_Month_Sales = lambda dfx : dfx.groupby('Company')['Sales'].shift())
.merge(df_output_1)
)
is better
Here is the graph I get when I plot over a short period
O i see the question - I misunderstood. Why not just do this? trying to do some code-golf or something?
df_output_1 = (
df_sales.groupby(['Company', 'Date'], as_index=False)['Sales']
.sum()
.assign(Last_Month_Sales = lambda dfx : dfx.groupby('Company')['Sales'].shift())
)
df_output_1 = df_output_1.merge(df_output_1)
yea i could have done this, i was curious if it was possible during chain assignment
indeed. Though as you have already noted, it is evident in the data that over long periods it does not look sInusoidal. So you could make your function more complex, such as y = a*sin(m1x) + m2x + b
I don't know if it would make much sense to do that. I would like to plot the average temperatures as a function of time and get something like this
applying a sinusoidal fit to each decade separately would be unable to predict temperatures in any future decade, so that wouldn't be very useful as a predictive model
No, I'm not trying to predict the temperatures of the next decade, but to see if temperature measurements taken over the last century show significant warming
anyway, as to how to fit a sine wave to your data, the function you're looking for is probably in scipy.optimize. See this very thorough answer on stackoverflow https://stackoverflow.com/a/42322656
O. What does a sine wave do for you to prove that a simple mean and standard deviation do not?
I do use scipy.optimize in my code but I would like to do it on the average temperatures of each decade.
I just want to check that the temperature increase does not correspond to a statistical fluctuation
@thick acorn you should consider decomposing this time series into a "trend" and "seasonal" component
it sounds like your technique is to break up the time series into 10-year chunks and try to fit a separate function in every chunk
seems crude but you should definitely be able to see the trend that way
that said, you could get a similar effect by just fitting a straight line!
you'll see visually that it goes up over time, in a way that is much bigger than random fluctuations
This is exactly what I want to do
my suggestion for how to do this with code: put the data into a pandas series, use date/time indexing to "step through" the series 10 years at a time, and then just loop
I planned to do a linear regression as a second step. The goal is to observe both effects
however i will warn you that it won't look as good as the "expected" output you showed
wouldn't you want to do the linear regression first, then subtract off the trend that you estimated in the regression model?
Would it be easier to use numpy.timedelta64?
not really imo
you have to do a lot more "bookkeeping" of indexes if you use plain numpy for this
Yes I know, it was just an example
pandas makes it pretty easy to just select a date range
you can even use .resample and iterate over the result
I prefer to plot the sinusoidal model first because it shows the temperature variations better
so go forth and produce statistics, but mathematically your approach described is just a discrete version of fitting the continuous function y = *m*x + *b* + *c*sin(*d*x)
plotting is one thing. but mathematically it makes no sense to compute the sinusoidal part first
very good point
anyway this is a tidy way to iterate over 10 year chunks in a pandas series that has a datetime index:
for chunk_10year in my_series.resample('10YS'):
...
I have never used resample but I will look into it
further reading
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling
https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook-resample
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html#pandas.Series.resample
This is an interesting point of view, thank you for this suggestion
Okay thanks a lot
What is the variable m supposed to represent in this function?
the slope of the trend line
Okay thank you
hey ,so i tried to import pyPDF2 and it always produces an error
even tho , i have it installed
(using linux btw -ubuntu) so i tried to reinstall it and it didn't work
any ideas ?
What error do you get?
module error
How do you import it?
