#data-science-and-ml
1 messages · Page 268 of 1
Also is that sympy? I'm not familiar with sympy.
its not suppoused to be sympy i think i passed it as a numpy with the lambdify
Oh I mean I'm not familiar with Sympy syntax.
ahh
What's lambdify supposed to do?
sorry haha
it does something like lambda (it converts it to an anonymous function i think) and pass it with a numpy format
I would double check there if lambdify is returning what you're expecting it to.
I'm googling the error right now and someone is having similar problem too. Where they mix sympy with numpy.
But to be more specific, I think your error might be coming from np.linalg.inv.
You should print Jr and fi's dtypes.
And check if they're compatible with np.linalg.inv.
f=
[5*(x + 1)**2 + (y + 1)2 - 25, 2.71828182845905x - y]
J=
[[10x + 10 2y + 2]
[1.0*2.71828182845905**x -1]]
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
they are both arrays
how can i see that?
You can check for them via f.dtypes or j.dtypes.
Either dtype or dtypes.
I don't remember the the exact syntax.
yes
and you're passing in matrix that haven't been evaluated yet.
So it's trying to do inverse on functions instead of matrix with numerical values.
f=
[5*(x + 1)**2 + (y + 1)2 - 25, 2.71828182845905x - y]J=
[[10x + 10 2y + 2]
[1.0*2.71828182845905**x -1]]<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
@plush zenith
If this is what f and J are, then I'm assuming the x and y aren't evaluated.
yes i understand that
but i passed the data
and when i write the J outside the loop
it shows a numerical result
the fi doesnt
and dont understand why
I think this is where your investigation might have to start. hahaha
sorry fi too appears as a numerical
Run this in your loop.
sure!!
for i in range(20):
Jr=np.array(J(v[0], v[1]))
fi=np.array(f(v[0],v[1]))
print(Jr)
print(fi)
break
Print it on first iteration and see if it's evaluated.
Hmm that's interesting.
I wonder if it breaks somewhere.
I guess add the inverse operation to that?
the J_inv
run the first iteration where it calculates Jr, fi, and J_inv.
Wait what module object is not callable?
TypeError Traceback (most recent call last)
<ipython-input-36-3485713534db> in <module>
37
38 #print(v0- np.linalg.inv(np.array(J(v0[0],v0[1])) @ np.array(f(v0[0],v0[1]))))
---> 39 print(sistema_newton(funcion1, funcion2,[0,2] ))
<ipython-input-36-3485713534db> in sistema_newton(funcion1, funcion2, v0)
16 Jr=np.array(J(v[0], v[1]))
17 fi=np.array(f(v[0],v[1]))
---> 18 J_inv= np.linalg(Jr)
19 print(Jr)
20 print(fi)
TypeError: 'module' object is not callable
i dont understand
it was passed as an array and it was evaluated
You might be tired.
by now it works
Okay so it works on first iteration?
If that's the case then do this.
J=sp.lambdify([x, y],[dp1,dp2], "numpy")
f=sp.lambdify([x, y],[dp1,dp2], "numpy")
v = v0
print(v)
for i in range(20):
print(f'on {i} it')
Jr=np.array(J(v[0], v[1]))
fi=np.array(f(v[0],v[1]))
J_inv=np.linalg.inv(Jr)
#print(J_inv)
print("")
v = v - J_inv @ fi
print("v")
print(v)
print("")
return
So you can see on which iteration it breaks.
on 1 it
the first one
TypeError: No loop matching the specified signature and casting was found for ufunc inv
Oh, then by process of elimination, if Jr, fi, and J_inv all process.
I'm assuming v is your issue.
Is v a matrix?
Could you print what v is?
Hm interesting.
This should be?
yes
You're giving me two values. hahaha I'm assuming one of them is v.
im tryng to make thing in an automatic way
ohh
sorry hahah
the matrix is what my programme is returning
the vector is what it should be
Hm and it's able to print v in the loop?
v = v - J_inv @ fi
It's able to evaluate this and print this?
Ahh okay so
v = v - J_inv @ fi is your issue then.
[0, 2] this is the values i passed
[[-1. 2.]
[ 0. 1.]]
this is the matrix it returns
but it should be a vector not a matrix
Quick question is
v = v - J_inv @ fi supposed to be v - (J_inv @ fi) or (v - J_inv) @ fi.
I'm really lost. hahaha
Because you have multiple v's and I'm only focusing on v = v - J_inv @ fi
and I'm not sure what you're referring to now.
Focus on the loop.
well
Unless you're saying v = v0 has a bug
can i show you the original code
Sure.
and the one im trynig to do?
Sure?
def f1(v):
return(5*(v[0]+1)**2 + (v[1]+1)**2 - 25)
def f2(v):
return(np.e**(v[0]) - v[1])
def f(v):
return(np.array([f1(v), f2(v)]))
def J(v):
M = np.array([[10*v[0]+10, 2*v[1]+2], [np.e**v[0], -1]])
print(M)
return(M)
def newtonSistemas(f, J, v0, n):
v = v0
for i in range(n):
print("V")
v = v - np.linalg.inv(J(v)) @ f(v)
print(v)
return
print(newtonSistemas(f, J, np.array([0,2]), 20))
this is the one i know it works
the thing with this is that you have to replace the x and y for v[0] and v[1], manually
and this is my poor son who doesnt work correctly hahah
funcion1=5*(x+1)**2 +(y+1)2-25
funcion2=(math.ex)-y
sorry i paste other
def sistema_newton(funcion1, funcion2, v0):
x, y, z = sp.symbols('x y z')
f=[funcion1, funcion2]
print("f=")
print(f)
print("")
dp1=[sp.diff(funcion1,x),sp.diff(funcion1,y)]#sp.diff(funcion1,z)]
dp2=[sp.diff(funcion2,x),sp.diff(funcion2,y)]#sp.diff(funcion1,z)]
print("J=")
print(np.array([dp1,dp2]))
print("")
J=sp.lambdify([x, y],[dp1,dp2], "numpy")
f=sp.lambdify([x, y],[dp1,dp2], "numpy")
v = v0
print(v)
for i in range(20):
print(f'on {i} it')
Jr=np.array(J(v[0], v[1]))
fi=np.array(f(v[0],v[1]))
J_inv=np.linalg.inv(Jr)
print(J_inv)
print("")
v = v -(J_inv @ fi)
print("v")
print(v)
print("")
return
#print(v0- np.linalg.inv(np.array(J(v0[0],v0[1])) @ np.array(f(v0[0],v0[1]))))
print(sistema_newton(funcion1, funcion2,[0,2] ))
this is the one you were helping me with
i just tried to make the computer do the replacements of x and y
Hmm I'm trying to understand what do you mean by you have to replace the x and y for v[0] and v[1] manually.
Because it seems like both are inserting [0, 2] from what I'm seeing.
ohh
yes yes
but
in the first one
funcion1=5*(x+1)2 +(y+1)2-25
funcion2=(math.e**x)-y
i have to pass
the x and y
manually
i have to copy in the computer
typing
5*(v[0]+1)2 +(v[1]+1)2-25
Can't you call f1(v) and f2(v)?
Oh why not?
so it can replace it
is like you are passing v as the parameter of a function
so unless you call a variable v (in this case a list) it wont work
i mean you could pass f1
v=[0, 1]
and it will replace all v for 0 and 1
Okay I'm going to try to focus on what you're trying to do now.
sure
try to see if J-inv @ fi works.
print that
see if it evaluates
and then add v - to that dot product/matrix multiplication.
See where the error might be coming up.
What do you mean by rest the v?
So J_inv @ fi gives you
[[1. 0.]
[0. 1.]]?
yes
and v - J_inv @ fi gives you
[[-1. 2.]
[ 0. 1.]]
yes
well that is better than anything hahah
Where v[0] = [-1, 2] and v[1] = [0, 1]
but your x and y are supposed to be single values, right?
To clarify you want v[0] to be 0 and v[1] to be 2?
in the first iteration
try print(sistema_newton(funcion1, funcion2, np.array([0,2]) )))
is the same
still the same error?
it must be something about dimensions
yes
what dimension is Jr and fi supposed to be?
but jr and fi, what about their dimensions?
what shape is the matrix?
2x2
Okay that makes sense.
2x1 - (2x2 @ 2x1) => 2x1
Oh what's the error?
It's okay. We've learned from this. But what was the error?
do you remember f?
yes
hahaha
Ahh.
f=sp.lambdify([x, y],[dp1,dp2], "numpy")
and it should be as
J=sp.lambdify([x, y],f, "numpy")
im so sorr
i cant believe it
Congratulations on solving the issue.
hahahaha
of what all the things it could
be
it was the silliest one
hahahaahah
Oftentimes it's the tiny details.
i think i copy paste
the lambdify
and forgot about
i didnt replace
you helped me to see that the vector wasnt correct haha
i have no words
thanks a lot
and sorry for wasting your time with this
thank you very much
It's not a waste of time if we learn something.
Happy to help, hope you have a wonderful mathematical journey from now.
can someone give me a brief explanation on enums?
i mean more general then that
like i have a list of 4 bit enum's
so i can store a values in the index between 0-15
right?
Hello, I would like to apply this line only the numeric variables of my dataset. Can anyone can help please?
(np.abs(stats.zscore(df)) < 3).all(axis=1)
@hushed wasp df.select_dtypes(np.number)
key,value = enumerate(iter)
@fallow prism that'senumerate, notenum
two different things
can someone give me a brief explanation on enums?
@chrome orbit what do you want to do?
@marsh chasm nice. thanks for telling me. gridsearch does the validation curve stuff for the whole set of parameters and plots with correct axis labels for the parameter set? sounds way easier.
@remote valley it doesnt actually do the validation curve stuff it just stores the validation score and training score both (which is what i needed)
Does amount of data I feed into keras model affect the speed of this model?
I mean, I want the model to predict ASAP, and I don't know if I should decrease the amount of data slightly, to get faster predictions
- does it affect size of model?
@oblique vine are you trying to develop a neural network?
It does not affect the size of the model. The amount of layers in the model are specified by you. If you're model already performs well, you should decrease the amount of data slightly. You could also use a regularization parameter like L2 or L1 which will "ignore" some features. Be careful about the model not converging @oblique vine
@velvet thorn can i pm u
@velvet thorn can i pm u
@chrome orbit nope, sorry
It does not affect the size of the model. The amount of layers in the model are specified by you. If you're model already performs well, you should decrease the amount of data slightly. You could also use a regularization parameter like L2 or L1 which will "ignore" some features. Be careful about the model not converging @oblique vine
@cerulean spindle L2 doesn’t perform feature selection
oh my bad
@velvet thorn ok i have a 4 bit message, enum, and just want to know how reference the data?
@velvet thorn ok i have a 4 bit message, enum, and just want to know how reference the data?
@chrome orbit that's still not clear TBH
how does the enum relate to the message
what do you mean "reference the data"
or are you saying the 4 bits are the enum
how about you give an example
0: System - Power Saving
1: System - ON, no hand detected
2: System - ON, hand detected
3: System - ERROR
4-15: Not used
okay, and what do you want to do with this
so for example, if i have System - ON, no hand detected, the value would be 0010?
so for example, if i have System - ON, no hand detected, the value would be 0010?
@chrome orbit ...in bits?
yes, but are you trying to convert a string to a number or what
no im just trying to read the message on the CAN bus
...so is that a microcontroller question or what
Hello! How would I go about call the 'i' in a loop? I am trying to create a series of pandas dataframes by splitting apart a big dataframe
for i in range(13):
dfi = df[df.index < 60*i]
I want to make it so dfi is actually changing with the i of the loop so I would end up with 13 dataframes
It really feels like I should be able to call that i somehow but I don't know how
dfs = []
for i in range(13):
dfs.append(df[df.index < 60*i])
or, using a list comprehension,
dfs = [df[df.index < 60*i] for i in range(13)]
Thank you ❤️
hey can anyone tell me is deep q network able to solve gym's mountain car environment?
I created a dqn which beats the cartpole env, but it fails on mountain-car env
I am using just a simple dqn without any target networks or anything
if you answer ping me
How can I get specific data from a website and then put it into my program?
web scraping
@velvet thorn Thanks gm but I don't know to apply this line on the numeric variables : (np.abs(stats.zscore(df)) < 3).all(axis=1)
df.select_dtypes(np.number) and (np.abs(stats.zscore(df)) < 3).all(axis=1) doesn't work together
Does anybody here know anything about automated stock trading?
@velvet thorn Thanks gm but I don't know to apply this line on the numeric variables : (np.abs(stats.zscore(df)) < 3).all(axis=1)
df.select_dtypes(np.number) and (np.abs(stats.zscore(df)) < 3).all(axis=1) doesn't work together
@hushed wasp do you understand what that line does...?
I want to get out of the outliers but i can apply it only on the numeric variables
not sure to understand you point
okay
a different question
not sure to understand you point
@hushed wasp why are you usingnumpyandscipy.statsthere?
do you understand?
in the sense that
you can do that
entirely within pandas
without referencing numpy or scipy directly
and I think
if you understand that
you will see how to solve the problem.
it's the code I found to be able to get rid of the outliers on the web using the zscore
yeah.
that's what I thought
I think you need to understand pandas fundamentals first
before doing such things
your foundation is the most important
sure but I don't see why I can't use numpy and scipy here
you can
but
it would actually be really simple to combine what I gave you
and what you have.
so if you don't know how, that suggests that you are doing something that is a bit too advanced for you
which is fine, that's how we learn
but again...can you write that piece of code purely using pandas?
because if you can, then how to integrate what I gave you will be clear.
(np.abs(stats.zscore(df)) < 3).all(axis=1) this.
actually, just the condition.
it raises me an error if I apply it on the whole dataset that's why I try to apply only on numerical data
does np.argmax() have issues parsing nan observations? I have a list of lists and the lists towards the start of my bigger list have a lot of nans
does np.argmax() have issues parsing nan observations? I have a list of lists and the lists towards the start of my bigger list have a lot of nans
@teal vine what kind of issues are you thinking of
Well the first say 30 lists end up returning 0 for np.argmax() or nan for np.max() when there are actual floats in there that should be returned. As the lists start having less nans the 2 functions start returning the actual max
Well the first say 30 lists end up returning 0 for np.argmax() or nan for np.max() when there are actual floats in there that should be returned. As the lists start having less nans the 2 functions start returning the actual max
@teal vine yup.
that's because of how comparisons work with nan
Oof
!e
import numpy as np
nan = np.nan
print(np.max([0, np.nan]))
print(np.min([0, np.nan]))
see?
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
001 | nan
002 | nan
however, there is a body of functions that ignore nans
that would be the subject of a Google search 😉
yw
Why is argmax and max not deprecated when nanargmax exists?
Seems like the only use would be detecting if there's a nan in your data but there's other functions for that
Am I missing something you could use argmax for over nanargmax?
Why is argmax and max not deprecated when nanargmax exists? :ThinkRotate: Seems like the only use would be detecting if there's a nan in your data but there's other functions for that
@teal vine efficiency.
Gotta save those microseconds. But yeah that makes sense thank you again.
You're very quick 
Hello everyone, I'm trying to make a real-time dataset that came from a MySQL database, to be consumed on powerbi, so I'm using apache Kafka, and I'm asking myself if that is the best way to do that?
Can anybody recommend any paid training on data science/ML?
Anybody have any suggestions on how to cluster this? Granted there's some noise visible there looks to be at least four clusters. I've tried DBSCAN and Gaussian mixtures which don't seem to work well.
@bronze barn You look experienced. I have a small problem. Can you please help me?
This code fetches a year old data. I need it to take the latest data. How do I do this
@bronze barn ?
I'm not familiar with quandl, is that a web scraping package or do you have the raw data for the latest stock prices?
How much data does it give you, a month's worth?
Aside from that I can't be much help to you then but I would guess that the issues has to do with calling quandl so I'd read up on their documentation, maybe there's an argument you can pass for that. Good luck though!
@crude marsh
@bronze barn take a look at HDBSCAN, it uses hierarquical cluster with dbscan to improve clustering => https://github.com/scikit-learn-contrib/hdbscan
When you finally get your code to run but it runs 3 times as slow as you thought it would 
Turns out that iterating over lists of DataFrames was not a good idea, who knew
Linear regression models the output, or target variable 𝑦 ∈ R as a linear combination of the 𝑃 - dimensional input x ∈ R𝑃 . Let X be the 𝑁 × 𝑃 matrix with each row an input vector (with a 1 in the first position), and similarly let y be the 𝑁-dimensional vector of outputs in the training set, the linear model will predict the y given x using the parameter vector, or weight vector w ∈ R𝑃 according to
lol what
can someone translate into english
import sklearn.metrics as metrics %matplotlib inline
# Fit Ordinary Least Squares: OLS
csv = pd.read_csv('https://raw.githubusercontent.com/neurospin/pystatsml/master/datasets/ ˓→Advertising.csv', index_col=0)
X = csv[['TV', 'Radio']]
y = csv['Sales']
lr = lm.LinearRegression().fit(X, y)
y_pred = lr.predict(X)
print("R-squared =", metrics.r2_score(y, y_pred))
print("Coefficients =", lr.coef_)
# Plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(csv['TV'], csv['Radio'], csv['Sales'], c='r', marker='o')
xx1, xx2 = np.meshgrid(
np.linspace(csv['TV'].min(), csv['TV'].max(), num=10),
np.linspace(csv['Radio'].min(), csv['Radio'].max(), num=10))
XX = np.column_stack([xx1.ravel(), xx2.ravel()])
yy = lr.predict(XX)
ax.plot_surface(xx1, xx2, yy.reshape(xx1.shape), color='None')
ax.set_xlabel('TV')
ax.set_ylabel('Radio')
_ = ax.set_zlabel('Sales')
these people don't even use train_test_spliti
Who are 'these people' ?
the author of the book
Statistics and Machine Learning in Python by Edouard Duchesnay, Tommy Löfstedt, Feki Younes
seems pretty sketchy to not use train and test splits but I'm not an expert on the topic.
I used "An Introduction to Statistical Learning" by Gareth James et al. when I was doing some ML stuff
It's focused on R but the concepts and explanations are pretty solid
oh yeah I have that too
I found most python implementations are just a google search away sooo I just focus on concepts
it's just that these books put me to sleep
there's also data science from scratch by McReilly
O Reilly has another book about ML
hands on machine learning
I haven't really looked that deeply into it
Just did some exercises that involved classifying defaulters, its interesting enough.
I guess this is a better place the post the question
I have to classify some DNA/RNA/etc sequences, and I have to come up with the features myself. I figure if I can find some of the longest substrings that appear frequently among all the sequences, the presence of that substring might be a good feature.
Not sure how to do "top k longest most common substrings"
@serene scaffold As in something like this? https://www.geeksforgeeks.org/find-the-longest-substring-with-k-unique-characters-in-a-given-string/
@heady hatch not quite, because there aren't going to be any reasonably long substrings shared between all sequences
I'm thinking "what's the most common n-character substring in the set of strings" even if it's not a substring of all the strings.
@serene scaffold I was thinking about your problem, and would bow work here?
From my understanding of DNA (and maybe RNA), they’re in sequences of 4, right?
Naive feature would be the count of different sequences of 4.
Otherwise, you can look for the longest substring shared, and keep decreasing it to create more features.
Hey. I’m new to Deep Learning and I was following Sentdex’s video on the handwritten digit recognition with Keras. Great video. Just a question, how did he now how many layers he should use for the neural network? And how many neurons should he put in each layer?
Please ping me if you can help 👍
Not only in that example but in any - how do I know how many layers should I use on a neural network and how many neurons should I use in each one?
@median dove
I hope others will provide input as well.
In terms of how many layers and units per model, that's where research comes in. Often built on other research.
I'm not familiar with shallow NN, but in terms of deep NN it's usually some kind of architecture that was experimented and found to be working then scale from there.
The first and last layer is determined by your input and output. The hidden layers are meant to capture relationship between the input and the output.
In computer vision, depending on the problem, the hidden layers does so by extracting features that which can be used to calculate the logits.
Or have its features extracted to have it decoded into something else.
The number of units and layers can also be set as a hyperparmeter. To answer your question of how many should you use, experiment and see what works for the problem you're trying to solve.
Awesome, thank you very much @heady hatch
@median dove I'm pretty late to this answer, but hyperparameter tuning is arguably the hardest part about making neural networks, since there is no way to "calculate" exactly how many neurons to put in each layer, what kind of layers to put, how many layers to put, etc. It's all experimentation. There are automated hyperparameter tuning methods that are arising, such as HyperBand, Bayesan Optimization, Grid Search (the most basic of them), which basically test out different parameters and see how they do, but even then it takes a while to find the optimal set of parameters. Different people will tell you different ways of how to go about it, like scaling up from a simple model, scaling down from a large model, using "blocks" of layers (basically the same set of layers that you repeat a few times), but like nine said it's usually best to start out with something that was experimentally proven to work on some problem, and then to adapt it to yours and make changes accordingly.
Thanks you @austere swift 😊
Hey! Anyone free to help?
i just wanted to ask how to make a scatter plot that has budget, revenue and profits columns from the dataset and shows how they change over time. i.e. years on the bottom axis
assuming a cleaned dataset i have
i just wanted to ask how to make a scatter plot that has budget, revenue and profits columns from the dataset and shows how they change over time. i.e. years on the bottom axis
@haughty hinge are you sure you wouldn't like a barplot or a line plot
was thinking a scatterplot with each of those mentioned in different colours
but a line plot makes sense
@haughty hinge that is something I could help you with if you are still in need of help.
@hollow gull Thanks 🙂 im trying it myself after reading stuff online ill message on the channel if anything
okay, I will start putting an answer together and you can ignore it if you want 🙂
legend!!
make sure your index is the date column for this to work: df.index = df['date']
import matplotlib.pyplot as plt
fig, ax = plt. subplots()
columnname_y = 'budget'
df.plot(y=columname_y, label=columname_y, ax=ax)
columnname_y = 'revenue'
df.plot(y=columname_y, label=columname_y, ax=ax)
columnname_y = 'profits'
df.plot(y=columname_y, label=columname_y, ax=ax)
fig.legend()
ax.grid()
fig.show()
I made a lot of assumptions about your dataframe, but maybe that is enough to push you forward a bit.
You don't have to make the date column your index, but if you do pandas does a pretty good job of making your life easier from my point of view.
Hey! I have a dictionary where the key is a string which I would like an AI to predict based on another string. The value is a list of strings as 'examples' of what should point my AI to to the key as a result. I want to be able to provide an AI a string similar to one that would be found as a value in the dictionary and it to return the probabilities of it being a part of each key. I'm very new to machine learning so I'm unsure which AI method (RNN, GAN, Q-learning) to use for this, but do wish to have a good go by myself. Could someone please point me in the right direction for this project? I assume not RNN since it's not a step by step process- it's just a single calculation- and I assume not a GAN since I am not wanting to generate values, so I assume Q-learning (the one that is most confusing to me, sadly) which I feel would make sense since this is something where it is given a question and it needs to determine the correct answer and be rewarded, which is mostly all I know about Q-learning.
@hollow gull thanks alot !!
@mortal pendant Could you give us an example of what the keys and values are like? How long are the strings?
bow chika bow bow
Can anyone tell me the relationship between all ML algorithms (SVM, Neural Net...)? I vaguely remember that all of the sorta acted like a neural net, but I'm not quite sure. Can anyone confirm this?
neuralnet is like... a huge nest of log regs imo
Hi, I got a question. I'm using a dataset in my python code which shows symptoms of diabetes, they're in true and false meaning some people have them and some do not. I am making a yes or no type survey and wanted to know how I can compare the answers to the dataset so it gives me a percentage of similarity. For example there are 14 questions, if more than half is given the answer yes, then I want it to compare the "yes" or "true" to the dataset and give me a percentage of similarity. Sorry if this is a very specific question. I have been researching this and currently been stuck on it for a while now
Usually for those types of datasets, you'd actually want some type of data (ex: measuring blood levels) instead of yes or no questions. If so, I think you'd have to reformat your data so that it can understand these yes or no questions (the datapoints only values can be 0 or 1)
I think you should reformat you're data though if you want to stick to the survey
Yes I have reformatted them in my python code to identify "true" and "false"
The way you're structuring it, "if more than half is given the answer yes, then I want it to compare the 'yes' or 'true' to the dataset and give me a percentage of similarity", I would approach this by using this survey as a sample and feeding the sample into an ML model. If it's an sklearn model, it should support predict_proba() and/or decision_function(), I don't remember entirely how they differ, I just know that they give you a probability for how likely it is for a sample to belong to each class. For example, an example output is [0.25, 0.50, 0.25] which means that there is a 25% it is in class1 50% in class2...
Ah ok, that's very helpful thanks
yeah sure. I'm pretty sure decision_function() gives you the probability in %, but predict_proba() doesn't use % so it's harder to understand, so I perfer decision_function()
Gotcha ty
what kind of algorithm?
you can go for neural networks as they are universal function approximators
@crude marsh
What do you mean by ML to train data science algorithms?
What's your definition of ML and what's your definition of data science along with algorithms?
he might mean K-means clustering, binary tree classification stuff..
although I am not into those
Ahh as in those are the algorithms?
My Random Forest Algorithm got a perfect prediction out of 30 trys is this wrong?
Depends. What did it get a perfect prediction on?
it perfectly predicted 520 people with diabetes
of which it had 16 different symptoms
Right so did you just train it on the data and predicted it on what it trained on?
70% training 30% Test data. It also used bootstrapping
Sounds like you have a great model.
When you say 30 tries, do you mean you trained it on the training set and predicted on the test 30 times?
Or did you mean something else?
Ye 30 tries
Try cross validation.
But since you're using rf, you can also check oob errors.
Could you clarify what you mean by you used bootstrapping?
So split your data, train test split.
do cv on your training data. See how it goes, then train it on your training data and test it on your testing data if your cv looks okay.
bootstrapbool, default=True
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
Thats how i learned it atleast xd
Grab your dataset, split your data into training and testing.
Then do your cross validation on your training data.
I'll give it a shot. Thanks man.
Yup. See if it still gives it the same result as your previous findings.
gotcha
hmm, sorry to ask how do I would do that? Only recently began this project and this whole thing is quite new to me. xD researching confuses me a lot since theres like 10 different examples and idk what they all do
Oh yea no worries.
So
from sklearn.model_selection import cross_val_score
cross_val_score(model, x, y)
and if you want to read more on it.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
Because there are more arguments.
Ah ok. Ill read through it ty
Your x and y should be your training set.
So you can leave your testing set out as hold out.
Thanks!
Hi guys,i have never tried machine learning and whenever i look into it it seems very very complicated.I guess my foundations of python are not enough yet but the jump towards ml still seems overwhelming and i guess i will need 1-2 years with my current tempo.But today i found out about the libra library that creates neural networks and other stuff in 1-2 lines of code.I am pretty sure that pros are laughing at the abilities of libra but would you consider it to be a good starting point for someone who struggles with python and want to learn a bit about ml,especially for trading (i am aware that for trading it needs more but for experimenting purposes)?
@indigo steppe
Depending on what your learning style is.
Being honest with you, Libra seems like they only cover the basic cases which can be done via automl as well.
If you want some direction moving forward, learning your Python foundation is useful whether you go into ds/ml or not.
You don't really need to master python to play with data science and machine learning, but you should be comfortable at least.
Libra might be able to get you started and allow you to play with small ml projects, but it doesn't seem to help you understand what the model is doing which I think might be a core part to working with them.
Hi! I'm new to discord. How does one use this chat room to start a discussion. How does one track replies, threads...it looks confusing. Thank you for helping me understand!
@heady hatch thx for your honest answer
I'm attempting to create a notebook for exploratory data analysis. I have an outline for the project that I'd like to bounce off someone experienced. I hope this is the right way to ask for help!
@heady hatch sorry for tagging you once again but i just found out that automl isn't a library but a subcategory of ml...you gave your opponion about libra and i respect that.would you still advice a beginner to use some automl libraries or would you advice to jump directly into the non automated stuff once i am "ready"?
When you say subcategory of ml, are you referring to something like this?
I mean not just the one that google made, but pretty much this whole section where ml is automated.
btw please don't worry about offending me in any way, my advice aren't gold or absolute truth. hahaha
no need to offend you,i just learned about automl which i consider usefull information.even if you were a "douchebag",i learned something from that "douchebag" which is progress in my book.so thank you for that "douchebag" (pls i am just joking,you are helping me so thx).never heard about this cloudml stuff,i was reading about auto sklearn,TPOT and hyperopt and found out there are a few libraries that do automl so i am not sure if i should go play with them or if it is a waste of time for learning about ml in trading.i guess they won't help me make money but at least i could learn MAYBE something
oh,autokeras is one of them too
Again depending on your learning style. But ml is just as wide as it is deep.
Can you use ml without any understanding of how things work just via calling libraries? Sure. In fact automl was made for things like this. Where you can use ml solutions without knowing how anything works other than the libraries or the api themselves.
Depending on your future job responsibility, it's hard to implement a model if you cannot train, test, debug or monitor it.
And automl will probably help ease with that. I don't know the degree of customization they allow you. And for most problems, it seems like a cookie cutter solution would do.
If you want to jump as quickly into deep learning as soon as possible, learn your python first and you can try fastai's deep learning course.
it's very well made and they teach from a top down perspective.
Meaning that if you just want to stop at calling libraries, sure.
And if you want to go farther, then they offer that too.
Nonetheless, Python is still needed.
To answer your question, whether these ml solutions will help you learn actual ml, I don't know since I've never used them.
You can gauge it for yourself. Use the library and look up papers, conversation, and notebooks and see if you can understand what others are talking about after using them.
I was taught under the discipline of implementing things myself and take things apart.
And this is how I learn. People are different, maybe these solutions will help you like training wheels on a bike.
have you gone through the course of fastai?it just read a bit about it and it sounds interesting.the thing that i am interested in is,is it only a course/tutorial or does it have assignments and "homework" too..?i went through many,many tutorials and tbh i forgot 80-90% of it and with time it will be even more i guess.so i just recently started with the foundations again so i can implement the tutorials knowledge into assignments and exercises.so does fastai offer some kind of exercises?
I didn’t take the in person class, but just watched the videos.
I don’t think there are exercises or homework for the online videos but they did release their notebooks for people to study from.
thank you very much,i appreciate your time
I'm looking a dataset for classification with p>=40 and n>max(500,p*5) if anyone happens to know of one please dm me!
yo man
wtf
ive never seen this error in my life
AssertionError: in user code:
<ipython-input-10-f20008174a6b>:13 train_step *
disc_real_output = discriminator(target_images, training=True)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/base_layer.py:985 __call__ **
outputs = call_fn(inputs, *args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py:386 call
inputs, training=training, mask=mask)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/functional.py:517 _run_internal_graph
assert x_id in tensor_dict, 'Could not compute output ' + str(x)
AssertionError: Could not compute output Tensor("activation_27/Sigmoid:0", shape=(None, 1), dtype=float32)```
@cobalt jetty i tried the keras preprocessing thing seems to work fine in terms of using an input so thanks heaps for that suggestion man
but i cant seem to figure out this error lol
im not sure
maybe its because im trying to downsample from 32 to 28
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
does strides work with non integers
would anyone know how i could downsample the inputs to 28x28 using conv2d? or by another means?
downsampling from 32,32,1, to 28,28,1
but i realised it doesnt work just now lol
ill send my discriminator cos thats where the issue is happening
one sec
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
x = Dense(1,activation='sigmoid')(x) seems 2 be the issue
@mortal pendant Could you give us an example of what the keys and values are like? How long are the strings?
@heady hatch I’m on mobile so can’t get an example but hope this helps: there are 5 6-12 character keys and the values all contain atleast 300 strings, where these strings’ lengths vary a lot- anywhere from 1 character to 2000 characters. Most of the time, they’re around 300 characters though. The dataset is flexible for the vocab, so if the AI would train better with just alphanumeric characters then it would be fine to just filter it down to that, but it can also contain a wide range of symbols including emojis if that would help. I can also filter out strings that are too long or too short if that would help, though this isn’t as flexible and would greatly decrease the amount of data
If this helps anyone else, here’s the context #data-science-and-ml message
Would you guys help me running opencv functions in GPU
btw for my issue it’s also worth noting I’m using Google Colab if that makes a difference
ok i fixed my activation problem the issue was me concatenating my inputs but i dont think i actually need to do that
but im having compatibility issues getting my input and the mnist dataset to work, 256x256 images wont work
@cobalt jetty sorry to bother you again, do you think i should change my images to like 224x224 and then progressively downsample to 28x28
using keras preprocessing
your issue, I think is that you're using a sigmoid activation function, which is used for binary classification. It won't help to get a matrix output imo.
Look into softmax activation
oh nah i figured that out it should be all good
im getting this issue because im trying to downsample a 256x256 image and get a 28x28 but i cant do that mathematically
a new issue i mean
downsampling is just a tool, you should instead look into how your shapes change across the learning process, maybe add a 784 dense layer at the end where each neuron is a grayscale value output
.WARNING:tensorflow:Model was constructed with shape (None, 28, 28, 1) for input Tensor("input_2:0", shape=(None, 28, 28, 1), dtype=float32), but it was called on an input with incompatible shape (1, 32, 32, 1).
so you can reconstruct your 28x28 picture.
yeah, you're feeding a wrongly shaped input into one of your layers
yeah thats what i mean
so in my generator model do you think i should add a dense layer that then connects to another layer to output a 28x28 image
You could .squeeze() your output labels (the MNIST image matrices) so it is represented as a 1d vector. Each point would represent a target for your neural network, so you don't have to worry about the end shape of the model beyond that it's a 784-element vector.
afterward you can just create a function that converts that 784-element vector into a 28x28 matrix.
iirc, a grayscale image is a single channel matrix compared to a RGB, so its shape for MNIST is technically 1x28x28, while if the picture was RGB, it would be 3x28x28 (due to the three channels R, G and B).
ye thats right
so, squeeze your output targets as 1x784 and create an end layer with 784 neurons that should output a value between 0 and 255.
The difficulty of your model, imo, is that images have some margin of error when you think about their interpretability. It's not because one image of a "3" is slightly lighter that it will look like a 4. So thinking about how you will measure the quality of your model is something to think about too. Which is an interesting problem. Your model matching your input to the output with the exact hue or saturation isn't necessary for it to be valid.
You should also look into a dimensionality reduction technique like TruncatedSVD since MNIST data is 80% zeros. TruncatedSVD can eliminated those zeros so your model is faster
If you train your model on data that’s (60000, 784) it’ll take three hours to train. I shrinked the data to (6000, 149) and got the same results (finished in < 1 min). If time is a factor for you of course
MNIST data is 80% zeros
technically the MNIST data is made of roughly an even number of objects of each label (see picture). It's better to say that each object's features are mostly zeros, imo.
(6000, 149)
How long did it take to run the TruncatedSVD on the data?
Hi, I have some questions about the google colab, It is easy to run a Neural Network on colab with a TPU?
I've not used Colab often, but you have a TPU option for free. It doesn't mean it will be a lot of TPU of course.
But perf should be improved somewhat compared to a 'normal' CPU instance.
Do wou have a linl to understand how to implement the model to train with TPU?
tbh, if you're using Gcolab it should be seamless. It's basically a dropdown select in the options.
otherwise, look up the blogs Towards Data Science
there must be one looking over the concept.
I'll check thanks
I don't know if this is the good channel but do you know if exist a server speaking about data science for trading in python, algorithmic tradin, etc?
I don't, sorry. You might want to hit up reddit and find related servers on specialized subreddit. Otherwise, look into data science applied to time series. That's what you want to look at.
hey guys
has anyone done this course on edX: Machine Learning with Python: from Linear Models to Deep Learning
it's an MIT course that uses python for machine learning
sike i saw the reviews for it it's pretty bad
For my question, I've now finished writing some code that tries to just provide each key a score based on simply how many times each word used in the dataset occurs in each key in the dict, but it's only 30.0% accurate which is just 1.5x better than if it were to just choose completely at random since there are only 5 keys atm. This should also give more clarity about the format of the dataset https://colab.research.google.com/drive/1EuMSz-Dcgulphjs8bRv1XO_lr7M5Doko?usp=sharing
So, any ideas how I could get started improving this accuracy using AI? Thanks so much in advance!
@mellow saffron I think you are going to have to go into more detail about what the problem is for people to be able to help.
Maybe this is similar to a small part of that problem, but I don't think it uses opencv.
https://aws.amazon.com/deepracer/
@mortal pendant Maybe look into TFIDF
https://en.wikipedia.org/wiki/Tf–idf#:~:text=In information retrieval%2C tf–idf,in a collection or corpus.
@re
any idea which one would be best suitable for this?
@mellow saffron Maybe someone else has better ideas than I do, but I am imagining that as a multi-step problem with a couple of challenging parts. I don't work much with images, so I am a little out of my depth. Building a CNN that is able to classify different plant types based on images might be a useful early step. Image segmentation might make the later tasks easier, but I haven't seen implementations of image segmentation outside of screenshots on aws sagemaker. Then there will also be the problem of identifying where to move based on the image, that seems more in line with the AWS DeepRacer that I posted. I have seen blogs where people showed code to solve toy problems that didn't have problems like tall weeds.
@mortal pendant This feels as much like a natural language processing (NLP) problem as a classification problem to me, but maybe I am missing the point. I am imagining using NLP to build features that you could then put into a logistic regression or other classification algorithm to obtain higher accuracy. It sounds sort of like the goal is to use ML in the solution, not necessarily to achieve the best accuracy by any method.
but i take it pulling images from that stream is what i'd do in the end anyways?
@mellow saffron It is the simple hacky solution instead of the more elegant possibly more complicated solution. I am not saying it is the better way, but it feels easier to implement.
Short answer: No, I have no idea. Longer answer: That seems like it would be really dependent on what your end solution ends up looking like. Image processing is pretty computationally expensive though, which I would imagine might be challenging for a raspberry pi. I would probably search of projects on google and see if anyone has done something similar with some reasonable degree of success.
Has anyone had trouble with Virtualenv inside of pycharm? I use to use conda environments, but it felt like Virtualenv was preferred (I don't remember what caused me to think that.) I feel like since I switched I am having many more issues with my virtual environments failing for whatever reason (not finding a package that use to be there) Anyone have any experience similar to this / suggestions on one type of environment over another?
@hollow gull
I can't say much on the pycharm part, but in terms of virtual environments, I used to use pipenv which managed the virtual environments and packages for me.
Then I went to python's native venv since I wanted to switch package manager.
@heady hatch can you provide any context on why you wanted to switch or why you switched to python's native venv?
@hollow gull
I think I just wanted to try different package managers.
It was either poetry or pip-tools.
poetry is like pipenv, comes with its own virtual environment manager. But pip-tools did not, which means I had to manage my own virtual env.
I ended up going with pip-tools because it's more robust plus easier to transition across projects.
Though to clarify, I don't use any IDE.
So I'm not quite sure how the interaction with IDE is over there.
with vscode, it's just setting the venv's python as the interpreter.
and all its packages and environment stuff will be there.
Being honest, people in my circle don't really use Virtualenv anymore. Just because there's a native solution or they use the package manager for it.
Often in the latter.
Cool, thanks for the info.
Yeah, it should be that easy inside an IDE as well, I assume I just didn't do a good job of cleaning things up before transferring or there is some other historical issue that is confusing some part of the process.
Ahh fair fair. Would package reinstall be suitable?
Yeah, I tried that with no luck, then I deleted the virtual environment and started over with no luck, then I downgraded the package that was causing issues to an earlier version which seems to have 'fixed' it. Maybe I am assigning the blame to virualenv and pycharm when it really was an issue with the library and I didn't remember upgrading to the most recent/broken? version.
It isn't like it is some unsupported package, it was numpy.
Oh man.
But hey, it is working again, so back to working on things that are interesting.
Hey guys, side question that's kind of related to data science.
Let's say your team and you share a development server on the cloud. How do you guys manage your ssh keys?
or do you just have a server for each dev?
I would like to get a bit of assistance regarding anaconda.
@lapis sequoia what assistance do you need?
@hollow gull I've been having trouble with virtual enviorments lately, this is my 3rd day not overcoming the issues so I've decided to try anaconda, which I've been told is a lot friendlier with virtual enviorments, 1. when I create the enviorment on anaconda navigator the launch visual studio I don't get the README.md page, 2. if you look on vscode terminal top right says bash, shouldn't it say conda or something similar?
I haven't used visual studio, and it seems like visual studio knowledge is an important part of answering your questions.
I would expect there to be some indicator in your terminal that you are in a virtual environment.
I am skimming this have you looked through it/ does it look potentially helpful? https://code.visualstudio.com/docs/python/environments
for some reason the visual studio sites are not working.
they are charging now.
thanks
"""However, launching VS Code from a shell in which a certain Python environment is activated does not automatically activate that environment in the default Integrated Terminal.""" seems like it might be an issue for you depending on how you are trying to get into the conda environment.
Later on they give examples of how to select the interpreter. Maybe you are still in the base environment and that is why it isn't indicating an environment.
if you're in a bash terminal & activate a venv, you'll see the name of the root folder of your venv in the terminal, which signifies that you're venv is activated
anyone around here who is well versed in the Pandas package
anyone around here who is well versed in the Pandas package
@late jackal don’t gatekeep, just ask
not famillar with the term but thank you i am trying to convert a column in minutes to date time but i cant seem to find how to specify HH:MM:SS not including like month or days
pd.to_datetime
i have that but it seems like it wants like MM DD HH:MM:SS
@late jackal Are you sure that makes sense, can you have a datetime = 'date' + 'time' without a date? You could use today's date information to build a datetime with whatever time you want to specify.
Even a timestamp I think will always have a date as part of it, I think. Otherwise all of the benefits that I see of datetimes go away. What is subtraction on time hours and minutes without knowing what days they correspond to?
Mehn, I just tried to install kite on my PC based on someone's recommendation here. It couldn't install on my PC for some reason.
So sad. ☹️
I use a HP EliteBook 8440p
You'll just have to make do with vscode and such.
It will be great if anyone can help me with this question
Need some help with confusion matrices
pls lmk if anyone can help and ill send source code and dataset
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
cm = confusion_matrix(actual, predicted)
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1] )
plt.ylabel("Predicted_Values")
plt.xlabel("Actual_Values")
plt.show()
draw_cm(y_predict, z['Observed Loan Status'])```
Error I'm getting: Classification metrics can't handle a mix of binary and unknown targets
predicted is usually a float
and what about z['Observed Loan Status']
but the z["Observed Loan Status"] is outputing: LogisticRegression(fit_intercept=False, random...
its a logistic regression
idk why its not also in binary
they both need to be binary for the confusion matrix to work lol
so that means something sup with my logistic regression im assuming
@austere swift logreg = LogisticRegression(random_state=42,fit_intercept=False) logreg.fit(X_train, Y_train)
this is my logistic regression code but its working fine
it returns: LogisticRegression(fit_intercept=False, random_state=42)
so i dont know where im going rong
fi = pd.DataFrame()
a = logreg.dict["coef_"]
#fi['Col'] = np.arange(0, 14)
fi['Coeff'] = a
fi.sort_values(by='Coeff',ascending=False)
fi
need some help with pandas dataframe functions
im getting an error that the data must be 1d
but i dont know how to proceed with that
any help would be appreciated
☝️What do you think of these books? Should I read them all?
Yes You Should
MSE = tf.keras.losses.MeanSquaredError()
cross_entropy = tf.keras.losses.BinaryCrossentropy(from_logits=True)
alpha = tf.random_normal_initializer(mean=0.0, stddev=0.05, seed=None)
def discriminator_loss(disc_real_output, disc_fake_output):
real_loss = cross_entropy(tf.ones_like(disc_real_output), disc_real_output)
fake_loss = cross_entropy(tf.zeros_like(disc_fake_output), disc_fake_output)
total_loss = real_loss + fake_loss
return total_loss
def generator_loss(disc_fake_output, gen_output, target):
total_loss = MSE(target, gen_output) + (alpha * cross_entropy(tf.ones_like(disc_fake_output), disc_fake_output))
return total_loss```
do yall see anything wrong with my loss functions? i think it's trying to convert datasets to tensors again but i dont see where im making the mistake of putting a dataset as an input?
cos im getting the same TypeError: Failed to convert object of type <class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'> to Tensor. Contents: <BatchDataset shapes: (None, 28, 28, 1), types: tf.float32>. Consider casting elements to a supported type.
i defined everything there in the train step
with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
gen_output = generator(image_batch, training=True)
disc_real_output = discriminator(target_images, training=True)
disc_fake_output = discriminator(gen_output, training=True)
gen_total_loss, gen_gan_loss, gen_l1_loss = generator_loss(disc_fake_output, gen_output, target_dataset)
disc_loss = discriminator_loss(disc_real_output, disc_fake_output)
generator_gradients = gen_tape.gradient(gen_total_loss,
generator.trainable_variables)
discriminator_gradients = disc_tape.gradient(disc_loss,
discriminator.trainable_variables)```
and the error traceback saying the problem happening with mean squared error function
but i cant figure out what goin on
How do I split my database three times instead of two? currently using the train_test_split
@brave crest use train_test_split again on the dataset you want.
Oh! Thanks will do!
@lapis sequoia What's your gen_output and disc_output, both disc output?
no
gen output is the output of the generator
and outputs an image
discriminator is a network that is binary classifier, saying either true or false
I apologize, to clarify what are they outputting.
Are they output tensors, datasets, etc etc.
gen should output a tensor that can be read as an image
disc is just binary yes or no to see if the image is fake or not
ill post my model
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
This is tf2, right?
2.3
yea
generator is an image generator literally and discriminator is a binary classification network
its 2am for me so im gonna sleep
okie dokes.
if you find anything in the model that i havent seen that might be cooking the whole thing and messing it up, lmk ping me/dm me im fine with either ive been looking at this trying to figure out whats wrong for hours on end now and i just cant pick it
appreciate any and all help : )
Yes You Should
@lapis sequoia
Okay Thank you, It will take a few 2month
i need help with clustering in python, im not sure which columns are suitable to perform clustering over in this dataset https://www.kaggle.com/aramacus/electricity-demand-in-victoria-australia
Why not all?
@glad mulch you can use pd.merge, pd.join, df.merge, or df.join. One uses the index the other you have to pass what column you want to join on (which would probably be easier in your case). In your case it looks like you want to join on the ticker column.
@lapis sequoia what is the type of logreg.dict["coef_"]?
doesn't one column need to be the outcome? @heady hatch
For clustering? What do you mean and what do you have in mind?
i think i will use k-means clustering, doesnt one column have to be the true label
so i was thinking maybe sorting the demand into high, medium, low for the true label @heady hatch
I don't think I understand exactly what you mean yet, can you elaborate further? To clarify what I mean, it seems like you are telling me how you want to do something but not what you want to do. If you can go into what you are trying to do, it might help me understand what you are asking.
@glad mulch you want to iterate through unique values of the date index?
df.index.get_level_values('Date').unique() iirc
Or df.index.unique(level='Date') maybe
Hey guys, I am currently trying to scrape a website for the sake of studying data science but I'm currently stuck on the following:
<ul>
<li>
<h3 class="match">
John
<span class="number">
123
</span>
</h3>
</li>
<li>
<h3 class="match">
Bob
<span class="number">
619
</span>
</h3>
</li>
<ul>
...
I am using BS4 and I want to run a for loop which will iterate over all those h3s and return the name and number.
I was trying to use the following syntax:
for name, number in current.find_all('h3').text.split():
list.append(MyObject(name, number))
But unfortunately I cannot do that since text cannot be run on an the return value of find_all, any ideas on how to solve it in an elegant way?
I did that, used a temp variable and stored the returned array, then I just took indices 0 and 1 and put them in that object
But want something more elegant
P.S. thanks for the quick response 🙂
items = []
for elem in curr.find_all('h3'):
name, num = elem.text.split(maxsplit=1)
items.append(MyObject(name, num))
This is perfectly idiomatic python
That said there is a lot of inefficiency here
Let me get to a real computer and i will show you
TYTY
@twin moth the first problem is that find_all constructs a full list in memory, which is slow and wasteful if you're just iterating over it
ah you know what
beautifulsoup doesn't support lazy iteration
but you can write this as a list comprehension at least, which might or might not feel more elegant to you
items = [
MyObject(*elem.text.split(maxsplit=1))
for elem in curr.find_all('h3')
]
perhaps even more "fancy" is doing it like this:
def process_h3(elem):
name, num = elem.text.split(maxsplit=1)
return MyObject(name, num)
items = list(map(process_h3, curr.find_all('h3')))
perhaps even more "fancy" is doing it like this:
def process_h3(elem): name, num = elem.text.split(maxsplit=1) return MyObject(name, num) items = list(map(process_h3, curr.find_all('h3')))
@desert oar That's nice!
what would be more efficient is, if you only need h3 tags and don't need other stuff from the page, you can use SoupStrainer: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document
that lets you specify only specific parts of the page to parse, which could be faster
Nah, I'm using a shit-ton of other things
@desert oar is the last line in that code supposed to be in the function?
Oh, nvm, just noticed that's it called from it
Okay, so I've tried it all but seems like it doesn't work
I get a myriad of errors when I try it
Either TypeError: 'NoneType' object is not callable or TypeError: __init__() takes 2 positional arguments but 3 were given
I think that I'd just go back to using the one way that works
for li in soup.find('ul').find_all('li', class_=True, recursive=False):
each = li.find('h3', class_='match')
each = each.text.split()
curr_poke.lister.append(MyObject(each[0], each[1]))
Hey... Can somebody tell me how to load a file from my PC to Tensorflow?
Oh, sorry, that's an older gen
for h3 in soup.find('ul', class_='bla').find_all('h3', class_='match'):
each = h3.text.split()
obj.lister.append(MyObject(each[0], each[1]))
Thanks for the help @desert oar , if you have any other ideas on how to fix it I'd be delighted to hear
Guys, would someone please help me understand this page of the documentation of "matplotlib" please?
https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.html
It regards the creating a bar chart. In which parameter do we place the height of each column? Where do we place the labels for which column?
Guys, would someone please help me understand this page of the documentation of "matplotlib" please?
https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.htmlIt regards the creating a bar chart. In which parameter do we place the height of each column? Where do we place the labels for which column?
@proud iron Never used it, but seems like there are examples in the bottom of the page, maybe you could use them
i posted this yesterday and some very nice people gave sugestions but I had to go before I could try and trouble shoot it with them
self.pframe1 = sql.ExcelWriter1()
self.pframe1['Time'] = self.pframe1['Time'].astype(float)
self.pframe1['Time'] = pd.to_datetime(self.pframe1['Time'], format='%H:%M:%S')
print(self.pframe1)
I have this code where i am trying to take a column of elapsed time in minutes and convert it to HH:MM:SS
yet im given this error
Traceback (most recent call last):
File "c:\Users\Nicks\Desktop\Deethanizer V4\HYSYSapp\source1\tankui.py", line 580, in onSaveData
self.darray.writer(file_dir)
File "c:\Users\Nicks\Desktop\Deethanizer V4\HYSYSapp\source1\tankui.py", line 802, in writer
self.pframe1['Time'] = pd.to_datetime(self.pframe1['Time'], format='%H:%M:%S')
File "C:\Users\Nicks\anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 803, in to_datetime
values = convert_listlike(arg._values, format)
File "C:\Users\Nicks\anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 454, in _convert_listlike_datetimes
raise e
File "C:\Users\Nicks\anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 418, in _convert_listlike_datetimes
arg, format, exact=exact, errors=errors
File "pandas\_libs\tslibs\strptime.pyx", line 144, in pandas._libs.tslibs.strptime.array_strptime
ValueError: time data '25' does not match format '%H:%M:%S' (match)
it should alredy be all floats but i added that just incase there was some issue with it not being floats
@twin moth cheers! 🙂
@twin moth I have frequently gotten errors like: TypeError: __init__() takes 2 positional arguments but 3 were given when I am get confused an don't properly instantiate my objects. For example, if you are suppose to do MyObject().run() and you do MyObject.run(). I don't know if that is what is going on in your code, but error messages like that give me painful flash backs.
@late jackal that seems like a really instructive error message. Your value seems to be the string '25' and pd.to_datetime doesn't know how to convert that to a '%H:%M:%S' format. Maybe you can use datetime.timedelta objects instead of datetimes? I would add a print(self.pyfram1) before the line that is failing.
i figured i am converting the entire column to floats first though
im trying it with the print moved now unfortunatly i have to reload the whole app anytime i make a change to the code
C:\Users\Nicks\Desktop\datasheet.xlsx
Time
0 24.169001
1 25.581001
2 27.978501
3 27.978501
4 29.105001
5 30.398001
6 31.770001
7 33.011001
8 34.184501
9 36.915001
10 38.162002
11 40.650002
12 41.817002
Traceback (most recent call last):
File "C:\Users\Nicks\anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 450, in _convert_listlike_datetimes
values, tz = conversion.datetime_to_datetime64(arg)
File "pandas\_libs\tslibs\conversion.pyx", line 350, in pandas._libs.tslibs.conversion.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'int'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\Users\Nicks\Desktop\Deethanizer V4\HYSYSapp\source1\tankui.py", line 580, in onSaveData
self.darray.writer(file_dir)
File "c:\Users\Nicks\Desktop\Deethanizer V4\HYSYSapp\source1\tankui.py", line 803, in writer
self.pframe1['Time'] = pd.to_datetime(self.pframe1['Time'], format='%H:%M:%S')
File "C:\Users\Nicks\anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 803, in to_datetime
values = convert_listlike(arg._values, format)
File "C:\Users\Nicks\anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 454, in _convert_listlike_datetimes
raise e
File "C:\Users\Nicks\anaconda3\lib\site-packages\pandas\core\tools\datetimes.py", line 418, in _convert_listlike_datetimes
arg, format, exact=exact, errors=errors
File "pandas\_libs\tslibs\strptime.pyx", line 144, in pandas._libs.tslibs.strptime.array_strptime
ValueError: time data '24' does not match format '%H:%M:%S' (match)
im very confused at this it seems the value is 24.16 not 24
@hollow gull
is there a way to get the "type" of that column?
I would make a toy example to test out the logic outside of your entire app to make troubleshooting faster.
df.dtypes
yeah ive been thinking i should do that just didnt wanna have to look how to make a dataframe
i guess it isnt hard
df = pd.DataFrame([24.169001, 25.581001], index=[0, 1])
import pandas as pd
d = {'Time': [24.169001, 25.581001,27.565457]}
df = pd.DataFrame(data=d, index=[0, 1, 2])
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S')
print(df)
this is the test app
gives the same error
I have a dataframe where each row is a list of arbitrary strings (the columns are not meaningful) and I want to transform that to a table of string counts for each row.
can you use timedeltas instead?
thats the time passed between two points
i don't think I can because i have a bunch of data tied to those exact times
@hollow gull i can try but if its the difference between two time intervals i don't think that will work
e
@late jackal you can create a timedelta by subtracting two datetimes, but you can also create them directly like the example I sent. I don't know how you plan to use this later, so maybe it doesn't solve your use, but it makes more sense to me to use a timedelta if you are talking about a duration than a datetime, which really isn't meant to handle durations (in my opinion)
@glad mulch you can specify the suffixes so the left version doesn't have a added '_x" then you can drop the '_y' versions. Otherwise, look at the arguments, there might be one to prevent duplicating the columns, I don't recall off the top of my head.
@late jackal look up the documentation. google: python datetime.timedelta
doing so now lol
Or if you are in a good IDE it will show you the arguments. In pycharm, if you are inside the parentheses you can hit Ctrl+p and it will show you the argument names or you can hit Ctrl+B to navigate to the definition.
@hollow gull this seems to have worked and ill try implimenting it in a sec however it gives it to me in this format
Time
0 0 days 00:24:10.140060
1 0 days 00:25:34.860060
2 0 days 00:27:33.927420
i dont see a format arguement anywhere in the docs
any idea if i can get rid of Days
What are you going to do with these durations?
feed them into a program that calculates tuning parameters for valve controlers
that program asks for hh:MM:SS
sure it can be done manually in excel but kind of defeats the purpose of automating it
If you are going to do any arithmetic on them later, I would leave them as timedelta objects because it will handle those operations for you. If you really just want a number of hours, they why convert it out of a float?
because the program only accepts a .txt file with a column in hh:mm:ss
Yeah, I don't see a elegant solution, I can think of a couple hacky ones.
ok, thanks though ill keep hitting my head against the wall for a bit xD
could i split it into two columns? like 1 doe days the other for hhmmss
You can add the timedelta to a midnight datetime and then take the hours, minutes, and seconds from the constructed datetime.
timedelta.total_seconds() will return a float that you can then format as HH:MM:SS using a custom function.
that may work xD
dt = datetime.datetime(2020, 1, 1)
dt += datetime.timedelta(0, 0, 0, 0, 24.16)
dt.time().isoformat()
or
dt = datetime.datetime(2020, 1, 1)
dt += datetime.timedelta(0, 0, 0, 0, 24.16)
dt.time().strftime('%H:%M:%S')
in pandas i have a dataset. one column is the month and the other is the rainfall
there are multiple rows with the same month
i want to calculate the total rainfall for each month
how do i do it
because i want to plot a violin plot with that dara
data*
@hollow gull thought I'd let you know i did get it working so TYVM here is the code
self.pframe1 = sql.ExcelWriter1()
self.pframe1['Time'] = 60*self.pframe1['Time']
self.pframe1['Time'] = pd.to_datetime(self.pframe1["Time"], unit='s').dt.strftime("%H:%M:%S.%f")
print(self.pframe1)
I just created a RegEx that works perfectly but I'm wondering if it is disgusting 😆
Anyone care to review it?
I've opened this up as a question in #help-chili 🧐
can anyone help me in k-means clustering? i am really stuck, dont even know where to begin
can anyone help me in k-means clustering? i am really stuck, dont even know where to begin
@lapis sequoia ask your question
@velvet thorn i dont even know where to start. what columns columns can i do it over https://www.kaggle.com/aramacus/electricity-demand-in-victoria-australia
I got a question, is there anyway I can get the 'food name of username 'jack'>
username food
0 jack none
1 2 none
2 hishi 123123
3 ghfdsa 23
4 12 rfgvfbc
5 hihihiihi jack
6 g2 none23
7 g2qd 65s4
8 12 44
9 12 44
10 12 44ssss
using pandas
@velvet thorn any ideas?
Can someone recommend some good tutorials/resources for intent classification with pytorch?
good morning data hunters
I have a question, do we really decide the features to keep with heat map correlation? I mean what if it's like heart failure prediction and the heat map says that only 2-3 features of 10 are needed
would that calculate accurate?
no, do not use pairwise correlation to choose features in your model when you only have 10 features
unless you have a good reason to believe that only those 2-3 features are useful
e.g. if one feature is "is it raining today?" then obviously you can remove that feature
Hi! Can i ask here something about sympy?
Hey, I'm trying to write a script for looping through a VCF file (638 million lines) and an SNP list (15K lines) to match SNPs with chromosome and position and make an outfile with 3 columns for each peice of info. My code makes sense to me that it should work and I understand that it will take a while, im just wondering if people have any experience with this or about how long files of this size (~114gbs) usually take to loop through on a home pc
hey can someone tell me how to get statred with data science like i know python basic programming and the math but i dont know how to get started
hey can someone tell me how to get statred with data science like i know python basic programming and the math but i dont know how to get started
@ripe field If you want a simple introduction to what ML is, checkout https://www.youtube.com/watch?v=IpGxLWOIZy4
Grokking Machine Learning Book: https://www.manning.com/books/grokking-machine-learning
40% discount promo code: serranoyt
A friendly introduction to the main algorithms of Machine Learning with examples.
No previous knowledge required.
What is Machine Learning: (0:05)
Linea...
Yo, do you guys learn pandas, matplotlib, numpy together or one after the other? Currently I'm learning pandas, can't do anything with it as I don't know matplotlib and numpy.
Do I learn matplotlib and numpy alongside pandas?
You can use plot directly from pandas, which uses matplotlib by default to make these plots
but you wont have to work with the matplotlib API directly
I dont know how you 'learn' that stuff one by one
but why not use one of the hundreds of toy projects which use these libraries
OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.
u guys know what this is?
Anyone here know BERT?
@ruby wyvern
I’ve used a bit.
I’m trying compressive transformers now. It seams cool, but it’s not pretrained
@tawdry sage any way I can find a missing word in a sentence?
Without pytorch
too lazy to type it all out again
lol
I searched through GitHub
And internet
All uses pytorch
Sure. I haven’t tried though.
I’m not sure, but I think you have to fine tune it with the data format you want.
I found much easier to use it with pytorch
Pytorch is like numpy right?
But there is some documentation explaining how to fine tune in tensor flow
Pytorch is like numpy right?
@ruby wyvern
Little different, it has some things that looks odd for a regular python syntax, but it’s not that hard
@tawdry sage what should I do?
username food
0 jack none
1 2 none
2 hishi 123123
3 ghfdsa 23
4 12 rfgvfbc
5 hihihiihi jack
6 g2 none23
7 g2qd 65s4
8 12 44
9 12 44
10 12 44ssss
@somber bane
I hope you’ve found out already.
df.loc[df.username == ‘jack’].food
This one does that
Actually apparently you don’t have to re-train the model to fill the blanks.
But anyway, this seams to do the trick
You want to do it in tensorflow?
But why you don’t want to install pytorch? I don’t get it
Repl.it doesn’t work
@ruby wyvern
I don’t know that. Maybe you can use colab instead?
I’m google drive. Do you know?
Well. If you have Bert installed there, you can try to use it to predict the blanks.
But I really think it’s much easier to use torch or Albert
Hi everyone! My team and I are working on a collaborative python notebook called Deepnote. We have just opened the platform for public beta, so I'd love to hear your thoughts and feedback! Here's a sample Deepnote notebook showcasing Python 3.9: https://deepnote.com/project/09e2609b-986b-40fa-9f56-fcbbc60eb61d#%2Fnotebook.ipynb
A bit more context on the product: We've built Deepnote on top of Jupyter so it has all the features you'd expect - it's Jupyter-compatible, supports Python, R and Julia and it runs in the cloud. We improve the notebooks experience with real-time collaborative editing (just like Google Docs), shared datasets and a powerful interface with features like a command palette, variable explorer and autocomplete. We want Deepnote to be an interface that empowers programmers to collaborate and experiment easily. Looking forward to your feedback!
Hi. Anyone know how to drop rows in a dataframe based on the data type?
I have 1000 rows with str and list mixed. I want to drop all the rows that has a list in them.
df2 = df2[(type(df2['text']) != 'list')] isn't working.
df2 = df2[(type(df2['text']) != 'list')]isn't working.
@vital marten you need to.map(type)
Ah! It's working now. Thanks.
What are some toy projects I could do to learn pandas
Isn't there the pandas cookbook?
and least there was something like this some time ago
with a bike dataset iirc
I am using the train test split function to create: Training, Test validation. How do i split them 60/20/20. It keeps giving me "Found input variables with inconsistent numbers of samples"
So test_size 0.4 and then 0.5?
I got that correctly didnt I
you want to split the whole data into three groups, right?
so first split everything 60/40 and then split the 40% again in half
Yeah but i think im doing something wrong
so you end up with 60/20/20
Yeah but i get a inconsistent number error
I think I might be comparing it to a bigger dataset somewhere
Not sure, what your code looks like and when exactly there is an error
I fixed it. Well i think i did! I trained the model using my validation sets then predicted against the train one instead of the test one
I am quite new to this so I definitely sound stupid right now xD
hi, idk if this is the right channel, but how can i save a dynamic file using matplotlib.savefig() ? right now im doing this ```py
plt.savefig('path/image_name.png')
i want something like ```py
plt.savefig('path/image_name_(id).png')
Do you guys know if it's possible to show graphs or pandas dataframes in PyCharm like one would do in Jupyter Notebooks?
https://www.jetbrains.com/help/pycharm/matplotlib-support.html#sm
This might help @twin moth
Thanks Thanos!
it works! thanks @quiet breach
Now that I have three sets how do I validate the trained data now?
Hi everyone, I don't know if it is the correct channel, but how can I remove vertical grids from a graph with matplotlib?
Hi, someone knows if there is as simple code on the web to do a LinearRegression but with a correlation coefficient to loss fonction
I got pyplot ax, and want to get handles or save this matplotlib drawing somehow
gradient_ax.savefig("Pendulum path.jpg")
Create a student table and insert data. Implement the following SQL commands on the student table: ALTER table to add new attributes / modify data type / drop attribute UPDATE table to modify data ORDER By to display data in ascending / descending order DELETE to remove tuple(s) GROUP BY and find the min, max, sum, count and average.
Help me!
@undone crown https://stackoverflow.com/questions/16074392/getting-vertical-gridlines-to-appear-in-line-plot-in-matplotlib
This isn't technically what you asked about, but the solution to this problem also covers your problem. Hopefully it helps.
@true nacelle Thanks!!
Hi. I'm doing some data analysis on a csv, and I need to group together by region, the top 3 most expensive houses. I basically want "Region A: house1, house2, house3" and so on, for each region, as a df.
This is my dataset (note that this is edited/sub-df. The original df has over 20 columns, but they are not relevant in this case.)
So far, I've tried a few things, but I haven't gotten the solution in the way I'd like: grouped by region.
df2 = df.sort_values(by = ['Price'], ascending = False)
df2.groupby(['Regionname']).head(3) #this shows df sorted high-low price, but not categorised by region
This right here produces the original df with the prices sorted, but they are not group together by "Regionname". I'm at a bit of a loss as to how to approach it, even conceptually.
Has anyone got an example of how to use the "golden features" paremeter in Catboost?
Data Science & Machine Learning projects and tutorials in python from beginner to advanced level.
Machine Learning Algorithms included
Hi. I'm doing some data analysis on a csv, and I need to group together by region, the top 3 most expensive houses. I basically want "Region A: house1, house2, house3" and so on, for each region, as a df.
This is my dataset (note that this is edited/sub-df. The original df has over 20 columns, but they are not relevant in this case.)
So far, I've tried a few things, but I haven't gotten the solution in the way I'd like: grouped by region.
df2 = df.sort_values(by = ['Price'], ascending = False)
df2.groupby(['Regionname']).head(3) #this shows df sorted high-low price, but not categorised by region
This right here produces the original df with the prices sorted, but they are not group together by "Regionname". I'm at a bit of a loss as to how to approach it, even conceptually.
@true nacelle Hi Thanos, when one applies .groupby () it must be related to a summary measure so that it can group. Otherwise, you will not have to group.
I see.
@true nacelle you can use df.sort_values(['Regionname','Price'],ascending=True).groupby('Regionname',sort=True).head(3)
and wherever possible, use nlargest() instead of head(), if you have numerical numbers
Hello everyone, I'm practicing with the housing dataset and using vs code. I was trying to fit my data into a linear regression model after some tuning but I'm getting an error. It says unable to allocate 464. MiB for an array with shape (16512, 3683) and data type float64. Anyone know what this means?
Title of the error is Memory error.
also quoting from quora:
The order of rows WITHIN A SINGLE GROUP are preserved, however groupby has a sort=True statement by default which means the groups themselves may have been sorted on the key. In other words if my dataframe has keys (on input) 3 2 2 1,.. the group by object will shows the 3 groups in the order 1 2 3 (sorted). Use sort=False to make sure group order and row order are preserved
Thanks a lot @green hemlock! I'll be sure to look into this! edit: This was really useful. It's so simple when you see it, but I struggle to "find" the solution.
@boreal summit do those dimensions look correct to you? It seems like your data got very wide, to the extent that it might be an error. I am assuming that is why it is taking up so much memory and you are getting the memory error.
Hi everyone, I am with the following problem. I want to convert a 'Date' column of my dataframe that has the format '09/21/2020' to the 'Sep-20' format. Can anyone help me on how I could do it?
You can parse it to datetime and then do strptime on it
Hello everyone, I'm practicing with the housing dataset and using vs code. I was trying to fit my data into a linear regression model after some tuning but I'm getting an error. It says unable to allocate 464. MiB for an array with shape (16512, 3683) and data type float64. Anyone know what this means?
@boreal summit it seems pretty self-explanatory to me. you don’t have enough memory.
what feature engineering did you do before that?
@undone crown oh.... why????
google datetime formats python
Okay, I am realizing that they way I find this isn't very linear... but then search for strftime or strptime. There will be a link named: strftime() and strptime() Behavior. to here: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior
Although never is often better than right now.
That has a table with all of the shortcuts when you are doing formatting of datetimes in python. So it looks like your format string will be '%b-%d'
you are crazy
putting it all together, I think this will work:
df['evenstrangerdateformat'] = df['weirdamericandatetime'].apply(datetime.datetime.strptime('%b-%d'))
@velvet thorn from what I garnered online, seems it's cause I'm using Python 32 bit, i want to upgrade to 64 bit and see if that will help.
I read that 32 bit Python can't run some complex stuffs that need memory.
@boreal summit I would really recommend looking at the shape of what you are putting into the LR and make sure it is what you are expecting... that seems really wide.
@velvet thorn from what I garnered online, seems it's cause I'm using Python 32 bit, i want to upgrade to 64 bit and see if that will help.
@boreal summit this might be true, but besides the point.
why do you have so many columns?
putting it all together, I think this will work:
df['evenstrangerdateformat'] = df['weirdamericandatetime'].apply(datetime.datetime.strptime('%b-%d'))
@hollow gull nope
prefer pd.to_datetime to .apply
and then .dt.strftime
Might be my processing and stuff. I used standard scaler to scale the Input data.
Might be my processing and stuff. I used standard scaler to scale the Input data.
@boreal summit doesn't add columns
what are all the steps you took?
did you OHE a continuous column or something
It's the housing dataset I was learning with.
No, I only did label encoder.
Hold on. Lemme screenshot it.
Pardon me please, I'm typing with my phone and doing stuff on my PC.
Okay, lemme login on my PC.
but yeah
I do not see anything particularly strange about that
actually
can you just tell me what data.shape is
I used to use jupyter with VScode. It was really slow and sometimes made really weird errors. Did anyone have the same problem? Does anyone have alternatives?
it was showing something like (323020, 1342)
...what dataset is this?
it was showing something like (323020, 1342)
@boreal summit are you sure?
I already cleared all the output, and uninstalled python 3.8 but I can remember the R was in 6 digits while the C was in 4 digits
the housing dataset
Don't know where things went wrong
yea I'm sure
the housing dataset
@boreal summit ...which housing dataset
do you mean the Boston housing dataset?
yea
I think that the columns are blowing up because you are doing a logisticregression with multiclass=auto
the Boston housing dataset has shape (506, 14).
this one has 20,641 rows
I'm trying to predict the median house income according to the excercise
should I leave the Hyper parameters in default?
If it is a continuous variable you are trying to predict a regression model typically fits the problem better.
So LinearRegression instead of LogisticRegression
LogisticRegression is designed for a target variable with 2 or a few unique values, not something continuous.
okay, I will try it out and see what happens. Thanks alot, I appreciate.
I actually intended to use LinearRegression, didn't know it was Logistic I used.
Thanks everyone, @hollow gull it ran successfully this time.
The dataset successfully fitted in. I guess it's cause I was using logistic regression.
The dataset wasn't fit for the model.
Hey guys, i have a question around dataset and what actions to take. I have 3 variables that are not at all on the same scale. I need to normalize or standardize them so they result in something i can weight and then pull into a single score as a result.
You can either use standard scaler or minimax scaler. Each has its own disadvantages and advantages.
My question being, can the ranges be set for scaling outside of the max/min of the data set
