#data-science-and-ml
1 messages Β· Page 207 of 1
why not just fill them up with the upcoming day values?
you mean the missing days is a pretty wide gap?
ah, then it wont work
generally, speaking in my weather prediction model, there were some missing data, so apparently pandas can do this autofill thing, and so it filled it up with the previous day's records.
Instead of me having to drop all those columns altogether.
yeah, but I am assuming, here, entire rows are missng
*missing
so it would be pretty impossible to just fill them up for a while
would this interest you? https://stackoverflow.com/questions/16787038/insert-rows-for-missing-dates-times
hmmmm.....
okay, i am very very new to data science
i have used augmented dickey fuller tests and all to see if the multivariate time series is stationary or not and all
but i didnt understand what autolag meant
and i cant seem to find its definition on google either
could you tell me, what on earth is autolag?
adds lag to the variables?
you mean it repeats it?
so, it just fills up the breaks between the time periods, by repeating the variable
so.... what is it then?
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Flatten
imgs = np.load("/Users/Kushi/PycharmProjects/"
"Artificial_Intelligence/MachineLearning/Sign-language-digits-dataset 2/X.npy")
targets = np.load("/Users/Kushi/PycharmProjects/"
"Artificial_Intelligence/MachineLearning/Sign-language-digits-dataset 2/Y.npy")
imgs = imgs.reshape((2062, 64, 64, 1))
model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape=(64, 64), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(64, (3, 3), input_shape=(64, 64), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(32, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(10, activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=['accuracy'])
model.fit(imgs, targets, epochs=5)```
ValueError: Input 0 of layer conv2d is incompatible with the layer: expected ndim=4, found ndim=3. Full shape received: [None, 64, 64]
why do i get this error even when i reshape the array?
You should only specify input_shape for the first Conv2D. The rest will figure it out themselves. If your data is has shape (2062, 64, 64), then input_shape should be (64, 64).
If you're only reshaping in order to add the final 1 dimension to the data, there's no need to do that. Just remove the reshape, and remove the input_shape to the second Conv2D
@lapis sequoia
i get this error : ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=4, found ndim=3. Full shape received: [None, 64, 64]
when i removed the reshape and input_shape
oops
didnt notice the input_Shape in 2nd conv
i still get the same error though :(
@feral lodge
sry for ping
no worries π Gimme a sec
shall i send the current code?
No need! Just tell me, what does targets.shape give you?
Oh, I might be wrong actually. Go ahead and reshape imgs to (2062, 64, 64, 1). But set input_shape of the first Conv2D to be (64, 64, 1)
This works for me: ```
imgs = np.random.rand(2062, 64, 64, 1) # fake data
targets = np.random.rand(2062, 10)
model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape=(64, 64, 1), activation="relu"))
...
@void anvil for time series?
to figure out if there's any AR structure you should look at the ACF and PACF plots
nah you dont actually look at the charts
you just check the ACF at p lags is > threshold
for what purpose
you mean you wanna use this as a feature to predict something, and you want to include multiple lags as predictors?
like Y ~ X + lag(X, 1) + lag(X, 2)
oh i see youre actually doing the AIC of the regression model
what is the point of the lags, because you think heres a cyclical component?
idea being, N lags should smooth those out?
Could you with ml predict if your country is going to enter a recession. You do have enough data
How can I extract only specific text from an image with OpenCV and Tesseract?
I want to try to extract only specific numbers from a lottery ticket.
We can do so by matching the aspect ratio of the text
But what if the numbers are pretty much the same font size?
I'm sure macroeconomists do use data science methods to make predictions. Whether they are, or to what the degree they are, successful, is a different matter.
1443/1443 [==============================] - 22s 15ms/sample - loss: 0.4330 - acc: 0.8614 - val_loss: 17.3330 - val_acc: 0.0000e+00
how can i fix this sorta thing
x_train = x_train.reshape((-1, 64, 64, 1))
model = Sequential()
model.add(Conv2D(128, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64, activation="relu"))
model.add(Dense(10, activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.3)```
is the code
@lapis sequoia is the validation accuracy like this for all epochs?
yes
then yes it is a very extreme case of overfitting
shall i add 3 dropout layers?
you shall add dropout layers yes
You'll want to penalize the Dense weights as well, with weight decay. They call it regularizers.l2 here: https://keras.io/regularizers/
You may not have enough data also; you have ~2000 training points right? For 64x64x1 images and a network with so many kernels that may be a bit on the low side. The network might generalize better if you downsample the images and reduce the size of the network.
That's not a good idea if the small details of your images are important though. If you can squint your eyes and still easily tell what each image is, then it might work better
For contrast, the cifar10 dataset (https://www.cs.toronto.edu/~kriz/cifar.html) has 32x32 images, and 5000 training point for each class
Hi could somebody help me out with implementing batch normalization to this implementation? https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG_update2.py?utm_source=share&utm_medium=ios_app I am not sure What to put in the learn function. @ me svp
i am new in the field of macine learning.i just want to start it...can anyone pleaseguide me how should i start it ....what will be the correct sequence to follow for machine learning and AI
also can anyone tell me from where i can learn this any channel or youtube playlist for ML and AI
ill try using the cifar10 dataset
Just wanted to say
After month initialising my dicts like a peasant, I discovered defaultdict
results_dict = defaultdict(lambda : defaultdict(int))
Ty Mr Python
Hello ! why hand gesture detection neural network is better than an haarcascade of a tipical gesture pls ? π
Hello, I've followed this tutorial to get a simple intent matching neural network:
https://towardsdatascience.com/build-it-yourself-chatbot-api-with-keras-tensorflow-model-f6d75ce957a5
The results are good, it's relatively fast and produces good results, except for one thing.
It always matches an intent, even if a query is nonsense it will match one intent. The probability is always high...
Is there anything I can do to fix this? I'm relatively new to neural networks and I'd appreciate if you add an explanation to whatever's going on! Thanks in advance
@upper ginkgo the reason this always outputs something is that you are having a softmax layer as output, softmax layers output a probability distribution, meaning that all output values of the layer will add up to 1. How to fix this I am not sure actually, I guess you could add another output for "nonsense" and add a few training examples which are nonsense so it knows what it has to qualify like that? But thats just my naive approach really
Thanks, @earnest prawn
If there's something else I can do I'd really like to hear it
I'd like to hear as well, my solution is as I said just a really naive one
This is a big issue in deep learning; neural networks are currently very poor at expressing uncertainty for previously unseen input. They tend to have a false sense of overconfidence, like you saw
The simple answer is that your network is overfit, so you'll need to regularize it harder and hope for the best
Consider this example! All data is vectors right, and in this example the data vectors are 2-dimensional
It's a binary classification problem. The little clusters of data is our labeled training set. The points are labeled 0 or 1. Our neural network outputs a classification digit; a number ranging from 0 to 1
What you see in the plot is the learned classification scheme of an overfit neural network. It's very black-and-white -- the network always produces a 0 or a 1. No room for uncertainty
But what if our network sees a new point, say at the coordinate (-2, -2) or (-2, 2)? That's very far away from the training data. Reasonably, such points probably don't belong to either of the two classes. But the network will happily misclassify them, expressing 100% certainty while doing so
Here we can also see the problem of adding fake nonsense data to the training set. How would we define such points? Those points would literally have to be everywhere surrounding these points
That's a lot of work, even in 2 dimensions
Language and image processing don't work in 2 dimensions, they work in 1000s of dimensions
The solution is to regularize the network
Here's the same problem, except the network is properly regularized:
Viola! The network expresses uncertainty for weird data
Then the programmer will have to decide what levels of uncertainty are acceptable for their application
It's a very complicated problem though, and an area of active research. There is no way to guarantee that regularizing your net will work, unfortunately
The exact same issue has caused crashes in Teslas and other weird stuff you wouldn't expect
Like this poor Chinese guy who got a ticket for scratching his face. https://www.bbc.com/news/blogs-news-from-elsewhere-48401901 The convolutional neural network missclassified him as speaking in a cell phone while driving
Sure! But each word is represented as a vector, right? Like an embedding https://en.wikipedia.org/wiki/Word_embedding or even simpler
Somehow, all data we work with is translated to vectors before we toss them through the network
So regardless of whether it's a text, an image, or just numbers we can always imagine them as points in a very high-dimensional vector space
And we always see the same kinds of problems
If you enter a text in Japanese, it might be classified with a 100% certainty, even though the network doesn't know what it's talking about
Likewise, if we make a network that can classify images of rotten apples from good apples and accidentally give it a picture of a shoe, the network might produce a very confident missclassification
"This shoe is definitely a good apple!"
Don't be discouraged though, regularization is the key. It might work well for you
Oh well, thanks for the information anyways
Is there somewhere I can learn about regularization?
What deep learning library are you using?
Keras/Tensorflow
I find many blogs and stuff if I just google "keras regularization". They're probably all good!
Thanks a lot
@feral lodge just wondering, I made a research before asking and I found other βsolutionsβ such as using a Bayesian neural network or a Monte Carlo dropout. Would those also fix my issue?
The "good" plot I showed above is actually a Bayesian neural network! They're experimental, don't always work properly and usually take a looong time to train. When they do work, they tend to have good uncertainty estimates, better than usual regularization. If Keras supports them, you can give it a try if you want! Monte Carlo dropout is actually a Bayesian method in disguise, but I've never used it
If you're a beginner, it may be wiser to just stick to plain old networks with plain old regularization though. The classic regularization method for nns is weight decay. They call it regularizers.l2 here: https://keras.io/regularizers/
It works better now, thanks @feral lodge! But the nonsense is still getting through, nonetheless..
Is there something I can do here?
This is what I added to my code:
model = Sequential()
model.add(Dense(128, input_shape=(len(x[0]),), kernel_regularizer=regularizers.l2(0.01), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.5))
model.add(Dense(len(y[0]), kernel_regularizer=regularizers.l2(0.01), activation='softmax'))
Or do I just increase the value in l2?
Good job! It's a tricky problem, it'll probably never work flawlessly, particularly with this network architecture. But yes, you should play around with the l2 value
Raising it regularizes the network harder, so if it's too high the network will not be able to fit the data properly
Perhaps 0.5 would do the job?
Give it a try! I've never gone above 0.1 personally
But it's good to play around and test
I'm gonna keep that in mind, thanks again
If you want a non neural network solution you could also do something simple. Just from the top of my head, check if any of the top 1000 most used english words are in the comments and if yes send it to the network if not label it nonsense? Neural networks are nice but some things can also be solved by simple heuristics.
Although there are probably better heuristics than the one I suggested π
I agree, that's a good idea!
0.5 was a bit too high, nothing matched
and I guess it isn't necessary, the nonsense isn't getting through now:
@feral lodge thanks again, it works like a charm π
I got a bit scared by bayesian stuff and those methods that required changing a lot of the structure, I'm glad I just had to change a few lines! π
Great job, glad it helped π
Bayesian neural networks are actually not as scary as you might think! Simply by using l2regularization you've actually performed Bayesian inference over your network, in disguise. Your Bayesian prior distribution is a Gaussian, and your posterior distribution is a Delta πΈπ
a good thing to remember is that softmax is a hack
"huh, we need to convert some real-valued outputs into a probability distribution, what do we do?"
"idk, take the ratio of exponents?"
they can be interpreted as probabilities, and our training losses treat them as such, but we should be cautious about actually treating them as "predicted probabilities"
Hey guys, has anyone used VSCode for ML/Data Analysis purposes? I installed the basic extensions like Python and Intellicode, but autocompletion for packages like pandas is very very lacking. Coming from a language where VSCode is a first class citizen, it's night and day how much difficult Python is with it. Somebody had better luck than me?
I need a Human Speech sentiment classification (e.g. Happy, Sad, Fearful, Surprised, Angry) data-set, please help.π
anyone here ?
Hi, sorry but i need some help - I am trying to write a script to loop through data and identify people that fit the definition. - i am having a hard time getting started - the data is loaded with pandas and the dataframe is reduced to columns of interest. But writing the loop with the if and or statements is a problem. - anyone here that could set aside 15min ??
An increase in plasma creatinine by >0.3 mg/dl or a relative increase of 1.5-fold above baseline
together with severely elevated plasma MTX concentrations at one or more of the following time-points
after initiation of the MTX infusion:
36 hour >20 ΞΌM
42 hour >10 ΞΌM
48 hour >5 ΞΌM
- If someone would like to help i can post the dataset
Hey, I'm looking to apply some of the stuff I learned on unsupervised learning from the book Hands-on machine learning with (...) and I was wondering if you guys had good entry level unsupervised learning project ideas to recommend π Thanks
@surreal nacelle , you can start with market segmentation for example. Or any use case with clustering algorithms
Gonna look into it, thank you
someone recommended me mnist without label
which sounds pretty good tbh π
Hey, can someone explain to me what's an epoch?
An epoch means a single training iteration over your full training set
So if you have 10000 training points, your computer might not be able to handle everything at once. Then you'll divide your training data into smaller chunks called mini-batches and train your model on those instead. If you divide your data into mini-batches of size 1000 for example, then 10 training iterations equals 1 epoch
Ohhh I see
Hey, trying to clusterize the mnist dataset with KMeans, (by applying PCA first), and I'm getting pretty bad results, which is not surprising, however, I'm not sure what the next step should be.
I tried using the PCA.components_ as centroid, but it didn't perform as well as 10 random init and 1000 iters.
What would you do in that situation ?
do your principle components look reasonable?
Hey, I'm using the Kobe Bryant dataset from kaggle.
I've been tasked to predict the shot_made_flag and to avoid data leakage by training on data prior to to the date of the test data.
I've also been told to find the best k using 10 K-FOLD CV.
I think these 2 requirements contradict each other because if I use K-FOLD to split my data into train and test then the model will eventually train on data that occurred after the test data, so it makes no sense in this context to use K-FOLD but rather sort by date and manually split at this point for train and test and find the best k.
Can someone correct me if I'm wrong here?
hello guys I managed to put my data into a DataFrame and added target value ( which is just 3 prices in the future if its higher than current price its a 1 otherwise a 0). I have no clue which step is next to make this NN usable and how I would need to do it with my data.
alot of examples and stuff ive seen consist of 2 things, 1 row of samples and 1 row of labels
I have multiple rows of samples and 1 row of labels
how seriously do you want to do this financial prediction?
because my usual prescription is: don't use ML for finance
with the further addendum: absolutely don't use DL for finance
if this is just for learning that's fine, then I can give tips
well im working towards using it for real trading π¬
im trying to recreate something
This strategy robot I wrote in C# is using genetically evolved neural nets. It learns trading rules by itself. It's trained on 40% of data sample and validat...
I saw this around 2 months ago on youtube and started learning python and since 2 weekw ive been doing this ML stuff
with NNs
aha, well the bitcoin market is illiquid enough for some strats to work. still my recommendation is not to use DL for this. ML algos can work somewhat
unlike most problems, finance is a system where the whole market is actively trying to unlearn its own patterns
unless it's something fundamental to the market structure
I also emailed that guy and he said I need to look at reinforcement learning but as I just first wanna have something that works I discovered maybe just make a NN that throws out somthing first
I can feed in anything
I mean I can do fundamental analysis too. The whole bitcoin market crashes on tweets and youtube videos
I also dont expect to get rich with it but if it ca beat the market like in his video even if its by 1 % im on the right tack
track*
I only did 1 ML alog and that was lineair regression, it sucked badly
I needed to feed it 4 of the 5 things
then I moved on to NN
are there any good machine learning tutorials i can learn about it?
depends, hows your math?
Does anyone have a ETA on the first tensorflow 2.0 release? The roadmap says Q2 2019.
Any idea on how to save a tensorflow model
predictions = estimator.predict(input_fn=predict_input_fn)
I wanna save the estimator thing so that next time I dont need to do the 10000 iterations
i believe that is what pickles are for
isn't there like tf.saver?
If it's just Tensorflow, i use tf.train.Saver yeah:
saver.save(sess, saved_variables_path) # save
saver.restore(sess, saved_variables_path) # load ```
If it's keras, I use ```
from keras.models import load_model
model.save(saved_model_path) # save
model = load_model(saved_model_path) # load ```
I think these also save internal stuff like Adam's momentum values, but I'm not completely sure
linreg = LinearRegression()
linreg.fit(X_train[:, :2], y_train)
y_train_2 = linreg.predict(X_train[:, :2])
y_test_2 = linreg.predict(X_test[:, :2])
linreg.fit(X_train[:, :4], y_train)
y_train_4 = linreg.predict(X_train[:, :4])
y_test_4 = linreg.predict(X_test[:, :4])
linreg.fit(X_train, y_train)
y_train_10 = linreg.predict(X_train)
y_test_10 = linreg.predict(X_test)
For this code what does the splicing do? Specifically the 2 and 4
Posted in help but maybe someone here can help
looks like it's only using the first2 / first 4 features
Yep that is what it's doing but i dont understand where the features are coming from, hang on let me grab the code
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=10)
X = poly.fit_transform(x)
obs_nums = np.arange(0, num_points)
np.random.shuffle(obs_nums)
top_70 = int(num_points * .7)
rand_train = np.sort(obs_nums[:top_70])
rand_test = np.sort(obs_nums[top_70:])
X_train = X[rand_train]
X_test = X[rand_test]
y_train = y[rand_train]
y_test = y[rand_test]
linreg = LinearRegression()
linreg.fit(X_train[:, :2], y_train)
y_train_2 = linreg.predict(X_train[:, :2])
y_test_2 = linreg.predict(X_test[:, :2])
linreg.fit(X_train[:, :4], y_train)
y_train_4 = linreg.predict(X_train[:, :4])
y_test_4 = linreg.predict(X_test[:, :4])
linreg.fit(X_train, y_train)
y_train_10 = linreg.predict(X_train)
y_test_10 = linreg.predict(X_test)
errors_train= np.array([np.mean((y_train - y_train_2) ** 2),
np.mean((y_train - y_train_4) ** 2),
np.mean((y_train - y_train_10) ** 2)])
errors_train = np.column_stack(([2, 4, 10], errors_train))
errors_test = np.array([np.mean((y_test - y_test_2) ** 2),
np.mean((y_test - y_test_4) ** 2),
np.mean((y_test - y_test_10) ** 2)])
errors_test = np.column_stack(([2, 4, 10], errors_test))
looks like you're taking the first 2/4 degree polynomials as featuers
I'm not sure where the feature values are coming from
Sorry in advance if im being super dumb
Yeah i follow that
Anyone here have experience working with the WebAgg backend of matplotlib
What would this kind of plot be called in matplotlib?
Its like a scatterplot, but how to connect each grouping of numbers based on value
@lilac reef contour?
Seems about right. Im just afraid contour is continuous values, but I'll look into it!
Thank you!
Yeah, def look right
Thanks :)
Update: Not quite. I think contour is working with curves.
I have a bunch of X, Y values that each have a category Z they fall into.
I want to draw lines around each Z group like they did in the pictured graph above
I think I'm going to plot a bunch of different scatter plots on top of eachother with different colors. Not sure how I'll connect them
@lilac reef what do you mean "working with curves"?
Contour plots (sometimes called Level Plots) are a way to show a three-dimensional surface on a two-dimensional plane. It graphs two predictor variables X Y on the y-axis and a response variable Z as contours. These contours are sometimes called the z-slices or the iso-response values.
Hmm, nevermind
I read something to the tune of contour plots plot slices of a plane
You can explicitly pass the levels (categories) you want contours at
So I have pretty much the same data as the picture I posted.
For each X and Y value, I have a resultant percent number ( 75, 76.5, ext)
I have the list of X, Y (both at regular intervals) and want to find the high point for the percentage.
Would I pass in my X, Y as
plt.contour([graph_x, graph_y,], graph_results, [.6, .65, .7, .75])
for example?
Thank you for the help btw man :) This is a really tricky plot I'm trying to pull off but I think it will look good
I believe it looks like plt.contour(X, Y, Z, levels), so no need to bundle up X and Y as a list
For a regular grid, you might need to meshgrid X and Y together, something like xx, yy = np.meshgrid(graph_x, graph_y)
Yeah, I kept seeing meshgrid come up
For just throwing in my stuff as X,Y,Z I got Input z must be a 2D array
So each Z value must be paired with a given contour 'height'?
Context
What shape is Z? It expects X to be M long, and Y to be N long, so you pass a 2D array of values
It'll figure out the contours for your data automatically
Z is 100x1 for me. I have X, Y and Z in different arrays. It is a height value that is in-order connected with each X Y combo
Seems like I should have a different format
What do you mean by "in-order connected"?
Plot the height Z[1] at (X[1], Y[1])
and so on
I cant quite seem to wrap my head around what contour(X,Y,[N,M]) is expecting
X and Y must both be 2-D with the same shape as Z (e.g. created via numpy.meshgrid), or they must both be 1-D such that len(X) == M is the number of columns in Z and len(Y) == N is the number of rows in Z. hmmm
You say X and Y are regular intervals, but if Z[2] is (X[2], Y[2]), haven't you got heights along a diagonal?
Yes, you could just reshape it if you meant Z[2] is (X[2], Y[1]), for example
Each height is a 'result' of the value X and Y being fed into an algorithm.
For context: X and Y are two parameters I am doing a grid search over a learning algorithm.
Z is how accurate of a predictor it was for each X,Y parameter combo (in cross poduct form)
Ok, so you have a grid...Then len(X) * len(Y) = 100 and you can just Z.reshape((len(Y), len(X))) and it'll probably give you at least something
I'll give it a go
oof, bigger than I thought
I think this might be more of a headache than its worth >_>
Just trying to maximize my predictive power. I can manually sift through it.
Just kinda wanted a flashy graph for presentation
Thank you for help switchy! If its not as hard as I'm thinking let me know
It's not... or at least shouldn't be
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(10)
y = np.arange(10)
plt.contour(x, y, np.sin(x)[None,:] + np.cos(y)[:,None])
x and y are 10-element arrays, Z is (10, 10)
So what are the elements in Z?
z[j,i] = sin(i) + cos(j), for i,j β [0, 1, ..., 9]
So Z[0,1] is mapped at location X[0], Y[1]?
Ok, I'll try to work off that for a bit
So how can my Z be indexed at Z[j, i] if my j and i are logrithmic and not integer?
Or do I need to just make a work-around
sorry, I guess I wasn't clear...
z[j,i] = z(x[i], y[j])
You're really super overthinking this
lmao. Im trying my best
import matplotlib.pyplot as plt
import numpy as np
x = 10**(np.arange(10) / 2)
y = 10**(np.arange(10) / 2)
plt.xscale('log')
plt.yscale('log')
plt.contour(x, y, np.sin(x)[None,:] + np.cos(y)[:,None])
I would really appreciate if you held my hand and really spelled it out for me so I could do some big boy data science lmao.
All I have are these 3 lists.
I need to reshape my results (Z) to be 10x10 or something
Im really sorry mate, I just dont think I'm going to get it to work tonight
My outputs are just whack that Im working with
My X and Y are 100 long, but are the same 10 values repeating. X goes up 1,2,3, vs Y that is ten 1's, ten 2's in a row ext
I just cant late night code. My bad mate. I really appreciate it though
@lilac reef just do .reshape((10,10)) on all of them
Then you'll have 2D arrays for X, Y, and Z (which is also acceptable input to contour)
π
If you wanted to turn them back into 1D arrays, you could take every 10th X value (graph_x[::10]), and the first 10 Y values (graph_y[:10]) -- but you shouldn't need to do that in this case
np.array(graph_x)
gotcha
I just got a half-chub
Thanks switchy <3
I think I can play with it from here
Great π
God thats a thing a beauty
The data is so abstract it would take me like 10 minutes to explain and I love it
I'l pretty it up once I stop throwing graphs at the wall and seeing what sticks
Trying to get my axes right
Why use fancy matplotlib when your numpy array spyder auto-coloring does the job π€
what does numpy.argmax( )actually do?
any idea how to delete a row in a dataframe where the value of one column is false
I wanna scan the whole df and find the value false in a particular column and then delete that row
drop it @quartz stream
yea that should work
send ur code
code of ?
df = df.drop(['name of the colum'])
scan a column find false and delete that
indices = df[!df['myColumn']].index
df.drop(indices, inplace=True)
@quartz stream Pretty sure my snippet should do what you want.
lemme try I also think so @wicked flare
not working
I dont wanna delete a column
i'll scan a column called valid
find index where value is false
and delete all of em
so you only have certain blocks which are false?
or a whole row
@quartz stream This is smth im working on, do you want this? See it dropped row 1 and 2
like i said above u just drop it
@quartz stream My snippet doesn't delete columns, it deletes rows.
>>> import pandas as pd
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4
>>> indices = df[df['col1'] == 2].index
>>> indices
Int64Index([1], dtype='int64')
>>> df.drop(index=indices, inplace=True)
>>> df
col1 col2
0 1 3
>>>
It finds rows where the given column matches the condition and drops them.
Wow
You Sir are amazing
@wicked flare
is there any way I can get count of this thing
count?
You can just check the length of indices
No worries.
1 indices = data[data['Valid'] == 'False'].index
----> 2 df.drop(index=indices, inplace=True)
TypeError: drop() got an unexpected keyword argument 'index'
@wicked flare
Your dataframe isn't called df.
<ipython-input-24-22441e5dbdc2> in <module>()
1 indices = data[data['Valid'] == 'False'].index
----> 2 data.drop(index=indices, inplace=True)
TypeError: drop() got an unexpected keyword argument 'index'
What version of pandas are you using?
The index parameter to drop was added in 0.21 so presumably your version is older than that
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (0.24.2)
Anyway, this should also work: df.drop(indices, axis=0, inplace=True)
it's not good practice to use inplace drop, it doesn't work properly most of the time
What's wrong with it?
It's always worked for me but it's supposedly supposed to be deprecated.
guys any ideas for data science related projects for good practice with numpy, pandas as well as general python stuff ??
btw I've a rpi so maybe something with that? I've really no clue what I can do. I've done several data analysis and webscraping ones and also some db ones.
How would I determin theamount of states with Qlearning
When my environment is trading
I have 3 actions, buy, sell , hold
Thanks that's a great idea
@void anvil The q values musnt be like thaat right?
I have this piece of code for the q_table
im just not sure what the [50] should be
as the 3 is the amount of actions
and the [50] should be the states
hello
I need help understanding what's happening at line 195 in this code: https://github.com/pytorch/examples/blob/master/imagenet/main.py#L175
they are concatenating two data sets, training and values (I think)
why are they doing this?
that's what path.join does right, concatenate directories?
If by concatenating directories you mean concatenate the path of two directories you are correct. os.path.join is needed because windows paths are separated by \ while unix paths are separated by /, so you can't just write 'dir1' + '/' + 'dir2' on all systems and expect that too work. So in this case it seems you provide the path to your data and the script expects your data directory to contain two subdirectories named train and val. The path variables traindir and valdir just hold the paths to those subdirectories.
Do I need to train-test-split if doing cross validation?
Is it possible to overfit with cross validation baked into the learning algorithm?
Hmm.
So I fit my model with the training set, and then cross validate on the whole thing? Or pure score it on test?
@void anvil
Like I'm using GridSearchCV. Isnt that cross validating the score for every model? Isnt that purely scoring based on whatever data you pass it
So I guess test the best model from the grid against the test set afterwards then?
Oh
:(
I've been overfitting pretty harshly
Shit
@lilac reef you can't usually rely on the CV scores from hyperparameter tuning to correctly estimate out-of-sample score
CV for hyperparameter tuning and CV for performance estimation are different steps
Gotcha
you can do a "nested" CV, or do the CV on your training set while keeping a holdout set for validation
So if the goal is to maximize the score on the holdout set (That means the model is generalizable right?),
how should I go about hyperparameter tuning?
Or is whatever parameters GridSearch returns probably still the best, even if it overfits somewhat?
Wait, whats a validation vs testing set?
If this is just an easy google and I'm being really lazy dont reply lmao
eh
arbitrary
you might use "testing" for iterating on your model
then when you need to get some kind of final assessment before you start sharing this w/ your company's CTO, you run it on the validation set to get a more "pure" estimate of accuracy
in theory you should only ever use your test set exactly once
@desert oar I think you swapped validation and test?
@silent swan no, thats how i use the terms. but they arent formal terms
i usually use "test" as the "innermost" holdout set
i.e. inside the CV loop
ah
the general convention is that validation/dev sets are used for tuning
test is used for pure eval
i use it the opposite way
in fact ive never seen it used the way you describe
i usually call eval "eval"
i avoid using the term "test" except in code
Interesting. I've seen the notation @silent swan describes plenty of times and heard it described as the canonical way. But in my head i too switch around test and validation so that I tune on test and I can't quite remember where i learnt it, maybe it's just more intuitive.
maybe i just dont use kaggle enough
it's the standard in research as well
would be better to stick to the conventional naming imo
(also NLP is weird where they sometimes call the val set the "dev set")
at least that makes sense, its what you develop against
I'm guessing that's the origin
Would you consider Auto Encoders for feature selection a form of clustering?
So what would you call the new form the data takes at the middle of the Auto Encoder?
latent code
Is there a GridSearchCV in Sklearn api that can take the best model (i.e. RF and SVC)
instead of the model with the best params?
basically just a separate grid search for both, possibly using same list of parameters..
or.. try an ensemble
sometimes i just send data back and say "i can't use this"
sometimes it's ambiguous and you literally can't know
its honestly pretty satisfying
but seriously though, MM-DD-YYYY was a mistake
anything other than YYYY-MM-DD for a data set is pretty bad
have fun sorting on DD-MM-YYYY
right
unless it's stored in a database like that...
nothing like a MM/DD/YYYY VARCHAR(255) timestamp!
oh lol excel
hmm doesn't excel represent all that consistently internally
i think pandas knows how to handle that
are they formatted as text or date?
if it's text you're fucked
if it's formatted as a date you might be safe
because you can reformat consistently
yeesh
yeah
uh
"this is your lesson not to be stupid"
"i can't work with this"
do people still use vba macros..
I would like a model that tries to predict the winner of a tennis match. It should focus on putting different weights or importances varying from 1-100 on various aspects of a tennis players game. As many possible options as possible should be put in there but ideally the variables that will be able to be measured between 1-100 are things like: Win streak - amount of games won so far Won last match Lost last match Previous encounters with same opponent Amount of points won in a game + ββ ββ ββ in last 5 games + ββ ββ ββ ββ last 3 games Double faults in last game + double faults in last 5 games + 3 games How they fare in similar situations for eg. 3rd deuce as a server whilst 1 set down.
- Is this a good use of ML?
- Could anyone recommend some videos on how to accomplish this
- Is there a better approach?
@pine yoke you can use machine learning to predict match winners yes. look into "logistic regression"
the "weights" produced by the logistic regression can give you some sense of "importance" but they aren't directly comparable
Do you have recommendations for libraries to use?
scikit-learn
pandas for data processing
numpy for general matrix and vector math
and scipy for various other math
i am wondering, for data science jobs, isn't a masters degree required?
data analyst is also a data science job.. you can get a business degree and go into that
ohh business degrees can do data analysis?
i am not a business degree i am afraid
for data analytics or data science?
data science
ohh, then this program i know irl is kind of trying to scam people into thinking they can do data science without a masters
there is no standards organization that dictates what is required
it is hard to get a data science job without a masters because there are many people with masters degrees applying, and also because a masters degree signals a basic level of competence
without the masters degree you need to signal competence some other way
oh i see
but i think it's kind of misleading of the program though to tell people they can get a job in data science without a masters or letting them know of the full picture
or not letting them know*
eh, they don't
i don't want to reveal my location
Why is my NN performing so bad?
What could I change
This is the model its a very simple one
it could be that your thing doesnt have enough weights to optimize, the optimizer isnt good for your use case or that your data is poluted
@earnest prawn How do you mean my data is poluted?
im using adam optimizer
I have been experimenting a bit and it just cant get past 55% accuracy for some reason
and when I train does it make sense that those accuracies are all the same
Mathematically speaking it does make a lot of sense that these accuracies are the same if your optimizer has found a minimum
@onyx moth what happens if you increase the amount of neurons in your hidden layers
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.1)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(4, input_dim = 4, activation = 'relu'))
model.add(tf.keras.layers.Dense(16, activation= 'relu'))
model.add(tf.keras.layers.Dense(32, activation= 'relu'))
model.add(tf.keras.layers.Dense(16, activation= 'relu'))
model.add(tf.keras.layers.Dense(1, activation= 'sigmoid'))
model.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs = 5)
scores = model.evaluate(x_test, y_test)
print(scores)
Im a bit new to Keras but what im doing is pulling BTC data frm binance, putting it into 4 indicators and inputting the values of those 4 indicators into the NN (with labels that if the price 5 prices further is higher 1 and lower 0)
This model I just sorta got of the internet nd have been playing around with it
it was meant for something els not bitcoin
if I change the amount of neurons I dont see much improvement, but its very random, when I change the pair to for example ETHBTC and put it to forecast 50 days in the future accuracy jumps to 77%
but the moment I add volume to what I input accuracy dumps to 21^%
21%*
And you are sure that these 4 indicators have a relation to wether the prices will rise or not?
yes its but its not like if this happens on the indicator it will rise 100% but more like if this happens the probabilty of it rising is higher
Could make it better yes
And you are (very very very) probably falling into a bias of some kind (lookahead, overfitting, etc.)
Or are trying to predict te wrong thing, for instance, the price of most well-known financial assets can be predicted day-to-day with a very good accuracy using this model:
predicted_price_for_tomorrow = price_today
https://aitrader.ai/I want to recreate something like these guys, but then for myself
@dim beacon Yes there are some good algos, its not so hard to program a bot which trades on a fixed algo, if u just do what theprevious days Heikin ashi candle says you will also catch all the big swings, but having a NN bot which can adapt to the situation looks cool to me but idk if it even something thats possible to make
@onyx moth algorithmic trading is, really, MUCH harder than most people think
Yea I couldnt come up with an algo that doesnt have a spot where it doesnt lose money
Like, there-exists-companies-paying-the-smartest-people-on-earth-millions-a-year-each-just-to-be-a-little-part-of-their-algos-strategies-development hard
Thats why I thought if I can imploment some NN and add that as another if it would maybe cancel out some noise
But it must be possiblefor us right?
Possible? Yes. But do not expect being able to do it profitably before years of deep experience and very high-level knowledge in a lot of fields
lol idk if this is gonna work xD
Like, yeah it is possible, like it is possible you get a Nobel Prize one day, but you'll agree that it is not very likely to happen
true
ive only been messing with NN lik 1.5 week or so
@dim beacon Do you have experience with Reinforcement learning?
Very few
okay now I have something that has a 64% accuracy and would like to try it out how do I now use this model to predict the next candles price?
?
model.predict(current_candle_data)
Also, best model, by far, to predict next candle price:
next_candle_price = current_candle_price
so u saying the best way to predict is to say that tomorrows price is the same as todays?
Yes
They do, but on average they don't on a day-to-day basis
well if u look at BTC its moved up 400% + in the past half year
Yeah, still
Most of the time it did not move that significantly from a day to the next one
well year before it moved down by 85% or so
Yeah, still, most of the time it did not move that significantly from a day to the next one
can anyone teach me how to make a neural network for hand written digits. ik theres hundreds of tutorials online but i find none of them really helpful
Why aren't they helpful ?
well i follow them. have about 20% clue of whats going on. and then it finishes without explaining how to save it or implement it into any other coding projects
the only one ive found useful was a Coursera one i did a while back but it was using Octave. It was more explaining the math behind it rather than explaining model structures
sounds like you need to learn tensorflow specifically
i feel like youd benefit more from a hands on machine learning book than from tutorials
could u recommend one?
and yes i think id like to learn the ins and outs of tensorflow
i havent used one so i dont know sorry
@turbid bay it is just an introduction, but I found it good https://www.youtube.com/watch?v=vq2nnJ4g6N0
Subscribe to Devoxx on YouTube @ https://bit.ly/devoxx-youtube Like Devoxx on Facebook @ https://www.facebook.com/devoxxcom Follow Devoxx on Twitter @ https:...
what does it mean if I need to discretizize values?
that you have to convert a continious space into a discrete one
so for example clip all rational numbers to [-1,0,1]
the comment above literally explains why
I dont understand it I added it and it does nothing to my accuracy
it is not supposed to
it is literally supposed to "fix the random seed so the results are reproducable"
so if I run it with seed 7 again it will produce the same results?
yes
aaah okay
there should be a Rule #1
Don't do anything finance-related for your first data science project
i get stuck with bug
@void anvil , do you have exp with two stage modeling ?
like choice modeling or a simple classification where you run the first stage to predict some feature a which you will add to your 2nd stage variables to predict some class
https://hackernoon.com/dont-be-fooled-deceptive-cryptocurrency-price-predictions-using-deep-learning-bf27e4837151 I think this article popped my bubble of how good NN and that stuff is for trading. I think its for the average person not possible to come up with a good strategy unless u very good at it
Ill just keep it with algo trading then
But the idea of shifting the price is only to create a label and u should remove the shifted price right?
So it learns well here I should buy and memorise the situation of the other data u pass in
@void anvil . what i am trying to achieve is to group my options into clusters, which is done. Now as first stage I want to predict the cluster which user will choose the options from and then try to predict the option chosen which is in that cluster
how would you approach to such a problem ?
kind of
yes
instead of shape i have clusters
instead of colors i have options
exactly
it limits
i have done these two
we can switch to dm if it is fit for you
makes sense
ah okay
okay
let me now give more info
exactly
thats whiy i am now wiritng more info
so, lets assume, I have 3 users who were shown some items to buy.
first 2 got 50 items shown and the third one only 10
what I was doing previously, I was using features of items which are shown to every user and cluster them into several clusters. then i would create artificial cluster related features and combine them with item features and run a logit on them to find the item user has chosen
and now the idea is to split this task into a two stage choice model. first build a model which will predict the cluster from which user will choose and 2nd stage will be from that cluster find which item user will choose
i have not found much papers/works on this on internet
okay i found LIME
will look into that
i see
which model will fit for the first stage?
i think i have mistaken when i explained the use case
i already have lets say 6 clusters for 50 items that are shown to a user 1
yes
which item he will buy
the second one
yes
we have historical items that are shown to users and whether or not that item is bought
users are anonimzed
we hae around 10k users
around 1ml items that were shown to them, some 40, some 50, some even 200 items
and for each user we know which items from those 40 or 50 they have chosen
no, we assume that users are kinda "unique"
I have done clustering on user level.
I took all items in my dataset which are show to a user A and clustered them into cluster, then user B and so all 10k users. As a results, the clusters are "per user" only
what was the initial task here?
initial task is to predict the item which user will choose from the item pool that we have for that user ( i have no control over that pool)
while having historical "unique" item pools and also the item which was chosen
there always was a choise
that's what an economist would call a "discrete choice" model
for clustering, you said you've already done it, but my immediate instinct would be to use NMF on the buyer-product matrix
NMF gets you clustered users and items at the same time
and non-negativity is just nice i guess
you could use SVD instead if you didnt need non-negativity
because of the economics side, i was advised to " use techniques which are more scientifically explainable and interpretable to management people"
first thing was to cluster items in every pool and derive some pool-cluster-itemspecific characteristics
bleh
what do you mean pool-cluster-item specific
like... you want to perform clustering within each pool?
that sounds totally mad
i derived additional features for every item which takes into account pool meta characteristics and + the cluster it belongs to within that pool
yes, that is exactly what i am saying π
and youre feeding those into the decision model?
how many products do you have btw
because only one item in every pool is chosen
1m "unique" products + 10k users (10k pools)
also it looks like your original questions was about two-stage modeling. what exactly did you want to know about it? like how to propagate variance?
my question was about finding another approach to this problem by treating this as two stage problem
first stage is can we predict the cluster of items within a pool which user will be insterested in
i agree with ragepope. i dont think that kind of "traditional" approach is going to work
like what you described
second stage is from the result of the first stage can we predict the item that user will choose
@desert oar computational will not be a problem
users only select from a subset of products, right?
they search for a product and a given a subset of products (which i have no control over)
you can see what they were shown, right?
you know what i'd do because i'm a lazy hack? i'd use NMF to get a low-rank Users cluster matrix and Products cluster matrix, then throw the user and product clusters into Vowpal Wabbit w/ negative-sampling loss
i have no idea if that would work, but it would be easy
the structure is:
- pool id,
- meta features , around 26
- was chosen or not
@desert oar I understood around 60 % of what you wrote π
vowpal wabbit is a command line tool for fitting linear models
but you have 1m labels so you really need negative sampling loss or something like that -- it's what makes word2vec possible with even huge vocabulary sizes
vowpal wabbit will do pairwise/quadratic interactions and negative sampling loss
so it's a very fancy logistic regression for really lazy people
Hello, very basic user here.
I have a set of data in a table, and a function of 2 parameters of the type :
y = f(x1, x2)
I would like to have some king of linear interpolation for any point on the graph based on the data I have. How would it be possible?
in "1D", np.interp does it
@void anvil i actually disagree, how to model this stuff well is all locked down as trade secrets
doing a thesis on purchase prediction and publishing it openly is valuable work imo
you think amazon doesnt?
sure. but in the real world youre almost never flying totally blind, its more just a matter of not knowing wtf to do with your metadata
i dont know what that even means
most users will have a unique combination of features, but that's just statistics
low variance X gives shitty predictions anyway
this is a project in collaboration with a firm. firm is interested in a predictive model which will tell them which items from the pool they get to show first and minimize the time on website.
for the academia it is to understand the decoy effect and make one item more likely to be chosen by "putting" the decoy item near it
sort of
amazon knows a lot more than that
they also know each users unique purchase and ad click history
which they do have
no? i thought that was the whole pool/search thing
they know what people searched for and clicked on
...right?
hm
i also dont like the term "pool" here
its misleading
its a search id
for a single search
@supple ferry do you really only have 1 search per user? or a whole history
and what are some examples of the other features?
and how big are the search pools anyway? first 10 results?
@desert oar so it is a search yes. and we have one search per "user"
what we have is search ids. and in the dataset we also have user ids, but they are all unique, it does not give us much info
so what is have is, user comes to website, searches for something -- as result he gets 1 - 200 results
and then chooses one
ahhhh
ok
if anything you only have one user
and your entire model will be effectively marginalized over user features
sorry if i explained in unclear way
i think he needs to hire me as a consultant
so i can quit my job and work on an interesting problem
our goal is to order those results in a way that user finds what he searches for and buys it and leaves the site
your model is: P( Y | U, Z ), where Y is the purchase decision, U is user metadata, and Z is other metadata (time of day, location, etc)
but i dont have any user metadata
right
not even age gender and etc
without knowing individual users, you are marginalizing over U
so you are fitting the model SUM_u P( Y | Z ) * P(U)
im abusing notation but hopefully you see what i mean
unless you start trying to disambiguate users using e.g. location, your whole thing needs to be interpreted from the perspective of "this is averaged across all users"
yes
so no, do not treat "users" as unique
you have unique searches
that said you can probably recover some user-level metadata
yes
e.g. region, users usually don't leave their country or continent
so how big are these search pools
you mean result pool ?
which user sees ?
because it does not come from the firm side, it is not normally distributed and ranges from just 1 result up to 200
i can not do so because i do not have control on what results the user is getting, i have control on ordering of those results
from academia point of view i am interested in ordering of the results because of the asymmetric dominance effect
ok that's a start at least
i'd seriously consider returning to first principles
e.g. the only reason we can use logistic regression for this is because of random utility maximization
yes
i don't know if this is the right answer. but maybe write out the user's utility function
im not sure you even can write one out
because you have this weird varying budget set
yes
so the user's decision is going to depend not just on their utility but also on the budget set
yes like i said, this is all in the sense of an "average" user
yes they choose from what they are given
this problem is interesting as it is hard
exactly
or actually get hard problems at work
instead of problems that are conceptually easy but just require fuck tons of programming
are you an economist btw
@void anvil im working on it lol
i wish
im not smart enough
@supple ferry no but i thought i was going to get an econ phd at one point
@void anvil thats interesting, i do like framing this as a search relevance problem
ok i think ive convinced myself that RUM still holds up here
but you totally do
preferences are a partial ordering
you wont reconstruct the complete order
sure there is, you see a collection of results and you know that chosen_item was preferred
preferred given the search results
thats the whole thrust of their investigation
yeah youll never do it all
no no
but i dont think thats the goal
that said @supple ferry do you actually have the search terms?
i want to order only the results a particular user sees
because i do think rage is on to something
im not sure what that achieves
the idea isn't to find a preference over all 1m items
oh i see
circumvent the problem entirely
w/ respect to the business "spend as little time on the page" criterion
yeah no thats totally reasonable
eh
not even shitty
if anything i think the question is a little misguided after this conversation
like... of course order matters a lot
and no, swapping the 198th and 197th items does not matter at all
nor does increasing the set from 201 to 202
if you really want to prove that, sure, you have a thesis
but if you want to investigate IIA violations in the wild you'll want to restrict your data set significantly
and you probably need more baseline data
e.g. how often does anyone ever click the nth search result?
i assume that after like 10-20 you get near-zero clicks
so not only "should" you use search term relevance like ragepope is saying, but your data is going to be very heavily dependent on the existing algorithm
@void anvil insurance, believe it or not
nope
btw actuarial would be really interesting if only regulators werent so strict about pricing
you know how everyone is clamoring for regulating AI and open/transparent algorithms and stuff? we've had that in insurance for years, and it holds back the industry
and it makes a worse experience for everyone
"why should i be penalized just because im under 25??"
"because state regulators dont understand XGBoost son"
right lol
but no i literally havent touched pricing at all
it sucks
ive basically spent the last year classifying businesses
yeah i love lime
its weird w/ text models though
yeah i know i cant remember the name either
i saw you mention it
hang on its in my zotero collection
SHAP
zotero to the fuckin rescue
i also like partial dependence plots a lot
oh shap has them too, nice
oooh nice python PDP lib https://github.com/SauceCat/PDPbox
i actually wrote my own partial dependence plot library for R a few years ago
i used it all the time
at my previous company they wanted to know how much users were willing to pay for a product upgrade
so i did exactly what QWERTY is doing, i fitted a discrete choice model w/ price as one of the inputs
then i got the partial dependence of price
the business loved it, it was the perfect combination of "i understand this" and "ai magic" and "we are domain experts"
i will dig into the directions you mentioned
@void anvil and @desert oar
wanna thank you both
for help and interesting discussion
put us in your thesis acknowledgements ;P
"and all those people with waifu and rick and morty avatars"
π
i have time till 2021
oh yeah you got time. im actually curious to see what comes out of this
i will share it with you when the time comes π
holy shit
you have been through some serious stuff
yeah poor guy
4 weeks with kidney stones? that must have been some serious kidney stones
@void anvil i meant about notes one day before presentation π
i think mine had a max page limit, i cant remember
max limit including charts
i remember it was brutal trying to pare it down
i was like tweaking line spacing and page borders in latex
heh
i also never had to present or anything, i just dropped the paper off at my advisor's office, i dont think i even met with him when i got the final marked up version back, i might have just picked it up from a drop box or something
like a fuckin term paper lol
that sounds way more intelligent
i basically turned a "fast track" 1 year masters into a full 2 year program
i did everything so weird and wrong
heh
my problem was partly that i got sick during my 3rd semester
fucked up my whole life
not to mention my schedule
lol right
is regression or forecasting used to predict rainfall?
I can't speak for all agencies, but I believe it's usually forecasting
@grizzled folio thanks
operational forecasting is pretty interesting though
It's not really that complicated
Regression and forecasting are two different things, you can use one or neither or both if you want
oh, then is regression used in estimating whether or not a person will default on a loan?
Hello, can u suggest big projects about data science(not includes deep learning) to examine which coded by experienced data scientist? I mean , complete project, using project structure etc. Not consist of jupyter notebooks? I want to see how big data science projects should be?
@lapis sequoia those are usually done professionally and not likely to be something you can find publicly
You might find some of the source code, or you might find the result of the project open source, but you're not likely to find a project where all of the intermediate work has been published
@unborn drum regression is a type of model, forecasting is a task to be achieved with a model
Thanks @desert oar π
@void anvil paperswithcode is biggest discovery i made this year
Can anyone suggest a good method of object detection for finding specific images (although they may be not see straight on) in a picture? I'm trying to detect street signs in real time
Ooh ok I'll check it out
Is it easy to train my own images on it?
Or would there already be street sign models
Thanks so much! My latest attempt was to run feature matches over circles detected on the frame. felt very janked
The code if you want π
Paper 2 : https://arxiv.org/pdf/1707.09700v2.pdf
Is it resource intensive? I'm planning to run it on a raspberry pi but I could stream the video to my pc for processing if it was necessary
and github code https://github.com/yikang-li/MSDN
you can always train the model
then save the model to a path
that file (joblib) can be used
on raspberry
and it wont be resource intensive
Okay, that's perfect then!
There are alternative for keras as well as tensorflow online
but if you are not into complex models try yolo it is one shot learning
Okay sounds good
hi! So I've built my first regression model in Keras and it doesn't look good. I'm not getting very accurate results, can anyone help?
@proud iris what data do you have, what kind of model are you using, how are you training the model, and how are you evaluating accuracy
Can anyone help me bounce ideas off of? I need project ideas that have business applications for a class. I've already done a stock bot but last semester. I need some fresh ideas.
I just created a data set, x*sin(x) is my evaluating function. I have three columns, x, sin(x) and their product. I have 3000 data entries, I'm using 2995 for training, last 5 as test cases
I'm very very new to this so please be patient with me if some of my questions appear very trivial
you're trying to learn the function x*sin(x)?
5 test cases isn't much
what kind of model is this?
1 output w/ some fully connected hidden layers?
yeah. Dense layers stacked
are you doing any hyperparameter tuning
can I show you my code
i guess. i dont really use keras and i'm not really a neural network aficionado
that would answer my questions though
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
no, I dn't have any idea about hyperparameter tuning so I guess I'm not doing it?
yeah probably not
You could use talos for hp tuning
maybe instead of just the last 5, what if you randomly drop out points for train/test
This should be a very simple implementation? I shouldn't be needing hypertuning and other stuff?
data_train = data.sample(frac=0.9)
data_test = data.loc[data.index.difference(data_train.index)]
walk me through this, data_train gets 75 percent of the generated data
what is going on in data_test?
i changed it to 90 but yeah
pandas has this notion of an "index"
basically a label for each row
yeah like excel
sorta yeah
data.index.difference(data_train.index) gives you the index values from data that aren't in data_train
then you use .loc to get the corresponding reows
alternatively you can do
is_train = np.random.choice([True, False], p=[0.9, 0.1], size=data.shape[0])
data_train = data.loc[is_train]
data_test = data.loc[~is_train]
random
(in both cases)
that doesn't make your model better but at least evaluating it will be more representative
Also take a look at this https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53
whats wrong with 300 values @proud iris ?
if your model is bad, going from 700 to 990 values isn't going to make it better
but it will improve your evaluation because you're going from 10 to 300 in your test set
Also, like srl said before you should really be doing at least a 20% test
How will I check the values of 300 entries?
or the mean squared error
but why is my model not predicting well? Is the code correct?
probably because you have no hp tuning and you're using the same type of neurons for like 4 layers
throw a sig layer in there see what happens.
im not even sure what the value of stacking relus like that is. wouldnt you just use like 1 big hidden linear layer for this kind of thing
or like a hidden layer and a sigmoid cause sin
like i said im not a NN guy
so correct me if im wrong
Also, you need a validation set.
then do a confusion matrix on the validation set with prediction set
Do 75, 15,10 sets
Literally just get 10% of your data and use that as your validation set.
I don't know any of those things @void anvil so please explain a bit. Also the diagram
Also, Rik, make that last layer a linear layer. Idk if it does that by default but I do it for good measure
@dusty latch I am already using 10 percent as my validation set. But what did you say about the confusion matrix?
confusion matrix basically test your validation set over your predicted values.
okay. So just subtracting one from the other?
You can get a better measure of model usefulness from sensitivity and specificity. A cm will give that you to you.
It does a lot more than just subtracting one from the other.
the one thing I don't get is that there are examples and tutorials where this code is enough to produce accurate-ish results. And my function isn't that difficult to learn is it?
@dusty latch will look into it then.
You're just not using enough data points and tuned hp
just look up hyperparameter tuning in python keras and you'll have loads of things.
What was that bit about linear layers? Mine is a dense one, isn't that how it should be? every neuron connected to all other neurons in the previous layer?
yeah I will definitely look into it
You know how you had layers before as relu? Add something like that to the last layer but use linear
yeah but why?
okay not the layer but the activation function
crap I need to learn quite a bit
how many neurons does one use in the hidden layers? Is it trial and error?
Won't that create a bias for the validation data?
If you keep tuning your hyperparams too much on your validation set then yes you're going to overfit
Guys I'm doing some capstone project about Anomaly Detection(Outlier Detection) on Credit card fraud dataset on Kaggle.
The dataset has labels(Whether it's fraud or not) so initially I was going to do some supervised learning stuff but then in real life scenario we don't get to work with labels so I was going to use unsupervised learning methods to detect outliers.
Yes I'm pretending I have no information about labels but will use it later to evaluate my unsupervised learning methods' performance.
The problem I'm facing here is that maybe it is too ...simple? I'm basically trying out many different outlier detection models and simply comparing them.
I'm wondering if you guys have any brilliant idea to make this part fancy or impactful enough?
hmm.. if you don't have labels, how do you plan to discriminate
there is a distance measure we typically use in banking.. for the life of me, i'm not able to recall it right now
but it's what's usually used when trying to find who's is at a risk of defaulting.. or committing credit card fraud
the name of the distance measure ends in 'ov' , I'll ping you if I remember it
@lapis sequoia Hey Tron, yea there are multiple algorithms i can use for outlier detection. Like you said, i have couple of stuff based on euclidean methods or ensemble or probability or cluster.
I'm generally concerned about the depth of the project. I feel like I'm just laying out options and simply telling them what worked the best.
And I don't feel like it's complicated enough
ok.. you want to tell them you tried a bunch of methods and compare
so do that
seems like a comparative study of classifiers for this application.. hmm..
Yea... still it's just not deep enough. I could probably do some hyperparameter optimization for better performance but then that's totally going against the concept of unsupervised learning
check out current papers and see if there's a gap you could fill
Yea I'll do that on the side. Thanks man
Hi guys
I own a small bakery
How can I use data science skills to boost my sale or help my business?
Can someone help me
@native rivet is this an actual question or is this a homework question
First get some data from your business and try to answer questions you have with it and explore it in general, see if something stands out
@native rivet
Later you can optimize your business using it and get a data strategy
But remember only to collect the data you really need, because data without a purpose is worth nothing
Questions like:
When do customers come to my shop
What do they buy
Are the customers somehow related
I suggest you entice the customers with some deep baking. Use densely connected flavours and many hidden layers in your pastries. Also look into batch normalisation if you're baking large batches of cup cakes. LSTM (Luxury Sweet Tasting Macarons) are in high demand, though some prefer TCN (TrdelnΓk Containing Nutella). Best of luck
π
It was an attempt at humor 
Lads, I'm struggling to think of ideas to do for my python data analytics class. Last semester I had the same class but in R and I did stock market direction prediction. Any ideas on some cool I can do? I dont want to do something just plain like the iris dataset or whatever
Thanks but I forgot to add that it needs to have a business case