#data-science-and-ml

1 messages Β· Page 207 of 1

sand reef
#

missing days?

#

why not just fill them up with the upcoming day values?

#

you mean the missing days is a pretty wide gap?

#

ah, then it wont work

#

generally, speaking in my weather prediction model, there were some missing data, so apparently pandas can do this autofill thing, and so it filled it up with the previous day's records.

#

Instead of me having to drop all those columns altogether.

#

yeah, but I am assuming, here, entire rows are missng

#

*missing

#

so it would be pretty impossible to just fill them up for a while

#

hmmmm.....

#

okay, i am very very new to data science

#

i have used augmented dickey fuller tests and all to see if the multivariate time series is stationary or not and all

#

but i didnt understand what autolag meant

#

and i cant seem to find its definition on google either

#

could you tell me, what on earth is autolag?

#

adds lag to the variables?

#

you mean it repeats it?

#

so, it just fills up the breaks between the time periods, by repeating the variable

#

so.... what is it then?

lapis sequoia
#
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Flatten

imgs = np.load("/Users/Kushi/PycharmProjects/"
               "Artificial_Intelligence/MachineLearning/Sign-language-digits-dataset 2/X.npy")
targets = np.load("/Users/Kushi/PycharmProjects/"
                  "Artificial_Intelligence/MachineLearning/Sign-language-digits-dataset 2/Y.npy")

imgs = imgs.reshape((2062, 64, 64, 1))
model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape=(64, 64), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(64, (3, 3), input_shape=(64, 64), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(32, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(10, activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=['accuracy'])
model.fit(imgs, targets, epochs=5)```
#

ValueError: Input 0 of layer conv2d is incompatible with the layer: expected ndim=4, found ndim=3. Full shape received: [None, 64, 64]

#

why do i get this error even when i reshape the array?

feral lodge
#

You should only specify input_shape for the first Conv2D. The rest will figure it out themselves. If your data is has shape (2062, 64, 64), then input_shape should be (64, 64).

#

If you're only reshaping in order to add the final 1 dimension to the data, there's no need to do that. Just remove the reshape, and remove the input_shape to the second Conv2D

#

@lapis sequoia

lapis sequoia
#

i get this error : ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=4, found ndim=3. Full shape received: [None, 64, 64]

#

when i removed the reshape and input_shape

#

oops

#

didnt notice the input_Shape in 2nd conv

#

i still get the same error though :(

#

@feral lodge

#

sry for ping

feral lodge
#

no worries πŸ‘Œ Gimme a sec

lapis sequoia
#

shall i send the current code?

feral lodge
#

No need! Just tell me, what does targets.shape give you?

lapis sequoia
#

okay

#

(2062, 10)

feral lodge
#

Oh, I might be wrong actually. Go ahead and reshape imgs to (2062, 64, 64, 1). But set input_shape of the first Conv2D to be (64, 64, 1)

#

This works for me: ```
imgs = np.random.rand(2062, 64, 64, 1) # fake data
targets = np.random.rand(2062, 10)

model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape=(64, 64, 1), activation="relu"))
...

lapis sequoia
#

okay

#

thx ill try it

desert oar
#

@void anvil for time series?

#

to figure out if there's any AR structure you should look at the ACF and PACF plots

#

nah you dont actually look at the charts

#

you just check the ACF at p lags is > threshold

#

for what purpose

#

you mean you wanna use this as a feature to predict something, and you want to include multiple lags as predictors?

#

like Y ~ X + lag(X, 1) + lag(X, 2)

#

oh i see youre actually doing the AIC of the regression model

#

what is the point of the lags, because you think heres a cyclical component?

#

idea being, N lags should smooth those out?

onyx moth
#

Could you with ml predict if your country is going to enter a recession. You do have enough data

lapis sequoia
#

How can I extract only specific text from an image with OpenCV and Tesseract?

#

I want to try to extract only specific numbers from a lottery ticket.

#

We can do so by matching the aspect ratio of the text

#

But what if the numbers are pretty much the same font size?

wicked flare
#

I'm sure macroeconomists do use data science methods to make predictions. Whether they are, or to what the degree they are, successful, is a different matter.

lapis sequoia
#

1443/1443 [==============================] - 22s 15ms/sample - loss: 0.4330 - acc: 0.8614 - val_loss: 17.3330 - val_acc: 0.0000e+00

#

how can i fix this sorta thing

#
x_train = x_train.reshape((-1, 64, 64, 1))
model = Sequential()
model.add(Conv2D(128, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64, activation="relu"))
model.add(Dense(10, activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.3)```
#

is the code

earnest prawn
#

@lapis sequoia is the validation accuracy like this for all epochs?

lapis sequoia
#

yes

earnest prawn
#

then yes it is a very extreme case of overfitting

lapis sequoia
#

shall i add 3 dropout layers?

earnest prawn
#

you shall add dropout layers yes

lapis sequoia
#

still the same

#

even after adding 3 dropouts

#

yep

feral lodge
#

You may not have enough data also; you have ~2000 training points right? For 64x64x1 images and a network with so many kernels that may be a bit on the low side. The network might generalize better if you downsample the images and reduce the size of the network.

#

That's not a good idea if the small details of your images are important though. If you can squint your eyes and still easily tell what each image is, then it might work better

lapis sequoia
#

Hi could somebody help me out with implementing batch normalization to this implementation? https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG_update2.py?utm_source=share&utm_medium=ios_app I am not sure What to put in the learn function. @ me svp

calm tundra
#

i am new in the field of macine learning.i just want to start it...can anyone pleaseguide me how should i start it ....what will be the correct sequence to follow for machine learning and AI

#

also can anyone tell me from where i can learn this any channel or youtube playlist for ML and AI

lapis sequoia
#

ill try using the cifar10 dataset

gilded dagger
#

Just wanted to say

#

After month initialising my dicts like a peasant, I discovered defaultdict

#

results_dict = defaultdict(lambda : defaultdict(int))

#

Ty Mr Python

lapis sequoia
#

Hello ! why hand gesture detection neural network is better than an haarcascade of a tipical gesture pls ? πŸ™‚

upper ginkgo
#

Hello, I've followed this tutorial to get a simple intent matching neural network:
https://towardsdatascience.com/build-it-yourself-chatbot-api-with-keras-tensorflow-model-f6d75ce957a5
The results are good, it's relatively fast and produces good results, except for one thing.
It always matches an intent, even if a query is nonsense it will match one intent. The probability is always high...
Is there anything I can do to fix this? I'm relatively new to neural networks and I'd appreciate if you add an explanation to whatever's going on! Thanks in advance

Medium

Step-by-step solution with complete source code to build a simple chatbot on top of Keras/TensorFlow model

earnest prawn
#

@upper ginkgo the reason this always outputs something is that you are having a softmax layer as output, softmax layers output a probability distribution, meaning that all output values of the layer will add up to 1. How to fix this I am not sure actually, I guess you could add another output for "nonsense" and add a few training examples which are nonsense so it knows what it has to qualify like that? But thats just my naive approach really

upper ginkgo
#

Thanks, @earnest prawn
If there's something else I can do I'd really like to hear it

earnest prawn
#

I'd like to hear as well, my solution is as I said just a really naive one

feral lodge
#

This is a big issue in deep learning; neural networks are currently very poor at expressing uncertainty for previously unseen input. They tend to have a false sense of overconfidence, like you saw

#

The simple answer is that your network is overfit, so you'll need to regularize it harder and hope for the best

#

Consider this example! All data is vectors right, and in this example the data vectors are 2-dimensional

#

It's a binary classification problem. The little clusters of data is our labeled training set. The points are labeled 0 or 1. Our neural network outputs a classification digit; a number ranging from 0 to 1

#

What you see in the plot is the learned classification scheme of an overfit neural network. It's very black-and-white -- the network always produces a 0 or a 1. No room for uncertainty

#

But what if our network sees a new point, say at the coordinate (-2, -2) or (-2, 2)? That's very far away from the training data. Reasonably, such points probably don't belong to either of the two classes. But the network will happily misclassify them, expressing 100% certainty while doing so

#

Here we can also see the problem of adding fake nonsense data to the training set. How would we define such points? Those points would literally have to be everywhere surrounding these points

#

That's a lot of work, even in 2 dimensions

#

Language and image processing don't work in 2 dimensions, they work in 1000s of dimensions

#

The solution is to regularize the network

#

Here's the same problem, except the network is properly regularized:

#

Viola! The network expresses uncertainty for weird data

#

Then the programmer will have to decide what levels of uncertainty are acceptable for their application

#

It's a very complicated problem though, and an area of active research. There is no way to guarantee that regularizing your net will work, unfortunately

#

The exact same issue has caused crashes in Teslas and other weird stuff you wouldn't expect

upper ginkgo
#

oh...

#

My neural network uses text only though, not images

feral lodge
#

Somehow, all data we work with is translated to vectors before we toss them through the network

#

So regardless of whether it's a text, an image, or just numbers we can always imagine them as points in a very high-dimensional vector space

#

And we always see the same kinds of problems

#

If you enter a text in Japanese, it might be classified with a 100% certainty, even though the network doesn't know what it's talking about

#

Likewise, if we make a network that can classify images of rotten apples from good apples and accidentally give it a picture of a shoe, the network might produce a very confident missclassification

#

"This shoe is definitely a good apple!"

#

Don't be discouraged though, regularization is the key. It might work well for you

upper ginkgo
#

Oh well, thanks for the information anyways

#

Is there somewhere I can learn about regularization?

feral lodge
#

What deep learning library are you using?

upper ginkgo
#

Keras/Tensorflow

feral lodge
#

I find many blogs and stuff if I just google "keras regularization". They're probably all good!

upper ginkgo
#

Thanks a lot

#

@feral lodge just wondering, I made a research before asking and I found other β€˜solutions’ such as using a Bayesian neural network or a Monte Carlo dropout. Would those also fix my issue?

feral lodge
#

The "good" plot I showed above is actually a Bayesian neural network! They're experimental, don't always work properly and usually take a looong time to train. When they do work, they tend to have good uncertainty estimates, better than usual regularization. If Keras supports them, you can give it a try if you want! Monte Carlo dropout is actually a Bayesian method in disguise, but I've never used it

#

If you're a beginner, it may be wiser to just stick to plain old networks with plain old regularization though. The classic regularization method for nns is weight decay. They call it regularizers.l2 here: https://keras.io/regularizers/

upper ginkgo
#

It works better now, thanks @feral lodge! But the nonsense is still getting through, nonetheless..

#

Is there something I can do here?
This is what I added to my code:

model = Sequential()
model.add(Dense(128, input_shape=(len(x[0]),), kernel_regularizer=regularizers.l2(0.01), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.5))
model.add(Dense(len(y[0]), kernel_regularizer=regularizers.l2(0.01), activation='softmax'))
#

Or do I just increase the value in l2?

feral lodge
#

Good job! It's a tricky problem, it'll probably never work flawlessly, particularly with this network architecture. But yes, you should play around with the l2 value

#

Raising it regularizes the network harder, so if it's too high the network will not be able to fit the data properly

upper ginkgo
#

Perhaps 0.5 would do the job?

feral lodge
#

Give it a try! I've never gone above 0.1 personally

#

But it's good to play around and test

upper ginkgo
#

I'm gonna keep that in mind, thanks again

polar acorn
#

If you want a non neural network solution you could also do something simple. Just from the top of my head, check if any of the top 1000 most used english words are in the comments and if yes send it to the network if not label it nonsense? Neural networks are nice but some things can also be solved by simple heuristics.

#

Although there are probably better heuristics than the one I suggested πŸ™‚

feral lodge
#

I agree, that's a good idea!

upper ginkgo
#

0.5 was a bit too high, nothing matched

#

and I guess it isn't necessary, the nonsense isn't getting through now:

#

@feral lodge thanks again, it works like a charm πŸ˜„

#

I got a bit scared by bayesian stuff and those methods that required changing a lot of the structure, I'm glad I just had to change a few lines! πŸ˜…

feral lodge
#

Great job, glad it helped πŸ˜„

#

Bayesian neural networks are actually not as scary as you might think! Simply by using l2regularization you've actually performed Bayesian inference over your network, in disguise. Your Bayesian prior distribution is a Gaussian, and your posterior distribution is a Delta πŸΈπŸ‘

silent swan
#

a good thing to remember is that softmax is a hack

#

"huh, we need to convert some real-valued outputs into a probability distribution, what do we do?"
"idk, take the ratio of exponents?"

#

they can be interpreted as probabilities, and our training losses treat them as such, but we should be cautious about actually treating them as "predicted probabilities"

languid oar
#

Hey guys, has anyone used VSCode for ML/Data Analysis purposes? I installed the basic extensions like Python and Intellicode, but autocompletion for packages like pandas is very very lacking. Coming from a language where VSCode is a first class citizen, it's night and day how much difficult Python is with it. Somebody had better luck than me?

still harness
#

I need a Human Speech sentiment classification (e.g. Happy, Sad, Fearful, Surprised, Angry) data-set, please help.πŸ™

quasi nacelle
#

anyone here ?

#

Hi, sorry but i need some help - I am trying to write a script to loop through data and identify people that fit the definition. - i am having a hard time getting started - the data is loaded with pandas and the dataframe is reduced to columns of interest. But writing the loop with the if and or statements is a problem. - anyone here that could set aside 15min ??
An increase in plasma creatinine by >0.3 mg/dl or a relative increase of 1.5-fold above baseline
together with severely elevated plasma MTX concentrations at one or more of the following time-points
after initiation of the MTX infusion:

36 hour >20 ΞΌM
42 hour >10 ΞΌM
48 hour >5 ΞΌM

  • If someone would like to help i can post the dataset
surreal nacelle
#

Hey, I'm looking to apply some of the stuff I learned on unsupervised learning from the book Hands-on machine learning with (...) and I was wondering if you guys had good entry level unsupervised learning project ideas to recommend πŸ™‚ Thanks

supple ferry
#

@surreal nacelle , you can start with market segmentation for example. Or any use case with clustering algorithms

surreal nacelle
#

Gonna look into it, thank you

#

someone recommended me mnist without label

#

which sounds pretty good tbh πŸ˜„

upper ginkgo
#

Hey, can someone explain to me what's an epoch?

feral lodge
#

An epoch means a single training iteration over your full training set

#

So if you have 10000 training points, your computer might not be able to handle everything at once. Then you'll divide your training data into smaller chunks called mini-batches and train your model on those instead. If you divide your data into mini-batches of size 1000 for example, then 10 training iterations equals 1 epoch

upper ginkgo
#

Ohhh I see

surreal nacelle
#

Hey, trying to clusterize the mnist dataset with KMeans, (by applying PCA first), and I'm getting pretty bad results, which is not surprising, however, I'm not sure what the next step should be.
I tried using the PCA.components_ as centroid, but it didn't perform as well as 10 random init and 1000 iters.
What would you do in that situation ?

silent swan
#

do your principle components look reasonable?

quaint ruin
#

Hey, I'm using the Kobe Bryant dataset from kaggle.
I've been tasked to predict the shot_made_flag and to avoid data leakage by training on data prior to to the date of the test data.
I've also been told to find the best k using 10 K-FOLD CV.
I think these 2 requirements contradict each other because if I use K-FOLD to split my data into train and test then the model will eventually train on data that occurred after the test data, so it makes no sense in this context to use K-FOLD but rather sort by date and manually split at this point for train and test and find the best k.
Can someone correct me if I'm wrong here?

onyx moth
#

hello guys I managed to put my data into a DataFrame and added target value ( which is just 3 prices in the future if its higher than current price its a 1 otherwise a 0). I have no clue which step is next to make this NN usable and how I would need to do it with my data.

#

alot of examples and stuff ive seen consist of 2 things, 1 row of samples and 1 row of labels

#

I have multiple rows of samples and 1 row of labels

silent swan
#

how seriously do you want to do this financial prediction?

#

because my usual prescription is: don't use ML for finance

#

with the further addendum: absolutely don't use DL for finance

#

if this is just for learning that's fine, then I can give tips

onyx moth
#

well im working towards using it for real trading 😬

#

im trying to recreate something

#

I saw this around 2 months ago on youtube and started learning python and since 2 weekw ive been doing this ML stuff

#

with NNs

silent swan
#

aha, well the bitcoin market is illiquid enough for some strats to work. still my recommendation is not to use DL for this. ML algos can work somewhat

#

unlike most problems, finance is a system where the whole market is actively trying to unlearn its own patterns

#

unless it's something fundamental to the market structure

onyx moth
#

I also emailed that guy and he said I need to look at reinforcement learning but as I just first wanna have something that works I discovered maybe just make a NN that throws out somthing first

#

I can feed in anything

#

I mean I can do fundamental analysis too. The whole bitcoin market crashes on tweets and youtube videos

#

I also dont expect to get rich with it but if it ca beat the market like in his video even if its by 1 % im on the right tack

#

track*

#

I only did 1 ML alog and that was lineair regression, it sucked badly

#

I needed to feed it 4 of the 5 things

#

then I moved on to NN

prime trail
#

are there any good machine learning tutorials i can learn about it?

mossy dragon
#

depends, hows your math?

polar acorn
#

Does anyone have a ETA on the first tensorflow 2.0 release? The roadmap says Q2 2019.

quartz stream
#

Any idea on how to save a tensorflow model

#
predictions = estimator.predict(input_fn=predict_input_fn)
#

I wanna save the estimator thing so that next time I dont need to do the 10000 iterations

light cloud
#

i believe that is what pickles are for

silent swan
#

isn't there like tf.saver?

feral lodge
#

If it's just Tensorflow, i use tf.train.Saver yeah:

saver.save(sess, saved_variables_path)  # save
saver.restore(sess, saved_variables_path)  # load ```
If it's keras, I use ```
from keras.models import load_model
model.save(saved_model_path)  # save
model = load_model(saved_model_path)  # load ```
#

I think these also save internal stuff like Adam's momentum values, but I'm not completely sure

quartz stream
#

Okay

#

Thanks @feral lodge

void spade
#

linreg = LinearRegression()
linreg.fit(X_train[:, :2], y_train)
y_train_2 = linreg.predict(X_train[:, :2])
y_test_2 = linreg.predict(X_test[:, :2])

linreg.fit(X_train[:, :4], y_train)
y_train_4 = linreg.predict(X_train[:, :4])
y_test_4 = linreg.predict(X_test[:, :4])

linreg.fit(X_train, y_train)
y_train_10 = linreg.predict(X_train)
y_test_10 = linreg.predict(X_test)

#

For this code what does the splicing do? Specifically the 2 and 4

#

Posted in help but maybe someone here can help

silent swan
#

looks like it's only using the first2 / first 4 features

void spade
#

Yep that is what it's doing but i dont understand where the features are coming from, hang on let me grab the code

#

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=10)
X = poly.fit_transform(x)

obs_nums = np.arange(0, num_points)
np.random.shuffle(obs_nums)

top_70 = int(num_points * .7)
rand_train = np.sort(obs_nums[:top_70])
rand_test = np.sort(obs_nums[top_70:])

X_train = X[rand_train]
X_test = X[rand_test]
y_train = y[rand_train]
y_test = y[rand_test]

linreg = LinearRegression()
linreg.fit(X_train[:, :2], y_train)
y_train_2 = linreg.predict(X_train[:, :2])
y_test_2 = linreg.predict(X_test[:, :2])

linreg.fit(X_train[:, :4], y_train)
y_train_4 = linreg.predict(X_train[:, :4])
y_test_4 = linreg.predict(X_test[:, :4])

linreg.fit(X_train, y_train)
y_train_10 = linreg.predict(X_train)
y_test_10 = linreg.predict(X_test)

errors_train= np.array([np.mean((y_train - y_train_2) ** 2),
                        np.mean((y_train - y_train_4) ** 2),
                        np.mean((y_train - y_train_10) ** 2)])
errors_train = np.column_stack(([2, 4, 10], errors_train))

errors_test = np.array([np.mean((y_test - y_test_2) ** 2),
                        np.mean((y_test - y_test_4) ** 2),
                        np.mean((y_test - y_test_10) ** 2)])
errors_test = np.column_stack(([2, 4, 10], errors_test))

silent swan
#

looks like you're taking the first 2/4 degree polynomials as featuers

void spade
#

I'm not sure where the feature values are coming from

#

Sorry in advance if im being super dumb

silent swan
#

there's the initial x that needs to come from somwhere

#

in
X = poly.fit_transform(x)

void spade
#

Yeah i follow that

storm void
#

Anyone here have experience working with the WebAgg backend of matplotlib

lilac reef
#

What would this kind of plot be called in matplotlib?
Its like a scatterplot, but how to connect each grouping of numbers based on value

grizzled folio
#

@lilac reef contour?

lilac reef
#

Seems about right. Im just afraid contour is continuous values, but I'll look into it!
Thank you!

#

Yeah, def look right

#

Thanks :)

lilac reef
#

Update: Not quite. I think contour is working with curves.
I have a bunch of X, Y values that each have a category Z they fall into.
I want to draw lines around each Z group like they did in the pictured graph above

#

I think I'm going to plot a bunch of different scatter plots on top of eachother with different colors. Not sure how I'll connect them

grizzled folio
#

@lilac reef what do you mean "working with curves"?

lilac reef
#

Contour plots (sometimes called Level Plots) are a way to show a three-dimensional surface on a two-dimensional plane. It graphs two predictor variables X Y on the y-axis and a response variable Z as contours. These contours are sometimes called the z-slices or the iso-response values.

#

Hmm, nevermind

#

I read something to the tune of contour plots plot slices of a plane

grizzled folio
#

You can explicitly pass the levels (categories) you want contours at

lilac reef
#

So I have pretty much the same data as the picture I posted.
For each X and Y value, I have a resultant percent number ( 75, 76.5, ext)
I have the list of X, Y (both at regular intervals) and want to find the high point for the percentage.
Would I pass in my X, Y as
plt.contour([graph_x, graph_y,], graph_results, [.6, .65, .7, .75])
for example?

#

Thank you for the help btw man :) This is a really tricky plot I'm trying to pull off but I think it will look good

grizzled folio
#

I believe it looks like plt.contour(X, Y, Z, levels), so no need to bundle up X and Y as a list

#

For a regular grid, you might need to meshgrid X and Y together, something like xx, yy = np.meshgrid(graph_x, graph_y)

lilac reef
#

Yeah, I kept seeing meshgrid come up

#

For just throwing in my stuff as X,Y,Z I got Input z must be a 2D array
So each Z value must be paired with a given contour 'height'?

grizzled folio
#

What shape is Z? It expects X to be M long, and Y to be N long, so you pass a 2D array of values

#

It'll figure out the contours for your data automatically

lilac reef
#

Z is 100x1 for me. I have X, Y and Z in different arrays. It is a height value that is in-order connected with each X Y combo

#

Seems like I should have a different format

grizzled folio
#

What do you mean by "in-order connected"?

lilac reef
#

Plot the height Z[1] at (X[1], Y[1])

#

and so on

#

I cant quite seem to wrap my head around what contour(X,Y,[N,M]) is expecting

#

X and Y must both be 2-D with the same shape as Z (e.g. created via numpy.meshgrid), or they must both be 1-D such that len(X) == M is the number of columns in Z and len(Y) == N is the number of rows in Z. hmmm

grizzled folio
#

You say X and Y are regular intervals, but if Z[2] is (X[2], Y[2]), haven't you got heights along a diagonal?

#

Yes, you could just reshape it if you meant Z[2] is (X[2], Y[1]), for example

lilac reef
#

Each height is a 'result' of the value X and Y being fed into an algorithm.
For context: X and Y are two parameters I am doing a grid search over a learning algorithm.
Z is how accurate of a predictor it was for each X,Y parameter combo (in cross poduct form)

grizzled folio
#

Ok, so you have a grid...Then len(X) * len(Y) = 100 and you can just Z.reshape((len(Y), len(X))) and it'll probably give you at least something

lilac reef
#

I'll give it a go

#

oof, bigger than I thought

#

I think this might be more of a headache than its worth >_>

#

Just trying to maximize my predictive power. I can manually sift through it.
Just kinda wanted a flashy graph for presentation

#

Thank you for help switchy! If its not as hard as I'm thinking let me know

grizzled folio
#

It's not... or at least shouldn't be

#
import matplotlib.pyplot as plt
import numpy as np
x = np.arange(10)
y = np.arange(10)
plt.contour(x, y, np.sin(x)[None,:] + np.cos(y)[:,None])
#

x and y are 10-element arrays, Z is (10, 10)

lilac reef
#

So what are the elements in Z?

grizzled folio
#

z[j,i] = sin(i) + cos(j), for i,j ∈ [0, 1, ..., 9]

lilac reef
#

So Z[0,1] is mapped at location X[0], Y[1]?

grizzled folio
#

Other way, Z[1,0] is X[0], Y[1]

#

You can just transpose it

lilac reef
#

Ok, I'll try to work off that for a bit

#

So how can my Z be indexed at Z[j, i] if my j and i are logrithmic and not integer?

#

Or do I need to just make a work-around

grizzled folio
#

sorry, I guess I wasn't clear...

#

z[j,i] = z(x[i], y[j])

#

You're really super overthinking this

lilac reef
#

lmao. Im trying my best

grizzled folio
#
import matplotlib.pyplot as plt
import numpy as np
x = 10**(np.arange(10) / 2)
y = 10**(np.arange(10) / 2)
plt.xscale('log')
plt.yscale('log')
plt.contour(x, y, np.sin(x)[None,:] + np.cos(y)[:,None])
lilac reef
#

I would really appreciate if you held my hand and really spelled it out for me so I could do some big boy data science lmao.
All I have are these 3 lists.
I need to reshape my results (Z) to be 10x10 or something

#

Im really sorry mate, I just dont think I'm going to get it to work tonight

#

My outputs are just whack that Im working with

#

My X and Y are 100 long, but are the same 10 values repeating. X goes up 1,2,3, vs Y that is ten 1's, ten 2's in a row ext

#

I just cant late night code. My bad mate. I really appreciate it though

grizzled folio
#

@lilac reef just do .reshape((10,10)) on all of them

#

Then you'll have 2D arrays for X, Y, and Z (which is also acceptable input to contour)

lilac reef
#

πŸ‘€

grizzled folio
#

If you wanted to turn them back into 1D arrays, you could take every 10th X value (graph_x[::10]), and the first 10 Y values (graph_y[:10]) -- but you shouldn't need to do that in this case

lilac reef
#

'list' object has no attribute 'reshape'

#

DataFrame(graph_x).reshape(10,10)?

grizzled folio
#

np.array(graph_x)

lilac reef
#

gotcha

#

Thanks switchy <3

#

I think I can play with it from here

grizzled folio
#

Great πŸ™‚

lilac reef
#

The data is so abstract it would take me like 10 minutes to explain and I love it

grizzled folio
#

Cool, glad to see it worked

#

don't forget your colourbar and axis labels πŸ˜‰

lilac reef
#

I'l pretty it up once I stop throwing graphs at the wall and seeing what sticks

#

Trying to get my axes right

#

Why use fancy matplotlib when your numpy array spyder auto-coloring does the job πŸ€”

onyx moth
#

what does numpy.argmax( )actually do?

quartz stream
#

any idea how to delete a row in a dataframe where the value of one column is false

#

I wanna scan the whole df and find the value false in a particular column and then delete that row

onyx moth
#

drop it @quartz stream

quartz stream
#

how

#

@onyx moth

#

drop takes column value

#

i wanna delete a particular row

onyx moth
#

df = df.drop(['name of the colum'])

#

ow a row?

quartz stream
#

yeah

#

a row

#

that too not any row

#

row with a column value false

onyx moth
#

yea that should work

quartz stream
#

how ?

#

can you send a snippet

onyx moth
#

send ur code

quartz stream
#

code of ?

onyx moth
#
df = df.drop(['name of the colum'])
quartz stream
#

I dont wanna delete a column

#

i wanna delete a row

onyx moth
#

yea

#

thats what that does

quartz stream
#

scan a column find false and delete that

wicked flare
#
indices = df[!df['myColumn']].index
df.drop(indices, inplace=True)
onyx moth
#

u wanna do this right?

wicked flare
#

@quartz stream Pretty sure my snippet should do what you want.

quartz stream
#

lemme try I also think so @wicked flare

#

not working

#

I dont wanna delete a column

#

i'll scan a column called valid

#

find index where value is false

#

and delete all of em

onyx moth
#

so you only have certain blocks which are false?

#

or a whole row

#

@quartz stream This is smth im working on, do you want this? See it dropped row 1 and 2

quartz stream
#

yes

#

there is a certain row with a column value false

#

how did you do it again

onyx moth
#

like i said above u just drop it

wicked flare
#

@quartz stream My snippet doesn't delete columns, it deletes rows.

onyx moth
#

row*

wicked flare
#
>>> import pandas as pd
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4
>>> indices = df[df['col1'] == 2].index
>>> indices
Int64Index([1], dtype='int64')
>>> df.drop(index=indices, inplace=True)
>>> df
   col1  col2
0     1     3
>>> 
#

It finds rows where the given column matches the condition and drops them.

quartz stream
#

Wow

#

You Sir are amazing

#

@wicked flare

#

is there any way I can get count of this thing

wicked flare
#

count?

quartz stream
#

yes

#

how many did it drop

wicked flare
#

You can just check the length of indices

quartz stream
#

ohh

#

yea

#

sorry

#

pretty new here so

wicked flare
#

No worries.

quartz stream
#
1 indices = data[data['Valid'] == 'False'].index
----> 2 df.drop(index=indices, inplace=True)

TypeError: drop() got an unexpected keyword argument 'index'
#

@wicked flare

wicked flare
#

Your dataframe isn't called df.

quartz stream
#
<ipython-input-24-22441e5dbdc2> in <module>()
      1 indices = data[data['Valid'] == 'False'].index
----> 2 data.drop(index=indices, inplace=True)

TypeError: drop() got an unexpected keyword argument 'index'
wicked flare
#

What version of pandas are you using?

#

The index parameter to drop was added in 0.21 so presumably your version is older than that

quartz stream
#
Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (0.24.2)
wicked flare
#

Anyway, this should also work: df.drop(indices, axis=0, inplace=True)

lapis sequoia
#

it's not good practice to use inplace drop, it doesn't work properly most of the time

wicked flare
#

What's wrong with it?

polar acorn
#

It's always worked for me but it's supposedly supposed to be deprecated.

olive willow
#

guys any ideas for data science related projects for good practice with numpy, pandas as well as general python stuff ??

#

btw I've a rpi so maybe something with that? I've really no clue what I can do. I've done several data analysis and webscraping ones and also some db ones.

onyx moth
#

How would I determin theamount of states with Qlearning

#

When my environment is trading

#

I have 3 actions, buy, sell , hold

olive willow
#

Thanks that's a great idea

onyx moth
#

@void anvil The q values musnt be like thaat right?

#

im just not sure what the [50] should be

#

as the 3 is the amount of actions

#

and the [50] should be the states

olive robin
#

hello

#

they are concatenating two data sets, training and values (I think)

#

why are they doing this?

#

that's what path.join does right, concatenate directories?

polar acorn
#

If by concatenating directories you mean concatenate the path of two directories you are correct. os.path.join is needed because windows paths are separated by \ while unix paths are separated by /, so you can't just write 'dir1' + '/' + 'dir2' on all systems and expect that too work. So in this case it seems you provide the path to your data and the script expects your data directory to contain two subdirectories named train and val. The path variables traindir and valdir just hold the paths to those subdirectories.

lilac reef
#

Do I need to train-test-split if doing cross validation?
Is it possible to overfit with cross validation baked into the learning algorithm?

lilac reef
#

Hmm.
So I fit my model with the training set, and then cross validate on the whole thing? Or pure score it on test?
@void anvil

#

Like I'm using GridSearchCV. Isnt that cross validating the score for every model? Isnt that purely scoring based on whatever data you pass it

#

So I guess test the best model from the grid against the test set afterwards then?

#

Oh

#

:(

#

I've been overfitting pretty harshly

#

Shit

desert oar
#

@lilac reef you can't usually rely on the CV scores from hyperparameter tuning to correctly estimate out-of-sample score

#

CV for hyperparameter tuning and CV for performance estimation are different steps

lilac reef
#

Gotcha

desert oar
#

you can do a "nested" CV, or do the CV on your training set while keeping a holdout set for validation

lilac reef
#

So if the goal is to maximize the score on the holdout set (That means the model is generalizable right?),
how should I go about hyperparameter tuning?

#

Or is whatever parameters GridSearch returns probably still the best, even if it overfits somewhat?

lilac reef
#

Wait, whats a validation vs testing set?
If this is just an easy google and I'm being really lazy dont reply lmao

desert oar
#

eh

#

arbitrary

#

you might use "testing" for iterating on your model

#

then when you need to get some kind of final assessment before you start sharing this w/ your company's CTO, you run it on the validation set to get a more "pure" estimate of accuracy

lilac reef
#

Awesome, that was kinda what I was thinking

#

Thank you salt rock lamp :)

silent swan
#

in theory you should only ever use your test set exactly once

silent swan
#

@desert oar I think you swapped validation and test?

desert oar
#

@silent swan no, thats how i use the terms. but they arent formal terms

#

i usually use "test" as the "innermost" holdout set

#

i.e. inside the CV loop

silent swan
#

ah

#

the general convention is that validation/dev sets are used for tuning

#

test is used for pure eval

desert oar
#

i use it the opposite way

#

in fact ive never seen it used the way you describe

#

i usually call eval "eval"

#

i avoid using the term "test" except in code

polar acorn
#

Interesting. I've seen the notation @silent swan describes plenty of times and heard it described as the canonical way. But in my head i too switch around test and validation so that I tune on test and I can't quite remember where i learnt it, maybe it's just more intuitive.

desert oar
#

maybe i just dont use kaggle enough

silent swan
#

it's the standard in research as well

#

would be better to stick to the conventional naming imo

#

(also NLP is weird where they sometimes call the val set the "dev set")

desert oar
#

at least that makes sense, its what you develop against

silent swan
#

I'm guessing that's the origin

lilac reef
#

Would you consider Auto Encoders for feature selection a form of clustering?

lilac reef
#

So what would you call the new form the data takes at the middle of the Auto Encoder?

silent swan
#

latent code

lapis sequoia
#

Is there a GridSearchCV in Sklearn api that can take the best model (i.e. RF and SVC)
instead of the model with the best params?

#

basically just a separate grid search for both, possibly using same list of parameters..

#

or.. try an ensemble

desert oar
#

sometimes i just send data back and say "i can't use this"

#

sometimes it's ambiguous and you literally can't know

silent swan
#

04-05-2019

#

fun stuff

desert oar
#

its honestly pretty satisfying

silent swan
#

but seriously though, MM-DD-YYYY was a mistake

desert oar
#

anything other than YYYY-MM-DD for a data set is pretty bad

#

have fun sorting on DD-MM-YYYY

#

right

#

unless it's stored in a database like that...

#

nothing like a MM/DD/YYYY VARCHAR(255) timestamp!

#

oh lol excel

#

hmm doesn't excel represent all that consistently internally

#

i think pandas knows how to handle that

#

are they formatted as text or date?

#

if it's text you're fucked

#

if it's formatted as a date you might be safe

#

because you can reformat consistently

#

yeesh

#

yeah

#

uh

#

"this is your lesson not to be stupid"

#

"i can't work with this"

lapis sequoia
#

do people still use vba macros..

pine yoke
#

I would like a model that tries to predict the winner of a tennis match. It should focus on putting different weights or importances varying from 1-100 on various aspects of a tennis players game. As many possible options as possible should be put in there but ideally the variables that will be able to be measured between 1-100 are things like: Win streak - amount of games won so far Won last match Lost last match Previous encounters with same opponent Amount of points won in a game + β€˜β€™ β€˜β€™ β€˜β€™ in last 5 games + β€˜β€™ β€˜β€™ β€˜β€™ β€˜β€™ last 3 games Double faults in last game + double faults in last 5 games + 3 games How they fare in similar situations for eg. 3rd deuce as a server whilst 1 set down.

  1. Is this a good use of ML?
  2. Could anyone recommend some videos on how to accomplish this
  3. Is there a better approach?
desert oar
#

@pine yoke you can use machine learning to predict match winners yes. look into "logistic regression"

#

the "weights" produced by the logistic regression can give you some sense of "importance" but they aren't directly comparable

pine yoke
#

Do you have recommendations for libraries to use?

desert oar
#

scikit-learn

#

pandas for data processing

#

numpy for general matrix and vector math

#

and scipy for various other math

lapis sequoia
#

and Tron for inspiration

#

(the movie)

lapis sequoia
#

i am wondering, for data science jobs, isn't a masters degree required?

#

data analyst is also a data science job.. you can get a business degree and go into that

#

ohh business degrees can do data analysis?

#

i am not a business degree i am afraid

desert oar
#

its difficult without a masters

#

possible but difficult

lapis sequoia
#

for data analytics or data science?

desert oar
#

data science

lapis sequoia
#

ohh, then this program i know irl is kind of trying to scam people into thinking they can do data science without a masters

desert oar
#

its probably not a scam

#

if it's a specialized training program, that's different

lapis sequoia
#

ohh

#

i thought masters is required

desert oar
#

there is no standards organization that dictates what is required

#

it is hard to get a data science job without a masters because there are many people with masters degrees applying, and also because a masters degree signals a basic level of competence

#

without the masters degree you need to signal competence some other way

lapis sequoia
#

oh i see

#

but i think it's kind of misleading of the program though to tell people they can get a job in data science without a masters or letting them know of the full picture

#

or not letting them know*

desert oar
#

maybe, maybe not

#

they might have hiring connections

lapis sequoia
#

eh, they don't

desert oar
#

maybe share the program?

#

if you have a link or a name

lapis sequoia
#

i don't want to reveal my location

onyx moth
#

What could I change

#

This is the model its a very simple one

earnest prawn
#

it could be that your thing doesnt have enough weights to optimize, the optimizer isnt good for your use case or that your data is poluted

onyx moth
#

@earnest prawn How do you mean my data is poluted?

#

im using adam optimizer

#

I have been experimenting a bit and it just cant get past 55% accuracy for some reason

earnest prawn
#

Mathematically speaking it does make a lot of sense that these accuracies are the same if your optimizer has found a minimum

#

@onyx moth what happens if you increase the amount of neurons in your hidden layers

onyx moth
#
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.1)

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(4, input_dim = 4, activation = 'relu'))
model.add(tf.keras.layers.Dense(16, activation= 'relu'))
model.add(tf.keras.layers.Dense(32, activation= 'relu'))
model.add(tf.keras.layers.Dense(16, activation= 'relu'))
model.add(tf.keras.layers.Dense(1, activation= 'sigmoid'))

model.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs = 5)
scores = model.evaluate(x_test, y_test)
print(scores)

Im a bit new to Keras but what im doing is pulling BTC data frm binance, putting it into 4 indicators and inputting the values of those 4 indicators into the NN (with labels that if the price 5 prices further is higher 1 and lower 0)

#

This model I just sorta got of the internet nd have been playing around with it

#

it was meant for something els not bitcoin

#

if I change the amount of neurons I dont see much improvement, but its very random, when I change the pair to for example ETHBTC and put it to forecast 50 days in the future accuracy jumps to 77%

#

but the moment I add volume to what I input accuracy dumps to 21^%

#

21%*

earnest prawn
#

And you are sure that these 4 indicators have a relation to wether the prices will rise or not?

onyx moth
#

yes its but its not like if this happens on the indicator it will rise 100% but more like if this happens the probabilty of it rising is higher

onyx moth
#

must I normalize data ?

#

will that make it better do u think?

earnest prawn
#

Could make it better yes

dim beacon
#

And you are (very very very) probably falling into a bias of some kind (lookahead, overfitting, etc.)

#

Or are trying to predict te wrong thing, for instance, the price of most well-known financial assets can be predicted day-to-day with a very good accuracy using this model:

predicted_price_for_tomorrow = price_today
onyx moth
#

@dim beacon Yes there are some good algos, its not so hard to program a bot which trades on a fixed algo, if u just do what theprevious days Heikin ashi candle says you will also catch all the big swings, but having a NN bot which can adapt to the situation looks cool to me but idk if it even something thats possible to make

dim beacon
#

@onyx moth algorithmic trading is, really, MUCH harder than most people think

onyx moth
#

Yea I couldnt come up with an algo that doesnt have a spot where it doesnt lose money

dim beacon
#

Like, there-exists-companies-paying-the-smartest-people-on-earth-millions-a-year-each-just-to-be-a-little-part-of-their-algos-strategies-development hard

onyx moth
#

Thats why I thought if I can imploment some NN and add that as another if it would maybe cancel out some noise

#

But it must be possiblefor us right?

dim beacon
#

Possible? Yes. But do not expect being able to do it profitably before years of deep experience and very high-level knowledge in a lot of fields

onyx moth
dim beacon
#

Like, yeah it is possible, like it is possible you get a Nobel Prize one day, but you'll agree that it is not very likely to happen

onyx moth
#

true

#

ive only been messing with NN lik 1.5 week or so

#

@dim beacon Do you have experience with Reinforcement learning?

dim beacon
#

Very few

onyx moth
#

okay now I have something that has a 64% accuracy and would like to try it out how do I now use this model to predict the next candles price?

dim beacon
#

?

model.predict(current_candle_data)
#

Also, best model, by far, to predict next candle price:

next_candle_price = current_candle_price
onyx moth
#

so u saying the best way to predict is to say that tomorrows price is the same as todays?

dim beacon
#

Yes

onyx moth
#

That means markets wont move

#

and it will be a flat line

dim beacon
#

They do, but on average they don't on a day-to-day basis

onyx moth
#

well if u look at BTC its moved up 400% + in the past half year

dim beacon
#

Yeah, still

#

Most of the time it did not move that significantly from a day to the next one

onyx moth
#

well year before it moved down by 85% or so

dim beacon
#

Yeah, still, most of the time it did not move that significantly from a day to the next one

turbid bay
#

can anyone teach me how to make a neural network for hand written digits. ik theres hundreds of tutorials online but i find none of them really helpful

dim beacon
#

Why aren't they helpful ?

turbid bay
#

well i follow them. have about 20% clue of whats going on. and then it finishes without explaining how to save it or implement it into any other coding projects

#

the only one ive found useful was a Coursera one i did a while back but it was using Octave. It was more explaining the math behind it rather than explaining model structures

desert oar
#

sounds like you need to learn tensorflow specifically

#

i feel like youd benefit more from a hands on machine learning book than from tutorials

turbid bay
#

could u recommend one?

#

and yes i think id like to learn the ins and outs of tensorflow

desert oar
#

i havent used one so i dont know sorry

turbid bay
#

oh o

#

k

#

if anyone does know of a good way to learn tensorflow please let me know

dim beacon
turbid bay
#

thankyou

#

will watch it in a minute

onyx moth
#

what does it mean if I need to discretizize values?

earnest prawn
#

that you have to convert a continious space into a discrete one

#

so for example clip all rational numbers to [-1,0,1]

onyx moth
#

but I have no clue why

earnest prawn
#

the comment above literally explains why

onyx moth
#

I dont understand it I added it and it does nothing to my accuracy

earnest prawn
#

it is not supposed to

#

it is literally supposed to "fix the random seed so the results are reproducable"

onyx moth
#

so if I run it with seed 7 again it will produce the same results?

earnest prawn
#

yes

onyx moth
#

aaah okay

silent swan
#

there should be a Rule #1
Don't do anything finance-related for your first data science project

torn musk
#

i get stuck with bug

supple ferry
#

@void anvil , do you have exp with two stage modeling ?
like choice modeling or a simple classification where you run the first stage to predict some feature a which you will add to your 2nd stage variables to predict some class

onyx moth
#

Ill just keep it with algo trading then

#

But the idea of shifting the price is only to create a label and u should remove the shifted price right?

#

So it learns well here I should buy and memorise the situation of the other data u pass in

supple ferry
#

@void anvil . what i am trying to achieve is to group my options into clusters, which is done. Now as first stage I want to predict the cluster which user will choose the options from and then try to predict the option chosen which is in that cluster

#

how would you approach to such a problem ?

#

kind of

#

yes

#

instead of shape i have clusters

#

instead of colors i have options

#

exactly

#

it limits

#

i have done these two

#

we can switch to dm if it is fit for you

#

makes sense

#

ah okay

#

okay

#

let me now give more info

#

exactly

#

thats whiy i am now wiritng more info

#

so, lets assume, I have 3 users who were shown some items to buy.
first 2 got 50 items shown and the third one only 10
what I was doing previously, I was using features of items which are shown to every user and cluster them into several clusters. then i would create artificial cluster related features and combine them with item features and run a logit on them to find the item user has chosen

#

and now the idea is to split this task into a two stage choice model. first build a model which will predict the cluster from which user will choose and 2nd stage will be from that cluster find which item user will choose

#

i have not found much papers/works on this on internet

#

okay i found LIME

#

will look into that

#

i see

#

which model will fit for the first stage?

#

i think i have mistaken when i explained the use case

#

i already have lets say 6 clusters for 50 items that are shown to a user 1

#

yes

#

which item he will buy

#

the second one

#

yes

#

we have historical items that are shown to users and whether or not that item is bought

#

users are anonimzed

#

we hae around 10k users

#

around 1ml items that were shown to them, some 40, some 50, some even 200 items

#

and for each user we know which items from those 40 or 50 they have chosen

#

no, we assume that users are kinda "unique"

#

I have done clustering on user level.
I took all items in my dataset which are show to a user A and clustered them into cluster, then user B and so all 10k users. As a results, the clusters are "per user" only

desert oar
#

what was the initial task here?

supple ferry
#

initial task is to predict the item which user will choose from the item pool that we have for that user ( i have no control over that pool)

#

while having historical "unique" item pools and also the item which was chosen

#

there always was a choise

desert oar
#

that's what an economist would call a "discrete choice" model

supple ferry
#

yes

#

it is a discrete choice model πŸ™‚

#

topic of my thesis

desert oar
#

for clustering, you said you've already done it, but my immediate instinct would be to use NMF on the buyer-product matrix

#

NMF gets you clustered users and items at the same time

#

and non-negativity is just nice i guess

#

you could use SVD instead if you didnt need non-negativity

supple ferry
#

because of the economics side, i was advised to " use techniques which are more scientifically explainable and interpretable to management people"

#

first thing was to cluster items in every pool and derive some pool-cluster-itemspecific characteristics

desert oar
#

bleh

#

what do you mean pool-cluster-item specific

#

like... you want to perform clustering within each pool?

#

that sounds totally mad

supple ferry
#

i derived additional features for every item which takes into account pool meta characteristics and + the cluster it belongs to within that pool

#

yes, that is exactly what i am saying πŸ™‚

desert oar
#

and youre feeding those into the decision model?

supple ferry
#

yes

#

Logit for now

desert oar
#

how many products do you have btw

supple ferry
#

because only one item in every pool is chosen

#

1m "unique" products + 10k users (10k pools)

desert oar
#

also it looks like your original questions was about two-stage modeling. what exactly did you want to know about it? like how to propagate variance?

supple ferry
#

my question was about finding another approach to this problem by treating this as two stage problem

#

first stage is can we predict the cluster of items within a pool which user will be insterested in

desert oar
#

i agree with ragepope. i dont think that kind of "traditional" approach is going to work

#

like what you described

supple ferry
#

second stage is from the result of the first stage can we predict the item that user will choose

desert oar
#

youre also gonna need to get "computational" here

#

of course you do

supple ferry
#

@desert oar computational will not be a problem

desert oar
#

users only select from a subset of products, right?

supple ferry
#

they search for a product and a given a subset of products (which i have no control over)

desert oar
#

you can see what they were shown, right?

supple ferry
#

yes i have the items they saw

#

and the item they chose

desert oar
#

you know what i'd do because i'm a lazy hack? i'd use NMF to get a low-rank Users cluster matrix and Products cluster matrix, then throw the user and product clusters into Vowpal Wabbit w/ negative-sampling loss

#

i have no idea if that would work, but it would be easy

supple ferry
#

the structure is:

  • pool id,
  • meta features , around 26
  • was chosen or not
#

@desert oar I understood around 60 % of what you wrote πŸ˜„

desert oar
#

vowpal wabbit is a command line tool for fitting linear models

#

but you have 1m labels so you really need negative sampling loss or something like that -- it's what makes word2vec possible with even huge vocabulary sizes

#

vowpal wabbit will do pairwise/quadratic interactions and negative sampling loss

#

so it's a very fancy logistic regression for really lazy people

drowsy marsh
#

Hello, very basic user here.
I have a set of data in a table, and a function of 2 parameters of the type :
y = f(x1, x2)
I would like to have some king of linear interpolation for any point on the graph based on the data I have. How would it be possible?
in "1D", np.interp does it

desert oar
#

@void anvil i actually disagree, how to model this stuff well is all locked down as trade secrets

#

doing a thesis on purchase prediction and publishing it openly is valuable work imo

#

you think amazon doesnt?

#

sure. but in the real world youre almost never flying totally blind, its more just a matter of not knowing wtf to do with your metadata

#

i dont know what that even means

#

most users will have a unique combination of features, but that's just statistics

#

low variance X gives shitty predictions anyway

supple ferry
#

this is a project in collaboration with a firm. firm is interested in a predictive model which will tell them which items from the pool they get to show first and minimize the time on website.
for the academia it is to understand the decoy effect and make one item more likely to be chosen by "putting" the decoy item near it

desert oar
#

sort of

#

amazon knows a lot more than that

#

they also know each users unique purchase and ad click history

#

which they do have

#

no? i thought that was the whole pool/search thing

#

they know what people searched for and clicked on

#

...right?

#

hm

#

i also dont like the term "pool" here

#

its misleading

#

its a search id

#

for a single search

#

@supple ferry do you really only have 1 search per user? or a whole history

#

and what are some examples of the other features?

#

and how big are the search pools anyway? first 10 results?

supple ferry
#

@desert oar so it is a search yes. and we have one search per "user"

desert oar
#

why do you say "user"

#

is this not real user data?

supple ferry
#

what we have is search ids. and in the dataset we also have user ids, but they are all unique, it does not give us much info

#

so what is have is, user comes to website, searches for something -- as result he gets 1 - 200 results

#

and then chooses one

desert oar
#

ahhhh

#

ok

#

if anything you only have one user

#

and your entire model will be effectively marginalized over user features

supple ferry
#

sorry if i explained in unclear way

desert oar
#

i think he needs to hire me as a consultant

#

so i can quit my job and work on an interesting problem

supple ferry
#

our goal is to order those results in a way that user finds what he searches for and buys it and leaves the site

desert oar
#

your model is: P( Y | U, Z ), where Y is the purchase decision, U is user metadata, and Z is other metadata (time of day, location, etc)

supple ferry
#

but i dont have any user metadata

desert oar
#

right

supple ferry
#

not even age gender and etc

desert oar
#

without knowing individual users, you are marginalizing over U

#

so you are fitting the model SUM_u P( Y | Z ) * P(U)

#

im abusing notation but hopefully you see what i mean

#

unless you start trying to disambiguate users using e.g. location, your whole thing needs to be interpreted from the perspective of "this is averaged across all users"

supple ferry
#

yes

desert oar
#

so no, do not treat "users" as unique

#

you have unique searches

#

that said you can probably recover some user-level metadata

supple ferry
#

yes

desert oar
#

e.g. region, users usually don't leave their country or continent

#

so how big are these search pools

supple ferry
#

you mean result pool ?

#

which user sees ?

#

because it does not come from the firm side, it is not normally distributed and ranges from just 1 result up to 200

#

i can not do so because i do not have control on what results the user is getting, i have control on ordering of those results

#

from academia point of view i am interested in ordering of the results because of the asymmetric dominance effect

desert oar
#

ok that's a start at least

#

i'd seriously consider returning to first principles

#

e.g. the only reason we can use logistic regression for this is because of random utility maximization

supple ferry
#

yes

desert oar
#

i don't know if this is the right answer. but maybe write out the user's utility function

#

im not sure you even can write one out

#

because you have this weird varying budget set

supple ferry
#

yes

desert oar
#

so the user's decision is going to depend not just on their utility but also on the budget set

#

yes like i said, this is all in the sense of an "average" user

supple ferry
#

yes they choose from what they are given

desert oar
#

right

#

ok so

#

RUM framework

supple ferry
#

this problem is interesting as it is hard

desert oar
#

yeah dude can i please go back to school

#

lol

supple ferry
#

exactly

desert oar
#

or actually get hard problems at work

#

instead of problems that are conceptually easy but just require fuck tons of programming

supple ferry
#

are you an economist btw

desert oar
#

@void anvil im working on it lol

#

i wish

#

im not smart enough

#

@supple ferry no but i thought i was going to get an econ phd at one point

#

@void anvil thats interesting, i do like framing this as a search relevance problem

#

ok i think ive convinced myself that RUM still holds up here

#

but you totally do

#

preferences are a partial ordering

#

you wont reconstruct the complete order

#

sure there is, you see a collection of results and you know that chosen_item was preferred

#

preferred given the search results

#

thats the whole thrust of their investigation

#

yeah youll never do it all

supple ferry
#

no no

desert oar
#

but i dont think thats the goal

#

that said @supple ferry do you actually have the search terms?

supple ferry
#

i want to order only the results a particular user sees

desert oar
#

because i do think rage is on to something

#

im not sure what that achieves

#

the idea isn't to find a preference over all 1m items

#

oh i see

#

circumvent the problem entirely

#

w/ respect to the business "spend as little time on the page" criterion

#

yeah no thats totally reasonable

#

eh

#

not even shitty

#

if anything i think the question is a little misguided after this conversation

#

like... of course order matters a lot

#

and no, swapping the 198th and 197th items does not matter at all

#

nor does increasing the set from 201 to 202

#

if you really want to prove that, sure, you have a thesis

#

but if you want to investigate IIA violations in the wild you'll want to restrict your data set significantly

#

and you probably need more baseline data

#

e.g. how often does anyone ever click the nth search result?

#

i assume that after like 10-20 you get near-zero clicks

#

so not only "should" you use search term relevance like ragepope is saying, but your data is going to be very heavily dependent on the existing algorithm

#

@void anvil insurance, believe it or not

#

nope

#

btw actuarial would be really interesting if only regulators werent so strict about pricing

#

you know how everyone is clamoring for regulating AI and open/transparent algorithms and stuff? we've had that in insurance for years, and it holds back the industry

#

and it makes a worse experience for everyone

#

"why should i be penalized just because im under 25??"

#

"because state regulators dont understand XGBoost son"

#

right lol

#

but no i literally havent touched pricing at all

#

it sucks

#

ive basically spent the last year classifying businesses

#

yeah i love lime

#

its weird w/ text models though

#

yeah i know i cant remember the name either

#

i saw you mention it

#

hang on its in my zotero collection

#

SHAP

#

zotero to the fuckin rescue

#

i also like partial dependence plots a lot

#

oh shap has them too, nice

#

i actually wrote my own partial dependence plot library for R a few years ago

#

i used it all the time

#

at my previous company they wanted to know how much users were willing to pay for a product upgrade

#

so i did exactly what QWERTY is doing, i fitted a discrete choice model w/ price as one of the inputs

#

then i got the partial dependence of price

#

the business loved it, it was the perfect combination of "i understand this" and "ai magic" and "we are domain experts"

supple ferry
#

i will dig into the directions you mentioned

#

@void anvil and @desert oar

#

wanna thank you both

#

for help and interesting discussion

desert oar
#

put us in your thesis acknowledgements ;P

#

"and all those people with waifu and rick and morty avatars"

supple ferry
#

πŸ˜„

desert oar
#

same. i wish i talked to people about mine

#

i didnt even have an advisor for a while

supple ferry
#

i have time till 2021

desert oar
#

oh yeah you got time. im actually curious to see what comes out of this

supple ferry
#

i will share it with you when the time comes πŸ™‚

desert oar
#

holy shit

supple ferry
#

you have been through some serious stuff

desert oar
#

yeah poor guy

#

4 weeks with kidney stones? that must have been some serious kidney stones

supple ferry
#

@void anvil i meant about notes one day before presentation πŸ™‚

desert oar
#

i think mine had a max page limit, i cant remember

#

max limit including charts

#

i remember it was brutal trying to pare it down

#

i was like tweaking line spacing and page borders in latex

#

heh

#

i also never had to present or anything, i just dropped the paper off at my advisor's office, i dont think i even met with him when i got the final marked up version back, i might have just picked it up from a drop box or something

#

like a fuckin term paper lol

#

that sounds way more intelligent

#

i basically turned a "fast track" 1 year masters into a full 2 year program

#

i did everything so weird and wrong

#

heh

#

my problem was partly that i got sick during my 3rd semester

#

fucked up my whole life

#

not to mention my schedule

#

lol right

unborn drum
#

is regression or forecasting used to predict rainfall?

grizzled folio
#

I can't speak for all agencies, but I believe it's usually forecasting

unborn drum
#

@grizzled folio thanks

grizzled folio
#

operational forecasting is pretty interesting though

grizzled folio
#

It's not really that complicated

desert oar
#

Regression and forecasting are two different things, you can use one or neither or both if you want

unborn drum
#

oh, then is regression used in estimating whether or not a person will default on a loan?

lapis sequoia
#

Hello, can u suggest big projects about data science(not includes deep learning) to examine which coded by experienced data scientist? I mean , complete project, using project structure etc. Not consist of jupyter notebooks? I want to see how big data science projects should be?

desert oar
#

@lapis sequoia those are usually done professionally and not likely to be something you can find publicly

#

You might find some of the source code, or you might find the result of the project open source, but you're not likely to find a project where all of the intermediate work has been published

#

@unborn drum regression is a type of model, forecasting is a task to be achieved with a model

lapis sequoia
#

Thanks @desert oar πŸ™‚

supple ferry
#

@void anvil paperswithcode is biggest discovery i made this year

west walrus
#

Can anyone suggest a good method of object detection for finding specific images (although they may be not see straight on) in a picture? I'm trying to detect street signs in real time

quartz stream
#

YOLO

#

is good for object detection

west walrus
#

Ooh ok I'll check it out

quartz stream
#

with less objecys

#

objects

#

there are pre trained model available

west walrus
#

Is it easy to train my own images on it?

#

Or would there already be street sign models

quartz stream
#

it is easy to train

#

read this paper

west walrus
#

Thanks so much! My latest attempt was to run feature matches over circles detected on the frame. felt very janked

quartz stream
#

this is sota for object detection

#

yeah

west walrus
#

Ok, I'll have a read

#

Thanks so much πŸ‘Œ

quartz stream
#

The code if you want πŸ˜›

west walrus
#

Is it resource intensive? I'm planning to run it on a raspberry pi but I could stream the video to my pc for processing if it was necessary

quartz stream
#

you can always train the model

#

then save the model to a path

#

that file (joblib) can be used

#

on raspberry

#

and it wont be resource intensive

west walrus
#

Okay, that's perfect then!

quartz stream
#

There are alternative for keras as well as tensorflow online

#

but if you are not into complex models try yolo it is one shot learning

west walrus
#

Okay sounds good

proud iris
#

hi! So I've built my first regression model in Keras and it doesn't look good. I'm not getting very accurate results, can anyone help?

desert oar
#

@proud iris what data do you have, what kind of model are you using, how are you training the model, and how are you evaluating accuracy

dusty latch
#

Can anyone help me bounce ideas off of? I need project ideas that have business applications for a class. I've already done a stock bot but last semester. I need some fresh ideas.

proud iris
#

I just created a data set, x*sin(x) is my evaluating function. I have three columns, x, sin(x) and their product. I have 3000 data entries, I'm using 2995 for training, last 5 as test cases

#

I'm very very new to this so please be patient with me if some of my questions appear very trivial

desert oar
#

you're trying to learn the function x*sin(x)?

proud iris
#

accuracy is being measured by mean squared error

#

yeah

desert oar
#

5 test cases isn't much

#

what kind of model is this?

#

1 output w/ some fully connected hidden layers?

proud iris
#

yeah. Dense layers stacked

desert oar
#

are you doing any hyperparameter tuning

proud iris
#

can I show you my code

desert oar
#

i guess. i dont really use keras and i'm not really a neural network aficionado

#

that would answer my questions though

#

!paste

arctic wedgeBOT
#
paste

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

proud iris
#

no, I dn't have any idea about hyperparameter tuning so I guess I'm not doing it?

desert oar
#

yeah probably not

proud iris
dusty latch
#

You could use talos for hp tuning

desert oar
#

maybe instead of just the last 5, what if you randomly drop out points for train/test

proud iris
#

This should be a very simple implementation? I shouldn't be needing hypertuning and other stuff?

desert oar
#
data_train = data.sample(frac=0.9)
data_test = data.loc[data.index.difference(data_train.index)]
proud iris
#

walk me through this, data_train gets 75 percent of the generated data

#

what is going on in data_test?

desert oar
#

i changed it to 90 but yeah

#

pandas has this notion of an "index"

#

basically a label for each row

proud iris
#

yeah like excel

desert oar
#

sorta yeah

#

data.index.difference(data_train.index) gives you the index values from data that aren't in data_train

#

then you use .loc to get the corresponding reows

proud iris
#

okay got it

#

is that 90 percent random? or the first 90 percent

desert oar
#

alternatively you can do

is_train = np.random.choice([True, False], p=[0.9, 0.1], size=data.shape[0])
data_train = data.loc[is_train]
data_test = data.loc[~is_train]
#

random

#

(in both cases)

#

that doesn't make your model better but at least evaluating it will be more representative

proud iris
#

still 300 values 😦

#

Gonna try with 99

dusty latch
desert oar
#

whats wrong with 300 values @proud iris ?

#

if your model is bad, going from 700 to 990 values isn't going to make it better

#

but it will improve your evaluation because you're going from 10 to 300 in your test set

dusty latch
#

Also, like srl said before you should really be doing at least a 20% test

proud iris
#

How will I check the values of 300 entries?

dusty latch
#

With code

#

for loop over all 300

proud iris
#

or the mean squared error

#

but why is my model not predicting well? Is the code correct?

dusty latch
#

probably because you have no hp tuning and you're using the same type of neurons for like 4 layers

desert oar
#

it's hard to say when you're only predicting on 5 values

#

and what landshark said

dusty latch
#

throw a sig layer in there see what happens.

desert oar
#

im not even sure what the value of stacking relus like that is. wouldnt you just use like 1 big hidden linear layer for this kind of thing

#

or like a hidden layer and a sigmoid cause sin

#

like i said im not a NN guy

#

so correct me if im wrong

dusty latch
#

Also, you need a validation set.

#

then do a confusion matrix on the validation set with prediction set

#

Do 75, 15,10 sets

proud iris
#

confusion matrix?

#

validation set, by that you mean a test set whose output I know

dusty latch
#

Literally just get 10% of your data and use that as your validation set.

proud iris
#

I don't know any of those things @void anvil so please explain a bit. Also the diagram

dusty latch
#

Also, Rik, make that last layer a linear layer. Idk if it does that by default but I do it for good measure

proud iris
#

@dusty latch I am already using 10 percent as my validation set. But what did you say about the confusion matrix?

dusty latch
#

confusion matrix basically test your validation set over your predicted values.

proud iris
#

okay. So just subtracting one from the other?

dusty latch
#

You can get a better measure of model usefulness from sensitivity and specificity. A cm will give that you to you.

#

It does a lot more than just subtracting one from the other.

proud iris
#

the one thing I don't get is that there are examples and tutorials where this code is enough to produce accurate-ish results. And my function isn't that difficult to learn is it?

#

@dusty latch will look into it then.

dusty latch
#

You're just not using enough data points and tuned hp

#

just look up hyperparameter tuning in python keras and you'll have loads of things.

proud iris
#

What was that bit about linear layers? Mine is a dense one, isn't that how it should be? every neuron connected to all other neurons in the previous layer?

#

yeah I will definitely look into it

dusty latch
#

You know how you had layers before as relu? Add something like that to the last layer but use linear

proud iris
#

yeah but why?

#

okay not the layer but the activation function

#

crap I need to learn quite a bit

#

how many neurons does one use in the hidden layers? Is it trial and error?

#

Won't that create a bias for the validation data?

dusty latch
rancid totem
#

If you keep tuning your hyperparams too much on your validation set then yes you're going to overfit

compact thistle
#

Guys I'm doing some capstone project about Anomaly Detection(Outlier Detection) on Credit card fraud dataset on Kaggle.
The dataset has labels(Whether it's fraud or not) so initially I was going to do some supervised learning stuff but then in real life scenario we don't get to work with labels so I was going to use unsupervised learning methods to detect outliers.
Yes I'm pretending I have no information about labels but will use it later to evaluate my unsupervised learning methods' performance.

The problem I'm facing here is that maybe it is too ...simple? I'm basically trying out many different outlier detection models and simply comparing them.
I'm wondering if you guys have any brilliant idea to make this part fancy or impactful enough?

lapis sequoia
#

hmm.. if you don't have labels, how do you plan to discriminate

#

there is a distance measure we typically use in banking.. for the life of me, i'm not able to recall it right now

#

but it's what's usually used when trying to find who's is at a risk of defaulting.. or committing credit card fraud

#

the name of the distance measure ends in 'ov' , I'll ping you if I remember it

compact thistle
#

@lapis sequoia Hey Tron, yea there are multiple algorithms i can use for outlier detection. Like you said, i have couple of stuff based on euclidean methods or ensemble or probability or cluster.
I'm generally concerned about the depth of the project. I feel like I'm just laying out options and simply telling them what worked the best.
And I don't feel like it's complicated enough

lapis sequoia
#

ok.. you want to tell them you tried a bunch of methods and compare

#

so do that

#

seems like a comparative study of classifiers for this application.. hmm..

compact thistle
#

Yea... still it's just not deep enough. I could probably do some hyperparameter optimization for better performance but then that's totally going against the concept of unsupervised learning

lapis sequoia
#

check out current papers and see if there's a gap you could fill

compact thistle
#

Yea I'll do that on the side. Thanks man

native rivet
#

Hi guys

#

I own a small bakery

#

How can I use data science skills to boost my sale or help my business?

#

Can someone help me

desert oar
#

@native rivet is this an actual question or is this a homework question

native rivet
#

πŸ˜†πŸ˜†πŸ˜†

#

Actual Question

#

@desert oar

olive willow
#

First get some data from your business and try to answer questions you have with it and explore it in general, see if something stands out

#

@native rivet

#

Later you can optimize your business using it and get a data strategy

#

But remember only to collect the data you really need, because data without a purpose is worth nothing

#

Questions like:

#

When do customers come to my shop

#

What do they buy

#

Are the customers somehow related

native rivet
#

Thanks bro

#

I'll see how it goes

polar acorn
#

I suggest you entice the customers with some deep baking. Use densely connected flavours and many hidden layers in your pastries. Also look into batch normalisation if you're baking large batches of cup cakes. LSTM (Luxury Sweet Tasting Macarons) are in high demand, though some prefer TCN (TrdelnΓ­k Containing Nutella). Best of luck

earnest prawn
#

πŸ™„

native rivet
#

It's a franchise

#

@polar acorn

polar acorn
#

It was an attempt at humor pensive_snake_cowboy

native rivet
#

@polar acorn πŸ˜†

#

Let's see ill let you know. Thanks @void anvil

dusty latch
#

Lads, I'm struggling to think of ideas to do for my python data analytics class. Last semester I had the same class but in R and I did stock market direction prediction. Any ideas on some cool I can do? I dont want to do something just plain like the iris dataset or whatever

dusty latch
#

Thanks but I forgot to add that it needs to have a business case