#data-science-and-ml | Python | Page 207

sand reef Aug 15, 2019, 2:20 PM

#

missing days?

#

why not just fill them up with the upcoming day values?

#

you mean the missing days is a pretty wide gap?

#

ah, then it wont work

#

generally, speaking in my weather prediction model, there were some missing data, so apparently pandas can do this autofill thing, and so it filled it up with the previous day's records.

#

Instead of me having to drop all those columns altogether.

#

yeah, but I am assuming, here, entire rows are missng

#

*missing

#

so it would be pretty impossible to just fill them up for a while

#

would this interest you? https://stackoverflow.com/questions/16787038/insert-rows-for-missing-dates-times

Stack Overflow

Insert rows for missing dates/times

I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/da...

#

hmmmm.....

#

okay, i am very very new to data science

#

i have used augmented dickey fuller tests and all to see if the multivariate time series is stationary or not and all

#

but i didnt understand what autolag meant

#

and i cant seem to find its definition on google either

#

could you tell me, what on earth is autolag?

#

adds lag to the variables?

#

you mean it repeats it?

#

so, it just fills up the breaks between the time periods, by repeating the variable

#

so.... what is it then?

lapis sequoia Aug 15, 2019, 2:58 PM

#

import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Flatten

imgs = np.load("/Users/Kushi/PycharmProjects/"
               "Artificial_Intelligence/MachineLearning/Sign-language-digits-dataset 2/X.npy")
targets = np.load("/Users/Kushi/PycharmProjects/"
                  "Artificial_Intelligence/MachineLearning/Sign-language-digits-dataset 2/Y.npy")

imgs = imgs.reshape((2062, 64, 64, 1))
model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape=(64, 64), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Conv2D(64, (3, 3), input_shape=(64, 64), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(32, activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(10, activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=['accuracy'])
model.fit(imgs, targets, epochs=5)```

#

ValueError: Input 0 of layer conv2d is incompatible with the layer: expected ndim=4, found ndim=3. Full shape received: [None, 64, 64]

#

why do i get this error even when i reshape the array?

feral lodge Aug 15, 2019, 3:16 PM

#

You should only specify input_shape for the first Conv2D. The rest will figure it out themselves. If your data is has shape (2062, 64, 64), then input_shape should be (64, 64).

#

If you're only reshaping in order to add the final 1 dimension to the data, there's no need to do that. Just remove the reshape, and remove the input_shape to the second Conv2D

#

@lapis sequoia

lapis sequoia Aug 15, 2019, 3:24 PM

#

i get this error : ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=4, found ndim=3. Full shape received: [None, 64, 64]

#

when i removed the reshape and input_shape

#

oops

#

didnt notice the input_Shape in 2nd conv

#

i still get the same error though :(

#

@feral lodge

#

sry for ping

feral lodge Aug 15, 2019, 3:27 PM

#

no worries 👌 Gimme a sec

lapis sequoia Aug 15, 2019, 3:27 PM

#

shall i send the current code?

feral lodge Aug 15, 2019, 3:31 PM

#

No need! Just tell me, what does targets.shape give you?

lapis sequoia Aug 15, 2019, 3:31 PM

#

okay

#

(2062, 10)

feral lodge Aug 15, 2019, 3:39 PM

#

Oh, I might be wrong actually. Go ahead and reshape imgs to (2062, 64, 64, 1). But set input_shape of the first Conv2D to be (64, 64, 1)

#

This works for me: ```
imgs = np.random.rand(2062, 64, 64, 1) # fake data
targets = np.random.rand(2062, 10)

model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape=(64, 64, 1), activation="relu"))
...

lapis sequoia Aug 15, 2019, 4:20 PM

#

okay

#

thx ill try it

desert oar Aug 15, 2019, 7:20 PM

#

@void anvil for time series?

#

to figure out if there's any AR structure you should look at the ACF and PACF plots

#

nah you dont actually look at the charts

#

you just check the ACF at p lags is > threshold

#

for what purpose

#

you mean you wanna use this as a feature to predict something, and you want to include multiple lags as predictors?

#

like Y ~ X + lag(X, 1) + lag(X, 2)

#

oh i see youre actually doing the AIC of the regression model

#

what is the point of the lags, because you think heres a cyclical component?

#

idea being, N lags should smooth those out?

onyx moth Aug 16, 2019, 6:08 AM

#

Could you with ml predict if your country is going to enter a recession. You do have enough data

lapis sequoia Aug 16, 2019, 9:49 AM

#

How can I extract only specific text from an image with OpenCV and Tesseract?

#

I want to try to extract only specific numbers from a lottery ticket.

#

We can do so by matching the aspect ratio of the text

#

But what if the numbers are pretty much the same font size?

wicked flare Aug 16, 2019, 12:09 PM

#

I'm sure macroeconomists do use data science methods to make predictions. Whether they are, or to what the degree they are, successful, is a different matter.

lapis sequoia Aug 16, 2019, 1:30 PM

#

1443/1443 [==============================] - 22s 15ms/sample - loss: 0.4330 - acc: 0.8614 - val_loss: 17.3330 - val_acc: 0.0000e+00

#

how can i fix this sorta thing

#

x_train = x_train.reshape((-1, 64, 64, 1))
model = Sequential()
model.add(Conv2D(128, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64, activation="relu"))
model.add(Dense(10, activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.3)```

#

is the code

earnest prawn Aug 16, 2019, 2:09 PM

#

@lapis sequoia is the validation accuracy like this for all epochs?

lapis sequoia Aug 16, 2019, 2:09 PM

#

yes

earnest prawn Aug 16, 2019, 2:09 PM

#

then yes it is a very extreme case of overfitting

lapis sequoia Aug 16, 2019, 2:10 PM

#

shall i add 3 dropout layers?

earnest prawn Aug 16, 2019, 2:10 PM

#

you shall add dropout layers yes

lapis sequoia Aug 16, 2019, 2:13 PM

#

still the same

#

even after adding 3 dropouts

#

yep

feral lodge Aug 16, 2019, 3:10 PM

#

You'll want to penalize the Dense weights as well, with weight decay. They call it regularizers.l2 here: https://keras.io/regularizers/

#

You may not have enough data also; you have ~2000 training points right? For 64x64x1 images and a network with so many kernels that may be a bit on the low side. The network might generalize better if you downsample the images and reduce the size of the network.

#

That's not a good idea if the small details of your images are important though. If you can squint your eyes and still easily tell what each image is, then it might work better

#

For contrast, the cifar10 dataset (https://www.cs.toronto.edu/~kriz/cifar.html) has 32x32 images, and 5000 training point for each class

lapis sequoia Aug 16, 2019, 5:55 PM

#

Hi could somebody help me out with implementing batch normalization to this implementation? https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow/blob/master/contents/9_Deep_Deterministic_Policy_Gradient_DDPG/DDPG_update2.py?utm_source=share&utm_medium=ios_app I am not sure What to put in the learn function. @ me svp

GitHub

MorvanZhou/Reinforcement-learning-with-tensorflow

Simple Reinforcement learning tutorials. Contribute to MorvanZhou/Reinforcement-learning-with-tensorflow development by creating an account on GitHub.

calm tundra Aug 16, 2019, 6:38 PM

#

i am new in the field of macine learning.i just want to start it...can anyone pleaseguide me how should i start it ....what will be the correct sequence to follow for machine learning and AI

#

also can anyone tell me from where i can learn this any channel or youtube playlist for ML and AI

lapis sequoia Aug 17, 2019, 5:37 AM

#

ill try using the cifar10 dataset

gilded dagger Aug 17, 2019, 8:59 AM

#

Just wanted to say

#

After month initialising my dicts like a peasant, I discovered defaultdict

#

results_dict = defaultdict(lambda : defaultdict(int))

#

Ty Mr Python

lapis sequoia Aug 17, 2019, 3:04 PM

#

Hello ! why hand gesture detection neural network is better than an haarcascade of a tipical gesture pls ? 🙂

upper ginkgo Aug 17, 2019, 4:32 PM

#

Hello, I've followed this tutorial to get a simple intent matching neural network:
https://towardsdatascience.com/build-it-yourself-chatbot-api-with-keras-tensorflow-model-f6d75ce957a5
The results are good, it's relatively fast and produces good results, except for one thing.
It always matches an intent, even if a query is nonsense it will match one intent. The probability is always high...
Is there anything I can do to fix this? I'm relatively new to neural networks and I'd appreciate if you add an explanation to whatever's going on! Thanks in advance

Medium

Build it Yourself — Chatbot API with Keras/TensorFlow Model

Step-by-step solution with complete source code to build a simple chatbot on top of Keras/TensorFlow model

earnest prawn Aug 17, 2019, 4:38 PM

#

@upper ginkgo the reason this always outputs something is that you are having a softmax layer as output, softmax layers output a probability distribution, meaning that all output values of the layer will add up to 1. How to fix this I am not sure actually, I guess you could add another output for "nonsense" and add a few training examples which are nonsense so it knows what it has to qualify like that? But thats just my naive approach really

upper ginkgo Aug 17, 2019, 4:45 PM

#

Thanks, @earnest prawn
If there's something else I can do I'd really like to hear it

earnest prawn Aug 17, 2019, 4:45 PM

#

I'd like to hear as well, my solution is as I said just a really naive one

feral lodge Aug 17, 2019, 5:58 PM

#

This is a big issue in deep learning; neural networks are currently very poor at expressing uncertainty for previously unseen input. They tend to have a false sense of overconfidence, like you saw

#

The simple answer is that your network is overfit, so you'll need to regularize it harder and hope for the best

#

Consider this example! All data is vectors right, and in this example the data vectors are 2-dimensional

#

📎 unknown.png

#

It's a binary classification problem. The little clusters of data is our labeled training set. The points are labeled 0 or 1. Our neural network outputs a classification digit; a number ranging from 0 to 1

#

What you see in the plot is the learned classification scheme of an overfit neural network. It's very black-and-white -- the network always produces a 0 or a 1. No room for uncertainty

#

But what if our network sees a new point, say at the coordinate (-2, -2) or (-2, 2)? That's very far away from the training data. Reasonably, such points probably don't belong to either of the two classes. But the network will happily misclassify them, expressing 100% certainty while doing so

#

Here we can also see the problem of adding fake nonsense data to the training set. How would we define such points? Those points would literally have to be everywhere surrounding these points

#

That's a lot of work, even in 2 dimensions

#

Language and image processing don't work in 2 dimensions, they work in 1000s of dimensions

#

The solution is to regularize the network

#

Here's the same problem, except the network is properly regularized:

#

📎 unknown.png

#

Viola! The network expresses uncertainty for weird data

#

Then the programmer will have to decide what levels of uncertainty are acceptable for their application

#

It's a very complicated problem though, and an area of active research. There is no way to guarantee that regularizing your net will work, unfortunately

#

The exact same issue has caused crashes in Teslas and other weird stuff you wouldn't expect

#

Like this poor Chinese guy who got a ticket for scratching his face. https://www.bbc.com/news/blogs-news-from-elsewhere-48401901 The convolutional neural network missclassified him as speaking in a cell phone while driving

upper ginkgo Aug 17, 2019, 6:21 PM

#

oh...

#

My neural network uses text only though, not images

feral lodge Aug 17, 2019, 6:23 PM

#

Sure! But each word is represented as a vector, right? Like an embedding https://en.wikipedia.org/wiki/Word_embedding or even simpler

#

Somehow, all data we work with is translated to vectors before we toss them through the network

#

So regardless of whether it's a text, an image, or just numbers we can always imagine them as points in a very high-dimensional vector space

#

And we always see the same kinds of problems

#

If you enter a text in Japanese, it might be classified with a 100% certainty, even though the network doesn't know what it's talking about

#

Likewise, if we make a network that can classify images of rotten apples from good apples and accidentally give it a picture of a shoe, the network might produce a very confident missclassification

#

"This shoe is definitely a good apple!"

#

Don't be discouraged though, regularization is the key. It might work well for you

upper ginkgo Aug 17, 2019, 6:26 PM

#

Oh well, thanks for the information anyways

#

Is there somewhere I can learn about regularization?

feral lodge Aug 17, 2019, 6:27 PM

#

What deep learning library are you using?

upper ginkgo Aug 17, 2019, 6:27 PM

#

Keras/Tensorflow

feral lodge Aug 17, 2019, 6:29 PM

#

I find many blogs and stuff if I just google "keras regularization". They're probably all good!

upper ginkgo Aug 17, 2019, 6:46 PM

#

Thanks a lot

#

@feral lodge just wondering, I made a research before asking and I found other ‘solutions’ such as using a Bayesian neural network or a Monte Carlo dropout. Would those also fix my issue?

feral lodge Aug 17, 2019, 6:55 PM

#

The "good" plot I showed above is actually a Bayesian neural network! They're experimental, don't always work properly and usually take a looong time to train. When they do work, they tend to have good uncertainty estimates, better than usual regularization. If Keras supports them, you can give it a try if you want! Monte Carlo dropout is actually a Bayesian method in disguise, but I've never used it

#

If you're a beginner, it may be wiser to just stick to plain old networks with plain old regularization though. The classic regularization method for nns is weight decay. They call it regularizers.l2 here: https://keras.io/regularizers/

upper ginkgo Aug 17, 2019, 7:40 PM

#

It works better now, thanks @feral lodge! But the nonsense is still getting through, nonetheless..

📎 unknown.png

#

Is there something I can do here?
This is what I added to my code:

model = Sequential()
model.add(Dense(128, input_shape=(len(x[0]),), kernel_regularizer=regularizers.l2(0.01), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.5))
model.add(Dense(len(y[0]), kernel_regularizer=regularizers.l2(0.01), activation='softmax'))

#

Or do I just increase the value in l2?

feral lodge Aug 17, 2019, 7:46 PM

#

Good job! It's a tricky problem, it'll probably never work flawlessly, particularly with this network architecture. But yes, you should play around with the l2 value

#

Raising it regularizes the network harder, so if it's too high the network will not be able to fit the data properly

upper ginkgo Aug 17, 2019, 7:48 PM

#

Perhaps 0.5 would do the job?

feral lodge Aug 17, 2019, 7:48 PM

#

Give it a try! I've never gone above 0.1 personally

#

But it's good to play around and test

upper ginkgo Aug 17, 2019, 7:49 PM

#

I'm gonna keep that in mind, thanks again

polar acorn Aug 17, 2019, 7:49 PM

#

If you want a non neural network solution you could also do something simple. Just from the top of my head, check if any of the top 1000 most used english words are in the comments and if yes send it to the network if not label it nonsense? Neural networks are nice but some things can also be solved by simple heuristics.

#

Although there are probably better heuristics than the one I suggested 🙂

feral lodge Aug 17, 2019, 7:51 PM

#

I agree, that's a good idea!

upper ginkgo Aug 17, 2019, 7:59 PM

#

0.5 was a bit too high, nothing matched

#

and I guess it isn't necessary, the nonsense isn't getting through now:

#

📎 unknown.png

#

@feral lodge thanks again, it works like a charm 😄

#

I got a bit scared by bayesian stuff and those methods that required changing a lot of the structure, I'm glad I just had to change a few lines! 😅

feral lodge Aug 17, 2019, 8:09 PM

#

Great job, glad it helped 😄

#

Bayesian neural networks are actually not as scary as you might think! Simply by using l2regularization you've actually performed Bayesian inference over your network, in disguise. Your Bayesian prior distribution is a Gaussian, and your posterior distribution is a Delta 🐸👍

silent swan Aug 17, 2019, 9:48 PM

#

a good thing to remember is that softmax is a hack

#

"huh, we need to convert some real-valued outputs into a probability distribution, what do we do?"
"idk, take the ratio of exponents?"

#

they can be interpreted as probabilities, and our training losses treat them as such, but we should be cautious about actually treating them as "predicted probabilities"

languid oar Aug 18, 2019, 12:57 AM

#

Hey guys, has anyone used VSCode for ML/Data Analysis purposes? I installed the basic extensions like Python and Intellicode, but autocompletion for packages like pandas is very very lacking. Coming from a language where VSCode is a first class citizen, it's night and day how much difficult Python is with it. Somebody had better luck than me?

still harness Aug 18, 2019, 4:22 AM

#

I need a Human Speech sentiment classification (e.g. Happy, Sad, Fearful, Surprised, Angry) data-set, please help.🙏

quasi nacelle Aug 18, 2019, 12:19 PM

#

anyone here ?

#

Hi, sorry but i need some help - I am trying to write a script to loop through data and identify people that fit the definition. - i am having a hard time getting started - the data is loaded with pandas and the dataframe is reduced to columns of interest. But writing the loop with the if and or statements is a problem. - anyone here that could set aside 15min ??
An increase in plasma creatinine by >0.3 mg/dl or a relative increase of 1.5-fold above baseline
together with severely elevated plasma MTX concentrations at one or more of the following time-points
after initiation of the MTX infusion:

36 hour >20 μM
42 hour >10 μM
48 hour >5 μM

If someone would like to help i can post the dataset

surreal nacelle Aug 18, 2019, 1:30 PM

#

Hey, I'm looking to apply some of the stuff I learned on unsupervised learning from the book Hands-on machine learning with (...) and I was wondering if you guys had good entry level unsupervised learning project ideas to recommend 🙂 Thanks

supple ferry Aug 18, 2019, 1:40 PM

#

@surreal nacelle , you can start with market segmentation for example. Or any use case with clustering algorithms

surreal nacelle Aug 18, 2019, 1:49 PM

#

Gonna look into it, thank you

#

someone recommended me mnist without label

#

which sounds pretty good tbh 😄

upper ginkgo Aug 18, 2019, 4:35 PM

#

Hey, can someone explain to me what's an epoch?

feral lodge Aug 18, 2019, 4:39 PM

#

An epoch means a single training iteration over your full training set

#

So if you have 10000 training points, your computer might not be able to handle everything at once. Then you'll divide your training data into smaller chunks called mini-batches and train your model on those instead. If you divide your data into mini-batches of size 1000 for example, then 10 training iterations equals 1 epoch

upper ginkgo Aug 18, 2019, 4:44 PM

#

Ohhh I see

surreal nacelle Aug 18, 2019, 4:54 PM

#

Hey, trying to clusterize the mnist dataset with KMeans, (by applying PCA first), and I'm getting pretty bad results, which is not surprising, however, I'm not sure what the next step should be.
I tried using the PCA.components_ as centroid, but it didn't perform as well as 10 random init and 1000 iters.
What would you do in that situation ?

📎 clusters_mnist.png

silent swan Aug 18, 2019, 5:52 PM

#

do your principle components look reasonable?

quaint ruin Aug 18, 2019, 7:00 PM

#

Hey, I'm using the Kobe Bryant dataset from kaggle.
I've been tasked to predict the shot_made_flag and to avoid data leakage by training on data prior to to the date of the test data.
I've also been told to find the best k using 10 K-FOLD CV.
I think these 2 requirements contradict each other because if I use K-FOLD to split my data into train and test then the model will eventually train on data that occurred after the test data, so it makes no sense in this context to use K-FOLD but rather sort by date and manually split at this point for train and test and find the best k.
Can someone correct me if I'm wrong here?

onyx moth Aug 18, 2019, 8:07 PM

#

hello guys I managed to put my data into a DataFrame and added target value ( which is just 3 prices in the future if its higher than current price its a 1 otherwise a 0). I have no clue which step is next to make this NN usable and how I would need to do it with my data.

📎 unknown.png

#

alot of examples and stuff ive seen consist of 2 things, 1 row of samples and 1 row of labels

#

I have multiple rows of samples and 1 row of labels

silent swan Aug 18, 2019, 8:14 PM

#

how seriously do you want to do this financial prediction?

#

because my usual prescription is: don't use ML for finance

#

with the further addendum: absolutely don't use DL for finance

#

if this is just for learning that's fine, then I can give tips

onyx moth Aug 18, 2019, 8:15 PM

#

well im working towards using it for real trading 😬

#

im trying to recreate something

#

https://www.youtube.com/watch?v=R1snBs5tyY8 This

YouTube

Rob The Quant

Neural Nets Trading System - Artificial Intelligence Strategy Robo...

This strategy robot I wrote in C# is using genetically evolved neural nets. It learns trading rules by itself. It's trained on 40% of data sample and validat...

▶ Play video

#

I saw this around 2 months ago on youtube and started learning python and since 2 weekw ive been doing this ML stuff

#

with NNs

silent swan Aug 18, 2019, 8:17 PM

#

aha, well the bitcoin market is illiquid enough for some strats to work. still my recommendation is not to use DL for this. ML algos can work somewhat

#

unlike most problems, finance is a system where the whole market is actively trying to unlearn its own patterns

#

unless it's something fundamental to the market structure

onyx moth Aug 18, 2019, 8:17 PM

#

I also emailed that guy and he said I need to look at reinforcement learning but as I just first wanna have something that works I discovered maybe just make a NN that throws out somthing first

#

I can feed in anything

#

I mean I can do fundamental analysis too. The whole bitcoin market crashes on tweets and youtube videos

#

I also dont expect to get rich with it but if it ca beat the market like in his video even if its by 1 % im on the right tack

#

track*

#

I only did 1 ML alog and that was lineair regression, it sucked badly

#

I needed to feed it 4 of the 5 things

#

then I moved on to NN

prime trail Aug 19, 2019, 1:07 AM

#

are there any good machine learning tutorials i can learn about it?

mossy dragon Aug 19, 2019, 6:31 AM

#

depends, hows your math?

polar acorn Aug 19, 2019, 8:38 AM

#

Does anyone have a ETA on the first tensorflow 2.0 release? The roadmap says Q2 2019.

quartz stream Aug 19, 2019, 9:25 AM

#

Any idea on how to save a tensorflow model

#

predictions = estimator.predict(input_fn=predict_input_fn)

#

I wanna save the estimator thing so that next time I dont need to do the 10000 iterations

light cloud Aug 19, 2019, 12:26 PM

#

i believe that is what pickles are for

silent swan Aug 19, 2019, 2:21 PM

#

isn't there like tf.saver?

feral lodge Aug 19, 2019, 2:39 PM

#

If it's just Tensorflow, i use tf.train.Saver yeah:

saver.save(sess, saved_variables_path)  # save
saver.restore(sess, saved_variables_path)  # load ```
If it's keras, I use ```
from keras.models import load_model
model.save(saved_model_path)  # save
model = load_model(saved_model_path)  # load ```

#

I think these also save internal stuff like Adam's momentum values, but I'm not completely sure

quartz stream Aug 19, 2019, 2:58 PM

#

Okay

#

Thanks @feral lodge

void spade Aug 19, 2019, 5:51 PM

#


linreg = LinearRegression()
linreg.fit(X_train[:, :2], y_train)
y_train_2 = linreg.predict(X_train[:, :2])
y_test_2 = linreg.predict(X_test[:, :2])

linreg.fit(X_train[:, :4], y_train)
y_train_4 = linreg.predict(X_train[:, :4])
y_test_4 = linreg.predict(X_test[:, :4])

linreg.fit(X_train, y_train)
y_train_10 = linreg.predict(X_train)
y_test_10 = linreg.predict(X_test)

#

For this code what does the splicing do? Specifically the 2 and 4

#

Posted in help but maybe someone here can help

silent swan Aug 19, 2019, 5:53 PM

#

looks like it's only using the first2 / first 4 features

void spade Aug 19, 2019, 5:53 PM

#

Yep that is what it's doing but i dont understand where the features are coming from, hang on let me grab the code

#


from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=10)
X = poly.fit_transform(x)

obs_nums = np.arange(0, num_points)
np.random.shuffle(obs_nums)

top_70 = int(num_points * .7)
rand_train = np.sort(obs_nums[:top_70])
rand_test = np.sort(obs_nums[top_70:])

X_train = X[rand_train]
X_test = X[rand_test]
y_train = y[rand_train]
y_test = y[rand_test]

linreg = LinearRegression()
linreg.fit(X_train[:, :2], y_train)
y_train_2 = linreg.predict(X_train[:, :2])
y_test_2 = linreg.predict(X_test[:, :2])

linreg.fit(X_train[:, :4], y_train)
y_train_4 = linreg.predict(X_train[:, :4])
y_test_4 = linreg.predict(X_test[:, :4])

linreg.fit(X_train, y_train)
y_train_10 = linreg.predict(X_train)
y_test_10 = linreg.predict(X_test)

errors_train= np.array([np.mean((y_train - y_train_2) ** 2),
                        np.mean((y_train - y_train_4) ** 2),
                        np.mean((y_train - y_train_10) ** 2)])
errors_train = np.column_stack(([2, 4, 10], errors_train))

errors_test = np.array([np.mean((y_test - y_test_2) ** 2),
                        np.mean((y_test - y_test_4) ** 2),
                        np.mean((y_test - y_test_10) ** 2)])
errors_test = np.column_stack(([2, 4, 10], errors_test))

silent swan Aug 19, 2019, 5:56 PM

#

looks like you're taking the first 2/4 degree polynomials as featuers

void spade Aug 19, 2019, 5:57 PM

#

I'm not sure where the feature values are coming from

#

Sorry in advance if im being super dumb

silent swan Aug 19, 2019, 5:58 PM

#

there's the initial x that needs to come from somwhere

#

in
X = poly.fit_transform(x)

void spade Aug 19, 2019, 5:59 PM

#

Yeah i follow that

storm void Aug 19, 2019, 6:13 PM

#

Anyone here have experience working with the WebAgg backend of matplotlib

lilac reef Aug 20, 2019, 12:39 AM

#

What would this kind of plot be called in matplotlib?
Its like a scatterplot, but how to connect each grouping of numbers based on value

📎 unknown.png

grizzled folio Aug 20, 2019, 12:40 AM

#

@lilac reef contour?

lilac reef Aug 20, 2019, 12:41 AM

#

Seems about right. Im just afraid contour is continuous values, but I'll look into it!
Thank you!

#

Yeah, def look right

#

Thanks :)

lilac reef Aug 20, 2019, 12:59 AM

#

Update: Not quite. I think contour is working with curves.
I have a bunch of X, Y values that each have a category Z they fall into.
I want to draw lines around each Z group like they did in the pictured graph above

#

I think I'm going to plot a bunch of different scatter plots on top of eachother with different colors. Not sure how I'll connect them

grizzled folio Aug 20, 2019, 1:00 AM

#

@lilac reef what do you mean "working with curves"?

lilac reef Aug 20, 2019, 1:01 AM

#

Contour plots (sometimes called Level Plots) are a way to show a three-dimensional surface on a two-dimensional plane. It graphs two predictor variables X Y on the y-axis and a response variable Z as contours. These contours are sometimes called the z-slices or the iso-response values.

#

Hmm, nevermind

#

I read something to the tune of contour plots plot slices of a plane

grizzled folio Aug 20, 2019, 1:02 AM

#

You can explicitly pass the levels (categories) you want contours at

lilac reef Aug 20, 2019, 1:06 AM

#

So I have pretty much the same data as the picture I posted.
For each X and Y value, I have a resultant percent number ( 75, 76.5, ext)
I have the list of X, Y (both at regular intervals) and want to find the high point for the percentage.
Would I pass in my X, Y as
plt.contour([graph_x, graph_y,], graph_results, [.6, .65, .7, .75])
for example?

#

Thank you for the help btw man :) This is a really tricky plot I'm trying to pull off but I think it will look good

grizzled folio Aug 20, 2019, 1:07 AM

#

I believe it looks like plt.contour(X, Y, Z, levels), so no need to bundle up X and Y as a list

#

For a regular grid, you might need to meshgrid X and Y together, something like xx, yy = np.meshgrid(graph_x, graph_y)

lilac reef Aug 20, 2019, 1:08 AM

#

Yeah, I kept seeing meshgrid come up

#

For just throwing in my stuff as X,Y,Z I got Input z must be a 2D array
So each Z value must be paired with a given contour 'height'?

#

Context

📎 unknown.png

grizzled folio Aug 20, 2019, 1:10 AM

#

What shape is Z? It expects X to be M long, and Y to be N long, so you pass a 2D array of values

#

It'll figure out the contours for your data automatically

lilac reef Aug 20, 2019, 1:19 AM

#

Z is 100x1 for me. I have X, Y and Z in different arrays. It is a height value that is in-order connected with each X Y combo

#

Seems like I should have a different format

grizzled folio Aug 20, 2019, 1:21 AM

#

What do you mean by "in-order connected"?

lilac reef Aug 20, 2019, 1:21 AM

#

Plot the height Z[1] at (X[1], Y[1])

#

and so on

#

I cant quite seem to wrap my head around what contour(X,Y,[N,M]) is expecting

#

X and Y must both be 2-D with the same shape as Z (e.g. created via numpy.meshgrid), or they must both be 1-D such that len(X) == M is the number of columns in Z and len(Y) == N is the number of rows in Z. hmmm

grizzled folio Aug 20, 2019, 1:23 AM

#

You say X and Y are regular intervals, but if Z[2] is (X[2], Y[2]), haven't you got heights along a diagonal?

#

Yes, you could just reshape it if you meant Z[2] is (X[2], Y[1]), for example

lilac reef Aug 20, 2019, 1:25 AM

#

Each height is a 'result' of the value X and Y being fed into an algorithm.
For context: X and Y are two parameters I am doing a grid search over a learning algorithm.
Z is how accurate of a predictor it was for each X,Y parameter combo (in cross poduct form)

grizzled folio Aug 20, 2019, 1:26 AM

#

Ok, so you have a grid...Then len(X) * len(Y) = 100 and you can just Z.reshape((len(Y), len(X))) and it'll probably give you at least something

lilac reef Aug 20, 2019, 1:27 AM

#

I'll give it a go

#

oof, bigger than I thought

#

I think this might be more of a headache than its worth >_>

#

Just trying to maximize my predictive power. I can manually sift through it.
Just kinda wanted a flashy graph for presentation

#

Thank you for help switchy! If its not as hard as I'm thinking let me know

grizzled folio Aug 20, 2019, 1:32 AM

#

It's not... or at least shouldn't be

#

import matplotlib.pyplot as plt
import numpy as np
x = np.arange(10)
y = np.arange(10)
plt.contour(x, y, np.sin(x)[None,:] + np.cos(y)[:,None])

#

📎 cRcSQDcDyz2OkwHgSyT9QDgjdHFphgthYAVf1GRN4BtuIJQWwjOFLaV4hIZ8ABTFLVipYaWmnsFhYWFq0IKyPSwsLCohVhGW0LCw.png

#

x and y are 10-element arrays, Z is (10, 10)

lilac reef Aug 20, 2019, 1:35 AM

#

So what are the elements in Z?

grizzled folio Aug 20, 2019, 1:36 AM

#

z[j,i] = sin(i) + cos(j), for i,j ∈ [0, 1, ..., 9]

lilac reef Aug 20, 2019, 1:36 AM

#

So Z[0,1] is mapped at location X[0], Y[1]?

grizzled folio Aug 20, 2019, 1:36 AM

#

Other way, Z[1,0] is X[0], Y[1]

#

You can just transpose it

lilac reef Aug 20, 2019, 1:37 AM

#

Ok, I'll try to work off that for a bit

#

So how can my Z be indexed at Z[j, i] if my j and i are logrithmic and not integer?

#

Or do I need to just make a work-around

grizzled folio Aug 20, 2019, 1:45 AM

#

sorry, I guess I wasn't clear...

#

z[j,i] = z(x[i], y[j])

#

You're really super overthinking this

lilac reef Aug 20, 2019, 1:46 AM

#

lmao. Im trying my best

grizzled folio Aug 20, 2019, 1:49 AM

#

import matplotlib.pyplot as plt
import numpy as np
x = 10**(np.arange(10) / 2)
y = 10**(np.arange(10) / 2)
plt.xscale('log')
plt.yscale('log')
plt.contour(x, y, np.sin(x)[None,:] + np.cos(y)[:,None])

#

📎 ucMxmw5tJywAAAABJRU5ErkJggg.png

lilac reef Aug 20, 2019, 1:57 AM

#

I would really appreciate if you held my hand and really spelled it out for me so I could do some big boy data science lmao.
All I have are these 3 lists.
I need to reshape my results (Z) to be 10x10 or something

📎 unknown.png

#

Im really sorry mate, I just dont think I'm going to get it to work tonight

#

My outputs are just whack that Im working with

#

My X and Y are 100 long, but are the same 10 values repeating. X goes up 1,2,3, vs Y that is ten 1's, ten 2's in a row ext

#

I just cant late night code. My bad mate. I really appreciate it though

grizzled folio Aug 20, 2019, 2:07 AM

#

@lilac reef just do .reshape((10,10)) on all of them

#

Then you'll have 2D arrays for X, Y, and Z (which is also acceptable input to contour)

lilac reef Aug 20, 2019, 2:07 AM

#

👀

grizzled folio Aug 20, 2019, 2:09 AM

#

If you wanted to turn them back into 1D arrays, you could take every 10th X value (graph_x[::10]), and the first 10 Y values (graph_y[:10]) -- but you shouldn't need to do that in this case

lilac reef Aug 20, 2019, 2:09 AM

#

'list' object has no attribute 'reshape'

#

DataFrame(graph_x).reshape(10,10)?

grizzled folio Aug 20, 2019, 2:09 AM

#

np.array(graph_x)

lilac reef Aug 20, 2019, 2:09 AM

#

gotcha

#

I just got a half-chub

📎 unknown.png

#

Thanks switchy <3

#

I think I can play with it from here

grizzled folio Aug 20, 2019, 2:13 AM

#

Great 🙂

lilac reef Aug 20, 2019, 2:47 AM

#

God thats a thing a beauty

📎 unknown.png

#

The data is so abstract it would take me like 10 minutes to explain and I love it

grizzled folio Aug 20, 2019, 2:47 AM

#

Cool, glad to see it worked

#

don't forget your colourbar and axis labels 😉

lilac reef Aug 20, 2019, 2:48 AM

#

I'l pretty it up once I stop throwing graphs at the wall and seeing what sticks

#

Trying to get my axes right

#

Why use fancy matplotlib when your numpy array spyder auto-coloring does the job 🤔

📎 unknown.png

onyx moth Aug 20, 2019, 5:46 AM

#

what does numpy.argmax( )actually do?

quartz stream Aug 20, 2019, 6:55 AM

#

any idea how to delete a row in a dataframe where the value of one column is false

#

I wanna scan the whole df and find the value false in a particular column and then delete that row

onyx moth Aug 20, 2019, 7:07 AM

#

drop it @quartz stream

quartz stream Aug 20, 2019, 7:07 AM

#

how

#

@onyx moth

#

drop takes column value

#

i wanna delete a particular row

onyx moth Aug 20, 2019, 7:08 AM

#

df = df.drop(['name of the colum'])

#

ow a row?

quartz stream Aug 20, 2019, 7:09 AM

#

yeah

#

a row

#

that too not any row

#

row with a column value false

onyx moth Aug 20, 2019, 7:10 AM

#

yea that should work

quartz stream Aug 20, 2019, 7:10 AM

#

how ?

#

can you send a snippet

onyx moth Aug 20, 2019, 7:10 AM

#

send ur code

quartz stream Aug 20, 2019, 7:10 AM

#

code of ?

onyx moth Aug 20, 2019, 7:11 AM

#

df = df.drop(['name of the colum'])

quartz stream Aug 20, 2019, 7:11 AM

#

I dont wanna delete a column

#

i wanna delete a row

onyx moth Aug 20, 2019, 7:11 AM

#

yea

#

thats what that does

quartz stream Aug 20, 2019, 7:12 AM

#

scan a column find false and delete that

wicked flare Aug 20, 2019, 7:12 AM

#

indices = df[!df['myColumn']].index
df.drop(indices, inplace=True)

onyx moth Aug 20, 2019, 7:12 AM

#

📎 unknown.png

#

u wanna do this right?

wicked flare Aug 20, 2019, 7:12 AM

#

@quartz stream Pretty sure my snippet should do what you want.

quartz stream Aug 20, 2019, 7:13 AM

#

lemme try I also think so @wicked flare

#

not working

#

I dont wanna delete a column

#

i'll scan a column called valid

#

find index where value is false

#

and delete all of em

onyx moth Aug 20, 2019, 7:16 AM

#

so you only have certain blocks which are false?

#

or a whole row

#

@quartz stream This is smth im working on, do you want this? See it dropped row 1 and 2

#

📎 unknown.png

quartz stream Aug 20, 2019, 7:22 AM

#

yes

#

there is a certain row with a column value false

#

how did you do it again

onyx moth Aug 20, 2019, 7:22 AM

#

like i said above u just drop it

wicked flare Aug 20, 2019, 7:23 AM

#

@quartz stream My snippet doesn't delete columns, it deletes rows.

onyx moth Aug 20, 2019, 7:23 AM

#

📎 unknown.png

#

row*

wicked flare Aug 20, 2019, 7:23 AM

#

>>> import pandas as pd
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4
>>> indices = df[df['col1'] == 2].index
>>> indices
Int64Index([1], dtype='int64')
>>> df.drop(index=indices, inplace=True)
>>> df
   col1  col2
0     1     3
>>>

#

It finds rows where the given column matches the condition and drops them.

quartz stream Aug 20, 2019, 7:24 AM

#

Wow

#

You Sir are amazing

#

@wicked flare

#

is there any way I can get count of this thing

wicked flare Aug 20, 2019, 7:24 AM

#

count?

quartz stream Aug 20, 2019, 7:25 AM

#

yes

#

how many did it drop

wicked flare Aug 20, 2019, 7:25 AM

#

You can just check the length of indices

quartz stream Aug 20, 2019, 7:25 AM

#

ohh

#

yea

#

sorry

#

pretty new here so

wicked flare Aug 20, 2019, 7:25 AM

#

No worries.

quartz stream Aug 20, 2019, 7:26 AM

#

1 indices = data[data['Valid'] == 'False'].index
----> 2 df.drop(index=indices, inplace=True)

TypeError: drop() got an unexpected keyword argument 'index'

#

@wicked flare

wicked flare Aug 20, 2019, 7:27 AM

#

Your dataframe isn't called df.

quartz stream Aug 20, 2019, 7:27 AM

#

<ipython-input-24-22441e5dbdc2> in <module>()
      1 indices = data[data['Valid'] == 'False'].index
----> 2 data.drop(index=indices, inplace=True)

TypeError: drop() got an unexpected keyword argument 'index'

wicked flare Aug 20, 2019, 7:29 AM

#

What version of pandas are you using?

#

The index parameter to drop was added in 0.21 so presumably your version is older than that

quartz stream Aug 20, 2019, 7:31 AM

#

Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (0.24.2)

wicked flare Aug 20, 2019, 7:31 AM

#

Anyway, this should also work: df.drop(indices, axis=0, inplace=True)

lapis sequoia Aug 20, 2019, 8:28 AM

#

it's not good practice to use inplace drop, it doesn't work properly most of the time

wicked flare Aug 20, 2019, 8:48 AM

#

What's wrong with it?

polar acorn Aug 20, 2019, 11:38 AM

#

It's always worked for me but it's supposedly supposed to be deprecated.

olive willow Aug 20, 2019, 11:39 AM

#

guys any ideas for data science related projects for good practice with numpy, pandas as well as general python stuff ??

#

btw I've a rpi so maybe something with that? I've really no clue what I can do. I've done several data analysis and webscraping ones and also some db ones.

onyx moth Aug 20, 2019, 12:10 PM

#

How would I determin theamount of states with Qlearning

#

When my environment is trading

#

I have 3 actions, buy, sell , hold

olive willow Aug 20, 2019, 2:19 PM

#

Thanks that's a great idea

onyx moth Aug 20, 2019, 8:56 PM

#

@void anvil The q values musnt be like thaat right?

#

I have this piece of code for the q_table

📎 unknown.png

#

im just not sure what the [50] should be

#

as the 3 is the amount of actions

#

and the [50] should be the states

olive robin Aug 21, 2019, 2:37 PM

#

hello

#

I need help understanding what's happening at line 195 in this code: https://github.com/pytorch/examples/blob/master/imagenet/main.py#L175

GitHub

pytorch/examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc. - pytorch/examples

#

they are concatenating two data sets, training and values (I think)

#

why are they doing this?

#

that's what path.join does right, concatenate directories?

polar acorn Aug 21, 2019, 3:57 PM

#

If by concatenating directories you mean concatenate the path of two directories you are correct. os.path.join is needed because windows paths are separated by \ while unix paths are separated by /, so you can't just write 'dir1' + '/' + 'dir2' on all systems and expect that too work. So in this case it seems you provide the path to your data and the script expects your data directory to contain two subdirectories named train and val. The path variables traindir and valdir just hold the paths to those subdirectories.

lilac reef Aug 21, 2019, 4:12 PM

#

Do I need to train-test-split if doing cross validation?
Is it possible to overfit with cross validation baked into the learning algorithm?

lilac reef Aug 21, 2019, 4:32 PM

#

Hmm.
So I fit my model with the training set, and then cross validate on the whole thing? Or pure score it on test?
@void anvil

#

Like I'm using GridSearchCV. Isnt that cross validating the score for every model? Isnt that purely scoring based on whatever data you pass it

#

So I guess test the best model from the grid against the test set afterwards then?

#

📎 unknown.png

#

Oh

#

:(

#

I've been overfitting pretty harshly

#

Shit

desert oar Aug 21, 2019, 4:39 PM

#

@lilac reef you can't usually rely on the CV scores from hyperparameter tuning to correctly estimate out-of-sample score

#

CV for hyperparameter tuning and CV for performance estimation are different steps

lilac reef Aug 21, 2019, 4:39 PM

#

Gotcha

desert oar Aug 21, 2019, 4:40 PM

#

you can do a "nested" CV, or do the CV on your training set while keeping a holdout set for validation

lilac reef Aug 21, 2019, 4:41 PM

#

So if the goal is to maximize the score on the holdout set (That means the model is generalizable right?),
how should I go about hyperparameter tuning?

#

Or is whatever parameters GridSearch returns probably still the best, even if it overfits somewhat?

lilac reef Aug 21, 2019, 4:59 PM

#

Wait, whats a validation vs testing set?
If this is just an easy google and I'm being really lazy dont reply lmao

📎 unknown.png

desert oar Aug 21, 2019, 5:06 PM

#

eh

#

arbitrary

#

you might use "testing" for iterating on your model

#

then when you need to get some kind of final assessment before you start sharing this w/ your company's CTO, you run it on the validation set to get a more "pure" estimate of accuracy

lilac reef Aug 21, 2019, 5:08 PM

#

Awesome, that was kinda what I was thinking

#

Thank you salt rock lamp :)

silent swan Aug 21, 2019, 6:03 PM

#

in theory you should only ever use your test set exactly once

silent swan Aug 21, 2019, 7:00 PM

#

@desert oar I think you swapped validation and test?

desert oar Aug 21, 2019, 7:00 PM

#

@silent swan no, thats how i use the terms. but they arent formal terms

#

i usually use "test" as the "innermost" holdout set

#

i.e. inside the CV loop

silent swan Aug 21, 2019, 7:03 PM

#

ah

#

the general convention is that validation/dev sets are used for tuning

#

test is used for pure eval

desert oar Aug 21, 2019, 7:28 PM

#

i use it the opposite way

#

in fact ive never seen it used the way you describe

#

i usually call eval "eval"

#

i avoid using the term "test" except in code

polar acorn Aug 21, 2019, 7:49 PM

#

Interesting. I've seen the notation @silent swan describes plenty of times and heard it described as the canonical way. But in my head i too switch around test and validation so that I tune on test and I can't quite remember where i learnt it, maybe it's just more intuitive.

desert oar Aug 21, 2019, 7:55 PM

#

maybe i just dont use kaggle enough

silent swan Aug 21, 2019, 8:13 PM

#

it's the standard in research as well

#

would be better to stick to the conventional naming imo

#

(also NLP is weird where they sometimes call the val set the "dev set")

desert oar Aug 21, 2019, 8:16 PM

#

at least that makes sense, its what you develop against

silent swan Aug 21, 2019, 8:16 PM

#

I'm guessing that's the origin

lilac reef Aug 22, 2019, 2:22 AM

#

Would you consider Auto Encoders for feature selection a form of clustering?

lilac reef Aug 22, 2019, 3:13 AM

#

So what would you call the new form the data takes at the middle of the Auto Encoder?

silent swan Aug 22, 2019, 3:24 AM

#

latent code

lapis sequoia Aug 22, 2019, 8:08 AM

#

Is there a GridSearchCV in Sklearn api that can take the best model (i.e. RF and SVC)
instead of the model with the best params?

#

basically just a separate grid search for both, possibly using same list of parameters..

#

or.. try an ensemble

desert oar Aug 22, 2019, 5:49 PM

#

sometimes i just send data back and say "i can't use this"

#

sometimes it's ambiguous and you literally can't know

silent swan Aug 22, 2019, 5:50 PM

#

04-05-2019

#

fun stuff

desert oar Aug 22, 2019, 5:50 PM

#

its honestly pretty satisfying

silent swan Aug 22, 2019, 5:50 PM

#

but seriously though, MM-DD-YYYY was a mistake

desert oar Aug 22, 2019, 5:55 PM

#

anything other than YYYY-MM-DD for a data set is pretty bad

#

have fun sorting on DD-MM-YYYY

#

right

#

unless it's stored in a database like that...

#

nothing like a MM/DD/YYYY VARCHAR(255) timestamp!

#

oh lol excel

#

hmm doesn't excel represent all that consistently internally

#

i think pandas knows how to handle that

#

are they formatted as text or date?

#

if it's text you're fucked

#

if it's formatted as a date you might be safe

#

because you can reformat consistently

#

yeesh

#

yeah

#

uh

#

"this is your lesson not to be stupid"

#

"i can't work with this"

lapis sequoia Aug 23, 2019, 12:44 AM

#

do people still use vba macros..

pine yoke Aug 23, 2019, 12:54 AM

#

I would like a model that tries to predict the winner of a tennis match. It should focus on putting different weights or importances varying from 1-100 on various aspects of a tennis players game. As many possible options as possible should be put in there but ideally the variables that will be able to be measured between 1-100 are things like: Win streak - amount of games won so far Won last match Lost last match Previous encounters with same opponent Amount of points won in a game + ‘’ ‘’ ‘’ in last 5 games + ‘’ ‘’ ‘’ ‘’ last 3 games Double faults in last game + double faults in last 5 games + 3 games How they fare in similar situations for eg. 3rd deuce as a server whilst 1 set down.

Is this a good use of ML?
Could anyone recommend some videos on how to accomplish this
Is there a better approach?

desert oar Aug 23, 2019, 12:55 AM

#

@pine yoke you can use machine learning to predict match winners yes. look into "logistic regression"

#

the "weights" produced by the logistic regression can give you some sense of "importance" but they aren't directly comparable

pine yoke Aug 23, 2019, 12:58 AM

#

Do you have recommendations for libraries to use?

desert oar Aug 23, 2019, 1:00 AM

#

scikit-learn

#

pandas for data processing

#

numpy for general matrix and vector math

#

and scipy for various other math

lapis sequoia Aug 23, 2019, 1:00 AM

#

and Tron for inspiration

#

(the movie)

lapis sequoia Aug 23, 2019, 2:59 AM

#

i am wondering, for data science jobs, isn't a masters degree required?

#

data analyst is also a data science job.. you can get a business degree and go into that

#

ohh business degrees can do data analysis?

#

i am not a business degree i am afraid

desert oar Aug 23, 2019, 3:01 AM

#

its difficult without a masters

#

possible but difficult

lapis sequoia Aug 23, 2019, 3:02 AM

#

for data analytics or data science?

desert oar Aug 23, 2019, 3:02 AM

#

data science

lapis sequoia Aug 23, 2019, 3:03 AM

#

ohh, then this program i know irl is kind of trying to scam people into thinking they can do data science without a masters

desert oar Aug 23, 2019, 3:04 AM

#

its probably not a scam

#

if it's a specialized training program, that's different

lapis sequoia Aug 23, 2019, 3:04 AM

#

ohh

#

i thought masters is required

desert oar Aug 23, 2019, 3:04 AM

#

there is no standards organization that dictates what is required

#

it is hard to get a data science job without a masters because there are many people with masters degrees applying, and also because a masters degree signals a basic level of competence

#

without the masters degree you need to signal competence some other way

lapis sequoia Aug 23, 2019, 3:05 AM

#

oh i see

#

but i think it's kind of misleading of the program though to tell people they can get a job in data science without a masters or letting them know of the full picture

#

or not letting them know*

desert oar Aug 23, 2019, 3:06 AM

#

maybe, maybe not

#

they might have hiring connections

lapis sequoia Aug 23, 2019, 3:07 AM

#

eh, they don't

desert oar Aug 23, 2019, 3:07 AM

#

maybe share the program?

#

if you have a link or a name

lapis sequoia Aug 23, 2019, 3:09 AM

#

i don't want to reveal my location

onyx moth Aug 23, 2019, 8:54 AM

#

Why is my NN performing so bad?

📎 unknown.png

#

What could I change

#

This is the model its a very simple one

#

📎 unknown.png

earnest prawn Aug 23, 2019, 10:49 AM

#

it could be that your thing doesnt have enough weights to optimize, the optimizer isnt good for your use case or that your data is poluted

onyx moth Aug 23, 2019, 11:48 AM

#

@earnest prawn How do you mean my data is poluted?

#

im using adam optimizer

#

I have been experimenting a bit and it just cant get past 55% accuracy for some reason

#

and when I train does it make sense that those accuracies are all the same

📎 unknown.png

earnest prawn Aug 23, 2019, 12:04 PM

#

Mathematically speaking it does make a lot of sense that these accuracies are the same if your optimizer has found a minimum

#

@onyx moth what happens if you increase the amount of neurons in your hidden layers

onyx moth Aug 23, 2019, 12:06 PM

#

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.1)

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(4, input_dim = 4, activation = 'relu'))
model.add(tf.keras.layers.Dense(16, activation= 'relu'))
model.add(tf.keras.layers.Dense(32, activation= 'relu'))
model.add(tf.keras.layers.Dense(16, activation= 'relu'))
model.add(tf.keras.layers.Dense(1, activation= 'sigmoid'))

model.compile(loss='binary_crossentropy', optimizer = 'adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs = 5)
scores = model.evaluate(x_test, y_test)
print(scores)

Im a bit new to Keras but what im doing is pulling BTC data frm binance, putting it into 4 indicators and inputting the values of those 4 indicators into the NN (with labels that if the price 5 prices further is higher 1 and lower 0)

#

This model I just sorta got of the internet nd have been playing around with it

#

it was meant for something els not bitcoin

#

if I change the amount of neurons I dont see much improvement, but its very random, when I change the pair to for example ETHBTC and put it to forecast 50 days in the future accuracy jumps to 77%

#

but the moment I add volume to what I input accuracy dumps to 21^%

#

21%*

earnest prawn Aug 23, 2019, 12:13 PM

#

And you are sure that these 4 indicators have a relation to wether the prices will rise or not?

onyx moth Aug 23, 2019, 12:14 PM

#

yes its but its not like if this happens on the indicator it will rise 100% but more like if this happens the probabilty of it rising is higher

onyx moth Aug 23, 2019, 12:33 PM

#

must I normalize data ?

#

will that make it better do u think?

earnest prawn Aug 23, 2019, 12:33 PM

#

Could make it better yes

dim beacon Aug 23, 2019, 1:15 PM

#

And you are (very very very) probably falling into a bias of some kind (lookahead, overfitting, etc.)

#

Or are trying to predict te wrong thing, for instance, the price of most well-known financial assets can be predicted day-to-day with a very good accuracy using this model:

predicted_price_for_tomorrow = price_today

onyx moth Aug 23, 2019, 2:02 PM

#

https://aitrader.ai/I want to recreate something like these guys, but then for myself

#

@dim beacon Yes there are some good algos, its not so hard to program a bot which trades on a fixed algo, if u just do what theprevious days Heikin ashi candle says you will also catch all the big swings, but having a NN bot which can adapt to the situation looks cool to me but idk if it even something thats possible to make

dim beacon Aug 23, 2019, 2:05 PM

#

@onyx moth algorithmic trading is, really, MUCH harder than most people think

onyx moth Aug 23, 2019, 2:05 PM

#

Yea I couldnt come up with an algo that doesnt have a spot where it doesnt lose money

dim beacon Aug 23, 2019, 2:06 PM

#

Like, there-exists-companies-paying-the-smartest-people-on-earth-millions-a-year-each-just-to-be-a-little-part-of-their-algos-strategies-development hard

onyx moth Aug 23, 2019, 2:06 PM

#

Thats why I thought if I can imploment some NN and add that as another if it would maybe cancel out some noise

#

But it must be possiblefor us right?

dim beacon Aug 23, 2019, 2:07 PM

#

Possible? Yes. But do not expect being able to do it profitably before years of deep experience and very high-level knowledge in a lot of fields

onyx moth Aug 23, 2019, 2:07 PM

#

lol idk if this is gonna work xD

📎 unknown.png

dim beacon Aug 23, 2019, 2:08 PM

#

Like, yeah it is possible, like it is possible you get a Nobel Prize one day, but you'll agree that it is not very likely to happen

onyx moth Aug 23, 2019, 2:08 PM

#

true

#

ive only been messing with NN lik 1.5 week or so

#

@dim beacon Do you have experience with Reinforcement learning?

dim beacon Aug 23, 2019, 2:22 PM

#

Very few

onyx moth Aug 23, 2019, 2:52 PM

#

okay now I have something that has a 64% accuracy and would like to try it out how do I now use this model to predict the next candles price?

📎 unknown.png

dim beacon Aug 23, 2019, 2:58 PM

#

?

model.predict(current_candle_data)

#

Also, best model, by far, to predict next candle price:

next_candle_price = current_candle_price

onyx moth Aug 23, 2019, 2:59 PM

#

so u saying the best way to predict is to say that tomorrows price is the same as todays?

dim beacon Aug 23, 2019, 2:59 PM

#

Yes

onyx moth Aug 23, 2019, 3:00 PM

#

That means markets wont move

#

and it will be a flat line

dim beacon Aug 23, 2019, 3:00 PM

#

They do, but on average they don't on a day-to-day basis

onyx moth Aug 23, 2019, 3:00 PM

#

well if u look at BTC its moved up 400% + in the past half year

dim beacon Aug 23, 2019, 3:00 PM

#

Yeah, still

#

Most of the time it did not move that significantly from a day to the next one

onyx moth Aug 23, 2019, 3:01 PM

#

well year before it moved down by 85% or so

dim beacon Aug 23, 2019, 3:02 PM

#

Yeah, still, most of the time it did not move that significantly from a day to the next one

turbid bay Aug 23, 2019, 3:40 PM

#

can anyone teach me how to make a neural network for hand written digits. ik theres hundreds of tutorials online but i find none of them really helpful

dim beacon Aug 23, 2019, 3:42 PM

#

Why aren't they helpful ?

turbid bay Aug 23, 2019, 3:55 PM

#

well i follow them. have about 20% clue of whats going on. and then it finishes without explaining how to save it or implement it into any other coding projects

#

the only one ive found useful was a Coursera one i did a while back but it was using Octave. It was more explaining the math behind it rather than explaining model structures

desert oar Aug 23, 2019, 4:01 PM

#

sounds like you need to learn tensorflow specifically

#

i feel like youd benefit more from a hands on machine learning book than from tutorials

turbid bay Aug 23, 2019, 4:06 PM

#

could u recommend one?

#

and yes i think id like to learn the ins and outs of tensorflow

desert oar Aug 23, 2019, 4:06 PM

#

i havent used one so i dont know sorry

turbid bay Aug 23, 2019, 4:14 PM

#

oh o

#

k

#

if anyone does know of a good way to learn tensorflow please let me know

dim beacon Aug 23, 2019, 4:22 PM

#

@turbid bay it is just an introduction, but I found it good https://www.youtube.com/watch?v=vq2nnJ4g6N0

YouTube

Devoxx

Tensorflow and deep learning - without a PhD by Martin Görner

Subscribe to Devoxx on YouTube @ https://bit.ly/devoxx-youtube Like Devoxx on Facebook @ https://www.facebook.com/devoxxcom Follow Devoxx on Twitter @ https:...

▶ Play video

turbid bay Aug 23, 2019, 4:36 PM

#

thankyou

#

will watch it in a minute

onyx moth Aug 23, 2019, 4:41 PM

#

what does it mean if I need to discretizize values?

earnest prawn Aug 23, 2019, 4:42 PM

#

that you have to convert a continious space into a discrete one

#

so for example clip all rational numbers to [-1,0,1]

onyx moth Aug 23, 2019, 5:54 PM

#

I seealot of people do this

📎 unknown.png

#

but I have no clue why

earnest prawn Aug 23, 2019, 6:02 PM

#

the comment above literally explains why

onyx moth Aug 23, 2019, 6:03 PM

#

I dont understand it I added it and it does nothing to my accuracy

earnest prawn Aug 23, 2019, 6:03 PM

#

it is not supposed to

#

it is literally supposed to "fix the random seed so the results are reproducable"

onyx moth Aug 23, 2019, 6:04 PM

#

so if I run it with seed 7 again it will produce the same results?

earnest prawn Aug 23, 2019, 6:04 PM

#

yes

onyx moth Aug 23, 2019, 6:04 PM

#

aaah okay

silent swan Aug 23, 2019, 6:31 PM

#

there should be a Rule #1
Don't do anything finance-related for your first data science project

torn musk Aug 24, 2019, 4:55 AM

#

i get stuck with bug

supple ferry Aug 24, 2019, 6:28 AM

#

@void anvil , do you have exp with two stage modeling ?
like choice modeling or a simple classification where you run the first stage to predict some feature a which you will add to your 2nd stage variables to predict some class

onyx moth Aug 24, 2019, 6:43 AM

#

https://hackernoon.com/dont-be-fooled-deceptive-cryptocurrency-price-predictions-using-deep-learning-bf27e4837151 I think this article popped my bubble of how good NN and that stuff is for trading. I think its for the average person not possible to come up with a good strategy unless u very good at it

Don’t be fooled — Deceptive Cryptocurrency Price Predictions U...

#

Ill just keep it with algo trading then

#

But the idea of shifting the price is only to create a label and u should remove the shifted price right?

#

So it learns well here I should buy and memorise the situation of the other data u pass in

supple ferry Aug 24, 2019, 1:06 PM

#

@void anvil . what i am trying to achieve is to group my options into clusters, which is done. Now as first stage I want to predict the cluster which user will choose the options from and then try to predict the option chosen which is in that cluster

#

how would you approach to such a problem ?

#

kind of

#

yes

#

instead of shape i have clusters

#

instead of colors i have options

#

exactly

#

it limits

#

i have done these two

#

we can switch to dm if it is fit for you

#

makes sense

#

ah okay

#

okay

#

let me now give more info

#

exactly

#

thats whiy i am now wiritng more info

#

so, lets assume, I have 3 users who were shown some items to buy.
first 2 got 50 items shown and the third one only 10
what I was doing previously, I was using features of items which are shown to every user and cluster them into several clusters. then i would create artificial cluster related features and combine them with item features and run a logit on them to find the item user has chosen

#

and now the idea is to split this task into a two stage choice model. first build a model which will predict the cluster from which user will choose and 2nd stage will be from that cluster find which item user will choose

#

i have not found much papers/works on this on internet

#

okay i found LIME

#

will look into that

#

i see

#

which model will fit for the first stage?

#

i think i have mistaken when i explained the use case

#

i already have lets say 6 clusters for 50 items that are shown to a user 1

#

yes

#

which item he will buy

#

the second one

#

yes

#

we have historical items that are shown to users and whether or not that item is bought

#

users are anonimzed

#

we hae around 10k users

#

around 1ml items that were shown to them, some 40, some 50, some even 200 items

#

and for each user we know which items from those 40 or 50 they have chosen

#

no, we assume that users are kinda "unique"

#

I have done clustering on user level.
I took all items in my dataset which are show to a user A and clustered them into cluster, then user B and so all 10k users. As a results, the clusters are "per user" only

desert oar Aug 24, 2019, 1:54 PM

#

what was the initial task here?

supple ferry Aug 24, 2019, 1:55 PM

#

initial task is to predict the item which user will choose from the item pool that we have for that user ( i have no control over that pool)

#

while having historical "unique" item pools and also the item which was chosen

#

there always was a choise

desert oar Aug 24, 2019, 2:05 PM

#

that's what an economist would call a "discrete choice" model

supple ferry Aug 24, 2019, 2:05 PM

#

yes

#

it is a discrete choice model 🙂

#

topic of my thesis

desert oar Aug 24, 2019, 2:06 PM

#

for clustering, you said you've already done it, but my immediate instinct would be to use NMF on the buyer-product matrix

#

NMF gets you clustered users and items at the same time

#

and non-negativity is just nice i guess

#

you could use SVD instead if you didnt need non-negativity

supple ferry Aug 24, 2019, 2:07 PM

#

because of the economics side, i was advised to " use techniques which are more scientifically explainable and interpretable to management people"

#

first thing was to cluster items in every pool and derive some pool-cluster-itemspecific characteristics

desert oar Aug 24, 2019, 2:08 PM

#

bleh

#

what do you mean pool-cluster-item specific

#

like... you want to perform clustering within each pool?

#

that sounds totally mad

supple ferry Aug 24, 2019, 2:09 PM

#

i derived additional features for every item which takes into account pool meta characteristics and + the cluster it belongs to within that pool

#

yes, that is exactly what i am saying 🙂

desert oar Aug 24, 2019, 2:10 PM

#

and youre feeding those into the decision model?

supple ferry Aug 24, 2019, 2:10 PM

#

yes

#

Logit for now

desert oar Aug 24, 2019, 2:11 PM

#

how many products do you have btw

supple ferry Aug 24, 2019, 2:11 PM

#

because only one item in every pool is chosen

#

1m "unique" products + 10k users (10k pools)

desert oar Aug 24, 2019, 2:12 PM

#

also it looks like your original questions was about two-stage modeling. what exactly did you want to know about it? like how to propagate variance?

supple ferry Aug 24, 2019, 2:13 PM

#

my question was about finding another approach to this problem by treating this as two stage problem

#

first stage is can we predict the cluster of items within a pool which user will be insterested in

desert oar Aug 24, 2019, 2:13 PM

#

i agree with ragepope. i dont think that kind of "traditional" approach is going to work

#

like what you described

supple ferry Aug 24, 2019, 2:13 PM

#

second stage is from the result of the first stage can we predict the item that user will choose

desert oar Aug 24, 2019, 2:14 PM

#

youre also gonna need to get "computational" here

#

of course you do

supple ferry Aug 24, 2019, 2:15 PM

#

@desert oar computational will not be a problem

desert oar Aug 24, 2019, 2:15 PM

#

users only select from a subset of products, right?

supple ferry Aug 24, 2019, 2:15 PM

#

they search for a product and a given a subset of products (which i have no control over)

desert oar Aug 24, 2019, 2:16 PM

#

you can see what they were shown, right?

supple ferry Aug 24, 2019, 2:17 PM

#

yes i have the items they saw

#

and the item they chose

desert oar Aug 24, 2019, 2:18 PM

#

you know what i'd do because i'm a lazy hack? i'd use NMF to get a low-rank Users cluster matrix and Products cluster matrix, then throw the user and product clusters into Vowpal Wabbit w/ negative-sampling loss

#

i have no idea if that would work, but it would be easy

supple ferry Aug 24, 2019, 2:19 PM

#

the structure is:

pool id,
meta features , around 26
was chosen or not

#

@desert oar I understood around 60 % of what you wrote 😄

desert oar Aug 24, 2019, 2:20 PM

#

vowpal wabbit is a command line tool for fitting linear models

#

but you have 1m labels so you really need negative sampling loss or something like that -- it's what makes word2vec possible with even huge vocabulary sizes

#

vowpal wabbit will do pairwise/quadratic interactions and negative sampling loss

#

so it's a very fancy logistic regression for really lazy people

drowsy marsh Aug 24, 2019, 2:21 PM

#

Hello, very basic user here.
I have a set of data in a table, and a function of 2 parameters of the type :
y = f(x1, x2)
I would like to have some king of linear interpolation for any point on the graph based on the data I have. How would it be possible?
in "1D", np.interp does it

📎 unknown.png

desert oar Aug 24, 2019, 2:22 PM

#

@void anvil i actually disagree, how to model this stuff well is all locked down as trade secrets

#

doing a thesis on purchase prediction and publishing it openly is valuable work imo

#

you think amazon doesnt?

#

sure. but in the real world youre almost never flying totally blind, its more just a matter of not knowing wtf to do with your metadata

#

i dont know what that even means

#

most users will have a unique combination of features, but that's just statistics

#

low variance X gives shitty predictions anyway

supple ferry Aug 24, 2019, 2:24 PM

#

this is a project in collaboration with a firm. firm is interested in a predictive model which will tell them which items from the pool they get to show first and minimize the time on website.
for the academia it is to understand the decoy effect and make one item more likely to be chosen by "putting" the decoy item near it

desert oar Aug 24, 2019, 2:24 PM

#

sort of

#

amazon knows a lot more than that

#

they also know each users unique purchase and ad click history

#

which they do have

#

no? i thought that was the whole pool/search thing

#

they know what people searched for and clicked on

#

...right?

#

hm

#

i also dont like the term "pool" here

#

its misleading

#

its a search id

#

for a single search

#

@supple ferry do you really only have 1 search per user? or a whole history

#

and what are some examples of the other features?

#

and how big are the search pools anyway? first 10 results?

supple ferry Aug 24, 2019, 2:27 PM

#

@desert oar so it is a search yes. and we have one search per "user"

desert oar Aug 24, 2019, 2:27 PM

#

why do you say "user"

#

is this not real user data?

supple ferry Aug 24, 2019, 2:29 PM

#

what we have is search ids. and in the dataset we also have user ids, but they are all unique, it does not give us much info

#

so what is have is, user comes to website, searches for something -- as result he gets 1 - 200 results

#

and then chooses one

desert oar Aug 24, 2019, 2:30 PM

#

ahhhh

#

ok

#

if anything you only have one user

#

and your entire model will be effectively marginalized over user features

supple ferry Aug 24, 2019, 2:30 PM

#

sorry if i explained in unclear way

desert oar Aug 24, 2019, 2:31 PM

#

i think he needs to hire me as a consultant

#

so i can quit my job and work on an interesting problem

supple ferry Aug 24, 2019, 2:31 PM

#

our goal is to order those results in a way that user finds what he searches for and buys it and leaves the site

desert oar Aug 24, 2019, 2:31 PM

#

your model is: P( Y | U, Z ), where Y is the purchase decision, U is user metadata, and Z is other metadata (time of day, location, etc)

supple ferry Aug 24, 2019, 2:31 PM

#

but i dont have any user metadata

desert oar Aug 24, 2019, 2:31 PM

#

right

supple ferry Aug 24, 2019, 2:31 PM

#

not even age gender and etc

desert oar Aug 24, 2019, 2:31 PM

#

without knowing individual users, you are marginalizing over U

#

so you are fitting the model SUM_u P( Y | Z ) * P(U)

#

im abusing notation but hopefully you see what i mean

#

unless you start trying to disambiguate users using e.g. location, your whole thing needs to be interpreted from the perspective of "this is averaged across all users"

supple ferry Aug 24, 2019, 2:33 PM

#

yes

desert oar Aug 24, 2019, 2:33 PM

#

so no, do not treat "users" as unique

#

you have unique searches

#

that said you can probably recover some user-level metadata

supple ferry Aug 24, 2019, 2:34 PM

#

yes

desert oar Aug 24, 2019, 2:34 PM

#

e.g. region, users usually don't leave their country or continent

#

so how big are these search pools

supple ferry Aug 24, 2019, 2:35 PM

#

you mean result pool ?

#

which user sees ?

#

because it does not come from the firm side, it is not normally distributed and ranges from just 1 result up to 200

#

i can not do so because i do not have control on what results the user is getting, i have control on ordering of those results

#

from academia point of view i am interested in ordering of the results because of the asymmetric dominance effect

desert oar Aug 24, 2019, 2:45 PM

#

ok that's a start at least

#

i'd seriously consider returning to first principles

#

e.g. the only reason we can use logistic regression for this is because of random utility maximization

supple ferry Aug 24, 2019, 2:45 PM

#

yes

desert oar Aug 24, 2019, 2:46 PM

#

i don't know if this is the right answer. but maybe write out the user's utility function

#

im not sure you even can write one out

#

because you have this weird varying budget set

supple ferry Aug 24, 2019, 2:47 PM

#

yes

desert oar Aug 24, 2019, 2:47 PM

#

so the user's decision is going to depend not just on their utility but also on the budget set

#

yes like i said, this is all in the sense of an "average" user

supple ferry Aug 24, 2019, 2:48 PM

#

yes they choose from what they are given

desert oar Aug 24, 2019, 2:48 PM

#

right

#

ok so

#

RUM framework

supple ferry Aug 24, 2019, 2:48 PM

#

this problem is interesting as it is hard

desert oar Aug 24, 2019, 2:48 PM

#

yeah dude can i please go back to school

#

lol

supple ferry Aug 24, 2019, 2:48 PM

#

exactly

desert oar Aug 24, 2019, 2:48 PM

#

or actually get hard problems at work

#

instead of problems that are conceptually easy but just require fuck tons of programming

supple ferry Aug 24, 2019, 2:49 PM

#

are you an economist btw

desert oar Aug 24, 2019, 2:49 PM

#

@void anvil im working on it lol

#

i wish

#

im not smart enough

#

@supple ferry no but i thought i was going to get an econ phd at one point

#

@void anvil thats interesting, i do like framing this as a search relevance problem

#

ok i think ive convinced myself that RUM still holds up here

#

but you totally do

#

preferences are a partial ordering

#

you wont reconstruct the complete order

#

sure there is, you see a collection of results and you know that chosen_item was preferred

#

preferred given the search results

#

thats the whole thrust of their investigation

#

yeah youll never do it all

supple ferry Aug 24, 2019, 2:53 PM

#

no no

desert oar Aug 24, 2019, 2:53 PM

#

but i dont think thats the goal

#

that said @supple ferry do you actually have the search terms?

supple ferry Aug 24, 2019, 2:53 PM

#

i want to order only the results a particular user sees

desert oar Aug 24, 2019, 2:53 PM

#

because i do think rage is on to something

#

im not sure what that achieves

#

the idea isn't to find a preference over all 1m items

#

oh i see

#

circumvent the problem entirely

#

w/ respect to the business "spend as little time on the page" criterion

#

yeah no thats totally reasonable

#

eh

#

not even shitty

#

if anything i think the question is a little misguided after this conversation

#

like... of course order matters a lot

#

and no, swapping the 198th and 197th items does not matter at all

#

nor does increasing the set from 201 to 202

#

if you really want to prove that, sure, you have a thesis

#

but if you want to investigate IIA violations in the wild you'll want to restrict your data set significantly

#

and you probably need more baseline data

#

e.g. how often does anyone ever click the nth search result?

#

i assume that after like 10-20 you get near-zero clicks

#

so not only "should" you use search term relevance like ragepope is saying, but your data is going to be very heavily dependent on the existing algorithm

#

@void anvil insurance, believe it or not

#

nope

#

btw actuarial would be really interesting if only regulators werent so strict about pricing

#

you know how everyone is clamoring for regulating AI and open/transparent algorithms and stuff? we've had that in insurance for years, and it holds back the industry

#

and it makes a worse experience for everyone

#

"why should i be penalized just because im under 25??"

#

"because state regulators dont understand XGBoost son"

#

right lol

#

but no i literally havent touched pricing at all

#

it sucks

#

ive basically spent the last year classifying businesses

#

yeah i love lime

#

its weird w/ text models though

#

yeah i know i cant remember the name either

#

i saw you mention it

#

hang on its in my zotero collection

#

SHAP

#

https://github.com/slundberg/shap

GitHub

slundberg/shap

A unified approach to explain the output of any machine learning model. - slundberg/shap

#

zotero to the fuckin rescue

#

https://raw.githubusercontent.com/slundberg/shap/master/docs/artwork/boston_instance.png
wew

#

i also like partial dependence plots a lot

#

oh shap has them too, nice

#

oooh nice python PDP lib https://github.com/SauceCat/PDPbox

GitHub

SauceCat/PDPbox

python partial dependence plot toolbox. Contribute to SauceCat/PDPbox development by creating an account on GitHub.

#

i actually wrote my own partial dependence plot library for R a few years ago

#

i used it all the time

#

at my previous company they wanted to know how much users were willing to pay for a product upgrade

#

so i did exactly what QWERTY is doing, i fitted a discrete choice model w/ price as one of the inputs

#

then i got the partial dependence of price

#

the business loved it, it was the perfect combination of "i understand this" and "ai magic" and "we are domain experts"

supple ferry Aug 24, 2019, 3:09 PM

#

i will dig into the directions you mentioned

#

@void anvil and @desert oar

#

wanna thank you both

#

for help and interesting discussion

desert oar Aug 24, 2019, 3:10 PM

#

put us in your thesis acknowledgements ;P

#

"and all those people with waifu and rick and morty avatars"

supple ferry Aug 24, 2019, 3:12 PM

#

😄

desert oar Aug 24, 2019, 3:12 PM

#

same. i wish i talked to people about mine

#

i didnt even have an advisor for a while

supple ferry Aug 24, 2019, 3:13 PM

#

i have time till 2021

desert oar Aug 24, 2019, 3:13 PM

#

oh yeah you got time. im actually curious to see what comes out of this

supple ferry Aug 24, 2019, 3:13 PM

#

i will share it with you when the time comes 🙂

desert oar Aug 24, 2019, 3:14 PM

#

holy shit

supple ferry Aug 24, 2019, 3:15 PM

#

you have been through some serious stuff

desert oar Aug 24, 2019, 3:16 PM

#

yeah poor guy

#

4 weeks with kidney stones? that must have been some serious kidney stones

supple ferry Aug 24, 2019, 3:17 PM

#

@void anvil i meant about notes one day before presentation 🙂

desert oar Aug 24, 2019, 3:19 PM

#

i think mine had a max page limit, i cant remember

#

max limit including charts

#

i remember it was brutal trying to pare it down

#

i was like tweaking line spacing and page borders in latex

#

heh

#

i also never had to present or anything, i just dropped the paper off at my advisor's office, i dont think i even met with him when i got the final marked up version back, i might have just picked it up from a drop box or something

#

like a fuckin term paper lol

#

that sounds way more intelligent

#

i basically turned a "fast track" 1 year masters into a full 2 year program

#

i did everything so weird and wrong

#

heh

#

my problem was partly that i got sick during my 3rd semester

#

fucked up my whole life

#

not to mention my schedule

#

lol right

unborn drum Aug 25, 2019, 1:13 AM

#

is regression or forecasting used to predict rainfall?

grizzled folio Aug 25, 2019, 1:42 AM

#

I can't speak for all agencies, but I believe it's usually forecasting

unborn drum Aug 25, 2019, 2:08 AM

#

@grizzled folio thanks

grizzled folio Aug 25, 2019, 2:08 AM

#

operational forecasting is pretty interesting though

grizzled folio Aug 25, 2019, 3:13 AM

#

It's not really that complicated

desert oar Aug 25, 2019, 10:21 AM

#

Regression and forecasting are two different things, you can use one or neither or both if you want

unborn drum Aug 25, 2019, 4:07 PM

#

oh, then is regression used in estimating whether or not a person will default on a loan?

lapis sequoia Aug 25, 2019, 5:06 PM

#

Hello, can u suggest big projects about data science(not includes deep learning) to examine which coded by experienced data scientist? I mean , complete project, using project structure etc. Not consist of jupyter notebooks? I want to see how big data science projects should be?

desert oar Aug 25, 2019, 5:34 PM

#

@lapis sequoia those are usually done professionally and not likely to be something you can find publicly

#

You might find some of the source code, or you might find the result of the project open source, but you're not likely to find a project where all of the intermediate work has been published

#

@unborn drum regression is a type of model, forecasting is a task to be achieved with a model

lapis sequoia Aug 25, 2019, 6:27 PM

#

Thanks @desert oar 🙂

supple ferry Aug 26, 2019, 10:50 AM

#

@void anvil paperswithcode is biggest discovery i made this year

west walrus Aug 26, 2019, 12:57 PM

#

Can anyone suggest a good method of object detection for finding specific images (although they may be not see straight on) in a picture? I'm trying to detect street signs in real time

quartz stream Aug 26, 2019, 12:58 PM

#

YOLO

#

is good for object detection

west walrus Aug 26, 2019, 12:58 PM

#

Ooh ok I'll check it out

quartz stream Aug 26, 2019, 12:58 PM

#

with less objecys

#

objects

#

there are pre trained model available

west walrus Aug 26, 2019, 12:58 PM

#

Is it easy to train my own images on it?

#

Or would there already be street sign models

quartz stream Aug 26, 2019, 1:00 PM

#

it is easy to train

#

https://arxiv.org/pdf/1901.07518v2.pdf

#

read this paper

west walrus Aug 26, 2019, 1:00 PM

#

Thanks so much! My latest attempt was to run feature matches over circles detected on the frame. felt very janked

quartz stream Aug 26, 2019, 1:00 PM

#

this is sota for object detection

#

yeah

west walrus Aug 26, 2019, 1:00 PM

#

Ok, I'll have a read

#

Thanks so much 👌

quartz stream Aug 26, 2019, 1:00 PM

#

https://github.com/open-mmlab/mmdetection

GitHub

open-mmlab/mmdetection

Open MMLab Detection Toolbox and Benchmark. Contribute to open-mmlab/mmdetection development by creating an account on GitHub.

#

The code if you want 😛

#

Paper 2 : https://arxiv.org/pdf/1707.09700v2.pdf

west walrus Aug 26, 2019, 1:01 PM

#

Is it resource intensive? I'm planning to run it on a raspberry pi but I could stream the video to my pc for processing if it was necessary

quartz stream Aug 26, 2019, 1:01 PM

#

and github code https://github.com/yikang-li/MSDN

GitHub

yikang-li/MSDN

This is our PyTorch implementation of Multi-level Scene Description Network (MSDN) proposed in our ICCV 2017 paper. - yikang-li/MSDN

#

you can always train the model

#

then save the model to a path

#

that file (joblib) can be used

#

on raspberry

#

and it wont be resource intensive

west walrus Aug 26, 2019, 1:02 PM

#

Okay, that's perfect then!

quartz stream Aug 26, 2019, 1:02 PM

#

https://www.geeksforgeeks.org/saving-a-machine-learning-model/

GeeksforGeeks

Saving a machine learning Model - GeeksforGeeks

In machine learning, while working with scikit learn library, we need to save the trained models in a file and restore them in order to… Read More »

#

There are alternative for keras as well as tensorflow online

#

but if you are not into complex models try yolo it is one shot learning

west walrus Aug 26, 2019, 1:03 PM

#

Okay sounds good

proud iris Aug 26, 2019, 5:46 PM

#

hi! So I've built my first regression model in Keras and it doesn't look good. I'm not getting very accurate results, can anyone help?

desert oar Aug 26, 2019, 5:47 PM

#

@proud iris what data do you have, what kind of model are you using, how are you training the model, and how are you evaluating accuracy

dusty latch Aug 26, 2019, 5:48 PM

#

Can anyone help me bounce ideas off of? I need project ideas that have business applications for a class. I've already done a stock bot but last semester. I need some fresh ideas.

proud iris Aug 26, 2019, 5:49 PM

#

I just created a data set, x*sin(x) is my evaluating function. I have three columns, x, sin(x) and their product. I have 3000 data entries, I'm using 2995 for training, last 5 as test cases

#

I'm very very new to this so please be patient with me if some of my questions appear very trivial

desert oar Aug 26, 2019, 5:50 PM

#

you're trying to learn the function x*sin(x)?

proud iris Aug 26, 2019, 5:50 PM

#

accuracy is being measured by mean squared error

#

yeah

desert oar Aug 26, 2019, 5:50 PM

#

5 test cases isn't much

#

what kind of model is this?

#

1 output w/ some fully connected hidden layers?

proud iris Aug 26, 2019, 5:51 PM

#

yeah. Dense layers stacked

desert oar Aug 26, 2019, 5:51 PM

#

are you doing any hyperparameter tuning

proud iris Aug 26, 2019, 5:52 PM

#

can I show you my code

desert oar Aug 26, 2019, 5:52 PM

#

i guess. i dont really use keras and i'm not really a neural network aficionado

#

that would answer my questions though

#

!paste

arctic wedgeBOT Aug 26, 2019, 5:52 PM

#

paste

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

proud iris Aug 26, 2019, 5:52 PM

#

no, I dn't have any idea about hyperparameter tuning so I guess I'm not doing it?

desert oar Aug 26, 2019, 5:52 PM

#

yeah probably not

proud iris Aug 26, 2019, 5:53 PM

#

https://paste.pydis.com/ewajacipuj.py

dusty latch Aug 26, 2019, 5:53 PM

#

You could use talos for hp tuning

desert oar Aug 26, 2019, 5:54 PM

#

maybe instead of just the last 5, what if you randomly drop out points for train/test

proud iris Aug 26, 2019, 5:55 PM

#

This should be a very simple implementation? I shouldn't be needing hypertuning and other stuff?

desert oar Aug 26, 2019, 5:55 PM

#

data_train = data.sample(frac=0.9)
data_test = data.loc[data.index.difference(data_train.index)]

proud iris Aug 26, 2019, 5:56 PM

#

walk me through this, data_train gets 75 percent of the generated data

#

what is going on in data_test?

desert oar Aug 26, 2019, 5:56 PM

#

i changed it to 90 but yeah

#

pandas has this notion of an "index"

#

basically a label for each row

proud iris Aug 26, 2019, 5:56 PM

#

yeah like excel

desert oar Aug 26, 2019, 5:56 PM

#

sorta yeah

#

data.index.difference(data_train.index) gives you the index values from data that aren't in data_train

#

then you use .loc to get the corresponding reows

proud iris Aug 26, 2019, 5:57 PM

#

okay got it

#

is that 90 percent random? or the first 90 percent

desert oar Aug 26, 2019, 5:58 PM

#

alternatively you can do

is_train = np.random.choice([True, False], p=[0.9, 0.1], size=data.shape[0])
data_train = data.loc[is_train]
data_test = data.loc[~is_train]

#

random

#

(in both cases)

#

that doesn't make your model better but at least evaluating it will be more representative

proud iris Aug 26, 2019, 6:00 PM

#

still 300 values 😦

#

Gonna try with 99

dusty latch Aug 26, 2019, 6:02 PM

#

Also take a look at this https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53

Medium

Hyperparameter Optimization with Keras

Finding the right hyperparameters for your deep learning model can be a tedious process. It doesn’t have to.

desert oar Aug 26, 2019, 6:02 PM

#

whats wrong with 300 values @proud iris ?

#

if your model is bad, going from 700 to 990 values isn't going to make it better

#

but it will improve your evaluation because you're going from 10 to 300 in your test set

dusty latch Aug 26, 2019, 6:04 PM

#

Also, like srl said before you should really be doing at least a 20% test

proud iris Aug 26, 2019, 6:05 PM

#

How will I check the values of 300 entries?

dusty latch Aug 26, 2019, 6:05 PM

#

With code

#

for loop over all 300

proud iris Aug 26, 2019, 6:05 PM

#

or the mean squared error

#

but why is my model not predicting well? Is the code correct?

dusty latch Aug 26, 2019, 6:08 PM

#

probably because you have no hp tuning and you're using the same type of neurons for like 4 layers

desert oar Aug 26, 2019, 6:08 PM

#

it's hard to say when you're only predicting on 5 values

#

and what landshark said

dusty latch Aug 26, 2019, 6:08 PM

#

throw a sig layer in there see what happens.

desert oar Aug 26, 2019, 6:08 PM

#

im not even sure what the value of stacking relus like that is. wouldnt you just use like 1 big hidden linear layer for this kind of thing

#

or like a hidden layer and a sigmoid cause sin

#

like i said im not a NN guy

#

so correct me if im wrong

dusty latch Aug 26, 2019, 6:09 PM

#

Also, you need a validation set.

#

then do a confusion matrix on the validation set with prediction set

#

Do 75, 15,10 sets

proud iris Aug 26, 2019, 6:11 PM

#

confusion matrix?

#

validation set, by that you mean a test set whose output I know

dusty latch Aug 26, 2019, 6:12 PM

#

Literally just get 10% of your data and use that as your validation set.

proud iris Aug 26, 2019, 6:14 PM

#

I don't know any of those things @void anvil so please explain a bit. Also the diagram

dusty latch Aug 26, 2019, 6:14 PM

#

Also, Rik, make that last layer a linear layer. Idk if it does that by default but I do it for good measure

proud iris Aug 26, 2019, 6:15 PM

#

@dusty latch I am already using 10 percent as my validation set. But what did you say about the confusion matrix?

dusty latch Aug 26, 2019, 6:15 PM

#

confusion matrix basically test your validation set over your predicted values.

proud iris Aug 26, 2019, 6:15 PM

#

okay. So just subtracting one from the other?

dusty latch Aug 26, 2019, 6:16 PM

#

You can get a better measure of model usefulness from sensitivity and specificity. A cm will give that you to you.

#

It does a lot more than just subtracting one from the other.

proud iris Aug 26, 2019, 6:17 PM

#

the one thing I don't get is that there are examples and tutorials where this code is enough to produce accurate-ish results. And my function isn't that difficult to learn is it?

#

@dusty latch will look into it then.

dusty latch Aug 26, 2019, 6:18 PM

#

You're just not using enough data points and tuned hp

#

just look up hyperparameter tuning in python keras and you'll have loads of things.

proud iris Aug 26, 2019, 6:19 PM

#

What was that bit about linear layers? Mine is a dense one, isn't that how it should be? every neuron connected to all other neurons in the previous layer?

#

yeah I will definitely look into it

dusty latch Aug 26, 2019, 6:20 PM

#

You know how you had layers before as relu? Add something like that to the last layer but use linear

proud iris Aug 26, 2019, 6:20 PM

#

yeah but why?

#

okay not the layer but the activation function

#

crap I need to learn quite a bit

#

how many neurons does one use in the hidden layers? Is it trial and error?

#

Won't that create a bias for the validation data?

dusty latch Aug 26, 2019, 8:10 PM

#

https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

Cross Validated

How to choose the number of hidden layers and nodes in a feedforwa...

Is there a standard and accepted method for selecting the number of layers, and the number of nodes in each layer, in a feed-forward neural network? I'm interested in automated ways of building neu...

rancid totem Aug 26, 2019, 8:13 PM

#

If you keep tuning your hyperparams too much on your validation set then yes you're going to overfit

compact thistle Aug 27, 2019, 5:18 AM

#

Guys I'm doing some capstone project about Anomaly Detection(Outlier Detection) on Credit card fraud dataset on Kaggle.
The dataset has labels(Whether it's fraud or not) so initially I was going to do some supervised learning stuff but then in real life scenario we don't get to work with labels so I was going to use unsupervised learning methods to detect outliers.
Yes I'm pretending I have no information about labels but will use it later to evaluate my unsupervised learning methods' performance.

The problem I'm facing here is that maybe it is too ...simple? I'm basically trying out many different outlier detection models and simply comparing them.
I'm wondering if you guys have any brilliant idea to make this part fancy or impactful enough?

lapis sequoia Aug 27, 2019, 5:21 AM

#

hmm.. if you don't have labels, how do you plan to discriminate

#

there is a distance measure we typically use in banking.. for the life of me, i'm not able to recall it right now

#

but it's what's usually used when trying to find who's is at a risk of defaulting.. or committing credit card fraud

#

the name of the distance measure ends in 'ov' , I'll ping you if I remember it

compact thistle Aug 27, 2019, 5:35 AM

#

@lapis sequoia Hey Tron, yea there are multiple algorithms i can use for outlier detection. Like you said, i have couple of stuff based on euclidean methods or ensemble or probability or cluster.
I'm generally concerned about the depth of the project. I feel like I'm just laying out options and simply telling them what worked the best.
And I don't feel like it's complicated enough

lapis sequoia Aug 27, 2019, 5:36 AM

#

ok.. you want to tell them you tried a bunch of methods and compare

#

so do that

#

seems like a comparative study of classifiers for this application.. hmm..

compact thistle Aug 27, 2019, 5:39 AM

#

Yea... still it's just not deep enough. I could probably do some hyperparameter optimization for better performance but then that's totally going against the concept of unsupervised learning

lapis sequoia Aug 27, 2019, 5:40 AM

#

check out current papers and see if there's a gap you could fill

compact thistle Aug 27, 2019, 5:41 AM

#

Yea I'll do that on the side. Thanks man

native rivet Aug 27, 2019, 6:19 AM

#

Hi guys

#

I own a small bakery

#

How can I use data science skills to boost my sale or help my business?

#

Can someone help me

desert oar Aug 27, 2019, 6:55 AM

#

@native rivet is this an actual question or is this a homework question

native rivet Aug 27, 2019, 7:36 AM

#

😆😆😆

#

Actual Question

#

@desert oar

olive willow Aug 27, 2019, 9:13 AM

#

First get some data from your business and try to answer questions you have with it and explore it in general, see if something stands out

#

@native rivet

#

Later you can optimize your business using it and get a data strategy

#

But remember only to collect the data you really need, because data without a purpose is worth nothing

#

Questions like:

#

When do customers come to my shop

#

What do they buy

#

Are the customers somehow related

native rivet Aug 27, 2019, 9:20 AM

#

Thanks bro

#

I'll see how it goes

polar acorn Aug 27, 2019, 9:22 AM

#

I suggest you entice the customers with some deep baking. Use densely connected flavours and many hidden layers in your pastries. Also look into batch normalisation if you're baking large batches of cup cakes. LSTM (Luxury Sweet Tasting Macarons) are in high demand, though some prefer TCN (Trdelník Containing Nutella). Best of luck

earnest prawn Aug 27, 2019, 9:28 AM

#

🙄

native rivet Aug 27, 2019, 10:44 AM

#

It's a franchise

#

@polar acorn

polar acorn Aug 27, 2019, 10:47 AM

#

It was an attempt at humor pensive_snake_cowboy

native rivet Aug 27, 2019, 12:17 PM

#

@polar acorn 😆

#

Let's see ill let you know. Thanks @void anvil

dusty latch Aug 27, 2019, 12:19 PM

#

Lads, I'm struggling to think of ideas to do for my python data analytics class. Last semester I had the same class but in R and I did stock market direction prediction. Any ideas on some cool I can do? I dont want to do something just plain like the iris dataset or whatever

dusty latch Aug 27, 2019, 2:11 PM

#

Thanks but I forgot to add that it needs to have a business case