#data-science-and-ml | Python | Page 205

surreal nacelle Jul 20, 2019, 12:09 PM

#

and I read the mean of the results from there

desert oar Jul 20, 2019, 12:09 PM

#

Ok

surreal nacelle Jul 20, 2019, 12:09 PM

#

when not using cross val, I fit/predict and read the accuracy from the prediction

#

I split before tho

desert oar Jul 20, 2019, 12:10 PM

#

Ok good. You should usually stick to one evaluation procedure

surreal nacelle Jul 20, 2019, 12:10 PM

#

cross val seems to be solid

desert oar Jul 20, 2019, 12:10 PM

#

Yeah it's a reasonable way to go

#

So you have two different models that have different accuracies?

surreal nacelle Jul 20, 2019, 12:10 PM

#

I only have 1 model for now, but I was thinking about combining 2 yes

desert oar Jul 20, 2019, 12:11 PM

#

Yes, that's a technique called ensembling

surreal nacelle Jul 20, 2019, 12:11 PM

#

Ok, gtk

desert oar Jul 20, 2019, 12:11 PM

#

It's very common and very popular in these type of prediction competitions

#

There are some great blog posts on it, let me find one

#

https://mlwave.com/kaggle-ensembling-guide/ @surreal nacelle

MLWave

mladmin

Kaggle Ensembling Guide

surreal nacelle Jul 20, 2019, 12:13 PM

#

thank you 😃

quartz monolith Jul 20, 2019, 3:09 PM

#

Decision Tree / CART with Label Encoder?

#

Dummies makes 0 sense

#

Balanced accuracy: 0.21309668192963746
Hamming loss: 0.2958520739630185
Accuracy: 0.7041479260369815

silk forge Jul 20, 2019, 4:12 PM

#

hey

#

"""Weather in Szeged 2006-2016: Is there a relationship between humidity and temperature? What about
 between humidity and apparent temperature?
  Can you predict the apparent temperature given the humidity?"""

import sklearn.linear_model as lin
import numpy as np
import pandas as pd
from sklearn.model_selection import  train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

dataw = pd.read_csv(filepath_or_buffer='C:/Users/admin/Desktop/Artifcial intelligence/ML/data/WEATHER/weatherHis.csv')


df_x = pd.DataFrame(dataw.Humidity)
df_y = pd.DataFrame(dataw['Temperature (C)'])

trainx,testx,trainy,testy = train_test_split(df_x,df_y,test_size=0.2,random_state=4)

regr = lin.LinearRegression()

regr.fit(trainx,trainy)
slope = regr.coef_
intercept = regr.intercept_



n = regr.predict(testx)

print(n)
print(testy)

print(mean_squared_error(y_true=testy , y_pred=n))

#

this is my

#

code

#

i get an MSE value of 55.248754390894184?

#

is that somewhat accurate

#

?

earnest prawn Jul 20, 2019, 4:37 PM

#

Ideally youd always want your MSE to reach a value asymptotically close to 0....wether 55 is a good value really depends on what you're predicting as it's simply take the averages of the squared difference of actual and predicted values you have to think yourself wether sqrt(mse) is good enough for you or not

void anvil Jul 20, 2019, 4:49 PM

#

if temperature is in celsius and humidity is 0-100, then probably not

#

if you did temperature, humidity vs apparent temp you'd probably get a lot better results

surreal nacelle Jul 20, 2019, 5:09 PM

#

Your Best Entry

You advanced 2,422 places on the leaderboard!
Your submission scored 0.78468, which is an improvement of your previous score of 0.77033. Great job!```

I really don't see how to improve on that, currently ranked 4200 out of 11600, is it time to read some notebooks written by top leaderboard ?

#

actually jumped 1000 place by running the model again 😄

#

and again 2000 places lol

#

now ranked 1000 out of 11600

#

feeling pretty good about it

quartz monolith Jul 20, 2019, 5:52 PM

#

nice good job

silk forge Jul 20, 2019, 5:58 PM

#

what did i do wrong?

#

@void anvil

surreal nacelle Jul 20, 2019, 5:59 PM

#

Thanks 😄

lapis sequoia Jul 20, 2019, 6:14 PM

#

What is the best free way to learn data science

earnest prawn Jul 20, 2019, 6:16 PM

#

check the second pin

void anvil Jul 20, 2019, 6:21 PM

#

x = temp + humidity, y = feels like temp

sweet socket Jul 20, 2019, 6:50 PM

#

guys i'm trying to get into a bit of algo trading and thinking of using backtrader in python, anyone know of anything better than this or is backtrader a good starting point? I'm not just greatest coder but i can fumble my way through most things

stuck obsidian Jul 20, 2019, 9:27 PM

#

I believe you would want to look at quantopian @sweet socket

void anvil Jul 21, 2019, 1:12 AM

#

lol good luck

surreal nacelle Jul 21, 2019, 7:40 AM

#

Alright I'm done with the titanic 😃

📎 Screen_Shot_2019-07-21_at_9.39.22_AM.png

quartz monolith Jul 21, 2019, 11:02 AM

#

Somebody worked with SMOTENC and make_classification to balance the class imbalance in the dataset?

void anvil Jul 21, 2019, 3:55 PM

#

Looking at some code for reinforcement learning on time series (below). Why is it scaling each observation 0-1 instead of the entire observation space to 0-1?

What happens if you have divergent values (e.g. values that go above what are seen in the train set and, thus, will go above 1 if scaled)? Will everything break if values go < 0 or >1 or is everything fine? Would it be better to scale to a tighter range (e.g. instead of 0-1 scaling, 0.2-0.8 scaling)? Want to see if anyone knows / has resources before I start experimenting.

  # Get the data points for the last 5 days and scale to between 0-1
  frame = np.array([
    self.df.loc[self.current_step: self.current_step +
                5, 'Open'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'High'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'Low'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'Close'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'Volume'].values / MAX_NUM_SHARES,
   ])  # Append additional data and scale each value to between 0-1
  obs = np.append(frame, [[
    self.balance / MAX_ACCOUNT_BALANCE,
    self.max_net_worth / MAX_ACCOUNT_BALANCE,
    self.shares_held / MAX_NUM_SHARES,
    self.cost_basis / MAX_SHARE_PRICE,
    self.total_shares_sold / MAX_NUM_SHARES,
    self.total_sales_value / (MAX_NUM_SHARES * MAX_SHARE_PRICE),
  ]], axis=0)  return obs```

desert oar Jul 21, 2019, 5:58 PM

#

@void anvil i dont fully understand your question. it's scaling each feature separately

#

those MAX_* variables are defined outside the function

#

practically, i think normally you'd just clip the value to 0 or 1

void anvil Jul 21, 2019, 6:00 PM

#

ah yeah you're right, but it's still scaling the max observed to 1

#

whereas it could go to 1.5 or w/e

#

in the unobserved test set (or in real time application)

#

this example came from stock trading, so amazon is trading at ~2k now. If we were to let it go, amazon could go to 3k or 50k because it's unbounded

#

so then you'd end up feeding a value > 1

desert oar Jul 21, 2019, 6:28 PM

#

yeah that makes sense too

#

depends on the data

quartz monolith Jul 21, 2019, 8:24 PM

#

I have 7 categorical features and i want to use smotenc

from collections import Counter
from numpy.random import RandomState
from imblearn.over_sampling import SMOTENC
sm = SMOTENC(random_state=42, categorical_features=[0,7])
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)

Does somebody has a idea what does the error mean?
ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6

hallow wave Jul 21, 2019, 8:36 PM

#

expected neighbors to be less than or equal to 1

quartz monolith Jul 21, 2019, 8:36 PM

#

I googled the problem, seems that my data set is to small

desert oar Jul 21, 2019, 8:38 PM

#

@quartz monolith that seems wrong. you shouldnt have only 1 sample

#

what is X_train.shape?

#

and y_train.shape?

#

unless that means 1 sample in a particular category

quartz monolith Jul 21, 2019, 8:38 PM

#

y_train = 4287 shape

#

X_train = (4287, 8)

#

Here is someone with similiar problem
https://stackoverflow.com/a/48820222/11811575
but I dont understand it...

Stack Overflow

SMOTE Value Error

I'm using SMOTE function for oversampling my sparse data set which contains around 98% 0s & 2% 1s.I used following code

from imblearn.over_sampling import SMOTE
import os
import pandas as pd

desert oar Jul 21, 2019, 8:41 PM

#

how many classes do you have

quartz monolith Jul 21, 2019, 8:42 PM

#

Label classes are 31

#

And feature 7

#

sm = SMOTENC(random_state=42, categorical_features=[X])
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)```

got now `ValueError: cannot copy sequence with size 5717 to array axis with dimension 8`

desert oar Jul 21, 2019, 9:00 PM

#

how many of each class do you have @quartz monolith

#

i dont think categorical_features=[X] is right. is X a matrix?

#

i think you would need to use the column numbers instead

#

im not 100% sure

#

what library is this?

quartz monolith Jul 21, 2019, 9:01 PM

#

https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.SMOTENC.html

#

X is wrong i need to use array

#

sm = SMOTENC(random_state=42, categorical_features=[0,7])
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
x_res, y_res = sm.fit_resample(X, y)```
0,7 should be right

#

My X_train.shape = (4287, 8)

#

By class you mean the number of my features?

floral lodge Jul 22, 2019, 12:39 AM

#

@desert oar it does but I would have to buy a very expensive license to get the full scripting functionality which is why I wanted to create a gui bot in python for it

#

I actually found some corner detection and edge detection stuff in opencv that I'm going to look into

#

thanks!

void anvil Jul 22, 2019, 2:37 AM

#

Got another question on the take_action portion. Again, using the stock example code is coming from:

    # Buy amount % of balance in shares
    total_possible = self.balance / current_price
    shares_bought = total_possible * amount
    prev_cost = self.cost_basis * self.shares_held
    additional_cost = shares_bought * current_price    self.balance -= additional_cost
    self.cost_basis = (prev_cost + additional_cost) / 
                            (self.shares_held + shares_bought)
    self.shares_held += shares_bought```

So right here it's calculating the max amount it can buy in the "total_possible" line.

If we wanted to arbitrarily limit it to a max amount, say 10000 shares, we could change it to:
``` total_possible = np.min(self.balance / current_price, 10000```

Hopefully, over the course of time, the algorithm will "learn" that it can't place a buy bigger than 10000 at a time (and we wouldn't expect a large output than that when the algo is finished training.

But what if we want to limit it to some amount dependent on the next time step that the algorithm SHOULD NOT have access to at time t because the assumption of an unlimited purchase of a stock or w/e doesn't make a lot of sense . For this example, we'll limit it to an arbitrary 10% of the volume of the next time period, vol_t+1.

```total_possible = np.min(self.balance / current_price, 0.1*next_period_vol)```

By placing this restriction in the take_action space, is it actually being fed a bit of information from the future or is this restriction put in the right place? Should the RL try to place a large buy / sell than possible and have the action be restricted elsewhere to not pass "cheating" information back?

silent swan Jul 22, 2019, 2:41 AM

#

porting between pytorch and TF code is LOL

lapis sequoia Jul 22, 2019, 5:07 AM

#

Hey does anyone have a recommended textbook for Machine Learning/Data Science?

exotic cedar Jul 22, 2019, 6:32 AM

#

suggest starting with this one

lapis sequoia Jul 22, 2019, 6:32 AM

#

that's a very good pick

#

skip the tf, go straight to keras

quartz stream Jul 22, 2019, 7:13 AM

#

@exotic cedar Thanks for the Useful Link !

#

Really Appreciated !

#

💯

lean ledge Jul 22, 2019, 11:38 AM

#

@exotic cedar Piracy isn't allowed here

#

(@lyric canopy)

lyric canopy Jul 22, 2019, 11:40 AM

#

No, it's not

#

@exotic cedar Please don't share pirated works or discuss piracy on our server.

#

!rule 5

arctic wedgeBOT Jul 22, 2019, 11:40 AM

#

Rules

5. We will not help you with anything that might break a law or the terms of service of any other community, site, service, or otherwise - No piracy, brute-forcing, captcha circumvention, sneaker bots, or anything else of that nature.

exotic cedar Jul 22, 2019, 1:18 PM

#

lol aight

lofty girder Jul 22, 2019, 1:23 PM

#

Question about pandas, does it count as a multi index if I use set_index([column1, column2]) or is that just a caveman mimicry of a multiindex?

silk forge Jul 22, 2019, 3:42 PM

#

okay so the red points are my y_pred and blue points are y_true

📎 unknown.png

#

this simple linear regr model come out well?

dense rose Jul 22, 2019, 3:50 PM

#

Is there good library for displaying graphs in Jupyter that you can modify and live update?

#

I mean like actual graph with nodes and edges.

silk forge Jul 22, 2019, 3:58 PM

#

matplotlib

earnest prawn Jul 22, 2019, 4:20 PM

#

fwiz that is not a graph with nodes and edges

#

he is talking about a graph in the cs definition

#

my goto for that would always be graphviz but idk about that in jupyter

#

@silk forge @dense rose

haughty wind Jul 22, 2019, 5:54 PM

#

Is there a function within Keras/TF that lets you add weights to the training data? Some of my data is higher quality and I'd like to give it more weight when fitting my NN. Currently I'm using ImageDataGenerator and flow_from_directory, if that helps any.

#

Ideally I'd want to assign different specific directories with higher/lower weights

crude bloom Jul 22, 2019, 6:42 PM

#

off the top of my head I don't know but you could always add repeats of the data during training @haughty wind

#

basically just copy the data you value more so that the model sees it multiple times per epoch

quartz monolith Jul 22, 2019, 7:13 PM

#

@silk forge have a look at ridge regression to understand the graph
https://www.youtube.com/watch?v=Q81RR3yKn30&t=785s

YouTube

StatQuest with Josh Starmer

Regularization Part 1: Ridge Regression

Ridge Regression is a neat little way to ensure you don't overfit your training data - essentially, you are desensitizing your model to the training data. It...

▶ Play video

quartz monolith Jul 22, 2019, 7:54 PM

#

still cant figure out how to use smotenc on my data set 🤔 @desert oar

serene veldt Jul 22, 2019, 11:47 PM

#

Has anyone had success installing tensorflow_datasets on tf 2.0?
Always get the same error when importing
AttributeError: module 'tensorflow._api.v2.autograph.experimental' has no attribute 'do_not_convert'

lapis sequoia Jul 23, 2019, 5:56 AM

#

can you compare versions

#

where that attribute is originally from and whether it was taken out

silent swan Jul 23, 2019, 6:29 AM

#

I recommend PyTorch

#

it gud

mossy dragon Jul 23, 2019, 7:17 AM

#

yo where raggy

lean ledge Jul 23, 2019, 9:07 AM

#

Hm? @mossy dragon

mossy dragon Jul 23, 2019, 9:07 AM

#

yo

#

im going over my calc

#

do i need to review implicit differention?

lean ledge Jul 23, 2019, 9:09 AM

#

Doesn't really come up but good to have some confidence with manipulating differentials

supple ferry Jul 23, 2019, 11:31 AM

#

@lofty girder yes it creates a multi index

surreal nacelle Jul 23, 2019, 11:43 AM

#

Hey, do you think that this is good enough ?
I'm trying to do some 'data augmentation' by rotating each element of the dataset by -5 degree. The rotated image is a little noisy tho. Should I take the time to denoise it ? (and should I apply a stronger rotation to the images ?)

📎 rotation_test.png

lethal spade Jul 23, 2019, 12:19 PM

#

anyone here using pandas? I'm trying to run this code

#

📎 unknown.png

#

but it is not adding the extra columns (path, dist, init, control, meas_interm):

📎 unknown.png

#

I just have the ones I already had

lapis sequoia Jul 23, 2019, 12:48 PM

#

looks like a bunch of pickles

#

in a dataframe.. hands down the weirdest thing I've seen yet

lethal spade Jul 23, 2019, 12:52 PM

#

ahah, yeah, still have to update the name

quartz stream Jul 23, 2019, 1:18 PM

#

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip
!pip install googletrans

import pandas as pd
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
df = pd.read_table('SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])

# Output printing out first 5 rows
df.head()

from googletrans import Translator
translator = Translator()
df2 = pd.DataFrame()
df2 = pd.DataFrame(columns=['label', 'sms_message'])
for i,j in zip(df['sms_message'],df['label']):
  text = translator.translate(i, dest='hi')
  df2 = df2.append({'sms_message': text.text,'label' : j},ignore_index = True)

#

can anyone help me

#

the above code throws this error

#

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-26-059f1e8ecbf9> in <module>()
      4 df2 = pd.DataFrame(columns=['label', 'sms_message'])
      5 for i,j in zip(df['sms_message'],df['label']):
----> 6   text = translator.translate(i, dest='hi')
      7   df2 = df2.append({'sms_message': text.text,'label' : j},ignore_index = True)

6 frames
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
    355             obj, end = self.scan_once(s, idx)
    356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
    358         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

#

The output of df2 is only till 530 value but the original file has more than 6000 values

lapis sequoia Jul 23, 2019, 1:36 PM

#

what are you trying to do

#

if it's a tab separated file, why don't you open it with builtins in pandas..

#

oh.. I just saw read_table..

#

not sure what you're trying to do here man..

#

what is df2

desert oar Jul 23, 2019, 2:17 PM

#

@lapis sequoia it looks like they're building up df2 by translating the contents of df

#

@quartz stream something is wrong with whatever is in i

#

you can try printing the data, or using %debug in ipython to investigate

quartz stream Jul 23, 2019, 3:08 PM

#

@desert oar

#

I tried printing i

#

its a standard dataset

#

it shows all the value

#

@desert oar Yes you guessed it correct I am trying to translate df into df2

#

Why dont you try the link I have given the data file also

surreal nacelle Jul 23, 2019, 3:09 PM

#

Any ideas why model.predict takes an abnormal amount of time to finish ? The model.fit takes a minute or so, and predict on 3% of the dataset takes 10

desert oar Jul 23, 2019, 3:10 PM

#

@surreal nacelle depends on the model. seems weird though

surreal nacelle Jul 23, 2019, 3:10 PM

#

KNN on mnist

desert oar Jul 23, 2019, 3:10 PM

#

oh

quartz stream Jul 23, 2019, 3:10 PM

#

@desert oar Any idea on my question

desert oar Jul 23, 2019, 3:11 PM

#

@surreal nacelle sklearn? KNeighborsClassifier?

surreal nacelle Jul 23, 2019, 3:11 PM

#

Yep

desert oar Jul 23, 2019, 3:11 PM

#

@quartz stream "normal dataset" doesn't really help. i'm not in a position to start downloading data files and debugging right now

#

the translator is expecting something different from what you gave it.. thats the best i can offer

#

im not familiar w/ that library

quartz stream Jul 23, 2019, 3:12 PM

#

Its 200kb

#

dataset

#

and the translator is working fine

#

for half the dataset

#

it is just not completing

#

so the code is fine

#

I translate and print every value

#

it does

#

I guess there is something wrong with adding values in database

#

i mean pandas*

desert oar Jul 23, 2019, 3:13 PM

#

so what is that error message from then

#

is it working, or is it not working?

quartz stream Jul 23, 2019, 3:13 PM

#

it is not

desert oar Jul 23, 2019, 3:13 PM

#

well look at the error

#

it's clearly related to translate and not pandas

#

there is a bad element in your data somewhere

#

@surreal nacelle what distance are you using?

surreal nacelle Jul 23, 2019, 3:15 PM

#

I'm using the default values for now, which is n_neighbors=5

desert oar Jul 23, 2019, 3:16 PM

#

so algorithm='auto'?

surreal nacelle Jul 23, 2019, 3:16 PM

#

Yes

desert oar Jul 23, 2019, 3:16 PM

#

and metric='minkowski'?

surreal nacelle Jul 23, 2019, 3:16 PM

#

Yes

#

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

desert oar Jul 23, 2019, 3:16 PM

#

can you print the ._fit_method attribute on the model

#

tree lookups should be fast

surreal nacelle Jul 23, 2019, 3:17 PM

#

Sure, gimme a second, need to refit

#

model._fit_method = kd_tree

desert oar Jul 23, 2019, 3:26 PM

#

hm. im not that well versed in the details of kd-trees, but the whole point is that tree lookups are supposed to be fast

#

are the results correct?

surreal nacelle Jul 23, 2019, 3:28 PM

#

It seems so

#

0.985

desert oar Jul 23, 2019, 3:30 PM

#

hm. could just be the way it is

surreal nacelle Jul 23, 2019, 3:31 PM

#

I guess

#

Thanks for the help anyway 😃

surreal nacelle Jul 23, 2019, 5:54 PM

#

@desert oar Some guy explained me that knn doesn't train, the .fit simply gives it data so that it can use it to compare during the .predict phase. So it does make sense that it takes much longer to predict than to 'train' (no training)

silent swan Jul 23, 2019, 6:26 PM

#

KNNs are a nonparametric method, so it doesn't learn anything, so there's nothing to fit

desert oar Jul 23, 2019, 6:47 PM

#

@silent swan sklearn doesnt build a tree when you call fit()?

silent swan Jul 23, 2019, 6:57 PM

#

actually you're right it probably does do it then

lapis sequoia Jul 24, 2019, 7:24 AM

#

I need some help

#

I have a column that contains a list.. I want to split it and add them to new columns in the dataframe.. but, not all rows have equal number of items in the list

#

how should I approach this

surreal nacelle Jul 24, 2019, 1:11 PM

#

Hey, how could I see the word instead of seeing the dictionary indice ?

vect = CountVectorizer(analyzer='word')
bag_of_words = vect.fit_transform(emails)
test_output = vect.transform(['email', 'test', 'hello'])
print(test_output)```

(0, 15725) 1
(1, 51148) 1
(2, 55302) 1```

desert oar Jul 24, 2019, 1:34 PM

#

@lapis sequoia is it already a list, or is it a string? and what's the max number of columns?

#

@void anvil i'd just use None personally

#

@surreal nacelle that's odd, my .transform returns a sparse matrix

#

but you would use vect.vocabulary_ to get a mapping from words to indices

#

oh weird. i didnt know sparse matrices acquired a fancy print method

#

yes that is a sparse matrix

#

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(analyzer='word')
v.fit(['email is okay', 'i like turtles', 'i email my email every turtle'])
transf = v.transform(['email is fun'])
print(type(transf))
print(transf.shape)
print(len(v.vocabulary_))

#

id rather leave null values as null, and fill in later

#

just my style tho

surreal nacelle Jul 24, 2019, 1:58 PM

#

I got the matrice that way, but as you can see it contains the value of the word indice in the dictionary instead of 0 and 1.

📎 test_mail_dic.png

desert oar Jul 24, 2019, 2:04 PM

#

the vocabulary is telling you that "greetings" is in column 21152

#

you can't really get the words back

#

that doesn't make sense

#

that's the whole point of vectorizing

#

words go in, numbers come out

surreal nacelle Jul 24, 2019, 2:05 PM

#

I understand that, but shouldn't the matrix contain the number of occurrence of each words instead of their indices in the vocabulary ?

desert oar Jul 24, 2019, 2:05 PM

#

it doesnt contain their indices

#

it contains 0s and 1s

#

well it contains more than that

#

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(analyzer='word')
v.fit(['email is okay', 'i like turtles', 'i email my email every turtle'])
transf = v.transform(['email is fun to email with email', 'turtles dont use email but turtles like lettuce'])
print(transf.toarray())
print(v.vocabulary_)

surreal nacelle Jul 24, 2019, 2:06 PM

#

oh, so it does exactly what I want 😄

#

ohhh

#

the np.argmax returns the index

#

in the array

desert oar Jul 24, 2019, 2:07 PM

#

no

surreal nacelle Jul 24, 2019, 2:07 PM

#

not its content

desert oar Jul 24, 2019, 2:07 PM

#

err

#

yeah

#

but

#

email_counts = transf.toarray()[:, v.vocabulary['email']]

#

you want to know the most frequent word in each document?

surreal nacelle Jul 24, 2019, 2:08 PM

#

For example yes

desert oar Jul 24, 2019, 2:10 PM

#

the vocabulary should have unique values as well as indices, so you can "invert" it

vocab_inverse = {val: key for key, val in v.vocabulary_.items()}
most_freq_word_per_document = [vocab_inverse[i] for i in transf.argmax(axis=1)]

surreal nacelle Jul 24, 2019, 2:11 PM

#

thank you, once again 😄

crimson trellis Jul 24, 2019, 2:23 PM

#

Hey everyone, I'm new to ML, and got a task to predict sales for dataset1 using dataset2. Dataset2 has data for 4 weeks in advance, how would I go about doing this? Currently built a linear regression model for dataset2 with .94 rsquared

desert oar Jul 24, 2019, 2:26 PM

#

@crimson trellis seems like a good start. how many data points do you have?

crimson trellis Jul 24, 2019, 2:26 PM

#

2.7mil from dataset2

desert oar Jul 24, 2019, 2:26 PM

#

oh thats a lot

crimson trellis Jul 24, 2019, 2:26 PM

#

running it on bigquery

desert oar Jul 24, 2019, 2:27 PM

#

you can use a train/test split or cross validation, and measure the accuracy of your model

crimson trellis Jul 24, 2019, 2:27 PM

#

I graphed the linear regression predictions against the actual data and it was pretty spot on

desert oar Jul 24, 2019, 2:27 PM

#

yeah thats probably good enough

#

but you can use the holdout set to be sure

#

implementing CV in bigquery would be annoying

#

but you can reserve eg 1/4 of your data and not train on it

#

then do the prediction on it and measure accuracy

crimson trellis Jul 24, 2019, 2:29 PM

#

got it. I think I saw something like that

#

I'm stuck on figuring out how to use the model for a forecast though...

desert oar Jul 24, 2019, 2:29 PM

#

that's a bigquery specific question i'm afraid

#

and i wouldnt know the answer. maybe someone else does

crimson trellis Jul 24, 2019, 2:30 PM

#

gotcha. All the examples I've seen don't really have a date forecast, and instead do something like 'airline delays, taxi fares, etc'

desert oar Jul 24, 2019, 2:30 PM

#

if time is involved things are a bit more complicated

#

can you describe datasets1 and 2 in more detail

#

how they related to each other etc

crimson trellis Jul 24, 2019, 2:31 PM

#

it's just date/product/sales/units/store1 - dataset1
same thing for dataset2, but different store

#

I was thinking of using a coefficient store1/store2 and apply it to the model to predict store1 using store2 performance

desert oar Jul 24, 2019, 2:32 PM

#

what do you mean

#

what is store1? some kind of performance number?

crimson trellis Jul 24, 2019, 2:32 PM

#

no that's walmart

#

store2 is target

desert oar Jul 24, 2019, 2:34 PM

#

so you're predicting, e.g. walmart sales using store2 sales?

crimson trellis Jul 24, 2019, 2:36 PM

#

yep

#

because store2 has data coming in daily, and walmart has data every month instead

desert oar Jul 24, 2019, 2:37 PM

#

how are you running that regression then

crimson trellis Jul 24, 2019, 2:37 PM

#

so the model is trained using target only

#

and not entirely sure if this is the right way to do it, but I want to use it to predict walmart data based on (walmart sales last month/target sales last month)

desert oar Jul 24, 2019, 2:43 PM

#

so you basically took the monthly average of target sales, then predicted this month's walmart sales using last months' walmart-target ratio

crimson trellis Jul 24, 2019, 2:43 PM

#

that's what I'm thinking of doing

#

does it make sense? 🤔

desert oar Jul 24, 2019, 2:46 PM

#

hm. the math of linear regression won't like that, you're going to have non-independently distributed data

#

as for testing it, you can only forecast 1 month ahead at best

#

unless you start forecasting target as well

crimson trellis Jul 24, 2019, 2:48 PM

#

I do have about 3 years of data in that 2.7mil records

desert oar Jul 24, 2019, 2:49 PM

#

unfortunately that doesnt help

crimson trellis Jul 24, 2019, 2:49 PM

#

ah

desert oar Jul 24, 2019, 2:49 PM

#

it means your model can learn more ,but it doesnt fix the fundamental issues

crimson trellis Jul 24, 2019, 2:49 PM

#

how would you approach this? I'd like to get it right without just throwing something up quickly

#

and that's how my coefficient thing feels

#

just a quick solution

desert oar Jul 24, 2019, 2:50 PM

#

model being valid or not, the way you would test it is by "sliding" over the data. say you have 24 months of data and you reserve the last 6 months for testing. then you train on the first 18 months and evaluate on the 19th month. then you train on the first 19th months and evaluate on the 20th. and so on until you run out of months. and then you can do mean square error of all the evaluation points

crimson trellis Jul 24, 2019, 2:59 PM

#

Ok, I get that. Then using the model, I would use it to predict future target sales

desert oar Jul 24, 2019, 3:00 PM

#

wait what

#

you would be predicting walmart sales

crimson trellis Jul 24, 2019, 3:00 PM

#

ah ok, I lost track

desert oar Jul 24, 2019, 3:00 PM

#

but again you can only predict 1 month in advance

#

because you need last month's target sales

crimson trellis Jul 24, 2019, 3:00 PM

#

yes

#

ok

desert oar Jul 24, 2019, 3:00 PM

#

and also this model is likely to have other issues

#

due to the fact that you're violating the iid assumption

crimson trellis Jul 24, 2019, 3:01 PM

#

I mean, it is what it is at the moment

#

I can use things like # of walmart stores, and # of different products sold at each location

desert oar Jul 24, 2019, 3:01 PM

#

actually wait. it might be okay w/ least squares actually

#

yeah, you know what? this should be fine

#

just make sure youre using the testing strategy i described

#

otherwise you will be "cheating" and using future data to predict past data

#

which inflates your accuracy

crimson trellis Jul 24, 2019, 3:03 PM

#

yep

#

time to figure out how to do this now. thank you 😃

desert oar Jul 24, 2019, 3:04 PM

#

good luck

lapis sequoia Jul 24, 2019, 3:33 PM

#

yo i need help with data cleaning

#

@desert oar are you available?

desert oar Jul 24, 2019, 3:34 PM

#

for a little bit olonger year

lapis sequoia Jul 24, 2019, 3:36 PM

#

im cleaning lyrics scraped from genius

#

I wanna remove text like "verse 1" "intro" "chorus" and brackets/punctuations
fortunately I found this python script that seems to does the job well

#

but

#

well here's the original data

#

📎 Screen_Shot_2019-07-24_at_10.29.12_PM.png

#

my codes: https://hastebin.com/azuwexadop.py

#

data output after applying the cleaning function:

desert oar Jul 24, 2019, 3:37 PM

#

sorry, i cant help with that

lapis sequoia Jul 24, 2019, 3:37 PM

#

📎 Screen_Shot_2019-07-24_at_10.31.21_PM.png

desert oar Jul 24, 2019, 3:37 PM

#

scraping is against their TOS

#

and its against the rules for us to help with TOS violations

lapis sequoia Jul 24, 2019, 3:37 PM

#

whose TOS?

desert oar Jul 24, 2019, 3:37 PM

#

!rule 5

arctic wedgeBOT Jul 24, 2019, 3:37 PM

#

Rules

5. We will not help you with anything that might break a law or the terms of service of any other community, site, service, or otherwise - No piracy, brute-forcing, captcha circumvention, sneaker bots, or anything else of that nature.

desert oar Jul 24, 2019, 3:37 PM

#

genius

lapis sequoia Jul 24, 2019, 3:37 PM

#

oh really?

desert oar Jul 24, 2019, 3:38 PM

#

https://genius.com/static/terms

Except as expressly authorized by Genius in writing, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Service or the Genius Content, in whole or in part, except that the foregoing does not apply to your own User Content (as defined above) that you legally upload to the Service. In connection with your use of the Service you shall not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods. Any use of the Service or the Genius Content other than as specifically authorized herein is strictly prohibited. As between you and Genius, the technology and software underlying the Service or distributed in connection therewith is the exclusive property of Genius, our affiliates and our partners (the "Software"). You agree not to copy, modify, create a derivative work of, reverse engineer, reverse assemble or otherwise attempt to discover any source code, sell, assign, sublicense, or otherwise transfer any right in the Software. Any rights not expressly granted herein are reserved by Genius.

#

sorry

lapis sequoia Jul 24, 2019, 3:38 PM

#

aight thank you

surreal nacelle Jul 24, 2019, 3:45 PM

#

I'm trying to figure out if having a dataframe with 60k columns (1 for each words in the dictionary) is ok ? @desert oar

#

It seems a little much

desert oar Jul 24, 2019, 3:45 PM

#

Why do you want that

#

I mean, I get why you might want that for convenience?

#

Can put sparse data into a data frame

surreal nacelle Jul 24, 2019, 3:46 PM

#

It's the flattened sparse matrix to feed the algorithm

desert oar Jul 24, 2019, 3:46 PM

#

Why do you need a data frame

#

Most sklearn models accept sparse data

surreal nacelle Jul 24, 2019, 3:47 PM

#

I see

#

I'll try that then

desert oar Jul 24, 2019, 3:47 PM

#

What model are you trying to use in this particular case

surreal nacelle Jul 24, 2019, 3:47 PM

#

no ideas yet

desert oar Jul 24, 2019, 3:47 PM

#

Usually you need to convert from data frame to matrix anyway

surreal nacelle Jul 24, 2019, 3:47 PM

#

gonna try a bunch

silent swan Jul 24, 2019, 5:09 PM

#

man tensorflow is black magic built on black magic

desert oar Jul 24, 2019, 5:10 PM

#

yes

#

i dont know why they dont just give me a damn api to construct a graph manually

#

instead of all this as_default stuff

earnest prawn Jul 24, 2019, 5:11 PM

#

at that point you might as well just manually palay with numpy stuff

desert oar Jul 24, 2019, 5:12 PM

#

except not at all? the whole point of TF is that you're constructing a differentiable graph

#

and that gets sent back down to the C++ framework for processing

#

its just the python API is extremely confusing and the documentation is unclear

#

(to me)

silent swan Jul 24, 2019, 5:14 PM

#

my experience so far with TF is

#

there're tons of ways to do the same thing

#

so it's actually easy to write code

#

but it's hell to read other's because you have no idea what their workflow is

#

the documentation problem is compounded by how often the API / "best practice" changes

#

like even running the official model code gives you a ton of deprecation warnings

earnest prawn Jul 24, 2019, 5:16 PM

#

thats just the nature of huge code bases really

silent swan Jul 24, 2019, 5:17 PM

#

pytorch is great though, everybody get on the pytorch train

earnest prawn Jul 24, 2019, 5:17 PM

#

I have never actually done useful things with pytorch but what Ive seen looks good

desert oar Jul 24, 2019, 5:19 PM

#

im hoping things stabilize in TF after 2.0

silent swan Jul 24, 2019, 5:19 PM

#

it's very pythonic. about the only "surprising"/obscured thing is that gradients are stored in state, and sometimes the dataloaders hide crazy stuff from you

#

otherwise the code does about exactly what you think it does when you read it

earnest prawn Jul 24, 2019, 5:20 PM

#

nah its just gonna be like the internal APIs of linux which are never guaranteed to be stable and allowed to be subject of change every commit @desert oar

desert oar Jul 24, 2019, 5:32 PM

#

i hope not

#

every time i load a model from a checkpoint i feel like im doing something wrong

lunar leaf Jul 24, 2019, 6:13 PM

#

embrace keras

sterile remnant Jul 24, 2019, 6:44 PM

#

import pandas as pd

#

import matplotlab.pyplot as plt

#

import numpy as np

#

data = pd.read_csv('pornhub.csv')

#

print(data.dtypes)

#

print(data.index())

#

print(data['pornstar'].unique)

#

data.plot[x=''di**k_size , y = ''satisfaction" , color = '' fapability '']

#

plt.show()

desert oar Jul 24, 2019, 6:51 PM

#

if you want to enter code, @sterile remnant , it will be easier to read if you use code block formatting

#

!codeblock

arctic wedgeBOT Jul 24, 2019, 6:51 PM

#

codeblock

Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.

To do this, use the following method:

```python
print('Hello world!')
```

Note:
• These are backticks, not quotes. Backticks can usually be found on the tilde key.
• You can also use py as the language instead of python
• The language must be on the first line next to the backticks with no space between them

This will result in the following:

print('Hello world!')

sterile remnant Jul 24, 2019, 6:54 PM

#

thanks for your help bro @desert oar it really help

#

'''import pandas as pd
import matplotlab.pyplot as plt
import numpy as np
data = pd.read_csv('pornhub.csv')
print(data.dtypes)
print(data.index())
print(data['pornstar'].unique)
data.plot[x=''di**k_size , y = ''satisfaction" , color = '' fapability '']
plt.show()
'''

desert oar Jul 24, 2019, 6:54 PM

#

no problem. did you have a question you wanted to ask?

#

use ` not '

#

on an american keyboard it's on the same key as ~, not sure what keyboard you have

sterile remnant Jul 24, 2019, 6:54 PM

#

ok gotcha bro

#

import matplotlab.pyplot as plt
import numpy as np
data = pd.read_csv('pornhub.csv')
print(data.dtypes)
print(data.index())
print(data['pornstar'].unique)
data.plot[x=''di**k_size , y =  ''satisfaction" , color = '' fapability '']
plt.show()

#

thanks

desert oar Jul 24, 2019, 6:55 PM

#

👍

#

also i assume you mean to write

data.plot(x="di**k_size", y = "satisfaction", color = "fapability")

#

right now you have doubled ' and [ instead of (

#

great variable name btw

sterile remnant Jul 24, 2019, 6:56 PM

#

bro i wanna do some scikit stuff but i am finding a really hard time to do so help me out please

#

yeak i was focusing more on my variables instead of syntax

#

lol

desert oar Jul 24, 2019, 6:58 PM

#

hard to say what help you need... do you have a specific objective in mind?

sterile remnant Jul 24, 2019, 7:01 PM

#

yep like i gotta some project on super vised learnig i have seen its tut but all went above my senses and i want to learn it do u hv any suggestions hw can i wrap my head around that?

#

like some link or something else anyone

desert oar Jul 24, 2019, 7:16 PM

#

@sterile remnant what is your level of programming and math knowledge?

#

i'd start by maybe working through a beginner book or online course

sterile remnant Jul 24, 2019, 7:19 PM

#

@desert oar bro i am also noob to this one and i have been handed over with that project i was looking out for some stuff to get some knowledge about it .

desert oar Jul 24, 2019, 7:20 PM

#

its hard to help without more context

sterile remnant Jul 24, 2019, 7:20 PM

#

yep so it is but i have lookes into it

#

y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3,random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
print(knn.score(X_test, y_test))

#

like this code gives u prediction about what dataset the given data belongs to

#

and like this bit and pieces i been handling my shit

#

thouh thanks @desert oar for ur consideration

quartz monolith Jul 24, 2019, 8:35 PM

#

we want to see the graph

desert oar Jul 24, 2019, 8:37 PM

#

i wanna know where this data came from tbh

#

i really regret not applying for a job at pornhub when i had the chance

unkempt helm Jul 24, 2019, 8:37 PM

#

👀

quartz monolith Jul 24, 2019, 8:38 PM

#

i see you mean data hub 🤔

desert oar Jul 24, 2019, 8:38 PM

#

i told my mother about it and she was not happy with the idea. it was in montreal too

#

would have loved to have an excuse to move to montreal

quartz monolith Jul 24, 2019, 8:40 PM

#

🍆 data wrangle

desert oar Jul 24, 2019, 8:43 PM

#

i wondered about that

#

ive heard mixed things about "porn tech"

#

maybe they still have data science jobs

quartz monolith Jul 24, 2019, 8:49 PM

#

yeah technology magazines makes me stimulated

#

especially MIT Technology Review

#

somebody ever worked with word2vec?

desert oar Jul 24, 2019, 8:51 PM

#

i have yeah

#

did you ever figure out your sampling issue btw

quartz monolith Jul 24, 2019, 8:56 PM

#

no 😦 i dont know why but we figured out something important that some error's have more labels. The classification model cant work. I want to buld a corpus and interview the experts how the text vector predicts words and use it as a sentence classifier later to create new keywords which will be the new label of the classiciation

#

my word2vec corpus with skipgram

📎 5ijHmiFwrw8Y2xEXjmcIUgEAAAAAAOAk3DIJAAAAAAAARyEQAwAAAAAAgKMQiAEAAAAAAMBRCMQAAAAAAADgKARiAAAAAAAAcBQC.png

#

tsne_model = TSNE(perplexity=5, n_components=2, init='pca', n_iter=5000, random_state=2300)

#

salt whats the best way to predict a sentence with w2v?

desert oar Jul 24, 2019, 9:07 PM

#

what do you mean the classification model cant work?

#

the way you'd use a word embedding like w2v is you'd generate a word vector for each word in the sentence, then average them to get a vector for the whole sentence, then feed that into a classifier as your feature vector

quartz monolith Jul 24, 2019, 9:09 PM

#

yes i want to build a feature vectors for the troubleshoot text classification and after that using it as a label

#

the errors are connected with to many same label. The Model can't predict the right come or has to many outcomes which make it really hard

desert oar Jul 24, 2019, 9:10 PM

#

have you done a cross tab of errors and labels

#

like with pd.crosstab(errors, labels) and then plotted it using plt.imshow

quartz monolith Jul 24, 2019, 9:10 PM

#

I did it with confusion matrix

desert oar Jul 24, 2019, 9:10 PM

#

yeah

#

that would work i guess

quartz monolith Jul 24, 2019, 9:11 PM

#

i had 130k data set and 51%., I used decision tree to understand how he splits the data and why is the model misleading the predicition to optimize the knowledge base

#

other problems was if some certain rows where based nmar (laziness) or not relevant missing values

desert oar Jul 24, 2019, 9:15 PM

#

ah

#

there are probably ways around that, but

quartz monolith Jul 24, 2019, 9:29 PM

#

hm

vale arrow Jul 24, 2019, 9:30 PM

#

could use some assistance in sklearn if this is the applicable channel

quartz monolith Jul 24, 2019, 9:32 PM

#

Yes?

vale arrow Jul 24, 2019, 9:34 PM

#

so I'm trying to do kfold cross validation with cross_val_score

#

and it was working just fine until today

#

and now when i run it I get like a hundred lines of traceback

#

and I have no idea what any of it means

#

"Fatal Python error: initfsencoding: unable to load the file system codec" is the first line

quartz monolith Jul 24, 2019, 9:36 PM

#

how does your data looks like?

vale arrow Jul 24, 2019, 9:36 PM

#

as in shape?

quartz monolith Jul 24, 2019, 9:38 PM

#

i found something about pyinstaller with sklearn

#

or this
https://stackoverflow.com/questions/55357451/fatal-python-error-initfsencoding-unable-to-get-the-locale-encoding-file-cm

Stack Overflow

Fatal Python error: initfsencoding: Unable to get the locale encod...

I am writing a job submission script for SLURM workload manager. First, I have loaded anaconda2/4.5.12 (including python 2.7) module. Then, I have created conda environment with Python3.7 version. ...

vale arrow Jul 24, 2019, 9:40 PM

#

next to nothing on that page makes any kind of sense to me

desert oar Jul 24, 2019, 10:03 PM

#

@vale arrow what code are you running?

#

and how are you running it?

vale arrow Jul 24, 2019, 10:03 PM

#

Give me a second. Im reinstalling my ide just to see if that fixes something

#

now i literally just can't install packages

#

it's like every time i touch something relatively new that i need to install the entire program shatters

quartz monolith Jul 24, 2019, 10:10 PM

#

what about your interpreter?

vale arrow Jul 24, 2019, 10:16 PM

#

yea this just didn't do anything

#

so here is my code

#

i just did

#

now i just can't send my code

#

just fucking kill me

#

kFoldScores = scoreModel(xTrain = XTrain, yTrain = yTrain)
this is the line that messes everything up^

#

This is the function it's calling:

def scoreModel(xTrain, yTrain):
    checkingNetwork = KerasRegressor(build_fn = buildNetwork, batch_size = 10, epochs = 100)
    accuracies = cross_val_score(estimator = checkingNetwork, X = xTrain, y = yTrain, cv = 5, n_jobs = -1)
    #print("Standardized: %.2f (%.2f) MSE" % (results.mean(), results.std()))
    return accuracies```

#

Data I'm using

📎 housing.csv

#

and it's like i get different errors everytime

#

if i take that function out and just do k fold manually everything is fine

#

but i don't understand what's wrong with that function

#

okay I think I've fixed it. it works if i just specify one cpu to use instead of all and i don't get it but holy crap that was a headache

quartz monolith Jul 24, 2019, 10:41 PM

#

good to know

quartz monolith Jul 24, 2019, 10:59 PM

#

fastTEXT 🚀

desert oar Jul 24, 2019, 11:13 PM

#

@void anvil yeah ive heard ray and modin are not anywhere near ready for production

#

its kind of baffling that companies arent willing to pick up and put money towards these kinds of projects

grizzled folio Jul 24, 2019, 11:46 PM

#

Hey all, I'm kind of overthinking myself into a hole here... I have for example hourly (2.8e-4 Hz) data for a week, and I want to filter out sub-inertial frequencies (anything below 5e-5 Hz). There's a lot of stuff about IIR and FIR, windows, filter order, how to apply the filter, but this all seems like overkill? I'm just looking for something fairly simple!

lean ledge Jul 24, 2019, 11:47 PM

#

@grizzled folio apply a scipy Butterworth filter with that cutoff frequency

#

That's as simple as it gets

grizzled folio Jul 24, 2019, 11:52 PM

#

@lean ledge cool, that's a handy pointer! I still need to provide the order of the filter though, I don't know how to choose that (unless I use buttord?)

lean ledge Jul 24, 2019, 11:55 PM

#

As a bit of nuance, you can't remove the filtered frequency completely. Higher order removes things more. Look at the bode plot for the filter you construct to see how the gain at different frequencies

#

📎 Screenshot_20190725-095643_Chrome.jpg

#

Gain is in dB which is a log scale

grizzled folio Jul 24, 2019, 11:58 PM

#

Aha, that's handy. So why wouldn't you just crank the order way up? Increased computation? Artifacts?

lean ledge Jul 25, 2019, 12:02 AM

#

Instability/Artifacts. Computation scales linearly too. Importantly, people working on signal processing are generally also often doing it on hardware. Higher order = more components, more cost, more board space etc

#

You can increase the order to an extent. After that you should come up with different strategies if you need sharper rolloff

#

Eg you can trade the smooth region of a Butterworth filter for one with more ripples to get a sharper rolloff (eg in an elliptic or Chebyshev filters)

grizzled folio Jul 25, 2019, 12:05 AM

#

Cool, this is at least a starting point. Computational cost may be a factor since I'm filtering a few tens of millions of timeseries. I'll see how things look with Butterworth and if that's working for me. Thanks!

#

I appreciate the real practical condensation of different types into their effects, I was really struggling to find any literature that didn't immediately get super technical about it

lean ledge Jul 25, 2019, 12:08 AM

#

Yeah signal processing is a rough subject to learn for someone who didn't learn it as part of formal electrical engineering education

grizzled folio Jul 25, 2019, 12:08 AM

#

I got some of it in applied math courses, but never got the chance to apply it

lean ledge Jul 25, 2019, 12:08 AM

#

There's a lot of nuance to signal processing so it's hard to simplify stuff down and ignore some very technical aspects

#

I don't think maths generally goes over signal processing

#

Except maybe at a grad level

grizzled folio Jul 25, 2019, 12:10 AM

#

Well, I think it was that course! It was very much on the applied/computational side of things

lean ledge Jul 25, 2019, 12:11 AM

#

This is something you'd learn in electrical engineering. Would be very surprised if other people are learning it muchb

grizzled folio Jul 25, 2019, 12:11 AM

#

I definitely remember talking about IIR systems, and probably designing them... So I could probably understand the nuance if I wanted to get into it, but for the moment I just need something high-level that works and I can build upon

lean ledge Jul 25, 2019, 12:11 AM

#

Huh

grizzled folio Jul 25, 2019, 12:13 AM

#

Aha, I think it was something like "Scientific and Industrial Modelling"

#

Anyway, that was quite a while ago 😉

earnest prawn Jul 25, 2019, 2:20 AM

#

@void anvil
https://github.com/danaugrs/huskarl
https://github.com/heronsystems/adeptRL
https://github.com/tensorflow/agents

there are certainly some

GitHub

danaugrs/huskarl

Deep Reinforcement Learning Framework + Algorithms - danaugrs/huskarl

GitHub

heronsystems/adeptRL

Reinforcement learning framework to accelerate research - heronsystems/adeptRL

GitHub

tensorflow/agents

TF-Agents is a library for Reinforcement Learning in TensorFlow - tensorflow/agents

#

maturity is eh

#

arguable

#

i think its more about the distributed approach for adept

#

why is the name schulman so familar to me

#

if you want something in oss contribute it lol

lapis sequoia Jul 25, 2019, 2:39 AM

#

@desert oar it's a list of objects, I have to get an item from each object.. the max number of columns can be 6

desert oar Jul 25, 2019, 3:10 AM

#

@void anvil for future reference i know 0 about RL

#

@lapis sequoia what was that in reference to again? ping me in the AM

silent swan Jul 25, 2019, 3:34 AM

#

great, a tensorflow version from less than a year ago breaks on 3.7

lapis sequoia Jul 25, 2019, 3:51 AM

#

oops.. sorry.. @desert oar I have this column in my dataframe, each row has a list..

#

the list is a bunch of objects I can pass through a function... the number of items in that list can be max of 6..

#

or the list can be empty.. I want to split this column into multiple columns based on this

#

thanks.. let me try:)

#

btw if you need any help on RL you can ask.. but I can't really point you to an implementation

silent swan Jul 25, 2019, 4:54 AM

#

I wonder if I could do a freelance gig where I replicate papers or port things between TF and pytorch

lean ledge Jul 25, 2019, 6:00 AM

#

Actually sounds like possible freelancing

sterile remnant Jul 25, 2019, 7:35 AM

#

arr = np.array([10,20,30,40,50])
for i in array:
    print(arr[i])```

#

help me this code i wanna print this array out?

#

but its flashisng an error

grizzled folio Jul 25, 2019, 7:39 AM

#

@sterile remnant for i in array... you don't have any variables called that

#

And secondly, each i will be an element of the array, so you'll be trying to index a 5-element array with 10, 20, etc., which aren't valid

sterile remnant Jul 25, 2019, 7:48 AM

#

so hw shd i do it?

supple ferry Jul 25, 2019, 8:23 AM

#

@void anvil , you can search for algos + papers + codes here:
https://paperswithcode.com/search?q=DQN

Papers With Code : Search for DQN

10 search results

#

@sterile remnant , if your array is 1D, e.g a vector, you can loop throw it element by element. If your array is a shape of 2D and greater, by default looping will happen on the first dimension. for a 2D array it will be on rows.
I dont know your reason behing printing the array the way you do it above, but you can do:

a = np.array([10,20,30,40,50])

for el in a:
    print(el)
# this will give you every element of the array because it is a vector

b = np.random.random((2, 5))

for row in b:
    print(row)

# this will give you every row of that array because it is a 2D

surreal nacelle Jul 25, 2019, 8:43 AM

#

Hey, what are my options for removing non-words in my 'corpus'?
Is there a way to keep all the words that are somewhat similar to something from the english dictionary and remove words that aren't ?
Example : 'redhat' is not a word, but I still want to keep it, however 'asdjhasgdja' must go.

#

also, would removing all 1 occurence words be a bad idea ?

supple ferry Jul 25, 2019, 8:47 AM

#

redhat should be considered as a company name by language models

surreal nacelle Jul 25, 2019, 8:48 AM

#

What do you mean ?

supple ferry Jul 25, 2019, 8:50 AM

#

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

#

http://www.velvetcache.org/2010/03/01/looking-up-words-in-a-dictionary-using-python

John Hobbs on coding, Omaha, and life in general

John Hobbs

Looking up words in a Dictionary using Python

surreal nacelle Jul 25, 2019, 8:52 AM

#

Oh nice, so this would keep 'sexyredhead' and remove 'sff' ?

#

📎 randomwords.png

supple ferry Jul 25, 2019, 8:54 AM

#

I havent treid it unfortunately, can not tell how well it works :)#

surreal nacelle Jul 25, 2019, 8:54 AM

#

Well, I'll give it a shot and keep you updated

granite sierra Jul 25, 2019, 8:59 AM

#

Is room free?

#

might be a bit far fetched question.

Is there a way to store the loaded matlab file data in variables?

I already know about scipy.io.loadmat(), but I mean after it's been loaded

supple ferry Jul 25, 2019, 9:04 AM

#

@granite sierra , according to docs scipy.io.loadmat():

Returns
mat_dictdict
dictionary with variable names as keys, and loaded matrices as values.

#

you can just assign a variable by indexing that dictionary

granite sierra Jul 25, 2019, 9:05 AM

#

ok

#

by inexing you mean blah.get(key)

supple ferry Jul 25, 2019, 9:06 AM

#

or blah["key"]

#

you will probably have to transform matrices to numpy arrays if it is not done automatically

#

otherwise, you are safe

granite sierra Jul 25, 2019, 9:08 AM

#

converting it to an array is ltierally just

a = scipy.io.loadmat('test.mat')
np.array(a)
```?

supple ferry Jul 25, 2019, 9:08 AM

#

no no

#

a is not a dictionary with variables as keys and matrices as values

#

lets say you have a variable foo in that matlab file

granite sierra Jul 25, 2019, 9:09 AM

#

sure

supple ferry Jul 25, 2019, 9:09 AM

#

you can access it now a["foo"] which will return you the matrix

#

np.asanyarray(a["foo"]) i think

#

or np.array

#

depending on use case

granite sierra Jul 25, 2019, 9:11 AM

#

I'll be honest

#

wait

#

this is how it's returning it

#

{'__header__': b'MATLAB 5.0 MAT-file Platform: nt, Created on: Thu Jul 25 10:48:49 2019', '__version__': '1.0', '__globals__': [], 'testfile': array([[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]],
      dtype=[('test', 'O'), ('temp_data', 'O')])}

#

a = scipy.io.loadmat('test.mat')
print(a)

supple ferry Jul 25, 2019, 9:14 AM

#

yes, you can now access your matrices. a["testfile"] will return you array([[(array([[ 5, 10, 15, 20]])

#

I have never used matlab, so, i may be mistaken 😃

granite sierra Jul 25, 2019, 9:14 AM

#

ok let me test haha

#

well it did exactly that

#

how do I store the variables now?

#

this is what it returned

[[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]]

#

also why does it return it with double [[]]

#

like how do I access the first list, the 5, 10, 15, 20

supple ferry Jul 25, 2019, 9:24 AM

#

you can now do c = np.asanyarray(a["foo"])

#

something like this

#

https://stackoverflow.com/questions/15788290/converting-a-matrix-created-with-matlab-to-numpy-array-with-a-similar-syntax

Stack Overflow

Converting a matrix created with MATLAB to Numpy array with a simi...

I'm playing with the code snippets of the course I'm taking which is originally written in MATLAB. I use Python and convert these matrices to Python for the toy examples. For example, for the follo...

granite sierra Jul 25, 2019, 9:25 AM

#

huh?

#

holy thats going to get messy with lots of data

supple ferry Jul 25, 2019, 9:25 AM

#

matlab is messy 😄

granite sierra Jul 25, 2019, 9:27 AM

#

also I think that code is outdated, numpy has no strip function now

#

ok so I did this

#

b = np.squeeze(np.asarray(a['testfile']))

it returned this

(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))

which is obviously just a tuple with a tuple of lists inside, is there anyway to access the first nested tuple?

#

tuples aren't mutable, are they, so there is no way to access by 'index'

granite sierra Jul 25, 2019, 11:32 AM

#

anybody?

hollow latch Jul 25, 2019, 1:21 PM

#

Hi, I have a neural network saved in ONNX format (make with matlab), do you know how to run it on python? Keras don't seam to support it and I faild to install caffe2 on my windows 7...

desert oar Jul 25, 2019, 2:25 PM

#

@hollow latch theres this... https://github.com/onnx/onnx-tensorflow

GitHub

onnx/onnx-tensorflow

Tensorflow Backend and Frontend for ONNX. Contribute to onnx/onnx-tensorflow development by creating an account on GitHub.

#

Ive actually been wondering how to use onnx and tf myself. So good thing i found this

supple ferry Jul 25, 2019, 2:35 PM

#

@granite sierra have you tried np.asanyarray?

#

It works with nested structures better

granite sierra Jul 25, 2019, 2:36 PM

#

let me try

supple ferry Jul 25, 2019, 2:37 PM

#

Also you can reduce the tuple until you get only vectors

#

You can even use numpys own reduce

granite sierra Jul 25, 2019, 2:39 PM

#

hmm let me see

#

but what ufunc would I do to the vector?

#

nah I dont think that works, unless I'm doing it wrong\

supple ferry Jul 25, 2019, 2:50 PM

#

How did you do it?

#

Show pls

hollow latch Jul 25, 2019, 2:54 PM

#

@desert oar Thank you, it's seam to work !

granite sierra Jul 25, 2019, 3:01 PM

#

import scipy.io as sci
import numpy as np

a = sci.loadmat('test.mat')

print(a)

b = a['testfile']

print(b)


c = np.squeeze(np.array(b))
d = np.reduce(c)

print(d)

supple ferry Jul 25, 2019, 3:20 PM

#

Maybe show the outputs too 😁

granite sierra Jul 25, 2019, 3:22 PM

#

oh sorr

#

runfile('C:/Users/danilov_d/.spyder-py3/understandfile.py', wdir='C:/Users/danilov_d/.spyder-py3')
{'__header__': b'MATLAB 5.0 MAT-file Platform: nt, Created on: Thu Jul 25 10:48:49 2019', '__version__': '1.0', '__globals__': [], 'testfile': array([[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]],
      dtype=[('test', 'O'), ('temp_data', 'O')])}
[[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]]
Traceback (most recent call last):

  File "<ipython-input-530-607ea6f6830d>", line 1, in <module>
    runfile('C:/Users/danilov_d/.spyder-py3/understandfile.py', wdir='C:/Users/danilov_d/.spyder-py3')

  File "C:\Anaconda\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\Anaconda\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/danilov_d/.spyder-py3/understandfile.py", line 21, in <module>
    d = np.reduce(c)

AttributeError: module 'numpy' has no attribute 'reduce'

silk forge Jul 25, 2019, 3:28 PM

#

how does test_train_split function work in cases of multiple linear regression

earnest prawn Jul 25, 2019, 3:39 PM

#

not different from any other model

#

it simply splits your data into test and training data so you can see how your model performs on data it has never seen during training

#

so you can for example diagnose overfitting etc

silk forge Jul 25, 2019, 3:42 PM

#

x = data.ENGINESIZE , data.CYLINDERS , data.FUELCONSUMPTION_COMB
y = data.CO2EMISSIONS

dfx = pd.DataFrame(x)
dfy = pd.DataFrame(y)



trainx , testx ,trainy, testy =  train_test_split(dfx , dfy, test_size=0.2 , random_state=8)

#

when im using more than 1 xvalues , can i just do this?

#

@

#

@earnest prawn

earnest prawn Jul 25, 2019, 3:43 PM

#

I mean I am not 100 percent sure but I dont see any reason why it shouldnt. You could just try I guess ¯_(ツ)_/¯

silk forge Jul 25, 2019, 3:50 PM

#

won't work

earnest prawn Jul 25, 2019, 4:08 PM

#

whats the errror?

desert oar Jul 25, 2019, 4:10 PM

#

@silk forge x is a tuple

#

why are you trying to take columns out of a dataframe then make a new dataframe out of it?

#

i assume data is a dataframe right? if so

dfx = data[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB']]   # dataframe, equivalent to 2-d array
dfy = data['CO2EMISSIONS']  # series, equivalent to 1-d array

silk forge Jul 25, 2019, 4:19 PM

#

yuh

#

oh imma try your thing

#

import numpy
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model as lin
from sklearn.metrics import r2_score , mean_absolute_error , mean_squared_error
from sklearn.model_selection import train_test_split

data = pd.read_csv(filepath_or_buffer="C:/Users/admin/Downloads/FuelConsumptionCo2.csv")

# so the x values are gonna be ENGINESIZE , CYLINDERS  AND FUELCONSUMPTION_COMB

x = data.iloc[: , 4:6]
x["FUELCONSUM"] = data.FUELCONSUMPTION_COMB
y = data.CO2EMISSIONS


dfx = data[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB']]   # dataframe, equivalent to 2-d array
dfy = data['CO2EMISSIONS']  # series, equivalent to 1-d array



trainx , testx ,trainy, testy =  train_test_split(dfx,dfy, test_size=0.2 , random_state=8)

regr = lin.LinearRegression()
regr.fit(trainx,trainy)

slope = regr.coef_
inter = regr.intercept_

plt.scatter(trainx,trainy,color = "red")
plt.plot(trainx,trainx*slope + inter, color = "blue")
plt.show()

desert oar Jul 25, 2019, 4:36 PM

#

@silk forge why are you using the iloc there?

#

you can delete that line entirely unless you really need to reduce memory usage by dropping columns

quartz monolith Jul 25, 2019, 4:50 PM

#

how to get a model info about the parameter of the model in gensim?

#

model = Word2Vec.load('fastTEXT_big_sg_w30_min12_iter15.model')
print("model ready")```

something like this
`model.info()`

round jay Jul 25, 2019, 7:39 PM

#

Hi, I'm applying to an entry level data analyst position, and I submitted a python technical challenge last week and got invited back for an interview including some code review (and SQL whiteboarding too). Would anyone be able to look through my notebook and offer advice prior to my code review?
Deleted

#

(posted in a help and the career challenges as well, this is my last time, apologies)

quartz monolith Jul 25, 2019, 7:52 PM

#

link doesn't work

round jay Jul 25, 2019, 7:53 PM

#

apologies, wrong link

quartz monolith Jul 25, 2019, 7:54 PM

#

whats the goal of the work now?

#

data wrangling?

round jay Jul 25, 2019, 7:55 PM

#

the outlined task: was given a couple of separated data files, combine them into one spreadsheet and also point out any outliers I find, within a 30-45 min period

#

combined spreadsheet was supposed to be formatted to compare a brand's product across 4 different regions

quartz monolith Jul 25, 2019, 8:25 PM

#

you can use some scatter graphs for the outliers maybe?

#

@round jay

#

maybe for outliers you can use scipy z-score

#

df = pd.DataFrame(np.random.randn(100, 3))
from scipy import stats
outliners =  df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] ```

something like this

here more info:
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba

Medium

Ways to Detect and Remove the Outliers

While working on a Data Science project, what is it, that you look for? What is the most important part of the EDA phase? There are certain…

round jay Jul 25, 2019, 8:32 PM

#

Thank you for the suggestions! will definitely consider and go over the medium post

quartz monolith Jul 25, 2019, 8:32 PM

#

maybe @desert oar has some more ideas

#

np!

round jay Jul 25, 2019, 8:55 PM

#

No worries, think I have a starting point to talk about just in case

#

Appreciate it

crimson trellis Jul 26, 2019, 2:59 AM

#

hey, has anyone used statsmodel to do stepwise regression?

#

trying to follow this here; but it's complaining about my inputs
https://stackoverflow.com/questions/22341271/get-list-from-pandas-dataframe-column

Stack Overflow

get list from pandas dataframe column

I have an excel document which looks like this..

cluster load_date budget actual fixed_price
A 1/1/2014 1000 4000 Y
A 2/1/2014 12000 10000 Y
A 3/1/2014 36000 2000 ...

#

my X is a dataframe with input variables, and my y is the list of target values

abstract zodiac Jul 26, 2019, 4:47 AM

#

Has anybody used dataquest and reccomend it? I have 3 hours a day to spend to studying and as a beginner what's the best online course material?

hot orbit Jul 26, 2019, 5:31 AM

#

Is this a good channel to ask about AI?

#

Just wanted to know if anybody was working/is working on AI

#

~~ok I read channel topic nevermind~~

#

btw the question is, what were you working at? just curious 😄

desert oar Jul 26, 2019, 11:10 AM

#

@crimson trellis show your code

surreal nacelle Jul 26, 2019, 11:22 AM

#

Spam filter finished !

📎 Screen_Shot_2019-07-26_at_1.19.41_PM.png

#

📎 Screen_Shot_2019-07-26_at_1.22.15_PM.png

granite sierra Jul 26, 2019, 11:31 AM

#

that's awesome @surreal nacelle

surreal nacelle Jul 26, 2019, 11:31 AM

#

Thanks 😄 First ""practical"" application of ml, really fun project tbh

granite sierra Jul 26, 2019, 11:31 AM

#

how long did it take you

surreal nacelle Jul 26, 2019, 11:31 AM

#

2 days

granite sierra Jul 26, 2019, 11:31 AM

#

did you do any maths for it?

surreal nacelle Jul 26, 2019, 11:32 AM

#

I used xgboost for the model

#

The part that required work was the preprocessing

#

so not much maths

#

next step is to code most of the basic algo from scratch tho

desert oar Jul 26, 2019, 11:37 AM

#

@surreal nacelle nice job!

#

What training data did you use

surreal nacelle Jul 26, 2019, 11:38 AM

#

spamassassin

desert oar Jul 26, 2019, 11:38 AM

#

Good stuff

surreal nacelle Jul 26, 2019, 11:38 AM

#

and thanks!

#

you helped 😄

desert oar Jul 26, 2019, 11:38 AM

#

👍

surreal nacelle Jul 26, 2019, 11:38 AM

#

https://spamassassin.apache.org/old/publiccorpus/

desert oar Jul 26, 2019, 11:39 AM

#

That's a great project idea by the way, do you think it is something you would recommend for other beginners

surreal nacelle Jul 26, 2019, 11:39 AM

#

Absolutely

#

Learned a lot

#

actually it is one of the assignments from the Hands on machine learning with sklearn, tensorflow, and keras book.

#

This book is really great

#

📎 Screen_Shot_2019-07-26_at_1.41.47_PM.png

desert oar Jul 26, 2019, 12:18 PM

#

ahh

#

good to know

rigid haven Jul 26, 2019, 12:20 PM

#

Hey y'all!
Sorry if this isn't the right category!
I'm not sure if this is totally a "python" question, but I figured I'd ask since I'm coding it in Python. I'm using numpy and matplotlib, and in this graph, I want to calclate a value for "when the graph is shooting up". Is there a word for that? Is there a function I could use in either numpy or matplotlib (or scipy) that'll help me calculate it?

#

📎 unknown.png

#

It's for a school assignment, and the professor is just expecting us to look at it and write down what it looks like, but that feels super gross

desert oar Jul 26, 2019, 12:22 PM

#

you want to know the time it happens?

#

or the steepness? or the amount it increases?

rigid haven Jul 26, 2019, 12:22 PM

#

The time it happens, sorry

desert oar Jul 26, 2019, 12:23 PM

#

so you want to know when the increase "starts"?

rigid haven Jul 26, 2019, 12:23 PM

#

"Happens" being a confusing word since it's over a time interval, but knowing how crazy mathmeticians are there's probably a definition for it :p

#

Oops, see above

desert oar Jul 26, 2019, 12:23 PM

#

so theres no exact definition for it

rigid haven Jul 26, 2019, 12:23 PM

#

Aww
Math let me down ;~;

desert oar Jul 26, 2019, 12:23 PM

#

how many data points do you have?

#

there are alternatives dont worry

rigid haven Jul 26, 2019, 12:23 PM

#

A bajillion
(I think technically like a thousand but I can generate more)

zenith nova Jul 26, 2019, 12:23 PM

#

You could monitor the derivative with a threshold

desert oar Jul 26, 2019, 12:23 PM

#

ok thats perfect

zenith nova Jul 26, 2019, 12:24 PM

#

( If that's what you were thinking of salt rock lamp? )

rigid haven Jul 26, 2019, 12:24 PM

#

OOO Bast that's a cool idea!!!

desert oar Jul 26, 2019, 12:24 PM

#

they are evently spaced points?

#

yep @zenith nova exactly

#

take the successive differences of points

rigid haven Jul 26, 2019, 12:24 PM

#

They are mostly evenly spaced; some specific points will generate errors, but that's usually only one in a hundred

desert oar Jul 26, 2019, 12:24 PM

#

and when it goes above some number say "the increase has started"

#

ok. i would linearly interpolate the missing points

#

so you don't accidentally register a 2x increase

rigid haven Jul 26, 2019, 12:25 PM

#

Oh man that sounds advanced
Changes np.loadtxt to np.genfromtxt

desert oar Jul 26, 2019, 12:25 PM

#

nah. https://docs.scipy.org/doc/numpy/reference/generated/numpy.interp.html

#

better yet, https://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html#d-interpolation-interp1d

rigid haven Jul 26, 2019, 12:26 PM

#

Oh I was kidding; doesn't genfromtxt do that automatically?

desert oar Jul 26, 2019, 12:26 PM

#

yeah 😛

rigid haven Jul 26, 2019, 12:26 PM

#

lol

desert oar Jul 26, 2019, 12:26 PM

#

well sorta

#

i dont know if it can do interpolation

rigid haven Jul 26, 2019, 12:27 PM

#

Hmm
I know it does something like interpolation, I don't know what it's technically doing

#

O well, I can just use interp ^_^

#

(Making edits now)

desert oar Jul 26, 2019, 12:27 PM

#

afaik genfromtxt only fills with a fixed value

rigid haven Jul 26, 2019, 12:27 PM

#

oh ew

desert oar Jul 26, 2019, 12:27 PM

#

so you can fill all the nulls with 0.0

rigid haven Jul 26, 2019, 12:32 PM

#

Hmm
I want to mess with the interp more, but I'll leave that for later.
Since we're taking "the successive differences of points", that's exactly what the derivative is, so imma actually try that :3
Thanks guys! <3

desert oar Jul 26, 2019, 12:32 PM

#

well... it's what the derivative is with infinite points, so kinda. but yeah that should work

#

either you can manually set a threshold

#

or use a change point detection algorithm

rigid haven Jul 26, 2019, 12:34 PM

#

A what

#

that sounds sciency I like it

#

I'll look it up; thanks for you help! ^_^

lapis sequoia Jul 26, 2019, 12:38 PM

#

Does anyone know how to prepare and load a local dataset for use in keras?

supple ferry Jul 26, 2019, 12:50 PM

#

@lapis sequoia throw Pandas. If it is in database, you can use database client for loading and then pandas for transforming

rigid haven Jul 26, 2019, 12:55 PM

#

Love ya @desert oar and @zenith nova ! ^_^

📎 unknown.png

desert oar Jul 26, 2019, 1:02 PM

#

👍

crimson trellis Jul 26, 2019, 1:08 PM

#

@desert oar wasn't able to use the walmart data to predict target sales, the differences were too big.

I'm trying to use VAR now just for target

#

I'm getting a little confused with all these tutorials though. They show me how to build a model, but not how to use it to predict future values

#

am I supposed to pass in fake X values for the future?

#

to predict Y

desert oar Jul 26, 2019, 1:22 PM

#

VAR has the same problem kinda

#

where you need to know current month in everything in order to predict next month

#

what you can do is "chain" predictions

#

like predict one period with X then use it to predict the next period with Y

lapis sequoia Jul 26, 2019, 1:45 PM

#

@supple ferry Should i import the data and labels separately?

#

i've tried using pandas read_csv function previously, but it didnt really like that i had 2 different data types

desert oar Jul 26, 2019, 1:53 PM

#

@lapis sequoia what does your data look like? you can exercise a lot of control over data types with pandas

lapis sequoia Jul 26, 2019, 1:54 PM

#

Its basically sequences of 5000 time measurements

#

sometimes a little bit less than 5000

#

and i want to add a label to the series as a whole to classify it

desert oar Jul 26, 2019, 1:58 PM

#

what was your concern about data types

crimson trellis Jul 26, 2019, 1:59 PM

#

@desert oar when you say " you need to know everything" what do you mean?

I have entire data from 2018 to this June, but nothing for July, is it still possible to predict?

lapis sequoia Jul 26, 2019, 2:00 PM

#

Well when i import it with data labels added it struggles to determine the data type, and gives me an error

#

i read some documentation, and it says i should specify the dtype

#

however if i specify it being float, it reacts to the label not being convertable to float

desert oar Jul 26, 2019, 2:02 PM

#

can you give us a sample of the data

lapis sequoia Jul 26, 2019, 2:03 PM

#

1 sec

#

📎 example_labeled.csv

#

That is 3 series of 5000 time measurements

desert oar Jul 26, 2019, 2:09 PM

#

so in each row you have 5000 time measurements, and a label at the end

lapis sequoia Jul 26, 2019, 2:09 PM

#

Yes

desert oar Jul 26, 2019, 2:09 PM

#

is that right?

#

ok. so you dont need pandas for this because you can trust that the commas arent going to be messed up

#

pandas is good for mixed data types

#

"spreadsheet" type of stuff

#

just open the line and split it on commas

lapis sequoia Jul 26, 2019, 2:10 PM

#

Yea the commas wont be messed up

#

The thing I'm struggeling with is passing it into Keras the correct way

desert oar Jul 26, 2019, 2:12 PM

#

with open('example_labeled.csv') as f:
    data = [(vals[-1], np.array(vals[:-1], dtype=np.float32)) for line in f for vals in line.strip().split(',')]

#

that gives you a list of tuples, the 1st element being the label and the 2nd element being a numpy array of the numbers

#

i mean, yeah

#

lol

#

for pandas, what you would do is this

#

i think theyre constructing this data

#

look at the example

lapis sequoia Jul 26, 2019, 2:14 PM

#

i am gathering the data myself yea

#

I can easily remove the label from the csv file

desert oar Jul 26, 2019, 2:14 PM

#

that said

data = pd.read_csv('example_labeled.csv', header=None)
y = data.iloc[:, -1]
x = data.iloc[:, :-1]

#

the first 5000 columns should already be float type

#

and the last column should already be 'O' type which is basically "string"

#

wait what the heck how do you disable reading headers

#

yeah its header=None

lapis sequoia Jul 26, 2019, 2:16 PM

#

if some of the series are a bit shorter than 5000 measurements, will it still work?

desert oar Jul 26, 2019, 2:16 PM

#

no

#

thats why you were having issues

#

either pad the series when you're creating the data, or read it line by line as i started describing above

lapis sequoia Jul 26, 2019, 2:18 PM

#

Ok, I think I have my work cut out for me now.

desert oar Jul 26, 2019, 2:18 PM

#

how are you creating the data?

lapis sequoia Jul 26, 2019, 2:18 PM

#

I'm measuring delay over a network

#

I'm more of a network engineer, but my masters thesis acquires me to touch on machine learning, which I've never really done before

#

Can I message you if I have a question in the future?

desert oar Jul 26, 2019, 2:28 PM

#

you can ping me here

lapis sequoia Jul 26, 2019, 2:34 PM

#

finger_gun

#

Thanks for the help!

quartz monolith Jul 26, 2019, 3:29 PM

#

@desert oar fastTEXT works really good on sentence/text classificaiton

desert oar Jul 26, 2019, 3:29 PM

#

yup, i love it

quartz monolith Jul 26, 2019, 3:30 PM

#

i showed the experts 7 different models with different parameters we decided one

#

they cant believe it lol

#

its really a odd feeling when they help you to create a new systeme with ai and maybe they will not be needed in future

desert oar Jul 26, 2019, 4:52 PM

#

wow

#

yeah it is

#

congrats on making it work

#

its hard.. you know that overall you are making the world more efficient. but its scary knowing that we as a society havent figured out how to handle people who lose their jobs due to automation

surreal nacelle Jul 26, 2019, 5:00 PM

#

https://github.com/ageron/handson-ml2/blob/master/math_linear_algebra.ipynb
This is pretty good to learn/relearn linear algebra

GitHub

ageron/handson-ml2

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2. - ageron/handson-ml2

#

Using python

#

and matplotlib 😃

#

that's from the repository associated with the hands on with machine learning (...) book, it's full of valuable ressources 😃

silent swan Jul 26, 2019, 6:29 PM

#

fastText is underrated

#

it's pretty good and very easy to use

desert oar Jul 26, 2019, 6:31 PM

#

and you have to fiddle with it less than VW

#

its become my baseline go-to

#

instead of liblinear

#

for text, that is

silent swan Jul 26, 2019, 6:33 PM

#

and less hassle than bert lol

desert oar Jul 26, 2019, 6:36 PM

#

we just started using bert here

#

weve had a big classification project running for almost a year now

#

1450 classes

#

very rare classes in some cases, < 5

#

lots of mislabelings

#

VW fell on its face

#

fasttext barely beat out liblinear

#

its a really messy project

vital plume Jul 26, 2019, 7:34 PM

#

I have a question which isn't really python related but more generally data-science. I can't seem to google my way to a solution through key words in different combinations as I don't know precisely how to articulate my need. I have a 2d data set which is really weirdly distributed - as in, a lot of the data is clumped in one area. Is there a method by which I can 'redistribute' that data and how can I go about searching for methods to do this?

desert oar Jul 26, 2019, 7:36 PM

#

@vital plume what do you mean redistribute? what would be the desired result?

vital plume Jul 26, 2019, 7:36 PM

#

Like

#

for a very dumb example

#

lets say we have data points that produce a completely diagonal relationship

#

(0, 0)(0.3, 0.5)(1, 1)

#

I only really care that the points themselves demonstrate that 1,1 is at the far right hand corner of the coordinate space and 0,0 is at the bottom left

#

the middle value 0.3, .0.5 however could be at 0.5, 0.5 and it would still express its position in relationship to both those points relatively

desert oar Jul 26, 2019, 7:43 PM

#

sort of? the distance to both points changed

#

the angled changed

#

its a totally different point imo, except for the fact that 0 < 0.3 < 1 and 0 < 0.5 < 1

#

so it depends on your meaning

#

eg you can identify a bounding box or bounding circle, and evenly space points within that bound

vital plume Jul 26, 2019, 7:49 PM

#

I guess I want to respace my data space

#

redistribute?

desert oar Jul 26, 2019, 7:53 PM

#

you just need some kind of criterion

#

some rule

vital plume Jul 26, 2019, 8:15 PM

#

Is there not some unsupervised processing of data that you can do to make a distribution more normal?

desert oar Jul 26, 2019, 8:27 PM

#

sure, but you havent described an actual criterion until now

#

you want it to be more gaussian?

#

heck you can take the mean and variance of the same data, and randomly generate N new points from a gaussian distribution

grizzled folio Jul 27, 2019, 12:29 AM

#

anybody know off-hand whether I can get a speedup by opening a compressed, chunked netCDF file in parallel? I vaguely recall benchmarking the uncompressed file and getting a huge speedup in single-thread reading...

desert oar Jul 27, 2019, 2:04 AM

#

@grizzled folio i dont know netcdf specifically, but "probably"

#

that would be my guess

grizzled folio Jul 27, 2019, 2:24 AM

#

looks like HDF5 is not thread safe, even for reads?!

desert oar Jul 27, 2019, 2:30 AM

#

oof

#

ive never used it. never had a need. i always used parquet for "big" tabular stuff and gzipped json for non-tabular structured

grizzled folio Jul 27, 2019, 2:33 AM

#

interesting, I'm working with climate/ocean data so netCDF is the way to go (though zarr is making its way in)

desert oar Jul 27, 2019, 2:35 AM

#

yeah im not familiar w/ the more complex data formats from the natural sciences

#

in social science everything is tabular or text

grizzled folio Jul 27, 2019, 2:36 AM

#

that'd make things much simpler!

desert oar Jul 27, 2019, 2:40 AM

#

whats the advantage of all these complex formats

#

i know for example GIS data it's just a really old format, so it's really messy

#

ok sometimes in social sciences we use GIS data too, but thats not so bad because there are a lot of established tools for it

grizzled folio Jul 27, 2019, 2:41 AM

#

netCDF isn't particularly complex, it's self-describing (so you can pick one up and have all the dimensions/attributes), and handles multiple dimensions, record dimensions (so you can write them as your model runs), things like that

#

I've never used GIS, but it definitely sounds messy 😉

silent swan Jul 27, 2019, 4:26 AM

#

hmmmm, might need to find a way to efficiently compute pair-wise dot products of 1million 128d vectors

grizzled folio Jul 27, 2019, 4:29 AM

#

10^12 dot products? ouch

#

at least it should vectorise nicely

silent swan Jul 27, 2019, 4:29 AM

#

ya as it turns out, nonparametric methods are... compute heavy

#

yea this is the sort of problem that could get close to the 100% theoretical efficiency of a GPU lol

lapis sequoia Jul 27, 2019, 9:22 AM

#

what did I miss

#

parquet is great.. have you tried capacitor? @desert oar

muted garden Jul 27, 2019, 9:52 AM

#

Hello,may i ask,data science is part of ML right?

silk forge Jul 27, 2019, 9:56 AM

#

yes and no

#

📎 unknown.png

#

@muted garden

#

📎 unknown.png

#

btw

#

i dont understand feature scaling and data normalization

#

📎 unknown.png

muted garden Jul 27, 2019, 10:00 AM

#

🧐

#

Okay thank you bro

idle cedar Jul 27, 2019, 11:23 AM

#

@muted garden - there are two areas. Data Analytics encompasses Data Science and Machine Learning. Inside Data Science there are traditional methods (logistic, linear, cluster, factor analysis etc) and Machine Learning

#

Here's a helpful infographic I use in work:

#

📎 365-Data-Science-Infographic.jpg

quartz monolith Jul 27, 2019, 12:34 PM

#

the most confusing is the difference between data mining and machine learning 😄

desert oar Jul 27, 2019, 1:45 PM

#

bleh

#

ML is a research discipline, a problem domain, and a loose collection of techniques

#

data science is a job title and a career path

#

data science subsumed a number of jobs that previously had different names, eg. "quant", "statistician", "machine learning researcher", etc.

#

any time you're automating something in a way that requires "learning" (anything beyond hard-coded rules) and making inferences from it, imo you're doing machine learning

#

basically any kind of automated prediction task

idle cedar Jul 27, 2019, 1:49 PM

#

Salt hit the nail on the head, 20 years ago a Machine Learning Engineer undoubtably would be a statistician or Quant

lean ledge Jul 27, 2019, 1:51 PM

#

I hate that infographic

idle cedar Jul 27, 2019, 1:52 PM

#

How come Raggy?

lean ledge Jul 27, 2019, 1:56 PM

#

It's just so bad. Clustering, regression and time series isn't ML, because ML is just supervised, unsupervised, and RL, which are clearly different things. Apparently the major distinguishing factor between supervised and RL is that you're maximising reward instead of minimising cost. Somehow ML is so different from traditional methods that it uses 5 more languages or something

idle cedar Jul 27, 2019, 1:57 PM

#

You clearly didn't read the infographic

#

Clustering, regression, factor etc are all under traditional not machine learning

lean ledge Jul 27, 2019, 1:57 PM

#

Regression, clustering etc are also generally ML

idle cedar Jul 27, 2019, 1:58 PM

#

Yes you can use them but they sit under traditional

#

📎 unknown.png

lean ledge Jul 27, 2019, 1:58 PM

#

How is clustering ever not ML

idle cedar Jul 27, 2019, 1:59 PM

#

Because you can do a 1 line in Python without any ML packages to find clusters within observations

#

Same with Factor Analysis

lean ledge Jul 27, 2019, 1:59 PM

#

...number of lines isn't an indicator of whether something is ML

idle cedar Jul 27, 2019, 2:00 PM

#

No, ML is the indicator of a machine learning from previous optimisation attempts to imrpove the accuracy of a model

#

Cluster does not do that unless specifically indicated

#

and thus k-means is more appropriate than Cluster for machine learning

#

than traditional cluster*

#

Let's take powerBI for example

#

You can use m query language to get a cluster of data

#

that didn't use ML at all

#

but you're right as well because

#

what if you have a constant stream of data that it needs to cluster

lean ledge Jul 27, 2019, 2:01 PM

#

K means isn't ML?

idle cedar Jul 27, 2019, 2:01 PM

#

k-means is

#

it's a very common model for machine learning

#

K-means clustering is a type of unsupervised learning

lean ledge Jul 27, 2019, 2:02 PM

#

I am aware what it is

#

Very well aware

idle cedar Jul 27, 2019, 2:02 PM

#

Great, so I think we are on the same page

lean ledge Jul 27, 2019, 2:02 PM

#

I'm just having a hard time grasping what you're saying

idle cedar Jul 27, 2019, 2:02 PM

#

I'm saying that the infographic is accurate for what it is trying to portray

#

But you're also right in some regards, it is hard to draw a hard line between the two

#

but there are more appropriate complex models that would separate traditional data science and machine learning

lean ledge Jul 27, 2019, 2:03 PM

#

Is it though? It's honestly just confusing for everyone

#

It can't even properly distinguish between RL and supervised learning

idle cedar Jul 27, 2019, 2:03 PM

#

It's usually better explained with the video

lean ledge Jul 27, 2019, 2:03 PM

#

It's just one of the million bad infographics made in the field

idle cedar Jul 27, 2019, 2:04 PM

#

But let's face it, if you dont know the difference between reward based learning and supervised learning then you shouldn't really be looking at it in the first instance un-aided

lean ledge Jul 27, 2019, 2:04 PM

#

🙄😒 "this resource for beginners is bad"
"If you're a beginner you shouldn't be looking at it anyway"

idle cedar Jul 27, 2019, 2:05 PM

#

I think you misinterpreted what I said because I said it is better explained with the video

#

But the infographic for all intent in purposes, is the best I have ever had at trying to help people explain the differences

#

but the entire Data Analytics field is so vast of buzz words

#

that is has become subjective in what sits where

#

everyone must make their own mind up these days

#

But on another topic, Matplotlib vs plotly 😄

lean ledge Jul 27, 2019, 2:06 PM

#

The infographic shouldn't try to give info when there isn't a concrete answer

idle cedar Jul 27, 2019, 2:06 PM

#

Problem is Raggy, I dont think anyone has a concrete anaswer yet

lean ledge Jul 27, 2019, 2:06 PM

#

Clustering, regression, and time series are problems to tackle not methods on their own

idle cedar Jul 27, 2019, 2:06 PM

#

answer* it just attempts to put some structure on a wildly unstructured field

lean ledge Jul 27, 2019, 2:07 PM

#

You can't classify them as traditional

#

And then list a bunch of ML methods to do those tasks and say they're separate

#

You shouldn't misinterpret the difference between RL and supervised

idle cedar Jul 27, 2019, 2:07 PM

#

I dont think that was the purpose because Cluster and K means are in both

lean ledge Jul 27, 2019, 2:07 PM

#

You shouldn't make claims on the languages used when there is no standard or meaningful distinction

idle cedar Jul 27, 2019, 2:07 PM

#

I think you're arguing for the sake of it

lean ledge Jul 27, 2019, 2:08 PM

#

Left doesn't mention K means

idle cedar Jul 27, 2019, 2:08 PM

#

No but it mentions cluster analysis

#

which is at traditional non learning method

#

k means is

#

That infographic isn't set is stone because it isn't explicitly mentioned

#

means that's it ti cannot be used in either regards

#

Let me rephrase that as that was terrible explanation

#

Because the infographic doesn't explicitly say what belongs to which, doesn't mean that it is the rule of law, it's just giving some methods to help people realise what the difference is

#

one doesn't learn from itself, here are some methods

#

one learns from itself, here are some methods

#

not an exhaustive list

#

I agree with you they can be used for both, absolutely

#

but someone has to make the attempt at giving a few examples for each

lean ledge Jul 27, 2019, 2:10 PM

#

You're incomprehensible to me

#

I'm not saying it's not exhaustive enough

#

I'm not saying something can be used for both

#

I'm saying the infographic is horrible because it's putting a problem statement and a solution next to each other and pretending the problem statement is a traditional method and the solution is an ML method

#

And that's just one of the many misleading things about it

idle cedar Jul 27, 2019, 2:11 PM

#

It's just listing some examples - not a definitive list

#

jesus christ

lean ledge Jul 27, 2019, 2:11 PM

#

ITS NOT LISTING EXAMPLES AT ALL

idle cedar Jul 27, 2019, 2:11 PM

#

Yeah it is

lean ledge Jul 27, 2019, 2:11 PM

#

Regression is not a traditional technique

#

It's a problem statement

#

Traditional technique would be the normal equation

idle cedar Jul 27, 2019, 2:12 PM

#

📎 unknown.png

#

Do you see the fucking section that says example usage

lean ledge Jul 27, 2019, 2:12 PM

#

...I am afraid you can't read

idle cedar Jul 27, 2019, 2:12 PM

#

Linear regression, logistic regression, cluster analysis and factor analysis are all traditional method

lean ledge Jul 27, 2019, 2:12 PM

#

THEY ARE NOT METHODS

#

THEY ARE PROBLEM STATEMENTS

idle cedar Jul 27, 2019, 2:12 PM

#

Sigh

lean ledge Jul 27, 2019, 2:13 PM

#

A traditional method for doing linear regression is analytically calculating weights using the normal equation

#

The ML way is gradient descent over weights

#

Both are the same tasks

#

linear regression

idle cedar Jul 27, 2019, 2:14 PM

#

I pray for you if you get angry over the difference between Method and problem statements. I'm going to duck out of this conversation now because as with any debate, everyone gets ingrained in their original opinion anyway.

lean ledge Jul 27, 2019, 2:14 PM

#

Jesus Christ

simple crag Jul 27, 2019, 2:15 PM

#

Let's not continue this

lean ledge Jul 27, 2019, 2:16 PM

#

Why does this server have such a high concentration of people who keep insisting they're right when they are clueless

#

It is so frustrating

simple crag Jul 27, 2019, 2:16 PM

#

All capping at people is really going to help with that

#

Vent your frustrations somewhere else

lean ledge Jul 27, 2019, 2:18 PM

#

🙄 What else do I do when someone isn't taking in what I'm saying over and over. Not like mods are willing to tell people to stop being wrong

simple crag Jul 27, 2019, 2:18 PM

#

Be an adult and move on

lean ledge Jul 27, 2019, 2:18 PM

#

Whatever

simple crag Jul 27, 2019, 2:18 PM

#

You're not the arbiter of truth on the internet

lean ledge Jul 27, 2019, 2:20 PM

#

When someone is wrong, they are wrong. I'm not the arbiter of truth but it's hopefully the policy of moderators to ensure the server is both polite and filled with intelligent discussion, not just polite and filled with crap.

simple crag Jul 27, 2019, 2:20 PM

#

You consider yelling at people intelligent discussion?

#

Perhaps come up with a way of discussing topics with people without being a child

lean ledge Jul 27, 2019, 2:21 PM

#

I did try telling the same thing multiple times without yelling. All capsing is just another form of emphasis, not childish shouting.

simple crag Jul 27, 2019, 2:21 PM

#

uh huh

#

I see no point talking this in circles

quartz monolith Jul 27, 2019, 3:44 PM

#

right, to swtich the topic. Has someone used azure machine learning studio or any deep learning instances?

muted garden Jul 27, 2019, 4:00 PM

#

hmm

#

That is interesting infographic

#

@idle cedar thank you 💛

idle cedar Jul 27, 2019, 4:06 PM

#

Hope it helps 😃

desert oar Jul 27, 2019, 4:47 PM

#

Fwiw i dont like the infographic much either, but not for the same reasons 😛

silent swan Jul 27, 2019, 4:50 PM

#

silly bois, the method is (X'X)^{-1} X'Y
easy stuff

#

next discussion: there is not such thing as unsupervised learning 😄

desert oar Jul 27, 2019, 4:54 PM

#

It might not be a good term but it has a specific meeting and it definitely exists

muted garden Jul 27, 2019, 5:07 PM

#

So which one who expert with AI should be expert wit data science too right?i am sorry if my question is silly but really i am wondering

desert oar Jul 27, 2019, 5:12 PM

#

sort of?

#

i dont think you can be an expert with data science

#

it's too broad

#

i think you often need data science to do AI

#

and some tools from AI are useful in data science

idle cedar Jul 27, 2019, 8:13 PM

#

Yeah, if you want to hit both Fof, I guess it's statistics

serene scaffold Jul 27, 2019, 8:45 PM

#

Has anyone used scikit learn crf?

desert oar Jul 27, 2019, 9:30 PM

#

scikit-crfsuite?

surreal nacelle Jul 28, 2019, 11:13 AM

#

Hey do any of you guys have experience with this : https://www.coursera.org/specializations/mathematics-machine-learning
Worth it ?

Coursera

Mathematics for Machine Learning | Coursera

Learn Mathematics for Machine Learning from Imperial College London. For a lot of higher level courses in Machine Learning and Data Science, you find you need to freshen up on the basics in mathematics - stuff you may have studied before in ...

wheat egret Jul 28, 2019, 11:17 AM

#

can anyone here provide an accurate, yet basic method of explaining going from two images into a 2d depth map?
i've played around with it a lot using opencv throughout the past couple days, but i think i'm missing some fundamental concepts
(not sure where to put this question, so i'll drop it here)

lapis sequoia Jul 28, 2019, 11:22 AM

#

I suggest you take a look at this https://en.wikipedia.org/wiki/Triangulation_(computer_vision)

Triangulation (computer vision)

In computer vision triangulation refers to the process of determining a point in 3D space given its projections onto two, or more, images. In order to solve this problem it is necessary to know the parameters of the camera projection function from 3D to 2D for the cameras in...

wheat egret Jul 28, 2019, 11:29 AM

#

Thanks, i'll look at it.

idle cedar Jul 28, 2019, 5:17 PM

#

@surreal nacelle I did a course on corsa many moons ago from a guy who taught it in Octave

#

the courses on there are really good

surreal nacelle Jul 28, 2019, 5:17 PM

#

that's andrew ng course

#

but it's not this one 😄