#data-science-and-ml

1 messages ยท Page 205 of 1

surreal nacelle
#

and I read the mean of the results from there

desert oar
#

Ok

surreal nacelle
#

when not using cross val, I fit/predict and read the accuracy from the prediction

#

I split before tho

desert oar
#

Ok good. You should usually stick to one evaluation procedure

surreal nacelle
#

cross val seems to be solid

desert oar
#

Yeah it's a reasonable way to go

#

So you have two different models that have different accuracies?

surreal nacelle
#

I only have 1 model for now, but I was thinking about combining 2 yes

desert oar
#

Yes, that's a technique called ensembling

surreal nacelle
#

Ok, gtk

desert oar
#

It's very common and very popular in these type of prediction competitions

#

There are some great blog posts on it, let me find one

surreal nacelle
#

thank you ๐Ÿ˜ƒ

quartz monolith
#

Decision Tree / CART with Label Encoder?

#

Dummies makes 0 sense

#
Balanced accuracy: 0.21309668192963746
Hamming loss: 0.2958520739630185
Accuracy: 0.7041479260369815
silk forge
#

hey

#
"""Weather in Szeged 2006-2016: Is there a relationship between humidity and temperature? What about
 between humidity and apparent temperature?
  Can you predict the apparent temperature given the humidity?"""

import sklearn.linear_model as lin
import numpy as np
import pandas as pd
from sklearn.model_selection import  train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

dataw = pd.read_csv(filepath_or_buffer='C:/Users/admin/Desktop/Artifcial intelligence/ML/data/WEATHER/weatherHis.csv')


df_x = pd.DataFrame(dataw.Humidity)
df_y = pd.DataFrame(dataw['Temperature (C)'])

trainx,testx,trainy,testy = train_test_split(df_x,df_y,test_size=0.2,random_state=4)

regr = lin.LinearRegression()

regr.fit(trainx,trainy)
slope = regr.coef_
intercept = regr.intercept_



n = regr.predict(testx)

print(n)
print(testy)

print(mean_squared_error(y_true=testy , y_pred=n))
#

this is my

#

code

#

i get an MSE value of 55.248754390894184?

#

is that somewhat accurate

#

?

earnest prawn
#

Ideally youd always want your MSE to reach a value asymptotically close to 0....wether 55 is a good value really depends on what you're predicting as it's simply take the averages of the squared difference of actual and predicted values you have to think yourself wether sqrt(mse) is good enough for you or not

void anvil
#

if temperature is in celsius and humidity is 0-100, then probably not

#

if you did temperature, humidity vs apparent temp you'd probably get a lot better results

surreal nacelle
#
Your Best Entry

You advanced 2,422 places on the leaderboard!
Your submission scored 0.78468, which is an improvement of your previous score of 0.77033. Great job!```

I really don't see how to improve on that, currently ranked 4200 out of 11600, is it time to read some notebooks written by top leaderboard ?
#

actually jumped 1000 place by running the model again ๐Ÿ˜„

#

and again 2000 places lol

#

now ranked 1000 out of 11600

#

feeling pretty good about it

quartz monolith
#

nice good job

silk forge
#

what did i do wrong?

#

@void anvil

surreal nacelle
#

Thanks ๐Ÿ˜„

lapis sequoia
#

What is the best free way to learn data science

earnest prawn
#

check the second pin

void anvil
#

x = temp + humidity, y = feels like temp

sweet socket
#

guys i'm trying to get into a bit of algo trading and thinking of using backtrader in python, anyone know of anything better than this or is backtrader a good starting point? I'm not just greatest coder but i can fumble my way through most things

stuck obsidian
#

I believe you would want to look at quantopian @sweet socket

void anvil
#

lol good luck

surreal nacelle
quartz monolith
#

Somebody worked with SMOTENC and make_classification to balance the class imbalance in the dataset?

void anvil
#

Looking at some code for reinforcement learning on time series (below). Why is it scaling each observation 0-1 instead of the entire observation space to 0-1?

What happens if you have divergent values (e.g. values that go above what are seen in the train set and, thus, will go above 1 if scaled)? Will everything break if values go < 0 or >1 or is everything fine? Would it be better to scale to a tighter range (e.g. instead of 0-1 scaling, 0.2-0.8 scaling)? Want to see if anyone knows / has resources before I start experimenting.

  # Get the data points for the last 5 days and scale to between 0-1
  frame = np.array([
    self.df.loc[self.current_step: self.current_step +
                5, 'Open'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'High'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'Low'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'Close'].values / MAX_SHARE_PRICE,
    self.df.loc[self.current_step: self.current_step +
                5, 'Volume'].values / MAX_NUM_SHARES,
   ])  # Append additional data and scale each value to between 0-1
  obs = np.append(frame, [[
    self.balance / MAX_ACCOUNT_BALANCE,
    self.max_net_worth / MAX_ACCOUNT_BALANCE,
    self.shares_held / MAX_NUM_SHARES,
    self.cost_basis / MAX_SHARE_PRICE,
    self.total_shares_sold / MAX_NUM_SHARES,
    self.total_sales_value / (MAX_NUM_SHARES * MAX_SHARE_PRICE),
  ]], axis=0)  return obs```
desert oar
#

@void anvil i dont fully understand your question. it's scaling each feature separately

#

those MAX_* variables are defined outside the function

#

practically, i think normally you'd just clip the value to 0 or 1

void anvil
#

ah yeah you're right, but it's still scaling the max observed to 1

#

whereas it could go to 1.5 or w/e

#

in the unobserved test set (or in real time application)

#

this example came from stock trading, so amazon is trading at ~2k now. If we were to let it go, amazon could go to 3k or 50k because it's unbounded

#

so then you'd end up feeding a value > 1

desert oar
#

yeah that makes sense too

#

depends on the data

quartz monolith
#

I have 7 categorical features and i want to use smotenc

from collections import Counter
from numpy.random import RandomState
from imblearn.over_sampling import SMOTENC
sm = SMOTENC(random_state=42, categorical_features=[0,7])
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)

Does somebody has a idea what does the error mean?
ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6

hallow wave
#

expected neighbors to be less than or equal to 1

quartz monolith
#

I googled the problem, seems that my data set is to small

desert oar
#

@quartz monolith that seems wrong. you shouldnt have only 1 sample

#

what is X_train.shape?

#

and y_train.shape?

#

unless that means 1 sample in a particular category

quartz monolith
#

y_train = 4287 shape

#

X_train = (4287, 8)

#

Here is someone with similiar problem
https://stackoverflow.com/a/48820222/11811575
but I dont understand it...

desert oar
#

how many classes do you have

quartz monolith
#

Label classes are 31

#

And feature 7

#
sm = SMOTENC(random_state=42, categorical_features=[X])
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)```

got now `ValueError: cannot copy sequence with size 5717 to array axis with dimension 8`
desert oar
#

how many of each class do you have @quartz monolith

#

i dont think categorical_features=[X] is right. is X a matrix?

#

i think you would need to use the column numbers instead

#

im not 100% sure

#

what library is this?

quartz monolith
#

X is wrong i need to use array

#
sm = SMOTENC(random_state=42, categorical_features=[0,7])
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
x_res, y_res = sm.fit_resample(X, y)```
0,7 should be right
#

My X_train.shape = (4287, 8)

#

By class you mean the number of my features?

floral lodge
#

@desert oar it does but I would have to buy a very expensive license to get the full scripting functionality which is why I wanted to create a gui bot in python for it

#

I actually found some corner detection and edge detection stuff in opencv that I'm going to look into

#

thanks!

void anvil
#

Got another question on the take_action portion. Again, using the stock example code is coming from:

    # Buy amount % of balance in shares
    total_possible = self.balance / current_price
    shares_bought = total_possible * amount
    prev_cost = self.cost_basis * self.shares_held
    additional_cost = shares_bought * current_price    self.balance -= additional_cost
    self.cost_basis = (prev_cost + additional_cost) / 
                            (self.shares_held + shares_bought)
    self.shares_held += shares_bought```

So right here it's calculating the max amount it can buy in the "total_possible" line.

If we wanted to arbitrarily limit it to a max amount, say 10000 shares, we could change it to:
``` total_possible = np.min(self.balance / current_price, 10000```

Hopefully, over the course of time, the algorithm will "learn" that it can't place a buy bigger than 10000 at a time (and we wouldn't expect a large output than that when the algo is finished training.

But what if we want to limit it to some amount dependent on the next time step that the algorithm SHOULD NOT have access to at time t because the assumption of an unlimited purchase of a stock or w/e doesn't make a lot of sense . For this example, we'll limit it to an arbitrary 10% of the volume of the next time period, vol_t+1.

```total_possible = np.min(self.balance / current_price, 0.1*next_period_vol)```

By placing this restriction in the take_action space, is it actually being fed a bit of information from the future or is this restriction put in the right place? Should the RL try to place a large buy / sell than possible and have the action be restricted elsewhere to not pass "cheating" information back?
silent swan
#

porting between pytorch and TF code is LOL

lapis sequoia
#

Hey does anyone have a recommended textbook for Machine Learning/Data Science?

exotic cedar
#

suggest starting with this one

lapis sequoia
#

that's a very good pick

#

skip the tf, go straight to keras

quartz stream
#

@exotic cedar Thanks for the Useful Link !

#

Really Appreciated !

#

๐Ÿ’ฏ

lean ledge
#

@exotic cedar Piracy isn't allowed here

#

(@lyric canopy)

lyric canopy
#

No, it's not

#

@exotic cedar Please don't share pirated works or discuss piracy on our server.

#

!rule 5

arctic wedgeBOT
#

5. We will not help you with anything that might break a law or the terms of service of any other community, site, service, or otherwise - No piracy, brute-forcing, captcha circumvention, sneaker bots, or anything else of that nature.

exotic cedar
#

lol aight

lofty girder
#

Question about pandas, does it count as a multi index if I use set_index([column1, column2]) or is that just a caveman mimicry of a multiindex?

silk forge
#

this simple linear regr model come out well?

dense rose
#

Is there good library for displaying graphs in Jupyter that you can modify and live update?

#

I mean like actual graph with nodes and edges.

silk forge
#

matplotlib

earnest prawn
#

fwiz that is not a graph with nodes and edges

#

he is talking about a graph in the cs definition

#

my goto for that would always be graphviz but idk about that in jupyter

#

@silk forge @dense rose

haughty wind
#

Is there a function within Keras/TF that lets you add weights to the training data? Some of my data is higher quality and I'd like to give it more weight when fitting my NN. Currently I'm using ImageDataGenerator and flow_from_directory, if that helps any.

#

Ideally I'd want to assign different specific directories with higher/lower weights

crude bloom
#

off the top of my head I don't know but you could always add repeats of the data during training @haughty wind

#

basically just copy the data you value more so that the model sees it multiple times per epoch

quartz monolith
quartz monolith
#

still cant figure out how to use smotenc on my data set ๐Ÿค” @desert oar

serene veldt
#

Has anyone had success installing tensorflow_datasets on tf 2.0?
Always get the same error when importing
AttributeError: module 'tensorflow._api.v2.autograph.experimental' has no attribute 'do_not_convert'

lapis sequoia
#

can you compare versions

#

where that attribute is originally from and whether it was taken out

silent swan
#

I recommend PyTorch

#

it gud

mossy dragon
#

yo where raggy

lean ledge
#

Hm? @mossy dragon

mossy dragon
#

yo

#

im going over my calc

#

do i need to review implicit differention?

lean ledge
#

Doesn't really come up but good to have some confidence with manipulating differentials

supple ferry
#

@lofty girder yes it creates a multi index

surreal nacelle
#

Hey, do you think that this is good enough ?
I'm trying to do some 'data augmentation' by rotating each element of the dataset by -5 degree. The rotated image is a little noisy tho. Should I take the time to denoise it ? (and should I apply a stronger rotation to the images ?)

lethal spade
#

anyone here using pandas? I'm trying to run this code

#

I just have the ones I already had

lapis sequoia
#

looks like a bunch of pickles

#

in a dataframe.. hands down the weirdest thing I've seen yet

lethal spade
#

ahah, yeah, still have to update the name

quartz stream
#
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip
!pip install googletrans

import pandas as pd
# Dataset from - https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
df = pd.read_table('SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])

# Output printing out first 5 rows
df.head()

from googletrans import Translator
translator = Translator()
df2 = pd.DataFrame()
df2 = pd.DataFrame(columns=['label', 'sms_message'])
for i,j in zip(df['sms_message'],df['label']):
  text = translator.translate(i, dest='hi')
  df2 = df2.append({'sms_message': text.text,'label' : j},ignore_index = True)

#

can anyone help me

#

the above code throws this error

#
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-26-059f1e8ecbf9> in <module>()
      4 df2 = pd.DataFrame(columns=['label', 'sms_message'])
      5 for i,j in zip(df['sms_message'],df['label']):
----> 6   text = translator.translate(i, dest='hi')
      7   df2 = df2.append({'sms_message': text.text,'label' : j},ignore_index = True)

6 frames
/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
    355             obj, end = self.scan_once(s, idx)
    356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
    358         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
#

The output of df2 is only till 530 value but the original file has more than 6000 values

lapis sequoia
#

what are you trying to do

#

if it's a tab separated file, why don't you open it with builtins in pandas..

#

oh.. I just saw read_table..

#

not sure what you're trying to do here man..

#

what is df2

desert oar
#

@lapis sequoia it looks like they're building up df2 by translating the contents of df

#

@quartz stream something is wrong with whatever is in i

#

you can try printing the data, or using %debug in ipython to investigate

quartz stream
#

@desert oar

#

I tried printing i

#

its a standard dataset

#

it shows all the value

#

@desert oar Yes you guessed it correct I am trying to translate df into df2

#

Why dont you try the link I have given the data file also

surreal nacelle
#

Any ideas why model.predict takes an abnormal amount of time to finish ? The model.fit takes a minute or so, and predict on 3% of the dataset takes 10

desert oar
#

@surreal nacelle depends on the model. seems weird though

surreal nacelle
#

KNN on mnist

desert oar
#

oh

quartz stream
#

@desert oar Any idea on my question

desert oar
#

@surreal nacelle sklearn? KNeighborsClassifier?

surreal nacelle
#

Yep

desert oar
#

@quartz stream "normal dataset" doesn't really help. i'm not in a position to start downloading data files and debugging right now

#

the translator is expecting something different from what you gave it.. thats the best i can offer

#

im not familiar w/ that library

quartz stream
#

Its 200kb

#

dataset

#

and the translator is working fine

#

for half the dataset

#

it is just not completing

#

so the code is fine

#

I translate and print every value

#

it does

#

I guess there is something wrong with adding values in database

#

i mean pandas*

desert oar
#

so what is that error message from then

#

is it working, or is it not working?

quartz stream
#

it is not

desert oar
#

well look at the error

#

it's clearly related to translate and not pandas

#

there is a bad element in your data somewhere

#

@surreal nacelle what distance are you using?

surreal nacelle
#

I'm using the default values for now, which is n_neighbors=5

desert oar
#

so algorithm='auto'?

surreal nacelle
#

Yes

desert oar
#

and metric='minkowski'?

surreal nacelle
#

Yes

#
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')
desert oar
#

can you print the ._fit_method attribute on the model

#

tree lookups should be fast

surreal nacelle
#

Sure, gimme a second, need to refit

#

model._fit_method = kd_tree

desert oar
#

hm. im not that well versed in the details of kd-trees, but the whole point is that tree lookups are supposed to be fast

#

are the results correct?

surreal nacelle
#

It seems so

#

0.985

desert oar
#

hm. could just be the way it is

surreal nacelle
#

I guess

#

Thanks for the help anyway ๐Ÿ˜ƒ

surreal nacelle
#

@desert oar Some guy explained me that knn doesn't train, the .fit simply gives it data so that it can use it to compare during the .predict phase. So it does make sense that it takes much longer to predict than to 'train' (no training)

silent swan
#

KNNs are a nonparametric method, so it doesn't learn anything, so there's nothing to fit

desert oar
#

@silent swan sklearn doesnt build a tree when you call fit()?

silent swan
#

actually you're right it probably does do it then

lapis sequoia
#

I need some help

#

I have a column that contains a list.. I want to split it and add them to new columns in the dataframe.. but, not all rows have equal number of items in the list

#

how should I approach this

surreal nacelle
#

Hey, how could I see the word instead of seeing the dictionary indice ?

vect = CountVectorizer(analyzer='word')
bag_of_words = vect.fit_transform(emails)
test_output = vect.transform(['email', 'test', 'hello'])
print(test_output)```

(0, 15725) 1
(1, 51148) 1
(2, 55302) 1```

desert oar
#

@lapis sequoia is it already a list, or is it a string? and what's the max number of columns?

#

@void anvil i'd just use None personally

#

@surreal nacelle that's odd, my .transform returns a sparse matrix

#

but you would use vect.vocabulary_ to get a mapping from words to indices

#

oh weird. i didnt know sparse matrices acquired a fancy print method

#

yes that is a sparse matrix

#
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(analyzer='word')
v.fit(['email is okay', 'i like turtles', 'i email my email every turtle'])
transf = v.transform(['email is fun'])
print(type(transf))
print(transf.shape)
print(len(v.vocabulary_))
#

id rather leave null values as null, and fill in later

#

just my style tho

surreal nacelle
#

I got the matrice that way, but as you can see it contains the value of the word indice in the dictionary instead of 0 and 1.

desert oar
#

the vocabulary is telling you that "greetings" is in column 21152

#

you can't really get the words back

#

that doesn't make sense

#

that's the whole point of vectorizing

#

words go in, numbers come out

surreal nacelle
#

I understand that, but shouldn't the matrix contain the number of occurrence of each words instead of their indices in the vocabulary ?

desert oar
#

it doesnt contain their indices

#

it contains 0s and 1s

#

well it contains more than that

#
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(analyzer='word')
v.fit(['email is okay', 'i like turtles', 'i email my email every turtle'])
transf = v.transform(['email is fun to email with email', 'turtles dont use email but turtles like lettuce'])
print(transf.toarray())
print(v.vocabulary_)
surreal nacelle
#

oh, so it does exactly what I want ๐Ÿ˜„

#

ohhh

#

the np.argmax returns the index

#

in the array

desert oar
#

no

surreal nacelle
#

not its content

desert oar
#

err

#

yeah

#

but

#
email_counts = transf.toarray()[:, v.vocabulary['email']]
#

you want to know the most frequent word in each document?

surreal nacelle
#

For example yes

desert oar
#

the vocabulary should have unique values as well as indices, so you can "invert" it

vocab_inverse = {val: key for key, val in v.vocabulary_.items()}
most_freq_word_per_document = [vocab_inverse[i] for i in transf.argmax(axis=1)]
surreal nacelle
#

thank you, once again ๐Ÿ˜„

crimson trellis
#

Hey everyone, I'm new to ML, and got a task to predict sales for dataset1 using dataset2. Dataset2 has data for 4 weeks in advance, how would I go about doing this? Currently built a linear regression model for dataset2 with .94 rsquared

desert oar
#

@crimson trellis seems like a good start. how many data points do you have?

crimson trellis
#

2.7mil from dataset2

desert oar
#

oh thats a lot

crimson trellis
#

running it on bigquery

desert oar
#

you can use a train/test split or cross validation, and measure the accuracy of your model

crimson trellis
#

I graphed the linear regression predictions against the actual data and it was pretty spot on

desert oar
#

yeah thats probably good enough

#

but you can use the holdout set to be sure

#

implementing CV in bigquery would be annoying

#

but you can reserve eg 1/4 of your data and not train on it

#

then do the prediction on it and measure accuracy

crimson trellis
#

got it. I think I saw something like that

#

I'm stuck on figuring out how to use the model for a forecast though...

desert oar
#

that's a bigquery specific question i'm afraid

#

and i wouldnt know the answer. maybe someone else does

crimson trellis
#

gotcha. All the examples I've seen don't really have a date forecast, and instead do something like 'airline delays, taxi fares, etc'

desert oar
#

if time is involved things are a bit more complicated

#

can you describe datasets1 and 2 in more detail

#

how they related to each other etc

crimson trellis
#

it's just date/product/sales/units/store1 - dataset1
same thing for dataset2, but different store

#

I was thinking of using a coefficient store1/store2 and apply it to the model to predict store1 using store2 performance

desert oar
#

what do you mean

#

what is store1? some kind of performance number?

crimson trellis
#

no that's walmart

#

store2 is target

desert oar
#

so you're predicting, e.g. walmart sales using store2 sales?

crimson trellis
#

yep

#

because store2 has data coming in daily, and walmart has data every month instead

desert oar
#

how are you running that regression then

crimson trellis
#

so the model is trained using target only

#

and not entirely sure if this is the right way to do it, but I want to use it to predict walmart data based on (walmart sales last month/target sales last month)

desert oar
#

so you basically took the monthly average of target sales, then predicted this month's walmart sales using last months' walmart-target ratio

crimson trellis
#

that's what I'm thinking of doing

#

does it make sense? ๐Ÿค”

desert oar
#

hm. the math of linear regression won't like that, you're going to have non-independently distributed data

#

as for testing it, you can only forecast 1 month ahead at best

#

unless you start forecasting target as well

crimson trellis
#

I do have about 3 years of data in that 2.7mil records

desert oar
#

unfortunately that doesnt help

crimson trellis
#

ah

desert oar
#

it means your model can learn more ,but it doesnt fix the fundamental issues

crimson trellis
#

how would you approach this? I'd like to get it right without just throwing something up quickly

#

and that's how my coefficient thing feels

#

just a quick solution

desert oar
#

model being valid or not, the way you would test it is by "sliding" over the data. say you have 24 months of data and you reserve the last 6 months for testing. then you train on the first 18 months and evaluate on the 19th month. then you train on the first 19th months and evaluate on the 20th. and so on until you run out of months. and then you can do mean square error of all the evaluation points

crimson trellis
#

Ok, I get that. Then using the model, I would use it to predict future target sales

desert oar
#

wait what

#

you would be predicting walmart sales

crimson trellis
#

ah ok, I lost track

desert oar
#

but again you can only predict 1 month in advance

#

because you need last month's target sales

crimson trellis
#

yes

#

ok

desert oar
#

and also this model is likely to have other issues

#

due to the fact that you're violating the iid assumption

crimson trellis
#

I mean, it is what it is at the moment

#

I can use things like # of walmart stores, and # of different products sold at each location

desert oar
#

actually wait. it might be okay w/ least squares actually

#

yeah, you know what? this should be fine

#

just make sure youre using the testing strategy i described

#

otherwise you will be "cheating" and using future data to predict past data

#

which inflates your accuracy

crimson trellis
#

yep

#

time to figure out how to do this now. thank you ๐Ÿ˜ƒ

desert oar
#

good luck

lapis sequoia
#

yo i need help with data cleaning

#

@desert oar are you available?

desert oar
#

for a little bit olonger year

lapis sequoia
#

im cleaning lyrics scraped from genius

#

I wanna remove text like "verse 1" "intro" "chorus" and brackets/punctuations
fortunately I found this python script that seems to does the job well

#

but

#

well here's the original data

#

data output after applying the cleaning function:

desert oar
#

sorry, i cant help with that

lapis sequoia
desert oar
#

scraping is against their TOS

#

and its against the rules for us to help with TOS violations

lapis sequoia
#

whose TOS?

desert oar
#

!rule 5

arctic wedgeBOT
#

5. We will not help you with anything that might break a law or the terms of service of any other community, site, service, or otherwise - No piracy, brute-forcing, captcha circumvention, sneaker bots, or anything else of that nature.

desert oar
#

genius

lapis sequoia
#

oh really?

desert oar
#

https://genius.com/static/terms

Except as expressly authorized by Genius in writing, you agree not to modify, copy, frame, scrape, rent, lease, loan, sell, distribute or create derivative works based on the Service or the Genius Content, in whole or in part, except that the foregoing does not apply to your own User Content (as defined above) that you legally upload to the Service. In connection with your use of the Service you shall not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods. Any use of the Service or the Genius Content other than as specifically authorized herein is strictly prohibited. As between you and Genius, the technology and software underlying the Service or distributed in connection therewith is the exclusive property of Genius, our affiliates and our partners (the "Software"). You agree not to copy, modify, create a derivative work of, reverse engineer, reverse assemble or otherwise attempt to discover any source code, sell, assign, sublicense, or otherwise transfer any right in the Software. Any rights not expressly granted herein are reserved by Genius.

#

sorry

lapis sequoia
#

aight thank you

surreal nacelle
#

I'm trying to figure out if having a dataframe with 60k columns (1 for each words in the dictionary) is ok ? @desert oar

#

It seems a little much

desert oar
#

Why do you want that

#

I mean, I get why you might want that for convenience?

#

Can put sparse data into a data frame

surreal nacelle
#

It's the flattened sparse matrix to feed the algorithm

desert oar
#

Why do you need a data frame

#

Most sklearn models accept sparse data

surreal nacelle
#

I see

#

I'll try that then

desert oar
#

What model are you trying to use in this particular case

surreal nacelle
#

no ideas yet

desert oar
#

Usually you need to convert from data frame to matrix anyway

surreal nacelle
#

gonna try a bunch

silent swan
#

man tensorflow is black magic built on black magic

desert oar
#

yes

#

i dont know why they dont just give me a damn api to construct a graph manually

#

instead of all this as_default stuff

earnest prawn
#

at that point you might as well just manually palay with numpy stuff

desert oar
#

except not at all? the whole point of TF is that you're constructing a differentiable graph

#

and that gets sent back down to the C++ framework for processing

#

its just the python API is extremely confusing and the documentation is unclear

#

(to me)

silent swan
#

my experience so far with TF is

#

there're tons of ways to do the same thing

#

so it's actually easy to write code

#

but it's hell to read other's because you have no idea what their workflow is

#

the documentation problem is compounded by how often the API / "best practice" changes

#

like even running the official model code gives you a ton of deprecation warnings

earnest prawn
#

thats just the nature of huge code bases really

silent swan
#

pytorch is great though, everybody get on the pytorch train

earnest prawn
#

I have never actually done useful things with pytorch but what Ive seen looks good

desert oar
#

im hoping things stabilize in TF after 2.0

silent swan
#

it's very pythonic. about the only "surprising"/obscured thing is that gradients are stored in state, and sometimes the dataloaders hide crazy stuff from you

#

otherwise the code does about exactly what you think it does when you read it

earnest prawn
#

nah its just gonna be like the internal APIs of linux which are never guaranteed to be stable and allowed to be subject of change every commit @desert oar

desert oar
#

i hope not

#

every time i load a model from a checkpoint i feel like im doing something wrong

lunar leaf
#

embrace keras

sterile remnant
#

import pandas as pd

#

import matplotlab.pyplot as plt

#

import numpy as np

#

data = pd.read_csv('pornhub.csv')

#

print(data.dtypes)

#

print(data.index())

#

print(data['pornstar'].unique)

#

data.plot[x=''di**k_size , y = ''satisfaction" , color = '' fapability '']

#

plt.show()

desert oar
#

if you want to enter code, @sterile remnant , it will be easier to read if you use code block formatting

#

!codeblock

arctic wedgeBOT
#
codeblock

Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.

To do this, use the following method:

```python
print('Hello world!')
```

Note:
โ€ข These are backticks, not quotes. Backticks can usually be found on the tilde key.
โ€ข You can also use py as the language instead of python
โ€ข The language must be on the first line next to the backticks with no space between them

This will result in the following:

print('Hello world!')
sterile remnant
#

thanks for your help bro @desert oar it really help

#

'''import pandas as pd
import matplotlab.pyplot as plt
import numpy as np
data = pd.read_csv('pornhub.csv')
print(data.dtypes)
print(data.index())
print(data['pornstar'].unique)
data.plot[x=''di**k_size , y = ''satisfaction" , color = '' fapability '']
plt.show()
'''

desert oar
#

no problem. did you have a question you wanted to ask?

#

use ` not '

#

on an american keyboard it's on the same key as ~, not sure what keyboard you have

sterile remnant
#

ok gotcha bro

#
import matplotlab.pyplot as plt
import numpy as np
data = pd.read_csv('pornhub.csv')
print(data.dtypes)
print(data.index())
print(data['pornstar'].unique)
data.plot[x=''di**k_size , y =  ''satisfaction" , color = '' fapability '']
plt.show()
#

thanks

desert oar
#

๐Ÿ‘

#

also i assume you mean to write

data.plot(x="di**k_size", y = "satisfaction", color = "fapability")
#

right now you have doubled ' and [ instead of (

#

great variable name btw

sterile remnant
#

bro i wanna do some scikit stuff but i am finding a really hard time to do so help me out please

#

yeak i was focusing more on my variables instead of syntax

#

lol

desert oar
#

hard to say what help you need... do you have a specific objective in mind?

sterile remnant
#

yep like i gotta some project on super vised learnig i have seen its tut but all went above my senses and i want to learn it do u hv any suggestions hw can i wrap my head around that?

#

like some link or something else anyone

desert oar
#

@sterile remnant what is your level of programming and math knowledge?

#

i'd start by maybe working through a beginner book or online course

sterile remnant
#

@desert oar bro i am also noob to this one and i have been handed over with that project i was looking out for some stuff to get some knowledge about it .

desert oar
#

its hard to help without more context

sterile remnant
#

yep so it is but i have lookes into it

#
y = iris.target
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3,random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
print(knn.score(X_test, y_test))
#

like this code gives u prediction about what dataset the given data belongs to

#

and like this bit and pieces i been handling my shit

#

thouh thanks @desert oar for ur consideration

quartz monolith
#

we want to see the graph

desert oar
#

i wanna know where this data came from tbh

#

i really regret not applying for a job at pornhub when i had the chance

unkempt helm
#

๐Ÿ‘€

quartz monolith
#

i see you mean data hub ๐Ÿค”

desert oar
#

i told my mother about it and she was not happy with the idea. it was in montreal too

#

would have loved to have an excuse to move to montreal

quartz monolith
#

๐Ÿ† data wrangle

desert oar
#

i wondered about that

#

ive heard mixed things about "porn tech"

#

maybe they still have data science jobs

quartz monolith
#

yeah technology magazines makes me stimulated

#

especially MIT Technology Review

#

somebody ever worked with word2vec?

desert oar
#

i have yeah

#

did you ever figure out your sampling issue btw

quartz monolith
#

no ๐Ÿ˜ฆ i dont know why but we figured out something important that some error's have more labels. The classification model cant work. I want to buld a corpus and interview the experts how the text vector predicts words and use it as a sentence classifier later to create new keywords which will be the new label of the classiciation

#

tsne_model = TSNE(perplexity=5, n_components=2, init='pca', n_iter=5000, random_state=2300)

#

salt whats the best way to predict a sentence with w2v?

desert oar
#

what do you mean the classification model cant work?

#

the way you'd use a word embedding like w2v is you'd generate a word vector for each word in the sentence, then average them to get a vector for the whole sentence, then feed that into a classifier as your feature vector

quartz monolith
#

yes i want to build a feature vectors for the troubleshoot text classification and after that using it as a label

#

the errors are connected with to many same label. The Model can't predict the right come or has to many outcomes which make it really hard

desert oar
#

have you done a cross tab of errors and labels

#

like with pd.crosstab(errors, labels) and then plotted it using plt.imshow

quartz monolith
#

I did it with confusion matrix

desert oar
#

yeah

#

that would work i guess

quartz monolith
#

i had 130k data set and 51%., I used decision tree to understand how he splits the data and why is the model misleading the predicition to optimize the knowledge base

#

other problems was if some certain rows where based nmar (laziness) or not relevant missing values

desert oar
#

ah

#

there are probably ways around that, but

quartz monolith
#

hm

vale arrow
#

could use some assistance in sklearn if this is the applicable channel

quartz monolith
#

Yes?

vale arrow
#

so I'm trying to do kfold cross validation with cross_val_score

#

and it was working just fine until today

#

and now when i run it I get like a hundred lines of traceback

#

and I have no idea what any of it means

#

"Fatal Python error: initfsencoding: unable to load the file system codec" is the first line

quartz monolith
#

how does your data looks like?

vale arrow
#

as in shape?

quartz monolith
#

i found something about pyinstaller with sklearn

vale arrow
#

next to nothing on that page makes any kind of sense to me

desert oar
#

@vale arrow what code are you running?

#

and how are you running it?

vale arrow
#

Give me a second. Im reinstalling my ide just to see if that fixes something

#

now i literally just can't install packages

#

it's like every time i touch something relatively new that i need to install the entire program shatters

quartz monolith
#

what about your interpreter?

vale arrow
#

yea this just didn't do anything

#

so here is my code

#

i just did

#

now i just can't send my code

#

just fucking kill me

#

kFoldScores = scoreModel(xTrain = XTrain, yTrain = yTrain)
this is the line that messes everything up^

#

This is the function it's calling:

def scoreModel(xTrain, yTrain):
    checkingNetwork = KerasRegressor(build_fn = buildNetwork, batch_size = 10, epochs = 100)
    accuracies = cross_val_score(estimator = checkingNetwork, X = xTrain, y = yTrain, cv = 5, n_jobs = -1)
    #print("Standardized: %.2f (%.2f) MSE" % (results.mean(), results.std()))
    return accuracies```
#

and it's like i get different errors everytime

#

if i take that function out and just do k fold manually everything is fine

#

but i don't understand what's wrong with that function

#

okay I think I've fixed it. it works if i just specify one cpu to use instead of all and i don't get it but holy crap that was a headache

quartz monolith
#

good to know

quartz monolith
#

fastTEXT ๐Ÿš€

desert oar
#

@void anvil yeah ive heard ray and modin are not anywhere near ready for production

#

its kind of baffling that companies arent willing to pick up and put money towards these kinds of projects

grizzled folio
#

Hey all, I'm kind of overthinking myself into a hole here... I have for example hourly (2.8e-4 Hz) data for a week, and I want to filter out sub-inertial frequencies (anything below 5e-5 Hz). There's a lot of stuff about IIR and FIR, windows, filter order, how to apply the filter, but this all seems like overkill? I'm just looking for something fairly simple!

lean ledge
#

@grizzled folio apply a scipy Butterworth filter with that cutoff frequency

#

That's as simple as it gets

grizzled folio
#

@lean ledge cool, that's a handy pointer! I still need to provide the order of the filter though, I don't know how to choose that (unless I use buttord?)

lean ledge
#

As a bit of nuance, you can't remove the filtered frequency completely. Higher order removes things more. Look at the bode plot for the filter you construct to see how the gain at different frequencies

#

Gain is in dB which is a log scale

grizzled folio
#

Aha, that's handy. So why wouldn't you just crank the order way up? Increased computation? Artifacts?

lean ledge
#

Instability/Artifacts. Computation scales linearly too. Importantly, people working on signal processing are generally also often doing it on hardware. Higher order = more components, more cost, more board space etc

#

You can increase the order to an extent. After that you should come up with different strategies if you need sharper rolloff

#

Eg you can trade the smooth region of a Butterworth filter for one with more ripples to get a sharper rolloff (eg in an elliptic or Chebyshev filters)

grizzled folio
#

Cool, this is at least a starting point. Computational cost may be a factor since I'm filtering a few tens of millions of timeseries. I'll see how things look with Butterworth and if that's working for me. Thanks!

#

I appreciate the real practical condensation of different types into their effects, I was really struggling to find any literature that didn't immediately get super technical about it

lean ledge
#

Yeah signal processing is a rough subject to learn for someone who didn't learn it as part of formal electrical engineering education

grizzled folio
#

I got some of it in applied math courses, but never got the chance to apply it

lean ledge
#

There's a lot of nuance to signal processing so it's hard to simplify stuff down and ignore some very technical aspects

#

I don't think maths generally goes over signal processing

#

Except maybe at a grad level

grizzled folio
#

Well, I think it was that course! It was very much on the applied/computational side of things

lean ledge
#

This is something you'd learn in electrical engineering. Would be very surprised if other people are learning it muchb

grizzled folio
#

I definitely remember talking about IIR systems, and probably designing them... So I could probably understand the nuance if I wanted to get into it, but for the moment I just need something high-level that works and I can build upon

lean ledge
#

Huh

grizzled folio
#

Aha, I think it was something like "Scientific and Industrial Modelling"

#

Anyway, that was quite a while ago ๐Ÿ˜‰

earnest prawn
#

maturity is eh

#

arguable

#

i think its more about the distributed approach for adept

#

why is the name schulman so familar to me

#

if you want something in oss contribute it lol

lapis sequoia
#

@desert oar it's a list of objects, I have to get an item from each object.. the max number of columns can be 6

desert oar
#

@void anvil for future reference i know 0 about RL

#

@lapis sequoia what was that in reference to again? ping me in the AM

silent swan
#

great, a tensorflow version from less than a year ago breaks on 3.7

lapis sequoia
#

oops.. sorry.. @desert oar I have this column in my dataframe, each row has a list..

#

the list is a bunch of objects I can pass through a function... the number of items in that list can be max of 6..

#

or the list can be empty.. I want to split this column into multiple columns based on this

#

thanks.. let me try:)

#

btw if you need any help on RL you can ask.. but I can't really point you to an implementation

silent swan
#

I wonder if I could do a freelance gig where I replicate papers or port things between TF and pytorch

lean ledge
#

Actually sounds like possible freelancing

sterile remnant
#
arr = np.array([10,20,30,40,50])
for i in array:
    print(arr[i])```
#

help me this code i wanna print this array out?

#

but its flashisng an error

grizzled folio
#

@sterile remnant for i in array... you don't have any variables called that

#

And secondly, each i will be an element of the array, so you'll be trying to index a 5-element array with 10, 20, etc., which aren't valid

sterile remnant
#

so hw shd i do it?

supple ferry
#

@sterile remnant , if your array is 1D, e.g a vector, you can loop throw it element by element. If your array is a shape of 2D and greater, by default looping will happen on the first dimension. for a 2D array it will be on rows.
I dont know your reason behing printing the array the way you do it above, but you can do:

a = np.array([10,20,30,40,50])

for el in a:
    print(el)
# this will give you every element of the array because it is a vector

b = np.random.random((2, 5))

for row in b:
    print(row)

# this will give you every row of that array because it is a 2D
surreal nacelle
#

Hey, what are my options for removing non-words in my 'corpus'?
Is there a way to keep all the words that are somewhat similar to something from the english dictionary and remove words that aren't ?
Example : 'redhat' is not a word, but I still want to keep it, however 'asdjhasgdja' must go.

#

also, would removing all 1 occurence words be a bad idea ?

supple ferry
#

redhat should be considered as a company name by language models

surreal nacelle
#

What do you mean ?

supple ferry
#
from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word
surreal nacelle
#

Oh nice, so this would keep 'sexyredhead' and remove 'sff' ?

supple ferry
#

I havent treid it unfortunately, can not tell how well it works :)#

surreal nacelle
#

Well, I'll give it a shot and keep you updated

granite sierra
#

Is room free?

#

might be a bit far fetched question.

Is there a way to store the loaded matlab file data in variables?

I already know about scipy.io.loadmat(), but I mean after it's been loaded

supple ferry
#

@granite sierra , according to docs scipy.io.loadmat():

Returns
mat_dictdict
dictionary with variable names as keys, and loaded matrices as values.
#

you can just assign a variable by indexing that dictionary

granite sierra
#

ok

#

by inexing you mean blah.get(key)

supple ferry
#

or blah["key"]

#

you will probably have to transform matrices to numpy arrays if it is not done automatically

#

otherwise, you are safe

granite sierra
#

converting it to an array is ltierally just

a = scipy.io.loadmat('test.mat')
np.array(a)
```?
supple ferry
#

no no

#

a is not a dictionary with variables as keys and matrices as values

#

lets say you have a variable foo in that matlab file

granite sierra
#

sure

supple ferry
#

you can access it now a["foo"] which will return you the matrix

#

np.asanyarray(a["foo"]) i think

#

or np.array

#

depending on use case

granite sierra
#

I'll be honest

#

wait

#

this is how it's returning it

#
{'__header__': b'MATLAB 5.0 MAT-file Platform: nt, Created on: Thu Jul 25 10:48:49 2019', '__version__': '1.0', '__globals__': [], 'testfile': array([[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]],
      dtype=[('test', 'O'), ('temp_data', 'O')])}
#
a = scipy.io.loadmat('test.mat')
print(a)
supple ferry
#

yes, you can now access your matrices. a["testfile"] will return you array([[(array([[ 5, 10, 15, 20]])

#

I have never used matlab, so, i may be mistaken ๐Ÿ˜ƒ

granite sierra
#

ok let me test haha

#

well it did exactly that

#

how do I store the variables now?

#

this is what it returned

[[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]]
#

also why does it return it with double [[]]

#

like how do I access the first list, the 5, 10, 15, 20

supple ferry
#

you can now do c = np.asanyarray(a["foo"])

#

something like this

granite sierra
#

huh?

#

holy thats going to get messy with lots of data

supple ferry
#

matlab is messy ๐Ÿ˜„

granite sierra
#

also I think that code is outdated, numpy has no strip function now

#

ok so I did this

#
b = np.squeeze(np.asarray(a['testfile']))

it returned this

(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))

which is obviously just a tuple with a tuple of lists inside, is there anyway to access the first nested tuple?

#

tuples aren't mutable, are they, so there is no way to access by 'index'

granite sierra
#

anybody?

hollow latch
#

Hi, I have a neural network saved in ONNX format (make with matlab), do you know how to run it on python? Keras don't seam to support it and I faild to install caffe2 on my windows 7...

desert oar
#

Ive actually been wondering how to use onnx and tf myself. So good thing i found this

supple ferry
#

@granite sierra have you tried np.asanyarray?

#

It works with nested structures better

granite sierra
#

let me try

supple ferry
#

Also you can reduce the tuple until you get only vectors

#

You can even use numpys own reduce

granite sierra
#

hmm let me see

#

but what ufunc would I do to the vector?

#

nah I dont think that works, unless I'm doing it wrong\

supple ferry
#

How did you do it?

#

Show pls

hollow latch
#

@desert oar Thank you, it's seam to work !

granite sierra
#
import scipy.io as sci
import numpy as np

a = sci.loadmat('test.mat')

print(a)

b = a['testfile']

print(b)


c = np.squeeze(np.array(b))
d = np.reduce(c)

print(d)
supple ferry
#

Maybe show the outputs too ๐Ÿ˜

granite sierra
#

oh sorr

#
runfile('C:/Users/danilov_d/.spyder-py3/understandfile.py', wdir='C:/Users/danilov_d/.spyder-py3')
{'__header__': b'MATLAB 5.0 MAT-file Platform: nt, Created on: Thu Jul 25 10:48:49 2019', '__version__': '1.0', '__globals__': [], 'testfile': array([[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]],
      dtype=[('test', 'O'), ('temp_data', 'O')])}
[[(array([[ 5, 10, 15, 20]]), array([[1, 2, 3, 4]]))]]
Traceback (most recent call last):

  File "<ipython-input-530-607ea6f6830d>", line 1, in <module>
    runfile('C:/Users/danilov_d/.spyder-py3/understandfile.py', wdir='C:/Users/danilov_d/.spyder-py3')

  File "C:\Anaconda\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
    execfile(filename, namespace)

  File "C:\Anaconda\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/danilov_d/.spyder-py3/understandfile.py", line 21, in <module>
    d = np.reduce(c)

AttributeError: module 'numpy' has no attribute 'reduce'
silk forge
#

how does test_train_split function work in cases of multiple linear regression

earnest prawn
#

not different from any other model

#

it simply splits your data into test and training data so you can see how your model performs on data it has never seen during training

#

so you can for example diagnose overfitting etc

silk forge
#
x = data.ENGINESIZE , data.CYLINDERS , data.FUELCONSUMPTION_COMB
y = data.CO2EMISSIONS

dfx = pd.DataFrame(x)
dfy = pd.DataFrame(y)



trainx , testx ,trainy, testy =  train_test_split(dfx , dfy, test_size=0.2 , random_state=8)
#

when im using more than 1 xvalues , can i just do this?

#

@

#

@earnest prawn

earnest prawn
#

I mean I am not 100 percent sure but I dont see any reason why it shouldnt. You could just try I guess ยฏ_(ใƒ„)_/ยฏ

silk forge
#

won't work

earnest prawn
#

whats the errror?

desert oar
#

@silk forge x is a tuple

#

why are you trying to take columns out of a dataframe then make a new dataframe out of it?

#

i assume data is a dataframe right? if so

dfx = data[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB']]   # dataframe, equivalent to 2-d array
dfy = data['CO2EMISSIONS']  # series, equivalent to 1-d array
silk forge
#

yuh

#

oh imma try your thing

#
import numpy
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model as lin
from sklearn.metrics import r2_score , mean_absolute_error , mean_squared_error
from sklearn.model_selection import train_test_split

data = pd.read_csv(filepath_or_buffer="C:/Users/admin/Downloads/FuelConsumptionCo2.csv")

# so the x values are gonna be ENGINESIZE , CYLINDERS  AND FUELCONSUMPTION_COMB

x = data.iloc[: , 4:6]
x["FUELCONSUM"] = data.FUELCONSUMPTION_COMB
y = data.CO2EMISSIONS


dfx = data[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB']]   # dataframe, equivalent to 2-d array
dfy = data['CO2EMISSIONS']  # series, equivalent to 1-d array



trainx , testx ,trainy, testy =  train_test_split(dfx,dfy, test_size=0.2 , random_state=8)

regr = lin.LinearRegression()
regr.fit(trainx,trainy)

slope = regr.coef_
inter = regr.intercept_

plt.scatter(trainx,trainy,color = "red")
plt.plot(trainx,trainx*slope + inter, color = "blue")
plt.show()

desert oar
#

@silk forge why are you using the iloc there?

#

you can delete that line entirely unless you really need to reduce memory usage by dropping columns

quartz monolith
#

how to get a model info about the parameter of the model in gensim?

#
model = Word2Vec.load('fastTEXT_big_sg_w30_min12_iter15.model')
print("model ready")```

something like this
`model.info()`
round jay
#

Hi, I'm applying to an entry level data analyst position, and I submitted a python technical challenge last week and got invited back for an interview including some code review (and SQL whiteboarding too). Would anyone be able to look through my notebook and offer advice prior to my code review?
Deleted

#

(posted in a help and the career challenges as well, this is my last time, apologies)

quartz monolith
#

link doesn't work

round jay
#

apologies, wrong link

quartz monolith
#

whats the goal of the work now?

#

data wrangling?

round jay
#

the outlined task: was given a couple of separated data files, combine them into one spreadsheet and also point out any outliers I find, within a 30-45 min period

#

combined spreadsheet was supposed to be formatted to compare a brand's product across 4 different regions

quartz monolith
#

you can use some scatter graphs for the outliers maybe?

#

@round jay

#

maybe for outliers you can use scipy z-score

#
df = pd.DataFrame(np.random.randn(100, 3))
from scipy import stats
outliners =  df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] ```

something like this

here more info:
https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba
Medium

While working on a Data Science project, what is it, that you look for? What is the most important part of the EDA phase? There are certainโ€ฆ

round jay
#

Thank you for the suggestions! will definitely consider and go over the medium post

quartz monolith
#

maybe @desert oar has some more ideas

#

np!

round jay
#

No worries, think I have a starting point to talk about just in case

#

Appreciate it

crimson trellis
#

hey, has anyone used statsmodel to do stepwise regression?

#

my X is a dataframe with input variables, and my y is the list of target values

abstract zodiac
#

Has anybody used dataquest and reccomend it? I have 3 hours a day to spend to studying and as a beginner what's the best online course material?

hot orbit
#

Is this a good channel to ask about AI?

#

Just wanted to know if anybody was working/is working on AI

#

ok I read channel topic nevermind

#

btw the question is, what were you working at? just curious ๐Ÿ˜„

desert oar
#

@crimson trellis show your code

granite sierra
#

that's awesome @surreal nacelle

surreal nacelle
#

Thanks ๐Ÿ˜„ First ""practical"" application of ml, really fun project tbh

granite sierra
#

how long did it take you

surreal nacelle
#

2 days

granite sierra
#

did you do any maths for it?

surreal nacelle
#

I used xgboost for the model

#

The part that required work was the preprocessing

#

so not much maths

#

next step is to code most of the basic algo from scratch tho

desert oar
#

@surreal nacelle nice job!

#

What training data did you use

surreal nacelle
#

spamassassin

desert oar
#

Good stuff

surreal nacelle
#

and thanks!

#

you helped ๐Ÿ˜„

desert oar
#

๐Ÿ‘

surreal nacelle
desert oar
#

That's a great project idea by the way, do you think it is something you would recommend for other beginners

surreal nacelle
#

Absolutely

#

Learned a lot

#

actually it is one of the assignments from the Hands on machine learning with sklearn, tensorflow, and keras book.

#

This book is really great

desert oar
#

ahh

#

good to know

rigid haven
#

Hey y'all!
Sorry if this isn't the right category!
I'm not sure if this is totally a "python" question, but I figured I'd ask since I'm coding it in Python. I'm using numpy and matplotlib, and in this graph, I want to calclate a value for "when the graph is shooting up". Is there a word for that? Is there a function I could use in either numpy or matplotlib (or scipy) that'll help me calculate it?

#

It's for a school assignment, and the professor is just expecting us to look at it and write down what it looks like, but that feels super gross

desert oar
#

you want to know the time it happens?

#

or the steepness? or the amount it increases?

rigid haven
#

The time it happens, sorry

desert oar
#

so you want to know when the increase "starts"?

rigid haven
#

"Happens" being a confusing word since it's over a time interval, but knowing how crazy mathmeticians are there's probably a definition for it :p

#

Oops, see above

desert oar
#

so theres no exact definition for it

rigid haven
#

Aww
Math let me down ;~;

desert oar
#

how many data points do you have?

#

there are alternatives dont worry

rigid haven
#

A bajillion
(I think technically like a thousand but I can generate more)

zenith nova
#

You could monitor the derivative with a threshold

desert oar
#

ok thats perfect

zenith nova
#

( If that's what you were thinking of salt rock lamp? )

rigid haven
#

OOO Bast that's a cool idea!!!

desert oar
#

they are evently spaced points?

#

yep @zenith nova exactly

#

take the successive differences of points

rigid haven
#

They are mostly evenly spaced; some specific points will generate errors, but that's usually only one in a hundred

desert oar
#

and when it goes above some number say "the increase has started"

#

ok. i would linearly interpolate the missing points

#

so you don't accidentally register a 2x increase

rigid haven
#

Oh man that sounds advanced
Changes np.loadtxt to np.genfromtxt

rigid haven
#

Oh I was kidding; doesn't genfromtxt do that automatically?

desert oar
#

yeah ๐Ÿ˜›

rigid haven
#

lol

desert oar
#

well sorta

#

i dont know if it can do interpolation

rigid haven
#

Hmm
I know it does something like interpolation, I don't know what it's technically doing

#

O well, I can just use interp ^_^

#

(Making edits now)

desert oar
#

afaik genfromtxt only fills with a fixed value

rigid haven
#

oh ew

desert oar
#

so you can fill all the nulls with 0.0

rigid haven
#

Hmm
I want to mess with the interp more, but I'll leave that for later.
Since we're taking "the successive differences of points", that's exactly what the derivative is, so imma actually try that :3
Thanks guys! <3

desert oar
#

well... it's what the derivative is with infinite points, so kinda. but yeah that should work

#

either you can manually set a threshold

#

or use a change point detection algorithm

rigid haven
#

A what

#

that sounds sciency I like it

#

I'll look it up; thanks for you help! ^_^

lapis sequoia
#

Does anyone know how to prepare and load a local dataset for use in keras?

supple ferry
#

@lapis sequoia throw Pandas. If it is in database, you can use database client for loading and then pandas for transforming

rigid haven
desert oar
#

๐Ÿ‘

crimson trellis
#

@desert oar wasn't able to use the walmart data to predict target sales, the differences were too big.

I'm trying to use VAR now just for target

#

I'm getting a little confused with all these tutorials though. They show me how to build a model, but not how to use it to predict future values

#

am I supposed to pass in fake X values for the future?

#

to predict Y

desert oar
#

VAR has the same problem kinda

#

where you need to know current month in everything in order to predict next month

#

what you can do is "chain" predictions

#

like predict one period with X then use it to predict the next period with Y

lapis sequoia
#

@supple ferry Should i import the data and labels separately?

#

i've tried using pandas read_csv function previously, but it didnt really like that i had 2 different data types

desert oar
#

@lapis sequoia what does your data look like? you can exercise a lot of control over data types with pandas

lapis sequoia
#

Its basically sequences of 5000 time measurements

#

sometimes a little bit less than 5000

#

and i want to add a label to the series as a whole to classify it

desert oar
#

what was your concern about data types

crimson trellis
#

@desert oar when you say " you need to know everything" what do you mean?

I have entire data from 2018 to this June, but nothing for July, is it still possible to predict?

lapis sequoia
#

Well when i import it with data labels added it struggles to determine the data type, and gives me an error

#

i read some documentation, and it says i should specify the dtype

#

however if i specify it being float, it reacts to the label not being convertable to float

desert oar
#

can you give us a sample of the data

lapis sequoia
#

1 sec

#

That is 3 series of 5000 time measurements

desert oar
#

so in each row you have 5000 time measurements, and a label at the end

lapis sequoia
#

Yes

desert oar
#

is that right?

#

ok. so you dont need pandas for this because you can trust that the commas arent going to be messed up

#

pandas is good for mixed data types

#

"spreadsheet" type of stuff

#

just open the line and split it on commas

lapis sequoia
#

Yea the commas wont be messed up

#

The thing I'm struggeling with is passing it into Keras the correct way

desert oar
#
with open('example_labeled.csv') as f:
    data = [(vals[-1], np.array(vals[:-1], dtype=np.float32)) for line in f for vals in line.strip().split(',')]
#

that gives you a list of tuples, the 1st element being the label and the 2nd element being a numpy array of the numbers

#

i mean, yeah

#

lol

#

for pandas, what you would do is this

#

i think theyre constructing this data

#

look at the example

lapis sequoia
#

i am gathering the data myself yea

#

I can easily remove the label from the csv file

desert oar
#

that said

data = pd.read_csv('example_labeled.csv', header=None)
y = data.iloc[:, -1]
x = data.iloc[:, :-1]
#

the first 5000 columns should already be float type

#

and the last column should already be 'O' type which is basically "string"

#

wait what the heck how do you disable reading headers

#

yeah its header=None

lapis sequoia
#

if some of the series are a bit shorter than 5000 measurements, will it still work?

desert oar
#

no

#

thats why you were having issues

#

either pad the series when you're creating the data, or read it line by line as i started describing above

lapis sequoia
#

Ok, I think I have my work cut out for me now.

desert oar
#

how are you creating the data?

lapis sequoia
#

I'm measuring delay over a network

#

I'm more of a network engineer, but my masters thesis acquires me to touch on machine learning, which I've never really done before

#

Can I message you if I have a question in the future?

desert oar
#

you can ping me here

lapis sequoia
#

Thanks for the help!

quartz monolith
#

@desert oar fastTEXT works really good on sentence/text classificaiton

desert oar
#

yup, i love it

quartz monolith
#

i showed the experts 7 different models with different parameters we decided one

#

they cant believe it lol

#

its really a odd feeling when they help you to create a new systeme with ai and maybe they will not be needed in future

desert oar
#

wow

#

yeah it is

#

congrats on making it work

#

its hard.. you know that overall you are making the world more efficient. but its scary knowing that we as a society havent figured out how to handle people who lose their jobs due to automation

surreal nacelle
#

Using python

#

and matplotlib ๐Ÿ˜ƒ

#

that's from the repository associated with the hands on with machine learning (...) book, it's full of valuable ressources ๐Ÿ˜ƒ

silent swan
#

fastText is underrated

#

it's pretty good and very easy to use

desert oar
#

and you have to fiddle with it less than VW

#

its become my baseline go-to

#

instead of liblinear

#

for text, that is

silent swan
#

and less hassle than bert lol

desert oar
#

we just started using bert here

#

weve had a big classification project running for almost a year now

#

1450 classes

#

very rare classes in some cases, < 5

#

lots of mislabelings

#

VW fell on its face

#

fasttext barely beat out liblinear

#

its a really messy project

vital plume
#

I have a question which isn't really python related but more generally data-science. I can't seem to google my way to a solution through key words in different combinations as I don't know precisely how to articulate my need. I have a 2d data set which is really weirdly distributed - as in, a lot of the data is clumped in one area. Is there a method by which I can 'redistribute' that data and how can I go about searching for methods to do this?

desert oar
#

@vital plume what do you mean redistribute? what would be the desired result?

vital plume
#

Like

#

for a very dumb example

#

lets say we have data points that produce a completely diagonal relationship

#

(0, 0)(0.3, 0.5)(1, 1)

#

I only really care that the points themselves demonstrate that 1,1 is at the far right hand corner of the coordinate space and 0,0 is at the bottom left

#

the middle value 0.3, .0.5 however could be at 0.5, 0.5 and it would still express its position in relationship to both those points relatively

desert oar
#

sort of? the distance to both points changed

#

the angled changed

#

its a totally different point imo, except for the fact that 0 < 0.3 < 1 and 0 < 0.5 < 1

#

so it depends on your meaning

#

eg you can identify a bounding box or bounding circle, and evenly space points within that bound

vital plume
#

I guess I want to respace my data space

#

redistribute?

desert oar
#

you just need some kind of criterion

#

some rule

vital plume
#

Is there not some unsupervised processing of data that you can do to make a distribution more normal?

desert oar
#

sure, but you havent described an actual criterion until now

#

you want it to be more gaussian?

#

heck you can take the mean and variance of the same data, and randomly generate N new points from a gaussian distribution

grizzled folio
#

anybody know off-hand whether I can get a speedup by opening a compressed, chunked netCDF file in parallel? I vaguely recall benchmarking the uncompressed file and getting a huge speedup in single-thread reading...

desert oar
#

@grizzled folio i dont know netcdf specifically, but "probably"

#

that would be my guess

grizzled folio
#

looks like HDF5 is not thread safe, even for reads?!

desert oar
#

oof

#

ive never used it. never had a need. i always used parquet for "big" tabular stuff and gzipped json for non-tabular structured

grizzled folio
#

interesting, I'm working with climate/ocean data so netCDF is the way to go (though zarr is making its way in)

desert oar
#

yeah im not familiar w/ the more complex data formats from the natural sciences

#

in social science everything is tabular or text

grizzled folio
#

that'd make things much simpler!

desert oar
#

whats the advantage of all these complex formats

#

i know for example GIS data it's just a really old format, so it's really messy

#

ok sometimes in social sciences we use GIS data too, but thats not so bad because there are a lot of established tools for it

grizzled folio
#

netCDF isn't particularly complex, it's self-describing (so you can pick one up and have all the dimensions/attributes), and handles multiple dimensions, record dimensions (so you can write them as your model runs), things like that

#

I've never used GIS, but it definitely sounds messy ๐Ÿ˜‰

silent swan
#

hmmmm, might need to find a way to efficiently compute pair-wise dot products of 1million 128d vectors

grizzled folio
#

10^12 dot products? ouch

#

at least it should vectorise nicely

silent swan
#

ya as it turns out, nonparametric methods are... compute heavy

#

yea this is the sort of problem that could get close to the 100% theoretical efficiency of a GPU lol

lapis sequoia
#

what did I miss

#

parquet is great.. have you tried capacitor? @desert oar

muted garden
#

Hello,may i ask,data science is part of ML right?

silk forge
#

yes and no

#

@muted garden

#

btw

#

i dont understand feature scaling and data normalization

muted garden
#

๐Ÿง

#

Okay thank you bro

idle cedar
#

@muted garden - there are two areas. Data Analytics encompasses Data Science and Machine Learning. Inside Data Science there are traditional methods (logistic, linear, cluster, factor analysis etc) and Machine Learning

#

Here's a helpful infographic I use in work:

quartz monolith
#

the most confusing is the difference between data mining and machine learning ๐Ÿ˜„

desert oar
#

bleh

#

ML is a research discipline, a problem domain, and a loose collection of techniques

#

data science is a job title and a career path

#

data science subsumed a number of jobs that previously had different names, eg. "quant", "statistician", "machine learning researcher", etc.

#

any time you're automating something in a way that requires "learning" (anything beyond hard-coded rules) and making inferences from it, imo you're doing machine learning

#

basically any kind of automated prediction task

idle cedar
#

Salt hit the nail on the head, 20 years ago a Machine Learning Engineer undoubtably would be a statistician or Quant

lean ledge
#

I hate that infographic

idle cedar
#

How come Raggy?

lean ledge
#

It's just so bad. Clustering, regression and time series isn't ML, because ML is just supervised, unsupervised, and RL, which are clearly different things. Apparently the major distinguishing factor between supervised and RL is that you're maximising reward instead of minimising cost. Somehow ML is so different from traditional methods that it uses 5 more languages or something

idle cedar
#

You clearly didn't read the infographic

#

Clustering, regression, factor etc are all under traditional not machine learning

lean ledge
#

Regression, clustering etc are also generally ML

idle cedar
#

Yes you can use them but they sit under traditional

lean ledge
#

How is clustering ever not ML

idle cedar
#

Because you can do a 1 line in Python without any ML packages to find clusters within observations

#

Same with Factor Analysis

lean ledge
#

...number of lines isn't an indicator of whether something is ML

idle cedar
#

No, ML is the indicator of a machine learning from previous optimisation attempts to imrpove the accuracy of a model

#

Cluster does not do that unless specifically indicated

#

and thus k-means is more appropriate than Cluster for machine learning

#

than traditional cluster*

#

Let's take powerBI for example

#

You can use m query language to get a cluster of data

#

that didn't use ML at all

#

but you're right as well because

#

what if you have a constant stream of data that it needs to cluster

lean ledge
#

K means isn't ML?

idle cedar
#

k-means is

#

it's a very common model for machine learning

#

K-means clustering is a type of unsupervised learning

lean ledge
#

I am aware what it is

#

Very well aware

idle cedar
#

Great, so I think we are on the same page

lean ledge
#

I'm just having a hard time grasping what you're saying

idle cedar
#

I'm saying that the infographic is accurate for what it is trying to portray

#

But you're also right in some regards, it is hard to draw a hard line between the two

#

but there are more appropriate complex models that would separate traditional data science and machine learning

lean ledge
#

Is it though? It's honestly just confusing for everyone

#

It can't even properly distinguish between RL and supervised learning

idle cedar
#

It's usually better explained with the video

lean ledge
#

It's just one of the million bad infographics made in the field

idle cedar
#

But let's face it, if you dont know the difference between reward based learning and supervised learning then you shouldn't really be looking at it in the first instance un-aided

lean ledge
#

๐Ÿ™„๐Ÿ˜’ "this resource for beginners is bad"
"If you're a beginner you shouldn't be looking at it anyway"

idle cedar
#

I think you misinterpreted what I said because I said it is better explained with the video

#

But the infographic for all intent in purposes, is the best I have ever had at trying to help people explain the differences

#

but the entire Data Analytics field is so vast of buzz words

#

that is has become subjective in what sits where

#

everyone must make their own mind up these days

#

But on another topic, Matplotlib vs plotly ๐Ÿ˜„

lean ledge
#

The infographic shouldn't try to give info when there isn't a concrete answer

idle cedar
#

Problem is Raggy, I dont think anyone has a concrete anaswer yet

lean ledge
#

Clustering, regression, and time series are problems to tackle not methods on their own

idle cedar
#

answer* it just attempts to put some structure on a wildly unstructured field

lean ledge
#

You can't classify them as traditional

#

And then list a bunch of ML methods to do those tasks and say they're separate

#

You shouldn't misinterpret the difference between RL and supervised

idle cedar
#

I dont think that was the purpose because Cluster and K means are in both

lean ledge
#

You shouldn't make claims on the languages used when there is no standard or meaningful distinction

idle cedar
#

I think you're arguing for the sake of it

lean ledge
#

Left doesn't mention K means

idle cedar
#

No but it mentions cluster analysis

#

which is at traditional non learning method

#

k means is

#

That infographic isn't set is stone because it isn't explicitly mentioned

#

means that's it ti cannot be used in either regards

#

Let me rephrase that as that was terrible explanation

#

Because the infographic doesn't explicitly say what belongs to which, doesn't mean that it is the rule of law, it's just giving some methods to help people realise what the difference is

#

one doesn't learn from itself, here are some methods

#

one learns from itself, here are some methods

#

not an exhaustive list

#

I agree with you they can be used for both, absolutely

#

but someone has to make the attempt at giving a few examples for each

lean ledge
#

You're incomprehensible to me

#

I'm not saying it's not exhaustive enough

#

I'm not saying something can be used for both

#

I'm saying the infographic is horrible because it's putting a problem statement and a solution next to each other and pretending the problem statement is a traditional method and the solution is an ML method

#

And that's just one of the many misleading things about it

idle cedar
#

It's just listing some examples - not a definitive list

#

jesus christ

lean ledge
#

ITS NOT LISTING EXAMPLES AT ALL

idle cedar
#

Yeah it is

lean ledge
#

Regression is not a traditional technique

#

It's a problem statement

#

Traditional technique would be the normal equation

idle cedar
#

Do you see the fucking section that says example usage

lean ledge
#

...I am afraid you can't read

idle cedar
#

Linear regression, logistic regression, cluster analysis and factor analysis are all traditional method

lean ledge
#

THEY ARE NOT METHODS

#

THEY ARE PROBLEM STATEMENTS

idle cedar
#

Sigh

lean ledge
#

A traditional method for doing linear regression is analytically calculating weights using the normal equation

#

The ML way is gradient descent over weights

#

Both are the same tasks

#
  • linear regression
idle cedar
#

I pray for you if you get angry over the difference between Method and problem statements. I'm going to duck out of this conversation now because as with any debate, everyone gets ingrained in their original opinion anyway.

lean ledge
#

Jesus Christ

simple crag
#

Let's not continue this

lean ledge
#

Why does this server have such a high concentration of people who keep insisting they're right when they are clueless

#

It is so frustrating

simple crag
#

All capping at people is really going to help with that

#

Vent your frustrations somewhere else

lean ledge
#

๐Ÿ™„ What else do I do when someone isn't taking in what I'm saying over and over. Not like mods are willing to tell people to stop being wrong

simple crag
#

Be an adult and move on

lean ledge
#

Whatever

simple crag
#

You're not the arbiter of truth on the internet

lean ledge
#

When someone is wrong, they are wrong. I'm not the arbiter of truth but it's hopefully the policy of moderators to ensure the server is both polite and filled with intelligent discussion, not just polite and filled with crap.

simple crag
#

You consider yelling at people intelligent discussion?

#

Perhaps come up with a way of discussing topics with people without being a child

lean ledge
#

I did try telling the same thing multiple times without yelling. All capsing is just another form of emphasis, not childish shouting.

simple crag
#

uh huh

#

I see no point talking this in circles

quartz monolith
#

right, to swtich the topic. Has someone used azure machine learning studio or any deep learning instances?

muted garden
#

hmm

#

That is interesting infographic

#

@idle cedar thank you ๐Ÿ’›

idle cedar
#

Hope it helps ๐Ÿ˜ƒ

desert oar
#

Fwiw i dont like the infographic much either, but not for the same reasons ๐Ÿ˜›

silent swan
#

silly bois, the method is (X'X)^{-1} X'Y
easy stuff

#

next discussion: there is not such thing as unsupervised learning ๐Ÿ˜„

desert oar
#

It might not be a good term but it has a specific meeting and it definitely exists

muted garden
#

So which one who expert with AI should be expert wit data science too right?i am sorry if my question is silly but really i am wondering

desert oar
#

sort of?

#

i dont think you can be an expert with data science

#

it's too broad

#

i think you often need data science to do AI

#

and some tools from AI are useful in data science

idle cedar
#

Yeah, if you want to hit both Fof, I guess it's statistics

serene scaffold
#

Has anyone used scikit learn crf?

desert oar
#

scikit-crfsuite?

surreal nacelle
wheat egret
#

can anyone here provide an accurate, yet basic method of explaining going from two images into a 2d depth map?
i've played around with it a lot using opencv throughout the past couple days, but i think i'm missing some fundamental concepts
(not sure where to put this question, so i'll drop it here)

lapis sequoia
wheat egret
#

Thanks, i'll look at it.

idle cedar
#

@surreal nacelle I did a course on corsa many moons ago from a guy who taught it in Octave

#

the courses on there are really good

surreal nacelle
#

that's andrew ng course

#

but it's not this one ๐Ÿ˜„