#data-science-and-ml

1 messages ยท Page 357 of 1

untold tundra
#

what you're loading with en_cor... is just the vocab vectors, so if you want to store that on s3, i imagine you can just save a version from the github where i imagine its downloaded from

rotund basin
#

this is what I had been assuming, but have not made it work. maybe just need to keep digging there

untold tundra
#

well i think nlp.vocab.to_disk() works in anycase, so you can use that

rotund basin
#

yes, I just might. Sometimes it seems like there may be an easier way when there actually isn't :)

#

thanks for your insights

sand loom
#

Is there an efficient way to, for example, remove columns from a numpy ndarray? Ive got an (n,85) ndarray and am trying to efficiently trim out the last 80 entries

#

In reality selecting relevant data from the last 80 entries and placing it into index 5 of the original array, rendering the last 80 no longer useful

#

Ive looked into masking, deletion, copying to new array... but I'm curious if anyone has any insight into an efficient method, or if leaving as-is is the best? I'm just a bit memory constrained

serene scaffold
#

!e

import numpy as np
arr = np.random.random((43, 85))
new = arr[:-10, :]  # chop off the last ten rows
print(new.shape)
arctic wedgeBOT
#

@serene scaffold :white_check_mark: Your eval job has completed with return code 0.

(33, 85)
sand loom
#

Ah yeah, it seems that slicing just provides a view. Alrighty

serene scaffold
serene scaffold
#

@sand loom how memory constrained are we talking, anyway?

#

If you can prevent the array from being assigned to a variable before you slice it, it might not stay in memory as long.

sand loom
serene scaffold
sand loom
#

the device is yocto-based (stripped linux) so just standard python

#

at least, afaik

serene scaffold
#

@sand loom I've never worked on an embedded system. I'm not aware of how numpy has been used with them

sand loom
#

I'm doing some more research on this (hadnt considered that yocto could use non-standard python which would change all profilings of things I have looked at). However, afaik it should be just the same python installation and implementation as a standard python install on ubuntu or other equivalent (python version 3.8.11 for that matter)

#

Well, at least at the top level I am not using cpython. The implementation on the system might be using it, though. Was not able to find any definite answer on that matter. But for now I am comfortable assuming it works nearly identical to a standard linux install

untold tundra
#

an (n,85) ndarray and am trying to efficiently trim out the last 80 entries

#

so you're looking to go from (n, 85) to (n, 5) ?

#

how efficient is data = data[:, -5:] ?

#

or otherwise, eg., more_data = data[:, -5:]; del data[:, :80]

#

the array built-in python module may also be helpful

pure pumice
#

@serene scaffold Hey, is it possible if you can help me with a few more things?

serene scaffold
#

@pure pumice idk what it is

pure pumice
# serene scaffold <@104664534446272512> idk what it is

{'Index': [0, 1, 2, 3, 4],
'ID': [1, 2, 3, 4, 5],
'Title': ['Inception',
'The Matrix',
'Avengers: Infinity War',
'Back to the Future',
'The Good, the Bad and the Ugly'],
'Year': [2010, 1999, 2018, 1985, 1966],
'Age': ['13+', '18+', '13+', '7+', '18+'],
'IMDb': [8.8, 8.7, 8.5, 8.5, 8.8],
'Rotten Tomatoes': [87, 87, 84, 96, 97],
'Netflix': [1, 1, 1, 1, 1],
'Hulu': [0, 0, 0, 0, 0],
'Prime Video': [0, 0, 0, 0, 1],
'Disney+': [0, 0, 0, 0, 0],
'Type': [0, 0, 0, 0, 0],
'Directors': ['Christopher Nolan',
'Lana Wachowski,Lilly Wachowski',
'Anthony Russo,Joe Russo',
'Robert Zemeckis',
'Sergio Leone'],
'Genres': ['Action,Adventure,Sci-Fi,Thriller',
'Action,Sci-Fi',
'Action,Adventure,Sci-Fi',
'Adventure,Comedy,Sci-Fi',
'Western'],
'Country': ['United States,United Kingdom',
'United States',
'United States',
'United States',
'Italy,Spain,West Germany'],
'Language': ['English,Japanese,French',
'English',
'English',
'English',
'Italian'],
'Runtime': [148.0, 136.0, 149.0, 116.0, 161.0]}

serene scaffold
#

What are you trying to do

pure pumice
#

this is what ive done for netflix but i cant just do the same this for all 4

#

is it possible to do it in one filter?

serene scaffold
#

@pure pumice so you need to add the four streaming platform columns

#

And then get those

#

That are >= 2

pure pumice
serene scaffold
#

@pure pumice you didn't take the sum.of them

#

.sum()

#

Also I'm on my phone

#

At my parents house

#

I wanna go home

#

Help me

pure pumice
#

lol

#

send ur address

#

ill pick you up and we can work on this

pure pumice
#

would be after the []

serene scaffold
#

@pure pumice try it

#

@pure pumice also you might need to set the axis

#

!docs pandas.DataFrame.sum

arctic wedgeBOT
#

DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)```
Return the sum of the values over the requested axis.

This is equivalent to the method `numpy.sum`.
pure pumice
#

wtffff

serene scaffold
#

What the fuckkkkk

pure pumice
#

so my column names

#

are gonna be inside the sum()

serene scaffold
#

No

fierce patio
#

hi

pure pumice
serene scaffold
fierce patio
#

plz i still can t understand whats the role of cross_val_score

serene scaffold
#

@fierce patio it's the score from cross validation

pure pumice
serene scaffold
#

@pure pumice ya

pure pumice
#

filt1 = df['Netflix','Hulu','Prime Video', 'Disney+'].sum(axis=1) >= 2

serene scaffold
#

@pure pumice that's going to be a series of bools

pure pumice
serene scaffold
#

@pure pumice remember what I said about saying you got an error.

pure pumice
#

Traceback (most recent call last)
/cloud/lib/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:

/cloud/lib/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/cloud/lib/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('Netflix', 'Hulu', 'Prime Video', 'Disney+')

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)
<ipython-input-12-bbb026620058> in <module>
1 # Create a new dataframe that only contains values that are on two or more streaming platforms
2 # HINT: This is a great place to use filters!
----> 3 filt1 = df['Netflix','Hulu','Prime Video', 'Disney+'].sum(axis=1) >= 2
4 df.loc[filt1]
5 #1 filter, total number line

/cloud/lib/lib/python3.9/site-packages/pandas/core/frame.py in getitem(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]

/cloud/lib/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: ('Netflix', 'Hulu', 'Prime Video', 'Disney+')

#

whoops sorry

serene scaffold
#

It has to be a list

#

Not a tuple

pure pumice
#

ah so i dont need quotations on each one of thems

#

them

serene scaffold
#

No that's not what I said

fierce patio
pure pumice
#

not parenthesis

serene scaffold
#

@pure pumice yes but you still need quotes

#

For each string that is a column name

pure pumice
#

df['Netflix','Hulu','Prime Video', 'Disney+']

#

like that right?

serene scaffold
#

No

#

You need an extra pair of []

pure pumice
#

for each

#

column

serene scaffold
#

No

#

For the whole thing

#

Df[[.,.,.,.,]]

fierce patio
pure pumice
serene scaffold
pure pumice
#

thank you python master

serene scaffold
#

Yw python disciple

pure pumice
#

i feel like shooting my computer screen i hate this python assignment ๐Ÿ˜ฆ

serene scaffold
#

Or make it go away

pure pumice
#

also @serene scaffold if i am trying to create a pie chart for a specific column is that possible, or would i have to put it into a pivot table first

pure pumice
serene scaffold
#

@pure pumice I've never made a pie chart

pure pumice
#

damn

serene scaffold
#

I like pie

pure pumice
#

apple pie

serene scaffold
#

Pecan pie

#

Pumpkin pie

pure pumice
#

do u eat ice cream?

#

if u like pecans u should eat some pralines and cream ice cream

#

@serene scaffold

#

okay so over here

#

{'Index': [0, 1, 2, 3, 4],
'ID': [1, 2, 3, 4, 5],
'Title': ['Inception',
'The Matrix',
'Avengers: Infinity War',
'Back to the Future',
'The Good, the Bad and the Ugly'],
'Year': [2010, 1999, 2018, 1985, 1966],
'Age': ['13+', '18+', '13+', '7+', '18+'],
'IMDb': [8.8, 8.7, 8.5, 8.5, 8.8],
'Rotten Tomatoes': [87, 87, 84, 96, 97],
'Netflix': [1, 1, 1, 1, 1],
'Hulu': [0, 0, 0, 0, 0],
'Prime Video': [0, 0, 0, 0, 1],
'Disney+': [0, 0, 0, 0, 0],
'Type': [0, 0, 0, 0, 0],
'Directors': ['Christopher Nolan',
'Lana Wachowski,Lilly Wachowski',
'Anthony Russo,Joe Russo',
'Robert Zemeckis',
'Sergio Leone'],
'Genres': ['Action,Adventure,Sci-Fi,Thriller',
'Action,Sci-Fi',
'Action,Adventure,Sci-Fi',
'Adventure,Comedy,Sci-Fi',
'Western'],
'Country': ['United States,United Kingdom',
'United States',
'United States',
'United States',
'Italy,Spain,West Germany'],
'Language': ['English,Japanese,French',
'English',
'English',
'English',
'Italian'],
'Runtime': [148.0, 136.0, 149.0, 116.0, 161.0]}

#

i have these genres and i need to slice it so only the first genre shows on each row

#

but what about the rows with only one genre. wont they get sliced as well?

serene scaffold
pure pumice
serene scaffold
#

@pure pumice the way with the fewest steps involves regular expressions

#

You might use apply and a lambda

pure pumice
#

never heard of ๐Ÿ˜ฆ

serene scaffold
#

Lambda

#

It's just where you make a one statement function

pure pumice
#

so what im doing rn is just longer

serene scaffold
#

I can show you
When I get home

pure pumice
pure pumice
#

Create a pivot table where the average runtime of the movie is examined

Make the rows Year and the columns Genres

#

gp_pivot = df.pivot_table(values='Runtime', index="Genres",
columns = 'Year', aggfunc='mean')
gp_pivot.tail()

sand loom
# untold tundra how efficient is `data = data[:, -5:]` ?

I assume this would just be another view. However, that's probably fine as long as I'm not making more copies of my slightly modified data. Will need to profile it and also profile your deletion suggestion. I appreciate the help!

quiet vault
#

Has anyone worked with colab when using a runtime of a vm from gcp

serene scaffold
#

@pure pumice

In [7]: df['Language']
Out[7]:
0    English,Japanese,French
1                    English
2                    English
3                    English
4                    Italian
Name: Language, dtype: object

In [6]: df['Language'].str.extract(r'^([A-z]+),?')
Out[6]:
         0
0  English
1  English
2  English
3  English
4  Italian
#

I guess it could just be df['Language'].str.extract(r'([A-z]+)')

pure pumice
#

@serene scaffold (r'^([A-z]+),?')

#

what does this mean exactly

serene scaffold
#

it's my demon summoning spell

pure pumice
#

does that mean its mine now

serene scaffold
#

no

pure pumice
#

dammit

serene scaffold
#

anyway a regular expression is a pattern that strings can match

#

^([A-z]+),? means "from the start of the string (^), extract (()) one or more (+) consecutive characters from A to z ([A-z]) possibly (?) followed by a comma (,).

pure pumice
#

and the reason we cant use str[]

serene scaffold
pure pumice
#

is because ithe languages are not in a list

#

ahhhhh okay

serene scaffold
#

str[1:2] would be a string slice

pure pumice
#

ya i was trying tha

#

and it was just

#

messing everthing up

serene scaffold
#

that would work if every language name had the same number of characters.

#

I'm gonna play a game before I go back to work in the morning

pure pumice
#

have a great night

#

thanks

#

again

serene scaffold
#

๐Ÿ‘

sullen tinsel
#

Hey, does anyone mind taking a look at a code I am working on? I'm struggling with one aspect of it that has to do with user input and the help chat said this chat is also helpful!

serene scaffold
brisk moth
#

how do i use CUDA toolkit?

serene scaffold
brisk moth
#

do cuda things

serene scaffold
brisk moth
#

i have a 1060 i tried installing cuda toolkit and it failed

serene scaffold
#

show error message

brisk moth
#

uhh

serene scaffold
brisk moth
#

true

serene scaffold
#

that's a different question from "how do I use CUDA toolkit?". afaik, CUDA toolkit is just a compatibility layer for installing pytorch and stuff (and therefore can't be "used"), but I try not to rule out the possibility that people know something I don't, since they usually do.

#

but if you can't show the error message, idk what to do.

royal crest
#

XY problem

serene scaffold
stoic musk
#

Is anybody familiar with basic Tensoflow/Keras?

#

for t in range(Tx):

    # Step 2.A: select the "t"th time step vector from X. 
    x = X[:,t,:](X)

Trying to figure out how to iterate over the above tensor X

final scaffold
#

Hi, ive installed anaconda and kept these checked while installing:
a) install for All Users (not currnt)
b) Add PATH
Installed location is c:/ProgramData.

Now, when i open cmd (both as user and administrator) and type: python
i get this warning message:-
This python interpreter is in conda environment, but the environment has not been activated. Libraries may fail to load.

gilded copper
#

Any professionals over here who can help me a bit

hard shuttle
#

Hi everyone

junior lintel
#

If I need help with an AI script do I simply go into a help channel or do I go here?

serene scaffold
#

@gilded copper always ask your actual question. Don't rule out everyone except "professionals" before you've put an answerable question out there

marsh yacht
marsh yacht
marsh yacht
gilded copper
serene scaffold
acoustic forge
#

Guys - How would you "rank" these algorithms i terms of complexity (NOT Big O, but rather complexity in terms of explainability to stakeholders)
XGBoost
Random Forest
Logistic Regression
K-Nearest Neighbours
Decision tree
Perceptron

shut valve
#

regression, decision trees, forests, k-nearest, xgboost, perceptron

#

but like k nearest could be higher

acoustic forge
#

K nearest is the easiest in my opinion

shut valve
#

it think between knearest and decision trees its a toss up to you on order to introduce it

acoustic forge
#

I see. I have this stupid assignment for my final exam in applied data science, and they want basically a powerpoint. I need to explain it to stakeholders who know nothing about data science

shut valve
#

i feel that decision trees are esier to show to non tech people

#

yeah then i feel k-nearest is a bit more algorithmic to understand as with decision tress its easier to explain the bigger picture without getting to technical

acoustic forge
#

Right, yeah. Makes sense

shut valve
#

just a bit of advice i didn't appreciate as much in school was to talk more about results and consequences then specific technical aspects.

#

Hey anyone have any interest in taking https://cds.nyu.edu/deep-learning/ it looks real cool I have take some other deeplearning classes and projects so I'm not a total noob but idk if i would even call myself intermediate yet. I took linear in college did ok but that was a few years ago and havn't taken a derivate or anything in years which I'm concerned about. I'm not going to be sprinting thought going to try to stick to the weekly schedule but If something takes me an additional week for whatever reason then so be it. I was planning on starting it after the new year just asking to see if there was any interest

trail horizon
#

guys i would to learn data engineering but dont know from where to start, I already know python and SQL

#

can u pls give me like a career track or course ?

#

pls

serene scaffold
serene scaffold
trail horizon
serene scaffold
trail horizon
#

data science = machine learning, deep learning, etc etc

#

data engineering = spark, pipelines, cloud

shut valve
#

umm then yours asking the wrong channel try devops?

serene scaffold
#

afaik the fundamentals are the same.

#

one of my classmates got a job titled "data engineer" whereas my title is "computational linguist", but we both did the data science program.

#

(though I also took linguistics classes.)

shut valve
#

yeah but if your more interested in dataops and mlops there is a great course on coursera https://www.coursera.org/learn/introduction-to-machine-learning-in-production/home/welcome I took it and it was really interesting got to see a side that you dont get in more technical classes

normal radish
#

Hey guys can any of you help me with Convolutional Neural Networks? It is for a school project

shut valve
#

umm maybe whats your problem?

normal radish
#

I have a code I have trouble understanding

#

Posted it in the brocoli help channel

shut valve
#

well it looks like its just making the network do you have a specific layer that you dont understand? i honestly never worked with seperable i dont know what that is

normal radish
#

The make model part is my problem. How does it work?

#

Do you have time for voice chat?

shut valve
#

i cant talk rn but it looks like you make stacks of convolutions with larger and larger filter sizes

normal radish
#

Ye I stole the code from the Keras creator.

#

Need to analyze and understand it

#

But the filter applying is confusing me. The difference between seperableconv2d and just conv2d

shut valve
normal radish
#

Shit yeah I read that but it didnt help me

#

Give me 2 sec I can make a model for you

shut valve
#

i think you can just use regular conv2d. yeah try it with regular conv2d

normal radish
#

Can you give me an example code private?

shut valve
#

it says it has The depth_multiplier argument but i dont see it used in the code

#

i would just ctrl-R SeparableConv2D to Conv2D

shut valve
#

ah i see the residual now

#

ok so where the netowrk splits into seperable and regular conv is because of this part # Project residual residual = layers.Conv2D(size, 1, strides=2, padding="same")( previous_block_activation ) x = layers.add([x, residual]) # Add back residual previous_block_activation = x # Set aside next residual

#

I think the network would be the same if you changed seprable to regular conv2d

#

so what that peice of code is doing is applying the conv2d the layer on the right of the splits and adding them back together and saving the previous_block_activation

normal radish
#

Ehhh?

#

Can you maybe draw a layered model of it?

shut valve
#

like i just changed it to regular conv if you issue is with the network splitting after the activation its bc of the code above

#

if you re post the make model in a help chat i can show you my comments

brisk moth
#

can anyone help me figure out why im getting training and validation accuracy over 3.0 lol

normal radish
#

But it was closed again

#

Can you still use it?

bleak kiln
#

anyone has some experience with pandas libray ?

fierce patio
#

hi guys i work on data about testosteronne i want to creat a ML model for classification my target is testosteronn does i have to drop it from my data if i wanna using kmeans algorithm

desert oar
bold timber
#

Hi, I am so confused about this. I have a 5000 feature in the dataset, but I only get around 2500 components in the plot. What happened in this case?

untold tundra
#

print X_train.shape

#

either way, do you need to see all 5k components? it's >0.95 at like 200

bold timber
# untold tundra print `X_train.shape`

Before this, I used a different dataset to visualize like this, and I get a whole feature as an axis n_components. But, when I use the new dataset that have 5000 features, I only get some n_components from the whole feature. What happened?

untold tundra
#

not sure, are you certain that X_train has 5000?

desert oar
bold timber
desert oar
bold timber
desert oar
#

a matrix with 100 rows but rank 1 really only has 1 piece of data in it

#

(this is a very very non-mathematical explanation)

#

(the actual explanation has to do with "linear transformations")

#

i strongly encourage you to learn the fundamentals of linear algebra, it's an essential tool for building mathematical models in statistics and machine learning

#

linear algebra and calculus

bold timber
undone heron
#

Hello everyone... I have a stacking ensemble with the current config

estimators = [
        ('decision_tree', dtm),
        ('linear_regression',LinearRegressionModel),
    ]

    stack = StackingRegressor(estimators=estimators, final_estimator=RandomForestModel, cv= 7, passthrough = True)

Why does the one above perform better than this one below?

estimators = [
        ('decision_tree', dtm),
        ('rf', RandomForestModel)
    ]

    stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegressionModel, cv= 7, passthrough = True)
bold timber
desert oar
#

i recommend an introductory book or course on linear algebra. MIT 18.06 is excellent, the lectures are all on YouTube and the lecturer Gilbert Strang is very entertaining and passionate

#

i actually think a new online version is starting soon or has already started. but the old lectures are also easily available

undone heron
desert oar
#

i don't think there's any compelling theory here

bold timber
undone heron
robust jungle
#

would anyone mind helping in #help-dumpling ? I'm having trouble getting model_main_tf2.py to work

desert oar
bold timber
desert oar
#

there is no 2873'th component to plot a variance ratio for

#

also i'm not sure there's value in 2000+ components that explain < 1% of variance...

bold timber
desert oar
#

i.e. do the plot without cumsum

#

you will see that the components at the end only explain a tiny fraction of overall variance

#

so you can probably just ignore them

#

and if you look at the plot you currently have, you will see that ~90% of the variance is explained by the first ~250 components

undone heron
#

That is basically 24 hours of predicitons (blue lines containing the real values) and that one above is the Linear Regression model trying to solve it

#

This one is the decision tree

#

Random Forest below

desert oar
#

@undone heron you just did linear regression of bus usage vs time?

#

this is average hourly usage over 2 months?

#

i don't think either model is a great idea tbh

#

you surely have seasonal effects to consider

#

as well as year-over-year trends

undone heron
#

Model trained with the complete months of Jan, Feb and March and is predicting April 1st (24 predictions = 24hrs of the day)

#

I only have 2015 as data (it is mentioned on the limitations of the paper)

desert oar
#

i see, maybe you have to assume that there is no change throughout the year although that is very very risky

#

you will at least need weekly seasonality

#

surely transit usage is very different on saturday and sunday vs mon-fri

undone heron
#

Indeed, my features make sure that this is accounted for

#

Day of week (Mon - Sund), hour, month (1-12), day of year (1-365), day of month (1-31)

#

My big Q is just why the f* is the Linear Regression as a weak learning improving performance if it is so bad?

bold timber
undone heron
#

Even removing it from the stack makes the thing worse lol

desert oar
#

see how it's nearly 0 for most of the components?

desert oar
#

it's great actually, it predicts the average hourly trend throughout the day

#

high bias low variance

#

the decision tree takes care of overfitting to all the little fluctuations

#

and it gets smeared out by the linear regression being very _under_fitted

#

and the predictions aren't highly correlated

undone heron
#

Hmmmmmm that is an interesting take....

desert oar
#

and putting them together with the random forest i guess makes sense?

#

maybe you should do it the other way

#

stage 1: linear regression + random forest
stage 2: decision tree

#

that would be intuitive to me at least

#

or maybe not

#

since you want lower variance in stage 2

#

either way, i can see how the random forest being nonlinear is essential in correctly "re-combining" the 2 models

#

what is the predicted output from the first stacked model? using that same plot

undone heron
#

Well, to give more context... The final stage is -> the best ensemble for that social domain. Let me plot the two stacking I meantioned in the first question. 1 sec

desert oar
#

tldr my guess is that your stacking model is doing what time series decomposition does, splitting apart "trend" and "noise", and then recombining them with the "noise" turned down and/or filtered with some kind of low-pass filter

undone heron
#

Random Forest as final estimator and Decision Tree + Linear Regression as weak learners

#

Linear Reg as final estimator DT + RF as weak learners

#

wtf

#

the plot is better but the performance is not

#

MAE on second one goes up from 50 to 55

desert oar
#

this is "in-sample" prediction performance, right?

undone heron
#

nope

#

unseen data

desert oar
#

i see

#

you are saying that the 1st one has a slightly lower median absolute error than the 2nd one?

#

conceptually i don't think DT + RF makes much sense

#

RF is definitely not a "weak" learner

#

and RF already is constructed from a bunch of trees

#

that said, i am surprised that the first plot has lower median abs error

#

maybe it has to do with median vs mean

#

since the "tails" of the error distribution are essentially discarded with medians

undone heron
#

Wait, technical question... How do I measure performance from a .predict() run?

desert oar
#

try mean abs error or mean squared error instead!

untold tundra
#

what's the thing here?

is it that lr(dt(rf(X, y))) has different characteristics vs. rf(dt(lr(X,y ))) ?

untold tundra
desert oar
#

i suppose it makes sense to take 1 very deep tree and take a weighted average of it with a forest of many shallow trees

#

that is what the 1st one does

#

im not sure a random forest makes much sense on 2 features either

#

i am almost tempted to do this:

stage1:

  • random forest
  • linear regression

stage 2:

  • fully connected neural network with 1 hidden layer and ~5 hidden layer units
untold tundra
#

dt is basically a sort of modal learning, lr is a mean learning, and rf a mean of mode learners

desert oar
#

that's a great way to put it

#

in which case, yeah. i guess smashing a mode and mean together kinda makes sense

#

that said, i think maybe this entire problem would benefit from probability calibration and estimated error bounds ๐Ÿ™‚

untold tundra
#

yeah, it all boils down to some form of average

desert oar
#

i really want to see confidence bands around that predicted line

#

or better yet, a probability density surface

#

i'm not sure what a hardcore bayesian machine learning person would do here

undone heron
#

Oh Jesus now I'm seeing things said here that I have no idea what they are about lol

#

I'm just trying to get a Bachelor Degree in CS folks

#

I have the feeling something somewhere is wrong

untold tundra
#

a neural network is a "mean of means" learner, the first mean() phase projects X into a compressed space; the second mean() is basically a distance from your input x to the nearby points in the compressed space

desert oar
#

it doesn't have to be compressed though, right? that's the whole magic of having more hidden units than inputs

#

like kernel methods that used to be fashionable

#

this data isn't public, is it? i might be curious to mess with it, if it's public

untold tundra
#

aprox, ```py
W, b = mean(historical data)
layer = mean(WX +B)
predictions = mean(layers)

desert oar
#

oh, i see

untold tundra
#

i think if you just replace mean with "taking an expectation", you could probably mess around abit and get the definition precise

undone heron
untold tundra
#

a layer's activations are just a weighed-mean of the previous layer's, where the weights are W

and W are just basically compressions of the original data

undone heron
#

If you want to jump on a voice chat I can share my screen and we can chat about it, I just need to compare ensemble methods on that domain

untold tundra
#

so the "intuitive formula" above, i think is largely correct

#

so a NN is basically an RF where the core stat part is a mean() rather than a mode

as you can just see routes from X to Y through the layers as independent regressions (basically, means), and the final layers as ensembling/mean'ing those

#

but i'll let you get back to helping, if that's what's going on

bold timber
desert oar
desert oar
#

otherwise i really like this explanation

#

and you're definitely not in the way of helping at all!

untold tundra
#

sure, its a higher-d space, but its linear in that space

#

the intuition is that the weights are basically templates of th original dataset

#

so you're projecting a new x into space where templates are the axis

#

that space isnt a compression of your new-x, its a compression of the historical data

desert oar
#

oh, i see what you're saying

#

yeah, interesting way to think about it

#

i tend to think of it as a "recombining" or "mixing" rather than "compressing"

untold tundra
#

well its compression just if len(W) << len(X)

#

in the sense that if len(W) == len(X), under forced interpolation, W == X

desert oar
#

you're talking about len() as in the entire data matrix? like len(x) being the number of data points in the training set?

untold tundra
#

yeah, well W, b = someop(X, y) right?

#

my claim is the basic heart of someop is mean

#

so really, W, b = means(X, y)

desert oar
#

yeah i am on board there

#

linear transformations are linear combinations are means

untold tundra
#

if len(W,b) == len(X, y), and if loss(training) == 0, then more-or-less W,b should just be X,y

oblique vine
#

Hey, can someone explain me how do I obtain model "score" as in sklearn GridSearchCV?
I have made a model, now I want to compare the score to external test set, and I fail to get the "score" in normal range (gridsearch gives something below 1, i get -60 or sth like that

desert oar
#

ok i see what you mean, it's an average over all data points within the possibly very-high-dimensional feature space

untold tundra
#

yeah, it's mean**s**(X, y)

#

so if those means are just the same number as the original data points, and if you can predict all of those points without error, those means are just the data points

simple ivy
#

hey everyone, is anyone here familiar with ONNX models?

desert oar
#

do you think in that sense, a neural network is fundamentally different from e.g. an svm or linear regression (maybe with polynomial or other hand-transformed features)?

#

or even a general additive model for that matter

untold tundra
#

there's a paper which provies all grad desc learners are eqv. to svm

#

*proves

desert oar
#

i remember hearing something like that

untold tundra
#

in my mind, i see the NN alg as basically a dial from: knn -> ensemble of lr

desert oar
#

interesting idea

untold tundra
#

if len(W) <<< len(X) is ensemble(LR), if len(W) ~= len(X), its knn

desert oar
#

knn as in nearest neighbors?

#

i haven't heard that idea before

untold tundra
#

maybe ensemble knn might be more accurate

#

yeah, to me its kinda obvious, but a lot of the marketing BS requires poeple basically ignore the weights

#

once you sub W = mean(historical data) into all the formulas, it isnt that much of a mystery

#

W = means(historical) , so W = historical if len(W) = len(historical) and loss(historical) = 0

#

which makes it knn

#

should be kinda obvious from autoencoders and the like too

#

an autoencoder just shows that the weights are basically "local aproximations" / compressions of the original data

and the main mechanism of a NN is just to put your new point into the spaces of those aproximation points, and take a mean

desert oar
#

oh see, basically if you have 1 weight per data point you're just taking local approximation around each data point?

#

fair enough

#

i have to remember that you're talking about a time series here

#

and not a "flat" dataset of rows and columns

#

i think i was hung up on that point

untold tundra
#

me?

desert oar
#

are you?

untold tundra
#

am i?

#

i'm speaking generically

desert oar
#

i guess not then!

untold tundra
#

i'm not the time series person

desert oar
#

i guess i'm not sure if you're speaking abstractly about the number of parameters or about the actual shape of the weight matrix/tensor

untold tundra
#

oh i'm speaking very aproximately

#

= here means, "is, at its heart, "

desert oar
#

because in the basic 1-layer feedforward case you have a "1xH" weight vector where H is the number of hidden units

#

i know that in general the closer you get to 1 parameter : 1 data point, the closer you get to just memorizing the original data. but i'm not sure how well that intuition generalizes

untold tundra
#

if you do KNN(k=1).fit(X,y).predict(X) you get exactly y, ie., 0 training loss ... why? because W = (X, y) by design

if you do repeat: NN(num_weights=len(X)).fit(X) until ==y, then you've got 0 training loss, ... why?

well it isnt literally that W = (X, y) , but W "is basically" shuffled(X, y)

desert oar
#

sure, but what do you mean by "repeat" in this case? are you talking about stacking more layers? adding more weights? running more epochs?

untold tundra
#

running more epocs

desert oar
#

i don't disagree btw, but i want to make sure i understand your point if i am to borrow the idea ๐Ÿ˜‰

untold tundra
#

you can see it as a probablistic condition on W

#

like, what happens if len(W) >> len(X)

#

then it is certainly never the case that the entires of W would be the entires of X

desert oar
#

is running more epochs really increasing the size of W though?

untold tundra
#

no, its about permuting W until its just a rotation of X into a new space

desert oar
#

ok, sure. or iterating as close as possible thereto

untold tundra
#

yeah

#

i mean, i think it is pretty exact

#

if a PCA is basically just "rotate X by its means"

#

then a NN under these conditions is just the same thing

desert oar
#

oh you are saying specifically if you have at least 1 weight per data point

untold tundra
#

ie., W = rotate (X,y) by its means

desert oar
#

yes, that makes sense

untold tundra
#

yeah, if len(W) < len(X) you get more compressive

#

and end up closer to an ensemble of linear regressions

#

it's always just: mean(means...( x rotated-by means...(history)))

desert oar
#

i guess i still don't have a great sense for what the len(W) is. if you have two one-hidden-layer networks, but one has 5 hidden units and the other has 10, the second one has greater len(W) in your eyes, right?

untold tundra
#

i just mean all the parameters

#

as in, every since thing under optimization

desert oar
#

yeah, ok

#

really interesting idea

#

makes sense intuitively but i might have to simulate and convince myself ๐Ÿ™‚

#

and maybe write out some equations

#

certainly i agree with the idea that if you have enough parameters you end up memorizing and reshuffling the data rather than compressing it

#

that such a thing conceptually is similar to k-nearest-neighbors sounds logical but somehow isn't fitting right into my head. will have to tinker with it

untold tundra
#

well a NN is just a prediction fn, f =A W1 X...A W2 X... A W3 X

desert oar
#

sure

untold tundra
#

so maybe, if we say, entires of W* = entires of W1, W2, W3

#

my claim amounts to something like, AW* is just a pca-like rotation of X

#

when len(W) == len(X) and when loss(training) == 0

#

ie., when the network is predicting its historical data, and when the number of parameters = the number of data points

desert oar
#

i follow you that far

untold tundra
#

right, so maybe the idea is something like

#

"roughly", AX on W* == AW* on X

#

ie., W* and X are basically just the same

#

this isnt how i arrived at the conclusion though

#

i arrived at my general view, by:
(1) dropout on NNs basically ensembles them.. wait, softmax/last-layer is just a mean() anyway, so they're coming into last layer as an ensemble

(2) all ML reduces to mean(), mode(), etc

(3) algs which predict their training data exactly are overfit, ie., their parameters are closer to the original data than they should be

(4) if you're perfectly interpolated and have sufficient parameters to play with, it is extremely likely your parameters are just your original points ("in a rotated space")

#

and also, if you think about the two branches of alg, either you force distributional assumptions on your historical data.. in which case you fit to a model

or you dont, in which case you fit to the data

#

a NN is just a dial between those

desert oar
#

yeah, that much i totally follow

#

i really like that line of reasoning

#

so i can definitely see how that would lead to something knn-ish

#

i wouldn't really describe it as nearest neighbors, but certainly an increasingly local approximation

#

i also really like the mode vs mean thing

#

going to borrow that one

#

ty for the insights!

lapis sequoia
#

to classify 1000 labels how many imgs per label do i need?

#

gonna use albumentations

untold tundra
#

there's no universal answer to that question, if each label is a basic shape, then a handful

lapis sequoia
#

wdym with basic shape

untold tundra
#

it depends on what the images are of

lapis sequoia
#

cartoon

#

anime

untold tundra
#

at a guess, 1k/label is a minimum

lapis sequoia
#

HAHAHAAH

#

i barely can get 100

#

xD

untold tundra
#

well just use that and see what happens

#

how big are the images?

lapis sequoia
#

160x160

untold tundra
#

c. 25,000 pixels/image, 100 images/label, 1000 labels

lapis sequoia
#

what is c.?

untold tundra
#

"circa", it means aprox.

lapis sequoia
#

yeah, but u are forgeting albumentations

untold tundra
#

yeah, you can augment

lapis sequoia
#

well, gotta scrap the images first. I did but some images do not correspond the label so i had to clean them manually and after cleaning 170 labels i got bored. I guess it will be easier scrapping better

untold tundra
#

yip, you could probably pay on mechanical turk

#

to have them labelled for you

lapis sequoia
#

uff not gonna pay for a hobbie xD

#

thanks tho

quiet vault
#

Does anyone know how to use google cloud storage with colab

#

I want to know how to access a folder

lapis sequoia
#

click on the drive folder inside colab

#

it will give u a link and request for a token

#

just click on the link

quiet vault
#

ok thanks

pure pumice
#

does anyone know how to filter out items from a dataframe?

#

to only show those specific items in that column

calm thicket
#

df.loc['column']

pure pumice
calm thicket
#

sure

pure pumice
# calm thicket sure

{'Index': [0, 1, 2, 3, 4],
'ID': [1, 2, 3, 4, 5],
'Title': ['Inception',
'The Matrix',
'Avengers: Infinity War',
'Back to the Future',
'The Good, the Bad and the Ugly'],
'Year': [2010, 1999, 2018, 1985, 1966],
'Age': ['13+', '18+', '13+', '7+', '18+'],
'IMDb': [8.8, 8.7, 8.5, 8.5, 8.8],
'Rotten Tomatoes': [87, 87, 84, 96, 97],
'Netflix': [1, 1, 1, 1, 1],
'Hulu': [0, 0, 0, 0, 0],
'Prime Video': [0, 0, 0, 0, 1],
'Disney+': [0, 0, 0, 0, 0],
'Type': [0, 0, 0, 0, 0],
'Directors': ['Christopher Nolan',
'Lana Wachowski,Lilly Wachowski',
'Anthony Russo,Joe Russo',
'Robert Zemeckis',
'Sergio Leone'],
'Genres': ['Action,Adventure,Sci-Fi,Thriller',
'Action,Sci-Fi',
'Action,Adventure,Sci-Fi',
'Adventure,Comedy,Sci-Fi',
'Western'],
'Country': ['United States,United Kingdom',
'United States',
'United States',
'United States',
'Italy,Spain,West Germany'],
'Language': ['English,Japanese,French',
'English',
'English',
'English',
'Italian'],
'Runtime': [148.0, 136.0, 149.0, 116.0, 161.0]}

#

so this is the data in my dataframe

#

i need to select 4 genres of my choice from the genres column and filter the dataframe so that only those 4 are left

calm thicket
#

ok

pure pumice
#

just ignore everything i just said

#

the first step i had to do was to take the genres column and only keep the first genre in it, like if the genres column has comedy,drama,romance. I had to turn it into just comedy

#

df['Genres'] = df['Genres'].str.extract(r'([A-z]+)') #
df.head()

#

i used that^

#

now i am being asked in the instructions to select 4 genres of my choice from the genres column and filter the dataframe so that only those 4 are left

calm thicket
#

i c

#

you can use

#

!d pandas.Series.isin

arctic wedgeBOT
#

Series.isin(values)```
Whether elements in Series are contained in values.

Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.
calm thicket
#

so, df.loc['Genres'].isin(genres) is a series with bools you can index with

pure pumice
#

so using the series.isin

#

id have each genre in the values area

pure pumice
calm thicket
#

yeah

#

in a list

#

or tuple, whatever

pure pumice
pure pumice
calm thicket
#

i think you need a list, not separate arguments

pure pumice
#

i used df['Genres'] = df['Genres'].str.extract(r'([A-z]+)')
df.head()

#

to only display one genre in the column

#

meaning the genres arnt in a list anymore

quiet vault
#

I am trying to access a directory with a ton of images stored in google cloud drive using colab. I type this in ```py
!gcloud config set project {project_id}
!gsutil cp -r dir gs://digits

And get the following error:

Updated property [core/project].
CommandException: No URLs matched: dir```
Can someone tell me what I have done wrong

old plinth
#

Hey guys so i have a doubt regarding tensorflow. I have been working with pytorch for a long time and felt like i needed to give tensorflow a shot. So right now I am able to understand custom training loops and all of those things. Only doubt is like pytorch where there is a custom dataset and dataloader class from torch.utils.data is there anything flexible like that in tensorflow that is easier to use for custom pre processing of data. Like what is most commonly used for creating custom dataset like we do in pytorch?

pure pumice
#

@serene scaffold hey can I please get a little bit more help before i hand this project in?

#

{'Index': [0, 1, 2, 3, 4],
'ID': [1, 2, 3, 4, 5],
'Title': ['Inception',
'The Matrix',
'Avengers: Infinity War',
'Back to the Future',
'The Good, the Bad and the Ugly'],
'Year': [2010, 1999, 2018, 1985, 1966],
'Age': ['13+', '18+', '13+', '7+', '18+'],
'IMDb': [8.8, 8.7, 8.5, 8.5, 8.8],
'Rotten Tomatoes': [87, 87, 84, 96, 97],
'Netflix': [1, 1, 1, 1, 1],
'Hulu': [0, 0, 0, 0, 0],
'Prime Video': [0, 0, 0, 0, 1],
'Disney+': [0, 0, 0, 0, 0],
'Type': [0, 0, 0, 0, 0],
'Directors': ['Christopher Nolan',
'Lana Wachowski,Lilly Wachowski',
'Anthony Russo,Joe Russo',
'Robert Zemeckis',
'Sergio Leone'],
'Genres': ['Action,Adventure,Sci-Fi,Thriller',
'Action,Sci-Fi',
'Action,Adventure,Sci-Fi',
'Adventure,Comedy,Sci-Fi',
'Western'],
'Country': ['United States,United Kingdom',
'United States',
'United States',
'United States',
'Italy,Spain,West Germany'],
'Language': ['English,Japanese,French',
'English',
'English',
'English',
'Italian'],
'Runtime': [148.0, 136.0, 149.0, 116.0, 161.0]}

serene scaffold
#

@pure pumice I'm busy RN but I guess ask your question

pure pumice
pure pumice
#

well now i need to

#

Select 4 Genres of your choice. Filter your dataframe so that only those 4 Genres are left

#

after that first step which we did

tribal oracle
#

any data scientist can tell me:

#

you store the data on an excel, for example, when you need to work with it, you transform it ALL from excel to dict(for example)?

#

or you work directly on the excel?

serene scaffold
serene scaffold
#

it's basically effortless.

tribal oracle
serene scaffold
#

handle with it?

tribal oracle
#

with the data

#

i'll send an example, one sec

serene scaffold
#

it depends on what you're trying to do shrug2

serene scaffold
tribal oracle
#
def load_excel(arquivo_excel, index_coluna):
    df = pd.read_excel(arquivo_excel)
    df.set_index(index_coluna).T.to_dict('list')
    return df.to_dict(orient='list')

i'm doing that to load from the .xlsc to (for example) that:

dict_machine = {'ignore-2': [],
                'ignore-1': [],
                "ignore": [],
                "ignore2": [],
                "ignore4": []}
serene scaffold
tribal oracle
#

wait, so i can totally remove it?

serene scaffold
#

but at that point you might as well have

def load_excel(arquivo_excel, index_coluna):
    return pd.read_excel(arquivo_excel).set_index(index_coluna).T.to_dict('list')
tribal oracle
#

hmmmm, let me try

#

nah, its returning a key error

serene scaffold
tribal oracle
#

maybe something below is conflicting

#

but anyways, nevermind

pure pumice
serene scaffold
pure pumice
#

because

#

the genres arnt listed as that

pure pumice
serene scaffold
#

@pure pumice but also you're comparing them to one string

#

Which is ordered

pure pumice
#

ya because I have them all under one ""

serene scaffold
pure pumice
#

ya

pure pumice
serene scaffold
pure pumice
#

before we removed

#

everything but the first one

serene scaffold
pure pumice
#

this wouldnt work either eh?

serene scaffold
#

@pure pumice you don't want to be using == with the whole column

#

Because it will check if the value in that column matches the string exactly, from beginning to end.

pure pumice
#

cuz just one = sign doesnt work either

serene scaffold
pure pumice
serene scaffold
#

Not at the moment, if I'm being perfectly honest. Your goal is to get those rows where at least one of the genres belongs to one of four that you pick, yes?

pure pumice
#

exact instructions^

serene scaffold
#

@pure pumice so you need to pick those columns where the genres are a subset of the four that you pick

pure pumice
#

yes

#

like when i do df.head()

#

only movies in those 4 genres should display

serene scaffold
#

@pure pumice "only movies in those four genres" needs a more robust definition, since a movie can belong to more than one genre

#

Does it need to belong to all four? Exactly one? At least one (but possibly others that aren't?)?

pure pumice
#

just one

#

because before this

#

we deleted all the genres from the column and kept one

serene scaffold
#

Exactly one?

pure pumice
#

so only one genre will show

#

if uk what i mean

serene scaffold
pure pumice
# serene scaffold Are you sure you were supposed to delete the others?

Using the original dataframe, take the genres column, and only keep the first genre.

For example, if the value was previously Comedy,Drama,Romance, then it would become Comedy

Select 4 Genres of your choice. Filter your dataframe so that only those 4 Genres are left

Create a pivot table of the average runtime of movies over time. The rows are therefore the year

The columns will be the 4 Genres you filtered for

serene scaffold
#

@pure pumice alright

#

!docs pandas.Series.isin

arctic wedgeBOT
#

Series.isin(values)```
Whether elements in Series are contained in values.

Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.
pure pumice
#

okay so where it says values

#

id put in the genres

#

in a list?

#

@serene scaffold

serene scaffold
#

@pure pumice try it and see

pure pumice
#

be df?

serene scaffold
#

@pure pumice a series is basically a stand alone column

#

I must now sleep

pure pumice
#

okay goodnight

serene scaffold
#

The genres column is a series.

pure pumice
serene scaffold
#

That won't work

pure pumice
#

ya series not defined

serene scaffold
#

The genre column is the series

#

Also you passed four strings individually as four arguments

#

Not a list

#

Good luck!

pure pumice
#

so they should all be under one " "

serene scaffold
#

No.

#

You should have passed one list with four strings.

pure pumice
#

okay ya i cant do this

#

i will try again

#

tomorrow

serene scaffold
cold skiff
#

oh both the argument and the parameters have to be set like objects

torpid elk
#

Iโ€™m trying to learn data science for python. Any good resources you guys can recommend? Or site where I can practice my skills?

magic edge
torpid elk
magic edge
torpid elk
frank torrent
#

Hi, does anyone know how I could create an orientation histogram using SIFT? I have found the key points and now would like to make an orientation histogram for some of them

next lance
#

Why can we not make fully auto driving cars when we already have technology to do that

untold tundra
#

we dont have the technology

rigid dawn
#

What's the difference between MYSQL, POWER BI , AND TABLEAU

untold tundra
#

mysql = database for storing data in tables; powerbi = microsoft data visualization & reporting tool which gets data from a database; tableu = non-microsoft "alternative" to powerbi, maybe a bit harder to use

rigid dawn
#

So, basically powerbi extracts data from MySQL

#

??

untold tundra
#

it's capable of doing that, yes

rigid dawn
#

What should I learn first?

#

I am on the data analysis road

#

Done with python basics( i still get stuck a lot of times)

#

Modules also

#

Numpy pandas matplotlib seaborn plotly

#

Now what should I do?

untold tundra
#

err, it is very important to learn SQL

#

the easiest way of doing that first is using sqlite3 in python, as you dont have to install anything

#

so go learn SQL & sqlite3, and when you've done that, then maybe look at powerbi

untold tundra
#

import sqlite3

rigid dawn
#

So, I can use all python modules in MySQL?

untold tundra
#

sqlite3 is a simpler database, that is a bit like mysql

rigid dawn
#

Sure, I would try that first. And after that move to powerbi

#

Bro, can I send you friend request? If that's not an issue

#

If you comfortable with that

untold tundra
#

you can send one, i wont accept immediately; i might accept at some point

acoustic forge
#

What's the best way to deploy a machine learning model on Azure? Which service should I look into?

untold tundra
#

depends what you mean by "deploy"

#

the obvious service is AzureML

acoustic forge
#

Perfect - Yeah, I think that's what I was looking for

untold tundra
#

microsoft has lots of free courseware in this area, on their microsoft learning github

acoustic forge
#

It's primarily cause we have to design a deployment model for a fictional company as part of one of our courses

#

Basically I need to make an architecture that scales well

#

I'll send a picture of the architecture in a bit, maybe something stands out to you as being comically wrong

untold tundra
#

i suspect if you just followed the instructions on that course above, maybe first 6 or 7 labs

#

you'd basically have the solution

buoyant nebula
#

Can any body help me on this query

acoustic forge
# untold tundra you'd basically have the solution

Anything here that (to you) looks very out of place? Any suggestions? Basically we need to present a scalable architecture to people who don't know anything about architecture/machine learning for a machine learning model in steel plate fault detection

desert oar
#

this is for people to actually train models?

#

i am generally skeptical of letting people who know 0 machine learning do machine learning

#

also the databricks notebook interface is hot steaming garbage

#

not worth it imo, use databricks-connect or just run non-interactive jobs

acoustic forge
#

I understand your skepticism ๐Ÿ˜› This is a completely fictional report. And we are forced to use Databricks

#

Well, the report is not fictional, the case is

desert oar
#

jesus, are you working for my previous employer?

acoustic forge
#

Hahahaha our teacher is a consultant, so if you work in consulting I might very well be

desert oar
#

i don't, but fuck the microsoft salespeople who convinced upper mgmt to force databricks on everyone

#

dont get me wrong, i really liked having a managed and optimized spark cluster

#

but there was a moment when they actually believed we could do actual work using that interface

acoustic forge
#

It's honestly crazy. Its such a hindrance for actual work

desert oar
#

like not just "big data" work -- they thought we could do all of our work either on our laptops or databricks

acoustic forge
#

Can I send you an updated architecture diagram?

desert oar
#

sure

#

and im so relieved that you agree

#

serious tip: set up a local spark cluster on a server, so people can actually develop and test their spark code without burning your very very expensive databricks cluster time

acoustic forge
desert oar
#

or even set it up on data scientists' / ml devs' machines

acoustic forge
#

We basically had to make up the case ourself

desert oar
#

i guess my point is that dev affordances must at least be part of the plan somewhere

#

at least in my opinion

#

maybe consultants dont care

acoustic forge
desert oar
#

if the plan is "make pretty diagram, tell contributors to f off" then your contributors will all quit for better and better-paying jobs, and your project will fail

#

yeah this seems kind of rough

#

so if it makes you feel better: yes, that architecture was good enough for a global 500

#

so it's good enough for your course

#

and actually on the "production" side it was pretty damn good

acoustic forge
#

Super nice! I am very happy about that, thanks a lot ๐Ÿ™‚

desert oar
#

and better diagrams than we had too ๐Ÿ˜›

#

ours were literally low res jpg's someone had downloaded off the azure site

acoustic forge
#

Yeah, we're not super into making these diagrams, we're data scientists, not necessarily cloud architects. He was super savage to us last time (cause we kept it more technical than he wanted). So this time we really tried to reduce it something anyone could understand (more or less)

#

But god damn. Soon I will never have to touch databricks again (hopefully)

desert oar
#

this looks good

#

and honestly if you do need spark, databricks beats the hell out of running your own yarn cluster

#

but yes avoid that accursed notebook interface

wicked grove
#

I want to build a movie recommendation system using ml / can you suggest any good ml projects that i can do

untold tundra
#

seems ok

queen crag
#

Are courses on free code camp useful?

serene scaffold
robust jungle
#

does anyone know how to fix this? absl.flags._exceptions.IllegalFlagValueError: flag --sample_1_of_n_eval_examples=: invalid literal for int() with base 10: ''

#

coming from:

#

python model_main_tf2.py
--model_dir=$/tmp/model_outputs --num_train_steps=$10000
--sample_1_of_n_eval_examples=$1
--pipeline_config_path=$/Users/admin/PycharmProjects/Imageclassifier/model/object_detection/efficientdet_d7_coco17_tpu-32 2/pipeline.config
--alsologtostderr

desert oar
#

@robust jungle did you write this program? model_main_tf2.py

#

it seems like someone gave you incorrect usage instructions

#

that, or you wrote $10000 when you meant 10000

#

the $ is not a "number", it introduces a shell variable

#

so $1 is the first argument of a shell script

#

so it's not clear what $1 or $10000 are supposed to be

#

when you try to use a nonexistent variable in a shell script, the value is an empty string

#

so that's probably what's causing this error

#

if you were given usage instructions, can you post those here?

desert oar
#

if you were given usage instructions, can you post those here?

robust jungle
#

ngl I probbably added that to experiment since I saw it on each line

#

but I just removed it

#

still doesnt work

#

just removed it from the other line too

#

new error (placeholder thing was still there), seems to have fixed it

#

thanks

#

side note: the thing im running is commented at the top of model_main_tf2.py, which can be downloaded on the tensorflow github

#

nevermind

#

just looked at it again

#

that one was on me

#

I did a goof

#

new issue

#
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. tensorflow.python.lib.io._pywrap_file_io.BufferedInputStream(filename: str, buffer_size: int, token: tensorflow.python.lib.io._pywrap_file_io.TransactionToken = None)

Invoked with: None, 524288
#

comes from:

PIPELINE_CONFIG_PATH=path/to/pipeline.config
MODEL_DIR=/tmp/model_outputs
NUM_TRAIN_STEPS=10000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python model_main_tf2.py -- \
  --model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \
  --sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
  --pipeline_config_path=$PIPELINE_CONFIG_PATH \
  --alsologtostderr
#

ignore the difference in the last bit, thought the top part was a quickstart and the bottom was placeholders

queen crag
pure pumice
#

@serene scaffold I figured it out df[df["Genres"].isin(['Action','Animation','Comedy','Adventure'])]

pure pumice
#

like we did with rotten tomatoes

serene scaffold
#

@pure pumice you can write over the existing one

pure pumice
pure pumice
#

thanks

normal radish
#

Hey guys I am in serious need of help on a CNN. Do any of you have some time to spare?

normal radish
#

Please anyone with any knowledge of convolutional neural networks! You will save my day!

serene scaffold
#

@normal radish try asking your CNN question. It's not likely that anyone will commit to a question you haven't asked yet, even if they know about CNNs.

#

(This goes for any time you want to ask a question on the internet: just ask the question.)

normal radish
#

Ok so I have a CNN where I have a 180x180x3 (RGB) being convoluded with 32 filters. It is correct too say that this gives med 32 new images?

#

@serene scaffold

untold tundra
#

no

#

there will be 32 activation maps when the 32 filters are applied, but the activation maps arent new images as such

normal radish
#

But it will change from a 180x180x3 too a 180x180x32 right?

untold tundra
#

if i recall correctly, yes -- you can just build it in keras and then print a summary

normal radish
#

Yeah I did that but not sure I understand how

untold tundra
#
from keras.applications.vgg16 import VGG16
model = VGG16()
print(model.summary())
#

compare the VGG16 diagram (google images) with the summary

#

convolving layers are often compressive

normal radish
#

Thing is each of the filters will be a 180x180x3 but turn a into a 180x180x1. Does it just add the values down?

untold tundra
#

so you can go from 180x180x3 -> lots of things

#

yes

#

the "convoltion product" is a dot product thru' the image

normal radish
#

Do you have 2 seconds too talk?

untold tundra
#

i have two seconds to type

#

are you aware of the stanford course on this? i'm sure there's a video which does parameter counting

normal radish
#

Iโ€™m not sure if I understand. I understand that the filter is parsing over every pixel in a kernel and here I dots and takes the sum for the new pixel. Butโ€ฆ there is 3 layers. How does this turn into 1?

normal radish
#

Do you know the answer to my previous question? I will watch this tomorrow

untold tundra
#

its like a volumtric dot product

#

you take the pixels of all three layers, and you write them in a single row

#

then the dot product is just the filter's weights in a single row

#

dot'd with those

#

visually, the filter "punches thru'" the layers

normal radish
#

So you dot the first pixel in every layer and then the next

untold tundra
#

i cant recall the exact formula, but iirc, i think basicaly you take the first weight, eg., w00 and you do that with each layer, so x0red*w00 + x0blue*w00 + ...

#

the idea is that w00 is the weight for "that part of the image"

#

and that there are three color channels is a bit of a distraction, it's duplicated information

normal radish
#

So if the first pixel in the red layer is 2, blue is 3 and green is 1 and the weights are all the the first pixel in the new feature map is 6?

untold tundra
#

well the output, ie., activation map, is a scanning of the filter across the image

normal radish
#

But you say it punches through basically adding the values of the pixels in each layer?

#

Resulting in an โ€œimageโ€ and not 3

untold tundra
#

yeah, but the detail is subtle, and i wouldnt want to get it wrong

quiet vault
normal radish
#

I am

#

Padding=same

#

And strides=1

untold tundra
#

the 180x180x32 activation maps are "images" made by weights, but the weights can be 3x3x3, say

#

or 3x3x1

normal radish
#

So it could give me 3 โ€œimagesโ€ pr. filter again?

untold tundra
#

each cell in the "image" ( activation map ) corresponds to the "same reigion" in the input picture

#

so the top left of each activation map is the top left of the input

#

and the value is the sum of the filters applied to that reigion

normal radish
#

Yes thatโ€™s the scan it makes with the filter

untold tundra
#

if you see a filter as a volume: 3x3x3

#

then it's dot'd with the same volume in the input, 3x3x3

#

and all i can remember about that, is the math is basically, take those 9 numbers in the top corner, and dot them with 9 numbers in the filter

#

basically in quite a naive linear order

normal radish
#

I see a filter as 3 values in top 3 in middle and 3 in bottom

#

Yes okay

#

What confuses me was how 180x180x3 turned into 180x180x32

untold tundra
#

well the "volume in weights" dot "volume in the image" is a single number

normal radish
#

Since in my case I convolude more than once resulting in 180x180x32 turning into 180x180x64 after and I thought it would be 32*64 activation maps

#

So 180x180x(32*64)

untold tundra
#

the underlying operation between each filter and each reigion of the image is just a dot product

#

so it projects just to a single number for each application of the filter

normal radish
#

I think I get it. Thanks for the answers @untold tundra and @quiet vault

untold tundra
#

sure, i'd strongly recommend watching the standford course

#

all of these questions are answered very clearly and logically

#

good night

normal radish
#

Goodnight

wooden cosmos
#

Hello, i have a question regarding the implementation of a particular neural network. I understand the model but i can't figure out how this guy does backpropagation. Could someone explain it to me? https://teddykoker.com/2019/12/beating-the-odds-machine-learning-for-horse-racing/

Teddy Koker

Inspired by the story of Bill Benter, a gambler who developed a computer model that made him close to a billion dollars1 betting on horse races in the Hong Kong Jockey Club (HKJC), I set out to see if I could use machine learning to identify inefficiencies in horse racing wagering. Kit Chellel, The Gambler Who Cracked the Horse-Racing Code,ย โ†ฉ

distant trout
#

if nayone could help with minimax ai come to kiwi chanel ๐Ÿ˜‚

lusty valley
#

Can somebody explain a classification report done on an SGD classifier to me. I have 79% precision and 100% recall on 0s but I have 0% precision and recall on 1s

tidal bough
#

That means your classifier classifies everything as 0, I believe

lusty valley
#

I see. So itโ€™s useless then

#

Not enough features

slender kestrel
#

yo anyone who is learning data science or working in field of data science ? be kind enough to hit me up on dms or ping me

acoustic forge
#

Hey guys - So I am creating a real estate regression model based on historical sales of real estate in the Copenhagen (Denmark) area. I was curious if anyone has any cool articles regarding the performance and real-world viability of these types of models?

#

I know how to create it, I am just curious how performant these things are, especially cause I don't know much about real estate, but will be buying an apartment soon

quasi ether
#
def prepare(filepath):
    size=50
    img=cv2.imread(filepath)
    img=cv2.resize(img,(size,size))
    return img.reshape(-1,size,size,1)
#

i need help

#

return img.reshape(-1,size,size,1)

warm jungle
#

reshape takes a single argument (normally a tuple)

quasi ether
warm jungle
#

so img.reshape((-1, size, size,1)) rather than what you have

normal radish
#

Hey do anyone know how to make a visualisation like this on my ConvNet?

orchid kayak
#

What does it mean when my evaluation loss is magnitudes higher than my training loss?

odd meteor
# orchid kayak What does it mean when my evaluation loss is magnitudes higher than my training ...

Your model is overfitting. Your model is suffering from high variance problem.

In a layman terms, it means your model performed well in minimizing your loss function on train set, but not so well in replicating the achieved success on your validation set.

The aim is to get the RMSE/Categorical_crossentropy/ exact loss function obtained on Validation set to be exactly same or somewhat closer to the RMSE/categorical cross entropy /your exact loss function obtained on your train set.

old grove
#

Hello all.... In spearmenr correlation test, What is our null hypothesis ? there is correlation or there is no correlation ?

warm valley
#

Hello, I have a small question. I have a pandas column which have negative values, I want to convert them into positive .

I tried
data[data['Quantity']] = abs(data['Quantity'])
but its giving error

dusk zephyr
#

which error? are there nan/null values in the column?

warm valley
#

By error, I mean, it was taking years.
But upon searching a bit, I found the code
data[data.columns[data.dtypes != object]] = data[data.columns[data.dtypes != object]].abs()

orchid kayak
odd meteor
# orchid kayak I understand, I've got a few followups if you don't mind: 1. My training accurac...

These ML terminologies might kinda be confusing but there are two things I'd like you to understand first.

Remember one of main goals of almost all model architecture in ML or Deep Learning projects is to either:

  1. Minimize the loss function
    Or
  2. Maximize the objective function

Now you as a ML Engineer, your goal is to get the optimum point. You'd want to get an optimum function that minimizes your loss function and at the same damn time maximizes / minimizes your objective function (you can think of this roughly in your head as finding the equilibrium point) to achieve the lowest generalisation error. In ML it's called Bias-Variance Tradeoff.

Usually, achieving #1 in most scenarios leads to achieving #2 as well.

__What is Loss Function or Objective Function? __

Loss function in a layman terms could simply be likened to what we need to train our model in order for us to know how well our model explains the data.

NB: When we're minimizing a function it's called loss function or cost function

Examples: RMSE, MAE, Logloss also similar to categorical cross entropy, MSE, Huber etc

Objective Function is the function we want to either minimize or maximize. In general term, it's any function we want to optimize during training.. So loss function is a type of an objective function.

Example: Coefficient of Determination a.k.a R^2, explained variance score, F1 score, RoC score , AuC score, accuracy_score, precision_score, and all the examples of loss function, etc

With the above explanation I believe you can easily understand what I'm about to say next.

a) High Error leads to Low Accuracy score
b) Low error leads to High Accuracy

So to answer your questions now

  1. I'm super sure all I've explained till this point has answered your first question.

  2. you'd have to reduce your model complexity, or gather more data. By reducing model complexity I mean, reducing your max_depth, min_samples_leaf etc.

Still confused? I hope not ๐Ÿ˜€

orchid kayak
#

Not confused anymore, thanks a lot!

bold timber
#

Hi, I am so confused about this code. What argsort()[-6:] does in this code?

serene scaffold
#

!e

import numpy as np
arr = np.array([1, 7, 3, 4])
print(arr.argsort())
arctic wedgeBOT
#

@serene scaffold :white_check_mark: Your eval job has completed with return code 0.

[0 2 3 1]
bold timber
serene scaffold
#

!docs numpy.ndarray.argsort

arctic wedgeBOT
#

ndarray.argsort(axis=- 1, kind=None, order=None)```
Returns the indices that would sort this array.

Refer to [`numpy.argsort`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html#numpy.argsort "numpy.argsort") for full documentation.

See also

[`numpy.argsort`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html#numpy.argsort "numpy.argsort")equivalent function
bold timber
serene scaffold
lapis sequoia
#

If we explain the above example stel gave,
argsort will give index of element instead of actual element in the sorted array.

serene scaffold
#

and the ints are in the order that the elements would be if you had actually sorted the array

bold timber
bold timber
lapis sequoia
#

Means?

bold timber
lapis sequoia
tough bolt
#

Yo

#

I have the task to run this

#

But have no idea where to start

#

Could somebody give me some guidance or nudge me into the right direction how I use those pretrained models

chilly flame
#

Hey everyone i have task to recognition character, i already finish the preprocessing and got some binary images but i have no idea how to extract and store the features to the database. Anyone who learning images processing and know about zoning based feature extraction with svm for classifier can explain me or give me link for the documentation? you can hit me up on dm

mighty spoke
#

Hi how would I create different plots from different data frames using a loop?

normal radish
serene scaffold
robust jungle
#

Im running this in terminal (from model_main_tf2.py) and im getting an error:

#
PIPELINE_CONFIG_PATH=/Users/admin/PycharmProjects/Imageclassifier/model/object_detection/efficientdet_d7_coco17_tpu-32 2/pipeline.config
MODEL_DIR=/Users/admin/PycharmProjects/Imageclassifier/model/object_detection/efficientdet_d7_coco17_tpu-32 2
NUM_TRAIN_STEPS=10000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python model_main_tf2.py -- \
  --model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \
  --sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
  --pipeline_config_path=$PIPELINE_CONFIG_PATH \
  --alsologtostderr
#

error:

#
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. tensorflow.python.lib.io._pywrap_file_io.BufferedInputStream(filename: str, buffer_size: int, token: tensorflow.python.lib.io._pywrap_file_io.TransactionToken = None)

Invoked with: None, 524288
serene scaffold
#

!traceback

robust jungle
serene scaffold
arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

robust jungle
#

alright thanks

serene scaffold
mighty spoke
# serene scaffold Why did you start with the assumption that there has to be a loop? Can you give...

Hi @serene scaffold my data is like this

import pandas as pd#import pandas package to read data more easily
import matplotlib.pyplot as plt#imported pyplot to plot graphs
import datetime as dt#date time to read first column of csv file
import numpy as np
from datetime import datetime



df = pd.read_csv('TSLA.csv')
df2 = pd.read_csv('NBM.V.csv')
df3 = pd.read_csv('TSLA.csv')

df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
x1=(df['Date'] - dt.datetime(1970,1,1)).dt.total_seconds()/86400
x2=(df2['Date'] - dt.datetime(1970,1,1)).dt.total_seconds()/86400

y1=df['Close']
y2=df2['Close']

t0=[]
d0=[]

def udcf(df,df2,t0,d0):
    y1_mean = np.mean(y1)
    y2_mean = np.mean(y2)
    y1_stdv = np.std(y1)
    y2_stdv = np.std(y2)
    for i in range(len(df)):
        for j in range(len(df2)):
            t=x2[j]-x1[i]
            t0.append(t)
            d = (y1[i]- y1_mean)*(y2[j] - y2_mean)/(y1_stdv*y2_stdv)
            d0.append(d)
    return udcf,t0,d0                                                                                               
x, y = zip(*sorted(zip(t0, d0)))#ensures x and y values correspond to each others in pairs when sorted
plt.scatter(x, y, ls='-', lw='1', color='red', marker='.')```
#

I want to use other data frames(df2,df3) and calculate t0 and d0 for each then plot them in different graphs rather than doing it manually

serene scaffold
pale thunder
#

anyone aware of a jupyter frontend that is capable of accepting user input and connecting to an already running kernel, like jupyter console can with --existing?

serene scaffold
#

What does one call it when they do an "outer" operation on two vectors, other than multiplication (ie outer product)?

#

looks like it might not matter, in this case

mighty spoke
#

one of the csv files (df)

serene scaffold
#

Remember what I said before: print(df.head().to_dict('list'))

#

if your next message doesn't have the data in that format, I'm afraid I'll have to stop helping.

#

That was right except that it was the same df twice

mighty spoke
# serene scaffold if your next message doesn't have the data in that format, I'm afraid I'll have ...

print(df.head().to_dict('list'))
{'Date': [Timestamp('2010-08-10 00:00:00'), Timestamp('2010-11-10 00:00:00'), Timestamp('2010-12-10 00:00:00'), Timestamp('2010-10-13 00:00:00'), Timestamp('2010-10-14 00:00:00')], 'Open': [10.25, 10.19, 11.05, 12.25, 12.9], 'High': [10.57, 12.0, 12.75, 12.8, 14.79], 'Low': [10.1, 9.85, 10.96, 11.86, 12.75], 'Close': [10.13, 11.13, 12.05, 12.71, 13.94], 'Adj Close': [10.13, 11.13, 12.05, 12.71, 13.94], 'Volume': [1135300, 712500, 777000, 1413100, 1895200]}

#

is that ok?

serene scaffold
#

yes, this is what I wanted

mighty spoke
#

ah kl

serene scaffold
#

It's a good format because I can copy and paste it directly and use it

mighty spoke
#

ohh i see

serene scaffold
#
def udcf(df,df2,t0,d0):
    y1_mean = np.mean(y1)
    y2_mean = np.mean(y2)
    y1_stdv = np.std(y1)
    y2_stdv = np.std(y2)
    for i in range(len(df)):
        for j in range(len(df2)):
            t=x2[j]-x1[i]
            t0.append(t)
            d = (y1[i]- y1_mean)*(y2[j] - y2_mean)/(y1_stdv*y2_stdv)
            d0.append(d)
    return udcf,t0,d0  

This can be greatly simplified

def udcf(y1, y2):
    d = np.outer(y1 - y1.mean(), y2 - y2.mean()) / (y1.std() * y2.std())
    t = (y1.reshape(-1, 1) - y2.reshape(1, -1)).reshape(-1)
    return t, d

or something like that.

#

Anyway @mighty spoke what are you trying to plot? Which two columns are x and y?

mighty spoke
#

@serene scaffold I'm trying to plot the lag values (x) and dcf values (y)

#

@serene scaffold this is my other data frame df2

#

print(df2.head().to_dict('list'))
{'Date': [Timestamp('2020-11-27 00:00:00'), Timestamp('2020-11-30 00:00:00'), Timestamp('2020-01-12 00:00:00'), Timestamp('2020-02-12 00:00:00'), Timestamp('2020-03-12 00:00:00')], 'Open': [0.09, 0.09, 0.09, 0.09, 0.09], 'High': [0.09, 0.09, 0.09, 0.09, 0.09], 'Low': [0.09, 0.09, 0.09, 0.09, 0.09], 'Close': [0.09, 0.09, 0.09, 0.09, 0.09], 'Adj Close': [0.09, 0.09, 0.09, 0.09, 0.09], 'Volume': [0, 0, 0, 0, 0]}

arctic wedgeBOT
#

Hey @mighty spoke!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

โ€ข If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

โ€ข If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

mighty spoke
#

my full code:

serene scaffold
#

@mighty spoke idk if I can do a full code exploration rn. Can you make it so that your dataframes have columns that represent the x and y data?

#

how is lag calculated?

mighty spoke
mighty spoke
#

also I tried binning the x values but i'm not sure its done the most efficient way

#

x=t0, y=d0

#

i'm trying to compare different data frames like df and df3 or df2 and df3 then plot them in different graphs but i don't want to make a a udcf function for each

serene scaffold
#

though it looks like your function is weirdly defined

#

it depends on variables defined in the global scope and doesn't use its parameters. it also returns itself

serene scaffold
# mighty spoke yes thats what i want
def udcf(y1, y2):
    return np.outer(y1 - y1.mean(), y2 - y2.mean()) / (y1.std() * y2.std())

assuming that they are both vectors (one-dimensional arrays)

mighty spoke
mighty spoke
#

when you did this t = (y1.reshape(-1, 1) - y2.reshape(1, -1)).reshape(-1)
return t, d

#

what does reshape do ?

serene scaffold
sleek sentinel
#

Hello, I don't have a powerful PC to train a resource-intensive model, do you know a software to make clusters that works on both linux and windows?

serene scaffold
autumn delta
#

Hello everyone !

I was wondering if anyone is able to help guide me in a direction for a project ?

Iโ€™ve been looking into it and Iโ€™ve been seeing a lot of Ai.

Not 100 percent sure if this is the place.

sleek sentinel
serene scaffold
sleek sentinel
#

uh yes but how to launch on several machines on the same job?

serene scaffold
#

do you mean several machines or several CPUs?

sleek sentinel
#

several machines :p

serene scaffold
serene scaffold
odd meteor
#

Just thinking out loud...

Do anyone here really use TensorFlow's high level Estimator API to train a model? If so, how often cos... ๐Ÿค”

I'm well aware of its many advantages over the low level algebraic method and Keras Sequential method but I think it can be stressful when we have many features in our dataset.

Let's say we have +52 features in our data, do we really have to define each +52 feature columns manually? ๐Ÿ˜ฉ

Is there no way to evade this process of manually defining each feature columns?

desert oar
#

that said, you can always write a for loop or list comprehension if you need to programmatically build up lists of features

final scaffold
#

Hey guys,
i have installed anaconda which almost comes with all packages i need. Is it still necessary to create an environment?
Can i not just do:

  • create a project in a location of my choice and select the default base environment. And then finally run the scripts since most of my packages are there in the conda installed location (base env)?
desert oar
#

there are a lot of reasons for this, but it's going to save you a lot of pain in the future if you just create one env per project

#

so yes of course you can do what you are asking about, but you shouldn't

#

personally i think anaconda made a very poor decision by shipping everything in one big base environment

compact parrot
#

Hi guys
How to implement roc auc for multiclass?
Tryed various variants from google and nothing worked (perhaps cause i am not as smart as i want)

desert oar
# compact parrot Hi guys How to implement roc auc for multiclass? Tryed various variants from goo...

you can compute it separately for each pair of classes, and combine those results using this formula: https://stats.stackexchange.com/q/76830/36229

compact parrot
#

Compute separately like
for class in multiclass:
code
?

desert oar
#

yes, it can be a for loop over pairs of classes

compact parrot
#

Thanks!

odd meteor
desert oar
#

actually.. i'm not sure if that's how you do it

#

let me look into this a bit more @compact parrot

desert oar
#

you might need to do something like fit a different model for every pair of classes

compact parrot
#

Oh

desert oar
#

it's not generally used for multi-class problems

desert oar
# odd meteor Thanks ๐Ÿ˜Š. Could you send the link to the doc you referenced here? I'd love to r...

https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator

Warning: Estimators are not recommended for new code. Estimators run v1.Session-style code which is more difficult to write correctly, and can behave unexpectedly, especially when combined with TF 2 code. Estimators do fall under our compatibility guarantees, but will receive no fixes other than security vulnerabilities. See the migration guide for details.

compact parrot
#

I am quite new for DT ๐Ÿ‘‰ ๐Ÿ‘ˆ

#

I could share my code for better understanding

desert oar
#

although feel free to share if you can

compact parrot
desert oar
compact parrot
#
#%%

x = df.loc[:,0:63]
y = df[64]

n_classes = y[0]

#%%

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

#%%

sc = StandardScaler()
x_train = pd.DataFrame(sc.fit_transform(x_train))
x_test = pd.DataFrame(sc.transform(x_test))

#%%

# Naive Bayes
gnb = GaussianNB()
fit = gnb.fit(x_train, y_train)
y_train_pred = fit.predict(x_train)
y_test_pred = fit.predict(x_test)

result = {'y_train': y_train, 'y_test': y_test, 'y_train_pred': y_train_pred, 'y_test_pred':y_test_pred}
show_info('Naive Bayes', gnb, result)```
I am using this dataset
https://www.kaggle.com/kyr7plus/emg-4
desert oar
compact parrot
#

Thanks, will read it