#data-science-and-ml
1 messages ยท Page 357 of 1
this is what I had been assuming, but have not made it work. maybe just need to keep digging there
well i think nlp.vocab.to_disk() works in anycase, so you can use that
yes, I just might. Sometimes it seems like there may be an easier way when there actually isn't :)
thanks for your insights
Is there an efficient way to, for example, remove columns from a numpy ndarray? Ive got an (n,85) ndarray and am trying to efficiently trim out the last 80 entries
In reality selecting relevant data from the last 80 entries and placing it into index 5 of the original array, rendering the last 80 no longer useful
Ive looked into masking, deletion, copying to new array... but I'm curious if anyone has any insight into an efficient method, or if leaving as-is is the best? I'm just a bit memory constrained
you can't change the shape of an array in-place, but you can still use slicing
!e
import numpy as np
arr = np.random.random((43, 85))
new = arr[:-10, :] # chop off the last ten rows
print(new.shape)
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
(33, 85)
And does slicing still use the original memory in place, i.e. a shallow copy?
Ah yeah, it seems that slicing just provides a view. Alrighty
No, I don't believe there's an in-place solution.
@sand loom how memory constrained are we talking, anyway?
If you can prevent the array from being assigned to a variable before you slice it, it might not stay in memory as long.
I don't have an exact number on that as the rest of the system is still under construction, but it's an embedded application. Doing non-max suppression on (25200x85) unsigned int8, so that gets a bit messy
Are you using cpython, or a different implementation?
@sand loom I've never worked on an embedded system. I'm not aware of how numpy has been used with them
I'm doing some more research on this (hadnt considered that yocto could use non-standard python which would change all profilings of things I have looked at). However, afaik it should be just the same python installation and implementation as a standard python install on ubuntu or other equivalent (python version 3.8.11 for that matter)
Well, at least at the top level I am not using cpython. The implementation on the system might be using it, though. Was not able to find any definite answer on that matter. But for now I am comfortable assuming it works nearly identical to a standard linux install
an (n,85) ndarray and am trying to efficiently trim out the last 80 entries
so you're looking to go from (n, 85) to (n, 5) ?
how efficient is data = data[:, -5:] ?
or otherwise, eg., more_data = data[:, -5:]; del data[:, :80]
the array built-in python module may also be helpful
@serene scaffold Hey, is it possible if you can help me with a few more things?
@pure pumice idk what it is
{'Index': [0, 1, 2, 3, 4],
'ID': [1, 2, 3, 4, 5],
'Title': ['Inception',
'The Matrix',
'Avengers: Infinity War',
'Back to the Future',
'The Good, the Bad and the Ugly'],
'Year': [2010, 1999, 2018, 1985, 1966],
'Age': ['13+', '18+', '13+', '7+', '18+'],
'IMDb': [8.8, 8.7, 8.5, 8.5, 8.8],
'Rotten Tomatoes': [87, 87, 84, 96, 97],
'Netflix': [1, 1, 1, 1, 1],
'Hulu': [0, 0, 0, 0, 0],
'Prime Video': [0, 0, 0, 0, 1],
'Disney+': [0, 0, 0, 0, 0],
'Type': [0, 0, 0, 0, 0],
'Directors': ['Christopher Nolan',
'Lana Wachowski,Lilly Wachowski',
'Anthony Russo,Joe Russo',
'Robert Zemeckis',
'Sergio Leone'],
'Genres': ['Action,Adventure,Sci-Fi,Thriller',
'Action,Sci-Fi',
'Action,Adventure,Sci-Fi',
'Adventure,Comedy,Sci-Fi',
'Western'],
'Country': ['United States,United Kingdom',
'United States',
'United States',
'United States',
'Italy,Spain,West Germany'],
'Language': ['English,Japanese,French',
'English',
'English',
'English',
'Italian'],
'Runtime': [148.0, 136.0, 149.0, 116.0, 161.0]}
What are you trying to do
i need to Create a new dataframe that only contains values that are on two or more streaming platforms
HINT: This is a great place to use filters!
this is what ive done for netflix but i cant just do the same this for all 4
is it possible to do it in one filter?
@pure pumice so you need to add the four streaming platform columns
And then get those
That are >= 2
filt1 = df['Netflix','Hulu','Prime Video', 'Disney+'] >= 2
df.loc[filt1
@pure pumice you didn't take the sum.of them
.sum()
Also I'm on my phone
At my parents house
I wanna go home
Help me
@pure pumice try it
@pure pumice also you might need to set the axis
!docs pandas.DataFrame.sum
DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)```
Return the sum of the values over the requested axis.
This is equivalent to the method `numpy.sum`.
wtffff
What the fuckkkkk
No
hi
how do i set an axis?
It's 1 or 0
plz i still can t understand whats the role of cross_val_score
@fierce patio it's the score from cross validation
so i only need to set an axis, none of that other skipna, level... stuff
@pure pumice ya
filt1 = df['Netflix','Hulu','Prime Video', 'Disney+'].sum(axis=1) >= 2
@pure pumice that's going to be a series of bools
it just gives an error
@pure pumice remember what I said about saying you got an error.
Traceback (most recent call last)
/cloud/lib/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
/cloud/lib/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
/cloud/lib/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: ('Netflix', 'Hulu', 'Prime Video', 'Disney+')
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-12-bbb026620058> in <module>
1 # Create a new dataframe that only contains values that are on two or more streaming platforms
2 # HINT: This is a great place to use filters!
----> 3 filt1 = df['Netflix','Hulu','Prime Video', 'Disney+'].sum(axis=1) >= 2
4 df.loc[filt1]
5 #1 filter, total number line
/cloud/lib/lib/python3.9/site-packages/pandas/core/frame.py in getitem(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
/cloud/lib/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: ('Netflix', 'Hulu', 'Prime Video', 'Disney+')
whoops sorry
No that's not what I said
what u mean by score plz
@pure pumice yes but you still need quotes
For each string that is a column name
do u mean the best hyperparametr
omg it worked

thank you python master
Yw python disciple
i feel like shooting my computer screen i hate this python assignment ๐ฆ
That won't fix it
Or make it go away
i have the same feeling
also @serene scaffold if i am trying to create a pie chart for a specific column is that possible, or would i have to put it into a pivot table first
pain
@pure pumice I've never made a pie chart
damn
I like pie
apple pie
never tried
do u eat ice cream?
if u like pecans u should eat some pralines and cream ice cream
@serene scaffold
okay so over here
{'Index': [0, 1, 2, 3, 4],
'ID': [1, 2, 3, 4, 5],
'Title': ['Inception',
'The Matrix',
'Avengers: Infinity War',
'Back to the Future',
'The Good, the Bad and the Ugly'],
'Year': [2010, 1999, 2018, 1985, 1966],
'Age': ['13+', '18+', '13+', '7+', '18+'],
'IMDb': [8.8, 8.7, 8.5, 8.5, 8.8],
'Rotten Tomatoes': [87, 87, 84, 96, 97],
'Netflix': [1, 1, 1, 1, 1],
'Hulu': [0, 0, 0, 0, 0],
'Prime Video': [0, 0, 0, 0, 1],
'Disney+': [0, 0, 0, 0, 0],
'Type': [0, 0, 0, 0, 0],
'Directors': ['Christopher Nolan',
'Lana Wachowski,Lilly Wachowski',
'Anthony Russo,Joe Russo',
'Robert Zemeckis',
'Sergio Leone'],
'Genres': ['Action,Adventure,Sci-Fi,Thriller',
'Action,Sci-Fi',
'Action,Adventure,Sci-Fi',
'Adventure,Comedy,Sci-Fi',
'Western'],
'Country': ['United States,United Kingdom',
'United States',
'United States',
'United States',
'Italy,Spain,West Germany'],
'Language': ['English,Japanese,French',
'English',
'English',
'English',
'Italian'],
'Runtime': [148.0, 136.0, 149.0, 116.0, 161.0]}
i have these genres and i need to slice it so only the first genre shows on each row
but what about the rows with only one genre. wont they get sliced as well?
Reading screenshots of text is annoying, especially on mobile. Which I'm on.
Using the original dataframe, take the genres column, and only keep the first genre.
For example, if the value was previously Comedy,Drama,Romance, then it would become Comedy
@pure pumice the way with the fewest steps involves regular expressions
You might use apply and a lambda
omg lambda
never heard of ๐ฆ
so what im doing rn is just longer
I can show you
When I get home
thank you
what is an aggfunc?
Create a pivot table where the average runtime of the movie is examined
Make the rows Year and the columns Genres
gp_pivot = df.pivot_table(values='Runtime', index="Genres",
columns = 'Year', aggfunc='mean')
gp_pivot.tail()
I assume this would just be another view. However, that's probably fine as long as I'm not making more copies of my slightly modified data. Will need to profile it and also profile your deletion suggestion. I appreciate the help!
Has anyone worked with colab when using a runtime of a vm from gcp
A function that you use to reduce aggregated data, such as sum or mean
@pure pumice
In [7]: df['Language']
Out[7]:
0 English,Japanese,French
1 English
2 English
3 English
4 Italian
Name: Language, dtype: object
In [6]: df['Language'].str.extract(r'^([A-z]+),?')
Out[6]:
0
0 English
1 English
2 English
3 English
4 Italian
I guess it could just be df['Language'].str.extract(r'([A-z]+)')
๐ฎ
@serene scaffold (r'^([A-z]+),?')
what does this mean exactly
it's my demon summoning spell
does that mean its mine now
no
dammit
anyway a regular expression is a pattern that strings can match
^([A-z]+),? means "from the start of the string (^), extract (()) one or more (+) consecutive characters from A to z ([A-z]) possibly (?) followed by a comma (,).
and the reason we cant use str[]
it's a method call, not str slicing
str[1:2] would be a string slice
that would work if every language name had the same number of characters.
I'm gonna play a game before I go back to work in the morning

๐
Hey, does anyone mind taking a look at a code I am working on? I'm struggling with one aspect of it that has to do with user input and the help chat said this chat is also helpful!
this chat is for data science, so questions about "user input" probably fall under a different domain. If it's a data science question, go ahead.
Any time you want help in this Discord, or anywhere on the internet, just say your question right away. Putting extra work between people and knowing what your question is just slows things down and makes it less likely that you'll get help.
how do i use CUDA toolkit?
to do what?
do cuda things

i have a 1060 i tried installing cuda toolkit and it failed
show error message
uhh
it sounds like your question is really "how can I install CUDA toolkit on my system despite it not working when I tried x"
true
that's a different question from "how do I use CUDA toolkit?". afaik, CUDA toolkit is just a compatibility layer for installing pytorch and stuff (and therefore can't be "used"), but I try not to rule out the possibility that people know something I don't, since they usually do.
but if you can't show the error message, idk what to do.
XY problem
which problem are you saying is an XY problem?
Is anybody familiar with basic Tensoflow/Keras?
for t in range(Tx):
# Step 2.A: select the "t"th time step vector from X.
x = X[:,t,:](X)
Trying to figure out how to iterate over the above tensor X
Hi, ive installed anaconda and kept these checked while installing:
a) install for All Users (not currnt)
b) Add PATH
Installed location is c:/ProgramData.
Now, when i open cmd (both as user and administrator) and type: python
i get this warning message:-
This python interpreter is in conda environment, but the environment has not been activated. Libraries may fail to load.
Any professionals over here who can help me a bit
Hi everyone
If I need help with an AI script do I simply go into a help channel or do I go here?
@gilded copper always ask your actual question. Don't rule out everyone except "professionals" before you've put an answerable question out there
You can ask here.
ill pm you if you need help installing anaconda
in the cmd type in conda instead of python see if that work
have you created a new anaconda environment yet?
Am confused to choose my career. Which to choose : IOT or Cloud. That's my qn
sounds like a question for #career-advice, but I suspect that it will ultimately be up to you
Guys - How would you "rank" these algorithms i terms of complexity (NOT Big O, but rather complexity in terms of explainability to stakeholders)
XGBoost
Random Forest
Logistic Regression
K-Nearest Neighbours
Decision tree
Perceptron
regression, decision trees, forests, k-nearest, xgboost, perceptron
but like k nearest could be higher
K nearest is the easiest in my opinion
it think between knearest and decision trees its a toss up to you on order to introduce it
I see. I have this stupid assignment for my final exam in applied data science, and they want basically a powerpoint. I need to explain it to stakeholders who know nothing about data science
i feel that decision trees are esier to show to non tech people
yeah then i feel k-nearest is a bit more algorithmic to understand as with decision tress its easier to explain the bigger picture without getting to technical
Right, yeah. Makes sense
just a bit of advice i didn't appreciate as much in school was to talk more about results and consequences then specific technical aspects.
Hey anyone have any interest in taking https://cds.nyu.edu/deep-learning/ it looks real cool I have take some other deeplearning classes and projects so I'm not a total noob but idk if i would even call myself intermediate yet. I took linear in college did ok but that was a few years ago and havn't taken a derivate or anything in years which I'm concerned about. I'm not going to be sprinting thought going to try to stick to the weekly schedule but If something takes me an additional week for whatever reason then so be it. I was planning on starting it after the new year just asking to see if there was any interest
guys i would to learn data engineering but dont know from where to start, I already know python and SQL
can u pls give me like a career track or course ?
pls
K nearest would be easiest, followed by decision tree. Random forest can't be explained without first explaining decision tree. You might be able to reduce "logistic regression" to "best-curve fitting".
"Data Science from Scratch" is a good book imo.
thank bro but im more interested on data engineering
what definitions of "data engineering" and "data science" are you working with?
data science = machine learning, deep learning, etc etc
data engineering = spark, pipelines, cloud
umm then yours asking the wrong channel try devops?
afaik the fundamentals are the same.
one of my classmates got a job titled "data engineer" whereas my title is "computational linguist", but we both did the data science program.
(though I also took linguistics classes.)
yeah but if your more interested in dataops and mlops there is a great course on coursera https://www.coursera.org/learn/introduction-to-machine-learning-in-production/home/welcome I took it and it was really interesting got to see a side that you dont get in more technical classes
Learn online and earn valuable credentials from top universities like Yale, Michigan, Stanford, and leading companies like Google and IBM. Join Coursera for free and transform your career with degrees, certificates, Specializations, & MOOCs in data science, computer science, business, and dozens of other topics.
Hey guys can any of you help me with Convolutional Neural Networks? It is for a school project
umm maybe whats your problem?
I have a code I have trouble understanding
Posted it in the brocoli help channel
well it looks like its just making the network do you have a specific layer that you dont understand? i honestly never worked with seperable i dont know what that is
The make model part is my problem. How does it work?
Do you have time for voice chat?
i cant talk rn but it looks like you make stacks of convolutions with larger and larger filter sizes
Ye I stole the code from the Keras creator.
Need to analyze and understand it
But the filter applying is confusing me. The difference between seperableconv2d and just conv2d
yeah I never used separable so i dont know if its make or break or if you can just swap it with regular 2d and get similar performance
https://keras.io/api/layers/convolution_layers/separable_convolution2d/
Shit yeah I read that but it didnt help me
Give me 2 sec I can make a model for you
i think you can just use regular conv2d. yeah try it with regular conv2d
Can you give me an example code private?
it says it has The depth_multiplier argument but i dont see it used in the code
i would just ctrl-R SeparableConv2D to Conv2D
ah i see the residual now
ok so where the netowrk splits into seperable and regular conv is because of this part # Project residual residual = layers.Conv2D(size, 1, strides=2, padding="same")( previous_block_activation ) x = layers.add([x, residual]) # Add back residual previous_block_activation = x # Set aside next residual
I think the network would be the same if you changed seprable to regular conv2d
so what that peice of code is doing is applying the conv2d the layer on the right of the splits and adding them back together and saving the previous_block_activation
like i just changed it to regular conv if you issue is with the network splitting after the activation its bc of the code above
if you re post the make model in a help chat i can show you my comments
can anyone help me figure out why im getting training and validation accuracy over 3.0 lol
anyone has some experience with pandas libray ?
hi guys i work on data about testosteronne i want to creat a ML model for classification my target is testosteronn does i have to drop it from my data if i wanna using kmeans algorithm
don't "ask to ask". state your question, and someone will answer if they are willing and able
see here for a guide on asking good questions online https://www.codementor.io/learn-programming/how-to-get-programming-help-online
A guide for how to ask good questions in our community.
Hi, I am so confused about this. I have a 5000 feature in the dataset, but I only get around 2500 components in the plot. What happened in this case?
print X_train.shape
either way, do you need to see all 5k components? it's >0.95 at like 200
Before this, I used a different dataset to visualize like this, and I get a whole feature as an axis n_components. But, when I use the new dataset that have 5000 features, I only get some n_components from the whole feature. What happened?
not sure, are you certain that X_train has 5000?
what is pca.explained_variance_ratio_.shape?
yeah, pca.explained_variance_ratio_.shape is 2418. But why I didn't get a whole number of datasets in the axis of n_components?
because you can't have more components than the matrix rank, and matrix rank can't be greater than the minimum of the number of rows and number of columns
What is the matrix rank? Can you explain me?
it's a concept from linear algebra. you can think of it as the number of individual "components" in a matrix
a matrix with 100 rows but rank 1 really only has 1 piece of data in it
(this is a very very non-mathematical explanation)
(the actual explanation has to do with "linear transformations")
i strongly encourage you to learn the fundamentals of linear algebra, it's an essential tool for building mathematical models in statistics and machine learning
linear algebra and calculus
but why, when I have 30 features in the dataset, I can get a whole number of features in the plot?
Hello everyone... I have a stacking ensemble with the current config
estimators = [
('decision_tree', dtm),
('linear_regression',LinearRegressionModel),
]
stack = StackingRegressor(estimators=estimators, final_estimator=RandomForestModel, cv= 7, passthrough = True)
Why does the one above perform better than this one below?
estimators = [
('decision_tree', dtm),
('rf', RandomForestModel)
]
stack = StackingRegressor(estimators=estimators, final_estimator=LinearRegressionModel, cv= 7, passthrough = True)
do you can give me article recommendation?
if you have 30 features and >=30 rows, then the data matrix has rank 30, so you can have up to 30 components
i recommend an introductory book or course on linear algebra. MIT 18.06 is excellent, the lectures are all on YouTube and the lecturer Gilbert Strang is very entertaining and passionate
i actually think a new online version is starting soon or has already started. but the old lectures are also easily available
I know there is not much information about the data here but I'm looking for a more theoretical response as to how/why this would be the case
that's a great question. my guess is that the linear regression and decision tree offset each other in a "bias-varance" sense. it might also be the case that the random forest model on just 2 features (the two first-stage model outputs) is finding a good bias-variance balance
i don't think there's any compelling theory here
Does it mean that the number of features in the dataset can't be plotted completely if it has lower a number of rows?
Got it, so the actual nature of how these models operate could explain why pairing them like that would make it better/worse? The data I'm working with is basically a time series of passenger flow... So a non-linear problem in that matter and a Linear Regression would have no business in solving it
would anyone mind helping in #help-dumpling ? I'm having trouble getting model_main_tf2.py to work
plotting what exactly?
in that case the linear regression is possibly fitting some kind of trend, and the decision tree is possibly fitting some kind of higher-order variation around the trend. if you can be more specific about the nature of the data, we might be able to provide more specific advice
plot.pca.explained_variance_ratio i mean
well yeah, if there are only 2418 components then you can only plot the explained variance ratios for those 2418 components
there is no 2873'th component to plot a variance ratio for
also i'm not sure there's value in 2000+ components that explain < 1% of variance...
What are you mean by "I'm not sure"? why 2000+ component is can't explain <1% of variance? can you explain to me about the correlation between components and variance?
especially in my case
perhaps it would be more illustrative to look at the explained variance of each component, instead of the cumulative explained variances
i.e. do the plot without cumsum
you will see that the components at the end only explain a tiny fraction of overall variance
so you can probably just ignore them
and if you look at the plot you currently have, you will see that ~90% of the variance is explained by the first ~250 components
Sure, it is the passenger onboarding of buses in my city (for the year of 2015 in increments 1 hour). Here a quick graph
That is basically 24 hours of predicitons (blue lines containing the real values) and that one above is the Linear Regression model trying to solve it
This one is the decision tree
Random Forest below
@undone heron you just did linear regression of bus usage vs time?
this is average hourly usage over 2 months?
i don't think either model is a great idea tbh
you surely have seasonal effects to consider
as well as year-over-year trends
Model trained with the complete months of Jan, Feb and March and is predicting April 1st (24 predictions = 24hrs of the day)
I only have 2015 as data (it is mentioned on the limitations of the paper)
i see, maybe you have to assume that there is no change throughout the year although that is very very risky
you will at least need weekly seasonality
surely transit usage is very different on saturday and sunday vs mon-fri
Indeed, my features make sure that this is accounted for
Day of week (Mon - Sund), hour, month (1-12), day of year (1-365), day of month (1-31)
My big Q is just why the f* is the Linear Regression as a weak learning improving performance if it is so bad?
Sorry, I still don't understand about this plot. Can you explain to me what information I get from this plot?
Even removing it from the stack makes the thing worse lol
the % of variance explained by each component
see how it's nearly 0 for most of the components?
it's not bad, look at your plot
it's great actually, it predicts the average hourly trend throughout the day
high bias low variance
the decision tree takes care of overfitting to all the little fluctuations
and it gets smeared out by the linear regression being very _under_fitted
and the predictions aren't highly correlated
Hmmmmmm that is an interesting take....
and putting them together with the random forest i guess makes sense?
maybe you should do it the other way
stage 1: linear regression + random forest
stage 2: decision tree
that would be intuitive to me at least
or maybe not
since you want lower variance in stage 2
either way, i can see how the random forest being nonlinear is essential in correctly "re-combining" the 2 models
what is the predicted output from the first stacked model? using that same plot
Well, to give more context... The final stage is -> the best ensemble for that social domain. Let me plot the two stacking I meantioned in the first question. 1 sec
tldr my guess is that your stacking model is doing what time series decomposition does, splitting apart "trend" and "noise", and then recombining them with the "noise" turned down and/or filtered with some kind of low-pass filter
Random Forest as final estimator and Decision Tree + Linear Regression as weak learners
Linear Reg as final estimator DT + RF as weak learners
wtf
the plot is better but the performance is not
MAE on second one goes up from 50 to 55
this is "in-sample" prediction performance, right?
i see
you are saying that the 1st one has a slightly lower median absolute error than the 2nd one?
conceptually i don't think DT + RF makes much sense
RF is definitely not a "weak" learner
and RF already is constructed from a bunch of trees
that said, i am surprised that the first plot has lower median abs error
maybe it has to do with median vs mean
since the "tails" of the error distribution are essentially discarded with medians
Wait, technical question... How do I measure performance from a .predict() run?
try mean abs error or mean squared error instead!
what's the thing here?
is it that lr(dt(rf(X, y))) has different characteristics vs. rf(dt(lr(X,y ))) ?
cross_val_score(...).mean()
lr(dt(x,y), rf(x,y)) vs rf(dt(x,y), lr(x,y))
i suppose it makes sense to take 1 very deep tree and take a weighted average of it with a forest of many shallow trees
that is what the 1st one does
im not sure a random forest makes much sense on 2 features either
i am almost tempted to do this:
stage1:
- random forest
- linear regression
stage 2:
- fully connected neural network with 1 hidden layer and ~5 hidden layer units
dt is basically a sort of modal learning, lr is a mean learning, and rf a mean of mode learners
that's a great way to put it
in which case, yeah. i guess smashing a mode and mean together kinda makes sense
that said, i think maybe this entire problem would benefit from probability calibration and estimated error bounds ๐
yeah, it all boils down to some form of average
i really want to see confidence bands around that predicted line
or better yet, a probability density surface
i'm not sure what a hardcore bayesian machine learning person would do here
Oh Jesus now I'm seeing things said here that I have no idea what they are about lol
I'm just trying to get a Bachelor Degree in CS folks
I have the feeling something somewhere is wrong
a neural network is a "mean of means" learner, the first mean() phase projects X into a compressed space; the second mean() is basically a distance from your input x to the nearby points in the compressed space
it doesn't have to be compressed though, right? that's the whole magic of having more hidden units than inputs
like kernel methods that used to be fashionable
this data isn't public, is it? i might be curious to mess with it, if it's public
aprox, ```py
W, b = mean(historical data)
layer = mean(WX +B)
predictions = mean(layers)
oh, i see
i think if you just replace mean with "taking an expectation", you could probably mess around abit and get the definition precise
aaaa it isnt public per se... It is public but it would literally be a crime if I was the one to give it to you, you'd have to request it from my local Gov.
a layer's activations are just a weighed-mean of the previous layer's, where the weights are W
and W are just basically compressions of the original data
If you want to jump on a voice chat I can share my screen and we can chat about it, I just need to compare ensemble methods on that domain
so the "intuitive formula" above, i think is largely correct
so a NN is basically an RF where the core stat part is a mean() rather than a mode
as you can just see routes from X to Y through the layers as independent regressions (basically, means), and the final layers as ensembling/mean'ing those
but i'll let you get back to helping, if that's what's going on
Ok I undersrand now. Thank you
don't worry about it then ๐
i guess i only take issue with the "compression" part - it might be a projection into a higher-dimensional space than it started in
otherwise i really like this explanation
and you're definitely not in the way of helping at all!
sure, its a higher-d space, but its linear in that space
the intuition is that the weights are basically templates of th original dataset
so you're projecting a new x into space where templates are the axis
that space isnt a compression of your new-x, its a compression of the historical data
oh, i see what you're saying
yeah, interesting way to think about it
i tend to think of it as a "recombining" or "mixing" rather than "compressing"
well its compression just if len(W) << len(X)
in the sense that if len(W) == len(X), under forced interpolation, W == X
you're talking about len() as in the entire data matrix? like len(x) being the number of data points in the training set?
yeah, well W, b = someop(X, y) right?
my claim is the basic heart of someop is mean
so really, W, b = means(X, y)
if len(W,b) == len(X, y), and if loss(training) == 0, then more-or-less W,b should just be X,y
Hey, can someone explain me how do I obtain model "score" as in sklearn GridSearchCV?
I have made a model, now I want to compare the score to external test set, and I fail to get the "score" in normal range (gridsearch gives something below 1, i get -60 or sth like that
ok i see what you mean, it's an average over all data points within the possibly very-high-dimensional feature space
yeah, it's mean**s**(X, y)
so if those means are just the same number as the original data points, and if you can predict all of those points without error, those means are just the data points
hey everyone, is anyone here familiar with ONNX models?
do you think in that sense, a neural network is fundamentally different from e.g. an svm or linear regression (maybe with polynomial or other hand-transformed features)?
or even a general additive model for that matter
i remember hearing something like that
in my mind, i see the NN alg as basically a dial from: knn -> ensemble of lr
interesting idea
if len(W) <<< len(X) is ensemble(LR), if len(W) ~= len(X), its knn
maybe ensemble knn might be more accurate
yeah, to me its kinda obvious, but a lot of the marketing BS requires poeple basically ignore the weights
once you sub W = mean(historical data) into all the formulas, it isnt that much of a mystery
W = means(historical) , so W = historical if len(W) = len(historical) and loss(historical) = 0
which makes it knn
should be kinda obvious from autoencoders and the like too
an autoencoder just shows that the weights are basically "local aproximations" / compressions of the original data
and the main mechanism of a NN is just to put your new point into the spaces of those aproximation points, and take a mean
oh see, basically if you have 1 weight per data point you're just taking local approximation around each data point?
fair enough
i have to remember that you're talking about a time series here
and not a "flat" dataset of rows and columns
i think i was hung up on that point
me?
are you?
i guess not then!
i'm not the time series person
i guess i'm not sure if you're speaking abstractly about the number of parameters or about the actual shape of the weight matrix/tensor
because in the basic 1-layer feedforward case you have a "1xH" weight vector where H is the number of hidden units
i know that in general the closer you get to 1 parameter : 1 data point, the closer you get to just memorizing the original data. but i'm not sure how well that intuition generalizes
if you do KNN(k=1).fit(X,y).predict(X) you get exactly y, ie., 0 training loss ... why? because W = (X, y) by design
if you do repeat: NN(num_weights=len(X)).fit(X) until ==y, then you've got 0 training loss, ... why?
well it isnt literally that W = (X, y) , but W "is basically" shuffled(X, y)
sure, but what do you mean by "repeat" in this case? are you talking about stacking more layers? adding more weights? running more epochs?
running more epocs
i don't disagree btw, but i want to make sure i understand your point if i am to borrow the idea ๐
you can see it as a probablistic condition on W
like, what happens if len(W) >> len(X)
then it is certainly never the case that the entires of W would be the entires of X
is running more epochs really increasing the size of W though?
no, its about permuting W until its just a rotation of X into a new space
ok, sure. or iterating as close as possible thereto
yeah
i mean, i think it is pretty exact
if a PCA is basically just "rotate X by its means"
then a NN under these conditions is just the same thing
oh you are saying specifically if you have at least 1 weight per data point
ie., W = rotate (X,y) by its means
yes, that makes sense
yeah, if len(W) < len(X) you get more compressive
and end up closer to an ensemble of linear regressions
it's always just: mean(means...( x rotated-by means...(history)))
i guess i still don't have a great sense for what the len(W) is. if you have two one-hidden-layer networks, but one has 5 hidden units and the other has 10, the second one has greater len(W) in your eyes, right?
yeah, ok
really interesting idea
makes sense intuitively but i might have to simulate and convince myself ๐
and maybe write out some equations
certainly i agree with the idea that if you have enough parameters you end up memorizing and reshuffling the data rather than compressing it
that such a thing conceptually is similar to k-nearest-neighbors sounds logical but somehow isn't fitting right into my head. will have to tinker with it
well a NN is just a prediction fn, f =A W1 X...A W2 X... A W3 X
sure
so maybe, if we say, entires of W* = entires of W1, W2, W3
my claim amounts to something like, AW* is just a pca-like rotation of X
when len(W) == len(X) and when loss(training) == 0
ie., when the network is predicting its historical data, and when the number of parameters = the number of data points
i follow you that far
right, so maybe the idea is something like
"roughly", AX on W* == AW* on X
ie., W* and X are basically just the same
this isnt how i arrived at the conclusion though
i arrived at my general view, by:
(1) dropout on NNs basically ensembles them.. wait, softmax/last-layer is just a mean() anyway, so they're coming into last layer as an ensemble
(2) all ML reduces to mean(), mode(), etc
(3) algs which predict their training data exactly are overfit, ie., their parameters are closer to the original data than they should be
(4) if you're perfectly interpolated and have sufficient parameters to play with, it is extremely likely your parameters are just your original points ("in a rotated space")
and also, if you think about the two branches of alg, either you force distributional assumptions on your historical data.. in which case you fit to a model
or you dont, in which case you fit to the data
a NN is just a dial between those
yeah, that much i totally follow
i really like that line of reasoning
so i can definitely see how that would lead to something knn-ish
i wouldn't really describe it as nearest neighbors, but certainly an increasingly local approximation
i also really like the mode vs mean thing
going to borrow that one
ty for the insights!
to classify 1000 labels how many imgs per label do i need?
gonna use albumentations
there's no universal answer to that question, if each label is a basic shape, then a handful
wdym with basic shape
it depends on what the images are of
at a guess, 1k/label is a minimum
160x160
c. 25,000 pixels/image, 100 images/label, 1000 labels
what is c.?
"circa", it means aprox.
yeah, but u are forgeting albumentations
yeah, you can augment
well, gotta scrap the images first. I did but some images do not correspond the label so i had to clean them manually and after cleaning 170 labels i got bored. I guess it will be easier scrapping better
Does anyone know how to use google cloud storage with colab
I want to know how to access a folder
click on the drive folder inside colab
it will give u a link and request for a token
just click on the link
ok thanks
does anyone know how to filter out items from a dataframe?
to only show those specific items in that column
df.loc['column']
can i show you what i mean sorry
sure
{'Index': [0, 1, 2, 3, 4],
'ID': [1, 2, 3, 4, 5],
'Title': ['Inception',
'The Matrix',
'Avengers: Infinity War',
'Back to the Future',
'The Good, the Bad and the Ugly'],
'Year': [2010, 1999, 2018, 1985, 1966],
'Age': ['13+', '18+', '13+', '7+', '18+'],
'IMDb': [8.8, 8.7, 8.5, 8.5, 8.8],
'Rotten Tomatoes': [87, 87, 84, 96, 97],
'Netflix': [1, 1, 1, 1, 1],
'Hulu': [0, 0, 0, 0, 0],
'Prime Video': [0, 0, 0, 0, 1],
'Disney+': [0, 0, 0, 0, 0],
'Type': [0, 0, 0, 0, 0],
'Directors': ['Christopher Nolan',
'Lana Wachowski,Lilly Wachowski',
'Anthony Russo,Joe Russo',
'Robert Zemeckis',
'Sergio Leone'],
'Genres': ['Action,Adventure,Sci-Fi,Thriller',
'Action,Sci-Fi',
'Action,Adventure,Sci-Fi',
'Adventure,Comedy,Sci-Fi',
'Western'],
'Country': ['United States,United Kingdom',
'United States',
'United States',
'United States',
'Italy,Spain,West Germany'],
'Language': ['English,Japanese,French',
'English',
'English',
'English',
'Italian'],
'Runtime': [148.0, 136.0, 149.0, 116.0, 161.0]}
so this is the data in my dataframe
i need to select 4 genres of my choice from the genres column and filter the dataframe so that only those 4 are left
ok
okay wait sorry
just ignore everything i just said
the first step i had to do was to take the genres column and only keep the first genre in it, like if the genres column has comedy,drama,romance. I had to turn it into just comedy
df['Genres'] = df['Genres'].str.extract(r'([A-z]+)') #
df.head()
i used that^
now i am being asked in the instructions to select 4 genres of my choice from the genres column and filter the dataframe so that only those 4 are left
Series.isin(values)```
Whether elements in Series are contained in values.
Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.
so, df.loc['Genres'].isin(genres) is a series with bools you can index with
each genre i include in my filter
okay thank you, gonna try rn
do u mean like this df.loc['Genres'].isin('Action', 'Adventure' , 'Sci-fi', 'Thriller')
i think you need a list, not separate arguments
but you know how before this step
i used df['Genres'] = df['Genres'].str.extract(r'([A-z]+)')
df.head()
to only display one genre in the column
meaning the genres arnt in a list anymore
I am trying to access a directory with a ton of images stored in google cloud drive using colab. I type this in ```py
!gcloud config set project {project_id}
!gsutil cp -r dir gs://digits
And get the following error:
Updated property [core/project].
CommandException: No URLs matched: dir```
Can someone tell me what I have done wrong
Hey guys so i have a doubt regarding tensorflow. I have been working with pytorch for a long time and felt like i needed to give tensorflow a shot. So right now I am able to understand custom training loops and all of those things. Only doubt is like pytorch where there is a custom dataset and dataloader class from torch.utils.data is there anything flexible like that in tensorflow that is easier to use for custom pre processing of data. Like what is most commonly used for creating custom dataset like we do in pytorch?
@serene scaffold hey can I please get a little bit more help before i hand this project in?
{'Index': [0, 1, 2, 3, 4],
'ID': [1, 2, 3, 4, 5],
'Title': ['Inception',
'The Matrix',
'Avengers: Infinity War',
'Back to the Future',
'The Good, the Bad and the Ugly'],
'Year': [2010, 1999, 2018, 1985, 1966],
'Age': ['13+', '18+', '13+', '7+', '18+'],
'IMDb': [8.8, 8.7, 8.5, 8.5, 8.8],
'Rotten Tomatoes': [87, 87, 84, 96, 97],
'Netflix': [1, 1, 1, 1, 1],
'Hulu': [0, 0, 0, 0, 0],
'Prime Video': [0, 0, 0, 0, 1],
'Disney+': [0, 0, 0, 0, 0],
'Type': [0, 0, 0, 0, 0],
'Directors': ['Christopher Nolan',
'Lana Wachowski,Lilly Wachowski',
'Anthony Russo,Joe Russo',
'Robert Zemeckis',
'Sergio Leone'],
'Genres': ['Action,Adventure,Sci-Fi,Thriller',
'Action,Sci-Fi',
'Action,Adventure,Sci-Fi',
'Adventure,Comedy,Sci-Fi',
'Western'],
'Country': ['United States,United Kingdom',
'United States',
'United States',
'United States',
'Italy,Spain,West Germany'],
'Language': ['English,Japanese,French',
'English',
'English',
'English',
'Italian'],
'Runtime': [148.0, 136.0, 149.0, 116.0, 161.0]}
@pure pumice I'm busy RN but I guess ask your question
thanks, remember when yesterday we only kept the first genre in the genre column
using df['Genres'] = df['Genres'].str.extract(r'([A-z]+)') #
df.head()
well now i need to
Select 4 Genres of your choice. Filter your dataframe so that only those 4 Genres are left
after that first step which we did
any data scientist can tell me:
you store the data on an excel, for example, when you need to work with it, you transform it ALL from excel to dict(for example)?
or you work directly on the excel?
well, what have you tried?
there are libraries for reading excel files into Python
it's basically effortless.
i know, that was not my question, i wanted to know how you handle with it
handle with it?
it depends on what you're trying to do 
don't post any screenshots.
def load_excel(arquivo_excel, index_coluna):
df = pd.read_excel(arquivo_excel)
df.set_index(index_coluna).T.to_dict('list')
return df.to_dict(orient='list')
i'm doing that to load from the .xlsc to (for example) that:
dict_machine = {'ignore-2': [],
'ignore-1': [],
"ignore": [],
"ignore2": [],
"ignore4": []}
so it's a dict of lists. do you want that? Also df.set_index(index_coluna).T.to_dict('list') does not modify the DataFrame in-place, so that statement has no effect.
wait, so i can totally remove it?
yes. or that can be the statement that you return
but at that point you might as well have
def load_excel(arquivo_excel, index_coluna):
return pd.read_excel(arquivo_excel).set_index(index_coluna).T.to_dict('list')

SORRRY im late filter1 = df['Genres'] == "Action,Adventure,Sci-Fi,Thriller"
Can you think of why that doesn't work
after we did the df['Genres'] = df['Genres'].str.extract(r'([A-z]+)') #
df.head() yesterday
ya because I have them all under one ""
Also that's going to remove all but the first genre
ya
so this ilter1 = df['Genres'] == "Action,Adventure,Sci-Fi,Thriller" only works before we removed all but first
it would only work for rows where the Genre is literally "Action,Adventure,Sci-Fi,Thriller"
ya it only worked with that
before we removed
everything but the first one
but you don't really want that either, do you?
nah
filter1 = df['Genres'] == "Action"
filter2 = df["Genres"] == 'Adventure'
filter3 = df['Genres'] == 'Sci-Fi'
filter4 = df['Genres'] == 'Thriller'
this wouldnt work either eh?
@pure pumice you don't want to be using == with the whole column
Because it will check if the value in that column matches the string exactly, from beginning to end.
What would u suggest doing then?
cuz just one = sign doesnt work either
Well of course not. That's for assignment
am i on the right track at least?
Not at the moment, if I'm being perfectly honest. Your goal is to get those rows where at least one of the genres belongs to one of four that you pick, yes?
Select 4 Genres of your choice. Filter your dataframe so that only those 4 Genres are left
exact instructions^
@pure pumice so you need to pick those columns where the genres are a subset of the four that you pick
@pure pumice "only movies in those four genres" needs a more robust definition, since a movie can belong to more than one genre
Does it need to belong to all four? Exactly one? At least one (but possibly others that aren't?)?
just one
because before this
we deleted all the genres from the column and kept one
Exactly one?
Are you sure you were supposed to delete the others?
Using the original dataframe, take the genres column, and only keep the first genre.
For example, if the value was previously Comedy,Drama,Romance, then it would become Comedy
Select 4 Genres of your choice. Filter your dataframe so that only those 4 Genres are left
Create a pivot table of the average runtime of movies over time. The rows are therefore the year
The columns will be the 4 Genres you filtered for
Series.isin(values)```
Whether elements in Series are contained in values.
Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.
@pure pumice try it and see
so i would put "Genres" instead of series
okay goodnight
The genres column is a series.
Series.isin('Action','Adventure','Sci-Fi','Thriller')
That won't work
ya series not defined
The genre column is the series
Also you passed four strings individually as four arguments
Not a list
Good luck!
so they should all be under one " "

oh both the argument and the parameters have to be set like objects
Iโm trying to learn data science for python. Any good resources you guys can recommend? Or site where I can practice my skills?
How about an AI that learns by talking
Too Advanced for me right now I think. Iโm not that skilled yet.
How about a game using variables. like memory game
Sounds doable. Iโll explore some existing projects. Thanks
Hi, does anyone know how I could create an orientation histogram using SIFT? I have found the key points and now would like to make an orientation histogram for some of them
Why can we not make fully auto driving cars when we already have technology to do that
we dont have the technology
What's the difference between MYSQL, POWER BI , AND TABLEAU
mysql = database for storing data in tables; powerbi = microsoft data visualization & reporting tool which gets data from a database; tableu = non-microsoft "alternative" to powerbi, maybe a bit harder to use
it's capable of doing that, yes
What should I learn first?
I am on the data analysis road
Done with python basics( i still get stuck a lot of times)
Modules also
Numpy pandas matplotlib seaborn plotly
Now what should I do?
err, it is very important to learn SQL
the easiest way of doing that first is using sqlite3 in python, as you dont have to install anything
so go learn SQL & sqlite3, and when you've done that, then maybe look at powerbi
Oh ,is that a module?
import sqlite3
So, I can use all python modules in MySQL?
sqlite3 is a simpler database, that is a bit like mysql
Sure, I would try that first. And after that move to powerbi
Bro, can I send you friend request? If that's not an issue
If you comfortable with that
you can send one, i wont accept immediately; i might accept at some point
What's the best way to deploy a machine learning model on Azure? Which service should I look into?
Perfect - Yeah, I think that's what I was looking for
microsoft has lots of free courseware in this area, on their microsoft learning github
have a look at: https://microsoftlearning.github.io/mslearn-dp100/
It's primarily cause we have to design a deployment model for a fictional company as part of one of our courses
Basically I need to make an architecture that scales well
I'll send a picture of the architecture in a bit, maybe something stands out to you as being comically wrong
i suspect if you just followed the instructions on that course above, maybe first 6 or 7 labs
you'd basically have the solution
Can any body help me on this query
Anything here that (to you) looks very out of place? Any suggestions? Basically we need to present a scalable architecture to people who don't know anything about architecture/machine learning for a machine learning model in steel plate fault detection
hah this looks a lot like my setup at a previous job
this is for people to actually train models?
i am generally skeptical of letting people who know 0 machine learning do machine learning
also the databricks notebook interface is hot steaming garbage
not worth it imo, use databricks-connect or just run non-interactive jobs
I understand your skepticism ๐ This is a completely fictional report. And we are forced to use Databricks
Well, the report is not fictional, the case is
jesus, are you working for my previous employer?
Hahahaha our teacher is a consultant, so if you work in consulting I might very well be
i don't, but fuck the microsoft salespeople who convinced upper mgmt to force databricks on everyone
dont get me wrong, i really liked having a managed and optimized spark cluster
but there was a moment when they actually believed we could do actual work using that interface
It's honestly crazy. Its such a hindrance for actual work
like not just "big data" work -- they thought we could do all of our work either on our laptops or databricks
Can I send you an updated architecture diagram?
sure
and im so relieved that you agree
serious tip: set up a local spark cluster on a server, so people can actually develop and test their spark code without burning your very very expensive databricks cluster time
or even set it up on data scientists' / ml devs' machines
Here's the thing, this will never see the light of day. There's no factory, no nothing
We basically had to make up the case ourself
i guess my point is that dev affordances must at least be part of the plan somewhere
at least in my opinion
maybe consultants dont care
I would agree, but I am honestly convinced that our teacher knows next to nothing about software development. (As you might be able to tell, I am NOT a fan of this course)
if the plan is "make pretty diagram, tell contributors to f off" then your contributors will all quit for better and better-paying jobs, and your project will fail
yeah this seems kind of rough
so if it makes you feel better: yes, that architecture was good enough for a global 500
so it's good enough for your course
and actually on the "production" side it was pretty damn good
Super nice! I am very happy about that, thanks a lot ๐
and better diagrams than we had too ๐
ours were literally low res jpg's someone had downloaded off the azure site
Yeah, we're not super into making these diagrams, we're data scientists, not necessarily cloud architects. He was super savage to us last time (cause we kept it more technical than he wanted). So this time we really tried to reduce it something anyone could understand (more or less)
But god damn. Soon I will never have to touch databricks again (hopefully)
this looks good
and honestly if you do need spark, databricks beats the hell out of running your own yarn cluster
but yes avoid that accursed notebook interface
Hello, can you please tell me if this is a good tutorial from which i can follow and learn
https://towardsdatascience.com/how-to-build-a-movie-recommender-system-in-python-using-lightfm-8fa49d7cbe3b
I want to build a movie recommendation system using ml / can you suggest any good ml projects that i can do
seems ok
Are courses on free code camp useful?
useful for what goal?
does anyone know how to fix this? absl.flags._exceptions.IllegalFlagValueError: flag --sample_1_of_n_eval_examples=: invalid literal for int() with base 10: ''
coming from:
python model_main_tf2.py
--model_dir=$/tmp/model_outputs --num_train_steps=$10000
--sample_1_of_n_eval_examples=$1
--pipeline_config_path=$/Users/admin/PycharmProjects/Imageclassifier/model/object_detection/efficientdet_d7_coco17_tpu-32 2/pipeline.config
--alsologtostderr
@robust jungle did you write this program? model_main_tf2.py
it seems like someone gave you incorrect usage instructions
that, or you wrote $10000 when you meant 10000
the $ is not a "number", it introduces a shell variable
so $1 is the first argument of a shell script
so it's not clear what $1 or $10000 are supposed to be
when you try to use a nonexistent variable in a shell script, the value is an empty string
so that's probably what's causing this error
if you were given usage instructions, can you post those here?
no
if you were given usage instructions, can you post those here?
ngl I probbably added that to experiment since I saw it on each line
but I just removed it
still doesnt work
just removed it from the other line too
new error (placeholder thing was still there), seems to have fixed it
thanks
side note: the thing im running is commented at the top of model_main_tf2.py, which can be downloaded on the tensorflow github
nevermind
just looked at it again
that one was on me
I did a goof
new issue
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
1. tensorflow.python.lib.io._pywrap_file_io.BufferedInputStream(filename: str, buffer_size: int, token: tensorflow.python.lib.io._pywrap_file_io.TransactionToken = None)
Invoked with: None, 524288
comes from:
PIPELINE_CONFIG_PATH=path/to/pipeline.config
MODEL_DIR=/tmp/model_outputs
NUM_TRAIN_STEPS=10000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python model_main_tf2.py -- \
--model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--pipeline_config_path=$PIPELINE_CONFIG_PATH \
--alsologtostderr
ignore the difference in the last bit, thought the top part was a quickstart and the bottom was placeholders
For now to get Introduced to the subject and be able to execute few projects ..I am planning to integrate it with my current skillsets
@serene scaffold I figured it out df[df["Genres"].isin(['Action','Animation','Comedy','Adventure'])]

is there a way to permanently change my df to show only those genres when i do df.head()??
like we did with rotten tomatoes
@pure pumice you can write over the existing one
what do you mean by over?
nvm figured it out
thanks
Hey guys I am in serious need of help on a CNN. Do any of you have some time to spare?
Please anyone with any knowledge of convolutional neural networks! You will save my day!
@normal radish try asking your CNN question. It's not likely that anyone will commit to a question you haven't asked yet, even if they know about CNNs.
(This goes for any time you want to ask a question on the internet: just ask the question.)
Ok so I have a CNN where I have a 180x180x3 (RGB) being convoluded with 32 filters. It is correct too say that this gives med 32 new images?
@serene scaffold
no
there will be 32 activation maps when the 32 filters are applied, but the activation maps arent new images as such
But it will change from a 180x180x3 too a 180x180x32 right?
if i recall correctly, yes -- you can just build it in keras and then print a summary
Yeah I did that but not sure I understand how
from keras.applications.vgg16 import VGG16
model = VGG16()
print(model.summary())
compare the VGG16 diagram (google images) with the summary
convolving layers are often compressive
Thing is each of the filters will be a 180x180x3 but turn a into a 180x180x1. Does it just add the values down?
so you can go from 180x180x3 -> lots of things
yes
the "convoltion product" is a dot product thru' the image
Do you have 2 seconds too talk?
i have two seconds to type
are you aware of the stanford course on this? i'm sure there's a video which does parameter counting
Iโm not sure if I understand. I understand that the filter is parsing over every pixel in a kernel and here I dots and takes the sum for the new pixel. Butโฆ there is 3 layers. How does this turn into 1?
I am not aware no
6 is where CNN itself starts: https://www.youtube.com/watch?v=bNb2fEVKeEo&list=PLf7L7Kg8_FNxHATtLwDceyh72QQL9pvpQ&index=6
the courseware is: http://cs231n.stanford.edu/
Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition.
Do you know the answer to my previous question? I will watch this tomorrow
its like a volumtric dot product
you take the pixels of all three layers, and you write them in a single row
then the dot product is just the filter's weights in a single row
dot'd with those
visually, the filter "punches thru'" the layers
So you dot the first pixel in every layer and then the next
i cant recall the exact formula, but iirc, i think basicaly you take the first weight, eg., w00 and you do that with each layer, so x0red*w00 + x0blue*w00 + ...
the idea is that w00 is the weight for "that part of the image"
and that there are three color channels is a bit of a distraction, it's duplicated information
it's halfway down the page of notes i sent: https://cs231n.github.io/convolutional-networks/
So if the first pixel in the red layer is 2, blue is 3 and green is 1 and the weights are all the the first pixel in the new feature map is 6?
well the output, ie., activation map, is a scanning of the filter across the image
But you say it punches through basically adding the values of the pixels in each layer?
Resulting in an โimageโ and not 3
yeah, but the detail is subtle, and i wouldnt want to get it wrong
If you are using padding, yes
the 180x180x32 activation maps are "images" made by weights, but the weights can be 3x3x3, say
or 3x3x1
So it could give me 3 โimagesโ pr. filter again?
each cell in the "image" ( activation map ) corresponds to the "same reigion" in the input picture
so the top left of each activation map is the top left of the input
and the value is the sum of the filters applied to that reigion
Yes thatโs the scan it makes with the filter
if you see a filter as a volume: 3x3x3
then it's dot'd with the same volume in the input, 3x3x3
and all i can remember about that, is the math is basically, take those 9 numbers in the top corner, and dot them with 9 numbers in the filter
basically in quite a naive linear order
I see a filter as 3 values in top 3 in middle and 3 in bottom
Yes okay
What confuses me was how 180x180x3 turned into 180x180x32
well the "volume in weights" dot "volume in the image" is a single number
Since in my case I convolude more than once resulting in 180x180x32 turning into 180x180x64 after and I thought it would be 32*64 activation maps
So 180x180x(32*64)
the underlying operation between each filter and each reigion of the image is just a dot product
so it projects just to a single number for each application of the filter
I think I get it. Thanks for the answers @untold tundra and @quiet vault
sure, i'd strongly recommend watching the standford course
all of these questions are answered very clearly and logically
good night
Goodnight
Hello, i have a question regarding the implementation of a particular neural network. I understand the model but i can't figure out how this guy does backpropagation. Could someone explain it to me? https://teddykoker.com/2019/12/beating-the-odds-machine-learning-for-horse-racing/
Inspired by the story of Bill Benter, a gambler who developed a computer model that made him close to a billion dollars1 betting on horse races in the Hong Kong Jockey Club (HKJC), I set out to see if I could use machine learning to identify inefficiencies in horse racing wagering. Kit Chellel, The Gambler Who Cracked the Horse-Racing Code,ย โฉ
if nayone could help with minimax ai come to kiwi chanel ๐
Can anybody help me on this..
Can somebody explain a classification report done on an SGD classifier to me. I have 79% precision and 100% recall on 0s but I have 0% precision and recall on 1s
That means your classifier classifies everything as 0, I believe
yo anyone who is learning data science or working in field of data science ? be kind enough to hit me up on dms or ping me
Hey guys - So I am creating a real estate regression model based on historical sales of real estate in the Copenhagen (Denmark) area. I was curious if anyone has any cool articles regarding the performance and real-world viability of these types of models?
I know how to create it, I am just curious how performant these things are, especially cause I don't know much about real estate, but will be buying an apartment soon
def prepare(filepath):
size=50
img=cv2.imread(filepath)
img=cv2.resize(img,(size,size))
return img.reshape(-1,size,size,1)
i need help
return img.reshape(-1,size,size,1)
reshape takes a single argument (normally a tuple)
i know
In this part, we're going to cover how to actually use your model. We will us our cats vs dogs neural network that we've been perfecting.
Text tutorial and sample code: https://pythonprogramming.net/using-trained-model-deep-learning-python-tensorflow-keras/
Dog example: https://pythonprogramming.net/static/images/machine-learning/dog.jpg
Cat ...
so img.reshape((-1, size, size,1)) rather than what you have
Hey do anyone know how to make a visualisation like this on my ConvNet?
What does it mean when my evaluation loss is magnitudes higher than my training loss?
Your model is overfitting. Your model is suffering from high variance problem.
In a layman terms, it means your model performed well in minimizing your loss function on train set, but not so well in replicating the achieved success on your validation set.
The aim is to get the RMSE/Categorical_crossentropy/ exact loss function obtained on Validation set to be exactly same or somewhat closer to the RMSE/categorical cross entropy /your exact loss function obtained on your train set.
Hello all.... In spearmenr correlation test, What is our null hypothesis ? there is correlation or there is no correlation ?
Hello, I have a small question. I have a pandas column which have negative values, I want to convert them into positive .
I tried
data[data['Quantity']] = abs(data['Quantity'])
but its giving error
which error? are there nan/null values in the column?
By error, I mean, it was taking years.
But upon searching a bit, I found the code
data[data.columns[data.dtypes != object]] = data[data.columns[data.dtypes != object]].abs()
I understand, I've got a few followups if you don't mind:
- My training accuracy itself wasn't that high anyway, so what does it mean that it minimized my loss function?
- Is there a method to fix the issue, or is it just randomly changing my hyper-parameters?
These ML terminologies might kinda be confusing but there are two things I'd like you to understand first.
Remember one of main goals of almost all model architecture in ML or Deep Learning projects is to either:
- Minimize the loss function
Or - Maximize the objective function
Now you as a ML Engineer, your goal is to get the optimum point. You'd want to get an optimum function that minimizes your loss function and at the same damn time maximizes / minimizes your objective function (you can think of this roughly in your head as finding the equilibrium point) to achieve the lowest generalisation error. In ML it's called Bias-Variance Tradeoff.
Usually, achieving #1 in most scenarios leads to achieving #2 as well.
__What is Loss Function or Objective Function? __
Loss function in a layman terms could simply be likened to what we need to train our model in order for us to know how well our model explains the data.
NB: When we're minimizing a function it's called loss function or cost function
Examples: RMSE, MAE, Logloss also similar to categorical cross entropy, MSE, Huber etc
Objective Function is the function we want to either minimize or maximize. In general term, it's any function we want to optimize during training.. So loss function is a type of an objective function.
Example: Coefficient of Determination a.k.a R^2, explained variance score, F1 score, RoC score , AuC score, accuracy_score, precision_score, and all the examples of loss function, etc
With the above explanation I believe you can easily understand what I'm about to say next.
a) High Error leads to Low Accuracy score
b) Low error leads to High Accuracy
So to answer your questions now
-
I'm super sure all I've explained till this point has answered your first question.
-
you'd have to reduce your model complexity, or gather more data. By reducing model complexity I mean, reducing your
max_depth, min_samples_leafetc.
Still confused? I hope not ๐
Not confused anymore, thanks a lot!
Hi, I am so confused about this code. What argsort()[-6:] does in this code?
do you know what argsort does, and do you know what [-6:] does?
!e
import numpy as np
arr = np.array([1, 7, 3, 4])
print(arr.argsort())
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
[0 2 3 1]
I only know argsort is a function to get a similarity value each other
That is not what argsort does.
!docs numpy.ndarray.argsort
ndarray.argsort(axis=- 1, kind=None, order=None)```
Returns the indices that would sort this array.
Refer to [`numpy.argsort`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html#numpy.argsort "numpy.argsort") for full documentation.
See also
[`numpy.argsort`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html#numpy.argsort "numpy.argsort")equivalent function
can you explain more to me?
when you argsort an array, you get an array of ints where each int represents an index in the original array
If we explain the above example stel gave,
argsort will give index of element instead of actual element in the sorted array.
and the ints are in the order that the elements would be if you had actually sorted the array
whether argsort giving a ascending array of value?
whether argsort giving a ascending array of value?
Means?
Whether argsort giving a value from smaller to higher?
It is giving indices of values from smaller to higher.
This is an official implementation of our CVPR 2020 paper "HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation" (https://arxiv.org/abs/1908.10357)...
Yo
I have the task to run this
But have no idea where to start
Could somebody give me some guidance or nudge me into the right direction how I use those pretrained models
Hey everyone i have task to recognition character, i already finish the preprocessing and got some binary images but i have no idea how to extract and store the features to the database. Anyone who learning images processing and know about zoning based feature extraction with svm for classifier can explain me or give me link for the documentation? you can hit me up on dm
Hi how would I create different plots from different data frames using a loop?
Hey everyone I need some help! How does this ConvNet return a negative value when the sigmoid functon is apllied and there is no negative values as far as I see after the model: https://colab.research.google.com/drive/1F4prYqhvItrD9xaHlZs46-EE7f5MxOEF?usp=sharing
Why did you start with the assumption that there has to be a loop?
Can you give an example of the data you're trying to plot? do print(df.head().to_dict('list')) and show it as text.
Im running this in terminal (from model_main_tf2.py) and im getting an error:
PIPELINE_CONFIG_PATH=/Users/admin/PycharmProjects/Imageclassifier/model/object_detection/efficientdet_d7_coco17_tpu-32 2/pipeline.config
MODEL_DIR=/Users/admin/PycharmProjects/Imageclassifier/model/object_detection/efficientdet_d7_coco17_tpu-32 2
NUM_TRAIN_STEPS=10000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python model_main_tf2.py -- \
--model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--pipeline_config_path=$PIPELINE_CONFIG_PATH \
--alsologtostderr
error:
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
1. tensorflow.python.lib.io._pywrap_file_io.BufferedInputStream(filename: str, buffer_size: int, token: tensorflow.python.lib.io._pywrap_file_io.TransactionToken = None)
Invoked with: None, 524288
Can you show what comes before that part of the error message?
!traceback
sure, but it is quite long
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
Looks like this isn't something I know enough about to jump into rn, but you can save this paste and ask again later or in a help channel. Or wait here.
alright
Hi @serene scaffold my data is like this
import pandas as pd#import pandas package to read data more easily
import matplotlib.pyplot as plt#imported pyplot to plot graphs
import datetime as dt#date time to read first column of csv file
import numpy as np
from datetime import datetime
df = pd.read_csv('TSLA.csv')
df2 = pd.read_csv('NBM.V.csv')
df3 = pd.read_csv('TSLA.csv')
df['Date'] = pd.to_datetime(df['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
x1=(df['Date'] - dt.datetime(1970,1,1)).dt.total_seconds()/86400
x2=(df2['Date'] - dt.datetime(1970,1,1)).dt.total_seconds()/86400
y1=df['Close']
y2=df2['Close']
t0=[]
d0=[]
def udcf(df,df2,t0,d0):
y1_mean = np.mean(y1)
y2_mean = np.mean(y2)
y1_stdv = np.std(y1)
y2_stdv = np.std(y2)
for i in range(len(df)):
for j in range(len(df2)):
t=x2[j]-x1[i]
t0.append(t)
d = (y1[i]- y1_mean)*(y2[j] - y2_mean)/(y1_stdv*y2_stdv)
d0.append(d)
return udcf,t0,d0
x, y = zip(*sorted(zip(t0, d0)))#ensures x and y values correspond to each others in pairs when sorted
plt.scatter(x, y, ls='-', lw='1', color='red', marker='.')```
I want to use other data frames(df2,df3) and calculate t0 and d0 for each then plot them in different graphs rather than doing it manually
Please don't show code when asked to show data. I don't have any of those three CSVs on my computer, so this isn't helpful for me.
anyone aware of a jupyter frontend that is capable of accepting user input and connecting to an already running kernel, like jupyter console can with --existing?
What does one call it when they do an "outer" operation on two vectors, other than multiplication (ie outer product)?
looks like it might not matter, in this case
one of the csv files (df)
Remember what I said before: print(df.head().to_dict('list'))
if your next message doesn't have the data in that format, I'm afraid I'll have to stop helping.
That was right except that it was the same df twice
print(df.head().to_dict('list'))
{'Date': [Timestamp('2010-08-10 00:00:00'), Timestamp('2010-11-10 00:00:00'), Timestamp('2010-12-10 00:00:00'), Timestamp('2010-10-13 00:00:00'), Timestamp('2010-10-14 00:00:00')], 'Open': [10.25, 10.19, 11.05, 12.25, 12.9], 'High': [10.57, 12.0, 12.75, 12.8, 14.79], 'Low': [10.1, 9.85, 10.96, 11.86, 12.75], 'Close': [10.13, 11.13, 12.05, 12.71, 13.94], 'Adj Close': [10.13, 11.13, 12.05, 12.71, 13.94], 'Volume': [1135300, 712500, 777000, 1413100, 1895200]}
is that ok?
yes, this is what I wanted
ah kl
It's a good format because I can copy and paste it directly and use it
ohh i see
def udcf(df,df2,t0,d0):
y1_mean = np.mean(y1)
y2_mean = np.mean(y2)
y1_stdv = np.std(y1)
y2_stdv = np.std(y2)
for i in range(len(df)):
for j in range(len(df2)):
t=x2[j]-x1[i]
t0.append(t)
d = (y1[i]- y1_mean)*(y2[j] - y2_mean)/(y1_stdv*y2_stdv)
d0.append(d)
return udcf,t0,d0
This can be greatly simplified
def udcf(y1, y2):
d = np.outer(y1 - y1.mean(), y2 - y2.mean()) / (y1.std() * y2.std())
t = (y1.reshape(-1, 1) - y2.reshape(1, -1)).reshape(-1)
return t, d
or something like that.
Anyway @mighty spoke what are you trying to plot? Which two columns are x and y?
@serene scaffold I'm trying to plot the lag values (x) and dcf values (y)
@serene scaffold this is my other data frame df2
print(df2.head().to_dict('list'))
{'Date': [Timestamp('2020-11-27 00:00:00'), Timestamp('2020-11-30 00:00:00'), Timestamp('2020-01-12 00:00:00'), Timestamp('2020-02-12 00:00:00'), Timestamp('2020-03-12 00:00:00')], 'Open': [0.09, 0.09, 0.09, 0.09, 0.09], 'High': [0.09, 0.09, 0.09, 0.09, 0.09], 'Low': [0.09, 0.09, 0.09, 0.09, 0.09], 'Close': [0.09, 0.09, 0.09, 0.09, 0.09], 'Adj Close': [0.09, 0.09, 0.09, 0.09, 0.09], 'Volume': [0, 0, 0, 0, 0]}
Hey @mighty spoke!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
โข If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
โข If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
my full code:
@mighty spoke idk if I can do a full code exploration rn. Can you make it so that your dataframes have columns that represent the x and y data?
how is lag calculated?
lag is calculated using t=x2[j]-x1[i]
sure i did this df4 = pd.DataFrame({'X' : x, 'Y' : y})
also I tried binning the x values but i'm not sure its done the most efficient way
x=t0, y=d0
i'm trying to compare different data frames like df and df3 or df2 and df3 then plot them in different graphs but i don't want to make a a udcf function for each
you should only have to make one udcf function that you can call whenever
though it looks like your function is weirdly defined
it depends on variables defined in the global scope and doesn't use its parameters. it also returns itself
yes thats what i want
def udcf(y1, y2):
return np.outer(y1 - y1.mean(), y2 - y2.mean()) / (y1.std() * y2.std())
assuming that they are both vectors (one-dimensional arrays)
then i could call it outside like y=udcf(y1, y2) ?
ye
when you did this t = (y1.reshape(-1, 1) - y2.reshape(1, -1)).reshape(-1)
return t, d
what does reshape do ?
change the shape of the array
Hello, I don't have a powerful PC to train a resource-intensive model, do you know a software to make clusters that works on both linux and windows?
so you want to do clustering? you can use sklearn for that.
Hello everyone !
I was wondering if anyone is able to help guide me in a direction for a project ?
Iโve been looking into it and Iโve been seeing a lot of Ai.
Not 100 percent sure if this is the place.
I use transformer of huggingface for train
in either case, all the deep learning libraries I know about (sklearn is not a deep learning library, for the record) can run on Windows and Linux, but might require extra work to get running on Windows.
uh yes but how to launch on several machines on the same job?
do you mean several machines or several CPUs?
several machines :p
This is the channel for asking AI questions, yes.
you might need to use Hadoop or something for that.
Just thinking out loud...
Do anyone here really use TensorFlow's high level Estimator API to train a model? If so, how often cos... ๐ค
I'm well aware of its many advantages over the low level algebraic method and Keras Sequential method but I think it can be stressful when we have many features in our dataset.
Let's say we have +52 features in our data, do we really have to define each +52 feature columns manually? ๐ฉ
Is there no way to evade this process of manually defining each feature columns?
if it makes you feel better, the docs recommend against Estimator because it doesn't support the v2 api, and suggest using the Keras api instead
that said, you can always write a for loop or list comprehension if you need to programmatically build up lists of features
Hey guys,
i have installed anaconda which almost comes with all packages i need. Is it still necessary to create an environment?
Can i not just do:
- create a project in a location of my choice and select the default base environment. And then finally run the scripts since most of my packages are there in the conda installed location (base env)?
creating one environment per project helps make sure your anaconda installation itself doesn't get messed up
there are a lot of reasons for this, but it's going to save you a lot of pain in the future if you just create one env per project
so yes of course you can do what you are asking about, but you shouldn't
personally i think anaconda made a very poor decision by shipping everything in one big base environment
Hi guys
How to implement roc auc for multiclass?
Tryed various variants from google and nothing worked (perhaps cause i am not as smart as i want)
you can compute it separately for each pair of classes, and combine those results using this formula: https://stats.stackexchange.com/q/76830/36229
Thanks!
Compute separately like
for class in multiclass:
code
?
yes, it can be a for loop over pairs of classes
Thanks!
Thanks ๐. Could you send the link to the doc you referenced here? I'd love to read it as well
actually.. i'm not sure if that's how you do it
let me look into this a bit more @compact parrot
Ok!
you might need to do something like fit a different model for every pair of classes
Oh
it's not generally used for multi-class problems
https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator
Warning: Estimators are not recommended for new code. Estimators run v1.Session-style code which is more difficult to write correctly, and can behave unexpectedly, especially when combined with TF 2 code. Estimators do fall under our compatibility guarantees, but will receive no fixes other than security vulnerabilities. See the migration guide for details.
Estimator class to train and evaluate TensorFlow models.
Thanks man
you don't need to share the code, but if you explain more about what you are doing, people can provide more useful advice
although feel free to share if you can
My scientific director in uni says that I should implent that
let me skim the paper that the stackexchange answer linked
#%%
x = df.loc[:,0:63]
y = df[64]
n_classes = y[0]
#%%
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)
#%%
sc = StandardScaler()
x_train = pd.DataFrame(sc.fit_transform(x_train))
x_test = pd.DataFrame(sc.transform(x_test))
#%%
# Naive Bayes
gnb = GaussianNB()
fit = gnb.fit(x_train, y_train)
y_train_pred = fit.predict(x_train)
y_test_pred = fit.predict(x_test)
result = {'y_train': y_train, 'y_test': y_test, 'y_train_pred': y_train_pred, 'y_test_pred':y_test_pred}
show_info('Naive Bayes', gnb, result)```
I am using this dataset
https://www.kaggle.com/kyr7plus/emg-4
https://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf yeah here, see section 8.2 for an explanation of this.
Thanks, will read it