#data-science-and-ml
1 messages · Page 244 of 1
@frail locust Use the character once in-between of words to do this:this`
If you want to do full code snippets, use it three times like:
this
Hello```
Wow
Use the ` key
ty
Hi guys! Could please somebody help me with a computer vision task?
Hey guys hows the mit course on linear algebra on youtube for data science ?
hey guys
@frail locust for visual display, or actually in the number?
@arctic cliff are those strings "NaN"?
oh hm
thats actually a bit weird
I see
Hi, I’m using Tensorflow 1.15.3 / Ludwig with Google Colab and have followed this guide to the letter: https://www.searchenginejournal.com/automated-intent-classification-using-deep-learning-part-2/318691/#close
When I run !ludwig experiment --data_csv Question_Classification_Dataset.csv --model_definition_file model_definition.yaml
I get an error File "/usr/local/lib/python3.6/dist-packages/absl/flags/_flagvalues.py", line 491, in __getattr__ raise _exceptions.UnparsedFlagAccessError(error_message) absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --preserve_unused_tokens before flags were parsed.
Here’s the link to Colab if you want to take a peek: https://colab.research.google.com/drive/1LZ9aA06B3wXysgGc8tfVHi8FZqZFx2xZ?usp=sharing
Thanks for your time. Been Googling all afternoon and don't want to give up! Hopefully not too much of a noobish question.
@lapis sequoia maybe something like this ```python
import pandas as pd
data = pd.read_csv('data.csv', parse_dates=['timestamp'])
violations_hourly = data
.groupby([pd.Grouper(key='timestamp', freq='60T'), 'VIOLATED_DIRECTIVE'])
.apply(lambda x: x.shape[0])
.to_frame('count')
.reset_index()
fig, ax = plt.subplots()
for lab, grp in violations_hourly.groupby('VIOLATED_DIRECTIVE'):
ax.plot(grp['timestamp'], grp['count'], label=lab)
fig.legend()
fig.plot()
the "Grouper" thing i didnt know about before. got it from here: https://stackoverflow.com/a/32012129/2954547
using .count() itself for some reason resulted in an empty dataframe
not sure why
@desert oar
The blue df contains 2 int numbers
Can I split the x number like the y ?
@arctic cliff
ax = plt.gca()
ax.set_xticklabels(blue['blueDragons'].unique())
maybe try something like this
AttributeError: 'numpy.int64' object has no attribute 'unique'
@desert oar
It's only one value
But I want to automatically split it into 7 pieces so the plot can be more logical
I still don't know if I should let it like that ..
oh
how does that make sense
how did you even plot a single value
ohhh i see what you did
ax = plt.gca()
ax.set_xticklabels(blue.columns)
@arctic cliff
anyway ax.set_xticklabels is what you want
plt.gca() is "Get Current Axis"
an "axis" (in matplotlib terminology) is the area that you plot in
a "figure" is a grid of one or more axes
In scipy, If I understand everything correctly, scipy.norm.cdf(x) returns the percent chance <=X will occur with respect to a normal distribution curve. Thus X = 0, is 0.5. X = Infinity, is 1.0 and X = -Infinity, is 0.0.
But, can someone explain what scipy.normal.pdf(x) does?
the pdf is the probability density
for a continuous distribution its not really interpretable in and of itself, but
its analogous to density in physics/chemistry
integral of density within a certain domain = mass in that domain
for a continuous distribution its not really interpretable in and of itself, but
@desert oar
That is what is confusing me. Because, scipy seems to let you do this.
https://en.wikipedia.org/wiki/Probability_density_function
Basically, the PDF is the derivative of the function you mentioned.
^
Ok, so PDF(0) is equal to 0
no
they are both used, because well, you can easily get one from the other.
PDF(x) is derivative of CDF at x
PDF(x) = (d CDF / dx)(x)
it makes more sense intuitively if you look at discrete distributions
for which PDF(x) = P(X = x)
whereas for a continuous distribution P(X = x) is always 0 because measure theory
ok, that makes sense. So, if I have a function... like the black scholes model that has one or more CDFs in it. And I want to find the derivative of that function, all CDFs turn into PDFs.
so you can only talk about P(X <= x) in a continuous distribution, which is the CDF
correct
https://www.desmos.com/calculator/njtytzquvt
Here, I made a graph.
excellent, you all were great help 🙂
The red is the PDF of a gauss (normal) distribution, the blue is its CDF - for the normal distribution, it's called the error function.
I guess my plotting itself is wrong .. @desert oar
My columns is contained of 0 or 1
So I'm summing
That's why I end up with only one value
I ran into this while learning how to backsolve the implied volatility of a stock option from its current market price. The process required the usage of the derivative of the black scholes model.
maybe something like this
blue = df[['blueDragons', 'blueWins']].sum()
plt.plot([0, 1], blue)
plt.gca().set_xticklabels(blue.index)
plt.xlabel('Dragon Effect')
plt.ylabel('Winnings')
@arctic cliff
Oh
forgot something..
It's the same as before ..
Does that plot makes sense to you ?
Can you get any useful info from it ?
Maybe it's right I don't know
the x-axis is missing on the new ones
Guess I don't even need a visualization for that kind of comparing ?
Well, They're actually not
X and Y are only 2 values
1 for x
1 for y
the idea of my code was to try and control the X axes more. you can keep experimenting
but yeah you should just print those values imo
no purpose in graphing 2 points
wait, you have two points?
@tidal bough they did .sum() on 2 columns
Then yeah, lol, maybe don't connect them with a line, that's very misleading 😅
df[['a', 'b']].sum() returns a Series with 2 numbers and index values 'a' and 'b'
it's not really something you see in How To Lie With Statistics. Not even there do people get 2 points of data and pretend it's a straight line.
xD Gotcha !
they are
or marker = "d",linestyle="", I think
I tried the scatter thing on both the series and the original df columns
I got some weird outputs
Not weird if they make sense ..
Something you can expect from 0 and 1
df.plot.scatter('blueDragons', 'blueWins')
did you try this?
oh yeah just 0 and 1
how about a cross table
What's a cross table ?
pd.crosstab(df['blueDragons'], df['blueWins'])
@arctic cliff is the issue resolved? If not please provide me some background I may help you in this
Give me a second
@bleak fox Take a look at this:
https://www.kaggle.com/potatomanduh/league-of-legends-dragons-effect-on-winning
@arctic cliff thanks, i have gone through with this... Now please share what is the exact problem which you are facing?
I tried to make a plot about the relationship between blue/redWins and blue/redDragons
@arctic cliff what is the point in plotting these 4 points...?
To show out the relationship between the correlation of both of them
To show correlation we generally use the scatter plot, hence you can use df.bluewins vs df.redwins (all values)
Also for correlation, you can use df.corr() , to print
Give me a second
df[['blueWins','blueDragons', 'redWins','redDragons']].corr()
Use the all values of these points, does your main df has only 4 rows?
Can you share access of your notebook with me?
does anyone know why this sql query is not working in the 'WHERE' clause
Sure thing, Wait
does anyone know why this sql query is not working in the 'WHERE' clause
@lapis sequoia share query
select extract(month from tstamp) as mon, extract(year from tstamp) as yyyy, count(number)
FROM table
WHERE mon != 8 and yyyy != 2020
GROUP BY 1,2
ORDER BY 2,1
getting column 'mon' does not exist
in the where clause
using psql
it works fine when i exclude the where clause but the columns are named mon and yyyy
Put "mon" And same for "yyyy" And try once
in the select or where clause?
I guess I need to search for you to add you to the Collaborators ?
in the select or where clause?
@lapis sequoia in where clause
I guess I need to search for you to add you to the Collaborators ?
@arctic cliff kapil.task.pro@gmail.com
im still getting a column "mon" does not exist
Do you have a kaggle account ?
WHERE "mon" != 8 and "yyyy" != 2020
Do you have a kaggle account ?
@arctic cliff https://www.kaggle.com/kapilpanwar
Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals.
Done
WHERE "mon" != 8 and "yyyy" != 2020
@lapis sequoia nav, the name you are changing as mon and yyyy are just to display you, you can use the same conversion for where aswell like where year from tstmp! = x
oh yeah i understand that
i just wanted to know why it doesnt work for an alias
so i cant use an alias in the where clause?
i just wanted to know why it doesnt work for an alias
@lapis sequoia your where query is still on db side with actual column names where as alias is just giving a new name after extraction, now where is called before data extraction... Hence where always require the actual column name
@arctic cliff added heatmap and correlation matrix for your data in notebook cell 6 and 7
also last question
@lapis sequoia I'll try
select extract(month from tstamp) as mon, extract(year from tstamp) as yyyy, count(number)
FROM table
WHERE extract(month from tstamp) != 8 and extract(year from tstamp) != 2020
GROUP BY 1,2
ORDER BY 2,1```
why does this result in all the rows containing 8 in the month AND all the rows containing 2020 in year to be lost?
What can I understand from a heatmap? It looks unfamiliar to me
select extract(month from tstamp) as mon, extract(year from tstamp) as yyyy, count(number)
FROM table
WHERE extract(month from tstamp) != 8 and extract(year from tstamp) != 2020
GROUP BY 1,2
ORDER BY 2,1```
@lapis sequoia this is what your filter is doing if month is 8 and year is 2020 don't include them...
yes
but it gets rid of 8-2017, 8-2019, 8-2020, AND all the months of 2020
so it gets rid of 1-2020, 2-2020, 3-2020 as well
i just want only 8-2020 gone
ok wait i changed it to OR and it worked
i dont know why lol
What can I understand from a heatmap? It looks unfamiliar to me
@arctic cliff is is giving correlation betwee all your columns, values near to 1 shows good correlation where near to 0 shows they are independent
so it gets rid of 1-2020, 2-2020, 3-2020 as well
@lapis sequoia happy for you😀
https://www.youtube.com/channel/UCChfG4FWN6qSPFqZF_E9XvA/videos please join and support 💪
quick question, (apologies if im using the wrong terminology) is there a way to stop jupyter notebooks from automatically moving the notebook every time i click a cell? Its driving me insane
quick question, (apologies if im using the wrong terminology) is there a way to stop jupyter notebooks from automatically moving the notebook every time i click a cell? Its driving me insane
@proper swift can you please elaborate what is moving and where?
sorry first time using jupyter, everytime i try to click the end of piece of code, the notebook jaggedly moves. I have to scroll with the mouse to get it back into a more suitable position
I have no additional extensions installed. Im only using vanilla Jupyter on Windows 10 with Python 3.8
sorry first time using jupyter, everytime i try to click the end of piece of code, the notebook jaggedly moves downwards. I have to scroll with the mouse to get it back into more suitable position
@proper swift sorry bro... It seems some issue with your browser/os/jupyter settings.. It is outside my scope... 😩
😦
@proper swift you can use vs code notebooks... I feel they are better than jupyter notebooks
Good to know. Sadly , i'm following a tutorial on Pandas which is using jupyter notebooks
Good to know. Sadly , i'm following a tutorial on Pandas which is using jupyter notebooks
@proper swift look it is just a place where you write code... You will be easily able to do things in vs code notebooks with same commands...
Good to know. Sadly , i'm following a tutorial on Pandas which is using jupyter notebooks
@proper swift check this out https://youtu.be/sHk9PH-9tSs
Follow me on twitter: https://evidencenmedia.com/twitter
In depth tutorial about how to get and open jupyter notebook inside visual studio code.
This is your opportunity to support the work I am doing.
Become a member of our exclusive data science community where we do pro...
thansk for the link, will check it out
@proper swift welcome...
thansk for the link, will check it out
@proper swift favour me in supporting my channel too, we have also started data science course from scratch https://www.youtube.com/channel/UCChfG4FWN6qSPFqZF_E9XvA/videos
@bleak fox Are u a data science student?
@lapis sequoia no, i am a professional with 7+ year of experience in this field.
link to where i can find out more about data science?
other than wikipedia?
Or is wikipedia reliable?
link to where i can find out more about data science?
@tidal sonnet https://www.youtube.com/channel/UCChfG4FWN6qSPFqZF_E9XvA/videos
@tidal sonnet Data sci is more than ml tho
ik
i picked py cause i wanted to learn ml
but i also want to know more about other parts of data science
well imo, databases are a big part of it
as in, if you learn to use them, they'll be pretty useful
numpy + pandas + matplotlib are some key libraries to learn as well
python is pretty much used in every field nowadays
ngl still haven't used it for ela
ela?
english
oh. people use it in the humanities, albeit more rarely
pandoc (in haskell) was written by a philosophy professor, if i recall right
hm I'll have a look into it
you have cases where people in the humanities write python scripts to manage their reference lists, things like that
not typically used directly in research, but can definitely be used as an automation tool by researchers
So I'm trying to write a kfolding algorithm that maintains class balance (like sklearn's StratifiedKFolds) and doesn't split groups (like sklearn's GroupKFolds)
I'm not sure how to go about doing that though
any ideas for a basic algorithm I could follow?
@solid aurora
Not really any particular algorithm, except for getting all the elements for each class, find how many elements you need to have an even ration, and then throw away the extra elements (this could be problematic tho)
@flat quest I'm not trying to delete elements from my dataset at all
If my class balance is 1:4, StratifiedKFolds will make all folds approximately 1:4 as well
meaning it purposely tries to maintain that ratio rather than leaving it up to probability
well one way to go about it would be lets say your fold has 1000 elements and there's a ratio of 4 cats to 1 dog.
The dataset has 4000 cats to 1000 dogs. So you calculate the number of cats that you'll need, then get all the cats in the ds and randomly select that number of elements you need, then do the same for the dogs.
There's probably a faster way to do it, but that's just one way to do it @solid aurora
@flat quest then I also need to not split groups across folds (like GroupKFolds)
what you described is basically StratifiedKFolds, which is what I would normally use unless I needed groups
going with your example of cats+dogs, let there be owners who each own anywhere from 1 to 100 cats and dogs
owners can have both cats and dogs
then I'm not allowed to split the pets of an owner across two or more different folds
but let's say I have 1000 pets and I want 5 folds, then I still need 160 cats and 40 dogs per fold
@flat quest make sense?
ah gotcha
So not split groups across folds, but you want to balance the overall classes
You can't get perfectly equal class ratios across each fold, but you can make an approximation
Ok so one way would be.
Calculate the number of elements for each class that should be in the fold.
Then select the group that can reduce the required number of elements for that fold by the greatest (so lets say fold needs 4000 cats 1000 dogs, but group has 80 cats 20 dogs, the remaining elements required would be 3920 cats 980 dogs).
Continue doing so until we hit a certain threshold for all classes.
This would require some calculation steps to find the one that can reduce the required elements by the greatest. It might be slow. It'll also congregate all the large groups into the first few folds.
Another way would be to again calculate the number of elements of each class for the fold, but then select a group at random. Continue to do so, until one or all the classes have surpassed a threshold. This one is faster and will distribute the groups better, but it will be more error prone.
For example we might have many groups that are 100 cats 1 dog. This might cause the class distribution for the fold to be like 8,000 cats to 1000 dogs when all the class counts reach their threshold.
@scarlet wigeon
A good complete blood cell dataset that I could be linked to?
@tidal sonnet Data sci is more than ml tho
@bitter harbor 100% right
@flat quest yea you're right that will probably be close enough
I was vastly overcomplicating it lol
I was trying to liken this to the packing problem and 0/1 knapsack and all
🙂
yeah maybe, but if you still want to try that route, by all means 😉
I don't think it would provide much of an improvement @solid aurora
But you never know right?
hey guys, has anyone seen an example of a model being deployed on online streamlit?
im a little confused on how it works in terms of running the model on a server
like i know ordinarily you load the model onto a pickle file on flask
but all the examples ive seen with streamlit run the actually model before rendering the prediction
nvm guys on the community had solution
In keras, if im working on a multi-channel input layer, and throw a cnn onto that layer, does the cnn get applied to all channels, and how?
input = Input(shape=(100, 100, 4,))
x = Conv2D(32, kernel_size=(3, 3), activation='relu', padding='same')(input)
is x added to all channels, and if so, how?
Hi all, Pandas beginner here. Just wondering how it's possible to aggregate the value column as a weighted average?
df.columns = ['material_id', 'Thickness', 'Material', 'Width', 'Quantity', 'Date']
pd.to_datetime(df['Date'])
df = df.pivot_table(index=['Material', 'Thickness', 'Width'],
columns=[],
aggfunc=df.ewm(times='Date', halflife=datetime.timedelta(days=60)).mean(),
values='Quantity')
Her's where I'm up to, but I keep getting ```
IndexError: list assignment index out of range
Full trace back? Which line exactly gives that error
I found it was actually caused by another exception, it's trying to convert another column to a float?
I'm now getting a keyerror for 'Date' with the above code
KeyError: 'Date'```
Also have tried this ```
df = df.groupby(by=['Material', 'Thickness', 'Width']).agg({'Quantity': ['mean', df.ewm(times='Date', halflife=datetime.timedelta(days=60)).mean()]})
Which causes the original exception, which is ValueError: could not convert string to float: which relates to the Material column
I need some ensembling advice: I have 5 neural nets I want to ensemble, I got good results with weighted averaging their softmax outputs. Now I want to try and apply weights to each models individual class prediction scores. I tried LR but it seems to be slightly worst than weighted average, plus I have to split my validation set up to train the weights
What else could I try?
Infact, should i train my stacking LR model on the validation or training data?
Never train on validation. Defeats the whole purpose of a validation dataset
The general consensus is that you learn the weights for weighted averaging on the validation data rather than training
Which is why i did it
Uh, I'm not aware of such a general consensus but it seems wrong to me. Maybe I'm out of the loop.
Common sense dictates it's wrong though, so actually no, I'd challenge that statement.
"A smarter way to ensemble classifiers is to do a weighted average, where the weights are learned on the validation data"
Which is in a book written by the author of keras
So idk
I guess the idea is to prevent overfit? It's so leaky though
Why would it not leak info from holdout into our actual ensemble?
I am guessing since the whole models are weighted rather t6han individual outputs, its not as much of a concern
But yes, there would to some degree
I'll refrain from further comment on this, this is out of my depth and just feels wrong, which may simply be due to a gap in my knowledge.
If theres some material around this that you or someone else encounters and can share, I'd greatly appreciate it
That quote is the only real information I have on the matter
in a frequentist model, your model should never adapt to the validation data. in a baysian model, it will because that's how it works. I wonder if the meaning behind the keras book's author's comment was in this spirit
the weights are adapted as you go
The only thing that I can think that makes a bit of sense is that if we treat the average model weights as hyperparameters like we would when optimising the models, testing them on the validation set and picking the best parameters based on validation accuracy
thats the definition of overfitting
Then I have no idea why, but everywhere uses the validation set that i can see
maybe it's a nomenclature thing? ime it's training and test data. your model never sees the test data until you're super convinced that the model is robust and sound.
I have 3 sets, train, validation and test
ok i never heard of validation data then
Validation is basically just for hyperparam optimisation
and is representative of the test set
couldn't that be done with cross validation style chopping up the data?
Yeah, basically, but the competition specifically hands out the validation set for that purpose, with the test set unlabelled until it closes
and is representative of the test set
ok my background would be training data = historical data; test data = paper trading. so you can't have a data set that 'represents' data that doesn't exist yet
I get what you mean, but it's for a competition, so the test set is already defined
It's just that it hasn't been labelled unlike the val set
Effectively the entire dataset is split into the 3 sets, using the test and validation, provide a model that best predicts the unlabelled test set, each set is made up randomly from the entire corpus
some people have a train-validation-test split
so instead of something like K-fold cross-validation
you have a fixed validation set
of course, this can lead to overfitting your hyperparameters to the validation set
which cross-validation is more robust to
but in either case you find out on the test set.
which cross-validation is more robust to
@velvet thorn and "more robust", not "immune to".
so
Yeah, basically, but the competition specifically hands out the validation set for that purpose, with the test set unlabelled until it closes
@acoustic halo and yes, this does happen.
a fair bit.
but ultimately the point is that you perform hyperparameter tuning on a subset of the data you have that is not seen by the model, right?
Okay, so then in my case, where i have the 3 sets, which do I use for weighted averaging?
and that you perform ultimate evaluation of the model upon a set that has not been seen at all, even for hyperparameter tuning
@velvet thorn yes
Okay, so then in my case, where i have the 3 sets, which do I use for weighted averaging?
@acoustic halo context?
weighted averaging of?
Each model is trained on the train set and hyperparam optimisation is done on validation set
Hi everyone,
I have a problem I've been working on for a couple of days and I just can't find a solution to it. I have a 4 identical dataframes with a Multilayered Columns and a single-layered Index.
Each Dataframe consists of one sample with different dilutions and for each dilution 3 seperate measurements were taken. So all dilutions are grouped under sample and all replicates are grouped under their dilution.
I want to combine these different Dataframes so that the replicates of all 4 samples are grouped next to each other. So it's a kind of nested merge.
I tried merging two Groupby objects the following way:
for group, group2 in zip(df1, df2):
pd.merge(group, group2, on="Label for level2)
But I get an error saying Grouped Objects cannot be merged. I tried looking for a solution but I'm not even sure how exactly to search exactly what I am looking for. Any help is greatly appreciated.
Thanks a lot
Now i want to weighted average the outputs
Do I use the train set to find the weights or the validation set
@quick fox oh my god i spent like a week trying to merge two dataframes with multi column indices. I gave up and walked the lists and wrote the join myself because nothing made sense.
This is the problem, I would have thought the train set too
unless
But resources i find online use the validation set
you fit on the validation set and evaluate on the test set and stop there
specifically the author of keras says use validation set
@quick fox if you don't show sample data it's gonna be hard for you to get help
@spark cape Yeah I'm starting to get desperate, too. If it comes to it I'll have to do it by hand which I really don't want to
it doesn't sound like a very simple problem
so if you want someone to be able to work on it, you need a way for them to easily reproduce the initial situation
as well as know what the expected result is
in my time on SO
I've seen a lot of pandas questions go unanswered because it's not clear what people want
and in general, written explanations are bad.
data is good.
code that can be copy-pasted to create initial state is best
Yeah I get that. I thought this might be a problem that's easy to solve for someone more experienced than I
it's not that the problem is difficult to solve
it's that it's difficult to explain
and if I don't know what your problem is, I can't help you with it
just as you have no idea what to search for, I have no idea what your data actually looks like
Alright I'm on the phone right now. I guess a picture won't do?
hey @desert oar thanks for the help yesterday
I probably won't go any further than hazarding a guess
but someone else might
so why not
specifically the author of keras says use validation set
@acoustic halo which book
is this from
by the way
@velvet thorn Deep learning with Python by Francois CHollet
265
Also found this:
"Finding the weights using the same training set used to fit the ensemble members will likely result in an overfit model. A more robust approach is to use a holdout validation dataset unseen by the ensemble members during training."
yup
you fit on the validation set and evaluate on the test set and stop there
@velvet thorn should be this
I mean
the thing is
you're basically doing a form of boosting
Except the part where I can't evaluate on the test set
Because it's unlabelled
yeah
so
what I would suggest is
split your data further
the train set
train base learners on t1, train meta-learner on t2, then evaluate on v
then final predictions on test set and submit that
like
hmm okay not a bad idea
I don't think it's wrong to train the meta-learner on the train set
but like I said
you're basically doing hardcore boosting
which is already fairly prone to overfitting
so
and yeah I mean meta-learning is probably pretty high variance already
what models are you using?
simple LR to combine?
NN
the metalearner
no, i'm learning the weights through nelder mead minimisation currently
actually, differential evolution, not nelder mead
I might just be lazy and leave it as is, my reasoning being that the averaging weights are not learnt per se, they are just another hyperparameter optimisation selected on the basis of the performance over the validation set
Just like layer size is selected on validation set performance
I don't even know anymore
isnt the basic stacking method just train a bunch of uncorrelated models, then fit linear regression on their predictions?
so what are you working on? im curious
I want to do average weight ensembling though, i tries stacking but got less than desirable results
fwiw, I actually won the first stage
congrats
what is average weight ensembling? never heard of that
Max Pechyonkin
Literally adding all the softmax outputs from vafrious models together
but also applying a weight to each model output
It's super basic, but I was planning on using the validation set to find the weights
Which is a big no-no apparently, despite being used in every article i look at
well linear regression is a weighted average if you squint
Yeah, I used LR to learn weights for each softmax output from each model individually
also theres this rule:
"Participants are NOT allowed to use the development set or any external dataset (labeled or unlabeled) to train their systems."
So i don't want to train the meta model on the validation set per se
Though I would argue optimising the model weights is learnt in much the same way model hyperparameters are learnt
the validation set
Ah ok
they just call it dev
darn
We actually have this problem at work, people want to use methods like temperature scaling and gold loss correction
All of those require "auxiliary" training sets
So if you get too aggressive using those methods you end up cutting down the size of your main training set significantly
Which can really hurt when you have a highly imbalanced problem or you are already low on data
So, this is my problem, I have to go back and retrain all my models on a smaller training set
Which is a massive pain
In some cases we have just reused the training set, but in those cases we were able to convince ourselves that the training so it wasn't significantly different from any other version of that data set we would have now or in the future
And we had to proceed very carefully to avoid overfitting
But, this still doesnt explain why everyone seems to get their weights on the validation set
Are they just breaking the rules? Lol
based on the above link and textbooks
didn't we discuss that earlier
Oh, yeah. We do that too
But it really makes your validation set less useful
Think of it this way, every "external" procedure requires another validation set
So if you only have one validation set you basically need to decide which procedure gets trained on the main training set and which procedure gets trained on the validation
In this case the rules of the contest tell you what your decision is, either you reuse the training set or you split off your own validation sets
This is what I originally thought, but my instructor insisted the rule was mainly in the context of using the validation set to train the neural nets
But what you said makes more sense
what kind of problem is this
regression? classification? how many / what kind of features?
Classification (1000 classes), features vary per model
But mainly n-grams, abstract syntax tree nodes and a special version of BERT
ah very similar to stuff ive worked on
how imbalanced are the classes
and how imbalanced are the features
(why cant i spell imbalanced today)
Classes are evenly split, features are alright
how many records
If you want links for Machine Learning and AI learning courses and files send me a message
50k in the training set, 25k in validation and test
oh yeah
can you slice off like 5k from the training set?
use that to train the ensemble
how many models are you ensembling? like 5?
Yeah it's 5, the main thing I am trying to justify in my mind is whether selecting the weights counts as hyperparameter optimisation
Because I could just grid search
and pick the best
Just like picking layer sizes
are you allowed to use the development set for hyperparameter optimization?
yes
Is that not the point of the validation set anyway?
yes but in real life it's not a strict delineation
its not just "model + hyperparameters"
there are potentially several "layers" of training
as you're seeing here
if you have a model w/ gold loss correction, temperature scaling, and hyperparameter tuning, theoretically you have three nested training procedures
So what i'm hearing is that i can get away with using the validation set to "optimise my hyperparameters" 😆
well... more like "it's not the main model" is your argument
what a stupid rule imo
i think the idea here is that you aren't allowed to do a final training run that includes the validation set, before submitting
Exactly
It's a mess
thats the whole point of a validation set
can you like, clarify the rule w/ a judge
or i guess you can just do it anyway and hope nobody calls you out
I'll do just that, but until I hear otherwise, I'll go on the basis that I'm allowed to use the validation set for hyperparameter optimisation which includes selecting weights
Realistically, I'm not too bothered about the competition, this is for my final project so I'm more concerned in learning the actual concepts than results
Does anyone have experience working on ml/ai open source projects? If so please reach out to me!
I'm trying to get started with tensorflow and opencv open source but im not sure where to start in terms of how to contribute.
@random perch I've contributed to open source projects before, as I'm sure many here have. Just not specifically tensorflow and opencv, but I don't imagine the process being any/much different. Most decent projects have a Contributing page/document that point you where help is most appreciated by the core devs. For example tensorflow: https://www.tensorflow.org/community/contribute
@Klaouss#9437
@paper niche Thank you very much! I appreciate your help 🙂
Hey everyone, does anyone have experience with data generators?
a little @faint ravine
What kind of data do you usually generate?
And are you aware of any generative algorithms other than the famous GAN?
also stupid question: for an input layer do i have to use keras.layers.Flatten() if im just inputting an array of parameters
i just used generators for a Image Classifier CNN
just modulated CIFAR-10 for more training data
So, just standard generation? Like rotating the image or playing around with the contrast?
yeah nothing complicated at all
Neat
sorry if im not helpful lol
Lol, It's ok.
Probably not
Aim for something that has a bit more purpose. Coding up an algorithm and running it often does not count. You have to "make it do something" and show results.
what does your SVM do anyway?
it might be a good project
if it's "i did a data science project and i happened to use an SVM for my model" that seems like a fine resume item
(as long as you can justify why you used the SVM)
what does your SVM do anyway?
@faint ravine i used SVM for point cloud segmentation
basically i had some point cloud data with different points belonging to differnet classes
and i implemented a multi class SVM to segment the point cloud into different regions
@desert oar is right.
Yeah, but don't say that on your resume. It sounds like: "I got some data, and I classified it". Something like: "I built a dog/cat recognizer" would be better.
of course
youre describing your project
make your project sound like a project
put the details in the bullet points
what kind of data was in the point clouds? or was it just a toy project w/ simulated data?
they were 3D Coordinates from a LIDAR Scanner
Yeah, don't rehearse the theoritical ideas that you learned. Implement them into something practically useful.
Guys give me a advice like how to learn data science so how do i see data-science in thinking way
and the dataset is public
LIDAR Scanner Data Segmentation
- Used SVM to segment LIDAR scanner data
- etc...
Guys give me a advice like how to learn data science so how do i see data-science in thinking way
@lapis sequoia Do you wanna plug-and-chug or learn the underlying theory?
how does this sound?
LIDAR Point Cloud Segmentation
-Implemented a soft margin multi-class SVM for point cloud segmentation
-Reduced computation time (by some metric) using efficient vectorized operations
-Achieved so and so accuracy
@faint ravine what u mean by plug-and-chug
That's good
did you implement the svm though?
yes
nice
like do you mean if use scikit learn?
yeah
i wrote an SVM Class using numpy
so no scikit
you might also want to mention where/how you got the data
@desert oar will do
i'm a beginner in numpy
but i love numpy
Hey guys for anyone interested in Data Science please check out my channel and leave a sub, if you want, would be very much appreciated to get my channel off the floor.
https://www.youtube.com/channel/UCiFF3AvbzLWdRyRnQMEttqw?view_as=subscriber
one exercise that i was told to do by my MA thesis advisor was to write an executive summary of my projects
Thank You
a 1 page document w/ maybe 1 plot. basically an extended abstract
i realised that being able to describe your projects is a very important skill
something you should keep a log of while you are building the project
+1
@hollow silo sounds like you'll have no problem getting hired, if that's your mentality 🙂
i wanna build something with numpy
@lapis sequoia write a neural network from scratch
one layer
hard to build or easy?
@hollow silo sounds like you'll have no problem getting hired, if that's your mentality 🙂
@desert oar thank u 🥺 its really hard bc i dont have a degree directly related to CS etc
the grind is real in software
hard to build or easy?
@lapis sequoia you can use numpy for pretty much anything actually...if you're interested in data science and ML then yeah a one layer NN is of moderate difficulty.. you can extend that to an autoencoder as well
oh thanks
the power of numpy lies in matrix slicing and dicing operations
yes i'm learning numpy
yeah np.dot etc is cool but a lot of times people just use for loops over their numpy matrices when the same thing can be represented as a matrix product
slicing index, shape,reshape i love them
yes i'm learning numpy
@lapis sequoia if you are interested in computer vision, i recommend following the cs231n course
wait until you guys learn about np.einsum
you can do their assignments
wait until you guys learn about
np.einsum
@desert oar i have read about that 😄 but never used it
i didnt understand it too well
"regex for array math"
"regex for array math"
@desert oar thats a neat way to put it
sadly, unless I missed something major, it's "only" for any kind of multiplication operations
you can't, say, make it calculate the sum of each element with each.
you can do outer multiplication: "i,j->ij" but not summing
Hey @quick fox!
It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Hey @quick fox!
It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Hey @quick fox!
It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Hey @quick fox!
It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Hey @quick fox!
It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
hey, does anyone here use kaggle? i'm trying to make a team notebook for a competition and i'm not sure how to do it
sadly, I think you did miss something major 😉 @tidal bough
Its just a summation equation at its core so it can deal with summing each element with each.
All you do is np.einsum('i,i', a, b)
@flat quest nah, that's just scalar multiplication. I meant, from 1d array a and b, produce a 2d array c, where c[i,j]=a[i]+b[j]
i mean i don't see how that would be multiplication. It's just summation all the way through.
But yeah if u want to do c[i,j] = a[i] + b[j]. You can do explicit mode I believe np.einsum('i,j -> i,j). I'm not sure entirely if that works, but based on the docs, it seems like it would. @tidal bough
@flat quest nope, np.einsum("i,j -> ij") would do c[i,j] = a[i]*b[j]
ah right. Yeah not thinking too straight this morning lol.
i mean i don't see how that would be multiplication. It's just summation all the way through.
scalar multiplication of vectors(1d arrays) is defined as the suma[i]*b[i]for all i 🙂
yeah ur right, it's all multiplication, and then summing over those multiplicated terms
My bad :/
Guess it's up to the standard addition to deal with those problems then 😉
Well, semi-standard. You do this via the glory of np.ufunc.outer 🙂
In [254]: arr
Out[254]: array([0, 1, 2, 3, 4, 5, 6, 7, 8])
In [255]: np.add.outer(arr,arr)
Out[255]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8],
[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 2, 3, 4, 5, 6, 7, 8, 9, 10],
[ 3, 4, 5, 6, 7, 8, 9, 10, 11],
[ 4, 5, 6, 7, 8, 9, 10, 11, 12],
[ 5, 6, 7, 8, 9, 10, 11, 12, 13],
[ 6, 7, 8, 9, 10, 11, 12, 13, 14],
[ 7, 8, 9, 10, 11, 12, 13, 14, 15],
[ 8, 9, 10, 11, 12, 13, 14, 15, 16]])
huh, this function is actually pretty slow
yeah, this dumb C-loop is 4 times faster:
@numba.njit
def outer_sums_full(arr):
res = np.zeros((len(arr),len(arr)),dtype=arr.dtype)
for i in range(len(arr)):
for j in range(len(arr)):
res[i,j]=arr[i]+arr[j]
return res
And this is a bit more faster, even, and is a very simple broadcasting-based solution:
@numba.njit
def outer_sums_3(arr):
arr=arr.reshape(-1,1)
return arr+arr.transpose()
Hello guys! I want to create some universal pandas reader for parquet/csv/hdf5/excel/sqlite/whatever depending on file extension.
How do you think, is it better to create it as function with **kwargs to send arguments or maybe use some kind of decorator?
Never tried decorators in practice, maybe it's time ti try
Hey fam could I please get some help with creating an array in numpy?
what kind of array
hey man its super simple but its giving me a "not callable error"
totalBands = np.array([[180,15], [5,20], [8,16],])
In [296]: totalBands = np.array([[180,15], [5,20], [8,16],])
In [297]: totalBands
Out[297]:
array([[180, 15],
[ 5, 20],
[ 8, 16]])
sorry that last comma shouldnt be there but stil
My guess is that you did something bad like redefining np.array.
what's the full error?
yeah, definitely redefined something
check these:
In [300]: type(np)
Out[300]: module
In [301]: type(np.array)
Out[301]: builtin_function_or_method
could you explain a little more? Im a bit confused
do type(np) and see what it gives you, same for np.array
what should i be looking for in those outputs?
the above is what you should get(if they aren't redefined).
I'm not even sure how you did this.
^^
do
del np.array
del np
import numpy as np
well, restarting ipython works too, yes.
could i just close out of spyder and reopen it?
sorry, im switching over from matlab and im still learning
yup, or probably just close and open the console.
or push a button somewhere to stop it.
i think it happen when i tried to define a new array above
above it*
anyways let me try to restart it
i think it happen when i tried to define a new array above
can you send that code?
you must have done something really weird like np.array = <something, a float64 to be precise>
hmmm probably lol, could i get your advise on what im doing? maybe you could point me in the right direction
so i have that array right? basically for each element in that array i send earlier (i.e ([[180,15], [5,20]]) I want to add additional numbers to each element in asending order, so the element [5,20] would turn into [5,6,7,8,9,...20]
sounds like you just want arange.
the problem is that arrays have a specific size on each dimension.
like, you can't have the second row be length 5 and the first length 10.
you can only pad it with NaNs, I guess.
so for your task, a list of 1d arrays might make more sense.
though it depends on what you're later using that list for.
yeah i was planning on using a 1xN array. The original idea was to interate through the array and find where each element has a matching value
xy-problem
Asking about your attempted solution rather than your actual problem.
Often programmers will get distracted with a potential solution they've come up with, and will try asking for help getting it to work. However, it's possible this solution either wouldn't work as they expect, or there's a much better solution instead.
For more information and examples: http://xyproblem.info/
What's the actual problem you're trying to solve?
oh my b, basically the problem is I have 12 ranges of data,ranging from 0 to 360, and i was trying write a function (or just code i guess) to see where if there is a value that the 12 ranges contain
does that make sense? sorry, i can try to explain better
if there is a value that the 12 ranges contain
So, whether there's an intersection of these 12 ranges?
yeah:)
I think it's much simpler and doesn't require any numpy.
Consider what the intersection of two ranges might be. It's either:
- Some range. Say 5:10 intersects with 7:12 on 7:10
- Empty set. Say, 5:10 with 20:50 have no intersection
So you need to just write a function determining the intersection of two ranges. Apply it to the first two elements, then to the result of this and the third element, then to the result of that and the fourth...
(this way of applying is, by the way, what functools.reduce does)
ohhh i see! okay thank you so much. Im gona work on this and ill see if can handle it from here
I can probably ask a question here and someone help me with a pandas code
Yes, you can't.
I was advised to come to this channel
I'm just kidding, what is it?
I am trying to create a new column based on a condition on another column of string values and am facing weird behavior
please see this
create new variable
I am trying to create the new variable 'flee1' based on the variable 'flee'..it should give True when 'flee' == 'Not fleeing'
any ideas anyone..even kaggle notebooks giving the same prob
nobody knows it seems 🙂 stackexchange for the real geeks
use .map @tame nest
df["flee"] = df["flee1"].map({True: "fleeing", False: "not fleeing"})
(it was a typo for anyone curious)
are libraries like tensorflow or pytorch required to make neural networks?
required ? no
useful ? definitely
and chances are, if you don't use them, your code will very likely be less efficient in many ways
again, yes
but you'll have to do the differentiation yourself, you won't be able to run your code on the gpu, and there are not as many functions commonly used to build networks
Hey @atomic oxide!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
Hello guys could I please get help to solve this issue why the yticks label are out of their places (ticks)
Hello guys could I please get help to solve this issue why the yticks label are out of their places (ticks)
@atomic oxide what do you mean out of place
I'm assuming you intend the logarithmic scale?
do you mean like how the major tick labels appear to be misaligned with the major ticks?
hard to say without seeing all your code.
you do?
that's weird
can you create a basic plot and show me?
e.g.
fig, ax = plt.subplots(figsize=(4, 4))
x = np.linspace(0, 2 * np.pi, 200)
y = np.cos(x)
ax.plot(x, y)
yeah, then it's not all your plots
I mean, I guess this is a long shot but
you're not manipulating the ticks and/or tick labels manually, right
or using some custom Locator/Formatter that might cause this
Hey @atomic oxide!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Hey @atomic oxide!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
just a heads up
if your code is that long I don't think many people will want to look through it.
not long but how can i upload it
Hey @atomic oxide!
It looks like you tried to attach file type(s) that we do not allow (). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
read what the bot said
my GOD
why don't you try removing each line that deals with the y-ticks
until the problem stops
so you can figure out which line is causing it
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda y,pos: ('{{:.{:1d}f}}'.format(int(np.maximum(-np.log10(y),0)))).format(y))) this is my guess though
I'm going to try, Thank you soooo much
I would really suggest you clean up your code a little
actually, maybe clean it up a lot?
Ok
hello friends, given a large dataset of [x, y] and assuming these two have some sort of correlation, what methods could I use to measure how closely these two are related?
I am analyzing video game statistics where x is vision of the map, and y is deaths. each set of [x, y] are from a new, unrelated instance of the game.
my data looks like this, so I am losing hope haha
this doesn't look super correlated 😅
maybe you want to train a regression neural net for this.
"how well can my neural net predict y by x" is, technically, a measure of their correlation 🙂
Yeah, sadly it does not, although it should have a fairly close correlation
ok, ty. My current regression model says: "lol"
I would very much like to prove whether or not it is correlated. As if it is not, it means x can be thrown out completely from my other analysis
red is the y predictions LOL
I mean that’s not awful
that's about as good a relationship as I can predict! 😅
by the way, is your x one-dimensional?
because, uhhh, this really doesn't look like enough data to predict y.
It is one dimensional and not enough data to fully predict y, but I would like to measure the correlation before moving forward with some more in-depth ML
This is using linear regression, but I wanted to know if this was a reasonable approach to this specific problem
actually, the red line represents what I assumed, higher x should mean less y
but I dont just want to confirmation bias this whole project
Is anyone familiar with a tool whereby you can enter a string, highlight a portion of that string, and see the index of the first and last character of what you've selected?
If I try to Google something like this I only get results about HTML.
I need it to write unit tests for an nlp project.
Is anyone familiar with a tool whereby you can enter a string, highlight a portion of that string, and see the index of the first and last character of what you've selected?
@serene scaffold you mena like a frontend thing?
I'm not sure what you mean
when you say a "tool" do you mean like a (very small) webapp?
because when you mention highlighting I'm assuming there's a GUI?
I'd prefer if it had a GUI because I'm not sure how to quickly disambiguate which instance of a substring I'm referring to if it were a CLI.
hm I don't know of anything that can do that offhand but it shouldn't be that difficult to build
if i did help = random.randint(1,250) would it be used as a integer or a string when i do while help != 1: print('no')?
!e
import random
thing = random.randint(1, 250)
print(thing, type(thing))
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
130 <class 'int'>
ok thanks
also is there a way to make a if statment in a loop?
whatever = random.randint(1,250)
test = 1
while whatever != test:
print('no')
if rsr == UR:
break
print('UR')```
can you open a help session and ping me?
can someone test my project? (especially if you're on a mac)
https://github.com/shyam1998/movie-recommendation-system-GUI
Traceback (most recent call last):
File "D:\Coding\python\AI\dropout.py", line 18, in <module>
train_ds = TensorDataset(inputs, targets)
File "D:\Coding\python\AI\lib\site-packages\torch\utils\data\dataset.py", line 158, in __init__
assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
AssertionError
Does anyone know what this error means
I was able to fix it
Hey! How did I write the code blocks? I keep forgetting
def isPower (num, base):
if base in {0, 1}:
return num == base
power = int (math.log (num, base) + 0.5)
return base ** power == num
I'm using that function that I found in stackoverflow to check if a number is a power of 2, but I'm curious because I just don't undestand why power is the logarithm of the number in the base, but +0.5. What is the +0.5 achieving?
I feel like it's basic math but I just don't get it, it's been a long week lol
lol where did you find that function @jovial thorn
return math.log(num, base).is_integer()
this will do it
@jovial thorn @kind granite if this isn't related to data science you might want to solve this in a help channel; see #❓|how-to-get-help
not necessarily
emphasis on necessarily ? I think it's easy to make a case for needing proper logarithms in data science lol
I wrote it here cause I deemed it related to data science
I was only pointing out the possibility that an individual help channel might be better. feel free to carry on.
thanks!
For future reference, It's easy to make a case for random abstract things, doesn't mean it's correct for this specific instance where it's literally talking about an ispower function. Just because data science is built on math doesn't mean anything goes and we start talking about addition or multiplication. You can try #algos-and-data-structs for questions related to algorithms, perhaps that is a better fit.
i have this code right here:
import random
t = 1
while (r := random.randint(1,40)) != t:
print(r)
else:
print("yes")```
i want python to print ('amount of times "r" was said')
(it may not look like it but it is data science)
like how do you bold/spoilers, but with 3 `?
yep
k, thanks
but can you help me?
how?
i have this code right here:
import random
t = 1
while (r := random.randint(1,40)) != t:
print(r)
else:
print("yes")```
i want python to `print ('amount of times "r" was said')`
(it may not look like it but it is data science)
in the while
you would have to increment a counter variable everytime you do print(r)
i didnt fo import random from random import *
your randint call is fine
so i have to do random.randint
i think from random import randint works
import random works just fine
Wrong? Nothing I thought. They just want to do something extra.
yes, that is what that code does. The goal is to also make it write the amount of numbers it wrote
Aye, that's because both r and yes is printed,
random order, though
r is a randint
14
32
16
...
28
6
yes
So yeah, make a counter variable before the loop set to 0. Each time the while loop is satisfied, add to this counter.
that's my output
Print counter variable at the end of everything else outside the loop.
That's your output because that's what the output should be. So the real question is this, what did you expect instead?
There's a mismatch between what the code does and what you think it does in this case, if you think something is unexpected. We can try to address that.
could also do something like
import random
for i, r in enumerate(iter(lambda: random.randint(1, 40), 1)):
print(r)
else:
print(f'a number was said {i+1} times')
```but a counter variable is probably saner.
I think he's offline...
hello, i have a question
about pytorch
my image batch comes with dimension of 3 instead of 4, its missing the color channel
how is this possible
Each image has a dim of 3? (in which case that's normal) or does the whole batch only have 3 dimensions
In general, When images don't have colour channel it means they are essentially greyscaled images.
Such that the same pixel value is used for all 3 channels at once
Hey. What does Flatten layer do (in e.g. Tensorflow) and what is it for?
Reduce the dimensions of something.
Oh
Say turning a 2d matrix into a 1d array
What's the point? Couldn't u just do this before feeding data to model?
Sure, you could.
Hmmm
batch is supposed have 4 dims no?
[batch_size, in_channel, w, h]
So both ways are possible?
It's probably logically easier to understand data going in normally, say images make sense as 2d for example
Oh, yeah makes sense. Thanks
If there's a shape error eysidi you can freely reshape as needed I'd assume. This is a guess but it shouldn't cause problems
Make sure you keep the correct axes when you reshape though
So perhaps a shape of [batch size, 1, w, h] if your notation is correct.
Should I buy the book about tensorflow and keras by O’reilly
Hey
anyone uses Visual Studio notebook here?
I thought of using that but I can't find the equivalent of shift+tab (jupyter notebook) on vscode
it's this question https://stackoverflow.com/questions/63408190/what-is-the-equivalent-of-shift-tab-of-jupyter-notebook-on-visual-studio
anyone here familiar with this?
@stuck oar just hover your mouse over the function
@stuck oar just hover your mouse over the function
@modern canyon right haha, thanks!
👍
that error shows to me whene i'm trying to Draw A picture INSIDE OTHER IMAGE
can u help me
?
So pretty wide and vague query - anyone know a model which uses transformers to be good to be used for seq2seq or NMT purposes?
Would something like FairSeq would be considered good, or maybe some flavors of BERT like models like RoBerta or BART or even GPT-2, Would these be good models for direct sequence to sequence conversion?
I think FairSeq is pretty good in itself, since it is dedicated to seq2seq problem types. Would it then be a good idea to use BART, RoBerta and all the other NLP models out there?
@teal notch tuple object has no load
@teal notch tuple object has no load
@molten hamlet yeah i know but how can i make this bot add images to other image
!ask
Asking good questions will yield a much higher chance of a quick response:
• Don't ask to ask your question, just go ahead and tell us your problem.
• Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving.
• Be patient while we're helping you.
You can find a much more detailed explanation on our website.
what are you using to write your code in
just curious I can't actually help you lol
What r some good ai tutorials/courses
This is a beginner-friendly coding-first online course on PyTorch - one of the most widely used and fastest growing frameworks for machine learning. This video covers the basic concepts in PyTorch viz. tensors & gradients, and walks through the process of implementing linear r...
It's a VOD now
I'm currently on computer vision and logistic regression i'm not progressing as fast as I wanted it to
How can I build a digit recognizer in python?
@faint ravine you can build that using opencv,sklearn,numpy
How do I get the best neural network?
what is the best CNN for handwriting recognition?
SOTA
If you want to learn how to make your own deep learning library feel free to check out my youtube series! 🙂 https://www.youtube.com/watch?v=nNFsHQaD7gQ&t=1182s
Hello!
Today we start a new adventure where we will be expanding on the JoelNet library with the ultimate goal of deploying our own MNIST web classifier (and maybe attacking it using some simple adversarial attacks). The idea is to model the library around the scikit-learn api...
I'm a researcher in machine learning and I make videos on the subject, I'm hoping to create discussions in the comment section to share knowledge, feel free to share them around if you think they are interesting and we can all learn together 🙂
What kind of research do you do?
I have a script that I would like to run daily to get data from websites. If I want to store this locally, would it be best to create a csv file and append the entries there?
another thought - if I wanted to store this online/in the cloud, what would people recommend?
one potential issue I can see with the CSV is I'll need a tab for each thing I'm tracking? Which could become quite high?
ah ok, I'll look into SQLite, thanks
you're welcome.
would it be bad practice to have manytables in a SQL database? as I'll be tracking price over days
and the table name would be the product name, I guess
@faint ravine in machine learning security
That’s what some of my videos are on
Robustness etc
Like, adverserial attacks and such?
Yeah
Neat
So you must be aware of GANs?
Generative adversarial networks?
yeh
They don’t have much to do with adversarial samples
Really?
Yeah a lot of people get that mixed up haha
How would you write a docstring for something like kwargs in a function? The example below demonstrates two arguments from kwargs. The actual function will accept many more keyword arguments. I would like to define in the docstring what the possible keyword arguments are.
def prandtl(**kwargs):
cp = kwargs.get('cp', None)
alpha = kwargs.get('alpha', None)
pr = cp / alpha
return pr
Can you give a short summary of how you go about doing machine learning security? and some real world applications? I'd like to know more
why not just make those regular kw arguments
def prandtl(*, pr, alpha):
pr = cp / alpha
return pr
``` @lapis sequoia
Check out my video on adversarial samples and my other one on Lipschitz continuity
I think that does a good intro
Better than what I could explain here haha
But in short
We look at how we can attack networks
And where they fail (distribution shifts)
That works fine for a few arguments. But what if I have many arguments like 5 or 10 or more?
And we try to make them more robust against this
then you list them in the signature
if you are taking 10 arguments, you write 10 arguments
So what's the point of **kwargs when you can just define all the arguments"
when you do not know all the arguments, for example when extending a class, or things like the dict constructor, types.SimpleNamespace
Fail at what?
At the task at hand
A lot of ML is based under the iid assumption
So stuff doesn’t work for ood (out of distribution)
We try to fix that in ML sec
Or we come up with new attacks
Nice
How can I make this work for only (u, d, rho, mu) or (u, d, nu)?
If I invoke reynolds(0.25, 0.102, rho=910, nu=1.4e-6) then the function will still run.
def reynolds(u, d, rho=None, mu=None, nu=None):
if u and d and rho and mu:
re = (rho * u * d) / mu
elif u and d and nu:
re = (u * d) / nu
else:
raise ValueError('Must provide u, d, rho, mu or u, d, nu')
return re
Why are some well-known packages in python messy like sklearn?
For example in sklearn.preprocessing (StandardScaler), I should first call fit method and then transform?! Why? It is really messy and is a type of side effect
It should be a method like transform, does everything and returns standard data, really simple but I have to remember I first need to call fit to compute mean and std data and then transform it
Well this seems to work fine.
def reynolds(u, d, rho=None, mu=None, nu=None):
if rho and mu and not nu:
re = (rho * u * d) / mu
elif nu and not rho and not mu:
re = (u * d) / nu
else:
raise ValueError('Must provide (u, d, rho, mu) or (u, d, nu)')
return re
When we have different methods to do the same thing, all of them are acceptable?
np.tile
np.matlib.repmat
Which one do you prefer?
np.reshape(arr,[2,5]) --> numpy methods
arr.reshape([2,5]) --> object methods
second one, it's cleaner
what are you using to write your code in
@desert parcel jupyter
jupyter is cool only for prototype and learning, I think
One big problem for me about jupyter is about IntelliSense and code completion, debugging, refactoring and git integration, bla bla. It is awful
you're not alone on that one, 100% agree
I haven’t fully used it yet, but apparently spyder was built for data sci
I use vscode + the inbuilt jupyter feature

Downloaded the Cars 169 data set and reading through the .mat file(never worked with mat) I am wanting to know how to get further details of the file currently I got [('annotations', (1, 16185), 'struct'), ('class_names', (1, 196), 'cell')] how do I get further details of 'annotations' and 'class_names'? Tried test['annotations'] nothing
Hey guys does anyone have a cheat sheet or resource on how to predict values given that you have dummy columns?
im really confused on how you would identify the 1's and 0's when using .predict
NVM my question already figured it out a while ago
@drowsy kite you would need to decide on a model
if you have a lot of categorical variables i have found Catboost is faster and more accurate than XGBoost
CatBoost - state-of-the-art open-source gradient boosting library with categorical features support, https://catboost.yandex/ #catboost
let me know if you have questions
nice
hello data science people
I have a question which Im gonna crosspost
since this is probably the right place for it
well, it's not connected with data science. But, what exactly do you mean? You could have just turned it to string and then just add , every 3 numbers
do you guys know how to return values which appear multiple times in a column using pandas
Hi guys, would anyone know why the output from np.polyfit is different from my own manual calculation through python?
@buoyant cypress probably can be done via just string formatting
@buoyant cypress okay just use {n:,} where n is a number
Hi guys, would anyone know why the output from np.polyfit is different from my own manual calculation through python?
^ this is solved, by the way.
beat me to it

There's also a way to make it locale aware
And in {n:,} you can replace comma with any symbol you want to seperate with
I want to fit some data in a pandas dataframe using a custom lmfit model, however my output is shuffled around in a weird way
red: fit, blue: data
I'm using matplotlib for the plotting. the essence of the code is: ```python
result = model.fit(y, x=x, method="leastsq", params=params)
plt.scatter(x, y)
plt.scatter(x, result.best_fit)
any idea what is happening here?
the fitting model is a linear term with a numpy sin wave on top
So pretty wide and vague query - anyone know a molde which uses transformers and is good for seq2seq or NMT purposes?
Would something like FairSeq would be considered good, or maybe some flavors of BERT like models like RoBerta or BART or even GPT-2, Would these be good models for direct sequence to sequence conversion?
I think FairSeq is pretty good in itself, since it is dedicated to seq2seq problem types. Would it then be a good idea to use BART, RoBerta and all the other NLP models out there?
@whole plover Any reason why you are using linear regression for such a data pattern?
@grave frost Not intentionally no, isnt this a nonlinear fit?
I've never done fitting with python before so forgive me if im wrong
Can anyone tell me how can do you interpreted a linear regression graph as given here in figure a, b, and c
Interpret*
I can send more details or figure legends if needed
What is better to use if I need only clear copy of numpy array, .copy() or np.copy()?
I'm considering doing a course on data science. Can someone recommend to me some good reading material that would show me the ropes
i mean both function doing same thing so u can choose which one u like u
personally i use np.copy() because i like the word np in my code
So I want to make a perfect AI for a fighting game, how far above my head am I getting? I don’t know anything about AI aside from how it works in theory.
No checking here either @cosmic lynx just fyi
You can ask general things, but not for us to help you cheat
@cosmic lynx Do you want to use ML or just a generic game AI present in most single-player games??
@cinder sage How is asking for help cheating??
I was thinking ML just to see what insanity happens, who knows, it may find stuff like touch of death combos...
@grave frost they are asking to use AI to perform perfect actions in a game. My guess is that is against the ToS of said game.
ML usually breaks the game because it exploits the game's engine, but yeah it is really powerfull if you use it correctly
However, RL does require some expertise. The more advanced your model, the more powerful actions it can take and the more it draws itself to above-human level of playing...
?
okay, now I think I get it. The game I’m planning on doing this in is like street fighter but with more stuff thrown in
np. Model can handle it. If you want to ease into Reinforcement Learning (RL) best way is to use a simple DQN and bump up the complexity as you learn....
DQN?
@cinder sage Why is using a bot aginst the Tos? As long as you don't "hack" or cheat it is considered fine. OpenAi made a model for DOTA 2. Since it got the same input as a human, it was allowed to play in the international tournament too...
@cosmic lynx Deep Q-learning Network. A very simple yet sometimes effective model. Good for simple games (Atari) and for beginners in RL...
Wait, a bot being allowed to play in a tournament? That’s interesting
I’ll have to look into it, thanks
Anytime
@grave frost openAI had permission
How does a bot get advantage? for many things, it usually a limitation as it can't take "blazing-fast actions" or use long-term strategy (consumer-level models). I don't see how it is cheating because you are basically limiting your own game score by allowing a bot to play...
also I don’t think I can hook hook up this bot to anything I can’t run on my potato....
Do you have a GPU??
I have a 5 year old laptop that was middle end then....
Well, You can't train a model without GPU. I suggest you look up Colab, Google's initiative to provide free GPU's with minimal setup. But It's not easy to do RL on Colab, so I suggest you get some GPU resources. A lappy isn't gonna cut it
F
Either way I was planning on buying a cheap desktop soon...
So much for AI tic-tac-toe bot perfecting the game....
Just make sure it has a Nvidia GPU if you do want to do some ML. You can do ML with AMD GPU but it won't work perfectly and may lead to a lot of crashes and bugs.
@cosmic lynx I think there are some people who have done RL on CPU only but I guess it will take hell of a time then. If you are fine with running your laptop 24hr+ then I think you can get started right away