#data-science-and-ml
1 messages · Page 404 of 1
-you can accumulate batches and still apply momentum
momentum changes the optimizer step
Can I upload a folder or something to colab?
i know, but you can rewrite momentum as an accumulated gradient with large momentum parameters and resetting the momentum every few steps. you can make them match in degenerate cases
yes, you can upload files to it. the easiest way is through google drive, but you can also just upload stuff directly
more generally, what i mean is that these two things are just a specific choice of descent step size and momentum
this just isnt true
How do I do that then?
you would need to reset momentum on every call to step
that's a valid choice of momentum
if step % acc_steps == 0:
what i'm getting at is that you can write a single mathematical expression that does either of these things with a different parameter schedule. of course you get different results. they're flavors of the same thing though, a stochastic gradient descent schedule
right i agree this is true when you reset momentum in your schedule
mhm
this argument can get pretty pathological though
i could argue that optimization schedule + loss is one object
and we study the equivalence classes of trajectories
when we trade off schedule and loss
I think I found it
not that this is wrong
some interesting results from thinking this way like https://arxiv.org/abs/1910.07454
change this:if epoch_current <= cfg.SOLVER.GENERATOR.INIT_EPOCH: # FP _t.tic() real_images_color = real_images_color.to(device,memory_format=torch.channels_last) generated = model_generator(real_images_color) loss_init = init_loss(model_backbone, real_images_color, generated) INIT_FP_time = _t.toc() loss_dict = {"Init_loss": loss_init} # BP _t.tic() if (iteration+1) % 4 == 0 or (iteration+1) == len(data_loader): optimizer_generator.zero_grad(set_to_none=True) loss_init.backward() optimizer_generator.step() scheduler_generator.step() scheduler_discriminator.step() INIT_BP_time = _t.toc() meters.update(INIT_FP_time=INIT_FP_time, INIT_BP_time=INIT_BP_time)
to?
Ok
although the second part after or is sort of funny
since they dont scale properly their learning rate for that step
(you could make it % 6 and decrease your batch size but make sure to also scale the learning rate to compensate)
Okay, scale by how much?
well think about it
if you accumulate 6 batches instead of 4
you have 6/4 more gradient
Actually, I added that myself
It wasn't in the code before, and I read an article and did that, wasn't sure how it works back then
2022-05-19 15:22:50,438 AnimeGan.trainer INFO: eta: 1:22:41 epoch: 1 | 421/6700 batch_time: 0.0062 (0.0049) data_time: 0.0001 (0.0001) Init_loss: 292.7678 (286.7284) lr(G|D): 0.000050|0.000010 max mem: 1857
wut
it's working!
Hey
Anyone here familiar with xarray?
'''
def calc_max_std(gb):
max_cent = gb.mean(dim='dt').argmax()
idx_left = max_cent - 30
idx_right = max_cent + 30
return gb[:,idx_left:idx_right].max(dim='x').mean()
da.groupby('dt').map(calc_max_std)
'''
It gives me an 'TypeError: 'DataArray' object cannot be interpreted as an integer'
If I do
'''
def calc_max_std(gb):
max_cent = gb.mean(dim='dt').argmax()
idx_left = max_cent - 30
idx_right = max_cent + 30
return gb[:,idx_left:idx_right].max(dim='x').mean()
'''
it works, but that's besides the point as I need to slice the array differently in every group. Anyone any idea?
thanks a lot in advance
if anyone knows how to fix this, ping me AssertionError: Torch not compiled with CUDA enabled i don't have an nvidia gpu btw
Does tensorflow 2.9 support cuda 11.6?
Can we ignore words with only one token in dataframe/series?
how to do it?
Found the problem 🙂
had to replace 'max_cent = gb.mean(dim='dt').argmax()' with 'max_cent = gb.mean(dim='dt').argmax().data'
can anyone tell me how can i add a new row in this datasets
but the main point is that i have to add a column but the values for that column should be similar to score but in place of 5 i have to put 1 and in place of 2 i have to place 0
can anyone help me out
Hi everyone, I have a few questions for those with experience in training neural networks in PyTorch. I've recently been trying to train and evaluate a CNN to classify images from the CIFAR10 dataset. The condition is it has to be three hidden layers only, and I've set it up as per the code attached.
I have a training and validation set, and I've been evaluating the model with them. I'm using SGD with a learning rate of 0.001 and a momentum of 0.9, with cross entropy loss as the criterion. I was hoping to get a graph that looks like an exponential graph, as per most experiments I've seen, but instead it's come out like this, where the validation accuracy is all over the place.
Is there anything I could change to get something more consistent and with higher accuracy?
This is what I was hoping it would look more like
Well the scale of your graph is quite misleading, the accuracy goes from 48 to 60 there
So the training accuracy and validation accuracy aren't as far off as it seems in that graph
@lone yacht
As of now it seems that you are slightly over-fitting as your validation accuracy is lower than your training accuracy. But equally important is that your training accuracy doesn't seem to increase further.
Thanks for the response @mild dirge. Do you know what I could try to get a better accuracy? I also note that the first epoch starts at around 55% which I'm curious about, because I would have expected it to be more like 10% starting off with random weights, given that there are 10 classes, do you know why that might be?
It is likely the accuracy after the first epoch
It is quite normal that it can jump that high after the first epoch*
Ahh, that makes sense
And the fact that the accuracy is not going up can be because of many things
complex dataset, too small network etc.
It's also not that common to only use 1 fc layer I think
hello everyone
just wondering if anyone can help me with my data structure analysis assignment
In general 3 hidden layers probably won't give too good of a performance
you could probably pull a sneaky and use a convolution layer with a stride above 1
that way you reduce the spatial dimensions and get to use another convolutional layer
(instead of the pool)
@lone yacht
Thanks @mild dirge , I'll try swapping out the pooling layer with another convolutional layer
With a stride above 1* that's the important bit here
in pyspark.ml.classification.logisticregressionmodel i can call 'evaluate' to get access to the aucroc and precision recall summary. ... what's the equivalent for randomforestclassifier?
@spring valley
@compact barncan u help me out
Hello guys, i have a doubt with pyspark. I printed a schema of dataframe and I get this ( check image). However, i want to split the array do you see there to columns, how do i do that?
is this what you're looking for?
df.select('session_id', 'session_events.*').show()
I think is something like that. I just saw that i didn't explain me well. My idea is to split what is inside of "element" to columns. In the image, that is the values that are inside of the row. And i want to slice it to get a column like this |DatetimeOpenApp| OpenApp| DatetimeViewHome| ViewHome|
@willow jasper please do not ping random people (including staff members) asking for help. This is a warning.
hmm... i'm in the middle of something at work, but off the top of my head i'm thinking something along the route of exploding the session_events and pivoting the results back into named columns. ...
from pyspark.sql import functions as F
df.select('session_id', F.explode(F.col('session_events'))).select('id', F.col('datetime'), F.col('event'))
there may be some syntax error in that, didn't validate, but hope that sets you on the right path.
I will try and thanks!
hey guys, is there an "inverse prediction" function in sklearn??
like what X i need to put for returning an Y
There can be multiple inputs that give the same prediction @flat ridge
Million thanks to dre, u saved my life. But now i need your help again guys xd So, now i was finishing in working on my data and i want to group the data in pyspark. I want to have only one session ID. Example : The first session id repeats in 4 rows. I want that value appears only one and that the other values stay in the same session. If you watch in the columns, you will see that i use a Dummies for that and i want to be like a checklist. Where we can see where did the client trespass(If they went to OpenApp it is 1 , if they went to viewhome it is a 2 )
anyway I have open question in here and i need your suggestion
let said I have several observation of evaluation (said one hour, thirty minute, twenty minute and last a minute). Let said in each observation give its performance (accuracy). I have hypothesis that even the observation minute is decrease, but the performance deviation should be small to take conclusion that my model is stable for each observation. Is it okay to put standard deviation calculation for this case? The model is same for each observation
i wrongly ask the question in the wrong room lmao
does anyone here have any experience with multi agent reinforcement learning?
don't ask to ask
use metal on mac
its how i run errting on the gpu
else, co lab is best for smallish projects
guys anyone have a cool machine learning project idea something
maybe an interesting way to collect some dataset or a task thats not so comon idk
would this be the appropriate channel to ask about using pandas to visualize my data?
can anhyone help me augmenting my images for cnn?
if anyone here has xp
desperate!
if you're using keras or pytorch, they should have built-ins for this
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 0)
I'm using sklearn and not sure why not getting the same results each run even though I've set the random state to 0
you can see random forest and log reg changes scores.
data are articles which have been vectorized and lemmatized .
Any recommendations in finding and understanding outliers in your Pandas dataset?
basic descriptive statistics are a good place to start
split up the data into quartiles and see if you can learn something
Yup, was thinking of box plots of each features to see where things are. .describe() gives good details as well
Oh they do
I’m just choking on syntax
Do u know it by heart?
sadly not by heart, but i can tell you what the parameters mean if you show me an example of the syntax
so when you use flow
and if you say, rotate at angles 10 degrees
it will add 36 images to ur training set
?
@wooden sail
well of course you save it as a new variable set
exxample
image_generator = ImageDataGenerator(rescale=1/255, validation_split=0.2)
train_dataset = image_generator.flow_from_directory(batch_size=32,
directory='full_dataset',
shuffle=True,
target_size=(280, 280),
subset="training",
class_mode='categorical')
validation_dataset = image_generator.flow_from_directory(batch_size=32,
directory='full_dataset',
shuffle=True,
target_size=(280, 280),
subset="validation",
class_mode='categorical')
bad example
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True
validation_split=0.2) # val 20%
val_datagen = ImageDataGenerator(rescale=1./255, validation_split=0.2)
train_data = train_datagen.flow_from_directory(train_path,
target_size=(224, 224),
color_mode='rgb',
batch_size=BS,
class_mode='categorical',
shuffle=True,
subset = 'training')
val_data = val_datagen.flow_from_directory(train_path,
target_size=(224, 224),
color_mode='rgb',
batch_size=BS,
class_mode='categorical',
shuffle=False,
subset = 'validation')```
@wooden sail
looking at this code here
how do you tell how many images train_datagen will create
its potentially thousands right?
per sample
my thought is that this amount of transformation will make many images from one image to go thru epochs
no, it won't generate new ones per se. it'll take the same ones you fed and make a modified copy in memory on the fly, then delete it. it will use as many images as you told it to when you specified the number of epochs and total number of images
the transformations are applied randomly
epoch 1 will use your first version
and epoch 2 will use a rotated one?
epoch 3 will use a scaled one?
randomly
it says its a range... so uh where does it specify how many inside that range are made
they will very likely never see the original
the range is a percentage. you tell it any transformation with a parameter ranging from 0% to 20% shift is valid, for example
(rescale=1./255, < also what is this?
multiply by 1/255
as far as i recall, yeah
riiight ok
im trying to get the code in here and work but struggling
Actually
I tried adding to a sequential model earlier but i think that was just a singular transform so it was busted
shud i expect a nice validation loss rreduction?
with augmentation, yes
the validation loss should be a lot closer to the training one, at least
its curerntly 98% accurate before agumentation almsot
that's meaningless, all overfitting
oh nvm
its 73
reckon it will break 80?
for X-rays, is there any rule so should I use any specific set of transformations and ranges that are optimal? or its it trial and erro
it would depend on the type of registration error you expect to find
i'd expect this type of shift and shear and scaling up and down along one or both axes are common. flipping LR and UD, not really
rotate
?
small rotations, yeah. very small though.
do you know how to account for artifacts?
that are biasing my accuracy
e.g literal text that label my class
artifacts in what, the images? you'd expect them to average out
or what are you calling artifacts
if you have pictures of cars and pictures of ducks and the cars all have on the image 'car' written on it
somewhere
small
its gona pick up on that
and thats kinda not what you want a model to do when you roll it out
couldn't really say. you'd kinda wanna remove bad data from the data set
Supposedly there’s a workaround for this we’re gona learn but I cannot guess what it’s going to be
i can't think of one off the top of my head. admittedly this isn't the type of ML i look at
What do u look at
so-called "deep unfoldings" are what has my attention atm
Industry or academia?
academia
Which country?
germany
doc
I was considering if a PhD is worth it next but I’m not so sure
I’d rather work for a company than a research uni
if all you wanna do is research, a company is more than ok tbh. you can consider a phd if you think it gives you better job options, you like academia and/or teaching, or something like that
I’m not so much into teaching this because I don’t have a math background
Are opportunities good in Germany for English speakers with masters?
in industry, i'm not sure yet. in academia yes, doctoral programs are often in english and lots of foreigners apply here. masters too, for that matter
PhD definitely improved job options because I notice half of the companies say PhD preferred rn which is annoying
But 3 years of hell….
if my skewness of a certain column came out to a normal distribution, why does my box plot show outliers?
But then I plot a histogram and you can kinda see that bins go above
any idea on what to do with outliers with salaries?
@wooden sail man im getting so many errors apparently my colab directory doesnt exist
if they are too useful to drop theres a scaler that will account for that
what would you do? I personally dont think it makes sense to drop them but ofc scaling would make the most sense
do you have it in the correct directory? 😛 you can see your file structure, check around
there's a tmp folder where some stuff lands when downloaded, but other things land in a different folder whose name i can't recall
is histplot a good idea to get an idea of outliers?
i'll pass 😛
Anyone?
gotcha, we can't have a salary of 0 so maybe it makes sense to drop just that one value
what's yalls thoughts on this?
thats alot of outleirs
keep them
its expected irl theres always a bunch earning more
yeah I'd keep the outliers above the right wisker, but the last one on the left needs to go, because salary can't be $0
is it justone data point
right but there's 5 columns with $0 salaries... imagine a house with negative square foot.. you'd be like that doesn't make sense
apologize, 5 rows
are they all the same company
doesn't say company name, just company ID and they're all different
god it takesme like 10 minutes to train a tiny neural network
how fast do u think a cloud service cud make this
80ms per step on m1 pro
anyone know why flow_from_directory says error20 directory doesnt exist
its a npy file
how does this function need data
ah damn u need specific folder structure
@wooden sail any idea how to create class files is it just numpy saving
I have a question, so I have created a model and it gave me accuracy of 85% with train test split, when I used it on the full thing, it gives me 98%.
is this right?
What I did at the end is fit + predict on the full dataset
on the full thing?
yea
well that would be the training accuracy
testing on the data you use for training will give very high accuracy indeed
So that accuracy does not tell you anything about how good the model would perform on new data
yeah makes sense
- Your Notebook with the code you used in cleaning and analyzing the data + your model + evaluation.
- JSON file for the clusters output (to each category).
- PDF file for the evaluation report that will be delivered to the leader (Make sure to mention all your steps in visualizing your results).
it is just that these are the requirement of the project, but I'm confused about the second one.
What about that is saying to train/test on the full data?
Or is this another problem?
nothing, but I'm confused on how would i output anything if they only gave me one dataset around, 2500 articles to classify
unless I manually split them it doesn't make sense
the second point is not about classifying, it's about clustering
yeah there is no clustering, it is just 3 columns, article, article name and category (three types)
or am i missing something
Alright, let's take it one problem at a time, what do you mean with this?
This is still about the classification right?
for example, make the training the first 2000 and test last 500
yes it is a classification project
Yes, that is a often used method
Sometimes the prof provides an already split data, but if it is not split yet, you can split it yourself
yeah but there is train test split thing and it does it for me it feels weird.
You do need to put a bit of thought into how you split it
Like you want to test all the classes, so you could f.e. try to get an equal distribution of each class in the testing set
well, it is articles and it doesn't seem to be orders, so i was thinking shuffeling then bottom 500 and move on
Well if you plan on looking at the accuracy, you want the classes to be equally distributed for the test set
Otherwise you'd normally use different measures like f1 score, macro accuracy, confusion matrix etc.
I was mainly looking at f score, but the project info says the mostly care about accuracy for this one.
Did you look at the class distribution?
yes
and?
wel
well, in score I just used weighted
but I honestly don't know much and just getting the project flipped lol
weighted?
for the scores
what function is that?
what I found weird is that engineering (least count) have highest f score, my thoery is due to not much overlap in words
def try_model(model_name):
mdl=''
if model_name == 'Logistic Regression':
mdl = LogisticRegression(max_iter=500) #increasing iterations since default 100 wasn't enough
elif model_name == 'Random Forest':
mdl = RandomForestClassifier()
elif model_name == 'Multinomial Naive Bayes':
mdl = MultinomialNB()
elif model_name == 'Support Vector Classifer':
mdl = SVC()
elif model_name == 'Decision Tree Classifier':
mdl = DecisionTreeClassifier()
elif model_name == 'K Nearest Neighbour':
mdl = KNeighborsClassifier()
elif model_name == 'Gaussian Naive Bayes':
mdl = GaussianNB()
oneVsRest = OneVsRestClassifier(mdl)
oneVsRest.fit(x_train, y_train)
y_pred = oneVsRest.predict(x_test)
# Performance metrics
accuracy = round(accuracy_score(y_test, y_pred) * 100, 4)
# Get the weighted precision, recall, f1 scores
precision, recall, f1score, support = score(y_test, y_pred, average='weighted')
print(f'Test Accuracy Score of Basic {model_name}: % {accuracy}')
print(f'Precision : {precision}')
print(f'Recall : {recall}')
print(f'F1-score : {f1score}')
# Add performance parameters to list
perform_list.append(dict([
('Model', model_name),
('Test Accuracy', round(accuracy, 4)),
('Precision', round(precision, 4)),
('Recall', round(recall, 4)),
('F1', round(f1score, 4))
]))
it is just a function to try few different models and see the result of each
from sklearn.metrics import precision_recall_fscore_support as score
Okay, so a weighted accuracy would be a solution I suppose
Where there any hyper-parameters to tune for your model?
yeahi'm used weighted as you can see
precision, recall, f1score, support = score(y_test, y_pred, average='weighted')
did you do a grid-search?
I kept defaults other than the estimators, but decided against it.
Did you train/test the model and decided to change stuff to get a higher accuracy afterwards?
Hey @loud cove!
It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
no because it took more time to get less than 1% extra
alright, that's good
So if you didn't change your model and choose all the hyper parameters beforehand, then just splitting the data into training and testing is fine
you can see the thing here, just change csv to ipynb since bot didn't let me upload
If you wanted to find good parameters for your model, you could have used a validation set to check the model accuracy
You understand why you split it into training and testing right?
I don't really care about the accuracy, I think 80 something is good enough.
yes to prevent overfitting
So why does it seem "weird" to you to split the data like that?
no what I find weird is that they requested clusters, I'm assuming my predictions for each of the three categories, which is weird
I'd only be able to get that if I split them manually
Is there more explanation on clusters?
nope
JSON file for the clusters output (to each category) This is literally everything in the assignment mentioning clusters
?
Feel free to choose your classification algorithm and all the pre-processing needed on the data.
The team shares with you this JSON file (Note: "JSON" text is clickable) for a group of categorized articles as you will divide those articles into 3 groups: training data, validating data, and testing data.
To measure the accuracy of each algorithm, at this level you will measure the accuracy by the percent of matching only.
What we expected:
A GitHub repository includes:
- Your Notebook with the code you used in cleaning and analyzing the data + your model + evaluation.
- JSON file for the clusters output (to each category).
- PDF file for the evaluation report that will be delivered to the leader (Make sure to mention all your steps in visualizing your results).
Avoid plagiarism at all costs. All submissions will undergo plagiarism checks before being graded.
yes
It seems that they wanted you to split into training/validation/test
you did not make a validation dataset?
why bother when u can cross validate training data on itself
nah I just did validation with grid search
It's more time efficient, cross validation requires multiple training cycles
I'm hoping they won't bother with that part lol
but grid search with what set?
I don't really know much and just took a github from an a bit similar project and made it similar to my needs.
training
i miss stats tasks
alright, so you used cross validation, that's fine
image_generator = imagedatagenerator(rescale=1/255, validation_split=0.2,rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True, )
y_train = keras.utils.to_categorical(y_train, 3)
model.compile(optimizer = opt , loss = 'categorical_crossentropy' , metrics=['categorical_accuracy'])
y_train = keras.utils.np_utils.to_categorical(y_train, 3)
y_test = keras.utils.np_utils.to_categorical(y_test, 3)
datagen = imagedatagenerator(rescale=1/255, featurewise_center=True, featurewise_std_normalization=True, rotation_range=20,
width_shift_range=0.2,height_shift_range=0.2,
horizontal_flip=True,validation_split=0.2)
# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(x_train)
# fits the model on batches with real-time data augmentation:
model.fit_generator(datagen.flow(x_train, y_train, batch_size=32),
steps_per_epoch=len(x_train) / 32, epochs=100)```
It also uses validation sets, but it splits your training data into multiple folds
its gona work this time
yeah but didn't do it in one set into like 60 10 30 training validation test
Concerning the second point of your requirements, I do not know what is meant
I would ask for clarification if I were you
I can only guess at some stuff that could be wanted from you
It also uses validation sets, but it splits your training data into multiple folds
What is it?
the cross validation grid search function
It is just one of those MOOC things, I honestly don't care much and just want to get it over with, they don't have much info but was just checking
yeah that's what i found when igoogled
@mild dirge i need help with this augmentation give 5 mins to train first
it was between that and random and i just went with this one.
honestly it is stupid, why ask me for pdf (I know in real life you'd need to summarize outside of your code, but just annoying).
That's just convention
For pretty much every one of my courses I have to write reports and send as pdfs too
Yeah I'm jus being lazy, I don't really care much ML career wise but just took it since it is free
fit generator broke my kernel
just gluing together few things here and there and it works
92ms/step 
I mean you can get pretty far gluing stuff together with ML, but when you don't fully understand what you are doing you might hit a brick wall at some point
I have 42
Not that bad right?
per image?
let me check
or batch
yeah I think domain knowledge and a little stats is enough for most cases.
ehhhh
if you didnd't set a batch size it is probably some default value in pytorch
https://github.com/MAmr21/EGYFWD/blob/KO---Articles-Classification/articles classifier V1.ipynb
alright what you think?
or keras*
wait nvm wrong file
32 is meant to be default
anytihng over 100ms/step would be so horrible for big data
i wonder if a tesla cud reduce to like 20
I don't think cars work very well as image classification models
oh it does that on your computer too? I thought mine just glitched
cruising on 13.5GB ram and 98% GPU usage
just from python3
shudda got 32gbn
wow 14gb ram used
yeah same
what causes it
what causes what?
just the calculations?
I'm looking through it btw
Yeah the forward and backwards pass mainly
backprop is the main ram hog>
I don't think 32 would make much difference
that's more than enough
windows/os keeps a backup memory so you apps don't lag
I thought these new macs stayed cool
I don't think any machine will stay cool when you have 98% gpu utilization
thanks, I copied most of it from this guy https://github.com/deepak0437/natural_language_processing/blob/main/News_Article/BBC_News_Classification.ipynb
Did you copy a lot?
hot
Could be an idea to reference it if you think you copied maybe a bit too much
I downloaded his then changed it to fit mine
I would if it was an academia thing, but most of it is just basic stuff
but they might scream palgarism so I'll just let it go
most of these projects follow same stuff anyways
So one thing that you did is choosing the model that gives the highest test accuracy @loud cove
that's because that's what they wanted
Which means you implicitly use the test data to make choices
the testing data is only for testing your final model
Nothing about choosing your model/training/parameter-search etc. should include anything from your test data
the closest thing to it is the log reg, but I think random forrest would make more sense
yeah makes sense
anyone have a good understanding of a matrix correlation??
1.) Before Plotting and calculating a matrix correlation do I need to OHE some feature columns first?
2.) Right now my dataset only has 3 numeric values, the rest are of types objects. The 3rd numeric value being the target variable. So if I build my matrix it's going to be small, and how can I determine which feature is most important, high correlation, etc. with really only 2 values?
3.) If I decide to OHE my object types to become numerical values will this be valuable for my correlation matrix?
but I'd have to split manually right?
It might be that your test data has some weird pattern in it and that random forest might be able to learn that very luckily
Which would make the final accuracy that you find not really fair indicator on new data
it changes between log and it for the best if i disable the rng
Doesn't really matter which one shows up best, the point is that the test data shouldn't be used for this
validation data should have been used
but random forests seems most reasonable with free text
in a real world scenario I'd statistically seperate a bit of the data for testing
yeah It used to be at start with word count, but moved it
to keep all visuals in one place
the eda have the counts anyways
Don't know how deep this project goes, but you could go more in-depth on this
it is the same thing, i just turned of the rng
Checking if there is a lot of overlap between words, more than the other two categories f.e.
that's actually the reason i moved it to bottom
also u shud be getting AUROC its better than accuracy
I could find the correlation, but it is a MOOC project with interest in ML so I didn't really dive deep
The word clouds at the bottom are pretty cool, but you don't say anything about them
thanks, I'll google that
yeah i just kept it to show the "overlap" because everything is related, a distribution chart would been better, but people eat that shit, it isn't like it matters.
Think all-in-all it looks pretty good, just be careful with using the test data for making decisions about the model. You could have a super interesting model and project with good results, but when you use the test data for training/choosing stuff, it could invalidate any results you get.
how would you split them manually with respect to distribution? I don't think this is needed given that it is free text, but with numbers it might
ValueError: Shapes (None, None, None, None, None) and (None, 3) are incompatible
can someone whos good with TF
help me
b4 i sleep
you'd want the same distribution of categories in the test set as the dataset
18 steps_per_epoch=len(x_train) / 32, epochs=100)```
dont even get how this will work without validation data added
added it and still get value error
I'm guessing there is probably a lib for that, or a function.
I'd probably do something like get 30% of the data, filter for each distribution and get their weight.
Think it's called stratified
im sorry if this is basic, i tried googling but no luck, what do you do with validation dataset exactly? I understand what its use is, but how is it different from cross validation functions?
train_test_split has an argument for it
it is there by default though
yea exactly
oh nvm the shuffle is the one on by default
Validation dataset is for checking the accuracy of your model while in the process of making decisions and training etc.
@loud cove where u from G
And cross validation splits training data into train/validation multiple times
Each time having a different slice of the data be validation and the rest train
Yeah i understand, but if i do cross validation doesn't this mean it isn't needed?
Egypt
correct
they accomplish the same thing
cross validation is probably better as it uses all of your training data for training and for testing at some point
yeah that's what i was confused about, that they wanted a standalone validation
But that also means you need to train/validate multiple times
And when it takes a few hours to train that might be something you are willing to give up on in order to get quicker results
yeah like i did for that 1% accuracy 😛
ValueError: Shapes (None, None, None, None, None, None, None, None, None, None) and (None, 3) are incompatible
save me
I had 5 million for my previous model 😛
local?
yeah
And I don't know how to help sorry
whys my shape liek that
interesting tensor shape though
what for?
that must be wrong
CNN, classifying 400 different bird species
my y train is encoded
cnn the channel?
convolutional neural network
ah another thing.
https://www.mikulskibartosz.name/how-to-set-the-global-random_state-in-scikit-learn/
np.random.seed(31415) this doesn't actually globally set seed status or whatever right?
might been an old thing i think
no clue srr
on its own?
no i just reset my variables lol
I'm going to sleep rn, I wish you both best of luck with your projects, gn!
gnnn
me too, thanks for the help.
have a good night guys!
hi all, it's been a while since I touched stats or probability. Can someone tell me what sampling each row in a dataset uniformly at random with probability p means?
is there any possibility that it just means "randomly pick 30% of the rows" (where p is .3)?
hey everyone, when i'm running this:
data_frame["lap"] = data_frame["lap"].astype(int)
I got this error because there is null data, how do i ignore null data?
i tried to add ignore, but it doesn't change dtype float to int
if i have a list [1,2,3,4] is 1 the head or the tail?
head
beginning
cool haha ty
for example you can use list pop
how do i convert "" to null boolean in pandas?
you can do the whole dataframe with something like df.replace('', np.NaN, inplace=True)
or one col like df.loc[df['col'] == ''] = np.NaN
if i convert to np.nan, when i'm using pf.to_sql it automaticly convert to true
not null
try using None instead of np.NaN then perhaps
nope too, it automaticly convert to false
You'll probably need to read the documentation for that method then. This thread also seems to have a solutions that may help you: https://stackoverflow.com/questions/23353732/python-pandas-write-to-sql-with-nan-values
that's tomori😋
yes!
maybe i will using sql, since i still don't know how
Any ideas of creating correlations and heatmaps of categorical variables?
Anyone tried web scraping with python before
I tried web scraping with the HTML element, but encountered problems
I'm trying to scrape a live updating number(my code executes every 1 min to get the updated number), but my code doesnt work for some reason
What's the error message you got? Are you using Playwright or Selenium or Bs4 or?
"TypeError: nonetype object is not subscriptable"
Good day... Am new here
Please can you help me with web scraping with playwright+scrapy?
I did all set up but am getting error whenever I tried to use it to scrape a Javascript heavy site
There’s a game called 8 ball pool(a billiards game developed by miniclip you all must be knowing about it)
I am developing a program which can predict the trajectories and also predict where the balls will go after the collision of billiard balls many apps like this occurs and they charge a heavy price on this. They simulate the collision same as the game does, I am not sure how they got the parameters ( mostly hit and trial) but its perfect.
My question is that which physics engine should I use for simulating collision of balls ?
I have fair knowledge in c++ and python.
The link below shows what I am trying to do :
I don't use scrapy tho. What's the error message you're getting? Are you able to see what you're doing? Add headless=False inside your chromium (if that's the browser you're using with Playwright)
So, it's not bringing out any error message right? It's just not working as you want it to?
Can you share your code?
I just started using playwright.... I normally use scrappy +splash for scraping Javascript sites but that combo did not work when I tried using it for heavy Javascript website
So why trying to look for solution to my problem, I came across playwright integration with Scrapy
I tried it but nothing nothing
So which tool(s) will you advice me to use it comes to web scraping ?
Especially Dynamic websites
Yeah but am not with my laptop
@odd meteor here you go
import bs4
import requests
from bs4 import BeautifulSoup
#may need to change variable according to previous components
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'}
url='https://finance.yahoo.com/quote/VAXX?p=VAXX&.tsrc=fin-srch'
r= requests.get(url)
web_content=bs4.BeautifulSoup(r.text, 'xml')
elements=web_content.find("div", {'class': "D(ib) Mend(20px)"})
print(r.text)
essentially what i'm trying to do is
parse the current stock price(VAXX) from this url
I picked the correct div(correct me if i'm wrong) but for some reason it keeps showing 4.37 despite the current price being something else
so I changed it to print(r.text) and did some digging, apparently the parsed value is hard set to 4.37
i'm thinking of scraping another website instead, looks like yfinance really doesnt like to be scraped
Did anyone know how we open security port open from Airflow web server to GI EMR, camel Ec2 and S3 buckets using python
First, confirm what you're trying to scrap from yfinance isn't disallowed. You can do that by checking the robot.txt file of the website.
https://finance.yahoo.com/robots.txt
If what you're trying to scrap isn't prohibited, and you're sure you are selecting the right attributes and tag where the information you wanna scrap is, then perhaps try using another parser. You could try html.parser or lxml
I'm not on pc at the moment, so unfortunately I can't inspect the HTML of the url you just sent at this time.
However, try this
import requests
from bs4 import BeautifulSoup
url = "https://finance.yahoo.com/quote/VAXX?p=VAXX&.tsrc=fin-srch"
page = requests.get(url)
soup = BeautifulSoup(page, 'lxml')
prices = soup.find('You most likely need to tweak this area to include the parent tag before the div tag ')
result = [price.text for price in prices]
result
wait what do you mean by including the parent tag before the div tag?
Try that, yes. Use CSS selector to link them. Then you might wanna use the id attribute instead of the class attribute (that's if the div tag has and id attribute)
Unfortunately, I can’t do much since I'm on my mobile phone. But it should work if you're picking the right tag and attribute where the price data you're trying to scrap is.
thanks a lot
all hail @odd meteor
<fin-streamer class="Fw(b) Fz(36px) Mb(-4px) D(ib)" data-symbol="VAXX" data-test="qsp-price" data-field="regularMarketPrice" data-trend="none" data-pricehint="4" value="4.09" active="">4.0900</fin-streamer>
it looks something like this
no id tag, nothing
With CSS selector, you can easily select more than one attribute of a tag if you can't easily tell the unique attribute.
Something like this
soup.find('div', class_ = "write the class here', data-test = 'qsp-price', value='4. 09')
ohhhhhhhhh
Also, from the above HTML, it doesn't have a div tag. It has fin-streamer tag. So ensure you're calling the appropriate tag as well
how to track very slight motions ??
this above image shows how a point was fixed then vector was obtained by comparing frames
i have the frames
but how do i apply this vector approach
has anyone done something similar before
Im currently doing a project for uni that ive chosen to analyze the accuracy of simulated predictions using a range of models for the stock market. Im wondering if arima would even be the right option to go down? Ive so far simulated with GBM, was looking into ARIMA & GARCH as well? Anyone have an opinion on what would be most suitable.
From what I've heard, the stock market is notoriously hard to predict, I'm not sure if there's any actual pattern for any time series methods to learn in general, I'd expect most of them to give similar, mediocre results. ARIMA's probably mediocre as well, you could try something with machine learning like a transformer or LSTM, but again, probably mediocre
import requests
from bs4 import BeautifulSoup
!pip install google
try:
from googlesearch import search
except ImportError:
print("No module named 'google' found")
# to search
query = "investing.com aapl"
for j in search(query, tld="co.in", num=1, stop=1, pause=1):
print(j)
url=j
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
prices = soup.find('div', {'class': 'instrument-price_instrument-price__3uw25 flex items-end flex-wrap font-bold'}).find_all('span')[0].text
print(prices)
@odd meteor I was able to solve the problem by google searching using python then doing it the traditional way
as I intend the query part to be a concat of investing.com + 'TICKER', with the TICKER being parsed from another excel sheet I programmed for optimal stock picking
now I need to make a for loop to execute this code per minute and to make an array to store this information, any ideas?
depends what you wanna do/where you apply. some jobs will also require you have some cardboard
There’s a game called 8 ball pool(a billiards game developed by miniclip you all must be knowing about it)
I am developing a program which can predict the trajectories and also predict where the balls will go after the collision of billiard balls many apps like this occurs and they charge a heavy price on this. They simulate the collision same as the game does, I am not sure how they got the parameters ( mostly hit and trial) but its perfect.
My question is that which physics engine should I use for simulating collision of balls ?
I have fair knowledge in c++ and python.
The link below shows what I am trying to do :
no wonder i see people pulling off trickshots that easily
Difference between classifying and regression?
calssification is 0-1 think you have a disease or you dont... regression is estimation for multiple values think prices of real estate, cars, etc
Any body have an idea on how to make a matrix correlation with mixed categories such as objects and ints?
Ah alr
Yea its called cheto on ios and aim assist on android
Howdy -- trying to bulk change data types in pandas for every column after the 15th to float from object -- is there an easy or preferred way to do this? panda newbie
Hey guys... I'm quite new to AI. Like, I know some theory, but I haven't done anything practical. All the projects related to AI seem boring, like I don't want to make AI to predict salaries and stuff. Anyone got any interesting ideas?
delve into Computer Vision
Good Idea... Thanks 👍
i wanna learn nural networks an more ai stuff how can i start?
how comfortable are you with python
😆
is maths necessary to use tensorflow or any other machine learning frameworks in that sense?
they do the math part for you, but you need to understand how neural networks work in order to do anything, so yes.
Have you guys ever seen two different matplotlib plots overlap eachother even though they are on different plots
i mean i kinda know how neural network works, (kinda), so the maths part like linear regression and others are already done in there for u?
in the same way that Python will do 2 + 2 for you, yes. but that doesn't eliminate the need to understand algebra. it just means that the literal calculation is done for you. if you understand linear regression, for example, well enough that you could do it by hand if you had to (or learn how to with minimal time investment), then you'll be fine.
seems ok to me , thanks
Welcome to the gang 💪🏿💪🏿. Now, you can easily convince others that ML isn't so hard as some people perceive it to be.
Thanks to Mr. Sterlerlock 🙌🙌
Hey everyone so can someone tell me from where can I learn opencv for machine learning ? I already know machine learning looking for a small course to get me started with computer vision.\
how about this one ? https://www.youtube.com/watch?v=qCR2Weh64h4&t=2s
no bro I don't want a tutorial for just opencv
Pyimage search is a great resource. https://pyimagesearch.com/
Guys, is it possible to download the 'wrangled' excel file after it has been wrangled using python?
I want to use it in Tableau as Tableau doesn't have good wrangling options.
I want to wrangle it in python and then use it in tableau for visualization.
Maybe this would work for you
Read excel into pandas, save to excel, open in tableu
Do you guys train your networks on a home gpu?
i currently do train them on my own pc but I've used google colab before
what gpu do you have?
i have a 1660super
so cloud gpu is a lot faster but i would have to upload my data there and i don't have the best upload speed (it isn't slow at all but for things like 30gb it will take 2 hours to upload the data)
@brave sand i just recommend for smaller things to try it out on your own device if possible (and it doesn't take way to long for you) and later maybe go to cloud if you think you will need it
yeah, I have a 3070, but the vram is limiting
internship is about reinforcement learning
What kinda network are you running that your vram is limiting on a 3070? @brave sand
oh, I haven’t run any models that vram is a limit but I’m just wondering if my 3070 will ever limit me in the future
Depends on the task (and also the network)
is Multi Reinforcement Learning demanding on vram?
Very.
I'm assuming deep learning is used*
(backprop)
I’ll have to see. I’m running my model on Monday
Yeah, it depends on the task, if it's remotely complicated it gets expensive and memory heavy fast.
Depending on what you are doing you may have one agent per GPU...
Multi-agent methods get amazing results, but their main downside is that you have to run multiple agents.
So either you have really simple agents, or a lot of GPUs.
since the GPU crash, is it worth investing in cheaper gpus?
No idea. Future is pretty uncertain right now regarding chip access.
If you are willing to not rely on pytorch, etc, and can make your own system like it, or use non-deep learning methods, then you can buy the much cheaper AMD GPUs.
But the tradeoff is a lot of work and learning.
AMD Gpus? I thought they were a pain to deal with for ml
They are because the big libraries for deep learning don't support them.
No CUDA.
And also because AMD used to not really care about ML.
Nvidia has more or less a monopoly on deep learning at this time.
(Unless you are willing to take the hard route that most can't be bothered with (someone has to make the pytorch support for AMD GPUs))
Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer and a scene. Optical flow can also be defined as the distribution of apparent velocities of movement of brightness pattern in an image. The concept of optical flow was introduced by the ...
Demonstrating the optic flow algorithm via "Polynomial Expansion".
A. Original footage
B. Motion represented by color
C. Motion represented by arrows
Original footage owned/uploaded by the slow-mo guys. See their channel here:
Roboticists love this.
Is this still considered a normal distribution?
Reason I ask is because I'm getting ready to OHE a categorical column and I need to define 3 different types, 1 for entry level, 2 for mid level, and 3 for senior. I was hoping to use my distribution graph to get an idea of how many n types we'd need but i'm not sure.
Really doesn't look like one to me. I'm not sure what's up with the oscillation, but the distribution is otherwise pretty much uniform (which is IMO a very unexpected result for "years of experience" in basically any dataset. Maybe it was intentionally normalized that way?..)
Yeah could've been normalized for the sake of this assignment, so in other words safe to say it is? I checked each values of the z score and they all fall between -3 and 3
So if I want to create a One Hot Encoder based off years experiences teirs (Entry-level, Intermediate, Mid-level, Senior or executive-level) 4 total. How would I do this? Right now my column is categorical
it sounds like you don't need it to be gaussian
you are talking about binning your data
there are several ways to do this, all of them essentially arbitrary
do you need to use these 4 categories for the assignment?
that density plot is really weird
so no I dont, but here's the issue. there's 100,000 lines and this specific column 'companyId' represents 63 unique values. If I were to OHE this it's gonna add 63 additional columns
this looks like it's more or less uniformly distributed based on that
so my other thought was it's an important column I don't want to drop it
company id doesn't necessarily make sense as a model input anyway - your model won't be able to make predictions for companies outside the training set
you might however want to use attributes of the company as input features
total number of employees, etc.
i don't see any reason to discretize a feature like "years of experience" unless you need to or can think of a specific good reason to
so let me show you the column names first:
jobId companyId jobType degree major industry yearsExperience milesFromMetropolis salary
it's limited here
well the reason why is because the companyId probably represents the company name, and obviously Apple would pay more than other companies.
so that's why I wanted to discretize companyId based off of yearsExpereince to split them into some tiers
But I'm having a hard time one hot encoding this based off a condition, not sure what to do?
i dont know what you mean by that
There's 63 unique values
i think you are overthinking whatever this is that you're doing
true but am I making sense first of all haha
no, sorry
you have 63 unique companies in the data, what does that have to do with binning years of experience?
Okay so I have 63 unique values for a categorical feature
i wouldnt consider company id a feature though
so basically you'd drop this
Only reason why I considered it because i know the companyId is an alias for the company name (assumption)
i would use it to incorporate attributes about the company into your model
but if you use company id in the model you can't make predictions about other companies
the goal is to make a prediction on salaries sorry I didn't mention that
the goal is to predict salaries
using the companyID can help identify which companies pay a bit more for salary... I just dont know how to OHE since it has so many features
its totally fine to one-hot encode 63 categories. however i still question if that's what you actually want to do
what if you need to predict salary for an employee at a company that isn't in the trianing set?
Is there any good resource for python data science?
very good question.. no idea. Whats your take?
decide what you want/need first
But this is where I’m struggling, I’m torn apart of either keeping it or not
Do I need to transform features into normal distributions?
I don't think so actually, just the machine learning models that require scaling.
Hey everyone, I have a "tf.contrib" error when testing the object detection API.
so I ran the upgrade scripts on my project directory and there was no issues or error.
I then tested the object detection API installation and I still got the error about "tf.contrib".
I then ran the upgrade scripts on the main directory where the error seems to come from and it was successful. But when I tested the installation, I still got the same error "tf.contrib".
Is Anything else I can do?
it seems the issue is the tensorflow version. you'd need a version of tensorflow starting with 1.x.x
you can read about it here https://www.tensorflow.org/guide/migrate/upgrade
try installing an older TF version in an environment
thank you very much, i didnt knew what to search 4
it looks like most of the pins in this channel are more geared towards ML/AI, if it's other DS subjects that you're after you might find something at https://www.pythondiscord.com/resources/?topics=data-science even if I unfortunately think it was a little bit thin for that as well
@wooden sail please the upgrade scripts kinda confuses me.
To be clear, should I uninstall the TF v2.8.0 and install TFv 1.15 before running the upgrade scripts?
And on what directory should I run the upgrade scripts on?
honestly i would just make a virtual environment with TF v1.x.x instead
Nice resource for image classification model selection from Timm models https://www.kaggle.com/code/jhoward/which-image-models-are-best
Compares speed and accuracy for a lot of models available from Timm library
@wooden sail thanks
I'm curious since you seem to do this for living, what do you do after training a model? how do you use it? I know you do fit predict and all that, but what is the typical workflow for saving it?
I read there is pickling (and another source were talking about a better way than pickling), I'm just curious how it happens in a typical professional environment.
Wouldn't really know. I'm just doing a masters in AI atm, haven't done anything professionally...
It's called machine learning model deployment if you wanted to look it up btw
normally one stores the model and its parameters in a way that it can be used easily, e.g. by containerizing it or simply keeping it around for yourself. the idea is that the network is simply an architecture, and the parameters you trained are specific to the data you used for training. if the training went well, you can now use the parameters only for inference without having to worry too much about the inference being correct
the network itself makes a prediction. the training step tuned the params so that the predictions become accurate. so once it's trained, save the parameters and trust the inference. as an example, once the params are stored, you can trust your network to detect faces in images
but yeah, look into deployment as pccamel suggests. containerization is just one approach
The only time we had to save a model, we indeed just saved the parameters and loaded it into the program that controlled a robot, and then the model was just called every iteration to check if the camera had spotted any objects
Most of the times we just make a model and use some test data to see how well it would work for new data
that sounds about right. since the params are a thing of their own independent of implementation, you can just as easily, say, do the training with tensorflow, then make a standalone implementation of a forward pass of the network in c++ that reads the resulting parameters and infers blazingly fast
so you really have a lot of flexibility in deployment
interesting, what's your undergrad?
thanks, I'm just curious, I have no plan on doing any ML anyways.
AI as well haha
the heck is AI lol
is undergrad the same as bacchelor?
yeah
what's that?
artificial intelligence
i know ai but i don't know bsc in ai
it is probably just a new trendy thing unis do now
Well it teaches stuff about AI
is it mostly stats? is it mostly cs?\
AI is a buzzword anyways. you can boil it down to a few math competencies and optimization for targetted applications
yeah is it closer to stats or CS though?
and psychology, biology a bit
jack of all trades
Lots of elective courses, I chose mostly cs courses
amen
AI falls both under stats and CS separately. also overlaps with so-called "signal processing"
We could choose what we were interested in, I chose mostly cs courses, and robotics
science and engineering
most unis are just capitalizing on the buzzword
ah
It exists for 20 ish years I believe on my uni
yeah those existed for a while, but it gained a lot of traction in the past few years.
But yeah AI just means its a bit more broad than just machine learning
yea im familiar
but it is mostly just buzzwords
once you have an optimization model or program then that isn't AI, that's just math and algorithms.
hmm
but AI is a cooler way to say it.
AI is just a name for a field of research with many different sub-fields
yea
yea, I know someone from here in Egypt working in AI since the 90s
so I'd imagine it was more common in the west.
Hello guys, hope you guys are having a nice day! I need help in pyspark and i'm knowing how to solve this. I want to merge duplicated rows but keep the values that were in them. In the first image, you can see the dataframe i'm working on it. I'm using this block of code , but isn't working it
why are you using alias?
wouldn't the max return one result only anyways?
oh nvm im dumb
I never used spark, but have you tried adding \ after each comma?
https://sparkbyexamples.com/pyspark/pyspark-groupby-explained-with-example/
looking similar to this to me
have you tried doing max for all the columns and then renaming them after? might be easier and cleaner.
you're getting the max for everything
i think you can max everything instead of writing all this and just rename after
try agg(max(*)) and see what it gets you.
I just made it work. I just trade the max for a function of sql. (f.max and it worked)!
thanks anyway !
i want to use D3.js v4 Force Directed Graph with Labels like the below link
https://bl.ocks.org/heybignick/3faf257bbbbc7743bb72310d03b86ee8
Does anybody used this code, it's not working with me
Hi
Does anyone know how to speed up pandas iteration?
I have a column which contains image URLs and I am downloading all the images using requests.
Right now it downloads around 12000 to 13000 images in an hour.
Is there a way to speed this up?
I have 50 Mbps. It is good enough, right?
and I am downloading all the images using requests
Using one thread, or multiple viathreading?
what's the source? are you sure you aren't getting limited?
and no it depends on the size of each image and the total. of them.
many smaller files will take longer than less small files.
One thread I guess. I have no idea about threading
Optimizing downloads is crucial to gather data from a series of websites. However, one should respect “politeness” rules. Here is a simple way keep an eye on all these constraints as once.
Waiting on multiple requests at the same time should boost the speed massively, then. This can be done either by threading requests, or, somewhat more nicely, by using an asynchronous requests library like aiohttp
Image size is pretty small like 2 KB. But there are over 100k images.
yeah that will just take long
all these images will download in less than one second
Currently, using only requests it's downloading 12k-13k images per hour
you can do multithreading and it will increase them, but be careful not to get black listed
what's the source of these files?
and how many files are there? they might have a better option to do this.
The image URLs are of Twitter profile pics.
Around 150k
try to see if the api will help
Twitter API?
yea
Nope, it's got loads of rate limits.
yeah your best bet is multi processing, but you'd probably be ip banned for spamming requests
you have both of these https://opensourceoptions.com/blog/use-python-to-download-multiple-files-or-urls-in-parallel/
go wild.
is there any reason why activation functions are just the normal sigmoid/relu/etc, and not something like this?
Well one reason is that people prefer their gpus below 200 °C
this looks like a combination of 4 parts, each of them in the form a (x-b)^k, so 3 parameters per part. Calculating gradients with regards to each parameter is going to be annoying.
I'd guess that people studied complex activation functions like that and they weren't any better than the ones normally used while being harder to compute. No sources for that claim, though.
what about neurons that take in two inputs and give one output? you could do something weird like have a functioning xor gate with only 3 neurons
something like this:
Not sure about that specific example, but a pretty major reason why it is as it is now is because we can easily apply matrix multiplication
Which is heavily optimized
that activation function doesn't really look all that complicated, I don't think?
Well compared to something like max(0, x) it would be, but yeah on smaller models maybe it would be okay
not really sure how much of the computation time is dependent on the activation function normally
that depends on how difficult it is to differentiate it, i.e. do the backwards passes
stuff like the derivative being bounded, and being bounded by a small constant at that, provides convergence guarantees and affects how large the learning rate can be while remaining stable
there's also a paper discussing the optimality of relu and similar piecewise spline-like functions, but i can't for the life of me remember the title. it must have been in ICASSP 2021 or 2020. but at any rate, any deep enough network should fall in the scope of the universal approximation theorem, so may as well go with well behaved functions
please let me know if you remember the paper 
i've been looking for like 10 minutes haha, i'll keep trying
oof i can't find it, sorry. assume i'm misremembering, or look around yourself...
Hello, I want to pursue AI and started my journey by learning the Fundamentals of Python, now I'm confused on what should I do next as I tried learning OpenCV but had too much trouble due to the background knowledge required.
Would love to hear your suggestions or journey in AI.
i would say that if you seriously intend on working in/with AI in the long term, you need to learn those things. a minimum background in linear algebra, stats, and multivar calc is necessary if you want to understand what you're doing.
if you only plan on using APIs and networks other people have designed and possibly trained, you don't need that. but that also limits your options
you can alternatively start with classical signal processing/image processing, which is the same stuff i mentioned above but directly looking at its applications
Thanks for your help :)
Many different activation functions are used, depends what you are doing.
LOL
Im shit at maths btw
i just failed a quiz on guassian substitution for matrices
doesnt stop me >:D
knowing how to do gaussian elimination by hand is not a big deal. knowing that elementary row and column operations are rank- and solution-preserving, on the other hand, is important
I think that as long as u know the omega basics of the math behind ml, only stats matters anymore
I mean, I’m sure everyone knows how to multiply matrices bro
sadly you can't cleanly separate the two in multivariate statistics, since you'll be looking at a LOT of covariance and correlation matrices, their rank, and their related spaces
I think that beyond what’s needed for backpropagation calculus gets less important
And more so general stats
if you just wanna use stuff and not make new theoretical results, yes, for sure
and honestly that is the case for the vast majority of people
I do not think 99% of data scientists are trying to invent the wheel
Yes
Myself included
If I wanted to do that I’d go for a PhD in ml
i might be a little out of touch with the real world 😛
Even the phds I know have not done so but rather applied it for research
The models we have already are fine
they're "fine" in that they work. they're "not fine" in that there aren't all that many results providing performance and convergence guarantees
Theoretical ml math is like for the actual gods and math nerds. We need them and I appreciate their work but it’s definitely only a tiny fraction of engineers
so you go off on a limb trusting your training and validation
small percentage of engineers and (applied) mathematicians, yeah.
I can’t imagine how hard it was to code tensorflow from scratch
Or come up with certain models
It’s way behind my own iq limit
the core models are old, i would argue the basics of that is not all that difficult. the stuff is decades old, we just lacked the computational power
so encode the companyId? There's 63 features, but I think I may just do this to save myself the time
Would anybody ordinal encode degree types? Such as High School, Bachelors, Masters, and PHD? A colleague of mine mentioned to OHE instead but I'm unsure
perfect, was going to do this anyway. Thanks mate
Does "None" for degree obtained mean missing values or just mean no degree was obtained? Originally I thought it meant no degree was obtained another colleuage saying it's null values
Yes
How about categorical features that say "None" for degree awarded, does this mean missing data or just none obtained? But now looking at major awarded some of the values have "None" even though a degree was awaraced
Maybe I could do a boolean indexing that finds all degrees awarded and make sure that majors are being awarded
because you can't graduate with no Major
I'd make dummy variables for each. I don't know how your data is structured, but it's possible for someone to have a PhD and a Masters, and someone else to have a PhD with no Masters for example.
Good looks on this, but it's not structured that way. How about occupation 'CFO, CEO, Janitor, Manager' would you ordinal encode this or One Hot encode instead?
Personally, I tend to not try and ordinal encode. It imposes an assumption of equal distance between points that might not be true.
okay cool, thanks for the help man.
You're welcome.
this a general problem in dealing with data, the solution is generally to ask the people who recorded/created the data, or look at any documentation associated with it
it's for a take home assignment on a job interview.. I think they just would like to see how I tackle certain problems
thank you though!
I hope the job interview goes well!
thanks so much, I really need it haha
any idea on making a correlation matrix for some with 92 features?
haha yeah that's one way, the image comes out to blurry cause there's so many features, and yes
may just do a table instead of viz
it's a lot of features
yeah, will have to just use a table
maybe there's a subset you can look at
or you could use some type of dimensionality reduction
like MDS or PCA
Yeah, but I'm not sure on that. I might just make 2 models... one dropping the big features and one with it
is it okay to havea correlation of .384 based off of degree and salary? salary is what we're predicting
plt.figure(figsize=(), dpi=())
sns.heatmap(df.corr())
it's just too big to fit on a figsize
corr_df.columns = ['feature_1','feature_2', 'correlation'] # rename columns
corr_df.sort_values(by="correlation",ascending=False, inplace=True) # sort by correlation
corr_df = corr_df[corr_df['feature_1'] != corr_df['feature_2']] # Remove self correlation
# corr_df[(corr_df['correlation'] >= 0.5) | (corr_df['correlation'] <= -0.5)]
corr_df```
Then yeah drop correlations close to zero I suppose.
But that's the highest is the .384
not sure if that's a good thing or bad
It's relative to the data
ok ok
So then how does the matrix tell us what features we need to drop?? The ones that have high correlation right?
If I have 93 features and 62 of them are fairly independent, does that mean we just keep them?
I wouldn't say this is a strong relationship, but given that salary is going to be influenced by a ton of things, I am not surprised
is that r or r^2 ?
I guess it's R if you are saying it's correlation
check this out
again its for job interview so as long as I'm on the right track...
yeah
because there's 63 unique companyID's
years experience + degree is probably pretty good then 🙂
I thought about dropping this to see if it made a difference
I bet years experience + degree + field is even better
but maybe just the two are enough
so based off what I showed you, do you think I can probably drop the comapnyId?
sorry typos
and that dang companyID has forced me to make 63 extra columns... but the reason I kept it is because company12 (could be apple for example) is known to pay better than company55(which could be HP)
does that make sense @void granite ?
yeah just pick a few of the higher ones for the model
Meaning that I create my features with only a few of the selected columns?
I mean, it's -a- way to do it, if you just want something quick and dirty. there's a lot of advanced techniques you can use if you want to be formal about winnowing down predictors, but uh... it's been over five years since I was doing any sort of advanced statistical work 😄
ahh gotcha gotcha, i'll have to play around a bit more! Thank you though
if you were being fancy about it, you would pick some sort of criteria for evaluating models, make a bunch of models, and then rate them by that criteria... maybe you would also have some parsimony there (preferring models with a lesser number of variables)
there's like entire fields of study about this problem heh
and there is, imho, a sense in which it's an art in addition to a science
actually, thinking about it
you want companyID to be a factor, like a categorial data type
"company" should just be one column, in other words
good morning
I am new to pandas, is there a sort of 'mask' functionality?
Lets say I have a data frame as in the image above
would it be possible to have a mask in a Serie and run it over the dataframe, returning me all rows matching the mask?
what would be the best approach when I have such a task?
why is the favicon.ico for colab in a different colour, for each of these notebooks?
What's your desired output DF?
Not sure. Maybe one is connected to kernel and the other one not?
Hey guys, so I have this plot, and I'd like to make a Gaussian fit. But I don't really know how, so would anyone know by any chance how to do it properly?
(the pit in p=0 is totally normal btw that's exactly what I was trying to get)
Simple way: calculate the mean and std of the data, and draw a gaussian with these parameters.
In theory you could instead directly minimize the mean squared error, or perhaps maximize the likelihood, via something like scipy.optimize. But I think the result will be very close, and the latter approach is more complicated and computationally expensive.
the opposite
!docs numpy.std can be used, say
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<no value>, *, where=<no value>)#```
Compute the standard deviation along the specified axis.
Returns the standard deviation, a measure of the spread of a distribution,
of the array elements. The standard deviation is computed for the
flattened array by default, otherwise over the specified axis.
I might be doing something wrong
if you have the data from which you produced that PDF, you could find the mean and std from there. if you only have the plot you showed, you have to find the parameters of a gaussian curve directly
I have a np array, I just use np.mean and np.std on it to get the parameters
an np array of what, though?
but I just decided to look directly on the graph to get the value and then get sigma from that
a numpy array of what? the original data you used to calculate the histogram?
ho it's the square module of a state function psi
if you have only the samples of the plot you showed, you'll have to do a least squares fit or something similar, and finding the mean and variance won't work
from the plot sigma should be 50-75 by my eye
yeah I find 66
hmm, how did you plot the gaussian?
I just look at the central value, calculate sigma from that and then plot a gaussian
worked fine lol
idk why the method you gave me didn't
that doesn't look like all that good a fit
the sigma is clearly slightly off (lower than it should be)
Yeah I know
guys i loaded a dataset using tf.data.Dataset.from tensor slices and my map function is like this ```
def preprocessing(x, img_path):
print(dir(x))
name1 = str(x[0].numpy())
name2 = str(x[3].numpy())
num1 = str(x[1].numpy())
num2 = str(x[2].numpy())
target = float(x[4])
img_name1 = f'{name1}_{add_zeros(num1)}.jpg'
img_name2 = f'{name2}_{add_zeros(num2)}.jpg'
img1 = plt.imread(os.path.join(img_path, name1, img_name1))
img2 = plt.imread(os.path.join(img_path, name2, img_name2))
return tf.convert_to_tensor([img1, img2, target])
also I changed the central value to get the maximum divided by 2
that's totally normal
surmised as much
you might in fact want to go the scipy.optimize way, since you only have the PDF
still, you'll wanna do a fit
PDF ?
since you have the parametric model, you can find the gradient and hessian analytically (or use automatic differentiation) to make it converge quickly
PDF = probability density function
Density plot
ooh, fancy ways
basically I have my system in an initial psi state as a ket, then I apply my operator, yada yada, and get a final state
and I'm ploting the density of probability
which I have as an array
so you do have the data. you can just do a maximum likelihood estimate of the mean and variance then, both of which have closed form if the distribution is normal (and is assumed to have AWGN noise)
AWGN ?
additive white gaussian noise
here's an example of gaussian fitting:
from scipy import optimize as opt
import scipy.stats as stats
import numpy as np
# X,Y are the points of your PDF
def loss_fun(params):
mu, sigma = params
return ((stats.norm.pdf(X, mu, sigma) - Y)**2).mean()
initial_guess = np.array([0, 60]) # initial guess, just needs to be close-enough to the true values
res = opt.minimize(loss_fun, x0=initial_guess)
just do what reptile suggests, it seems we're using different nomenclature anyway so it'd take too long to discuss it
Sure, each Y[i] being the value of |\Psi|^2 at the corresponding p=X[i]
ha nice!
Then you do mu, sigma = res.x and plot a gaussian with these.
note my edit just now; it should be return ((stats.norm.pdf(X, mu, sigma) - Y)**2).mean()
ho okay
did you get it to work? i'm curious how close that solution will come, since vanilla least squares isn't optimal here. seems to be that the noise covariance is not just a scaled identity matrix, so the cost function would ideally have an inverse of the covariance matrix somewhere in there
there you go
guess i surmised as much. once thing you can try is to split up the signal into chunks of a handful of samples and compute the variance of each
