#data-science-and-ml
1 messages ยท Page 372 of 1
Mouadjg, think about what sorts of things you want to do in the field before deciding on a job description. An ML Engineer is a very specific (almost kind of niche) job. :'] You might want to go to like, Indeed, and check out some job descriptions to see which you like.
?
ah i'll try to learn from books
I mean, I code it for myself searching for the stuff I do not know about
yeh sure i'll see now
Please, do not use me as a reference hahaha. I am just a bored student interested in ML, I am sure that @stone marlin can advice you way better than me
In general, in the ML/DS area, things are different job-to-job. I'd say if Python is the language-of-choice, you will be using, at least:
- Pandas (this gives you "dataframes" which are similar to tables in SQL or like in Excel)
- Numpy (this gives you some methods to do numerical stuff in python efficiently)
- Matplotlib (this package lets you plot things)
- Scikit-Learn (this package has a ton of the ds-related things in it: preprocessing, modeling, evaluation, pipelining, etc.)
There are other packages, of course, but these are the big ones.
do any one of u have any idea about these two courses
You asked about this above, no? I don't know, but I'll echo Stel's sentiment.
no no it's not like that because lot of my friend told me to use books rather than video
Excellent, so the stuff presented in Hands on Machine Learning is useful
yeh
I do believe that It is a personal preference, I am just accustomed to it
Yeah, it's legit. I'd say that learning to use all of those things is a great use of time. There are more specific libraries, but those can be learned as you go.
thaank you so muchhh for your explainationn
yeah it's depend on persone but i'll try it ,itcan be better
I'm a big fan of books + online docs, but I'm much more used to it. Ultimately, whatever helps you learn the stuff.
Me too
I use a drawing tablet for annotations so e-reading is part of my daily basis
Thanks a lot @stone marlin for sharing your experience, it is been extremely helpful
i like this habit of being disciplined to read books
No problem, just stick to it.
Hahaha COVID-19 forced me into it
nooo what happened to your green filter pfp
you can do this, but i don't think you'd really retain an understanding from it if you don't know the math behind it
there are github repos w the course in python i believe
Oh, that's good --- still would be good to code it themselves, though.
for some strange reason, df.dt.weekofyear works now??
yeah i just meant they could use it as a reference
i feel the burnout that's for sure
made a typo hang on
https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.isocalendar.html this is all i see in the docs for isocalendar
oh dear i think stack overflow is down lmao
do i need just a high school maths?*
yeah i think high school stats can be a good basis to build on
ah thank u i'll start the course that u recommend
https://www.patreon.com/ProfessorLeonard
Statistics Lecture 1.1: The Key Words and Definitions For Elementary Statistics
i recommend you use this guy's playlist
aa thank y smmm
anyone know AI Voice Assistant
I did read more books due to the panndemic but had the habit before it
Hello ,i wanted to know if kfold cross validation is good way to predict the accuracy of the model
Since during the training the model already sees data which is used for validation in the next iteration
@wicked grove each iteration of k fold cross validation, the model is reset
So it doesn't benefit from the evaluation data having been part of the training data in the past.
If it did, you would be right, however.
Good question 
Ohhh thank you so much
Is that automatically set in sklearn?
Or do i have to set that in some way?
im completely new to python itself how do i start coursse pls....
@serene scaffold
models = [] train_errors = np.empty(shape=n_folds) test_errors = np.empty(shape=n_folds) for idx, (train, test) in enumerate(cv_folds): model = Model() model.fit(train) models.append(model) train_errors[:, idx] = model._loss(train) test_errors[:, idx[ = model._loss(test)``` like this?
there is a metric cross_val_score
which automatically does cross validation and prints out a list of scores
default is I think 5
although yes there are methods you can make validation sets
KFold, Stratified , there are more, I just remember these two
https://github.com/krishnaik06/Types-Of-Cross-Validation
Building machine learning models is an important element of predictive modeling. However, without proper model validation, the confidence that the trained model will generalize well on the unseen data can never be high. Model validation helps in ensuring that the model performs well on new...
here
'python for everybody' , its on freecodecamp
k but i want to learn data science
Guys i listen to lex fridman podcast , eye on ai etc often i hear of some kind of online forums, groups where exciting things are being worked and explored apart from this discord community i have been unable to find some good forums /groups /channels/communities , i would really appreciate any opinions or suggestions .
Yes yes but my question is should i reinstantiate the model everytime i do k fold
Or is that automatically done when i do model.fit in the loop
like this is my code
from sklearn.model_selection import KFold
kf = KFold(3,shuffle=True,random_state=42)
fold=0
for train,test in kf.split(x_train):
fold+=1
print('fold',fold)
x_train1 = x_train[train]
x_val = x_train[test]
y_train1=y_train[train]
y_val=y_train[test]
history = model2.fit(x_train1,y_train1,epochs=50,validation_data=(x_val, y_val),callbacks=[callback])```
should i initialize it anywhere in the loop
use cross_val_score and then in the 'cv' argument put kf
Hai, can anyone guide me how to get the speech from a video into a text file?
I tried but i am getting error
import moviepy.editor as mp
import speech_recognition as sr
clip = mp.VideoFileClip(r"sample1.mp4")
clip.audio.write_audiofile(r"Converted_audio.wav")
print("Finished the convertion into audio...")
audio = sr.AudioFile("Converted_audio.wav")
print("Audio file readed...")
r = sr.Recognizer()
with audio as source:
audio_file = r.record(source)
result = r.recognize_google(audio_file)
with open('recognized.txt',mode ='w') as file:
file.write(result)
print("Wooh.. I did it...")
This is the error(last two lines)
raise RequestError("recognition request failed: {}".format(e.reason))
speech_recognition.RequestError: recognition request failed: Bad Request
check this out
my new ai
J.A.R.V.I.S
How to make an ai using python only, with many interesting features . Give credit in the project when using . anmol-123456789 github bit.ly/justacoder - JARVISv-1.0.py
are there any memory leak problems with librosa.load when its put inside a loop?
when this is run on colab it exceeds available RAM (12 GB)
import os
import nltk
import csv
src = '/content/drive/MyDrive/Audio/'
text = '/content/drive/MyDrive/Transcripts/'
for file in os.listdir(src):
#path = os.path.join(src,file)
path = f"{src}/{file}"
#print(path)
temp=file[:-3] +'txt'
tpath =os.path.join(text,temp)
#print(tpath)
#print(path)
#print(tpath)
speech, rate = librosa.load(path,sr=16000)
input_values = tokenizer(speech, return_tensors = 'pt').input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim =-1)
transcriptions = tokenizer.decode(predicted_ids[0])
print(transcriptions)
f = open(tpath, "r")
for x in f:
print(x)
wer=fastwer.score_sent(x, transcriptions, char_level=True)
print(wer)```
When normalizing data
do you subtract the mean first then divide by the standard deviation?
so like if i am standardizing lots of matrices in a larger matrix
I will find the mean of each individually first
subtract that from each entry in each matrix individually
then find the sd for each matrix individually and do the same, is this right?
but the important thing is i subtract the mean before i find the sd of each matrix right?
anyone got a dataset of different types of common file formats used in our daily life?
hey everyone , can someone help me with this . https://www.kaggle.com/muhammedjaabir/asl-alphabet-classification-irl-testing its an asl alphabet sign classification. I can see that my model is overfitting and so to fix that i collected more image data (289k images ) and trained it for around 5 epochs which gave me ```
training acc (augmented and rescaled) :==> 85%
development acc (rescaled) :==> 95%
testing acc (rescaled) :==> 19%
when visualizing data why do these weird white line occur horizontally in the blue graph?
i plotted the graph using plt.subplots(). when i just plot the blue data with plt.plot() there arent these weird white lines
Hello,
Pycharm or Anaconda for OpenCV?
Anaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment.
and pycharm is an IDE. I'm really confused.
sorry, I'm very beginner.
I want to learn Object Detection and i think opencv is good for object detection, what's best way to use that?
change the style maybe
datascience is not web scrapping. please post in related channel.
oh im sorry i will delete, could you address me where?
hm even I'm not sure where they do take place. I'm not sure if it can be put in #web-development
But webdevs may can help you since we usually scrap depending on classes or ids.
so #web-development seems atleast closer to it.
well i tried, i hope it will work and i will get help
regardless you can always ask in help channels ofc. and in #python-discussion too.
How do I make my data frame (on the left) look like the one with lines(on the right)
?
@spring marsh probably a jupyter notebook setting. pandas doesn't actually render the tables like that, jupyter does.
not really. why do you need them to look that way, though? just personal preference?
@spring marsh looks like you can use this: https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.html#pandas.io.formats.style.Styler
so it would be something like df.head().style(...)
but I don't know html, and that's what it's based on.
sorry I wasn't more helpful.
Heyy, I'm so sorry to ping you again
You were telling me about how the model is reset in each iteration of kfold
Could you please tell me if my code is correct
from sklearn.model_selection import KFold
kf = KFold(3,shuffle=True,random_state=42)
histories=[]
fold=0
for train,test in kf.split(x_train):
fold+=1
print('fold',fold)
x_train1 = x_train[train]
x_val = x_train[test]
y_train1=y_train[train]
y_val=y_train[test]
history = model2.fit(x_train1,y_train1,epochs=50,validation_data=(x_val, y_val),callbacks=[callback])
histories.append(history)
tf.keras.backend.clear_session()```
Hello, I'm trying to develop a regression model to predict prices for an in-game economy using TF Keras.
I've already managed to properly preprocess the data and such, but my main concern would be the accuracy for a trained model using it (as the game has over 2000 items in the economy, and I'm assuming if i just blindly included item id as one of the parameters to the model it wouldn't perform very accurately).
From doing some research, I heard making a model for each item would be the best choice, and from an experiment with a few items this is likely the case. However, what would be the best way to train and "store" each model for use? Should i train each model back to back, which would likely take a LONG time to complete? Is there a way to speed up said training and maybe use some sort of parallel processing? After training a model, should i just save the model info (with model.save) and write code to use the correct model based on the inputted item id? Is there a way to do this without making so many new files/directories (i.e. is it possible to "merge" all the models into one?)
You could use one hot encoding, since adding the id might "make the model think there's some sequential pattern in the id" (closer ID's are more similiair) but this is likely not the case
So if you think the ID is of large importance, add 2000 nodes to the input of the model, and have each item id give a value of 1 for a certain node, and 0 for all other 1999 nodes
That way the model won't try to learn a sequential pattern from the ID's
@tropic matrix
(This is talking about using 1 model, and not separate model for each item ID)
hey, what's up?
I am just greeting you, I just joined the group
What does it mean that ML pipelines should be composable? If you can provide an example in respond, that would be nice. Please, if you want to respond to this question, use "reply"
Therefore, while exploratory data analysis code can still live in notebooks, the source code for components must be modularized.
What does it mean that source code for components must be modularized?
hey everyone , can someone help me with this . https://www.kaggle.com/muhammedjaabir/asl-alphabet-classification-irl-testing its an asl alphabet sign classification. I can see that my model is overfitting and so to fix that i collected more image data (289k images ) and trained it for around 5 epochs which gave me ```
training acc (augmented and rescaled) :==> 85%
development acc (rescaled) :==> 95%
testing acc (rescaled) :==> 19%
Put in different modules, like have a .py file for each separate component or use classes etc.
not one giant clump of code, but multiple nicely organised modules
My workflow is usually do some eda and functionalize anything that kind'a works nicely, then group that into their own modules.
I don't like to keep everything in a notebook. Even if I use a notebook after that, I can still call my modules from there.
I've also, more recently, been trying to make most of my things pipelines so that I can take advantage of pipeline composition.
So ML pipelines are composable if they are placed in modules and those modules and can be combined together to achieve particular goal?
Pipelines are composable in general, modules are kind of like files that have all these related functions inside of them. It's a good way to structure a python package.
Here's an example pipeline from a blog post (it uses the Penguin dataset). This one is an example, so it's just using a bunch of stuff, but it shows how you can use and compose pipelines. I'm not sure if this is what is meant above, but it's something I do fairly often since, for example, I might want to try a different preprocessing pipeline but keep the same modeling pipeline or something.
# Create pipeline for preprocessing numerics.
pipeline_numeric = Pipeline([
("impute_w_mean", SimpleImputer(strategy="mean")),
("scale_normal", StandardScaler())
])
# Create pipeline for preprocessing categoricals.
pipeline_categorical = Pipeline([
("impute_w_most_frequent", SimpleImputer(strategy="most_frequent")),
("one_hot_encode", OneHotEncoder(handle_unknown='ignore', sparse=False))
])
# Give columns defining desired numeric/categorical cols.
numeric_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
categorical_cols = ['species', 'island']
# Column Transformer which "runs" the two preprocessing pipelines
# on the correct columns.
preprocessing_transformer = ColumnTransformer([
("numeric", pipeline_numeric, numeric_cols),
("categorical", pipeline_categorical, categorical_cols)
])
# Normal RFC definition.
rf_clf = RandomForestClassifier()
# Entire preprocessing-to-modeling pipeline.
preprocess_model_pipeline = Pipeline([
("preprocessing", preprocessing_transformer),
("random_forest_classifier", rf_clf)
])
# Fitting the pipeline.
pmp = preprocess_model_pipeline.fit(x_train, y_train)
Obv you can have a whole lot more in here, I just wanted to give an example of how this kind of thing works with putting everything in a pipeline.
The perk is that I could totally just make another pipeline_numeric_2 or something with a different SimpleImputer strategy and then all I'd have to do is plug that into the preprocessing transformer.
Source on the docs: https://scikit-learn.org/stable/modules/compose.html#pipeline
Hey does this channel also contains computer vision?
Hi, I have to make a language detector for very short texts which must have 100 languages, when I say very short it is for example messages of 5 characters or more but misspelled or something like "Bot... no..." I'm tryinnnnnnnnnnnnng" etc... Do you know what technology I should use? I also have my own datasets that I form as I go along which can have hundreds of thousands of examples.
the thing about one hot encoding is that including the other features that are used as inputs, i would be getting around 3000-4000 inputs into the model, multiplied by the 23,141,931 rows of data i'm using, makes memory become an issue (also yes ids are actually strings btw but they aren't related)
Well if you have memory issues you can always train in batches (which you probably already do)
and I think training and storing 2000 separate neural networks might also not be such a great idea @tropic matrix
that is also correct, ig the main thing is that when i trained each model individually it performed more accurately overall than when I trained it all at once
right, but you didn't use 1 hot encoding
you probably gave the ID as an integer
which makes the model think there could be a sequential pattern
but the actual value of the ID isn't important, it's just symbolic
that's why one hot encoding should be used
"Pipelines are composable in general..." - I am not sure what does it mean that pipelines are composable. Does it mean that different pipelines are connected together to make one solution?
Can someone answer?
you can ask about computer vision, yes.
Aight ty
it means that you can make a pipeline out of other pipelines
Yeah, that's what I thought. Thanks for response!
Who here is familiarized with the "face_recognition" lib?
whenever you want help, you should just ask your actual question, not if anyone knows about the topic of a question you haven't actually asked.
(context: I told this person to ask their question in this channel.)
Very well then.
The error:
File "/usr/local/lib/python3.10/site-packages/PIL/Image.py", line 2959, in open
fp = io.BytesIO(fp.read())
AttributeError: 'tuple' object has no attribute 'read'
The code:
def LoadImageFile(FilePath:str, Unknown=True, TargetName=None) -> list:
# set_trace()
try:
FFile = face_recognition.load_image_file(FilePath)
Locations = face_recognition.face_locations(FFile)
Encodings = face_recognition.face_encodings(FFile, Locations)
except:
set_trace()```
I've verified prior to the load_image_call that the image is indeed a string rather than a tuple and it's a real path(which is also previously verified with os.path)
I've also made sure that it wasn't face_recognition doing anything wrong because I've used it in various scenarios with nearly the same context.
PDB Output(Before the load_image_call):
LoadImageFile()
-> try:
(Pdb) print(FilePath)
Pic.png
(Pdb)
PDB Output(Invoked by exception)
LoadImageFile()
-> if Unknown:
(Pdb) print(FilePath)
()
(Pdb) ```
Please mention me when one of you are able to help.
This is the result of some txt detection
Using ocr with python
Anything i could possibly imporove?
I just got done with an "interview problem" that I thought y'all might enjoy, so I wanted to share it with you! This was for a Sr. Data Scientist position at a small-to-mid-sized startup where the focus was mainly marketing-type things. So, how would you go about something like this?
Note: The time I was allotted was 1.5 hours, no restrictions on language, methods, etc.
Note further: I am not asking for solutions for myself, I already did this and submitted it. I think it would be fun to see how y'all would approach something like this!
Problem: Divvy is a bike-share program in Chicago. It collects a bunch of data about riders. Using data from, say, Divvy_Trips_2020_Q1.zip at https://divvy-tripdata.s3.amazonaws.com/index.html, try to figure out something about this stuff:
-
Suppose we work for Divvy and we want to give discount rates to commuters. At what stations would we want to advertise the most?
-
Suppose we work for Divvy and we want to give discount rates to "pleasure-cyclists", who mostly just rent a bike to use for a nice trip along the lake a day or two a month. Where would we advertise this?
-
(Here's one I thought of) If we increased the price of Divvy a bit, what stations do we expect to grow / shrink? If we decreased the price of Divvy for a bit, what stations do we expect to grow / shrink?
Have fun!
Thanks I will try this , im surprised this is for senior role though, i though senior data scientist would be masked to build a ML model deploy it on cloud with tableua for visual aids and data cleaning techniques and featue engineering
This is more like data analyst role (not saying it is but this is my first time seeing this),
Any deep learning book for beginners which has everything including maths ??
Have not looked at the data and idk if I can be bothered to do this problem, but does the data contain information about what was advertised where, how, and how much was spent?
Hi!
I am going to make a pretty strange question
Currently I am reading a book called "Introduction to Statistical Learning" by Trevor Hastie. To be honest I am completely in love with it and the stuff it offers.
However, the moment I enjoy the most is when the "math" stuff comes into play, now that I have studied AI I know for sure that this is what I want to do for a living
In the real world. What happens with the "mathematical" profile worker? Is it really taken into account?
I do not know if the question is clear, I am not a native speaker but I tried to do my best in order to express what I want to say
Please let me know if my explanation is not clear. Thanks in advance
In AI? Yes, but that's just part of it, you need to be able to program to demonstrate things (science / engineering), not just prove things (math). And also maybe even more, like robotics (physical / real world demonstration).
Could a mathematician get a job in AI? Yeah, probably, but you have be really good at it then to make it worth hiring you for only that, and not someone else that can do multiple things.
hm ok, I'll try one hot encoding the item names again
on another note, is there anything i can do to help speed up the training process? i'm already training in batches to save on ram usage due to the huge amount of data and using a gpu, but is there anything else I can do?
A lot of people in AI really like math, but the real hard work comes from reality being messy and incredibly complex, plus lots of trial and error (the daily grind). When I get to work on just some plain math I am relieved (unless it's someone else's paper with nonsense notation and hand-waving).
I get the point
Would you consider that a good mathematical background is mandatory? How useful advanced math really is when it comes to AI?
Sorry for asking too much, new questions related to the AI field arise to my mind every day
Yes it's mandatory (the fundamentals such as linear algebra, statistics, calculus). At the very least many of the different fields' intuitions / essence are useful as a guide on what could be done / give ideas of where to go.
Are you currently working in the AI field?
Yes.
Would you please, if you dont mind, describe me what a day in your work is like?
It consists of working on my current project and spending time seeing what others are working on while also improving my overall knowledge in (not necessarily in this order) mathematics (whatever field I currently feel lacking in or think might be useful), computer science, physics, biology, neuroscience, history (of AI/ML/related), psychology, robotics, and others (it can all apply to AI, so at the very least I want to know the general ideas floating around). The current project might be something more math focused, something more programming focused, or something physical like building a robot / test environment / computing cluster.
Hello mates, i am trying to do ML with a small synthetic dataset but i cant get more than 0.64 accuracy
can someone please guide me out, i have tried literally everything, neural networks, various regressions and boosting but idfk why i cant get more than 0.64
i would be very grateful
this is the data, i have tried plotting it in many different ways too
almost all the features have a similar plot like this, with different scales
oof harsh, not even column labels
yeaaaa ;-;, purely synthetic
i did a correlation plot: but i kind of find it overwhelming to start with
do you know what accuracy you're expecting? 0.64 itself is quite meaningless unless you can get a gauge of what's possible
the baseline, which they told me is 0.63 AUC score, many of my classmates have gotten 0.66 or even 0.68
many people are suggesting me to select features/feature engineering but there are so many different combinations, i dont know where to start
i achieved a maximum of 0.638 on the testing dataset, which was predicted by the model which gave 0.641 AUC when i was doing KFold with my training dataset
mm ill give it a try
ohhh thank you so much, i have a deadline till today midnight (after 12 hours) if you can even provide me with a direction to move forward in, i will be very grateful
sure:
- Neural Networks:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import roc_auc_score
from tensorflow.keras.metrics import AUC
import os
model = Sequential()
model.add(Dense(26,input_dim=15, activation='relu'))
#model.add(Dense(14, activation="relu"))
model.add(Dense(13, activation="relu"))
# model.add(Dense(8, activation="relu"))
model.add(Dense(1,activation="sigmoid"))
model.compile(loss="binary_crossentropy",optimizer="adam", metrics=['accuracy'])
model.fit(train_X,train_y,epochs=10,batch_size=10,verbose=0)
tried many different combinations of number of nodes and layers, epoch and batch_size values but this was the relatively best situation where i got around .63 auc score
tried many types of correlation plots, like this one:
and the one i sent earlier:
then i tried selectKBest features, with 2 different algos
i got this plot with mutual_info_regression
this with chi2:
and then, finally i have tried LGBMRegressor, TabNetRegressor, XGBRegressor, LogisticRegressor the accuracy doesnt move. Ranges from 0.58 to 0.62
thats all i have till now
Re my problem above: there is no current advertisement AFAIK, it's mostly a pretty open question. It pretty much has trip data: when a person left a station or arrived at a station, that's a big chunk of it. I think they just want to see what you'd do with it, who knows.
It's fine if y'all don't have time or can't be bothered, I wanted to share it in case anyone found it interesting! I can keep them to myself next time, haha, sorry about that!
No, it's fine, it's an interesting problem. Without advertisement data and budget data it's pretty hard to tell, since it's a resource allocation problem. So really it just comes down to personal priors / marketing knowledge in general (i'm not into marketing so my priors are really lame, like try the most densely populated for the most eyes (maybe that's what they actually want for 1 (and for 2 probably they want to see if you can find the pattern of "pleasure-cyclist" / where they are))).
Yeah, I think the lack of advertisement data + pretty much any kind of budgeting data is the "hard" part of this problem. If we had that, it kind of becomes an easier optimization problem. I sorta liked it because with a 1.5 hour limit I think it was trying to see how I thought to EDA data and what stuff I could come up with. I dunno if this is a good interview problem, but it was a fun problem to do.
Like, idk how this actually reflects the job or how well I can do it, but I thought it was sort'a neat to go through.
(with mobile apps being required to use say the bikes, advertisement cost goes near zero (which is why every company wants you to download their app), and your location is given by the phone (GPS))
(which also why mobile apps don't even care, they just spam advertisements)
(answer is everywhere then)
Yeah, I don't think Divvy [the actual company the data is from, which is NOT the company I am applying for] does any advertising. Or, I haven't seen it.
It's not an easy optimization problem even with budget and advertisement result data (because people are dynamic and can feel overly advertised to, budget changes over time, etc))
Yeeeep. I feel this is sort of similar to a lot of marketing problems given to DS, where everything is real loosey-goosey.
With that data I would maybe throw a bandit swarm at it (it's a multi-arm bandit problem).
Yeah, that's a plan. Like, pick out most probable spots and try stuff out. Otherwise it's a bandit with a ton of arms.
The bandit swarm will try things and adapt to the result. It smartly picks what to try out and balance risk/reward/exploration/exploitation.
The feedback loops gives even more useful data, but it's harder than "normal" ML.
Or by hand / manual would be A/B testing.
In this business-case though, they prob wouldn't want MAB, since it'd be sort of giving people discounts in places and then having to take it away. So you're still prob gonna wind up trying to find optimal paths.
Hello, im stuck on a transfer learning problem . Im classifying images using vgg19
I have used kfold to check the accuracy
I'm not quite sure if what im doing is correct tho, can you please help me out
It would not surprise me if the bandit swarm picks up on people getting pissed for the discounts getting taken away and such and tries to exploit the psychology somehow.
I have seen it do things like that before in simulation.
And other stuff ofc, like break the physics engine.
(it's pretty useful for debugging stuff)
Yeah, there's not gonna be a "real solution" to this problem, but in my mind it came down to: finding the most reasonable stations to try this on, and then choosing those to test the discounts or whatever. Which is pretty much the deal here with MAB.
Yeah, anything that's not like advertise where nobody is at.
You could at least know what not to do probably.
Yep. I'd be interested to see what other candidates' solutions looked like, given that even basic EDA takes longer than one thinks.
Yeah speed seems key here. Like are you already comfy with your tools.
Yeah, if I didn't know pandas / sklearn / whatever, it would'a probably taken an hour just to plot all the stuff out, haha.
It's a solid way to weed people out. Even if they don't get amazing conclusions, if they can't pull up anything in time, they did not practice.
I think that's the case. The job is for a Senior Data Scientist (whatever that means) but this isn't all that different from the things I had to do when I was an entry-level DS. It'll prob be the case that the tech interview will be more difficult, and they'll use this as a jumping-off-point.
(and you are not hiring, in this case, for someone to first spend a week getting into it)
Yep! I'll update a little later on what they ask me about it, or if anything harder comes about, or anything like that. Another company I interviewed with had a problem which was literally "Do you know what ARIMA is?" so I didn't share that one here. :''']
y=mx+c is the formula to find the linear regression line, right ?
And here c means the the co ordinate at y-axis from where the regression line starts, right?
Anyone know of any PCA projects I could do? Like I was thinking eigenface type facial recognition but I would rather do something else. Something to do with making useful predictions with data.
line starts seems like a vague way to put it, it's a line, if your data(x) is negative, y will go below c.
but yeah at y axis, value of y will be c since x is 0 there.
well basically y=c when x=0
c is the intercept of the line at y axis
How about something estimating the growth of world GDP in 2022 provided there is a new strain of corona and one case where there is not one like a comparison you can also use geo plotting to visualize the growth
how would that use PCA?
how is that to do with ai at all lol, i am guessing you thought i asked for a DS project?
dont worry about it
yeah
in simple terms can someone explain to me the difference between value_counts and count()?
data["created_at"]
a_new_series_of_not_spamcalls = data[["created_at", "is_blocked"]]
a_new_series_of_not_spamcalls["day"] = data["created_at"].dt.day_name()
#a_new_series_of_not_spamcalls
df_not_blocked_by_week = a_new_series[a_new_series["is_blocked"]==0]
df_not_blocked_by_week
# df_not_blocked_by_week.shape
# df_not_blocked_by_week.groupby(by = df_blocked["day"]).count()
i know for certain that df_not_blocked_by_week is (40027459, 3)
what i do not understand is that when i groupby it, it's an empty dataframe
could it potentially have something to do with the setting by copy warning?
it's not an attribute error
it's not a type error
oh shit
maybe i know what the error is
i didn't even use the var i created from earlier on lol
unfortunately, no that was not the error
um i don't have syntax highlighting in jupyter notebook lmao
shit
i'm a dumbasss
no duh it wouldn't work if you used a dataframe that had all 1s as a groupby
yeah now it works just fine
How to find the best fit line which has the least error
depends on what error metric you are using. If it's just mean squared error, then an exact solution is possible*, or if you have too many points for an exact solution, something like gradient descent.
beta = (X^T @ X)^(-1) @ X^T @ Y, derived here, say: https://en.wikipedia.org/wiki/Linear_regression#Least-squares_estimation_and_related_techniques
I was trying to learn Kmeans clustering of unsupervised data. How can we select number of clusters from Elbow diagram? I think it will be point with minimum angle. Am I right?
for eg, is it 2 on above diagram?
seems like it yeah
as long as the inertia is low enough and the number of clusters is also low
It's is the point in the graph where the curve bends from high slope to low slope, this is what I find everywhere, and it seems a bit subjective
But in your graph it's pretty visibly at 2 clusters
I am confused between 2 and 3
๐
basically just choose the sharpest angle
Hey can anyone explain this error?
epochs = 5
x_train = np.array(x_train, dtype='object')
y_train = np.array(y_train)
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=[keras.metrics.SparseCategoricalAccuracy()])
model.fit(x_train, y_train, epochs=epochs)```
ValueError: could not broadcast input array from shape (400,400,3) into shape (400,400)
@modern cypress what is x_train before?
to be clear, it looks like you're trying to write over x_train as an array version of whatever it is. I'm asking about what you're passing to np.array(...), not the result of that.
x = []
y = []
for features, label in all_data:
img = tf.convert_to_tensor(features, dtype=tf.float32)
x.append(img)
y.append(label)```
Images here, and then I used train_test_split to split it to x_train, x_test
(I'm slightly confused, because this was working earlier)
Is there a way I can find which column contains the value of another column?
For example I have columns 1,2,3,4,5 and I want to see which of the columns A,B,C,D has the same value as the first columns for every row
Could it be due to the .convert_to_tensor ?
is this a dataframe? please do print(df.head().to_dict('list')) and paste the raw text into this chat.
does each image have the same shape?
Yes sir, I resize all the images before they're put here image.resize((400, 400))
alright. tensors are a lot like numpy arrays anyway--why are you trying to convert it?
I was having some errors earlier, I did some googling and saw someone said to include this as the solution so I tried it out and it worked
However, I don't remember the error, I'll try find that out
Oh right: ValueError: Input 0 of layer sequential_1 is incompatible with the layer: : expected min_ndim=4, found ndim=2. Full shape received: (None, 1)
I get this error without converting to tensor
but the error message you posted earlier points to the line where you pass it to np.array
Yes, sorry if I am confusing you. I was just explaining why I converted it to tensor
I'm quite new to this, so there may likely be errors in my thinking
@modern cypress that's okay 
If anyone has any pointers, I'd appreciate it
@opaque oasis you have to ask your actual question about using python with power bi, or no one will volunteer to help.
If I am reading this error correctly, it's saying that I'm trying to create a ndarray from ragged nested sequences, however when I go through x_train one by one and print it's shape I can see that they are all the same (400,400,3)
So surely this is fixed, and not jagged? Especially if I am resizing them earlier to fixed dimensions image.resize((400, 400))
As far as I can tell, no. All the prints have the exact same shape
how many prints?
do:
for x in x_train:
if x.shape != (400, 400, 3):
print('wrong dimensions')
(400, 400)
so it's greyscale or something
The outputs are the model's predictions which you can also monitor and measure. This should include an understanding of the deployment of different model versions to help you understand how different versions perform. You should also consider performing correlation analysis to understand how changes in your inputs affect your model outputs. And again, this should be done on slices of your data, for example, correlation analysis can help you detect how seemingly harmless changes in your inputs cause prediction failures.
Can someone expand on this? For example by giving an example why to do correlation analysis. I am not sure why to do correlation analysis
Let me clear output and try again real quick
Hello Everyone,
i need your help and support
i want to use clustering algorithms for some online labs
so i need a dataset that contains online laboratories description
the dataset size that i need should be more than 100 labs
where i can find it any suggestions please ?
You could add noise to separate dimensions and see how it affects the output, if it classifies it differently (or in case of regression how much the output changes)
{'1': [1, 0, 0], '2': [0, 1, 0], '3': [0, 0, 0], '4': [0, 0, 1], 'A': [0, 0, 0], 'B': [0, 1, 0], 'C': [0, 0, 0], 'D': [1, 0, 1]}
It can give you insight into which variables are important for prediction
What you mean by separate dimensions?
Like if you base your prediction of an animal on it's weight, height, average color etc.
you try add a bit of noise to color and see if that changes the prediction
If it doesn't change anything, that might mean color by itself does not have a lot of impact on the prediction
but that's just a really general example
correlation analysis can help you detect how seemingly harmless changes in your inputs cause prediction failures.
This part about harmless changes isn't clear. Does he think that if input value is disturbed a little bit, because model overfitted, then he will fail to give correct prediction?
@mild dirge
This is a pretty common example
Only a tiny tiny bit of noise is added, and the prediction changes completely
Thank you so much ๐ I used the small loop you sent to find the grayscale images
This is something you want to avoid
awesome haha
correlation analysis can help you detect how seemingly harmless changes in your inputs cause prediction failures.
This part about harmless changes isn't clear. Does he think that if input value is disturbed a little bit, because model overfitted, then he will fail to give correct prediction?
@mild dirge
yeah something like that
Look at this
the noise is very small, but completely changes the outcome
it's called an adversarial attack
your network should be robust against this
Yeah I understand that.
Thanks for explantion
@mild dirge Can I DM you about something?
If it's confidential sure, otherwise feel free to ask here ๐
Want to use python as source for hourly or daily refresh to a rest api and then use power query
To connect to the data frame and
Refresh from the service
My report dashboard
I think it doable using a personal data gate way
Is least squared method and total square method the same ?
anyone can help?
i traced the error and it's due to len(x_roi)==0 that's why i get none but i know it shouldn't be returned as none and i suspect it's due to best_iou < 0.1 and i stuck there. still don't understand why my best_iou is smaller than it should be (got 0.001~0.06 instead)
@low spear your Y1 is None, for some reason.
so you'll have to check why calc_iou returned None for that value.
https://github.com/RockyXu66/Faster_RCNN_for_Open_Images_Dataset_Keras/blob/master/frcnn_train_vgg.ipynb specifically used this code but the only difference is using my own dataset with the size 1137 * 640
@serene scaffold were you able to see the dictionary I sent in
@merry wadi Yeah, sorry. Are you trying to compac column one with column A, 2 with B, etc?
*compare
All good and Iโm trying to find if A matches up with column 1,2,3,4. If B matches up with 1,2,3,4 etc.
And be able to update that column with a new value using np.where or apply
๐ Nobody have any idea ?
@mild dirge (sorry for pinging, I didn't see if you responded to this)
hey um quick question
i can't seem to change the size of a matplotlib grouped bar chart
labels = ['Monday' , 'Tuesday' , 'Wednesday' , 'Thursday' ,
'Friday' , 'Saturday' , 'Sunday']
spam_calls_by_day = [1872864 , 2165318 , 2786778 , 2563185 ,
1930462 , 770719 , 369446]
#Monday - 5843038
#Tuesday - 6128272
#Wednesday - 7769496
#Thursday - 7638595
#Friday - 6662231
#Saturday - 3393077
#Sunday - 2592750
not_spam_calls_by_day = [5843038, 6128272, 7769496, 7638595, 6662231, 3393077, 2592750]
x = np.arange(len(labels)) # the label locations
width = 0.4 # the width of the bars
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, spam_calls_by_day, width, label='Spam Calls By Day')
rects2 = ax.bar(x + width/2, not_spam_calls_by_day, width, label='Ham Calls By Day')
# Add some text for labels,x-axis tick labels
ax.set_ylabel('Amount of Calls')
ax.set_xlabel('Days of the Week')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
# Display plot
plt.setp(ax.get_xticklabels(), rotation=30, ha='right')
plt.figure(figsize=(20,20))
fig.tight_layout()
plt.show()
lol i feel dumb
figured it out
Hey, i am currently working with time series data and i am following he tensorflow time series forecasting tutorial. In order to use the datetime of the measurements as inputs to the NN they apply sine and cosine transofrm to the dates. It makes sense to me why to do that and also why sine and cosine is needed for that but what i dont really understand is why you need both sine and cosine. Would appreciate if someone could explain the idea and concept of that or maybe knows a resource where i can find some more information about that
Is this book covers everything thing that we need to know in ML and DL??
Any book for DL which covers each and everything from beginning to end??
No. No.
You should at least be familiar with vectors.
vector calculus or just the basic notion of vectors
Just the basic notion.
Vector calculus is one of the chapters.
Also should know some calculus*
hello, I implemented linear regression from scratch with no libraries except numpy and pandas
how could I validate my linear model with my test data?
i was thinking of passing my test data to my new slope and bias and check the cost at each iteratino and which ever is lowest cost i take the 1 - cost to get accuracy for testing data
if I use
torch.save(model, PATH)
and subsequently
model = torch.load(PATH)
is there a way I can infer the architecture of my model from the model object? If not, or if not advised to do so, what is your recommendation on saving the architecture of my model? I'm considering copying the code block in my colab notebook as a text. If I choose to do so - is there a nice way to do that via python code?
The book seems good but the derivation and maths seems difficult
Anyway to make it easy understandable??
youtube
also, there may be more one than way of learning but I would say trying to read a ML book isnt great way to learn as also requires strong math background. Practical experience is needed to see how the math applies, thats why I mentioned youtube because there lots of great tutorials that shows applicatino of whats inside textbook.
And trying to learn vector calclulus and other math for some can become quicly demotivating as it can feel like so much to learn. This is not best way to learn, instead focus on particular alogirhtm of machine learning or deep learning and only then if need to know certain math then can learn about it just the basics needed.
Short of paying someone to explicitly write a course completely tailored to your situation, you will have a hard time something like that.
I would recommend to try reading from these sources and diving deeper as needed. That will require a bit more effort on your end, but it will eventually pay off.
But do keep in mind that while interesting and rewarding, DL is a complex piece of technology
thing about math is, like everything else, you're gonna have to push through
math is a harder push than python, that's for sure
hey guys, ive made an AI Jarvis using some simple modules, and speech recognition. I also designed a GUI for it in Qt Designer but when i convert it from ui file to py file and open in python, all my images are not visible and the window get white... i thought it was hanged but i waited for an hour still nothing happened... pls help me out
i also saw some solutions from GitHub, it says give the picture path to the folder where you have saved your pictures... so how do i paste a picture path in a folder??? shall i make a shortcut and put the path in it??? pls helpp
Hey guys got a question if I wanna make an ai that runs 24h/d do I need do turn my pc 24h/d?
If some code has to run 24h/d, it needs the capability to do so
But that has little to do with AI/ML
You may want to read more on https://stackoverflow.com/help/how-to-ask about how to ask great questions. It will help increase your odds of getting a reasonable answer
https://docs.python.org/3/tutorial/modules.html
But that's the wrong channel for it...
yeah either that or #โ๏ฝhow-to-get-help
Here is the free article that will take you through the basics of data science, BY nm.devโ
Any experts in Transformer models? Specifically Huggingface BERT/BART models
not sure about experts but some people here do know about transformers.
Basically, I am trying to train a transformer model to generate a meta description (meta tag in web development) based on <h1>, <h2> and <p> tags.
So this means that I need the model to take multiple inputs of course. I read up on it, and I would need to concatenate these tags somehow and feed it into the model
Just wondering how to do this smartly
suppose i have a list of np arrays appeneded to a list how can i save that list of np arrays as a single image?
that's a a seq2seq task - do you have any dataset? you can finetune a model from HuggingFace models
Yes - I do have a dataset containing four columns
- "Meta" (The meta description that I want to generate
- "h1"
- "h2"
- "p"
The three last ones are the ones that I want to somehow combine into an input that a model can then generate a meta description for
I have read on HuggingFace, but I cannot find anywhere, where I can input multiple features/columns - Do you have any good resources on that
what do colums 2-4 contain?
All are strings.
h1 will contain html <h1> strings
h2 will contain html <h2> strings
p will contain html <p> strings.
An example of a p instance:
'Special days or every day โ the simple but elegant FรRNUFT 24-piece cutlery set will add extra flavour to your table settings. The cutlery is made of stainless steel that is durable and easy to clean. Cutlery with a clean design goes well with many different kinds of table settings. Includes: Fork, knife, spoon, teaspoon and dessert/salad fork, 4 of each. The material in this product may be recyclable. Please check the recycling rules in your community and if recycling facilities exist in your area. Maria Vinka Stainless steel Stainless steel, Stainless steel Dishwasher-safe. For the flatware to be easy to clean and to reduce the risk of corrosion, always rinse off the remains of any food immediately. Wash, rinse and dry your flatware before using it for the first time. Width: 6 ยพ " Height: 1 ยผ " Length: 8 ยพ " Weight: 2 lb 1 oz Package(s): 1 Stainless steel is found in everything from building structures and cars to sinks and knives. Itโs easy to see why it has so many uses. Stainless steel is hard and durable and has good resistance to corrosion โ namely rust. It generally has a low nickel content, and for IKEA products we mainly use stainless steel thatโs nickel-free. Like many other metals, it can be recycled again and again to become new, hardwearing items โ without losing its valuable properties. All cutlery needs some care to maintain its original finish. Rinse cutlery after use to remove any scraps of food, even if you plan to wash it later. If you use a dishwasher, choose a shorter cycle, and dry the cutlery by hand. That way you avoid rust spots caused by food that sticks to the cutlery, or by leaving cutlery too long in the hot, humid air inside the machine. Good advice doesn\'t have to be expensive.'
Example of a h1 instance:
'FรRNUFT20-piece flatware set, stainless steel'
And example of a h2 instance:
'Product details Measurements Reviews Material Quality'
So I want to somehow include these three strings in a model to generate a meta description
if you crop the para,
then make them into a single sentence - delimited by a special character like <DELIM>
pass that service token into the model when building it (there would be a list, indlcuding <END>, <START> etc. just add your own to that list)
Okay, so it would be like
FรRNUFT20-piece flatware set, stainless steel<DELIM> Product details Measurements Reviews Material Quality <DELIM> {Long Paragraph}
yes
no
it would be something like this in the tokenized form
["FรRNUFT20-piece", "flatware set", "stainless steel", "<DELIM>" "Product details",....]
a list after tokenization
Ah, alright. So I can use an autotokenizer for that, somehow?
dataset can use any delimiter
FรRNUFT20-piece flatware set, stainless steel|#| Product details Measurements Reviews Material Quality |#| {Long Paragraph}
ye, you would have to modify it
or, if you're too lazy - you could just add a delimiter in the dataset and hope the model learns it itself
So I could essentially join the three columns and add |#| between them. And then I can somehow tell the bert model to understand |#| as the delimiter?
it should pick that up on its own
Alright - I'll give that a go!
So I combine it and I have something like this:
'FรRNUFT20-piece flatware set, stainless steel |#| Product details Measurements Reviews Material Quality |#| Special days or every day โ the simple but elegant FรRNUFT 24-piece cutlery set will add extra flavour to your table settings. The cutlery is made of stainless steel that is durable and easy to clean. Cutlery with a clean design goes well with many different kinds of table settings. Includes: Fork, knife, spoon, teaspoon and dessert/salad fork, 4 of each. The material in this product may be recyclable. Please check the recycling rules in your community and if recycling facilities exist in your area. Maria Vinka Stainless steel Stainless steel, Stainless steel Dishwasher-safe. For the flatware to be easy to clean and to reduce the risk of corrosion, always rinse off the remains of any food immediately. Wash, rinse and dry your flatware before using it for the first time. Width: 6 ยพ " Height: 1 ยผ " Length: 8 ยพ " Weight: 2 lb 1 oz Package(s): 1 Stainless steel is found in everything from building structures and cars to sinks and knives. Itโs easy to see why it has so many uses. Stainless steel is hard and durable and has good resistance to corrosion โ namely rust. It generally has a low nickel content, and for IKEA products we mainly use stainless steel thatโs nickel-free. Like many other metals, it can be recycled again and again to become new, hardwearing items โ without losing its valuable properties. All cutlery needs some care to maintain its original finish. Rinse cutlery after use to remove any scraps of food, even if you plan to wash it later. If you use a dishwasher, choose a shorter cycle, and dry the cutlery by hand. That way you avoid rust spots caused by food that sticks to the cutlery, or by leaving cutlery too long in the hot, humid air inside the machine. Good advice doesn\'t have to be expensive.'
How would you recommend tokenizing this? I saw some special tokenizer apis in the huggingface
ecosystem. Is there a special tokenizer I can tell to treat the |#| characters differently?
Actually, I see that different tokenizers have different sequence tokens https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/roberta#transformers.RobertaTokenizer
why my accuracy is not increasing what might be the issue?
Ok thanks
:incoming_envelope: :ok_hand: applied mute to @zenith tide until <t:1643635727:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
So this is how I should do it instead of the |#|
yh, you don't need to add another token
but you would have to do some modifications to instruct the tokenizer to place <SEP> where you want it to be - between the different columns' sequences
Alright - I thought perhaps I could do it with a script
I'll try to look into it, thanks ๐
Hello, i trained my model (vgg19) and evaluated using kfold cross validation
I get an average accuracy of 82
But on when i evaluate on the test set the accuracy is 74
Is there anyway to improve the model
Hey guys - I have the following after I do datasets.load_dataset
DatasetDict({
train: Dataset({
features: ['index', 'url', 'meta', 'title', 'h1', 'h2', 'p'],
num_rows: 23691
})
})
I want to tokenize using the following function
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')
def tokenize_function(example):
return tokenizer(example["h1"], example["h2"], example["p"], truncation=True)
tokenized_dataset = dataset["train"].map(tokenize_function)
But unfortunately, I get the following error text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
@wicked grove if this is for supevised machine learning, then apply regularization, and check which data gave poor accuracy possible outliers or requires feature engineering ( check for correlation and use feature selection techniques)
I can't understand how to find the best fit line with least squared method, can anybody suggest me a good resource for it
This statistics video tutorial explains how to find the equation of the line that best fits the observed data using the least squares method of linear regression.
My Website: https://www.video-tutor.net
Patreon: https://www.patreon.com/MathScienceTutor
Amazon Store: https://www.amazon.com/shop/theorganicchemistrytutor
Disclaimer: Some of t...
Yes it's supervised
I have applied dropout
How do i check which data gave poor accuracy??
Cause my test is just numpy array which i get after train_test_split
You should probably try to visualize the test set
Maybe it is more difficult than the data used for cross validation
Rightt,But how can i visualize it ๐
In the final result can i display val acc and test acc?? And what happens if there is a 10% difference between val and test acc
then you probably overfitted on val data somehow
Or there is not enough data to make an accurate prediction on the accuracy
maybe average accuracy over 100 runs
Hmm
How can i improve that
Also can i use cross_ val_score with a cnn model??
ๆณฃใใชใชใซใใผ โ ๐ โ already gave some good suggestions
cross validation is not bound to a model type
it works on any prediction task
hello i wanna ask you guys should i learn machine learning then neuroscience with robotic or Ai
neuroscience is only losely related to most machine learning topics and alogirthms
you want to maybe start with linear algebra and statistics
yeh i'll start with it
other question do machine learning do only prediction or also do learning it self by retrieve data training
Hey, i am using a simple LSTM NN with a Dense output layer that has one unit (i only want to predict on numerical value). Model fitting and so on works but i am confused with the output of model.predict() when i let it predict on one batch.
You can see the prediction in the picture for some reason it is 3 numbers and not just one. the upper matrix is the prediction at index 0 and the number at the bottom is the actual number it should predict
Linear regression line formula is y=mx+c
Here we have to find m and c, after finding m and c
,we have to put them in the y=mx+c equation, now as we know the values of x, so now we will put the values of x in the the y=mx+c equation and we will get the y values.
Now after this we will plot the y and x co ordinates on the graph (these are the predicted values) and draw the regression line. After that we will use Rยฒ method to check whether our line is best fit or not, if it's not then we will repeat the whole process again, which will give us new values of m and c, so we will put that new m, c and the original x values in the y=mx+c equation and now we got a regression line which fits in a better way than earlier
Am I right ?
Not really, if you watch the video it will show that you will need both the x and y values of every datapoint
with those you can get the line of least squared (so calculate m and c)
It's not an iterative process
After you got the line equation, you can predict the y value of new x values you get
other question do machine learning do only prediction or also do learning it self by retrieve data training
not sure what you mean
hello, would like to please ask, I have a dataframe and would like to train test split it but when I use sklearn library it gives me a dataframe - I would like instead it to be 2d array with the numerical values not the whole dataframe, how to do please?
you can use np.array(dataframe)
nice๐
heyy, i had a doubt...do i have to save my model before i use model.evaluate on the test set
or can i train and then evalueate it
Don't need to save it
as long as you have trained the model you can use it to evaluate data
saving is just so you can use it after terminating the programming so you can load it again
Ohhh okayy, thank you so muc
something is going massivley wrong with training my LSTM NN. for some reason the model predicts always the exact same value no matter the input.
hey
I had another interview problem I wanted to share with y'all. I promise there won't be too many of these! This one was totally different from the previous:
- Given the dataset Labeled Faces in the Wild [http://vis-www.cs.umass.edu/lfw/], create a system [sic] which will identify the person's nose and draw a green box around it. Time limit was one hour.
I didn't get through this one as well (I don't usually work with CV) but it was fun to try it out, and I submitted what little I had, haha.
[Edit: The job title was "Data Scientist, Agriculture".]
How would y'all do something like this?
My solution, which did not ultimately yield any fruit, was really naive. I did a standard NN-type thing, and then tried to do some LIME stuff to see if there was a "nose"-type feature (spoiler: there wasn't) and then I panicked and tried to do it for eyes which worked a bit better, then I made a box from eyes to a set number of pixels down, and then cut that box in half, basically hoping the nose was in that area.
It was a terrible system. I know now that there's the facial_landmark stuff in cv, haha, but it was honestly a fun time. At least y'all can get a little chuckle at my attempt. :']
Yeah not sure how I would have done that without these really useful libraries tbh
You either already need a network that can detect noses or manually annotate them yourself and somehow train a network to extract the location of a nose
Yep. I think that it may have been the point of the problem to have already known the libraries and resources, haha, which I def did not.
Image hash (sliding window) first attempt.
Huh, I never heard of image hashing, that's pretty cool.
Well, a bunch of sliding window stuff will work. No need to go all crazy with a deep NN.
(Since it's to detect a specific thing)
(And only 1 hour)
I've got a very shallow level of understanding for NN stuff --- slowly increasing, but, you know, haha --- so I'll have to check that method out to see how it works and what it does.
Any simple method will work that is hand crafted to detect noses, so OpenCV's built in stuff should work. The reason it should work is because the dataset description basically gives away the answer: "The only constraint on these faces is that they were detected by the Viola-Jones face detector."
Viola-Jones is old school face recognition before everyone started using DNNs.
Hm, lot'sa stuff I dont' know about. I'd prob have given this a glance if I knew it was going to be entirely CV-related, haha.
Only works on nice clear images.
Otherwise it gives way too many false positives.
The label being the person's name is ofc not useful, so they really want a specific purpose / hand crafted thing (for noses in this case).
(Viola-Jones is specifically hand-crafted filters for faces)
Having now read the dataset description, I would just throw OpenCV at it.
Although that would not get the best results possible.
can you send the code for the model?
i've got help on another server already thanks tho
hi everyone.
iam wondering if somebody knows of an easy way to schedule colab notebooks on your google drive. i would like to let some notebooks run on automatically schedule once per day.
You would have to see if there's a colab API, but I think it's pretty unlikely that they support that, as colab is intended to be a learning environment, not a general purpose host for Python programs.
i see, thanks @serene scaffold
why column "one" has float value and column "two" has integer value please help!!
they're not the same length
can someone help me understand the hidden state in RNNs?
I just built an RNN and trained it on a dataset with 5 different words, so the inputs are all one hot vectors like
[0,0,1,0,0]```
but each cell has a hidden state with 256 weights
hidden_size = 256
class MyRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(MyRNN, self).__init__()
self.hidden_size = hidden_size
self.in2hidden = nn.Linear(input_size + hidden_size, hidden_size)
self.in2output = nn.Linear(input_size + hidden_size, output_size)
def forward(self, x, hidden_state):
combined = torch.cat((x, hidden_state), 1)
hidden = torch.sigmoid(self.in2hidden(combined))
output = self.in2output(combined)
return output, hidden```
but I'm so confused how it works from here
the forward function concatenates the 5 vector to the 256 vector, creating a 261 vector
then it passes that to nn.Linear with arguments 261, 256
and then nn.Linear does a linear transformation to make the 261 vector into a 256 vector?
what exactly is happening here
how we apply softmax to sklearn neural network ?
Anyone has tried O Reilly , packet publication books for data science or ML or DL ??
In linear regression the m and c values will keep changing unless we get the best fit regression line, right ?
Hie @potent jolt I have got the following data frame and i would like to delete rows where CHILD_DEATHS and WOMEN_DEATHS is equal to zero but when i do that, its deleting all rows with zero in any of the columns
Sorry if I am at a wrong channel
del dataframeName. loc("rowNane")
The problem is I don't know the row name. I want to delete all rows where child deaths and women deaths is equal to zero but it's deleting all rows where women deaths or child deaths is equal to zero
any pyspark expert here??
I am trying to insert a dataframe in a table. the dataframe what i have gives me output like this _
| Field | Value |
|---|---|
| id | 95 |
| name | N04 |
| parentId | 702 |
| parentExternalId | N7 |
| description | WAT.. |
| metadata | {_replir... |
| source | https://github.co... |
| Id | 43636 |
| createdTime | 2021-10-13 11:58:... |
| lastUpdatedTime | |
| rootId | 6978 |
| agregates | null |
| dataset | null |
| labels | null |
And I have multiple rows here. Now I am trying to insert this data in one of the table which has columns as (PointID, Name, PointParentID, desc). I would like to insert Id column in PointID, name in Name, parentID in PointParentId and description in desc. How can I do this using pyspark sql. I tried below command and it works in inserting a single data in a table with a single column.
df = spark.sql("Insert into `Utilities_66_Demo`.`UtilityProduct` values('Hi')")
df = spark.sql("Select * from `Utilities_66_Demo`.`UtilityProduct`")
df.show()
Same way I have table with 4 columns defined above and a dataframe where I need to insert data to.
But how can i insert a data completely in single call from the dataframe whose output I have shown above. the dataframe I got is from this code:
dfSourceAssets = spark.read.format("xyz.spark.v1") \
.option("type", "assets") \
.option("tokenUri", sourceTokenURI) \
.option("clientId", sourceClientID) \
.option("clientSecret", sourceClientSecret) \
.option("baseUrl", sourceBaseURL) \
.option("project", sourceProject) \
.option("scopes", sourceScopes) \
.load()
Whenever I run the huggingface trainer api (trainer.train()) my kernel crashes
I am new to stats in general. I am building a logistic regression model, and a few vairables have a 2.6 or 3.2 correlation coefficient. Is that good since it's a positive relationship?
Also the r-squared score is 26% which is too low, right?
So there are many methods by which we can get the best fit linear regression line, so do we need to learn all the methods or any one ?
I probally need to just learn all of it. Typing it out has made me realize I really have no idea what I am doing. Any great resources for learning from scratch on these models?
Most google articles seem to think I have a great understanding of models already
A correlation coefficient should be between -1 and 1, where 1 is a strong positive relation (if one var is big, the other is too) and -1 a strong negative relation (if one var is big, the other is small) and 0 signifies no relation.
(probably a bit oversimplified but that's the just of it)
If the correlation coefficient for the correlation between an input variable and output is not 0, the input variable could contain useful information for predicting the outcome variable
And if it is 0, sometimes it still might be useful information, but in combination with other variables only (no linear relation with the outcome)
@grizzled stirrup
Does that explain it a bit?
This was actually very helpful @mild dirge ! Thank you a lot. I was a bit thrown off on why one could come back as more than 1 (2.6 for example) as I did think it had to fall between the -1 and 1 line. However, you cleared many things up! Thank you ๐
Also when two input variables have a very strong correlation, that often means that you can basically remove one of the variables, since they contain the similar/same information @grizzled stirrup
Did you perhaps calculate the covariance instead of correlation coefficient? that might give you values above 1 and below -1
Hmm... I am looking at it is saying coefficient estimate and that gives the 2.x number. Perhaps I did do that
There's plenty of detailed explanations of matplotlib, pandas and numpy to find so if you're really interested you should probably look it up on google/youtube. Why do you think they are useless?
@lapis sequoia
Or maybe at what part specifically do you think they are lacking?
I wrote an explanation of what they do in the pins. Numpy does large amounts of math in batches, which is very useful for data science. Pandas does large amounts of tabular data manipulations in batches, which is also very useful. and matplotlib makes data visualizations, which is useful.
(even though the matplotlib api makes me sad)
Corey schafer has some good tutorials on matplotlib and pandas
Just processing a lot of numbers at once
like calculating the sum of a vector
or adding two vectors together
that kinda stuff
like an array
not immutable
but fixed size yes
fixed size, not immutable
and normally also only contains 1 data type
whereas lists often can have mixed strings, numbers, etc.
also, arrays are intended to make it easy to express "larger" math operations. for example
!e
import numpy as np
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
result = a + b
print(result)
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
[ 6 8 10 12]
in "regular python", you would have to do something like [ax + bx for ax, bx in zip(a, b)]
and then if it were a 2d array, it would have to be like that, but nested.
Which book would be better to learn data science from scratch ??
np.array returns an array, and the __add__ method of it does element-wise addition, returning a new array
can you use scrapy when you need to interact with the webpage to get the data you need?
I know you can use selenium in that case but since selenium is waay slower I was wandering if there is a workaround using scrapy
say for example, you need to input txt into a text field to get the data you need
can you use scrapy for that?
@lapis sequoia making more sense?
I'd like to combine GP with DL to identify regions in the AST that should be mutated more than others, any idea which model could work for that?
Hey, i have a LSTM NN with 11 input features. Is there an easy way to find out what features contribute the most to the output of the NN?
hey is this the right channel to ask a matplotlib specific question or is general better?
this is the channel for matplotlib questions; questions in #python-discussion are usually not answered, since the channel moves so quickly.
great thank you!
I'm trying to decorate the Figure class's add_subplot method.
this is what I have:
class ChildFigure(Figure):
def __init__(self):
self = figure()
def add_subplot(self, *args, **kwargs):
ax = super().add_subplot(*args, **kwargs)
# do something
return ax
When I call the method I get an "'ChildFigure' object has no attribute 'transSubfigure'" exception
Where am i going wrong?
sounds like the exception happens "further down"; try sharing the whole traceback
!traceback
Please provide the full traceback for your exception in order to help us identify your issue.
A full traceback could look like:
Traceback (most recent call last):
File "tiny", line 3, in
do_something()
File "tiny", line 2, in do_something
a = 6 / b
ZeroDivisionError: division by zero
The best way to read your traceback is bottom to top.
โข Identify the exception raised (in this case ZeroDivisionError)
โข Make note of the line number (in this case 2), and navigate there in your program.
โข Try to understand why the error occurred (in this case because b is 0).
To read more about exceptions and errors, please refer to the PyDis Wiki or the official Python tutorial.
hm okay let me figure out how to do that
in the meantime this is my full code:
class ChildFigure(Figure):
def __init__(self):
self = figure()
def add_subplot(self, *args, **kwargs):
ax = super().add_subplot(*args, **kwargs)
# do something
return ax
# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)
fig = ChildFigure()
ax = fig.add_subplot(111)
ax.plot(t, s)
show()
it works when i substitute ChildFigure() for figure(). trying the traceback now,
File "c:\python\examples\ChildFigure.py", line 22, in <module>
ax = fig.add_subplot(111)
File "c:\python\examples\ChildFigure.py", line 13, in add_subplot
ax = super().add_subplot(*args, **kwargs)
File "C:\python\anaconda\lib\site-packages\matplotlib\figure.py", line 772, in add_subplot
ax = subplot_class_factory(projection_class)(self, *args, **pkw)
File "C:\python\anaconda\lib\site-packages\matplotlib\axes\_subplots.py", line 34, in __init__
self._axes_class.__init__(self, fig, [0, 0, 1, 1], **kwargs)
File "C:\python\anaconda\lib\site-packages\matplotlib\_api\deprecation.py", line 456, in wrapper
return func(*args, **kwargs)
File "C:\python\anaconda\lib\site-packages\matplotlib\axes\_base.py", line 615, in __init__
self.set_figure(fig)
File "C:\python\anaconda\lib\site-packages\matplotlib\axes\_base.py", line 756, in set_figure
fig.transSubfigure)
AttributeError: 'ChildFigure' object has no attribute 'transSubfigure'
is my problem the self = figure() in consturctor
You want the child to behave just like a regular matplotlib figure right?
yeah
So why not just do
class ChildFigure(plt.Figure):
def __init__(self, **kwargs):
super().__init__(**kwargs)
Would be my first instinct, not really comfortable with this, but since no-one else is replying
This also gives problems further down the line though
yeah i tried that at first but the figure dosn't render. i think figure() does some stuff behind the scenes adding it to a figure manager.
Yeah, so when setting the current figure to your child class instance you can do plt.figure(fig)
but that gives the following error
Traceback (most recent call last):
File "C:\Users\Jory\PycharmProjects\PathOfExile\tool.py", line 19, in <module>
plt.figure(fig)
File "C:\Users\Jory\PycharmProjects\PathOfExile\venv\lib\site-packages\matplotlib\pyplot.py", line 753, in figure
raise ValueError("The passed figure is not managed by pyplot")
ValueError: The passed figure is not managed by pyplot
apparently you can do something like:
plt.figure(FigureClass=ChildFigure)
But this gives me an empty plot
import matplotlib.pyplot as plt
import numpy as np
class ChildFigure(plt.Figure):
def __init__(self, **kwargs):
super().__init__(**kwargs)
def add_subplot(self, *args, **kwargs):
ax = super().add_subplot(*args, **kwargs)
# do something
return ax
# Data for plotting
t = np.arange(0.0, 2.0, 0.01)
s = 1 + np.sin(2 * np.pi * t)
fig = ChildFigure()
ax = fig.add_subplot(111)
ax.plot(t, s)
plt.figure(FigureClass=ChildFigure)
plt.show()
ah yeah i was wrestling with that error too.
oh you may be on to something with the FIgureClass arg.
This actually shows something, but an empty plot :/
They actually got documentation on sub-classing figure ^^
maybe that link can give ya a push in the right direction, I don't think i'm of much help here
awesome thank you very much!
this worked!
class ChildFigure(plt.Figure):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
@staticmethod
def create():
return figure(FigureClass=ChildFigure)
def add_subplot(self, *args, **kwargs):
ax = super().add_subplot(*args, **kwargs)
print("in subplot\n")
return ax
now i can create the figure beforehand with fig = ChildFigure.create()
Awesome!
weirdly, no plot shows up when i put a breakpoint in the child's add_subplot method but its getting called correctly. must be some threading thing with the debugger. thank you very much for your help.
uh awesome this are some interesting ideas to solve this thanks a lot!
Don't spam to get verified
SORRY MY SMALL BROTHER DID THAT
what are you gonna do about it
CAN U STOP AND MIND UR OWN BUISINESS I THINK SO ITS BETTER
what is the significance of the number of epochs? I am unsure how to select and optimize my model
when you ask "what is the significance", are you asking what an epoch is?
I know it is like a "cycle" almost, but I'm not sure how to determine how many cycles my model needs
you can't really know in advance
Basically you divide the training dataset into "batches" or groups of training examples based on the batch size (if batch size = 1,000 in a 100,000 sample dataset, you get 100,000/1,000 = 100 batches), and if you set the number of epochs to 1, you're saying you want to go through the entire dataset once, which means 100 iterations (one for each batch). For 3 epochs, it'd be 3 times 100 = 300.
How many iterations do you need? You can't really know in advance, but there's a trade-off: the amount of time and computing resources you have vs. the possibility of getting better results if you add more epochs. So you have to decide on a case-by-case basis.
You can also risk over-fitting if you use too many epochs
true, but there's tools like early stopping to handle that
I would say a good strategy is going for 1 epoch, seeing how long it takes to run, and based on that, choosing the number of epochs, early stopping and so on. So if it took 20 min for 1 epoch, maybe I'd use 10 epochs if I'm only willing to wait 20 min * 10 epochs = 200 min = a bit over 3 hours.
Log messages are very easy to generate since it is just a string, a blob of Jason or typed key value pairs. Event logs provide valuable insight along with context providing detail that averages and percentiles don't surface.
I don't understand last part "along with context providing detail that averages and percentiles don't surface" - can someone explain that part?
Ahhhhh okay, thank you all for your response
How do I get access to tf.compat.v1 in tensorflow? import statements don't allow me to access it
Using tensorflow 2.3.0
Perfect lol the fake AI GF for valentines
I have this RNN in pytorch
class MyRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(MyRNN, self).__init__()
self.hidden_size = hidden_size
self.in2hidden = nn.Linear(input_size + hidden_size, hidden_size)
self.in2output = nn.Linear(input_size + hidden_size, output_size)
def forward(self, x, hidden_state):
combined = torch.cat((x, hidden_state), 1)
hidden = torch.sigmoid(self.in2hidden(combined))
output = self.in2output(combined)
return output, hidden
def init_hidden(self):
return nn.init.kaiming_uniform_(torch.empty(1, self.hidden_size))```
and i'm really confused about how the hidden state works
the internet says that kaiming_uniform is basically random, except that the random values are chosen to best reduce the vanishing gradient problem
but then I'm really confused about this line
hidden = torch.sigmoid(self.in2hidden(combined))```
it passes the hidden state vector to a neural net layer and then puts the output through sigmoid?
Hey @lapis sequoia!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
hist,_,_ = np.histogram2d(x, y, bins=(1920,1080))
dpi = 300
fig = plt.figure(figsize=(1920/dpi, 1080/dpi), dpi=dpi)
plt.imshow(hist.T,cmap='hot')
plt.savefig("graph.png")
How would I need to call the histogram function if I want to start the X,Y coordinates on the top-left? My input coordinates also behave like that
yo what this means?
do i need to set up something first before using tensorflow?
i dont have dedicated gpu but that message is ok even i ignore it?
i thinkyou can ignore it
I have a data from which consists numerical and categorical columns along with date(yyyy-mm-dd)
Date l col_1 l col_2 l col_3 l
2021-01-01 100 0 200
2021-02-01 120 0 300
2021-03-01 140 1 300
2021-04-01 150 0 300
2021-05-01 160 1 300
2021-06-01 180 0 300
2021-07-01 120 1 300
Now I need to pass the dictionary which has percentage change for col_1 and col_3 and for col_ 2 it paas the categorical responses
dicct = {"col_1" : 10, "col_2": [0,1,0,0,1,1,0], "col_3":-5}
So this is my input now it should change my above df in such a way:
- My col_1 increased by 10%
- My col_2 is replaced by the list in dictionary
- My col_3 is decreased by 5%
Hello, i just wanted to know how nany images should a test set and validation set contain
If my total dataset had 3k images
does this suffer from overfitting or underfitting?
If there are 300 images in test set and 100 images for each class is that sufficient??
Hey guys i want to make an AI which can create a sentence for me.
I want to use it for a thing in a game.
So what i want to do is that i want to turn sentences like this:
selling m5 100Kselling house 232 2M
into this:
Im selling a "BMW M5". Price: $100.000Im selling the House Nr.232. Price: $2.000.000
How can i start the project? Does someone have tips and Ideas for me how i could create something like this?
PS: You can ping me when u answer to this.
Yea, atleast i would do that. test set is just to test right, How your model is performing
70/20/10 is the most commonly used division proportions
you can do 60/20/20 also if you want
or 65/20/15
thats your choice
I think you should use NLP (natural language processing) for this cuz it includes language processing
Okay.
So how would i start there? I never did anything with AI.
Is it possible to do it for german?
In order to learn NLP first, you should start learning just machine learning, there must be many tutorials on youtube in German. Then you should learn neural networks nd then you can properly understand NLP. i havent learnt NLP yet so i cant tell you in detail but this is what i have watched from the youtube videos
Hm okay.
Actually i don't want to learn all those things (even tho they are interesting) to do this.
Still thanks for your help.
Okayy thank youu:)) i thought 85/15 ...and im using kfold for validation
Cool
Does the validation and test have to be of similar size?
the greater the size, the better you can estimate the perfomance of your model
But there's no real benefit to having the exact same size no
There even exists something called leave-one-out cross validation where you train the model on all the data but 1 data point, and then test if the model guesses that data point correctly. And then you average that over n folds, where n is the amount of data points.
This is actually one of the more accurate cross validation methods, but takes a long time as you'd have to retrain for every data point
@wicked grove
So I found that for my matrix (np.tril(m).sum() + np.triu(m, 1).sum() + np.diag(m).sum()) (sum of upper and lower triangles and the diagonal) is slightly larger than m.sum() - how could that even be possible?
Hi. Have you ever used Chatterbot?
Is there any way, to change his language, to, for example, polish, korean?
Or I should use other libraries and translate text from polish to english and from english to polish?
What is slightly?
Triangle sum is 22800262 and the full matrix sum is 22800242
the diagonal is 0
you sure, can you print the sum of diagonal rl quick?
yeah that's the first thing I checked
np.diag(m).sum() == 0
also np.tril(m).sum() == np.triu(m, 1).sum() since it's a symmetric matrix
No I just print(np.tril(weights).sum() + np.triu(weights, 1).sum(), weights.sum())
hmm ok I do once thing at the start which is weights = weights.astype(np.float32)
not sure how big the matrix is and if it may round in between
If there are many many values between 0 and 1 that may be a reason (maybe?)
How come it gives an integer result for the sum though?
I just trimmed the final .0
I tried it again without casting and the total sum correctly becomes 22800262, matching the sum of triangles
ah alright
So it seems to be related to casting it to float?
but only affects the total sum?
yeah it seems to be a float32 accuracy issue
I haven't seen one like this before that manifested itself in this way
yeah because the number is big
and the float only has 32 bits
floats are more inaccurate further from 0 so that can make sense ^^
try 64 bit float maybe?
if you need floats that is
I think I'm most confused about the triangle sums being more accurate
What are you basing the accuracy on?
yeah I need to turn this matrix into probabilities
that it matches the integer sum result
ah alright
yeah if the values actually are integers, might as well not convert to floats ๐
If I don't numpy will do it as soon as I divide - but yeah I'll make sure to sum first before dividing
hey can you please help me in #help-cheese
has anyone ever done object detection with a dataset of videos?
What are you having trouble with?
oh not having trouble yet, just about to start collecting data and not sure where to start
i did object detection with images but not sure how it changes for video so wanted to ask here
Well using videos can have some slight benefits
Since you asked that I'm just reading some bits of this https://towardsdatascience.com/ug-vod-the-ultimate-guide-to-video-object-detection-816a76073aef
And you can f.e. use classifications of previous frames to your advantage
Obviously you could simply separate the images and have a regular image prediction problem, but having a video could give some extra info
Hi, nice to meet yall. I want to put a small group together to study machine learning algorithms together. Iโm thinking of having ML journal club and/or discuss various algorithms from general to deep concepts such as reinforced learning and evolutionary algorithms. Please dm me if you are interested.
so i am trying to learn machine learning (tensorflow in particular) and the maths involved in it but i am having some trouble learning it since i am a high schooler i did consider learning concepts like calculus first but dont understand certain calculations in it what should i do is there a particular list of topics that i can learn in order to understand all the concepts?
Cloud providers, including Google, offer managed services such as Google's Vertex Prediction, that help you perform continuous evaluation of your prediction requests. Continuous evaluation regularly sample's prediction input and output from trained machine learning models that you've deployed to Vertex prediction. Vertex data labeling service then assigns human reviewers to provide ground truth labels to your prediction input, or alternatively, you can provide your own ground truth labels. The data labeling service compares your model's predictions with the ground truth labels to provide continual feedback on how well your model is performing over time.
I am not sure what does "Continuous evaluation regularly sample's prediction input and output" mean. Word "sample's" confuses me
(i am aware that i dont really NEED to know the math behind it but kind of want to)
i am a beginner at this btw
@lapis sequoia You can audit this specialization https://www.coursera.org/specializations/mathematics-machine-learning
If you want to access course materials you have to pay for course or ask for financial aid which they give in a lot of scenarios
First two courses are very good but in third course there are not so much visualisations, it's too much theoretical and professor doesn't explains concepts that much clearly as in first two courses
this is awesome- thanks so much for sharing!
do i have to sign up for a trial?
@lapis sequoia
me?
yes
neural networks
Click "audit", don't sign up for a trial
i have learned linear regression and classification
oh okay good
which math particular
that dont understand for that model
is it derivates?
etc
i dont see that
Hm... I think convolution is the core concept of any ML utilizing neural network...
integrals in calculus
oh okaym hmm integrals is not needed though to implement a NN from scratch
or with any libraries
only if want to deep more into NN
not really i can search for them on the internet though
i am talking about calculus
the onlt calculus reallly needed to understand NN for internviews is what partial derivative that is, there are many oyutube videos that show this
No need. I had thought about some tryed ones
and little about maximum and minimum
Convolution...
other than that , there really is no more to calculus- its a huge field that will take you while before you can understand everything about calculus and its connections to ML
only need to know partial derivates for NN
i dont think its convolution nn i tried searching for it and i have heard it be mentioned but i dont think thats what i am trying to learn
Look up convolution... basics of anything that applies any kind of filter lol
Can you try to go to any other course and see if there is "Audit" option so you can know whether it's a case that in this courses there is no option for auditing them? So, if you don't want to pay for these courses, you can ask for financial aid - you need to ask financial aid for each course and you will wait 15 days for response whether you will get financial aid.
ooo
What maximum and minimum, max and min of what?
so can i just jump to that concept directly like is there any titourials?
@lapis sequoia in gradient descent this is what derivaite does and why we use - instead of + for updating weights.
w - learning rate * dw we use - because that gives min. IN addition, for NN you will see "adam" as optimizer that because in neural networks we can have local minumums and global minumums causing problems to find best values for our model while in linear regression we only have global minumums
lol, what you mean by "ooo"? Did you check for other courses whether you can audit them?
local mins and max is just what first derivative does , once you understand why we take derivaites thats reallly only needed for ML neural networks and just understand what each hidden layer or activatio function does
i dont ig i have to apply for financial aid
ig?
i guess
@lapis sequoia @lapis sequoia if learning NN, i recommend that you also try other sources like youtube video guides as they will show visuals that will help you understand the google notebook you are using to learn NN
but there is no need to learn integrals
if you do it will be more confusing
I am also interested about learning how partial derivation is done in neural networks. Can you guys provide any tutorial(s)?
Do you have any specifics videos for this topic?
I see
also on linear algebra in general
Partial derivatives tell you how a multivariable function changes as you tweak just one of the variables in its input.
About Khan Academy: Khan Academy offers practice exercises, instructional videos, and a personalized learning dashboard that empower learners to study at their own pace in and outside of the classroom. We tackle math, science, ...
@lapis sequoia Can you audit other courses?
is this a good tutorial to follow along with?
no i cant
many found that to be great, i personally dont use khan academy because they dont show practical examples which helps me learn more
3b1b normally puts a lot of effort into visualizing and explaining intuition, khan academy normally runs through an example problem in their videos
Weird, so if you go to other courses there is no "Audit"?
nope
@lapis sequoia You could ask Coursera helpdesk about that or google about your problem
Wait, are you logged in in Coursera? @lapis sequoia
yes
Weird
if i could i can just learn partial derivation it would probably be easier
are there any video specific recommendations i could watch?
something that maybe is beginner friendly
@lapis sequoia Or others, if you find any good video about this topic (how partial derivation is done in neural networks) please send me DM or tag me
alright i will
Hello all, I am having an issue with binning in Python using pd.cut()
I am able to apply the pd.cut() bins to my train and test data BUT because it is automatic, when I try to apply it to new data for predictions (after my model is complete) it gives different bins and is not recognized by the model.
Does anyone know of an easy (faster than manually binning) way to apply the same bins to my new data set?
Now that you've detected drift, what can you do about it? Well, first, try to determine which data in your previous training data set is still valid by using unsupervised methods, such as clustering or statistical methods that look at divergence.
I am interested how can clustering here helps? Can somebody provide an example? Or how can statistical methods helps that look at divergence?
Also, what in this context means "divergence"?
If you know about something that I asked about, pleasure use reply so I get notified
The way actual users experience your system is essential to assessing the true impact of its predictions, recommendations and decisions. For example, you should design your features with appropriate disclosures, built in clarity and control is crucial to a good user experience.
"you should design your features with appropriate disclosures," what does that mean?
I not sure as i need to see what site you are using or what context
what do you actually compare between cluster algorithm? Other than run time
Performance; accuracy of model
I'm doing a project where I do autonomous navigation using lidar camera fusion. Is there a single algorithm or library or framework I can use to do navigation and sensor fusion? I'm thinking of something like SLAM where it combined lidar and camera data to navigate. I need it to be able to run on a small device, so not too computationally heavy
i made a model generated its own trading strategy that beat the tesla return on investment by 74x. I would like to see your thoughts on it since I am going to make a free to use website using these generated strategies.
you might want to to add some disclaimers such as "this is not financial advice" or at least "I am not liable" if you share it
and make sure that your testing methods are reliable, not just "train and test on 100% of the data"
individual data frame operations are pretty highly optimized. the problem is that sequential data frame operations are not optimized at all, because there is no "compilation" step. each operation has to be a self-contained operation. frameworks like dask and pyspark sidestep this problem because applying a function to a dask dataframe does not actually immediately apply the function. it just builds up a "graph" of computations yet-to-be-done, and doesn't actually execute them until you request execution. so then the framework can inspect the full graph of operations and apply optimizations thereto.
if something is too good to be true, it probably is. i am not a finance person, but that seems like something you might want to backtest on more than just 1 ticker
also you need to be very careful of data leakage when evaluating "sequential" predictive models
consider that 10 1-step-ahead forecasts is often very very different from a 10-step-ahead forecast
Also the stock went from 5 to 1200 ๐
But yeah, really agree with the too good to be true argment
i dont understand how standard scaler works
standard scaler tries to give mean of 0 and standard deviation of 1. Overall standard scalre performs better than normalization , but we use normlaization for imaages like CNN algorithm or anything between 0-1 values , for regression we use standard scaler
Yo!
can anyone help me get through a project im working on... Im pretty fucked
Im working 7 days a week 10 hour days and I have a DS project im working on academically and its WRECKING my shit
perhaps? you have to ask your actual question to attract potential answerers.
I have a deadline to produce a predictive model
@prime hearth for your reference, editing a message to include a ping does not actually ping the person.
And i have no idea how to construct one. Im just working my ass off and really cannot wrap my mind around this shit in the time constraint i have
heres the objective text
To explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business
oh thanks i never knew that;w as doing it this whole time
thanks
this server is not a code-writing service, so if you are not in a position to understand what you've been asked to do, it's not likely that anyone will be able to help you complete the assignment.
I didnt ask for anyone to write my code
I asked for help - to make myself clear. Im underarmed for what im assigned. I know this entirely
im desperate to apply what i DO know to accomplish what im working on. - if no one can or is willing to help i understand entirely
im just trying to make financial ends meet and finish this project
I understand. It's unfortunate that many countries don't adequately invest in the education of their citizens.
Do you understand what linear regression is, at a high level?
when talking about "levels" of knowledge, "high" is less.
just trying to get up to speed with what im assigned with my back against the wall
it just sucks - i got pressured into this course and Im working my ass off and trying to meet the courses deadlines
im already paid into it - following along but Its just kicking my ass - i want to learn but i cant access these resources that they claim to provide
I feel scammed paying 4k for a 6 month AIML course just to be in the dark while im trying to pay bills on my other hand you know
It seems like an easy enough project but i dont know how to build what theyre asking for
either way - if anyone is able to lend a hand i'd be incredibly grateful. Im not confident in my ability to perform where its required at this moment and Ive got family basically telling me im a failure for not being adequate.
ill fuck off i sorry yall
do you know what a "best fit line" is? (in this case, it doesn't need to be a "straight line", even though a line is necessarily straight by some definitions) @lapis sequoia
Im familar with it but
Im not entirely sure how to even apply it
I've got a data set on my hands but ive never built a predictive model or had time due to my work schedule to really attempt it in full.
So, you're familiar with it. This is a good start. Linear regression is a process for calculating the best fit line/curve given a bunch of points.
"There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market.
In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.
As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.
"
this is my context statement
I have a dataset to accomodate.
I understand that you have a lot on your plate right now. let's just focus on understanding some topics, so that you can continue as best you can.
Objective
To explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business.
do you know about the scikit-learn library?
well, maybe we'll get to that in a bit. in the mean time, let's look at the data. Whatever format it's in, please put some examples in this chat as text (no screenshots).
the bare minimum objective is to "build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business."
so you have these: ``S.No.,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price`
so you want to be able to calculate what the Price value should be, given all the others.
yeah.
"should be" is where im stumped. I have no idea where to start when it comes to predictive analysis
well, it might be possible to do it just using the New_Price and Kilometers
generally speaking, cars lose value the more you drive them
Thins like the engine, power, and seats would probably be accounted for in the original price of the car
make sense?
I think since it's also for data analysis, they might want you to use those for the linear regression too
to see the impact of those on the price
Im also considering
brand names
like maruti as opposed to luxury brands like audi , mercedes
you'd have to encode the brands in some way, since linear regression deals with numbers
I understand the idealogical approach - its the execution and model building where im unknowledgeable.
You could do it yourself with numpy, there's also libraries that are really easy to use
like scikit that stelercus recommended
you can use this: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Examples using sklearn.linear_model.LinearRegression: Principal Component Regression vs Partial Least Squares Regression Principal Component Regression vs Partial Least Squares Regression, Plot ind...
the usage might look like this
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(df[['New_Price', 'Kilometers']], df['Price'])
and then you can do model.predict with some array of shape (n, 2) where the first column is the new price and the second column is the number of kilometers driven.
is scikit something i can import from my Anaconda toolkit or am i going to have to download and import from elsewhere
I don't use anaconda, so idk.
the project is out of a Jupityr notebook.
I feel like a fucking idiot whos way in over his head
I understand that this is a frustrating, and somewhat embarrassing situation for you. I would just focus on doing what you can.
Im just trying to get scikit into my environment right now
it would usually be pip install sklearn. idk how it works for anaconda.
Just trying to figure it out
Cant even get a foothold on a starting point
im so fucked im sorry yall.
back up. what do you know?
you said that you have worked with pandas and seaborn. that's fine. but what do you know about modeling, stats, machine learning, etc?
it's fine if the answer is "not much", but given that you are taking a course and are expected to complete this assignment, i highly doubt that you know nothing
if you list the things that you do know how to do, then we can maybe help you work up towards something achievable with your current state of knowledge
in general, python packages need to be installed. you should be able to install it in Anaconda
i am very curious what course you took that cost 4k USD, that's quite a lot of money
imo if you are paying money like that, you deserve some kind of one-on-one assistance if you are struggling
Its a mygreatlearning course partnered with UT texas mccombs school of business
AIML
and none of the instructors are responsive
at all
I dont even think I can.
$4k is like what you pay at a big name university for credits
They structure payments where its 1k each
and they basically rapid fire them at you before youre even halfway done w the course
i'm sorry to hear that. it sounds like a money mill scam
In an attempt to appease external pressures I signed up for the course but
due to work im just behind
external pressures from who?
family
i see
so how behind are you exactly? have you tried reaching out to an administrator who is not a professor?
4k is way more than one should have to pay for a single course in the US.
but Im getting fucking rocked
Its a 6 month post graduate program
basically my family is dissapointed in what i do for a living
and constantly harps on me to do this shit
because the market is "hot" etc etc etc
I know id be capable but. i have bills to pay. i work 7 days a week (voting machine company / way behind bc of covid and supply chain)
so I have to travel , work long hours, and when im finally at the hotel im fuckng exhausted
yeah, even individual courses at ivy league schools don't cost that much, or didn't when i was in school
trying to catch up but im just extremely disoriented