#data-science-and-ml
1 messages · Page 7 of 1
can anyone recomend a python package for excel data?
someone's hungry
pandas
in either case, I think the quote is a facetious remark about the graphic and how it doesn't communicate anything without context.
pandas is like the gold standard?
thats fair.
i used to use pyxl or whatever a few years ago
pandas uses openpyxl to do the actual interfacing with excel
i think it really represents the popularity of the transformer model in various use cases
robots in disguise
Ok, now I have a series of boolean, should I use dataframe.loc??
yes. good job 💚
thx Stelercus, you're tough but wise!
I've just realize I'm having the same problem, only get the limit dates, not the ones in the middle
you should look at the data, sorted by the date column, and make absolutely sure that rows between the two dates are even there
because what between does is pretty clear.
this is a piece of the series of between method:
7166 True
7167 False
this is the result of printing those indexes:
print(data_relevant["FechaEncuesta"][7167]) // 2022-08-07 00:00:00
print(data_relevant["FechaEncuesta"][7166]) // 2022-07-07 00:00:00
is the dataframe even sorted on FechaEncuesta?
because that isn't required to be the case for Series.between to work.
like this?
data_relevant = data_relevant.sort_values("FechaEncuesta")
yes
its sorted and still doesn't work....😟
are you using a jupyter notebook?
nope, vs code
hmm
can you show me the code for when you call between? because I don't even know what your end dates are
What would happen if you did backprop with a slightly modified derivative?
slightly modified how?
Mhm like it had a +1 and you removed it
removed what?
The +1 term
so like, if you're taking the derivative of 2x^2 + 4x, use 2x instead of 2x + 4?
Hey @hazy saddle!
It looks like you tried to attach file type(s) that we do not allow (.zip). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
@hazy saddle you don't have to show all the code and all the data. just an example that replicates the problem.
ok, the data too heavy to attach, but
I have to go, see ya!
it's not that it's "too heavy to attach", it's that our moderator bot deletes messages if they have a zip attached.
thank for your help
Correct.
sh*t, you're right
what are two end dates that will reproduce the problem?
import pandas as pd
from datetime import date, timedelta
from utils import get_Week
file = "InfoAbaste221072022.csv"
data = pd.read_csv(file, sep=";",
encoding="latin-1",
parse_dates=["FechaEncuesta"])
markets = list(set(data["Fuente"]))
columns = ["Fuente", "FechaEncuesta", "Grupo", "Ali", "Cant Kg"]
data_relevant = data[columns]
data_relevant = data_relevant.sort_values("FechaEncuesta")
first_day_week1, last_day_week1, first_day_week2, last_day_week2 = get_Week(data)
first_week_filter = data_relevant["FechaEncuesta"].between(first_day_week1, last_day_week1)
first_week_data = data_relevant.loc[first_week_filter]
I'm lookin between 2022-07-07 and 2022-07-13
In [18]: df.loc[df['FechaEncuesta'].between('2022-07-07', '2022-07-13'), 'FechaEncuesta'].unique()
Out[18]: array(['2022-07-07T00:00:00.000000000', '2022-07-13T00:00:00.000000000'], dtype='datetime64[ns]')
In [19]: df.loc[df['FechaEncuesta'].between('2022-07-07', '2022-07-10'), 'FechaEncuesta'].unique()
Out[19]: array(['2022-07-07T00:00:00.000000000'], dtype='datetime64[ns]')
In [20]: df.loc[df['FechaEncuesta'].between('2022-07-07', '2022-07-09'), 'FechaEncuesta'].unique()
Out[20]: array(['2022-07-07T00:00:00.000000000'], dtype='datetime64[ns]')
there just aren't any days between those two days, except those two.
you can do df.groupby(df['FechaEncuesta'].dt.day).head(), and you will see that it goes from july 7, straight to july 13, with no days in between.
not sure I understand...
look
print(data_relevant["FechaEncuesta"][7167]) // 2022-08-07 00:00:00
that's in august.
i'm confused, i guess the datetime is backwards, if i look the data in text editor find this:
Bogotá, D.C., Corabastos;**08/07/2022;**20:18;TL;WOL099;null;'25;CUNDINAMARCA;'25040;ANOLAIMA;null;null;VIN VND CIDRA;VERDURAS Y HORTALIZAS;Calabaza;800;KILOGRAMO;1;800;LMCORTESR;
if you have 08/07/2022, there's no way to know if "8 July 2022" or "7 August 2022" is intended, unless you know what format they're using a priori.
whereas if you have the year first, like 2022/8/7, then it's known to be year/month/day
in your data, is the day or month first?
ok, thanks for your help! good night
df["Total Average"] = df.iloc[:, 6:19].mean(axis=1) #Calculate average of all rows from column 6 to column 19
I'm trying to calculate the average of all rows from column 6 to column 19, but how do I make it skip any columns with a value of 0?
they can be between 0-5
you can replace those zeros with NaN, and then it will take care of itself.
I’m starting school in the fall and have a lot of questions about the actual math involved. I feel like I’ve come a long way from knowing no data science, to knowing some and the field of AI and data science is enticing, but I’m wondering how much math I’ll be using in practice.
I like, even love this field and want to work hard at it. I just also know I’ve struggled with math in the past. What’s helped you guys? Were you always innately good at math, or did it take work for you to get where you needed to be?
you need to understand the math behind AI to approach some problems intelligently. you'll never actually do any calculations by hand, but you should be able to if you had to.
I think people psych themselves out about math. you can learn math. if you're in the US, chances are, the techniques your teachers used to teach you math were pretty shitty. don't talk yourself into thinking that it's more arduous than it has to be.
This is what I believe too. In high school I wasn’t super motivated with math, and I think it left me with a lot more questions than answers. Math as it’s taught in high school tells you to assume a lot of things, rather than gives you the explanation for why things are. I struggled with that a lot. It’s harder to remember things if you don’t know why they are that way.
Does anyone have an example or a real application of the gym library on factories or something similar?
Okay I got a pretty basic and probably dumb question
Why do we do EDA on the data we have?
it can give you some idea of which tools might be effective for whatever you want to do with the data
Okay so after I do EDA, I have some observations on which columns are useful and which might not be
So in the model training part I can use these?
sure, but it also helps you pick the model in the first place
By this do you mean that one would be able to identify whether they need classification algorithm or regression algorithm or that which exact model they should be using for this job (like RandomForest or SVC)?
that would be the idea, yeah
for example if you discover in a preliminary stage that your data exhibits some sort of 1 or 2D statistical invariance, then convolutional neural networks make sense. if there is a strong temporal correlation, then an LSTM makes sense. or maybe you discover that the problem doesn't require deep learning at all (deep learning doesn't always makes sense and is often not really needed)
So this is the reason why one should know the theory behind a machine learning algorithm?
yeah, at least at a reasonable level of understanding. no need to know all the math if that's not your thing, especially if you're not doing research. if to be able to use a tool well, you need to know when it makes sense to use it
there are scenarios where it makes sense to hit a screw with a hammer, but that's usually not what you wanna do
Ah dude I haven't been doing anything like this
I just make a few graphs to understand the distribution of the data and that's all
Then move onto making a model using a pre-decided algorithm
that's usually fine if you have enough data and computational power. if you lack one one or both of these, then knowing which model to use is vital
or if the data is not nice, too
I need to get here
Like for the past 2-3 months I was learning the theory behind ML algos and never really figured out why it's needed
And that thing has been eating me up since then
That's the whole reason why I got into Kaggle and Discord servers
where can i find tutorial about machine learing and data science and how can i practice them
guys, i'm new to competitions in kaggle . how do i load such big datasets and its taking a lot of time
Whether TensorFlow will automatically encode the categorical data if we have applied the input function to the model?
are you doing it locally?
haha the 'today' bullet point
💀
from Jacopo's "MLOps at a Reasonable Scale" talk
Hello again, is there any way i can change the data types of multiple columns in one go?
I wrote an article comparing Harry Potter and AI on medium(Ik it sounds weird, but check it out): https://medium.com/@hariaakash646/artificial-intelligence-is-basically-harry-potter-3251e5e3b64b
Read it when you get time and tell me how it is. And also give it some 👏 lol
Hey all! I trained a regression model in Google Cloud's Vertex AI the other day. When I set it up I forgot to export the test data (set at a random 10% sample). Now I want to do some additional testing. Do anybody know if it's possible to retrieve that subset of data afterwards?
Don' t know anything about that google cloud vertex, but did you use a seed for generating that sample?
@long perch
is there any reason why my centroids refuse to move to red? also I'm confused on how to interpret clusters
thanks for reply, im doing it on kaggle
i'm unable to read using pd.read_csv
best place to learn student teacher network's working?
i have a presentation to give.
show the code and the whole error message, please.
bros its 30c its too hot to work and code
use a fan and don't train any models locally.
hahahaha, i only train locally on sensitive data
laptops hot
i bought a water sprayer to spray myself
Hey @desert bear!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
man im really enjoying my calculus book so far
really happy
much much nicer than lin alg ive taken previously
do we need to use any library like dask or just pandas is enough
hey i am trying to make a sentensen to entety nlu model and i get this error: text Traceback (most recent call last): File "C:\Users\Sebastiaan\AppData\Local\Programs\Python\Python38\lib\code.py", line 90, in runcode exec(code, self.locals) File "<input>", line 1, in <module> File "D:\PyCharm 2021.2.2\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 198, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "D:\PyCharm 2021.2.2\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "D:/fun/jarvis/nlu/classifier_test.py", line 61, in <module> for X, Y, in train_loader: File "C:\Users\Sebastiaan\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 681, in __next__ data = self._next_data() File "C:\Users\Sebastiaan\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\dataloader.py", line 721, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "C:\Users\Sebastiaan\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\utils\data\_utils\fetch.py", line 52, in fetch return self.collate_fn(data) File "D:/fun/jarvis/nlu/classifier_test.py", line 50, in vectorize_batch Y, X = list(zip(*batch)) ValueError: too many values to unpack (expected 2)
how can i fix that?
you can find my code hear: https://paste.pythondiscord.com/raw/ruxuxuleje
line #?
Hey guys for Gensims most_similar function how do I get a list of just the most similar word without the float value next to them
I was looking through documentations trying to do it and it wasn’t working
i found the couse it have to do with the text collate_fn=vectorize_batch
in line 58 and 59
these are not nessesery so i removed them and then i had this error
Traceback (most recent call last):
File "C:\Users\Sebastiaan\AppData\Local\Programs\Python\Python38\lib\code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "D:\PyCharm 2021.2.2\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "D:\PyCharm 2021.2.2\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "D:/fun/jarvis/nlu/classifier_test.py", line 61, in <module>
for X, Y, in train_loader:
ValueError: too many values to unpack (expected 2)```
@wooden sail currently just on linear function problems and am trying to write the equation for a like through point a,c with slope m but I am confused because that equations intercept depends on m so not sure how to write that
hmm?
Can you show an example of the output?
Like
can you give all the info? show the original problem, preferably
yes but you worded it poorly
I have no choice but to just use a symbol for the intercept ??
the whole thing is in symbols
Oh, it’s a,c extrapolated via m to give the intercept as an extension of ac
?
Doesn’t it depend if m is positive?
a line is of the form y = mx + b. we know when x = a, y = c. so subbing that in we get c = ma + b, or b = c - ma
then the eq is y = mx + c - ma
i need a face shape dataset for men
can i get dataset?
So point c minus m times value x at a u mean? Or what
c is not a point
It is in that question
Sorry yes I meant c as in c from the point not as in intercept as it’s sometimes written
C - ma is the intercept?
mhm
here's an arbitrary example i made up right now, in case you weren't convinced
In [1]: import matplotlib.pyplot as plt
In [2]: import numpy as np
In [3]: x = np.linspace(0,10,100)
In [4]: m = -3.345643
In [5]: a = 6.23234
In [6]: c = -9.039485
In [7]: b = c - m*a
In [8]: y = m*x + b
In [9]: plt.plot(x,y)
Out[9]: [<matplotlib.lines.Line2D at 0x7faaa86e3d90>]
In [10]: plt.scatter(a,c)
Out[10]: <matplotlib.collections.PathCollection at 0x7faaa86fcbb0>
In [11]: plt.scatter(0,b)
Out[11]: <matplotlib.collections.PathCollection at 0x7faac5085370>
In [12]: plt.legend(('line','(a,c)','(0,b)'))
Out[12]: <matplotlib.legend.Legend at 0x7faaa86e3f70>
In [13]: plt.show()
I just did a example but it required c + ma not -
Equation for the line perpendicular to y=5x-3, through point 2,1
Must be -1/5x + 7/5
dude ofc, because they're not asking you the same thing lol
please read the questions carefully
If it’s perpendicular u add not minus?
i think i need to answer you the same way rex did, unfortunately
What’s that
that sadly you don't listen, so i can't help you. good luck with your problems
I did listen and I did it how you said, but for this question the answer required adding
So instead of minus 2/5 you’d add
To get 7/5 not 3/5
not listening to people here is fine, but if you also don't read your book carefully, you're not getting very far
I did read it
It never said that you change method for perpendicular lines, in fact it didn’t rly go over this topic at all
Yeah I just checked it literally didn’t explain any of this
Perpendicular for points a b is y=(-1/m)(x-a)+ b and parallel is m(x-a)+b
Any tips on where to find Machine Learning communities? Just want it to be active.
non-discord, non-reddit would be best
at uni?
My uni is online I'm not sure if that's an option for me
Hey guys can someone help me out. I’m trying to PyTorch to train a model using my gpu but it has an insane memory leak where no matter how much gpu memory I give it it quickly overflows
do they offer some sort of platform to discuss stuff though
Rn i gave it 40 gb of gpu memory (with an a100$ and after 15 or so batches it is full and crashes
This is during inference as well
Not really, and probably no clubs for online students
I was thinking like forums for undergrads/professionals or something similar
i see. if you're a student though, you could try something like applying for coursera's financial aid to participate in their ML courses for free. then you could interact with other people taking the courses there via the coursera forums. other than that, i can't think of any suggestions. forums for undergrads are stuff like stack overflow and some discussions on researchgate. i wouldn't know what else to suggest, maybe someone else has more/better ideas
i tried to use tf.estimator.DNNClassifier to determine whether or not a string has swear words. although for every single string i enter it gives me [0.46180007 0.5381999] the first index is the probability of it being 0 (no swears) while the second is the probability of it having a swear (1). here is my code and csv files
the training data is pretty small
nvm i got it working!
working semi decently too
now i just need butt tons of data and im good to go
i'd be curious of the DNN outperforms a pile of hand-crafted regex. i assume the DNN is more likely to be able to learn obscure formulations like d̷̘͚i̸̹̤c̴̡̞k̵̮̳ ̸͖̂b̸͓͉u̴̡̖ţ̵̩̪t̵͓͕ and 【fuck】
fortunately you can manually construct a huge dataset of this
if anyone has a large dataset of strings that contain swear words, dm me
i was just going to suggest generating your own
yeah just saw that
write an algorithm that can produce basic human sentences, then write another one to obfuscate the swear words and/or sections of the whole text
use unicode lookalikes, etc.
that should be easy
yeah, this is a great use case for data augmentation
thnx
what kind of dnn are you using? convolutions? rnn/lstm? transformers? something else diy?
am I getting this error because each array for both x and y have to all be the same size?
if my training loss is going down but AUC staying roughly the same
does this mean i should increase learning rate
or just wait it out

anyway, hope grad school is going well Edd
just finished myself

How does correlation in sequenced portion of video plays a role in training???
In anomaly detection
I’m finishined in less than a month
A coworker of mine is using matplotlib to draw a wafer plot with a quiver plot on top of it like the one displayed in this image. He told me that it is very slow since he has to draw many shapes to make the wafer plot. He is looking for something faster and even interactive. I mentioned maybe Bokeh or plotly could do it. I don't really know if this is possible in either library or if it would be faster. Does anyone have any suggestions or experience doing something similar?
how can I define error, like for an error bar, if I don't really know what the error is? for instance, in google sheets, you can tick a box called "error bars" and it'll just generate them for you automatically
matplotlib should be able to handle this, it's possible that his code isn't very efficient. but he might also want to consider #pyqtgraph or even gnuplot
usually it's something like 1 or 2 standard deviations. the google sheets documentation https://support.google.com/docs/answer/9085344?hl=en&co=GENIE.Platform%3DDesktop says that you can choose between a constant value, a percentage of some kind, or the standard deviation.
Want to get more out of Google Docs for work or school?
Am currently working on a clustering project using KMeans Clustering on the Mall Customer Segmentation dataset and am wondering on what type of EDA should I do to have the ideal clusters
(https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python)
Currently am thinking of making 2D scatterplots between columns and to look for clusters in the plot, then do the same with a 3D scatterplot. But this doesn't seem like an ideal strategy because in some cases I might be doing 4D clustering and in those cases I won't be able to visualize the clusters in this way.
Would doing stuff like making histograms for different columns even help here?
with just 4 dimensions you could use a grid of pairs of variables
a "scatter matrix" i think it's called sometimes
it would work but it starts getting hard to read at bigger sizes
example: https://i.stack.imgur.com/oREtY.png
So on each axis there are multiple columns? Is that how we are able to visualize these?
you might also want to use some dimension reduction algorithm to be able to plot your data in 2d or 3d. of course there are also quantitative ways to evaluate clustering quality
You mean elbow method and silhoutte score?
those are some valid options yes
Help me
note that silhouette score (and k-means in general) do not perform well on clusters that are not approximately "spherical"
gitattributes is not a python file
Yh heirarchical clustering is helpful here
my go-to for clustering is hdbscan
What type of EDA you do for those?
clear it gitattributes ?
i look at the univariate distributions (density plot, percentiles, mean, etc), then i move up to pairwise bivariate distributions (e.g. scatterplot matrix like i posted above), then i go for dimension reduction to see more of the global structure. sometimes i've even done 3d plots and manually "flew" through the 3d point cloud
don't run it as python code. leave it alone
How do I run the project?
explain what you are trying to do in greater detail
Sorry for just bombarding you with questions
But how does that help in determining which columns to fit in the clustering algorithm?
it doesn't, but it does help me at least understand the shape of the data and decide what kind of clustering might even make sense + how to evaluate it
feature selection is a whole different issue. i rely as much as possible on domain knowledge for that. but i also try to discard features that seem uninformative, e.g. it is mostly all the same value or has a lot of missing values that can't be easily imputed
I guess cause I've already decided that I'll be using KMeans for the job, this part is not of much use for me
This tho can't be observed with EDA alone right?
not doing basic eda is like driving with your eyes closed. just do it on every dataset no matter what.
you can get some idea about it, yes
Yh that's something I've heard a lot
But my main problem rn is that
Even if I do EDA on the data I have
I can't find much use of it
The question "what's the purpose of doing this" always comes in my mind
I've heard a lot that it helps you in understanding the data, which then helps you in choosing the algorithm you want to go with
But I've never been able to implement this in actual projects and that's just eating me up
Do you have any notebooks in mind I can go through?
the main purpose is that you want to understand the distribution of the data
where is there a high density of data points? are there extreme values to consider? what variables are highly related?
I guess for now I should just go through a few datasets on Kaggle and check the code of other people on how they do stuff
Thanks for the suggestions. I did suggest that maybe the code could be optimized. It turns out that he didn't care so much since he still wants to add interactivity.
very simple things to consider, for example, are that if your data is not jointly gaussian with diagonal covariance, using vanilla least squares as your cost function is not optimal
That is a super loaded question for this application. Each vector would represent a single measurement, which is deviation away from the ideal. So in a sense we are measuring the error. We can define an x and y direction std for a single water plot.
Damn such a cheater
hi anyone here are familiar with pandas
currently I have a dataframe that each row contains a substring, I want to locate which sentence the substring at by doing a substring matching, using panda apply function, however, it runs horribly slow, any efficient way to do so ?
I have use selenium with panda apply, because selenium can scrape text around my row element surrounded and minimize collision error (since substring matching might not be reliable, 17 matches with 17000 or 17)
but selenium find element seems to not working concurrently and incredibly slow as well
I am looking for a efficient and reliable method, thanks
the definition of sentence here would be .split('.'), assuming full stop as the sentence where my row element located at
have tried .str.contains, normal matching or regex, but none of them really improve the performance
You should firstly store all the data you need in one place and only then try using pandas on it
That'll be a lot more convenient too
anyone here transitioned to learning python from R?
Is Dataquest.io free for only the first 3 lessons of each path
I think yes
https://www.freecodecamp.org/learn is free forever :)
I've got a project where I want to see the sales of a company based off store locations, store sales, individual product sales, and the date. I was thinking of applying a linear model but I dont think its the best choice given there's a time factor. Thoughts?
have you heard of time series forecasting? also, how are you going to represent nominal features like "location"?
yea i realized it'd be time series, not something i covered in my course, so exploring more about it. I'd have location be region/city based: ie. london, manchester, etc
how many distinct locations are there?
that's a really good question
depends on the level of granularity i'd want
cities will be probably close to 30
regions would be 15
or increased granularity down to neighbourhoods or streets
how many instances would there be per city? also, is there really any point for the "region" feature, if every city belongs to one region?
that would probably result in too many distinct values for that feature, and make it useless.
exactly my thought
there'd be a few hundred per city
I would just have city as a one-hot feature (or something like that), and see how it goes.
so if im comparing manchester, london, and brighton, i'd want the model to understand sales might differ based on location, whether region or city
ow right, have to make each region an individual column
if you know what regions each city belongs to, you don't necessarily need the model to know.
true, but greater london would include parts outside of london that wouldn't exhibit many sales
so i feel the model would fall off aswell if i choose regions rather than cities
you would probably need to break greater london into more than one "city", if the geographical scope of your data is the UK.
Yea that makes sense, maybe just expanding the regions with special cases will reduce the number of hotencoders i'd have
you mind if i dm you at another time I run into any more doubts?
you should just ask here. I check this channel often enough, and it's good to be able to get input from other people
Alright will do! 🙂
Hi All, I have this matrix where I need to figure out a list of species each zoo is missing in the matrix (Original content is 152colXx562rows)
Would I be able to do this in pandas easily? or would I be better off just programming it in Python?
For this Matrix, I would need to show that:
LA Zoo - Monkey, Reptile
NY Zoo - Bird, Bear, Reptile
FL Zoo - Bird, Monkey
I’m starting to think SQL and probability intuition is by far the most important abilities for passing technical tests when interviewing
Which is pretty dumb imo
even if you read each table into memory with pandas, you can figure out what is missing with sets.
!e
animals = {'bird', 'dog', 'elephant', 'donkey'}
la_zoo = {'dog', 'donkey'}
missing_animals = animals - la_zoo
print(missing_animals)
@serene scaffold :white_check_mark: Your 3.11 eval job has completed with return code 0.
{'bird', 'elephant'}
a lot of people don't know about sets. those poor, poor people.
oh wait, what if I needed to group the animals by species? for example, the LA zoo has a Grizzly Bear, so I can just take any bear species out of its missing animals list
In [1]: animals = ['parrot bird', 'panda bear', 'polar bear', 'grizzly bear']
In [4]: pd.MultiIndex.from_tuples(map(str.split, animals))
Out[4]:
MultiIndex([( 'parrot', 'bird'),
( 'panda', 'bear'),
( 'polar', 'bear'),
('grizzly', 'bear')])
you could start with something like this, I guess.
if you have an extra level of column indexing to give you superclasses of animals, you can then figure out which zoos have at least one of each superclass.
Hmm I will give it a go and report back
How come x2>-2x? In this picture it's a perceptron learning algorithm
Hello guys plz someone help me with this problem I don't know how come is that =>x2>-2x
this is a column of a dataframe
0 128
1 111 <- new min
2 116
3 121
4 110 <- new min
...
7 131
8 100 <- new min
9 122
...
50 105
51 93 <- new min
52 129
...
4995 137
4996 139
4997 118
4998 105
4999 100
how can I set the values to be the same until there's a new minimum? like this
0 128
1 111 <- new min
2 111
3 111
4 110 <- new min
...
7 110
8 100 <- new min
9 100
...
50 100
51 93 <- new min
52 93
...
4995 93
4996 93
4997 93
4998 93
4999 93
@wooden sail I have a application test whicih is asking something that appears to be maths, want to look at it?
i can glance at it while i eat, but wdym by "application test"
its for a job, the first screening exam
im on the final question, fared quite well for all of the others but got stuck on a calculus
data science quiz
its basically asking you, in python, to find the gradient at a point along a 3dimentional graph function
so its multivariate
im happy to share all questions after i submit
matplotlib, is there a function to annotate every data point?
all i'll say is you're overthinking it
sorry, that was at supermoon, not you. pyplot has an annotate function you can use
I sent it to you. wasnt rly shure what to do.
i can send u others too as ive submittedf i tnow
if ur interested, there was a few MCQs
i answered all except for the mountai none
there was a few stats, one linalg and a few prob ones
let me show u the lin alg one
it never said to compute the gradient directly in python. the easiest solution was to differentiate by hand and code the resulting function in. numpy.gradient uses finite differences and is therefore not exact. not only that, it won't work if you just give in scalars x and y
for a pandas dataframe like column_1, column_2 hello, 0 there, 1 how would i iterate over every row in column_1 and replace it with a 1D array of integers?
you could've used sympy too, if you really don't wanna do the math yourself. but it had to be done symbolically, either on paper or with a lib
wat do u thnk about that matrix one
hi guys is there a more efficient way of doing this code?
well, im sure youd get that one
the o notation and the run time for this cell is quite bad
that would create a ragged array, it's not good to have columns containing arrays
well, what im trying to do it take that data frame, and do py dataframe = read_csv(...) train_y = dataframe.pop('column_2') dataframe.head() because i need to encode the strings in the first column
what you'd do is add more columns, not put an array inside a column
@wooden sail also, there was one which i had never seen before. it was how many edges does a fully connected graph have
its something liek 2/1-n something or other
You can use the itterrows() method
https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas
lemme think. the first node has n-1 edges. the second node has n-2, since one of its edges is already set in. it looks like (n-1)! to me, but lemme look it up
if there's another option to encode the strings in col_1 i'll take it
I think ur right, but id never encountered such a topic before
I mean, ive never even looked into graph theory so it was a hard one
i just googled it 😉
ah oops i multiplied them but they were supposed to be added added, that was my bad. but the logic was sound. so n-1 + n-2 + ... + 0. then yeah, n(n-1)/2
oh, they also askeda monty hall problem
they also asked what unit is variance in
which i think its just unit squared
one hot is usually done by adding columns (or rows, depending on how the data is oriented)
oh, they also asked if you had two dice and a coin, whats the odds of landing a combined 7 and heads
1/6 i think
they asked what does the valuye of the sigmoid function tend to as x moves towards -inf, which i answered as 0
they asdke dme to write a python function to referse a functions arguments into a new function, as well as a sql query
overall, not too bad
1/12, shouldn't it be? 6 ways to roll a 7 out of 36 ways to roll 2 dice, but also 1/2 to roll a head
to roll a 7 out of 2 die, its 3 possible options but with two arrangements so thats 6./12 ?
wait a second
thers only 3 ways u can roll 7
6, since order matters
You could bin it with pandas cut() method. This should improve the runtime as well.
https://pandas.pydata.org/docs/reference/api/pandas.cut.html
ohhhh aight thanks ill take a look
but im pretty sure i got the sigmoid question right it was the easiest one
pytohn and sql was tricky but its doable
and they also asked if a dude had a stack of cards and said he guessed all ur cards correctly after guessing all cards, he abuses which metric which i answered recall
i mena thats a recall of 1
I believe the metric that was abused was precision (if I don't have an equal number of shapes in the card pack I hold)
it would be recall
if he gussed all cards, his precision wud be rly low. but recall 1.0 as didnt miss any of the picked cards
If he guessed all cards correctly?
Well, I guess it depends on what they actually mean by "abuses" which metric
if he claimed he guessed all of the picked ones correctly, the picked ones isnt the entire set
that claim that u got 1,0 is just abusing recall
his precision will be low like 0.1
I mean to say Accuracy here not Precision.
Any one got a suggestion for the best approach to creating a model to detect video game characters(data would be images of character from various angles, outfits etc) . I was going to do face detection, but that isnt gonna work for the character not facing the camera.
Hi, I would like to know how transfer learning can be done with a custom object detection task, not image classification. Is it possible to use yolo pretrained weights onto our own model? Main reason why I'm asking is because I am trying to get a higher mAP value which is the metric used for object detection instead of normal accuracy.
yo, dumb question I didn't quite find an answer to (more like I wanna confirm a thing):
if I have a low amount of data, an epoch on GPU won't differ too much from an epoch on a CPU, right?
especially if the model is not too complex, i.e. something like 6 Dense layers, output layer has 3 classes and after each Dense layer, I have 0.5 dropout rate with an input layer of at most 208 neurons (depending on the features extracted from the data) and each Dense layer starts has half the number of neurons of the previous layer, starting from 512 on the first active layer
the easiest answer is to try it and see. parallelization is not only done over data samples in a batch, but also over all matrix operations even if there's a single data vector
how do you identify what glove embedding pretrained text file to use? i am currently trying to train my model that is using gru.
yeah, well logic and the intuition from what I learned in parallel computing classes and other info about well data size in parallel computing tell me it is the case
in my case, the computing times were quite similar, but I don't have more data to "fill" the dataset so that I can check my hypothesis that more data would result in epoch duration growing alot faster on CPU than GPU
so it's more like a "does my intuition make sense?" type of question
that sounds about right. the general philosophy is that, although there's a ton of cores, they're each weaker than what your cpu has. for very small matrices, you would indeed expect the cpu to be faster, especially considering the cost of moving stuff from memory to gpu memory. then as the matrices get larger, you start reaping the benefits of massive parallelization
my previous comment was more along the lines of "it's difficult to know where the break-even point is"
thanks
There is the NVIDIA TLT package, but I've never used it. I increased my training set and got an increase of 2% on the mAP
. What is your mAP and IoU if I may know?
Well my mAP value just hits 80-81 and then just stops increasing, my training set is around 6000+ images as well. I don't have the IOU calculated though. Just using the built in TF mAP function with keras. Not using a pretrained model either.
oh you use a public data-set
So you don't train using Darknet?
hmm 6000+ images should have given a good mAP tho
I've seen one with 1000 hit 90%
and mine is way less
I have not used that, just wanted to know how I could get higher results without using a pretrained model directly because 6000+ images is a decent bit of data.
Yeah just a medical dataset with a total of 8000+ images which I then split.
the only thing I've seen is NVIDIA TLT, but never used it. So go knock yourself out
out of curiousity which version of yolo are you using?
Hmm yeah I'll check it out. But do you think that the custom model can be an issue for why it does not increase beyond 80?
I have yolov4 but not used yet cause I can't seem to get it to work on my custom model.
How many classes do you have? my first guess would be that the data has some problems.
Well I mean on which version of Yolo did you train your custom model?
I only have one class given for the images, like the images are all confirmed for having that specific object present in it. I don't have any classification, just detecting the object. F.e. the images are confirmed cases of stenosis so I'm just detecting it with the help of the labels.
I used yolov3.
it looks like you're overwriting the age (a number) with a string that represents a range of ages. don't overwrite data with a less specific version of that data, or with data of a different type.
df['Age'] = df['Age Group']
df['Age Group'] = ''
df.loc[df['Age'].le(14), 'Age Group'] = 'Child'
df.loc[df['Age'].between(15, 24), 'Age Group'] = 'Youth'
something like this.
ow god, completely forgot about le and between existence; do they work for datetime modules?
Yep!
Yeah, your intuition is correct. There's likely no qualitative distinction between the classes as clustered. Also, before downsampling, you could just make the markersize much smaller to see if anything closer to a pattern appears. ALSO, your data shows no one working after the age of 80, but then a few random 84 year olds? Something to check out as well.
you can still do that, if you want to do things only in terms of the day (and not hours, seconds, etc)
Hi where would you guys suggest moving to after learning python basics and doing small projects. I've got a 12 months free of data camp but it doesn't look comprehensive. I've also heard of Kaggle micro courses but I'm not sure how effective they are
what is your goal
Quant
But starting in data science is the best path I can take leading to that
Also have access to all Coursera courses
idk what quant is.
Quantitative analyst it's basically if data science and data analysts had a love child placed it in the finance industry
They are usually employed by large trading and hft firms and require extensive technical knowledge, you deploy that knowledge for trading operations
seems that most people who are employed as "data scientists" these days are data analysts anyway. idk what employers for quantitative analysts look for in applicants, though.
Tbh it doesn't matter at this stage. Rn I need to work on learning the skills of a data scientist
what is "this stage"? what do you do currently?
I only know basic python and am currently studying the required math for data science at school.
by "school", are you referring to mandatory public education (called high school in the US)? university/college?
I study Maths, Further Maths and Computer Science in the UK
Sort of a college
But the maths ain't a problem
can you take courses that are specific to your goals?
Hi everyone, I'm getting this error:
first_week_data['Ciudad'] = first_week_data['Fuente'].apply(lambda element: element.split(',')[0])
/home/carlos/Documentos/Programacion/Python/angela_dane/semanal/data.py:60: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I've read the sugested documentation but didn't see the conection with my problem.
can you do print(first_week_Data['Fuente'])?
also please always show the whole error message, starting from Traceback. (and please do that now as well, because I think there might be more to this.)
/home/carlos/Documentos/Programacion/Python/angela_dane/semanal/data.py:58: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
first_week_data['Ciudad'] = first_week_data['Fuente'].apply(lambda element: element.split(',')[0])
/home/carlos/Documentos/Programacion/Python/angela_dane/semanal/data.py:60: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
last_week_data['Ciudad'] = last_week_data['Fuente'].apply(lambda element: element.split(',')[0])
it's not an error it's a warning
this
print(first_week_Data['Fuente']) ------>
0 Armenia, Mercar
4785 Medellín, Central Mayorista de Antioquia
4784 Medellín, Central Mayorista de Antioquia
4783 Medellín, Central Mayorista de Antioquia
4782 Medellín, Central Mayorista de Antioquia
...
33485 Bucaramanga, Centroabastos
33458 Bogotá, D.C., Plaza Samper Mendoza
33484 Bucaramanga, Centroabastos
33492 Bucaramanga, Centroabastos
33504 Bucaramanga, Centroabastos
Name: Fuente, Length: 37854, dtype: object
do .str.extract(r'^([^,]+)') instead
like this?
first_week_data['Ciudad'] = first_week_data['Fuente'].apply(lambda element: element.extract(r'^([^,]+)')[0])
no apply
same warning
show the new code exactly
first_week_data['Ciudad'] = first_week_data['Fuente'].str.extract(r'^([^,]+)')
/home/carlos/Documentos/Programacion/Python/angela_dane/semanal/data.py:58: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
first_week_data['Ciudad'] = first_week_data['Fuente'].str.extract(r'^([^,]+)')
/home/carlos/Documentos/Programacion/Python/angela_dane/semanal/data.py:60: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
hi there, im currently trying to do a classification problem where i have an image with spots. im trying to classify each pixel in the image as either spot or not spot. what is the best algorithm to do this?
you need to have a decent finance background as well. see if you can include that as your domain focus
are you going to further education after graduating college? quant positions can be extremely competitive otherwise
how does one resample their data, especially if it's not timeseries but regular numeric data? been trying to look it up, but I can't really find any solid solutions
Yeah, Ik a bit about finance but that is the least important topic for getting into a Quant position. Maths and computing are far more essential.
Yeah I’m tryna get into a top uni here however if not I'm also applying to degree apprenticeships at fintech companies which allow you to gain a degree in a relevant field while working under the company
The degree apprenticeship looks appealing cuz all uni finances are covered with no student debt and you can choose which career pathway you would like at the company such as Software Engineering or Data Science
yes as long as you dont forget.
yes but it can be uber competitive but maybe you can stand out depending on your projects/resume
good luck

What's your favorite resource to study/learn about ML?
Preferably one that helps you get a solid fundamental understanding of the underlying principles.
Hi @hazy saddle, I am not getting the copy of a slice error on Python 3.10. However if you use .loc[:,'Fuente'].apply... it should work. ```In [5]: import pandas as pd
In [6]: last_week_data = pd.read_clipboard()
In [7]: last_week_data
Out[7]:
Index Fuente
0 0 Armenia, Mercar
1 1 Medellín, Central Mayorista de Antioquia
2 2 Medellín, Central Mayorista de Antioquia
3 3 Medellín, Central Mayorista de Antioquia
4 4 Medellín, Central Mayorista de Antioquia
5 5 Bucaramanga, Centroabastos
6 6 Bogotá, D.C., Plaza Samper Mendoza
7 7 Bucaramanga, Centroabastos
8 8 Bucaramanga, Centroabastos
9 9 Bucaramanga, Centroabastos
In [18]: last_week_data['Ciudad'] = last_week_data.loc[:,'Fuente'].apply(lambda element: element.split(',')[0])
In [19]: last_week_data
Out[19]:
Index Fuente Ciudad
0 0 Armenia, Mercar Armenia
1 1 Medellín, Central Mayorista de Antioquia Medellín
2 2 Medellín, Central Mayorista de Antioquia Medellín
3 3 Medellín, Central Mayorista de Antioquia Medellín
4 4 Medellín, Central Mayorista de Antioquia Medellín
5 5 Bucaramanga, Centroabastos Bucaramanga
6 6 Bogotá, D.C., Plaza Samper Mendoza Bogotá
7 7 Bucaramanga, Centroabastos Bucaramanga
8 8 Bucaramanga, Centroabastos Bucaramanga
9 9 Bucaramanga, Centroabastos Bucaramanga```
what are you trying to resample?
I am thinking to just create an arbitrary column with a repeating pattern in it such that you can perform a group-by on the arbitrary value.
If you are trying to get into deep learning - https://d2l.ai/ is probably one of the better introductions.
I also like to recommend the book by Ian Goodfellow and Yoshua Bengio and Aaron Courville
hi, i have sql db, and i want to search for string in "comment" collumn, is that possible to search by string ?
For example, you could do something like this df['resampler'] = np.trunc(np.arange(1+step, len(df), step)).astype(int)[:len(df)] to create a new column which could be used to group by. col1 col2 col3 resampler 0 0.607871 10.075861 20.203499 1 1 0.049092 10.531278 20.696755 1 2 0.832901 10.512815 20.765228 1 3 0.376783 10.583901 20.758072 1 4 0.982780 10.229963 20.051475 1 5 0.739152 10.478775 20.420801 2 6 0.720491 10.644305 20.083453 2 7 0.705236 10.203818 20.870851 2 8 0.783557 10.351655 20.012904 2 9 0.957087 10.882574 20.691543 2 10 0.636897 10.653356 20.954984 3 11 0.306318 10.617002 20.963245 3 12 0.557695 10.704019 20.616715 3 13 0.352175 10.987861 20.704404 3 14 0.132969 10.216441 20.135463 3 15 0.615025 10.387754 20.457027 4 16 0.595251 10.301297 20.603991 4 17 0.819896 10.239930 20.914990 4 18 0.336612 10.016438 20.628703 4 19 0.275393 10.850988 20.743750 4 20 0.384558 10.404489 20.853798 5 And then you could perform the groupby like df.groupby(by='resampler', axis=0).first() Which would yield a "resampled" data frame as such: In [39]: df.groupby(by='resampler', axis=0).first() Out[39]: col1 col2 col3 resampler 1 0.607871 10.075861 20.203499 2 0.739152 10.478775 20.420801 3 0.636897 10.653356 20.954984 4 0.615025 10.387754 20.457027 5 0.384558 10.404489 20.853798 In my example, step = 0.2, but you could use a smaller number for a larger sample interval or a larger step for a 'faster' sample rate.
sorry for the late reply, I was busy with something, I was just trying to perform k-means clustering on these two columns
which ends up creating horizontal clusters
but i'll try that and see how I could modify it, thanks
This for classical stats models otherwise no imo
To tell you the truth, I am unfamiliar with k-means clustering - not sure if my reply will be helpful for that, but I'll be curious if it does.
this is pretty wrong. you need statistics to formulate appropriate cost functions
ehh statistics is a huge field and not really something u can just 'learn' to apply to cost functions its not worth it
maybe if you just focused on specific areas of statistics
i found the same problem with calculus, massive field but in this case you sort of need to wade through the endless foundational stuff before moveing on to differentials
else it makes 0 sense
my statement was meaning that the relevant importance of stats drops off compared to other stuff once you exit those areas of ml
Hello ppl my question is how this (1,1) can be ruled out when it can be (1,0) as it can also fires 1
Or is that inhibitory is 1 then whole unit becomes "0"??
didnt it say x2 must be 0
i disagree entirely, the majority of ML is estimation theory and IS statistics
what are the main applications of probability theory in ML would you say
in a typical classification project
everything, lol. the output of the network is probabilities, to start with
sure, but how would being well versed in probability theory help you produce better results from a random forest for instance
let's just say that all of the issues you've been having with your unbalanced classes are statistics problems
yep. and ive found that categorising age improves random forest performance, whichi believe has been due to reducing noise
@steady basalt which step in the row ua narrating abt
well, you'd know what you're doing instead of "believing" if you knew stats. you're at a point where you're so far removed from the topic that you can't properly assess its usefulness
Even if i 'knew' all of stats, i wouldnt be able to exploit that with python on this data
how would you even go about analysing how the noise is ruining classififiers?
practically
The root of the issue is poor data not imbalance
you would also be able to do something about that, too, but the conversation is pointless
my only point is to tell you not to misguide others by saying stats isn't important, it's key in ML whether you understand that or not
it is important, that isnt what I said at all
im listening, theres no need to hate
So I'm trying to implement a Sequential model to Detect the likelihood of someone getting liver cirrhosis given some readings and I keep running into this error while trying to train the model:
WARNING:tensorflow:Layers in a Sequential model should only have a single input tensor, but we receive a <class 'dict'> input: {'N_Days': <tf.Tensor 'IteratorGetNext:11' shape=(None,) dtype=int64>, 'Status': <tf.Tensor 'IteratorGetNext:18' shape=(None,) dtype=string>, 'Drug': <tf.Tensor 'IteratorGetNext:8' shape=(None,) dtype=string>, 'Age': <tf.Tensor 'IteratorGetNext:0' shape=(None,) dtype=int64>,
I just wanna make sure I'm doing it right or make it work for starters.
FULL CODE @
https://drive.google.com/file/d/101ZQUkUi8ZnYm6g5YPdCIdWZtwwx_nEi/view?usp=sharing
from torch.utils.data import Dataset
class ConfocalDataset(Dataset):
def __init__(self, img_dir, mask_dir, transform=None):
self.img_dir = img_dir
self.mask_dir = mask_dir
self.transform = transform
# list all the files in this folder
self.imgs = os.listdir(img_dir)
self.mask = os.listdir(mask_dir)
def __len__(self):
return len(self.imgs)
def __getitem__(self, index):
img_path = os.path.join(self.img_dir, self.imgs[index])
mask_path = os.path.join(self.mask_dir, self.mask[index])
img = czifile.imread(img_path).reshape(242, 512, 512)
mask = io.imread(mask_path)
if self.transform:
img = self.transform(img)
return (img, mask)
dataset = ConfocalDataset(img_dir = images_path, mask_dir = masks_path,
transform = transforms.ToTensor())
dataset[0][0].shape
gives torch.Size([512, 242, 512]) instead of 242,512,512 which I specified in my Class, does anyone know why, by the way this is a 3D dataset of greyscale images
Hi carry_a_laser, thx for answering, still gettring the warning. I have a question, why did you use [:,'Fuente'], why ;, ?
Hi - basically I was using that to try to avoid the error A value is trying to be set on a copy of a slice from a DataFrame. Try using **.loc[row_indexer,col_indexer]** = value instead. Basically ":" is the row_indexer and 'Fuente' is the col_indexer. Here is a pretty good explanation on stack overflow: https://stackoverflow.com/questions/48409128/what-is-the-difference-between-using-loc-and-using-just-square-brackets-to-filte
ok, nice
Hello, I have a question to anyone using m1 mac for data science and machine learning. How is the compatibility of python packages? As I see on the internet, many people still face issues. I'm indecisive just because of these compatibility issues
so, the metal framework should quite in theory allow you to have gpu acceleration on m1 both for pytorch and tensorflow, but you have to follow the steps carefully
what about some packages like scikit, numpy, pandas and some database management tools(postgres, sql)
i honestly don't know about those. i would be surprised if they didn't work, but at the same time, they fall back on BLAS and LAPACK builds for x64 normally, so they probably need a special version or have to be built from source. i couldn't say for sure
yeah, catch myself using .replace for that to remove it completely or just .date
I created a scatter matrix for 3 numeric columns I had in my dataframe to identify which columns can be used for clustering
From the above plot I can see that the columns Income and Score are forming 5 clusters and Age and Score are forming 2 clusters.
So I was wondering, that should I use all three columns for the clustering?
@vast spade this is where you can ask your question
Hello, I have a question to anyone using m1 mac for data science and machine learning. How is the compatibility of python packages? As I see on the internet, many people still face issues. I'm indecisive just because of these compatibility issues
They run just fine
The issues have been resulted last winter
Resolved
You will have a good experience on Mac OS if you know what you’re doing with miniforge
Hi, I have a question
I have a dataframe like this:
| col_1 | col_2 | date | money |
| A | B | '2022-06' | 400 |
| A | B | '2022-07' | 500 |
| A | C | '2022-07' | 600 |
| A | C | '2022-06' | 700 |
I need to create as many columns with that date format, to end up like this
| col_1 | col_2 | 2022-06 | 2022-07 |
| A | B | 400 | 500 |
| A | C | 700 | 600 |
For now, I am basically adding the columns (getting a set of that column) and initializing them in 0. Then I just fill the columns one by one in a loop (yes, not very efficient) while filtering the data.
Is there a better (more efficient or pythonic) way to do this?
I think there is a function which exists which does this. Lemme check docs.
I have done this kinda thing before.
Hello everyone, hope you're doing great!
Anyone know how to fix this? I'm using seaborn. I want to get the correct scale, but when I set the ylim, and yticks, it looks like in the image.
Code is as follows:
sns.set_style("whitegrid")
ax = sns.lineplot(data = df,
x = "Tiempo (min )",
y = "Presión (mbar)",
)
ax.set(ylim=(min(df["Presión (mbar)"]), max(df["Presión (mbar)"])))
ax.set_yticks(df["Presión (mbar)"])
Dataset is the 2nd image
I want to make it look similar to this, but using Python
There is pandas.get_dummies
!d pandas.get_dummies
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)```
Convert categorical variable into dummy/indicator variables.
But yeah you'll need to look up to make it work, but it does convert categorical data into each col.
Looks like it is exactly what I needed. Thank you very much. Really appreciated
There may be another step required since it may just make cols of 1s and 0s, but it should not be too hard.
Changed the scale to logarithmic and it's fixed, now I just have to change the ylabels
that's what i was about to suggest. you can swap the y ticks for log y and plot in semilogy
What's semilogy? 😮
it plots your quantities in a logarithmic scale along the y axis
there is also semilogx, and loglog (both axes logarithmic)
if you use semilogy, it'll change the ticks and the plot for you automatically, but it's up to you if you want to show the ticks in linear or log scale
hello guys i need help a project im working on could someone dm if his free
hey guys, i have a dataset with 1380 samples and 1.8 million features, and i need to run supervised learning on it
so important step is feature selection, so i'm trying to find some good methods or libraries for it
i need to keep in mind space and time as well
any suggestions ?
you need to use pivot_table. it's exactly for this
!docs pandas.DataFrame.pivot_table
DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)```
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
hello, im new here. Is anyone good at analyzing excel files with multiple columns and using time series analysis on them?
I have code setup but running into issues with my low coding knowledge
you're most likely to get help when you ask a question that people can read and start answering right away. like by showing the code and the error message.
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
#return datetime.strptime(s, '%m/%d/%Y %H:%M') old dating
return datetime.strptime(s, '%Y-%m-%d %H:%M:%S EDT')
main_data = pd.read_csv('hvac_data.csv', parse_dates=[0], index_col=0, date_parser=parser)
for i in range(2,3):
coltwo = main_data.iloc[:, i]
stl = STL(coltwo, period=15)
stl_data = stl.fit()
seasonal, trend, resid = stl_data.seasonal, stl_data.trend, stl_data.resid
redsidual_mean = resid.mean() #mean of the residual graph
residual_dev = resid.std() #stdev of the residual graph
upper_bounds = redsidual_mean + 2*residual_dev #for anomaly detection
lower_bounds = redsidual_mean - 2*residual_dev
anomalies = x[(resid < lower_bounds) | (resid > upper_bounds)]
print(anomalies)
My main issue is that I want to analyze all the columns in that range, but it only analyzes the last one in the for loop
and the other issue is that the bounds vary with the column I am analyzing. How would I go about making the bound change based on standard deviation or mean?
I'm facing an issue while trying to import tensorflow-ranking module. The error:
AttributeError: module 'tensorflow._api.v2.compat.v2.__internal__' has no attribute 'monitoring'
The TF version is 2.4.1 with Python 3.9+, trying to get my gpu to work as well so that's why I am using these respective versions.
Maybe try upgrading numpy and TF? https://github.com/tensorflow/tensorflow/issues/54286#issuecomment-1031205739
That solved my problem in nearly one line. Thank you very much sir
Hi guys !
I have a couple of questions that teachers never answer in data science master courses:
Since I'm passionated with CS and maths, I'd like to make the plots and graphs look better, sharper, cleaner, with dark mode preferably. and I find those matplotlib so ugly !
So, I'd like to have a better understanding on how the conversion between functions and graphs occur so I get to know what to edit to make all these viz look the way I'd like them to.
If someone has a good experience on this please lmk ! :D
I think seaborn plots are nicer, idk?
I did upgrade to TF 2.9.0 but now my GPU does not show up, I guess there is specific support for certain Python versions with TF.
Well, I'm not really looking for a lib name since I guess this is just about me googling and finding it but, I'd like to know if I can maybe create my own graphics or something? :o
Since matplotlib is mostly used, I'd like to know if there's a way to edit just the graphics part of it if that makes sense?
tbh if you want that granularity, i would recommend you export the data you want to plot into a csv, and then plot it in latex using tikz and/or pgfplots. then you can create vector graphics out of your dataset and format them however you like
mmmh you can't always do this ig
wdym?
how do you make a for loop that loops through csv columns? while applying your anomaly code?
like my previous code
if you use pandas, you can do iteritems(), I guess
!docs pandas.DataFrame.iteritems
DataFrame.iteritems()```
Iterate over (column name, Series) pairs.
Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.
but the goal is often to do this as little as possible.
will this work for large data set with 100 columns?
you can iterate over as many columns as will fit in the memory of your computer.
given two images, what algorithm(s) can i use to determine the similarity between the two? I've read up a bit on image similarity but i don't know where to start.
semantic similarity?
(I have no idea, I'm just thinking of how we can unpack your question)
from a quick google:
the answers in https://stackoverflow.com/questions/11541154/checking-images-for-similarity-with-opencv explains some methods
seems like https://www.geeksforgeeks.org/measure-similarity-between-images-using-python-opencv/ implements the 'histogram' one, but idk if that's even the best way to implement it since geeks4geeks isn't very credible tbh
geeks4geeks is an exploration of sadness.
What makes two images similar? Which objects they contain? Which patterns they contain? Difference in pixel values (pixel by pixel)? Overall dominant colors? Total brightness level?
Or maybe a combination of all of the above in some weighted / heuristic fashion?
hi! simple request
how would i search this:
say i have one type of element called an A. i have a list of A's.
i also have an element type B as well as a master list. i know that in the master list, all A's are flanked by B's, in that an A element is surrounded by a B on both sides
now, i want to find the specific B's that surround the A's i have
how would i do this? sorry its not too specific because its a bioinformatics question for currently unpublished research but im blanking out lol
I would transform the sequence into a string and use a regular expression.
you are under the impression that there is a single definition of "distance" and "similarity". all of the things you mentioned are valid metrics, and you're supposed to use the one that suits your application
but I'm one to see a lot of things as a regular expression problem. because I have issues. sort of like how if you're vladimir putin, you see a lot of things as a long white table problem.
There is not a single definition. You have to choose, I am asking them because they probably have some idea of what they want / what their problem is.
you are right. the asker didn't specify.
sorry i pinged the wrong person, i was reading the original message you referenced 😛
Yeah I often reply to the wrong thing.
You have to choose what you think will be good indicators for later for whatever problem you are doing and try to extract those / compare the images by that.
And you can then organize images by those indicators.
you could formulate this as a peak-finding problem. what you have is of the same shape as non-overlapping "pulses" in a time series, and these can be located by doing matched filtering and then peak-finding. this is pretty much equivalent to what stelercus said regarding regex, since the sequences are well-resolved. the only open question is whether BABAB is a valid sequence with 3 Bs or not, i.e. if it should be BABBAB instead. in either case, this can be done efficiently with something like a sliding window, or if you're feeling super fancy, with a discrete fourier transform
If you are doing some NN stuff, you can have the NN find out those for you while it tries to do whatever problem you want it to do. If the NN is big enough / is flexible enough / depending on the type of NN.
(Hard coded approach vs let the NN figure it all out (with enough data / training))
i keep a notebook under my pillow, but i doubt that's what you mean
Do you mean BED files? https://en.wikipedia.org/wiki/BED_(file_format)
The BED (Browser Extensible Data) format is a text file format used to store genomic regions as coordinates and associated annotations. The data are presented in the form of columns separated by spaces or tabs. This format was developed during the Human Genome Project and then adopted by other sequencing projects. As a result of this increasingl...
yes!
What about them? They are really simple.
yep; im just searching entries of these
would the peak file approach be useful for this
im just terribly inexperienced thats all
What Edd suggested sounds good. I would start with a sliding window (most simple approach) on some made up cases to test, then bigger cases and try to account for edge cases. So the usual programming process for exploratory algorithm design / making something up.
If your starting test cases are small enough, and there are not too many computations, you can do it by hand on paper first before worrying about code.
some questions include: do you really only want BAB, or is in some cases something like CAB or AAB allowed? what happens at the beginning and end of the list? e.g. starting with AB or ending with BA? is BABAB two pairs of valid surrounding Bs or just one? you can ignore them when you first set things up, but they have to be dealt with at some point
we could call these "noise", "boundary conditions" and "overlap"
Solve the more simple version of the same problem first (that still captures the main issues / essence) and let that inform you on how to do the more complex version later (the actual problem).
PCA really doing a number on my classifier. does PCA even work properly if its alot of binary features
i was thinking a weighted combination them
!pypi ImageHash
these might be what you're looking for
they detect stuff like cropping/scaling/reencoding (as in, such operations don't affect the perceptual hash much), and generally also overall similarity
@wooden sail
what does it mean by is cont.?
there is only one point -2,1
or is that giving x values?
that's an interval of x values
no
what is a continuous function anyway?
it means at all the infinitely many x values in that interval
oh ok
ty
so yea between those two values f(x) can only be those two functions depending on the value of x
but im not sure what theyre exactly asking by that
i think the accessible definition of continuity for you is in terms of limits
so theres... more advanced stuff that would be beyond this level in that definition?
epsilon delta would be nice, but that's also elementary
a is false
with the limit definition, for points inside the interval, we say f is continuous at the point (c, f(c)) if f(x) = a (the function is defined at x=c and has some value a) and also the limit as x -> c of f(x) is a
that means the function has to be defined at that value of x, and the limit as x approaches that value from the left and from the right has to equal the value of the function
in your example, f(0) = 2, since f(x) = x + 2 in the interval [0,1]. if we approach x = 0 from the right, we get this same value. however, if we approach from the left, then f(x) = x - 1, whose limit as x approaches 0 from the left is -1
then the limit as x -> 0 of f(x) does not exist, and the function has a type of jump discont.
iirc at the boundaries of the interval it suffices to have the corresponding one-sided limit
i'm afraid there's no simpler explanation
while we're at it, for b), note that sin(y) is 2 pi periodic, meaning that it starts over whenever y is an integer multiple of 2 pi, i.e. of the form 2 n pi for integer n. now we do a substitution with y = 2 pi x. then we need y = 2 pi x = 2 n pi, so x = n. that means that every integer value, sin (2 pi x) repeats itself, so it indeed has period 1
so its a gap?
well, it will look like a bunch of vertical lines
If you go from the left to right, what is the furthest y you get to without jumping, and if you go from right to left, what is the furthest y you get to without jumping?
what? how does that translate to vertical lines? and infinitely many, at that
in the interval for example of 0 to 1, f(x) = x+2 right?
mhm
so at any points id have thought you'd just get a line going straight
if x is 1.5, youd get a straight line y=x+5
that's just one point (c, f(c))
sorry y=1.5+2
that's a point, not a line
you need to take a step back
The problem is we are not dealing just with individual points, we are trying to talk about the function as a whole.
review some precalc before you step into calculus, otherwise you will understand nothing
yes im in my first week of this topic xd i just happened across a more advanced problem that what ive learnt
and was curious
take a step back to algebra
im currently going thru the function stuff now which may be similar to 'precalc'
its explaining exponential functions, transformations, trig functions
striaght lines, curves
that's indeed what you should look at, but this looks like you skipped ahead, since it involves limits
ok. take it easy, all in good time
im on page 21 out of 1100
haste makes waste, and all of the stuff you see will build up on itself. if you skip something, it'll come back and bite you
honestly may need to practise more algebra if i ever want to take on the advanced stuff, it looks mind bneding
towards the ned of the book its like some trippy shit, pipes and stuff
stokes law is meant to be from physics
...
no, that's a common application for it though
it will probably take me a literal year to finish this book
It's just more notation, it will make sense when you see each notation introduced one by one.
And have a solid intuition before calc.
'calculus, single and multivariate' by hallett, check it out. starts with the basics and builds into some advanced stuff
it sounds like the book goes all the way from precalc to vector calc from what you're saying. this is like 3 or 4 semesters of math
How functions look when plotted and algebra's relation to geometry.
that is correct, but the precalc is quite short actually, just a couple chapters
probably the first 80 out of 1k pages
you need to be fluent in that stuff to before you move on though
right now its just exersices for log functions and 'compisites' functions within functions liek f(g(x))
it took me ages today of contemplating just to give up and look at the solution but it was k(y) = e^-y&^2 and you had to composite it
If you want the really easy intro to calculus, then Calculus Made Easy by Thompson is pretty good. But the book you currently have seems fine too. Just don't skip stuff. You can't really do that in a math book unless you already know a lot of math.
the reason why it was so hard bcs they never exaplined you have to introduce another 'z'
the answer was f(z) = e^2 and f(g) y^2
that looks all sorts of wrong
so i think its f(z(g)) or something
thats from memory tho i can grab a photo
example 2B
not looking forward to the upcoming sin cos stuff comning up
hated that in school
just excited when i finally get to differentials
side note: how would you go by altering this random forest so that you increase recall by sacrificing a little bit of precision
id be happy for it to overguess 1 a little bit more
Thanks
@wooden sail wow hit some truely incredible information
f^-1(x)
and eulers number
just amazing
log(exp(x)) = x moment
im guessing u know alot about inverse functions
if R = f(T) = 7T-35,
T = f^-1(R), why does that equal R/7 +5?
r/7 + 35 no id have thought
ohhh we divide 35 by 7
oops missed that
So I have a bunch of review data and I wish to extract sentiment toward certain topics. Any idea on how to go about this?
Currently I'm just extracting sentences that contain related keywords and getting sentiment from those.
you forgot the learning rate
Oh yeah true
Good catch
I was planning to finish up this video for the SOME2 thing, but at this point I really just want to make the best video I can make
which means I won't lower the quality just to meet the deadline of 2 days 😪
at the moment I'm using list comprehension to append json data to another list. Overtime this will use alot of processing power, is it better to pass json data to a np array?
Can anyone suggest some good feature selection techniques for clustering?
the vids are looking ok, but you're shooting yourself in the foot with the first one. because you chose a non-convex plot, you now have to explain how and when gradient descent does and does not work, since it doesn't always converge, and even if it does, it doesn't necessarily do so at the global optimum
well yeah because that's the truth
neural networks are non convex so I want to make that clear from the start
I don't have to go into too much detail though, just mention that it's non convex and what non convex means and the problems that causes to training
very briefly
convexity is a kinda large topic to just brush under the rug, but ok 😛
yeah but would you say it's better to mention it briefly to make the viewer aware of it (so that they might look into it further if they like), or not mention it at all?
i guess the former, if those are your only 2 options
well yeah cuz the video is supposed to be an introduction to computer vision with a focus on deep learning
I'm not even planning to go into the details of backprop as that deserves a video for it's own
a "tourist guide" so to speak
also random question
do you think I should refer to mse_error as E or something referring to error in general?
just thought about it now
that might make it more clear that I can put any differentiable error function there and it would work
maybe call it cost function or error/error energy
I have a small doubt regarding numpy save, is it faster to load a saved array of shape (1105, 512, 256, 1) or to compute it on the go. does anyone have any idea?
you can test it yourself with timeit. it'll depend on which operations are used to create the arrays. in your case, the array seems to require loading several other arrays, so i'd say loading a single big array is a lot faster than loading several arrays
@wooden sail surely they don’t mean f divided by g right? U can’t calculate that with ur brain
12C
Surely it just means of either
wdym?
3n^2-2 / n+1
When I put that into the website it makes a graph which certainly has two domains
Weird
a function cannot have two domains
that's still just one domain, but made up of the union of disjoint sets
it should've been, discussing domains of functions requires talking about sets, since the domain of a function is a set
I’ve been taught domains
then that should've been there
But not when it has to be calculated of something like that where there’s a gap
keep in mind division by 0 is undefined
I was thinking instead of a divide sign they just meant and
that means the function does not exist when the denominator is 0
So x can’t be 1?
n cannot be -1
So it’s -inf to excluding -1 then excluding -1 to inf
mhm
They didn’t show how to do two sets of a domain
How does this graph look
Is the syntax that big { for two domains?
there's several ways to describe a set
i think the easiest here would be
.latex $(-\infty, -1) \cup (-1, \infty)$
.latex or $n \in \mathbb{R} \ {-1}$
Must be this thanks
Or with the > and < signs but with a comma between?
that's also valid, yeah
If it CANT be -1 shudnt the bracket be ]?
] includes the value, ) excludes it
So we exclude infinity?
it's not a number, and it's not part of the traditional real numbers
Oh ok
ignore this one btw, i mangled it. i wanted to write a set difference but forgot a few brackets and stuff
x^3 +5x +10 this is invertible and yet x^3 -5x+10 isnt, weird
why does stretch only happen negatives
Hello Guys, I'm into the Perceptron learning algorithm in that I'm stuck on the Cconvergence of Perceptron learning based on deep learning so, is there any best interpretation resource that clarifies some of my doubts if anyone knows abt the concept let me know
how in-depth of an answer are you looking for
Convergence on the minima?
U reading gradient descent?
@wooden sail red flag, my text book has rly bad reviews. people complain it gives problems it h asnt given prior teachings for in terms of methods
do they give concrete examples? tbh i know better than to trust the reviews of students at face value 😂
not for everythign thats asked, which is why i may have upset you previously as It wasnt even covered earlier
but its liek 3/5 stars reviews. maybe i shud use a rly popular one instead?
I had a look at a famous maths book from the 1950s on analysis and it covers all theory but its too hard for me to understand past page 20 because its purely explaining definitions in a quick way
I meant the rudin book
yeah, you're definitely not ready for that
the book starts off with things that I NEED to know though
i can give u an example, in the first few pages it explains SYNTAX that i literally need to read papers such as how to show that something belongs to a set in symbols
you can try if you want, then, but that's a book usually used in mathematics majors that requires a lot of mathematical readiness
for example
you can get to that level early if you're good, but for many people you need to already know calculus before learning analysis
like it'd go after your current book
this i think is foundations that I SHOULD learn
rational number system is something u shud know by default
I mean, take a look at the first 10 pages
trust me, you're not ready for that 😛
if you study this way, you will get stuck at natural numbers before even reaching trig and calculus
yes, you should, but waaaay later on
you don't understand anything there, and you won't for quite a while
YES lol
but this looks like plain logical tinking
😂
there is no "obviously", you need to PROVE it
nothing is taken as obvious. you have to start from the proof that 1 +1 = 2 working with natural numbers before you even reach this
the proof is that something to the power of a positive number equals a positive number must have a positive base?
you should DEFINITELY skip this
ok ill leave this book for now
so this is taking concepts i will learn about soon in my text book but turning them into actual theoretical proofs which is deeper?
if you don't believe me, try the book out and see how far you get with what you know 😛 lemme know how that goes
pretty much. it's a lot more formal
you remember you disliked linalg, yeah? why was that?
but ill learn about |z| in calc?
that's complex variables/complex analysis. that goes after calc
oh okay
after how much calc tho? cause my textbook goes to university level i think
im pretty sure my book ends highschool calc after 70%
university level 😛 complex analysis requires multivariable calculus
and some linear algebra too
my book goes to multivariable but also more whacky stuff that i showed previously
im sure thats first year uni at least
yes, you need that for complex vars too
green's theorem, for example, is used all the time
is that stokes stuff uni level
yes
1st year?
multivar calc level
however long it takes you to get there
so differential calc and integral calc are prerequisites. at least 2nd year, very likely
i feel like over here in high school multivariable is part of a advanced course u take in school
you say that but, did you learn it?
no cause i dropped the subject in my first year of hs
it doesn't matter what level it "should" be taught at, what matters is that you don't know it
i got to basic differential and integrals
so you need to review it from scratch
well, get a different book and compare them as you go along
AbeBooks.com: Calculus: Single and Multivariable, International Student Version (9780470409442) by Hallett, Deborah Hughes and a great selection of similar New, Used and Collectible Books available now at great prices.
not sure what international means
lots of ppl saying dont use for self teach
seems very differnet to the normal one
normal edition looks way harder
How do you guys come up with project ideas?
I have been trying to think of a cool idea for weeks now
And I literally can't
So, how do I detect if a sentence is about a specific topic? I know that lda and stuff exist, but they don't seem that useful in this case.
If you want a challenge, play with the steam review data of your favorite game.
can you elaborate?
Idk Im looking to do something that involves AI and web scraping but I can't think of nothing.
Say I have video game reviews.
"The graphics are very good."
"The game looks good"
"The game is very fun"
In this example I wish to filter for reviews talking about graphics. So I want to get the first two reviews from my dataset.
if you're looking for reviews about graphics, why would you want the first two, but not the last one? because the second two are basically the same.
i think they mean "looks" quite literally
I see
it is a good-looking game
Oops bad example. ^ this is right
that's a better example
well, you don't have to overthink it. can you just look for certain keywords? "looks", "graphics"?
I've tried that. In somewhat works, but going through and looking for every keyword is time consuming and prone to error. And if possible a better method would be able to use context, as I've just shown many words have double meanings.
if checking that a string contains at least one of a few possible substrings is taking a long time, I imagine that so will anything "smarter"
Not processing time, but human time in finding the keywords in the first place
not to mention you'd need labelled data in the first place, though
Honestly I wouldn't mind making a dataset once if could be confident I could re-use the model it makes.
if you have a dataset where reviews about the visual aesthetic of the game are marked, you can write something to figure out what "interesting" words are most frequent in those reviews.
something similar to sentiment analysis seems reasonable off the top of my head, but stelercus is the expert here
Actually that makes sense, instead of sentiment analysis, topic analysis (as in a 0 to 1 chance of it being the topic). The only issue is would would have to train a new model for every topic.
topic modeling is a thing, but I don't think that's what you want.
I know, and it's not.
I think it should be possible to find keywords through embeddings, but I'm not quite sure how that would work.
just to be clear, do you have a labeled data set, or no?
I currently do not. I just have reviews and if the reviews were thumbs up or down. If it is the best way I'm willing to make one, but only if necessary.
I would see if anyone has published on anything having to do with video game reviews, and see what dataset they used, and if it's available. it's a long shot, but it's a good idea to check first.
The only good dataset I can find is the one of half a million steam reviews I'm already using. It has steam data which is just thumbs up or down and review text along with the game information.
This seems like a method that might get close to what I want. https://blog.insightdatascience.com/contextual-topic-identification-4291d256a032
I don't get to choose the topics, but it at least seems decent
guys im trying to make a nn that will detect if a shoe is converse, adidas, or nike
but when i train my model
im getting 41 percent accuracy
i dont know why im a beginner, can someone help me?
this is my code
from pickletools import optimize
from tkinter.tix import ListNoteBook
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import os
import cv2
import matplotlib.pyplot as plt
import numpy as np
import random
DATADIR_TRAIN = 'shoes_data/train'
DATADIR_TEST = 'shoes_data/test'
IMG_SIZE = 50
CATEGORIES = ['adidas', 'converse', 'nike']
# (0) (1) (2)
training_data = []
def create_training_data():
for category in CATEGORIES:
path = os.path.join(DATADIR_TRAIN, category)
class_num = CATEGORIES.index(category)
for img in os.listdir(path):
try:
img_array = cv2.imread(os.path.join(path, img), cv2.IMREAD_GRAYSCALE)
new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
# plt.imshow(new_array)
# plt.show()
training_data.append([new_array, class_num])
except Exception as e:
print(e)
create_training_data()
random.shuffle(training_data)
X = []
y = []
for features, labels in training_data:
X.append(features)
y.append(labels)
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
X = X/255.0
y = np.array(y)
model = keras.Sequential([
layers.Flatten(),
layers.Dense(256, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(3, activation='sigmoid')
])
model.compile(
loss=keras.losses.SparseCategoricalCrossentropy(),
optimizer='adam',
metrics=['accuracy']
)
model.fit(X, y, epochs=10, batch_size=64)```
your model is kinda small, and you might wanna exploit spatial invariance by using a few convolutional layers
what is spatial invariance?
let's say something like the adidas logo showing up anywhere on the image indicates that it's adidas wear. doesn't matter where the logo is
yea
that's the principle behind convolutional neural networks
i added the conv layers and the accuracy went up to 79 percent
cool
i hsould have used conv layers from the start
to be fair, you could've gotten good results with an arbitrarily deep network of only dense layers, but applying all the knowledge you have of the underlying phenomenon into the network structure makes it perform better more easily
that's the difference between bad black box deep learning and knowing what you're doing 😛
I don't know whether I should learn TF or PyTorch
And what's the best source to learn the syntax?
Thanks in advance
languages have syntax. libraries have APIs. you don't "learn TF/pytorch syntax".
it seems that pytorch is becoming more widely used than TF.
That makes sense ^^"
Where can I learn PyTorch?
I am learning Deep learning from Andrew ng so I just need to learn how to use the library ig
I don't think we have a curated resource for that, unfortunately.
Thanks
In case of tf, we have this
https://www.tensorflow.org/tutorials
idk if this is the channel but
given the url of an embed video, can u get a random frame without downloading the whole video?
Do you happen to have some thoughts on why people are using PyTorch?
Recently got into ML/DL for a video for my channel, i genuinely fell in love with it. I learned TensorFlow, but, i hear a lot of people saying PyTorch is better. Should i switch
I would do both. I'm gonna learn pytorch too after a while for the sake of versatility.
learn pytorch please
not really. originally, tensorflow was more "black boxy" than pytorch, but I don't think it's like that anymore.
don't learn tensorflow
perhaps you could answer tsavorite's question 
what's "other"? jax?
even google is switching their products to jax
nah probably all other frameworks whatever they are
I'd assume jax holds most of its percentage tho
Yes he’s it is, and funnily enough google shills keep claiming otherwise. PyTorch is winning for academia except in industry where companies that for some reason don’t want to
Tensorflow really is annoying trying to shove data into models and getting version errors
are there any good tutorials with tensorflow which it reads an image and recognises it?
possibly any with teachable machine
recognizes what about it?
presumably he meant the bot can describe what the image is: https://stackabuse.com/image-recognition-in-python-with-tensorflow-and-keras/
TensorFlow is a well-established Deep Learning framework, and Keras is its official high-level API that simplifies the creation of models. Image recognition/c...
hey guys, so i made a model that can classify weather a show is converse, nike or addidas but when i try to predict i gives me this out put [[0, 1, 1]] and i dont know what to make of this, can someone explain this to me
from tensorflow import keras
from tensorflow.keras import layers
import os
import cv2
import matplotlib.pyplot as plt
import numpy as np
import random
DATADIR_TRAIN = 'shoes_data/train'
DATADIR_TEST = 'shoes_data/test'
IMG_SIZE = 50
CATEGORIES = ['adidas', 'converse', 'nike']
training_data = []
def create_training_data():
for category in CATEGORIES:
path = os.path.join(DATADIR_TRAIN, category)
class_num = CATEGORIES.index(category)
for img in os.listdir(path):
try:
img_array = cv2.imread(os.path.join(path, img), cv2.IMREAD_GRAYSCALE)
new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
training_data.append([new_array, class_num])
except Exception as e:
print(e)
create_training_data()
random.shuffle(training_data)
X = []
y = []
for features, labels in training_data:
X.append(features)
y.append(labels)
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
X = X/255.0
y = np.array(y)
model = keras.Sequential([
layers.Conv2D(64, (3, 3), input_shape=X.shape[1:]),
layers.Activation('relu'),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Conv2D(64, (3, 3)),
layers.Activation('relu'),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Flatten(),
layers.Dense(256, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(3, activation='sigmoid')
])
model.compile(
loss=keras.losses.SparseCategoricalCrossentropy(),
optimizer='adam',
metrics=['accuracy']
)
model.fit(X, y, epochs=19, batch_size=64, validation_split=0.1)
model.save("model.h5")
from tensorflow import keras
import cv2
import matplotlib.pyplot as plt
CATEGORIES = ['adidas', 'converse', 'nike']
def prepare(path):
IMG_SIZE = 50
img_array = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
return new_array.reshape(-1, IMG_SIZE, IMG_SIZE, 1)
model = keras.models.load_model("model.h5")
prediction = model.predict([prepare('shoes_data/test/nike/15.jpg')])
print(prediction)
print(CATEGORIES[int(prediction[0][1])])```
Can it only be one of the three? @crystal skiff
yea
Then your model is wrong
you shouldn' t be able to get 2 ones
You are using sigmoid instead of softmax for the last layer
lemme make that change
Then you get something like
[0, 0, 1] -> Converse
[0, 1, 0] -> Nike
[1, 0, 0] -> Addidas
ur right
imm a beginner to this so pls excuse my silly mistaks
Whats the problem? @bold timber
I don't know why the output shape becomes None, 30, 30, 32
can you explain this?
None is the batch size, which may be unknown, and thus None
Then you have a kernel of 3x3, so the output shape is reduced by 2 for both image dimensions
And 32 filters, thus it becomes (32-2, 32-2, 32) thus (30, 30, 32)
Which part confuses you? @bold timber
Why it get reduced by 2? whether it because I use pool_size 2,2?