#data-science-and-ml
1 messages · Page 361 of 1
you have missed northeast?
is this not what you wanted ?
sorry no, thought that be obvious. its ok ill figure it out
Oops, sorry then..
lets hope you get help from someone here (I just came in to clear one my doubts and then saw yours on the way and thought I might be able to help in some way)
are you looking for a smooth kernel density estimate instead of histogram?
so you already have code that groups of data into four regions and draws a histogram for each region
perhaps you can spend some time understanding that code, in order to figure out how to modify it appropriately?
alternatively, you could learn the more idiomatic way to do this with Seaborn, which uses a paradigm called "the grammar of graphics"
under that paradigm, this grid of subplots is called "faceting"
each subplot is a "facet"
and usually you create one facet per sub-group in the data, which appears to be precisely what you are looking for
so here is a demonstration of using line plots
even better, here is an example with histograms https://seaborn.pydata.org/examples/faceted_histogram.html
Note: the row and col parameters control faceting
perhaps you might also want to read about displot https://seaborn.pydata.org/generated/seaborn.displot.html
and I suspect you also want to read about how to use col_wrap and col for "wrapped" facet plots
https://seaborn.pydata.org/generated/seaborn.FacetGrid.html this documentation pages long, but the parameter names are the same as in the higher level clotting functions
so it's a good place to learn about what those individual parameters do
i'm surprised that there isn't a nice document explaining wrapped facet plots for seaborn, at least i couldn't find one
but here is one for a different plotting package: https://www.sharpsightlabs.com/blog/facet_wrap/. obviously you should ignore all the code, but hopefully the explanation and examples makes sense
floating point numbers are not "real numbers" in the mathematical sense
it's unrealistic to expect them to sum exactly to 1
If I have used float64, It would give 1
changing the floating-point number representation changes how precisely the numbers are stored, which changes the way errors are propagated through calculations
64 bits is twice as precise as 32 bits
so maybe with 32 bits you get enough error that it shows when you print the numbers
i think at some point every practitioner of data analysis will be forced to learn about floating-point arithmetic and numerical stability
is there any way to calculate the probability mass function (pmf) column wise?
i don't quite understand the question, is that a joint probability table? and you are trying to compute marginal or conditional probabilities?
I would like to get an array, where the column sum could be 1..
or the probability sum of elements column wise would be 1, when the datatypes are of numpy float32 values..
if you are just trying to check the correctness of your code, don't be concerned by floating point errors on the order of 1e-6 or whatever they were
that said, why do you need 32 bit specifically?
i am trying to calculate a loss function..
unless you have a very specific need to do otherwise, just use the default np.float which is usually 64-bit on modern machines
ok, and like i said: floating-point arithmetic is never 100% accurate because most floating point numbers are not exact
so expect small errors
in some situations, it might be a problem if those errors start to accumulate throughout the sequence of computations
is there a name of the loss? maybe a reference would help us iunderstand
in your case, it sounds like you are just concerned that you are implementation is wrong because the numbers don't add up exactly to 1
i am telling you that your implementation is probably fine because those errors look like what you expect from accumulated floating-point errors, rather than the algorithm being wrong
jensen shannon divergence
if you are still concerned and not convinced, i recommend you spend some time reading about floating point numbers and floating point arithmetic
this is just the reality of using computers, at some point you have to deal with the fact that they are physical computers and not idealized computing machines
I have understood what you are saying..
yo what are those kernel options and verbosity for svm?
Thanks Salt. I look it all up. Exhausted! 👍
does svm kernels are like optimizers on neural networks?
Hey guys, anyone here familiar with pandas? I need help making a subset of a very large CSV file into another file to use.
can you be more specific?
what do you mean making a subset??
Hi, we are making a deep learning translator like program with pytorch and would really appriciate some help with trying to normalize the data and transforming words (different values derrived from words) into tensors
look into lemmatization, stemming, and algorithms like word2vec
thanks 🙂
Hey guys, I have a question related to normalization in grayscale images. https://stackoverflow.com/questions/70371050/finding-the-mean-and-std-of-pixel-values-for-grayscale-images-in-pytorch
Would appreciate any help
I'm trying to normalize this grayscale xray images dataset https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
I have a few doubts
1)I looked up some of the projects done using the same d...
x = np.linspace(0 , (2 * np.pi), 200)
def h2d(x):
fig = plt.figure()
n=np.arange(1, 12,2)
print(n, 'n')
xx,nn = np.meshgrid((x),(n))
plt.plot(xx,nn)
happrox = (1*(4/(np.pi)) * ((np.sin(nn*(xx))) / nn))
happrox = np.cumsum(happrox,1)
return(happrox)
happrox = h2d(x)
print(happrox)
This my code but it gives wack result for cumsum, is it coz i have imaginary numbers after the sin operation?
why'd you have imaginary numbers here?
wdym?
I don't see anything here that'd result in imaginary results
it shouldn't yes
so why is the output wrong?
Why do you think that it's wrong?
because when i get it to check the code jupyter jotebook says the desired output is
y: array([[ 0.000000e+00, 8.033524e-02, 1.602705e-01, 2.394092e-01, 3.173611e-01, 3.937456e-01,
4.681952e-01, 5.403578e-01, 6.099000e-01, 6.765098e-01, 7.398989e-01, 7.998050e-01,...
y array is the desired x array is my output
If I'm reading it right, even the shape isn't right.
yes which is really scary
Hey @agile monolith!
It looks like you tried to attach file type(s) that we do not allow (.docx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
doc isnt allowed damn
this is the task so it doesnt make sense for the shape to be (4,200)
@tidal bough im not trippin right?
Yeah, it's somewhat weird that the shape isn't (6,200)
my shape is (0,6,200) apparently which if we consider it to be (6,200) it makes sense along with the instructions. but the wanted result is in the form (4,200)
pretty noob question but does anyone know a good way to only view the first few lines of a massive json file in the terminal?
I'm getting these api responses that are so big my terminal won't print the whole thing, but it only prints the end of the file up, so I can't see the data structure
I'm writing them to a file currently, but I was wondering if there's an easy way to see the beginning in the terminal
you can just .splitlines the JSON string and take the first few lines
is there a place where i can practice Data Science based Python questions?
ive checked Leetcode but its not there 😦
i did it
lowkey very happy
no idea why or how it works
<ipython-input-82-182241fd5e53> in <module>
6 "stride": 2}
7
----> 8 Z, cache_conv = conv_forward(A_prev, W, b, hparameters)
9 print("Z's mean =\n", np.mean(Z))
10 print("Z[0,2,1] =\n", Z[0, 2, 1])
<ipython-input-81-f5cd533d7e29> in conv_forward(A_prev, W, b, hparameters)
86 weights = W[:,:,:,c]
87 biases = b[:,:,:,c]
---> 88 Z[i,h,w,c] = conv_single_step(a_slice_prev,weights,biases)
89
90
IndexError: index 4 is out of bounds for axis 3 with size 4``` i cant understand where i am going wrong
I see that there's a slight improvement when it comes to training a contextual chatbot in deeplearning with data that fits this criteria:
- Equal distribution of interrogative words
i.e. if one tag containshow do i water the plants?
Adding a question starting withhowon another tag to balance it makes it more intelligent - Equal distribution of patterns (i.e. 10 patterns each tag)
I have some idea why this is the case but could anyone explain to me why exactly this is the case? (or if it isn't and it's a matter of chance)
i know this isn't related to data science nor artificial intelligence but i remember you! congrats on getting the helper role!
Thanks for answering. I m trying this
It works just fine @polar acorn I think I ve lost myself in my own code as it seems just easy and logic today.
Thanks for it.
Guys, I have another question.
I think I m gonna use streamlit to create dashboards app for my company.
My question (and I ve asked it in game dev as well because I think the stakes are the same) is how can I share the built apps to my colleagues.
I can either build the app on the common file server and installing a python env in this folder so everyone can launch the app from a shared folder but only one user can use it at a time. + Other problem is that I need to create a file that launch the python env before the app file.
Sounds like mac gyver solution.
Or I create an app just like any software and I can share the app to anyone locally. How can I do that?
Should I use docker to create an image?
Seems like a problem Gradio can solve. Check out Gradio
I m checking this out thanks for it
Also, I think HuggingFace has a feature called Spaces, you can host what you've built with Gradio on Spaces and multiple users can access it at the same time.
Well, since HuggingFace just acquired Gradio yesterday, I believe it'll definitely blossom into something more beautiful.
ah yeah I didn t mention something is that as it s working with companies data I wan t it to stay local or intranet
HuggingFace's spaces is a hosting cloud service?
IDK for sure because I've not personally used Spaces. But what I do know is, whatever code / model you put out on HF spaces is still 100% yours.
ok I m gonna check this out as well thanks @odd meteor
hello i am really new tensorflow and i keep getting these errors
AttributeError Traceback (most recent call last)
<ipython-input-9-797be23c9feb> in <module>
3 loss = tf.Variable((y-y_hat)**2,name='loss')
4 #init = tf.initialize_all_variables
----> 5 with tf.Session() as session:
6 #session.run(init)
7 print(session.run(loss))
AttributeError: module 'tensorflow' has no attribute 'Session'```
sorry to ping you, could you please tell me why i am getting the above error
Because the attribute has been deprecated in TensorFlow 2.0
Ohhh okayy,thank you! So ill just use tf.print()?
Use tf.compat.v1.Session() instead
Anyone knows of deep learning models used for time series classification that I could read about?
Hi I'm getting this error: but not sure how to fix it
File "C:\Users\haris\Documents\Bsc stock project\centroid lags.py", line 68, in xvals
return [lin_interp(x1, y1, zero_crossings_i[0], percent_y),
IndexError: index 0 is out of bounds for axis 0 with size 0 import pandas as pd#import pandas package to read data more easily import matplotlib.pyplot as plt#imported pyplot to plot graphs import datetime as dt#date time to read first column of csv file import numpy as np from datetime import datetime CL=[] for i in range(100): df = pd.read_csv('TSLA.csv') df2 = pd.read_csv('NBM.V.csv') df0=pd.read_csv('file1.csv') df5=pd.read_csv('file2.csv') data1=df0 data2=df5 data1['Date'] = pd.to_datetime(data1['Date']) data2['Date'] = pd.to_datetime(data2['Date']) x1=(data1['Date'] - dt.datetime(1970,1,1)).dt.total_seconds()/86400 x2=(data2['Date'] - dt.datetime(1970,1,1)).dt.total_seconds()/86400 y1=data1['Close'] y2=data2['Close'] t0=[] d0=[]
y2_mean = np.mean(y2)
y1_stdv = np.std(y1)
y2_stdv = np.std(y2)
for i in range(len(data1)):
for j in range(len(data2)):
t=x2[j]-x1[i]
t0.append(t)
d = (y1[i]- y1_mean)*(y2[j] - y2_mean)/(y1_stdv*y2_stdv)
d0.append(d)
# return udcf
#data=udcf(data1,data2)
x, y = zip(*sorted(zip(t0, d0)))#ensures x and y values correspond to each others in pairs when sorted
df4 = pd.DataFrame({'X' : x, 'Y' : y}) #we build a dataframe from the data
#bins = create_bins(lower_bound=-6,width=3,quantity=30)
bins=np.arange(min(x), max(x)+0.01, step=4.3)
#bins2 = pd.IntervalIndex.from_tuples(bins, closed="left")
categorical_object = pd.cut(x, bins)
count=pd.value_counts(categorical_object)
grp = df4.groupby(by = categorical_object) #we group the data by the cut
ret = grp.aggregate(np.mean)
data2_new=df2.sample(frac = 0.7)
data1_new=df.sample(frac = 0.7)
dict = pd.DataFrame({'Date':data1_new['Date'],'Close': data1_new['Close']})
kd = pd.DataFrame(dict)
kd.to_csv('file2.csv', index=False)
dict2 = pd.DataFrame({'Date':data2_new['Date'],'Close': data2_new['Close']})
kd = pd.DataFrame(dict2)
kd.to_csv('file3.csv', index=False)
x1,y1=zip(*sorted(zip(ret.X,ret.Y)))
def lin_interp(x, y, i, percent_y):
return x[i] + (x[i+1] - x[i]) * ((percent_y - y[i]) / (y[i+1] - y[i]))
def xvals(x, y):
percent_y = (max(y)*0.8)
signs = np.sign(np.add(y, -percent_y))
zero_crossings = (signs[0:-2] != signs[1:-1])
zero_crossings_i = np.where(zero_crossings)[0]
return [lin_interp(x1, y1, zero_crossings_i[0], percent_y),
lin_interp(x1, y1, zero_crossings_i[1], percent_y)]
hmx = xvals(x1,y1)
centroid=np.mean(hmx)
CL.append(np.mean(centroid))```
Axis 0 size 0? Wth
yhh it says says this also: return [lin_interp(x1, y1, zero_crossings_i[0], percent_y),lin_interp(x1, y1, zero_crossings_i[1], percent_y)]
IndexError: index 1 is out of bounds for axis 0 with size 1
but when I take it all out the for loop it runs without errors
what exactly are you trying to do?
I'm trying to find 80% of the max y values in a particular 70% sample then I'm trying to find the x coordinates of where the line y_value=max(ret.Y)*0.8 crosses/intersects the 2 points either side of the peak, Then I try finding the midpoint of these to x coordinates and append them to a list, then I will try bin these values and plot N(number of points in each bin) vs the binned data
ret.Y is my y data and ret.X is my x data
and plot N(number of points in each bin) vs the binned data the issue starts b4 this but after the following 'm trying to find 80% of the max y values in a particular 70% sample then I'm trying to find the x coordinates of where the line y_value=max(ret.Y)*0.8 crosses/intersects the 2 points either side of the peak, Then I try finding the midpoint of these to x coordinates and append them to a list
ur samples have any nans?
lemme check one sec
one of the operations u do is preventing u i think
coz the size doesnt match
check the size of all the lists
check what they are after the nans are filtered (if u have any)
when I print my sample data frame it dosent contain any Nan
oh yes
so how dose one pass weight scaling to a loss function in pytorch? I see that BCELoss can take weights when a object is being created but I go through my data in batches and I want batch wise. do you guys just create a new object for each batch or what?
I just pass a "reference" or whatever python calls it to my training function
no, the svm kernel is more like hidden layers in the NN
thanks!
how to choose parameters with it?
ill use polynomial kernel but what parameters i should put?
what do you mean?
a polynomial kernel turns y = wx into y = f(x) where f is a polynomial. the "parameter" is just the order of the polynomial. you probably don't want more than 2 or 3 imo
for the other kernels, treat them like hyperparameters in any other model, e.g. a neural network
you should familiarize yourself with how those kernels actually work (and what a kernel actually is)
oh maybe thats why haha
thank you ill try to look into it
You great man; your links really helped - so easy:
g = sns.FacetGrid(df2, col="region", height=3.5, aspect=.65)
g.map(sns.kdeplot, "charges")
easy!!
(just now need to add mean/median*).
these are still matplotlib plots, so you can get access to the Axes objects and use the usual .plot method to add lines
or better yet, the .axvline method
hey can anyone help me with this. nameError: 'app' is not defined on my 1 flask code
online learning with relative preferences has resulted in a new framework for optimizing over sets of alternatives with only relative, subset-wise observations. This is a general framework that is applicable to many automated recommendation systems that can sequentially elicit only relative preferences from, say, human users, e.g., “Do you like X over Y (and Z)?” This study has yielded state-of-the-art learning algorithms that make optimal subset selection decisions in terms of regret and rank-order estimation error, along with new insights on how to efficiently make comparisons to elicit items’ utilities within a wide range of social choice models.i came across this in a paper, can someone please explain it to me
sounds like a #web-development question, but the problem is that you referred to something that you haven't defined. There are built-in functions and classes that are already named when your program starts, but everything else you have to import or name in the code.
I would recommend using a code editor that highlights built-in names so that you start remembering which ones they are. It's helpful to know them since they're always available.
what is your previous expertise with ML? for example, does "regret minimization" mean anything to you?
Nothing can minimize my regret.
What of a 4 story building in Silicon Valley and a yatch? 😃
if I can sell them both and buy a condo here, sure.
Hey all! I want to fill the area's in between the lines and put a text
Any idea on how to do it?
I don't know much about model deployment (I'm assuming you're trying to deploy your model using Flask)
Ensure you've created app.py file
Hopefully, people into MLOps can help you.
I know there's a fill_between function in matplotlib, but have no idea on how to implement it in the current state of my code
And it needs to be a different color in each section
Anyone any idea?
this can't be a real plot
also post your code if you're unsure of implementation
Hey guys I have a pandas dataframe and the index is all labels. I need to use the value of the labels individually in another function and I was wondering how to go about getting them
You can do df.index,or if you only need the distinct values df.index.unique
I need each index value one at a time
Hi I my code was running like 2 mins ago but now when I run it it says
File "pandas_libs\parsers.pyx", line 549, in pandas._libs.parsers.TextReader.cinit
EmptyDataError: No columns to parse from file
and in another tab using the same files it says the same thing
well, try printing the DataFrame right before the line that eventually causes that error. Remember to read the traceback from top to bottom to understand how the error happened
thankyou it worked
what I suggested was not a solution--it was only intended to help you figure out why you got the error
but I'm glad it worked, whatever that means
yhh it was reading data from a file with no data
:incoming_envelope: :ok_hand: applied mute to @dense atlas until <t:1639779871:f> (9 minutes and 58 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
How many types of machine learning are there?
Supervised Learning, Unsupervised Learning, Reinforcement Learning, And most recently ; Semi-Supervised Learning and Self-Supervised Learning 😫
TENSORFLOW
So this is the output from my model, I have a tokenizer which can map integers to and from words, but as this is my output, I can't use it, is there a way to return this to integer form? I looked into TextVectorisation but my model uses an image and a question input to get an answer (in text) output. So I'm not sure if I can use it
Can anyone give me some tips on how to convert this into integers so I can tokenize it back into english?
I imagine I need some sort of TextVectorisation layer as output from my merged model?
i am so beyond confused
by that bottom arrow
x should be 2, y is 2, and z is 0
but i don't see how exactly y is two?
oh
i'm an idiot
lol
now it makes sense
it's 2 diagonally
in python tensorflow what does .loc() do?
oh no i do not know what it means. i had to read a professor's paper and write a mail
example?
this
I am not aware of a loc method in tensorflow, perhaps you are using the pandas .loc()?
in which case refer you can refer to this for complete documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
super noob here. I have this pandas dataframe, and i would like to compare the second two columns to the first, outputting true if their signs match, and false otherwise.
_df['randforest_result'] = np.where((_df['randforest'] >= 0 and _df['y_test'] >= 0) or (_df['randforest'] < 0 and _df['y_test'] < 0), True, False)
``` I tried something like this but it's giving me issues about the truthiness of series/dataframes
This probably is not the right place to ask this, but anyway, I am trying to do some elementary satellite image analysis with rasterio and earthpy. The NDVI is given as the normalized difference between the near infra-red and red bands. Water should appear yellow or red (negative values) and vegetation should appear green (positive value) in the NDVI result. But I am getting kinda the opposite
The green part is supposed to be water, and yellowish part is land and vegetation
this is the code:
ndvi = es.normalized_diff(stacked[4], stacked[3])
ep.plot_bands(ndvi, cmap="RdYlGn", cols=1, vmin=-1, vmax=1, figsize=(10, 14))
plt.show()
stacked[4] is NIR and stacked[3] is the red band
import numpy as np
a = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12],
[13, 14, 15],
[16, 17, 18]])
arr = np.array_split(a, 3, axis=1)
print(arr)
[array([[ 1],
[ 4],
[ 7],
[10],
[13],
[16]]), array([[ 2],
[ 5],
[ 8],
[11],
[14],
[17]]), array([[ 3],
[ 6],
[ 9],
[12],
[15],
[18]])]
Why is it divided on the basis of the column and not on the basis of the row, even though (1) the division on the basis of the row?
if the difference is 0.5 between mean and median... can we use mean... eg i am analyzing avg no of likes on Instagram
from this code I got an error like this: TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
how to fix that?
Bros, I'm a 1.5 years machine learning engineer, with devops and automation experience too. I know python and am learning Go too, know SQL well and bash too. I had a rough week at my org and want to prepare asap for a couple of months and get a new job. Can someone tell me what specif site, topics and courses I should start doing to get a new job. For Python btw. If u know what ML topics I should learn that would be cool.
What are the properties of a stop word without considering its semantics? I want to determine whether a word is a stop word for any random language, one I can think of is that it occurs in a high frequency within a text.
The inverse document frequency would be high too, but I don't want to work with a group of documents. Rather, for a single document.
For a multi-document approach, I can perhaps create a context vector and train a naive bayes classifier to determine if a word is stop word or not.
However, I need a lightweight way to just determine with some accuracy on a single document.
Stopwords are literally those common words in every language that adds Zero value to our task yet they tend to be littered everywhere in our document most times.
A word can be a stopword and still not appear frequently in a document.
Since you're already a Machine Learning Engineer, I think a demonstrable work experience would be sufficient in your case. Except there's a niche in ML you wanna go into
yo in application of data augmentation for samples there could be a boundary like there will be a optimal amount of augmented data that best give result and there is something like a dataset that compose of too much augmented data to the point that it gives bad observation to the model?
Is there any reason why you'd wanna do this from sratch when there's already an easier option to achieve your task?
Just as we have English stopwords, there are other stopwords in various languages too. Libraries like spaCy, TfidfVectorizer etc already has a parameter that can handle such.
If you're working on a French text, just indicate you'd wanna remove stopwords in French with the appropriate parameter.
Yes, that is true, but as I said, "without considering semantics", the high-frequency part is one of the general trend features and I only want to detect with some accuracy which is expected here. I am looking for some extra features which can help generalize with more certainty.
I have some low resource languages for which such stop word data does not exist. So I have to work with an unsupervised method
Yes but I have to solve the coding questions, and get through the coding round. I haven't been doing an data structures and algorithms or leetcode etc
Plus my main work was data science related with a niche but I want to work in anything that is being offered
I am doing a kaggle compeitition, getting low f1 score (its stuck around 0.5) but high accuracy any suggestions on what might be causing this ? my dataset is balanced and has two classes
Aside the high frequency, I doubt if there's away to easily get the kind of accuracy you'd love to see on the project if you're not familiar with the 'new' language or at least a native speaker of that 'new' language you wanna work with.
That's why I kinda feel using SpaCy or TfidfVectorizer or CountVectorizer's stopwords parameter is best.
It's an interesting project regardless 😊
SpaCy and the other packages deal with stopwords using a pre-defined dictionary for some language, right?
If such a dictionary does not exist in them, I don't think they will be able to work for a new low-resource language.
Even going with the high frequency, this could easily mis-classify words that aren't stopwords as one.
Remember, even after removal of stop words, we still had to do Stemming and Lemmatization on a text. And most times, a stem of a word could very much appear frequently in a text too alongside the main culprit (stopwords)
However, certain features which I can think of is, properties of NOT STOP WORDS instead of choosing STOP words. One could be the rank of the sentence a word appears in, given a document, I think the top paragraphs convey way more information and as we go down it saturates and then concludes.
Such words in top sentences have a chance to be "keywords" rather than stop words, so words appearing in top sentences get lower weights.
Also frequency of a word within a sentence.
No. of sentences some word occurs in : More it is, the higher chances to be a stop word
Bigram frequency
Unique surrounding trigrams, if a word is surrounded by more unique words in a window of size 1, the greater chance it has to be a stop word.
And of course the frquency.
Combining these 6 features, I can train a classifier I guess
Yeah. Basically, I'd assumed it was majorly worked on by those who are fluent speakers or native speakers of that language.
Lemmatization and stemming will not exist for a low resource language as there is no pre-defined tree/rules for that language.
To my understanding...
We all understand English, so that's why we can from the top of our head sight a stopword in English without sweating it.
It'll be difficult to spot a stopword in Igbo Language if you don't at least understand the language to a reasonable level. Languages differ and can be complex at times... so I'm just thinking hmmm 🤔
Well, what do I know 🤷🏾♂️
@odd meteor Don't you think these features will be unique across any language? Even if its Igbo
Yes, that's why we need something to handle it, more of a syntactic approach where the meaning is not known.
I don't believe it will be tbh. Language can be complex and tend to differ in a lot of ways. Definitely, the only intersection I'd say all language would share when it comes to stopwords is their frequency of occurrence. And we need more than frequency of occurrence to get a reasonable result.
I don't believe those guys that worked on spaCy were able to come up with stopwords for several languages by just looking at frequency of occurrence in a text...
😂😂😂 To me, I believe it starts with grouping people who are native speakers of each language together to manually annotate, label or identify stopwords in that language
I might be wrong tho.... But that's what my brain is telling me they did
Yes, language can differ in a lot of ways, but the writing style for an article is consistent, as the frequency.
Which feature do you think from them will not be consistent across multiple languages? Because they are all term features, just like frequency (which you said will be consistent) , what differs them to not make them consistent? I am asking to gain more insights and better the features.
yo can i generate more than 1000 augmented image per image
lets say i have 10 samples and i genereted 1000 augmented images per sample so my total data set is 10010?
or based on the parameters or techniques that will be applied to the image there will be a limit on how much augmented images i can generate?
i am talking about the imagedatagen by keras
or if there are other library for image augmentation?
spaCy is open source, so I believe some contributors provided dictionaries of words for languages, frequency alone is definitely not the case. I will read the implementation of the module I guess
I saw, they just predefined a list of stop words for all languages, from different contributors.
That's definitely a good way to go about it. I'm not an expert in NLP myself... All ideas are valid at this point 💡
Now, that's more like it. I kinda perceived that's what they'd do.
I am only wondering as you said those features will not be consistent, but the frequency will be. So why is that the case, as implicitly most of them are working with some form of frequency and the rest are about order. If you can explain a bit on why you think they won't be consistent, then I can enhance them or maybe remove them. :)'
Certificates aren't always necessary to break into tech, but if you're already good in Data Science and Machine Learning, then you might wanna consider learning Deep Learning.
Start from TensorFlow + Keras , then take the TensorFlow certification exam if you wanna have a certification.
If you can add PyTorch + TensorFlow + Keras in your repertoire that'd be super 🔥 🔥 🔥
Never once said "certificates"
Coding round isn't remotely related to deep learning
Not sure about the implementation by imagedatagen, but by definition you can augment an image an infinite number of times!
However, the important aspect will be to check the redundancy among the augmented images, what parameters are u setting for augmentation depending upon the objective. because you can have 10 augmented images, and they are not providing sufficient NEW information for the neural network to learn anything significant.
do you think it can be a thesis topic? determine the optimal volume of augmented data for cnn models? like ill experiments using the same augmentation parameters and use 3 configurations
1 train the model with 30% of samples being augmented
2 train the model with 60% of samples being augmented
3 train the model with 90& of samples being augmented
will this topic hold value?
I think doing it from the redundancy and objective aspect with it will add more value, that is
Creating 10 augmented images for an "auto brightness" task will have high redundancy if the augmented images only have varying level of pixel brightness, but could be still optimal given the task.
However, for an object detection task, 10 of such images would be not optimal, in general (depends on the task again)
What I meant was that...
Languages can be fluid and dynamic, and as such, each language will be governed by different set of rules.
So I believe that Identified Features in language A will be consistent for language A but might not necessary be in Language B and Language C. That kinda explained why spaCy had to use different contributors who supposedly are native speakers of the language they were assigned to work on.
what do you mean by auto brightness? classifying the brightness level? what i said earlier will be applied to a fish classification task could it be useful to determine the optimal amount of data for classifying fish images? or should i try to use a open source dataset and apply the configurations (1-3) and compare the result ?
do you think the result will be useful for others? or the topic itself is subjective on the task of what classification task is being conducted?
spaCy didn't use an unsupervised approach for stop words, that's why they used available stop word data and for the languages they dont have data, there is no rigorous implementation yet.
I am only quoting this line which you said:
Definitely, the only intersection I'd say all language would share when it comes to stopwords is their frequency of occurrence.
The other features I have defined are almost from a frequency perspective only, so I believe they have to be consistent. But I was only discussing which feature among the ones I have said won't be consistent.
Like given an image, which has a bit darker areas, the task is to increase brightness optimally on those areas only.
Like, say image processing part of many cameras in night mode.
there are already existing studies about this or not?
hello can you please tell me why we add (x) at the end of dense pyx = Flatten()(input) x = Dense(dense_shape, activation='relu')(x)
What I meant is, the task of finding an optimal number of augmentation depends on the objective too!
" determine the optimal volume of augmented data for cnn models" + "Given a task"
If your task is fish classification, then it becomes
"" determine the optimal volume of augmented data for cnn models if we are doing fish classification task"
It will depend on the data as well.
A better topic willbe
"Given any random task, and data, finding an optimal number of augmentation images for a cnn model", which will be also an interesting cntribution
This was an example on how changing the task will change the optimal number for augmentation as well. So it becomes difficult to generalize, but also a good problem.
I think going with a meta-heuristics way is a good start for this.
My response was with the assumption that you're interested in getting into another ML position. However, If you're more interested in coding round, then I'd suggest data structures and algorithm related tests.
Personally, the only coding test I've done during one of my ML job interview stages was a 2 hours Kaggle problem. 😂 About 11 of us were given a real life company data and asked to build a model and make prediction... F1 score was the metric used to rank submission. Top 4 submissions proceeded to the next level of the interview stage...
Again, it depends on your country as well. Other experienced guys here might be able to add one or two....
Okay, I get your point now.
I'm in the US and have done ML-engineering + DS + DE roles --- it's a wildly different process depending on the company and their needs.
should i learn oops in python to understand it better?
So, depends on what you're into 0n3. You've done MLE for 1.5yrs, you prob know a direction you wanna move into. From there, perhaps look at some Indeed / BuiltIn postings to see what technologies and so forth some desired companies use.
I've never been asked to do anything related to deep learning w/rt my work/interviews but I'm also not in an industry which uses deep learning frequently, so YMMV.
oh i see then it will not be a good topic unless i figure out a technique on the better topic you suggested
Regardless of where you move into, if you haven't checked out Advent of Code (which is happening now!) check it out. Solving those problems will teach you a ton about software engineering and the like, and that's always a plus for applicants (at least in the fields I work). There's also a bunch'a channels here for it.
where is this advent of code?
website?
or channel?
It can be found at https://adventofcode.com/ and the associated channel here is #advent-of-code .
Oh, also, I meant this for 0n3, but it is useful for anyone.
@tall loom yo sir i found a paper that classifies fish using cnn also but they did not apply augmentation on their dataset so if i redo their study and apply augmentation would it be a good contribution?
sorry i just dont know how to classify what is a good or nah topic because there i see more on solving low level problems like about the theories which is so hard for me and i see this classification tasks as topic and i dont know why our topic got revised and demanded a modification or alternative to data augmentation or find an issue on a cnn model that we can address which is soo much for me hahaha
oh this like challenges ?
its like school festival but for programmers?
It's a series of programming challenges, yes.
@tall loom yo sir i go back about the image augmentation stuff
what if what i experiment is the parameters?
each configuration will have each parameters and tested
1 zoom, flip
2 color manipulation
3 patch erasing cropping
the best resulting parameters for image augmentation can be the my contribution for like
if there are anyone interested in training a model on classifying fish and they have those fish images ithey can use the result of the experiment on which augmentation techniques they can apply on their model yeah? or nah?
they recommended me to think of an alternative way or better way for image augmentation for cnn models but i think everything is already there i cant figure out new technique
Here is how it will be:
- You are getting better results by using image augmentation, on the methods which THEY have used.
- There already exists other models which can classify fish, but it is not something that THEY have used, and perhaps they can generate better results without any augmentation, but for them baselines might already exist!
However, it can still get accepted, I know of a case personally who worked on some classification (can't disclose) on top of image augmentation, however they also used more dataset.
From a contributions perspective, it's not just better results though, its more about WHAT is used to get better results, augmentation is one way and it can stil be considered as a topic, but if you are looking for novelty then this is not it, but that is just my opinion. As augmentation is standard practice these days for similar tasks.
yeah i see other topics like a classification task with cnn also and they got accepted which is weird because we are pretty much the same almost only the samples are different
also the reason they say that i need revision is because i need to have some originality or something like what i see on cnn that i can change or what alternative method i can replace with image augmentation which is what you say a novel topic right? why are they expecting that kind of idea to me hahahaha
so i am trying to find something like a comparative analysis type of study which maybe acceptable than creating a whole new theory which is impossible for me
Yes, optimal parameter estimation is a good idea, although that is also task-dependent, and if you are restricting yourself to the task of "fish detection" then many things can be explored.
Although not generalized, it will still be a good analysis for this task and then it can be inferred by some empirical experiment on how it can be used further for similar if not all tasks.
But just testing won't be good enough, if you are doing only an iterative approach to find an optimum point, that will be data-dependent. But this can be a baseline for this data, the motive could be "How the standard practice of such and such hyperparameters for the fish classification task can be improvised if we modify this hyperparameter in this way, this you would have to show empirically of course, and then recommend something based on that"
i need to present a logical reason on why this configuration is better than that and that? is what you mean?
btw you are pro sir are you one of those who create papers and attend conferences?
Yes, you will say start with that logical reason as your hypothesis and then you prove it by showing it from results with that experiment of tuning. You will either have to justify why the previous tasks have NOT explored your configuration OR you will be the first to suggest that configuration, for that task.
And the main theme should be how this idea can be implemented for other similar tasks.
for example in this 3 configuration
1 zoom, flip
2 color manipulation
3 patch erasing cropping
i got 1 as the best result that implies a good parameter for data augmentation of fish images
i need to say why is it the winner
like because the 2nd configuration makes the image pixels change colors thus making the model confused because fish colors are important features(for example this is my observation on the model using this configuration)
and another reason for 3rd configuration
the reasons would be from the observed result of experimentation
The other way.
This is like, you did some experiment randomly, analyzed some configurations, and then found this should work well, and then you are making a theory of why it is working better. This works when there is a generalized data or generalized model. But here, it's one specific task on some data!
So you would have to then explain further on why that "image pixel theory" is correct, not just for this experiment, but for in general for similar tasks.
(You can try running on other datasets, previous and after results, etc)
Or you can also start by proposing that why enhancing some parameters should work better and is something that has not been done in previous works, then you give a theory, and then implement it and show that it is correct for a number of datasets, this is how it should work though, as you start with some hypothesis based on some observation and then you prove it.
oh for example i have the results then i will try the configuration on example imagenet dataset and check the result with and without using the configuration
but if i fail to prove it do i also fail my semester? 😅 or is it still an acceptable discovery?
for example my hypo is the color manipulation is bad configuration and it turns out good dang what do i do hahaha
No, you won't fail, the conclusion would be that hyperparameter tuning for such and such parameters is not acceptable and is something that is not worth implemeting, given the increase in efficiency is not sufficient to justify the extra data.
This helps others to NOT explore the part or explore in a different way(some other way which you can also suggest in end to explore further as ideas)
oh so this what a real science topic is right?
Also, the people judging also matter in this case if its for semester exams, ( a lot of bias occurs from panel memebers) you would have to present it in a different way then.
Keep a few hypotheses on side(like different configurations, you would have to read some papers on why people are exploring something), and if all of them fail on your data, then your project will be simply "effects of certain changes in a system" and tbh it will be an interesting study to explain why some tuning didn't work and why it worked for others. If you figure this out, you can definitely suggest some different directions to explore by end.
Yes yes! Just make sure to not just analyze things and say something about them, it should be a comprehensive study and any claim should be justified with experiments+/mathematically or citing other works
yeah there are different panel members by each group and maybe i draw the short straw on this one hahaha
trying to install Tensorflow/PyTorch on my Mac M1. is it possible without installing conda?
conda comes with too many unused packages and junk associated with it.
its like everything should have a legitimate basis? this is the crucial part right?
do you need CUDA?
just pip install torch should work on Mac but getting CUDA requires extra work, apparently.
image augmentation is pretty popular so maybe there are alot of classification task that uses image augmentation that i can compare
No, don't worry. Any study which concludes something with experiments and ideas is accepted. It should be robust though, the part after the results should be rigorous. For example: Just analyzing a dataset and then concluding something about it is not acceptable, you would have to justify the conclusion, its reason with robust evidence, and then further generalize it.
Yes, you cannot say something happened because you think this is why it should happen. Your reason should have evidence.
The only thing to be careful about it, "You are explaining the panel about what you thought should happen, its reason and you concluded it didn't happen"
And turns out your reason has flaws OR something didn't happen and the cause of it is something to be studied before even starting. Discuss this with supervisor 🙂
yeah, I do
let me see
looks like Anaconda is the recommended way to do it for Mac.
Hello, im trying to do data augmentation for my dataset of 40 images and am a bit confused
Do i do the augmentation per image and generate 10 images
yeah we got advisers but they are pretty much busy anyways i think i got this topic and maybe give this idea to the panel and start drafting on how should this go
incase it got accepted hahhaha because base on my understanding is i need to have something original to add or modify to existing algorithms well maybe this experimental study could suffice the panel
thank you very much sir hoping to talk to you again with my future struggles 😅 👍
Or should i go about it in batches??
train_datagen = ImageDataGenerator(rescale=1./255,rotation_range=45,horizontal_flip=True)
``` i have been stuck and idk what i should do after this:/
Yes, even if it's not original, a rigorous study is sufficient for a thesis. However, some modifications and suggestions can help in publication as well if that is your goal. And dont call me sir! 😄
Is anyone here from the United Kingdom?
why do you ask?
Has anyone here done A-Level maths?
I wanted to ask if the knowledge for calculus from there is sufficient enough to do stuff in machine learning
Or would I need to learn anything extra?
I can't speak for what the job market is like in the UK, but AI depends on linear algebra and prob/stat, so it depends on whether they will let you learn that theory on the job. I suspect not.
You have to augment every image in your training dataset
And how many images should be generated per image? I think i am actually confused
In that case, I think i should be fine then
I'm starting uni soon but I heard that machine learning is heavy on calculus, which the degree I am taking doesn't expand on further
it's more linear algebra than calculus.
So I was wondering whether I'd need to catch up on or learn anything extra
But there was this other book I found online called mathematics for machine learning
Would you recommend it?
Also I'm augmenting before training the model,is it necessary to augment it on the fly or can i augment before save it to the drive and use it in cnn later?
you should learn linear algebra and probability/stastistics. My degree (B.Sc. in computer science/data science, in the US) had both.
So you split the data into train, test, and validation
The number of images depends on your motive, you can create a rescaled image, a section of the image, some rotation image, etc
Can i go into data science with limited knowledge in calculus? So just stuff like differentiation and integration and other things
No
What else would I need then? Apart from linear algebra and stats
in terms of math, or other general skills?
Maths
You can augment it anytime, but the 'on the fly' is more like a function you are applying on the image, so you don't have to generate and save all those images, rather function(Image) goes in the network. (I think this is how the library should do, to save space and avoid reading writing all images, or at least that's how I would implement)
But from the learning perspective, generating 'on the fly' or providing input of all augmented images from saved data makes no difference.
All it is doing is, creating new data for the model to learn, and all goes in the network @wicked grove
I've found a book that teaches maths for machine learning
I'll just teach myself the math needed
Thanks for informing me of the math needed
I'm not in the UK but I did A-level in Math, Physics, and Economics. From my experience, my knowledge of A-level in math was sufficient for calculus but poor in linear algebra and statistics. I learned Linear Algebra and Statistics in the University.
Thank you for informing me
I'll learn the stuff needed then
there are optionary modules in a level maths for stats, linear algebra is not taught coz u are assumedto know it already
the last time someone was sure I was the only one who could help them, some of their limbs were never found.
I have a separate folder w images for validation and test
I think linear algebra is taught in fm
hahahahahahaha
I don't take fm because I'm massive stoopid
pretty sure linear algebra is heavy in c3 and c4
r u taking physics a lvl?
Comp sci, physics, maths
@serene scaffold come #help-pretzel
what library should i use as a beginner? like tensorflow, pytorch, keras
Thank uou soo much for the explanation!! I am doing this for a thesis paper. After getting more images i wanna preprocess it and then feed it into a cnn
I would do something that isn't deep learning first.
So, none of them for now.
then?
sklearn
gg bad combo
Also another question
What should be the batch size for the augmentation
seems like a good combo at first but thats horrible
ah ok got any tutorials ?
not off the top of my head
It's good enough for me
I enjoy physics and I want to take comp sci at uni
So it works imo
how?
if u taking physics u must do FM too coz 50% of the topics are same
is it ur second year?
i just cant find any on yt
I know a lot of people who take physics but not fm
Then again, idk the contents of fm
Also yeah, second year
is sklearn and scikit-learn the same?
yh thats usually the teachers stupidity not knowing how many topics are actually linked. they keep saying no FM is too hard when u literally cover most of that in physics too.
I plan on teaching myself fm once exams end so I'll try and draw comparisons then
how r u getting along with ur a lvls so far?
which one shoulds i watch?

Good, thanks for asking
I submitted my uni app about two weeks ago and I've heard back from 4/5 of my unis
Well, I'm from 🇳🇬. Statistics was the last module in Math curriculum that briefly introduced us to stats. And it only covered introduction to measures of central tendency... nothing too serious.
I proceeded to do my major in Statistics afterwards though.
which ones did u apply for?
UCL, QMUL, Loughborough, City, Notts
there are many optional modules for maths u only did S1 there is also S2,S3,S4
all russel?
City and Loughborough are the odd ones out
@tall loom im so sorry i had another doubt
While doing the augmentation for the entire dataset
What parameters should i set for the image to come about 1000
Perhaps we have different A-level bodies. The one in my country is probably different from yours. Was yours Cambridge A-level?
they are usually slightly different in other countries
gl fam, u need like 3 As for ucl
Thank you
I'm predicted A* A* A
So i think i should be fine
yh u will be fine probs
I hope so
UCL is the only I haven't heard back from as of yet
😃 Check all of them first before you settle for one. Drop anyone that doesn't work for you.
PS: I'd also advice learning from a more structured platform. Learning solely on YouTube can be overwhelming and also cause fatigue.
got a friend who went to ucl, everyday was a misery for him afaik
u will if the predicteds are correct
Was it that bad? I've heard good things about it
in terms of difficulty*
I just about meet the requirements so hopefully I should get in
yea ty
you got any structured platforms that i can learn from?
Ah, that makes sense
My first choice is either qmul or ucl
Is it important to learn oops and dsa in python for a career in ml or ds ?
- Andrew Ng Coursera course on ML
- Udemy.com
- DataQuest.io
- DataCamp.com
- Kaggle.com/learn
- Jovian.ai
- Neuromatch Academy
ty
In fact, Neuromatch Academy has one of the best customer-friendly course on Deep Learning.
https://academy.neuromatch.io/
Check their YouTube channel also
There's a youtube channel called DataSchool ,his tutorials were really good
why queen mary?
they arent that good
compared to ucl
Close to where I live and my girlfriend is going there for medicine
I didn't learn OOP before I started ML. However, from my experience it'll be great to know OOP in Python before starting ML.
You'll find it very useful when you start learning PyTorch. Because PyTorch will assume you already have the knowledge of OOP. If you use other DeepLearning frameworks like TensorFlow + Keeas then you won't really use OOP.
It's good to know at least two frameworks. Avoid been over dependent on one DL framework.
PyTorch, TensorFlow + Keras, JAX, MXNet, Sonnet, CNTK, Kaffe etc..
Just know at least two.
So I started DL with TensorFlow but I'm currently learning PyTorch now. But I had to learn OOP first before coming back to PyTorch.
@odd meteor
I can't really tell you which one to use. Just check all of them and go for the one that works for you.
Jovian is nice but so are other websites as well.
I started with Andrew Ng course then I dropped it and moved to Udemy.
Ohhh okayy ,thank youu😁 im learning tensorflow right now
And what about dsa? Or do i just practice pyton on leetcode/hackerrank
DSA? Data Science Africa? 😀
udemy is paid
I've been curious --- most of y'all seem to be doing a lot of NN-type things, but I've seen very little of it in the field in the last ten-or-so years where I've done work [IoT, Travel, Loans...] besides a few ad hoc projects. Do your fields have you working with NNs quite a bit?
I think the most I've used'em is autoencoders for fraud / sensor fluctuations.
Edit: I didn't mean to sound judge-y here! I'm legit interested in what people are doing with NNs.
😂😂no no data structures and algorithms
Ooh... I wasn't a software developer before getting into ML so I don't really know beyond my undergrad Big O notation in csc class 😩 I don't know anything about binary tree either.. Lol
I do pick up new stuff along the way though. I was first a Statistician before ML blew up to become something I could no longer ignore .
Oftentimes much value is gotten from paid services 😁 don't you think so?
There's no amount of free ML course online that could match Courses on platforms like DataCamp and DataQuest. I personally haven't seen one yet.
ohhh alrightt!!i guess ill leave it for now then
@odd meteor i was trying out the data augmentation and this is what i did ... could you please tell me how i can i save the augmented images?
train_path = 'D:\glaucoma_train\ODIR-5K\ODIR-5K\Glaucoma'
train_datagen = ImageDataGenerator(rescale=1./255,rotation_range=45,horizontal_flip=True)
train_generator = train_datagen.flow_from_directory(train_path,target_size=(512,512),batch_size=16)```
I think Research oriented companies tend to use NN the most. DeepMind, Google, etc.
My company (folks at Research department actually) mostly worked on CV + GAN this year. With two research papers on the projects accepted at this year's NeurIPS.
This checks out. I've mainly seen them in academia or on things which are heavy users of image-processing [self-driving cars, satellite + agriculture, etc.] so I was interested in seeing who was doing what with them.
Out of all of the NNs I wanted to dip into, GANs are among the top ones. They're pretty interesting!
I don't know enough about Image Processing or CV yet to give an appropriate response at this time.
I just started my DL journey with NLP
Dang, have fun with NLP. That's one I started up again because it seems to be in demand at more and more places. It's cool, I just don't remember a whole lot of it. And I def don't know the new techniques.
It's mind blowing to be honest 😀. I do see the wonders those guys in my company's Research Dept are making from time to time.
Yeah, I get a little jealous that most of what I'm doing is fittin' GLMs and boostin' a bit! Haha, but NNs, even with LIME, are pretty terrible explainers and it's hard to say to a customer, "Hey, your XYZ is going to break." "What signals lead you to believe that?" "Idk, lol."
Bruhhhhhh you're not the only one that's jealous. Why did I just started learning Deep Learning? Because I can't take it anymore.... I just have to know it too 😀 😂 😂
I know what'cha mean. Maybe I'll jump down this rabbit hole too and try'ta learn some!
is... is matplotlib not sufficient for you?
My teacher asked me to draw a function and then apply transformation on it in opencv
Guys can anyone tell me how statistics ,AI, ML ,DL , Probability are used in Data Science . I am very confused so kindly give me a real life example
your confusion stems from probably not defining data science well
how would you define it?
:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1639854074:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
F
Well thanks for your input 🤠 I don't know how most companies operate, but from what I've seen coding skills seem way more important (think MLops) than ML skills as a MLE, for Data Scientist roles what u say maybe true (more Kaggle less leetcode etc) but again thanks 👍
Right
Same for me no deep learning, more and more of ML quick fix solutions then all about getting them deployed, AWS sagemaker + lambda ▶️
Yes ! That's the kind of stuff I need for a new job
Data Scientists do use a lot of NN, one of my friends has a few papers published and is going to pursue a PhD too soon, got 1 patent too. He works at Uber in the computer vision section so yes he does regularly write purely data science oriented python programs ...
There are more and more jobs becoming like this
:incoming_envelope: :ok_hand: applied mute to @lusty sphinx until <t:1639861645:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
Hi i'm trying to plot(scatter plot) with binned data on the x axis and number of points in each bin on the y axis would anyone know how I would do this, any help appreciated, my code: ```x_cl, y_cl=zip(*sorted(zip(CL,peak)))
dat = pd.DataFrame({'CL' : x_cl, 'Y_VAL' : y_cl}) #we build a dataframe from the data
bins=np.arange(min(x_cl), max(x_cl), step=0.005)
categorical_object = pd.cut(x_cl, bins)
count=pd.value_counts(categorical_object)
grp = dat.groupby(by = categorical_object) #we group the data by the cut
plt.scatter()
plt.show()```
i need help in #help-cheese regarding data science
pd.DataFrame.hist()
i think it has param options to use plt as backend as well
oh scatter
sorry
@mighty spoke https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html this looks like what youre trying to do
there's a lot of data out there that isn't purely tabular, and deep learning is less about neural networks and more about "architectures for creating certain inductive biases for un- or semi-structured data"
and actually, even for tabular data we use it
I usually hear deep learning as using anything with hidden layers, but what sort of things are you doing with tabular data, for example?
i mean, you can build a basic model that takes a tabular dataset for regression and use a transformer-based model on it
there are some conveniences that it gives, one of which is it is significantly easier to write custom loss functions
slightly off-topic, how important are custom loss functions? are the standard ones usually too generalized?
it depends on your problem, of course
and "custom" is relative to the specific package you're using and which functions are pre-installed
I see, so you didn't necessarily mean loss functions you write yourself
like, if you're working on underwriting, you might use tweedie deviance as your loss function
some GBM packages support it, some don't
you will often have to write one yourself if it doesn't
but there are times where you have to write it yourself
and in that case you would wish that you were working with an autodiff library like pytorch
does this also extend to validation methods?
your validation method usually uses similar metrics to that used in model training
and if not, then you would need to check that
like, god forbid you have to write your own implementation of the hyvarinen scoring loss in lightgbm
ml world is a scary place it seems. just took intro to AI this semester and was curious. thanks for humoring me
Sure, but aren't transformer-based models NNs? I do agree that that kind of ensemble might be nice to do --- I don't think I've seen it done outside of CV though.
Oh, maybe I misunderstood what you were saying. It's not just plugging things into NNs, but rather a methodology of model-making around certain types of data. Maybe?
is shaping the same in all frameworks?
Hello, I was working on a binary classification NN with Pytorch and I got this issue: https://stackoverflow.com/q/70405429/13071340 , maybe you have any ideas on how can I fix it.
yes
i mean, deep learning is kinda catch-all term, right?
but a lot of research is really centering around the idea of "inductive biases"
Got'cha. Yeah, but I usually hear it specifically referring to things where you're using hidden layers to feature-find.
I'm not sure I've heard of Inductive Biases --- it seems that https://en.wikipedia.org/wiki/Inductive_bias is fairly general though.
a simple example of an inductive bias is in a conv net
if you have images, you would expect that the correlation between two pixels drops off heavily with distance
so a fully-connected neural net would perform not great on it
Sure, this is also the principle behind k-NN though, no?
mostly because it has parameters for every distance on the image
So, I guess the idea is "which inductive biases work in what way for what problem" is a research topic for NN stuff. Makes sense.
yeah, the inductive biases of knn is that "you are defined by what you're close to"
there's a lot of ML research these days on "how can i build my model architecture to exploit these inductive biases that i want to hold"
Yeah. If we are gonna just take deep learning as "not shallow learning" (not given features a priori) then it would probably be good to know what features it can create and what proximity to cluster them around, via inductive biases.
that's a gist, yeah
Yeah, that jives with what you noted about custom loss functions, because it's basically the same kind'a deal.
It would be nice to ensemble something onto a standard alg for tabular data (even for a toy problem), but my hesitance is always that I need interpretability for most of my jobs. But maybe just for fun I'll try something out on a toy data set and see what it picks up.
interpretability is a harder job, but at my job we approach it via a lot of ablation and perturbation testing
also, keeping a bunch of reasonable synthetic data sets around is wonders for debugging weird model predictions
Yeah, LIME + Perturbation was pret much stock and standard when I was doing it, but LIME is sometimes a pretty big nightmare to work with if you're feature-heavy.
agreed
unfortunately, that's pretty state of the art unless you have someone whose full time job is to investigate this stuff
Yeah, the other kin of LIME are pretty much more or less specific versions. I honestly don't know, with the computing power we have now, how we would get anything better than pert testing and maybe well-defined LIME stuff. But maybe someone clever is on it.
Maybe I'll try it on some easy, small-featured dataset and see if I can mess around with different kinds of transformers. Might be worth learning!
This is a result of a HalvingRandomSearchCV(), which looks nothing like the example here https://scikit-learn.org/stable/auto_examples/model_selection/plot_successive_halving_iterations.html#sphx-glr-auto-examples-model-selection-plot-successive-halving-iterations-py
Can it still be considered "correct?"
RandomizedSearchCV took 24.04 seconds for 188 candidates parameter settings.
Model with rank: 1
Mean validation score: 1.000 (std: 0.000)
Parameters: {'eps': 0.0007622902978304772, 'fit_intercept': 1.0, 'normalize': 1.0, 'tol': 0.00015997189211448186}
Model with rank: 2
Mean validation score: 1.000 (std: 0.000)
Parameters: {'eps': 0.0007830958292280829, 'fit_intercept': 1.0, 'normalize': 1.0, 'tol': 0.0002491044511612459}
Model with rank: 3
Mean validation score: 1.000 (std: 0.000)
Parameters: {'eps': 0.0007622902978304772, 'fit_intercept': 1.0, 'normalize': 1.0, 'tol': 0.00015997189211448186}
this is some more output of my search
Can someone give me a hint on how to avoid lack of convergence in DCGAN model? I'm kind of tired of looking at figures of squares full of random colored pixels...
My Adam optimizers have the same learning rate as the one used in Pytorch's tutorial for DCGAN, yet my model doesn't work.
23 for y in range(len(pos)):
24 for k in range(d):
---> 25 angles[y,k] = np.sin(y/(100002i/d)) if k % 2 == 0 else np.cos(y/(100002i/d))
26
27
ValueError: setting an array element with a sequence.
Trying to code position encoding for an RNN Transformer
Hey, I have been having trouble trying to install Bazel, is anybody willing to help me with the process. With bazelisk I'd think this would be really quick and easy I don't know why I'm running into so much trouble
You'll have to provide more context than that, perhaps you should show more of your code
How do i connect a folder to google drive..
has anyone here tried openAI’s codex?
i need a little help with it
does anyone know what is tenserflow v2 equivalent for image_dim_ordering()
what does image_dim_ordering() do actually
@serene scaffold hello i have a doubt in pandas
The length of this df is 7820
I want to make only the images with 0s to a total of 1000 and drop the rest
Can you please guide me
Should i split it into separate columns and use df.drop or is there a simpler way?
If you only want 0s why use df? I'm not sure what you mean over here.
how do I connect to a sqlite3 database that's on another computer?
do I have to host it on a server?
No no i want both 0s and 2s but i want to reduce the images having 0s to a 1000
And i do not know how i should do that
You can find all images for 0 and then slice down to 1000?
How do i do that??
Should i split the df such that i have 2 columns one w 0s and another w 2s or is there a better way?
that is the way i can think right now.
I don't help with screenshots of DataFrames. Sorry 
merged_labels = pd.merge(train_labels1,labels )
merged_labels.tail(5)
index_names = merged_labels[ merged_labels['level'] == 1 ].index
merged_labels.drop(index_names, inplace = True)
#print(merged_labels.head(20))
merged_labels_norm=merged_labels[merged_labels['level']==0]
merged_labels_dr=merged_labels[merged_labels['level']==2]
#print(merged_labels_norm.head(6))
print(len(merged_labels_dr))
merged_labels_norm.iloc[0:1000,:]
merged_labels_dr.iloc[0:1000,:]
final_df1 = pd.merge(merged_labels_norm,merged_labels_dr)
print(final_df1.head(5))```
this is what i tried
I need to know what data is in it as text. df.head().to_dict('list')
this is the df
0 10003_left 0
1 10003_right 0
2 10007_left 0
3 10007_right 0
4 10009_left 0
... ... ...
8403 19494_right 0
8404 19498_left 0
8405 19498_right 0
8406 194_left 0
8407 194_right 0```
please do df.head().to_dict('list')
that's the only way that I'll read it. Otherwise I have to go back to what I was doing.
oh okayy
if I understand correctly, you want to retain up to 1000 rows for which level is 0, ignoring the rest
try df.query("level == 0").head(1000)
yess exactly!!
alright i will thank youu!!and i need to concat after that
but i get an empty df
concat with what?
'10003_right',
'10007_left',
'10007_right',
'10009_left'],
'level': [0, 0, 0, 0, 0]}```
what do you need to concat?
with a df that has level 2
the above df has images with level 2
what are you really trying to do? there's apparently more to it than just getting 1000 rows where level == 0
i split it in the same way
correct there are rows with level==2
what are all the unique values in the level column?
is it 0, 1, and 2 only?
and you need 1000 from each?
because that's just df.groupby('level').head(1000)
i got itt!!i did this, it was a stupid mistake
merged_labels_norm=merged_labels[merged_labels['level']==0]
merged_labels_dr=merged_labels[merged_labels['level']==2]
#print(merged_labels_norm.head(6))
print(len(merged_labels_dr))
merged_labels_norm=merged_labels_norm.iloc[0:1000,:]
merged_labels_dr=merged_labels_dr.iloc[0:1000,:]
final_df1 = pd.concat([merged_labels_norm, merged_labels_dr])
print(final_df1.head(2000))```
my way is probably going to be faster.
i will try it outt
ohhh i did not thing about thiss
only 0 and 2
then what I suggested would work. If there were values in the level column that you didn't want, only slightly more would be needed to ignore them.
alrightt!! thank youu😄 ill need slightly more or just 1000 is enough?
idk what you're trying to do, so idk.
Do you control the body
Hey guys, i have these different set of files that are generated when i run my model. I want to organise all the files by putting them to respective folders. is there a smart way to organise them than manually selecting files and putting them to different folders?
I don't read screenshots of text; are you doing it systematically based on their file name?
yes
yes; I am the glasses
I recommend using pathlib
okay!! do you some examples to follow through?
what were you expecting tensorflow to do with that array of objects?
do you understand why an array of objects that are lists is not a valid input?
that's part of it. arrays have to be "rectangular". you can't get around that by having an array of lists that are different lengths
but also, just naively throwing data into the network isn't going to accomplish anything. it looks like you haven't done any kind of pre-processing.
@serene scaffold ?
no
Hey, I have been having trouble trying to install Bazel. I am installing with Bazelisk, and when installing through command line it seems to install correctly, I will get the text Starting local Bazel server and connecting to it... [bazel release 4.2.1] So I try and test it out by getting the version number but recieve the error 'bazel' is not recognized as an internal or external command, operable program or batch file.
do you have pip
yes
upgrade required in think so sudo apt-get upgrade bazel
'sudo' is not recognized as an internal or external command,
operable program or batch file.
C:\Users\*******\Desktop\bazelisk-master>apt-get upgrade bazel
'apt-get' is not recognized as an internal or external command,
operable program or batch file.```
I'm running windows btw
you need to add bazel (or bazelisk) to your PATH
I'm trying to use https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.least_squares.html but it can't even fit a second degree polynomial
blue are true values
heck, even giving it the optimal values as initial guess, it insist on making it linear
Thank you
will do
welcome
I'm trying to add bazel to path and reading how to download and add to path, I think the problem is I have a space in my path, my friend made my user name on windows "cum face" when he was helping repair my pc, so my my file path to my user folder is C:\Users\cum face, and according to the bazel site "None of these paths should contain spaces or non-ASCII characters."
fun
...lmfao
i'm sorry, that's fucking hilarious
friend made my user name on windows "cum face" when he was helping repair my pc
wth, that's not nice at all
that's called friendship
guess I don't know what friendship is then
what are some good beginner resources for ai and data science, especially for someone being self tauhgt?
I know there are good modules like numpy and pandas, but I'd like a resource that helps me apply that to some projects if y'all know of any.
I have a project I'm working on but I need to see what a fleshed out data science project might look like
I found this resource a while back ago:
https://towardsdatascience.com/a-complete-26-week-course-to-learn-python-for-data-science-in-2022-e95b67551df4
but if y'all have any others I'd love to see it!
If you get any good responses send them my way too
Trying to add bazel to path still, do I directly link to the exe in Path or the directory it is in
You can always change it from "cum face" to a preferred name (that's if you wanna change it) 😂😂
I dont know if I can. So I changed my username but can't change the name of the folder from what I've tried, my original version of windows I bought was lost/corrupted during the repair process so I'm having to run a torrented version of windows at the moment which locks me out of changing some things until I activate windows
Finally I have installed bazel using chocolatey and it seems to work, it still gives me that error about a space in the file path but it seems to work now so fingers crossed.
Out of curiosity: which IDE and how?
new problem, if I type bazel into cmd 'bazel' is not recognized as an internal or external command, operable program or batch file. but if I run cmd as admin it works
I should be able to run bazel commands without running cmd as admin right?
PyCharm
It's an option in the settings. I use it with high-contrast theme
Thanks!

hello, anyone knows how to fix this?
i need to set a fixed scale for both axis, using matplotlib
wanna do it like this
file = open('Simulationn-algo.txt','r')
data = json.loads(file.read())
size = []
Insertion =[]
Merge =[]
Heap =[]
Quick =[]
Bubble =[]
Selection =[]
Counting =[]
for i in range(len(data['Simulation Details'])):
size.append(data['Simulation Details'][i]['Size'])
Insertion.append(data['Simulation Details'][i]['Insertion Sort'][0:5])
Merge.append(data['Simulation Details'][i]['Merge Sort'][0:5])
Heap.append(data['Simulation Details'][i]['Heap Sort'][0:5])
Quick.append(data['Simulation Details'][i]['Quick Sort'][0:5])
Bubble.append(data['Simulation Details'][i]['Bubble Sort'][0:5])
Selection.append(data['Simulation Details'][i]['Selection Sort'][0:5])
Counting.append(data['Simulation Details'][i]['Counting Sort'][0:5])
_Insertion = np.array(Insertion)
_Merge = np.array(Merge)
_Heap = np.array(Heap)
_Quick = np.array(Quick)
_Bubble = np.array(Bubble)
_Selection = np.array(Selection)
_Counting = np.array(Counting)
_size = np.array(size)
plt.plot(_size, _Insertion, label='Insertion')
plt.plot(_size, _Merge, label='Merge')
plt.plot(_size, _Heap, label='Heap')
plt.plot(_size, _Quick, label='Quick')
plt.plot(_size, _Bubble, label='Bubble')
plt.plot(_size, _Selection, label='Selection')
plt.plot(_size, _Counting, label="Counting")
plt.xlabel('size')
plt.ylabel("Duration (ms)")
plt.title("Different Sorting Algorithms")
# plt.legend()
plt.show()
Plot()```
can someone help me? :(
your axes are off the grid
how to fix? i searched everywhere
you can't search it - that's the entire passive "off the grid"
i searched for a fix
you apparently don't use reddit much - smart.
just google who to set scale for axes in matplotlib
didnt work 🙂
ask crypto then 😏
yo how to normalize on a cnn model is there any layer that can do it on tensor? or it willbe a preprocessing?
Does anyone know what the newest proposed activation function for deep neural networks is?
does anyone know if it's possible to use Spark Streaming with websockets?
do beginner data science projects need to be perfect
Help
So baicalky I am working on a machine learning assignment in creating an algorithm that will detect brain tumors based on brain data tumor data sets. This is an assignment that is asked by my professor but I don’t have experience with coding.
These are the steps I already complete which is
(1) Import the dataset into a fresh Google Collab project
(2) Split the dataset into training / testing / validation sets (thursday)
But i still need help with
(3) Defining my classification model(s). You would probably want to try a few different models here. You can either build your own convolution neural network (layer by layer) in tensorflow and train it from scratch, or you can modify an existing pre-trained network like VGG19, alter it to better suit our binary classification needs, and retrain it on the dataset.
(4) Train you model(s) and evaluate
If you don't have experience with coding, things might get complicated. Try seeing how to use VGG19.
It seems that scikit-learn also has a module especially for creating a neural network automatically, but I don't know how reliable that is.
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
However, my bias is to create your own, which isn't hard if you try doing that through keras/tensorflow.keras.
Examples using sklearn.neural_network.MLPClassifier: Classifier comparison Classifier comparison, Visualization of MLP weights on MNIST Visualization of MLP weights on MNIST, Compare Stochastic lea...
If your dataset is already preprocessed, then you just need to do something like this:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, BatchNormalization, Dense
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=5, padding='same', strides=2, activation='relu'))
model.add(BatchNormalizaton())
model.add(Conv2D(filters=64, kernel_size=5, padding='same', activation='relu'))
model.add(Dense(1, activation='relu'))
And your neural network is ready. Just fit it to your dataset and train it.
Ohh ok
However, it's a good idea to learn more about it, how those layers work, so you can enhance this neural network.
omg, I tried to write a network too, but it's incredibly dumb
can you help me with it maybe? @hasty mountain
It depends. If it's a Generative Adversarial Network, I can't. I got problems with those.
Hm...What for?
uhm, nothing
I can't even train it on data, from a polynomial function
I've only managed with linear function
Basically there was this dude that helped me with this. And he wasnt able to continue to help me bc he is busy. Is it ok if I add u onto a group chat so then u can see where we left off? @hasty mountain
I wrote the backpropagation and stuff
but I'm doing something wrong with training and gradient descent I think
Nah, sorry man, I can't. I'll have to sleep soon and I'll be busy for the week.
Hey guys, I'm trying to figure out a way to calculate (efficiently) sets of density based clusters given a set of x,y coordinates -- does anyone have any articles or suggestions for that?
But from what you told there, if you use keras, probably the only thing you'll have to worry is about preprocessing your data, how the neural network works and probably that'll be all. Maybe just organizing the output of your neural network...
Keras is really intuitive and quite useful. Like...to train your neural network you just define it and then do model.fit(X, y) and to predict data, model.predict(X, y)
Problem would be if you're using Pytorch.
Unfortunately not. Sorry.
I also asked this question in #help-potato
Oh...if you're using Pytorch, things are a bit complicated...
I just started using Pytorch some days ago.
Im not. I've just hacked it myself using a matrix library
things are probably complicated by that
wanted to try writing it all
Uuuh...then I really can't help you, you're going through a quite difficult path.
uhhh, well I was still hoping you could help me
I think my problem is, in my gradient descent
why don't you post it
However, maybe you could learn how to do that if you check the source code for tensorflow...which is a module that requires manually creating every single step in a neural network...and it's the hardest way, already.
Well...if you've gone that far, then you probably know much more than me
oh ok 
well, gl with Pytorch though!
Try checking the source code for some optimizers in pytorch and keras.
Thanks!
post the code you're having difficulties with
that's a terrible way to first learn about gradient descent
since none of the gradients are exposed
Hm, I didn't know about that
okay, it's not python though, so I hope that's alright
we started talking about this, because someone else mentioned it
C code
/*
Trains the network [pos] on the dataset [points/next] using [steps] with adaptive stride
*/
void adaptive_learn(framework* pos, int steps, void* points, int next(void*, point*)) {
netw* vel = netw_init(pos->spec); // The gradient
point* point = point_init(pos); // A point which gradient finder caches to
double error = INFINITY; // error starts at infity: i.e. as bad as possible
int diag_hz = steps / 10; // frequency of outputs
print_vbar("DESCENDING");
for (int i = 0; i < steps; i++) {
double prev_error = error; // save the error, before computing next
error = next_gradient(pos, vel, point, points, next); // sets vel
double rate = minimize_diagonal(pos, vel, point, points, next); // computes optimal learning_rate/step_size by mimizing in the direction of gradient
netw_scale(vel, rate); // scales the gradient by optimal
if (i % diag_hz == 0) print_step(pos->net, vel, error, rate); // print stats
if (is_error(prev_error, error)) { // gradient descent is wild
if (i % diag_hz != 0) print_step(pos->net, vel, error, rate);// in case step was skipped, print the final step
exit(2);
};
netw_add(pos->net, vel); // move
}
netw_free(vel);
point_free(point);
}
params = params - learning_rate * gradient
would be
netw_add(pos->net, vel);
in my code
so netw_add does a minus?
or is netw_scale already scaling by a negative learning rate
wait
i thought your gradient was error
hence the next_gradient function
i don't see that error being used anywhere meaningful
next gradient is stored into vel through side effects
fair enough
are you sure you have your minuses correct?
if you put a - in front of minimize_diagonal and run it
what do you get
yeah! No, it can fit a straight line
it's just way too inefficient at learning
also I know the code is messy, but that's because it's C lol
well, actually...
The training set is pairs of (x,y) coordinates
which lie on a straight line
what's a model?
it's uhh Feed-Forward
I'm surprised this
params = params - learning_rate * gradient
should be enough
doesn't work for me at all
maybe it's too slow
how many steps are needed?
hey I've got this massive table (12 sets of a 12x21 table), what would be a good way to display this? any libraries or smthn I should use?
col/row labels are just numbers, and the cells are probabilities
guys I need suggestion about embedding a image. I have review some famous model for embedding (arcface, deepid, vgg) and I want to ask this one. is the embedded vector are normalized by the model or not? and is it normal practice to normalize the vector before further process like store it to vector db or just keep it like that?
and is it normal practice to mapping the cluster made by each vector (for example i use 26 times augmentation per image and want to check whether each vector are clustered perfectly per image or there is a slight mix)?
You can look at the heatmap. Matplotlib and seaborn will do your work.
hmm I was thinking something like that, kinda made one with tabulate and some colors
I like how it actually shows the numbers, but there's basiaclly a bunch of these tables (yellow colored 12-21) and I've also got multiple different methods of calculating that set of tables so I'd like to show the difference in those too. Don't want to have to show 10 different tables 5 times. Any idea of how to compactly show that?
uhhhhhhhhhhhhhhhhhhhhhhhhh?
I don't understand your english
ehe sorry for my bad english
umm i wonder if normalizing the embedding vector from image is a common practice or not
and i wonder if cluster the vector from each augmented image is common to, to analyze if the augmentation works well to separate each entity
since i done some research about the effect of each embedding model and want to know what models and augmentation methods meet my expectation
hi i hope everyone are doing well
i need an help in literal_eval and a pandas coloumn
ValueError: malformed node or string: 0 [312020]
this is the error i am getting
returns an array of 100 evenly spaced numbers between 0 and 70
*oreder small to big as well
You can use subplots. So basically all of them will be in one of them.
I may have an example for the same. Gimmi an hour or something.
would you know what kind I should look at? I'm scrolling through matplotlib's 'subplot' docs but not sure which I'd use
What is the best way to retrieve and parse a streaming api in json format? Like if I want to search for a match in name:
Hey, so I installed tensorflow but when I run my code I get this warning py 2021-12-20 15:15:23.864238: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
How do I rebuild tensorflow with the appropriate compiler flags?
Hi, I have a question about NLP. Must we remove a data text when it appears more than once?
like a use drop_duplicates??
this might help-
https://indicodata.ai/blog/should-we-remove-duplicates-ask-slater/
but personally so far, reducing data redundancy worked well in most cases, so i prefer removing duplicates mostly
also check this out^
import cv2
from PIL import Image
cam = cv2.VideoCapture(0)
def draw_rectangle(img, rect):
(x, y, w, h) = rect
cv2.rectangle(img, (x, y), (x+w, y+h), (0, 255, 0), 2)
def draw_text(img, text, x, y):
cv2.putText(img, text, (x, y), cv2.FONT_HERSHEY_PLAIN, 1.5, (0, 255, 0), 2)
def predict(test_img):
face, rect = detect_face(img)
label = face_recognizer.predict(face)
label_text = subjects[label]
#draw a rectangle around face detected
draw_rectangle(img, rect)
#draw name of predicted person
draw_text(img, label_text, rect[0], rect[1]-5)
return img
def predict(test_img):
#make a copy of the image as we don't want to chang original image
img = test_img.copy()
#detect face from the image
face, rect = detect_face(img)
#predict the image using our face recognizer
label, confidence = face_recognizer.predict(face)
#get name of respective label returned by face recognizer
global label_text
label_text = subjects[label]
return img
predicted_persons = []
while True:
ret, frame = cam.read()
if not ret:
print("failed to grab frame")
break
cv2.imshow("Attendence...", frame)
k = cv2.waitKey(1)
if k%256 == 27:
break
elif k%256 == 32:
# SPACE pressed
stimg = cv2.imwrite("Student_Image.jpg", frame)
studentimg = cv2.imread("Student_Image.jpg")
Student_Prediction = predict(studentimg)
#draw a rectangle around face detected
draw_rectangle(Student_Prediction, rect)
#draw name of predicted person
draw_text(Student_Prediction, label_text, rect[0], rect[1]-5)
predicted_persons.append(label_text)
from openpyxl import Workbook
book = Workbook()
sheet = book.active
row = (label_text)
if row not in predicted_persons:
sheet.append(row)
book.save("Today's Attendence.xlsx")
cam.release()
cv2.destroyAllWindows()
While proceeding with the code, I'm getting the following error-
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-14-271c883c8a01> in <module>
47 stimg = cv2.imwrite("Student_Image.jpg", frame)
48 studentimg = cv2.imread("Student_Image.jpg")
---> 49 Student_Prediction = predict(studentimg)
50 #draw a rectangle around face detected
51 draw_rectangle(Student_Prediction, rect)
<ipython-input-14-271c883c8a01> in predict(test_img)
22 def predict(test_img):
23 #make a copy of the image as we don't want to chang original image
---> 24 img = test_img.copy()
25 #detect face from the image
26 face, rect = detect_face(img)
AttributeError: 'NoneType' object has no attribute 'copy'
Why am I getting this error and what can be the possible fixes for this?
Sure. Gimmi some time. I've been quite busy today.
I think that studentimg is image, but then also it is showing that it is showing it as 'NoneType'.
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
May be you are passing wrong input
Function not able to read image properly that's why getting error attributeerror
@verbal dock Check the path of the image
I WISH I COULD BE THIS GOOD
If I have a dataset that label is doesn't match the feature. What can I do? Remove the data or replace the label value?
share the error
what do you mean by not match with the feature? can you elaborate this?
I have a headline classification dataset and the feature of headline is doesn't match in label of category. What should I do? remove the data or rename the label value?
rename the label values
can we change the label value in the dataset?
yo is normalization and image augmentation have up and downs on cnn models?
Share the screenshot of the dataset and which algorithm you use for the classification?
Means ?
send me the dataset link
So I will guide you in better way
Hello, where can I start learn ML or deep learning?
I am with 0 knowledge.
yo bro nevermind i got confused i cant even remember why i asked that hahaha
btw this is new question does cnn models naturally recognizes the shapes right? or their colors too?
try "Data Science from Scratch"
yes colors are recognize by the cnn
like convo layers detects edges therefor they can see the shapes
like vehicle color detection
so if i train a model with a dataset that for example 2 balls same shape but one is blue and one is red and i used color space manipulation (image augmentation technique) this will then give disadvantage on the model right?
if the distinct feature that differentiate this 2 class is only color then altering their color in preprocessing(using image augmentation) can confuse the model?
for example applied red casting to blue ball and applied blue casting to red ball?@vivid echo
can this be a good thesis topic?
we use augmentation the generate more training data so our model trains with better accuracy.
Yes
but if the generated data is somewhat questionable(like the color casting thing) then it can then give negative impact?
yes that time model will confuse or give wrong prediction
actually i have researcher this one
and yes
sometimes augmentation will give bad result depends on the characteristic of the object
oh is it already done?
published?
oh i can still use this as thesis yeah?
i just need a topic
hahaha my 1st topic got rejected
so my topic would be selecting object to classify then if my experiment proves right then that object or similar objects to that are not good to use augmentation(that generate data with altered feature(color)) if my experiment proves me wrong then at least i did conduct experiment and have a thesis right?
😅
you can try that one
Thx
In Tensorflow, is there a way to use a tokenizer vocabulary as the vocabulary = ... parameter in the TextVectorization layer or should I only use a TextVectorization layer
While using the imagedatageneratkr does it also generate original images? I used rotation_range=45 and got a few images which were like the original one
Does anyone have any recommendation for a paid online course to learn Python for Data Science for a complete beginner? I have a coworker that is interested in learning and I already sent them a lot the free material, but I figured a well-structured paid one would be better if it is all being billed back to the company anyway.
buy a book
like grus' data science from scratch
or bishop's pattern recognition and machine learning
(the second is a personal recommendation)
first book has a python tutorial in it-- you can supplement it with beazley's python cookbook
yeah maybe at least bill the book to the company and give them a few hours a week to learn it
Hello,While using the imagedatagenerator in keras does it also generate original images? I used rotation_range=45 and got a few images which were like the original one
Do I need to have a high math level?
it goes over the basics of the math you'll need
Ok
Figured this out but I got no clue how to use my TextVectorization layer lmao x-x
you might want to give examples or elaborate more, I've not touched on image generation but it could just be taking smaller images from the rest of the image to focus on each different part?
I have this dataset called Refuge with 40 images i want to increase it to 120 images
I used imagedatagenerator in keras for it
So for data augmentation i am trying rotation
I set rotation_range to 45 degree but it generates a few images which are very similar to the original
Have you checked the documentation? Personally I've never used an imagedatagenerator (and I use Tensorflow 2 anyway). Would this be of any assistance?https://stackoverflow.com/questions/34801342/tensorflow-how-to-rotate-an-image-for-data-augmentation
Yess i did check thiss
I actually also wanna know what is better for data augmentation
Imagedatagenerator or openCV
What do you think i should go for??
here are some libraries used for image data augmentation
if u just want to solve a classification problem u can use any data augmentation libraries
for object detections some times after the augmentation one might have to take care of the bounding box coordinates too. I am not sure but I think some of these libraries might have methods of doin that, and maybe there are some APIs which helps with that too. I am also trying to learn about it all atm.
Can you please name the libraries,i cant access the article
Yess mine is just a classification problem
sure let me post screenshots for you-
Thank you soo much!
:incoming_envelope: :ok_hand: applied mute to @uneven flame until <t:1640023262:f> (9 minutes and 58 seconds) (reason: attachments rule: sent 7 attachments in 10s).
!unmute 509403906963406860
:incoming_envelope: :ok_hand: pardoned infraction mute for @uneven flame.
@uneven flame Sorry, your message got zapped by our filters since it had quite a few attachments.
Would you like me to get them back for you? We have them in logs.
No worries mate, I dm'd @wicked grove the details
👌
Hey guys, about GANs and, especifically, DCGAN: is it possible to use some kind of automate hyperparameter tuning for the optimizer without collapsing the model? Or do I have to be changing the learning rate manually and train for many epochs after each change to see how it goes?
I made a DCGAN that will only stop generating noise and generate something that slightly resembles the real images if I use more than 5000 epochs, so it kinda sucks to have to change the learning rate and wait for 5000 epochs everytime.
Oh, I see. Do you have a suggestion to how many epochs would be reasonable to make the LR change?
that's often a hard hyperparameter choice
some of those schedulers use the validation set to determine good times to lower/raise learning rates
so maybe check out those
Can I simply apply a scheduler for the generator and another one for the discriminator at the same time? DCGANs feels like I'm always walking on eggshells.
yes
gans in general are very unstable, due to mode collapse
it might be good to look into modifications that try to get around those issues
like infoGANs and WGANs
my favorite resource to learn about wgans
I see. I think model collapse isn't exactly the problem for me, since both the generator and discriminator loss function doesn't go to infinite and beyond. But the generator still will only generate noise until around 5000 epochs.
I've added some gaussian noise to the discriminator's conv2d layers and now I'm trying to change the learning rate.
that's usually what WGANs do-- accelerate training
but for now, tweaking learning rates is good to do
Hi! I have a silly question: I am coding a classifier estimator with sklearn, and I have a dataset of diamonds (size, color, clarity and cut).
Based on the first 3 features, i want to classify each sample by cut, this can be: ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'] .
So my target Y is a Series of strings.
My question: what is the difference between OneHotEncoder and simply putting numbers from 0 to 4?
well firstly, if you have 10000 examples, every output from a one-hot encoder model would give you 10000 items in a list
with 9999 "0" and 1 "1"
from my understanding of it
I think some models won't work properly if you use 0 to 4 instead of one-hot encoder.
My head is on neural networks right now, and those demand one-hot encoder to be able to use soft max and classify classes correctly. Otherwise, I think the model would the see the values from 0 to 4 as continuous, something like price prediction, so it would require a different structure.
You probably can use 0~4 values, but I think it would be more complicated to work with.
Thanks!
:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1640029708:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
please help
so i am making a personal ai assistant its been over 5 to 6 months now
and now i want to add some ml algorithm for doing this:
lets say i wake up at 7 am and turn on the lights and i do this continuously every day till a week when the next week starts i want my my ai to automatically turn on the lights for me . another example could be lets say i set a alarm to wake up at 7 am everyday the more i do it the more it knows and does it itself.
which ml algorithm can i use to achieve this?
suppose u set alarm straight for 5 days to ring at 7am, u want the program to set an alarm at that same time on the 6th day, in case if u forget to set it yourself?
Hey, so I have a question.
What could be happening here? according to the shape it is 1,445,477 entries but the index goes from 0 to 523,862, which seems pretty weird
yup
Already checked, seems normal honestly
this is difficult to follow. can you post a snippet of code and some kind of demonstration of what goes wrong?
!code see below for using code formatting:
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
!paste or use our paste site:
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
please avoid posting screenshots unless there's absolutely no other way to show what you are asking about
it's impossible to search, impossible to read for certain people, and generally puts a lot more burden on other people
likewise @novel acorn, can you please post this as a code block
check for extra whitespace around the names in the CSV
hqahahahah it's because I'm using multiple versions and want to see what changes I'm doing
e.g. maybe your list has ['a', 'b', 'c'] but the csv has a , b , c which would be ['a ', ' b ', ' c']
then you need to manually inspect the files and find out what's going on
figure out which particular files are causing trouble, read one of them, and then print the columns with df.columns.tolist()
sure, but I'm sending screenshots because of the resulting dataframe, not the code 😄
you can also put the output in a code block? or in the paste site. it doesn't have to be syntactically valid python code
sure, look
yup, but that's the problem
1.41 million rows
default, the file didn't have an index
@olive jackal so what exactly is the problem with this? you are expecting one of these dataframes to have certain columns, but it doesn't have those columns?
show the code you used to create this dataframe. it's probably just not sorted by its index value
i told you, the video isn't useful. at least isn't not something i personally can use to help you with
it may be, I actually had to do a join because I had every single year in different datasets
it's very likely that there's nothing wrong with the dataframe, but that the last row does not have the highest index value. first of all .shape should tell you the size of the dataframe, and that number will not lie to you. second, you can do .index.max() or equivalent
indexes are there to help you, but they can get confusing if you aren't used to working with them
I'll try that
ff_id = pd.read_csv(path_customer, encoding='unicode_escape')
data_2018 = pd.read_csv(path_2018, encoding='unicode_escape')
data_2019 = pd.read_csv(path_2019, encoding='unicode_escape')
data_2020 = pd.read_csv(path_2020, encoding='unicode_escape')
data_2021 = pd.read_csv(path_2021, encoding='unicode_escape')
filtered_2018 = data_2018.merge(ff_id, on="ID", how="inner")
filtered_2019 = data_2019.merge(ff_id, on="ID", how="inner")
filtered_2020 = data_2020.merge(ff_id, on="ID", how="inner")
filtered_2021 = data_2021.merge(ff_id, on="ID", how="inner")
year_2018 = filtered_2018.drop(columns_to_drop, axis=1)
year_2019 = filtered_2019.drop(columns_to_drop, axis=1)
year_2020 = filtered_2020.drop(columns_to_drop, axis=1)
year_2021 = filtered_2021.drop(columns_to_drop, axis=1)
all_years = pd.concat([year_2018, year_2019, year_2020, year_2021])
that's the code I used to read it and to concat it in a single big dataset
all those merges will likely reorder the data, yes
is the ID unique across rows? if so, consider setting it as the index for each dataframe
yup, but it's a long id for unique customers
is it 1 customer per row? or more than 1?
1 per row, but customers repeat because it's the movements of 4 years
that would be more than 1 row per customer then
indeed
yes, that dataset only has 3k rows
because it's the id of the customers that belong to certain category
and what's in each row of the other tables? some kind of transaction?
yup, not transaction but information of a movement (cargo)
so customer 1234 can appear in any table multiple times? or customer 1234 can only appear in each table once, but they can appear in multiple tables?
I'll try it 😄
@olive jackal i believe you can pass a list of column names to usecols= so you don't have to deal with the column name index business
desired_columns = ['a', 'b', 'c']
for file in files:
tmp = pd.read_csv(file, nrows=0)
columns_in_file = list(set(tmp.columns) & set(desired_columns))
data = pd.read_csv(data, usecols=columns_in_file)
...
first one, they can appear in any table multiple times and each time it may be a different movement
ok then @novel acorn. this probably won't change your outcome much, but in general i'd do something like this to make it clear what the unique identifiers are (and maybe wrap it up in a function to reduce duplication and the risk of typos):
# Customer data: one row per customer
ff_id = pd.read_csv(path_customer, encoding='unicode_escape').set_index("ID")
# Load cargo data: one row per shipment
def load_cargo_table(path):
data = pd.read_csv(path, encoding='unicode_escape')
data = data.join(ff_id, on="ID", how="inner")
return data.drop(columns_to_drop, axis=1)
paths = [path_2018, path_2019, path_2020, path_2021]
all_years = pd.concat([load_cargo_table(path) for path in paths])
i'm not sure what you mean by that. but you can actually use the fact that it accepts callables to your advantage (this example is even given in the docs):
desired_columns = ['a', 'b', 'c']
for file in files:
tmp = pd.read_csv(file, nrows=0)
data = pd.read_csv(data, usecols=lambda c: c in desired_columns)
...
desired_columns = ['a', 'b', 'c']
def is_desired_column(c):
return c in desired_columns
for file in files:
tmp = pd.read_csv(file, nrows=0)
data = pd.read_csv(data, usecols=is_desired_column)
...
Thank you!
I'll try that
any opinions here on metaflow?
i don't know how i feel about stuffing everything into a single class
tysm, this helped a lot in cleaning the code, but the solution ended up being resetting the index, probably as you said, due to the merges and the concat (most likely this), the indexes were broken, after restarting it, everything seems fine.
@bronze skiff hey, just a quick question:
I've seen that usually GANs don't use Dropout layers, even though a dropout layer helps to avoid overfitting and thus helps with loss function.
However, since GANs are...kinda special and fragile, is this a bad idea?
(I tried using dropout(0.4) after each ReLU in my discriminator. I think I broke my GAN...)
yup I know, I was given the data in different files because it was one file per year and they wanted me to do a global analysis
I want to learn sql, but due to college and work, I have little to no time, but it's in my to do list 😄
😮
I'll try it then, I'll see if in the following days I have a little time 😄
Thought it was as hard to learn as a new programming language hahahha
I'd recommend messing around in pgexercises.com. They have very common questions, have solutions at the bottom which explain their process, and it's pretty fast to pick up SQL. This is in Postgres dialect, but most of the dialects are very, very similar to one-another.
When I was doing data engineering, I recommended this to all of the interns who wanted to get into DE. Most [not all] were able to get up to speed relatively quickly and commit something to our codebase in a few weeks. :']
To optimize SQL calls takes quite a bit of experience and knowledge of the architecture --- but to do simple calls (which are 99% of the calls people prob will want) the learning curve is fairly low.
in LinearRegression can you predict on a single dimension like x?
As in, you have a list like [1, 4, 6, 1, 7, 8] and you want to regress on it?
They might wanna take the mean in a weirdly convoluted way.
thank you!
I'll keep this in mind
No problemo, I love SQL stuff. It's a really fantastic tool to learn and I'd argue that it's essential for most jobs in DS / DE / Analysis these days.
I don't think that it's a competition, both are very important to know.
Give me an example, I don't think I follow.
The kinds of data which I have been working with were not always able to be pulled with looker/tableaux/powerbi --- this was more of the Data Analysts job. But I understand what you mean here. Nevertheless, if you're using a BI tool, the point of it is to be able to pass it around easily and modify it --- so I'm not sure I get where the "passed around and now it's unusable" thing is coming from, unless you mean that there's some people exporting to excel, changing it, etc., which is bad practice in general.
wait sorry my question doesent make sense
I think though that either way, knowing proper structure to put things in is important as well. For ETL, you're going to need to know the proper ways to Extract (sql, or whatever BI tool you're using if that's acceptable), Transform (whatever system you use here, python/spark/etc.), and Load (which is where the datastructures comes in).
please help
so i am making a personal ai assistant its been over 5 to 6 months now
and now i want to add some ml algorithm for doing this:
lets say i wake up at 7 am and turn on the lights and i do this continuously every day till a week when the next week starts i want my my ai to automatically turn on the lights for me . another example could be lets say i set a alarm to wake up at 7 am everyday the more i do it the more it knows and does it itself.
which ml algorithm can i use to achieve this?
If the whole df is 90mb, it depends strongly on what you're doing, right? If your ETL is windowing over a whole bunch of stuff, that's more complicated than a single SELECT for the business team.
Moreover, it depends where the load is going to. To a DW for the business team? To a db for the DS team? Etc.
I think we're both saying right things here: it's important to know SQL (or, equally, the THEORY behind how querying works, since it's still the same thing in Tableaux / whatever, just simplified), as well as how to produce a product relevant for the team you're handing it off to.
I'd never give my business team a raw Excel file. I give them a Looker view and they can export if they want, but they can't change it on looker and mess up anything.
Haha, okay, see, I'm the person on the data team people buy the coffee for. :']
simplified example:
suppose u set alarm straight for 5 days to ring at 7am, u want the program to set an alarm at that same time on the 6th day, in case if u forget to set it yourself?
But yeah, if one is not going to be a Data Scientist / Data Engineer, then it's probably not AS important to get too deep in the weeds, you're totally right.
Ash, that's still a 2D regression. Your x-axis is day number, your y-axis is the time.
Yes, this is true. I tend to do smaller companies, so this is my bias, certainly.
Yeah, I like to have a "say" in the company, for some definition of "say". Most of my companies have been around 20 - 300 people, so it's a wide range, but def not a big company.
Haha, well, yes. We usually will have a data lake which only DBAs have access to, and they will then push data (with help from DS + Commercial) to a Data Warehouse so that we can attach looker / tableaux / whatever to that. Then we'll have another smaller data warehouse for DS which is mostly lightly parsed Data Lake stuff.
That way commercial gets what they want and we don't have to do a ton of gross formatting AND we don't have to worry about weird business defs accidentally getting into the DS part. And then for the DS part, we're free to do whatever we want with that, we're the DBAs of that DW.
I don't think it's optimal, but it's definitely served us well! Which goes back to my comment: without SQL, I'd be SOL. Haha. But yeah, in a big company if you're not required to just get your data willy-nilly, SQL might not be a top thing to invest in.
Haha, yeah. For BI stuff, tableaux + looker have served be very well. I didn't do much with PowerBI but it looks fine.
I've been meaning to learn it, a few companies I'm looking at use it and I don't know much about it.
Yeah, it's kind of weird, since Windows got WSL, a bunch of companies my friends are at have slowly transitioned off macs (because they're VERY expensive) and went over to windows. Since the devs get the WSL stuff w/ Linux for dev work, and everyone else is like, "whatever, windows is fine."
The only thing I don't wanna do is have to learn Azure cloud. I've already had to learn AWS and GCP stuff, I don't wanna learn another one, haha. But either way, that's a good task for me to do, just to look into it.
Yeeeeeep. I'm not sure the direction of the market, but it seems a few companies have been moving over. I'm not sure if it's price-point or what.
We'll see how it all goes. AWS certainly has marketshare and name. GCP is way friendlier to use, imo. I dunno about Azure yet. But in ten years we'll see where the market is, haha.
How can I translate a softmax output (length of 16 data type of float) to integer again (I tokenized my words so the output float -> int -> string)
I'm using Tensorflow but at this point I'm open to anything
My vocabulary is about ~28,000 words so one-hot encoding might work but I'd rather look and see if there's a more dynamic approach first
I'm not sure what Dax is, but I'd prob do most of my modeling in the SQL anyhow regardless, haha.
28,000 words with one-hot? Lawddddd that's a large, sparse dataset.
yeah that's why I'm hoping for a more dynamic approach
I think TextVectorization gives a more dynamic approach but I couldn't wrap my head around it
what does Linear Regression mean?
I know there's a way of doing it but I've not found a guide or documentation on it
I'm not excellent with NLP, so I'm sure someone else can guide you, Tenten. :'[
Yeah I've been asking for days sadly 😦
Ah, got it. Yeah, I have a weird feeling either GCP or Azure is going to pick up Amazon's pieces, but we'll see if they do.
It's making a line-of-best-fit. So, you have a lot of data points on an xy plot (for example) and you draw a line that's kind of in the middle the "fits the data" the best.
But it's fine, I find a lot of AI based subjects don't have too many people talking about it (I think a lot of people are just really good and don't need help or it's more of a test and try method)
yeah
I'm not saying anyones mean, I'm just saying 1) finding someone for your needs and 2) them seeing your messages are a hard to get combo
this server is lovely so every day I put the same question in xD
Sorry, this is done in MSPaint because I'm on my laptop. For linear regression (in 2d) you have these little datapoints. For you, it might be the x-axis being day number, and the y axis being time to wake up. Something like that. Anyhow, these are the blue circles (they should be the same size, but I'm terrible at mspaint). Linear Regression allows you to draw this red line which "approximates" where the dots are kind of headed.
Tenten, can you give me a little more info on what you're doing? You're taking a corpus and some kind of bi/tri/whatever-grams stuff and putting that through a NN? And at the end, you want your activation function to kind of classify something so you need an integer?
If you've heard of Visual Question Answering, that's what
I'm mostly guessing here, but that's the kind'a thing I'd like to know about so I can try to help a bit better. Haha.
I take an image and a question, and I try to output an answer
I'm using tensorflow due to it's extremely powerful Keras API (and its own functions) as well as its documentation
but I don't see many examples that allow me to classify text in the capacity I need it - namely from:
String --> Integer tokens (preprocess/tokenize) --> float values (during model training/predicting) --> (output) integer tokens --> string
Can you give me an example using a real string (fake values, obv). I'm not sure what you mean by integer tokens.
so basically a question can be up to length 32:
"What are you doing?"
Would be standardized to lose punctuation and make it all lowercase
to become: "what are you doing"
then I use a tensorflow tokenizer that splits it at each whitespace
["what", "are", "you", "doing"]
then my tokenizer is fitted on my examples to give integer token values to my text
what --> 581, are --> 20, ....
[581, 20, 14, 3414]
Then it's put into my model through an embedding layer and dense layers, activations layers blah blah blah until it reaches the output where it's of the type:
[3.43456e-03, 3.90534e-04, ....]
That's 16 long
My tokenizer saved in txt file form is 11 KB lol
whoops wrong one
that's my tensorflow model
my tokenizer is 2.7MB