#data-science-and-ml
1 messages · Page 298 of 1
ai decensorship for japanese culture is a thing
basically-- you can "learn the ass"
what was the training data? hen-
You could create a mask with a preference based on a big enough sample size, right? Then it's goodbye Tinder swiping 😊
On a serious note though, why is this not a thing on porn websites?
they had a bug bounty program; must be hard for the experts to browse the site for vulnerabilities 😏
@spiral trail @misty flint there are other examples of ML being used in the real world that are more appropriate for our server, which is designed for users as young as 13.
we allow anyone who is eligible to use Discord to fully participate in our community. While it's not likely that a 13 year old would have the prerequisite knowledge to succeed in data science right away, that doesn't preclude them from participating.
this is true. let me delete my comment
what about julia 
where is foxxy
maybe i will learn julia next 
im using speech brain for a project rn that uses pytorch first time working with it not doing anything advance gonna end up using the pertained models anyway so its the same layers and stuff. I just kinda started in tf and figured that the cert would help me get a job in ai
apparently its all just straight from the coursera course so only sequential
interesting

my classmates and i were thinking of doing the certificate just to get more familiar with TF not really to use the certificate lol
since ik very few certificates mean anything to companies
which is fair
my friend said their company just hired a guy that had a certificate for a certain tech but didnt actually end up knowing the tech when they brought him onboard

that shit happens all the time lol
he said if it was up to him, that guy def would not be hired
yeah one of the red flags was the guy had a 4 page resume
sadly it means the DS learn certificates i have dont mean shit even thou I know -a bit- of it
that alone would have made him discard his CV lol
Hello everyone, I am new to python programming and would like some help please in building a script!
but my friend said no one consulted him so 
Hey. This is the DS channel. If your question is about DS we can help, otherwise try looking for a more appropiate channl 🙂
Fire away man
well its not bad at all (the course and the exam) I guess it depends on your course load like i had the python skills to do this years ago in school but i was spreading my self thin enough there and didnt have the time to do it. Well I dont plan of cheating my through it i like ai and think it would just be nice to show with my projects
yeah. i think at this point, projects >>> certificates bc you can actually show employers your learning
Then you have a shit mind like mine and can't think of a good project :v
-sad dog face-
good to know 
Eventually i'm just going to run a sentiment analysis of my friends in twitter
see the good thing about group projects is im never the one that has to think of the idea since im also shit at coming up with ideas
or just find a random dataset on kaggle

-proceeds to embarass himself-
I'm not sure, but I thought I could get some help here because it's about data that I want to manage in python (manipulation of data files). Does it fit in the group discussion? or...
since all the public channels have an easy way to download their data
you just go to the top right corner, and literally press export chat data

so if you need ideas, theres that
they even have a telegram api
it should
not that i really use telegram. its not that popular in the states
but it still seems like something nice to put on the resume
classificattion challenge - Is Rex trolling or not? :v

seems too good to be true, right?
thats what i thought too at first
I explain my problem ? or...
the problem comes when you try to download the data, its usually too big so you have to select what you want
I honestly don't know the value of certs to projects like obv a good project is worth the most but like i make shitty little things that are fun for me i dont do medical or business stuff i do stupid shit
yes, but does your stupid shit use dif technologies? dif languages? it doesnt matter then

you can still put it
did you make a docker container for it? its still using docker 

I do everything in python and yeah atm im just trying to get really good a few libs tf, (numpy, pands, the basic data exploratory stuff), I try to put them on plotly's Dash
word its sick
ye fam
what is plotly's dash?
should i save another bookmarket of another tool to learn? -pukes-
like Im not a front end dev and i just do the basic scatter bar colorful graphs stuff its just a way to have stuff on the internet
possibly

im jk idk what you use for data viz
i was thinking of using this
yeah like tableau with flask
since its free and shit xd
and looks pretty
oh that one looks interesting
oh plotly is coded in python as well?
i will also save it
another library?
yeah
-dies buried by libraries-
all the libraries 💀
better than R packages

yeah plotly is for graphs and data viz and dash is for front end deployment
those are endless
lol yeah thats why i like dash real easy one file type shit
we used flask for our last project
and by we, i mean my friend
and by used, i mean we had one page that was a pain to figure out
💀
this is it?
the struggling means your closer* to learning
yeah
what about when youre bashing your head against the keyboard
what does that mean
Dash Enterprise

oh yeah baby look at those animations. 
Have you seen SAO?
yeah
I¿ll get rid of you ala Gun Gale Online :v


skull mask villain 
but yeah
were going to try to use this for our telegram project
the code is v minimal
even less than flask
yes streamlit is also very cool again its just i started in dash gonna build on what i know
bro is this what my CS friend called "tool hell"
im going to try dash if i cant figure out how to make the graphs interactive on streamlit
the engineer's dilemma
too caught up in the tools you never actually build anything

When you have a hammer, everything looks like a nail
-throws a naive bayes classifier in your face-
💀
new tool for that google colab
💀 we did a Random Forest and a Decision Tree the other day and the accuracy was only slightly better than 50-50
0.53
💀
right?
it had to do with poor data
but good thing it was just a throwaway assignment
what was your data?
like super small sample so the machine couldnt learn properly
also
tensorflow emote when

i need one
Just ask, if it's the wrong channel we can tell you which channel to ask the question in.
rps=np.loadtxt('Line_001RPS.txt',dtype = np.str)
sps=np.loadtxt('Line_001SPS.txt',dtype = np.str)
print('File 1 shape', sps.shape) #Pour connaître la structure du fichier en détail
print('File 2 shape', rps.shape)
rps.shape
sps.shape
print(len(sps))
Nlines_sps=sps.shape[0]
Nlines_rps=rps.shape[0]
fichier=open('mise_a_jour.txt','w')
for i in range(Nlines_sps): #boucle for pour gerer le premier fichier(qui fonctionne correctement)
Cf1 = sps[i,0] + ' ' + sps[i,1] + ' ' + sps[i,2] + ' ' + sps[i,3] #ligne du premier fichier
fichier.write(( (Cf1 +'\n')*282 ))
fichier.close()
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
rps=np.loadtxt('Line_001RPS.txt',dtype = np.str)
sps=np.loadtxt('Line_001SPS.txt',dtype = np.str)
print('File 1 shape', sps.shape) #Pour connaître la structure du fichier en détail
print('File 2 shape', rps.shape)
rps.shape
sps.shape
print(len(sps))
Nlines_sps=sps.shape[0]
Nlines_rps=rps.shape[0]
fichier=open('mise_a_jour.txt','w')
for i in range(Nlines_sps): #boucle for pour gerer le premier fichier(qui fonctionne correctement)
Cf1 = sps[i,0] + ' ' + sps[i,1] + ' ' + sps[i,2] + ' ' + sps[i,3] #ligne du premier fichier
fichier.write(( (Cf1 +'\n')*282 ))
fichier.close()
ok, so what is the context and what is the goal?
I have indeed two data files with each: SPS (251 rows 4 columns) and RPS (781 rows and 4 columns)
first I duplicated each line of sps by 282, so I got a new file (mise_a_jour) of 70782 lines and 4 columns. then I want to make a scan in RPS so as to take the first 282 lines to add them in the mise_a_jour file and then continue the operation by starting again not at d but at d+2 (I shift of two(2) lines each time always taking 282 lines.
I know if I have been clear enough but you can ask me other questions!
So your problem is that you want to loop over RPS and take the first 282 lines and add them, then move the cursor down 2 lines and add the next 282 lines and so on. What are these 282 lines being added to?
"add them in the mise_a_jour file" - this needs more explanation.
by "add" do you mean append to the end of the file (such that the number of lines of the file increases by 282 each time)?
Also what is the context? So that we do not waste time on an XY problem. @dim dirge
yes exactly !
the 282 rows are added to the update_file which currently contains 70782 rows and 4 columns. at the end of the job, the update_file will contain 70782 rows and 8 columns
not add to the line, but rather on the same line to make the 8 columns
So you are aligning / matching lines and adding them up, resulting in the same number of lines (element-wise addition)?
exactly... so the first line of SPS will correspond to the first line of RPS in update_a_jour
this will be the beginning of the update_file
does pandas have an append I/O method when producing a CSV output?
In standard python i think we can
with("file", "a")
to add something at the end of a file instead of overwriting.
Ok, but there is a problem, this will not align, RPS has too many lines even when you skip every other line in it (781 / 2 * 282 = 110121).
What do you want to happen when the end of SPS is reached before the end of RPS is reached?
You're talking about with open(file, "a") (which opens the file in append mode, so writes result in the data being added to the end of the file).
It's probably a better idea to just read the file, append your data to it, then dump it back. If you must, though, try passing a file-object opened in "a" mode to to_csv.
No no RPS has 781 lines.
and SPS has 251; so 251*282=70782
If I take the first 282 lines of RPS, and match them to the first line of SPS duplicate 282 times.
I'll leave you the three files if you want, so you can look at them and maybe it will help you to understand better!
the SPS lines are duplicated 282 times, which makes 251*282=70782 lines
if I take 282 lines of RPS and start again by d+2 each time, I will have 70782 lines also for RPS
NB: the numbering of RPS goes from 561 to 1342.
it is an acquisition in which each signal sent by one (1) point of SPS (line) is recorded by 282 points RPS (line). then to send my second signal I shift two (2) points in RPS. that is to say that the first two points of RPS which recorded the first point of SPS will not record the second point of SPS anymore
Thanks, Mr. Confused Reptile 
RPS has 781 lines correct?
So one thing you want for sure is to duplicate each line of sps 282 times right? let's start with that.
sps_duplicated = np.repeat(sps, 282, axis=0)
yes yes I already did
next you want every other line from rps right?
every_other_rps = rps[::2]
oh wait nvm, you want to have a sliding window over rps
I can't believe it, you got it in one order?
It took me 2 days though 😩
yes
# Loop through rps, but with step of 2 (ever other)
for i in range(0, rps.shape[0], 2):
# Do stuff here
Yes, but the first 282 lines must correspond to the other 282 of SPS and then I come back to take the other 282 values that follow, leaving the first two lines
You are looping through every other line of rps while also looping over every line of sps (not every other)? @dim dirge
What's a nice way to embed PyCharm visualizations in a Medium article?
i think this is the right channel...
from PIL import Image
im = Image.open("maze.jpg")
im.show()
output = open('maze.txt', 'a+')
for pixel in iter(im.getdata()):
output.write(str(pixel))
this gives me a lot of tupels in (R,G,B) configuration, however i would like them to be in a [1, 0, 0, 1, 1, 0] sorta configuration
i am fairly very new to doing this kind of thing in python.
so any help would be great
wait I'll leave you the files, so we can understand each other better I think
Hey @dim dirge!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Hey @dim dirge!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Hey @dim dirge!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
try putting the code in pastebin.com
Hey @dim dirge!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Hey @dim dirge!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
I did it but it doesn't work
701 688081.8 3838302.1 46.0 561 684590.2 3837867.6 41.0
this is the first line of the file I need to have
no, certs won't get you a job in ai
considering the material that cert is based off of is really basically, it won't really be an indicator of anything
the only cert that matters for ml jobs is your degree
lets be real
yeah, totally agree
maybe you could get a job without degree, but you would have to be a 200iq genius for that
im stuck
i want to calculate the growth ratio between these dates for each file
file_name date dist
20210314_080621.txt 03-16 0.820328
03-18 0.838098
20210314_080633.txt 03-16 0.755168
03-18 0.784473
20210314_080644.txt 03-16 0.561407
i dont think groupby.agg would work for this
oh wait i misinterpreted your question
If anyone is interested, I just finished up a project experimenting with style GANs. I trained a model that generates 1024x1024 bird images. If you're curious, I post 2 per day on Twitter as @bird_not_exist. Happy to discuss methodology with anyone who is interested.
just took a peek, looks awesome!
would love to ask more but idk anything about GANs but i might come to you later if i do something with it.
like, do you think its possible to do the same with fishies?
🐡
Thanks @misty flint! Happy to answer any questions you have. Feel free to send me a DM whenever as well.
You could for sure do it for fish haha.
Would just need to get enough images. I used something like 35K bird images to train and more would have been better.
Also need a pretty beefy GPU haha
this is good info to know. thanks i might take you up on that offer when i get a bit further along in my ML studies

Good stuff! GANs seem to have a lot of really cool use cases for sure.

bro this crap looks amazing af omg

Seems pretty sweet. Looks like it goes a lot further than plotly
Been a while since I bothered with plotly though. Could be better by now. Really just only use matplotlib and seaborn
I live a simple life lol.
jealous
i got 97 test data, and i willing to do some manual calculation, but when i print my test data, i only got 5 from top and 5 from bottom. im using print(X_test). what syntac do i need to write to show all my 97 data sample?
exploratory data analytics
Hi - I wanted to ask what is the name of the thing I'm trying to do:
Say I have 200 sentences that follow this pattern:
Chevrolet can go 200 km/h
230 km/h is the top speed of this Ford
I wanted to train a model that would identify what's the CarBrand and TopSpeed in that sentence. What is the scientific name for what I'm trying to do? It's not sentiments
EDIT: it's named entity recognition
please suggest the world best github repo / any blog for sentiment analysis with good accuracy?
anyone knows?
depends on the dataset; in most cases it would need a lot of modifications/tweaks to get it working
can you suggest some github repo for sentiment analysis ?
you can google it lol, there are plenty out there
if it is a numpy array, you can convert it to list and use print()
if in pandas, use .values to convert a column to a numpy array
Ok, thanks, I'll try it later
if it still can't be done, you can use magic functions to write the variable to the file (I think its %writefile variable > file_namee.txt) and view the file
i want to display date and time generated in python file (.py) in my django project its a opencv file
i have captured the drowsiness alert data so
I'm totally noob in Python, can someone help me? I have some homework from space data science and this is my notebook:
Hi anyone here familiar with scipy.stats?
@rotund dock https://dontasktoask.com/
Fair enough
I want to calculate the value for a given probability using the st.gumbel_r.ppf(). I'm comparing it with the analytical solution and it is giving me completely different results, anyone knows why?
I obtained the values for the scale and location using the moment of methods
P_class = [0.14285714, 0.28571429, 0.42857143, 0.57142857, 0.71428571, 0.85714286, 0.999999]
u = 8.590342451210152
alpha = 0.1841827435642898
h_class1 = st.gumbel_r.ppf(P_class, scale = u, loc = alpha)
h_class = u-np.log(-np.log(P_class))/alpha
Results
h_class1 = [ -5.53466431, -1.7516637 , 1.6076281 , 5.17091797, 9.54112426, 16.24661736, 118.86414528]
h_class = [4.97583548, 7.36682128, 9.49000861, 11.74212971, 14.50424958, 18.74235061, 83.60013865]
I want to get the same h_class results when using the scipy function
I'm not sure what your intentions are, but linking people to that website isn't appropriate. It's comes off as dismissive and condescending.
sorry I saw other people do it so I assumed it was ok
I see. Thanks for letting me know. We're see that sort of thing in the same light as "google it" or "rtfm", so be sure to avoid it.
Got it. Won’t happen again
I was just wondering if this was the right place to ask a question about that topic
Which topic
yeah it is
This is the right channel for asking about scipy. The best way to get help is to just dive right in to your question--if someone understands the subject matter, they'll see your message and help.
hmm..but the content in the website seems pretty polite and very understanding. It's just a suggestion - not phrased as a rule. If the tone was bad, I would have understood. but I dont see the harm in anyone getting someone to read the content in the site
if it was me, This link should be in the first message every new user gets
So for example, don't ask if anyone knows about a general library or a type of problem, and hope that that person will know how to answer your specific question. It's easier to start helping if we know exactly what you'd like help with.
the wording on the website isn't necessarily bad, but getting a link like that when wanting help with a problem sends the message that "look, you've made a rookie mistake so common that there's a website dedicated to it"
I believe we cover some of the same material in our question asking guide.
yeah, so? StackOverflow has many such links for everyone. its much easier to link a message than to type it out
typing out everything everytime someone new asks would flood the server. its not efficient
I'm not sure I follow your reasoning. This isn't stack overflow. And Discord hasn't threatened to decrease their resource allocation for us.
its not about the resource allocation, just simply that :
putting a link to a message is more efficient than typing out a huge wall of lines everytime
That's why I have a file of copypastas that I wrote
that's pretty useless and inefficient. either you let someone put a link (because not everyone wants to store copypastas) or you get a bot to do that
a link does not feel harsh or anything. its just a website. how can it convey some negative feelings?
I can appreciate that getting a link like that might not feel harsh to you, but it does for a lot of people, and the reason it does makes complete sense.
How about you DM @sonic vapor if you'd like to discuss this further.
otherwise you only lead to inefficiencies of a question-answering system and help vampirism
Pasting the link without another word is the problem (not necessarily the content of the website), for the reasons I've explained previously.
You're not required to give long-winded responses for each ask-to-ask instance. You can simply say "Go ahead and ask" if you'd like.
Let us know in #community-meta if you'd like to discuss this further.
So I'm using sklearn's MLPClassifier to predict whether a machine breaks based on input data
I've tried a tonne of different paramter settings and 8/10 times the machine's score is terrible
It's not really predicting it at all
Is this indicative that the input data might not actually correlate to whether a break happens or not?
@kindred radish what features are you using?
I've got like 6 features and they're features of the thing that's going into the machine
Such as the thickness of the material and its composition
I don't have any data on the machine itself apart from how many times it breaks a day
Ok so:
-X_train are like 6 features about the film that goes into the machine.
-y_train is an array of 1s and 0s that represent whether or not the machine breaks.
-This is fed into the MLPClassifier that SkLearn provides.
-The aim was to create a model that could correctly predict if the machine breaks or not.
-Despite playing around with the classifier's parameters, the model is unsuccessfully predicting whether the machine breaks: the score and the precision are abysmal.
- my question is: Does this mean that the input data is uncorrelated to whether or not the machine breaks? Can I say that this data doesn't have anything to do with the machine breaking?
Lmk if anything needs clarification, I think that's as clear as I could make it! I standardised the input data as well ^^
yeah, either your model does not have correctly inputted data, or there is no correlation. can you identify whether or not the machie breaks seeing a sample?
ive been stuck for hours on trying to curve fit some data
the curve is relatively exponential
but when i plot the curve fit it gives me a straight line
def curve1(x, a, b, c):
return a * (b ** x) + c
def plot1():
fig, ax = plt.subplots()
graph = ax.scatter(year1, production, c=production, cmap="bone_r", s=8)
popt, _ = curve_fit(curve1, year1, production, p0=[-1000, 1e-6, 1])
a, b, c = popt[0], popt[1], popt[2]
fit = []
for i in year1:
fit.append(curve1(i, a, b, c))
plt.plot(year1, fit)
plt.tight_layout()
plt.savefig("graph1.png")
plt.show()```
I'm sorry, what did you mean by that last bit? Are you asking if I have access to the physical machine itself?
hello, when I run my network, "model.fit(x=train_samples, y=train_labels, batch_size=10, epochs=30, verbose=2)" and train_sample and _label are train_labels =['^GSPC.Adj Close.csv']
train_samples = ['^GSPC.Close.csv']
I get Function call stack:
train_function
i couldnt figure it out either. sorry. i tried all the various methods in that module too
no, like if you were given the training data, could you yourself identify whether it would break or not?
No I couldn't, I didn't even know if there was a correlation to begin with. I was just given the data to play around with and see if I could get anything out of it
But it's fine that it doesn't work, as long as that means I can say that the variables do not affect whether the machine breaks or not
no, it would be more likely that your feature engineering is wrong
How do you mean?
All of the features have been standardised and I've removed outliers
Feature engineering is an informal topic, but one that is absolutely known and agreed to be key to success in applied machine learning. In creating this guide I went wide and deep and synthesized all of the material I could. You will discover what feature engineering is, what problem it solves, why it matters, how […]
Is this more to do with unsupervised learning?
Thank you for your answers so far btw!
uhh no. this is feature engineering - that involves applying certain methodologies to enhance the features you are feeding to a model
I'm unsure how much engineering I can really do with my features
They're measured values of things like Viscosity or amount of chemical
well, its mostly common-sense/logic. What do you think is the best way to represent your data so that the model can understand
Well I made them all standardised so that they were comparable to one another
As some features are of the order of magnitude of 100 whilst some are like 0.1
So I guess that was a form of feature engineering?
My task was exploratory, I wanted to see if this data was responsible for why the machines break
wait, if you are doing EDA then why are you using models?
If I could get my model to reliably predict a break, that would mean that the input data affects the machine
EDA?
exploratory data analysis
EDA involves something like this: https://www.kaggle.com/sohier/structured-eda-for-data-cleaning
a small list of what you have to do (not to solve the task, but to explore the data)
exploration: like what feature appears the most. plots etc.
Ah it's a bit confusing, I don't mean machine as machine learning. Refer to this ^^
you have some kind of machine that youre feeding data to to determine whether it physically breaks?
why arent you measuring performance of the machine?
or does the machine not have any metrics?
i know youre not referring to machine in machine learning
The machine's only metric is that it breaks or that it doesn't break
Yeah exactly, there isn't context for me to problem solve
well, you don't have to train a model anyways, so no feature engineering required
can you not look for more context? i feel like it would go a long way in your data analysis
This is all the data that I have to work with unfortunately

Yeah I know
Despite playing around with the classifier's parameters, the model is unsuccessfully predicting whether the machine breaks: the score and the precision are abysmal.
how much data did you have to work with
you usually need A LOT of instances
After cleaning it, it's around 300

I've only used the sklearn MLPClassifier
why did you use that one
Because i thought it would classify between "break" and "not break"
i think you should not use a neural network
you dont have enough data
use more simple classifiers
its probably overfitting
Oh right, what kinds of classifier's would you recommend from SkLearn?
try a bunch
logistic regression, naive-bayes, SVM, random forest, decision tree, etc.
see how your accuracy looks like afterwards
But If the precision is bad for those as well, maybe that suggests the input data has nothing to do with the output?
yes
Thank god
Okok thank you
technically thats a conclusion too
Yes exactly
and then you can show that by showing all the different classifiers you tried
The worst thing I can say to my supervisor is "I don't know why it doesn't work"
So if I try a bunch of things and it doesn't work then I can say "this could be because the data isn't right for the job"
Which is much much nicer
yep yep
Thank you !!!
Feel so relieved

Would you mind if I @you here if I run into trouble?
I won't do it for petty shit, just advice
yeah sure
whenever i have time, ill end up responding
haha np
we're all here learning together

huh? how does the precision have to do with any correlation between data?
Lemme show you what I mean, one sec
you can look at accuracy scores using sklearn btw
theres a function/method
also this is cool if people are actually trying to figure out CAUSATION, not correlation https://livefreeordichotomize.com/2016/12/15/hill-for-the-data-scientist-an-xkcd-story/
So like, right now it's incorrectly predicting "Break" when it's not a break
Ideally the top left and bottom right elements would be high numbers
aye that's what this ios
It's set to be normalised
ah
whats your F-score?
uhhh how would i find that? Do you just literally mean "model.score"? For this run it was 57%
well, is your data imbalanced?
i think classification_report() tells you a bunch of scores
(and please do not jump in a task without learning the appropriate basics,
you would struggle more and you wouldn't understand anything)
imbalanced how?
This is my final year project and I've been teaching myself everything since my supervisor doesn't understand how any of this works. This is the last stage of it all 😕
do you want to do data analysis or EDA?
I specifically have to do machine learning. I've got a final meeting with a CEO on Tuesday where I have to explain the results of this :))))))))))))
So i wont really have time to implement anything major
how does making a model help in this?
to find correlation between features, there are different techniques
I originally just wanted to predict if the machine would break as I thought the data was correlated
As i wanted to be able to say to the CEO "hey this is why ML is good. Look, i can predict when your machines break"
And my supervisor also thought that would be good
and what exactly is your task- meaning what exact variables have you been given?
The Ripeness of the film, the amount of a chemical in it, the viscosity of it, the thickness and some other stuff as well
and WTH is the machine?
hahahaha
thats what I was asking
💀
people always try to separate the data from the context and it never ends well
The machine essentially rolls the "liquid" into films
and how often does it break per hour?
All of these inputs are for the "liquid". Sorry i said film because ive wanted to avoid saying "liquid" since i dont want to talk too much about specifics of the manufacture since it's a company i sort of working under
I havent been given that data, ive only been given how many times a day it breaks
🤦 if its a number, then why are you trying to predict whether it breaks or not?
I tried to break the problem down into a simpler one
And i thought that predicting whether the machine would break at all would be simpler than trying to predict how many times it breaks
please tell the whole problem next time. your task is not classification, its regression
the number of breaks is very small: around 0-5
and does it break atleast once a day?
No, some days it doesnt break
then that is a regression task - which is much easier than making it a classification task with 5 labels
When i tried to use MLPRegression it didn't work too well

hm...accuracy?
I got a negative score lol
thats not accuracy lol
accuracy is always positive
idk how you can get a negative accuracy
yeah i know thats why i was so confused
And got a negative numner
yes i understand the square of something gives a positive
Yeah that can be negative if the model performs worse than the mean
Id read that a negative score was bad 😅
just leave that and focus on the accuracy 🤷
💀
ok im sorry that i might seem really dumb or whatever, but ive been given literally no direction and im a Physicist, i code in python all day and pray to god that my code doesnt come back to haunt me because it looks monstrous
friendly remember: sometimes, outliers can mean something, so just think about your problem as a whole before cutting them out, etc.
thats alright
No worries man
negative R2 is pretty bad - did you try increasing the number of hidden layers in the MLP?
Sometimes that means that there is something going on with your model specification
It might be worth taking a closer look at your data
Tried a bunch of things, brought on my CS housemate to look at the parameters too
ruler just wants to remind people to brush up on stats basics before diving into ML
which is fair
bc it happens a lot

something like this = hidden_layer=(10,30,30,50)
@kindred radish number of layers, not the amount of neurons in each layer
I did (100,50,25) ive tried just doing 100 or stuff
Might also be worth plotting some of the variables against each other and coloring by whether the machine broke that day to see if there is any pattern
Or doing the same with histograms
oh this might be a problem then? Wait let me show you
in multi-dimensions? 👀 TSNE it
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor.score
the very first parameter "hidden_layer_sizes" is what i've been doing (100,50,25)
I did see improvement
But not much
make it even longer; use solver lbfsg
Yeah ive used that solver because my data set is small
and more hidden layers?
Yeah i tried a bunch of things
make max_iter=1000 or so?
return MLPClassifier(hidden_layer_sizes = (100,50,25),
random_state=0,
activation = 'relu',
learning_rate_init=0.0001,
solver='adam',
max_iter=5000,
verbose=True,
early_stopping=False,
n_iter_no_change=10,
alpha=0.001,
tol=1e-8,
beta_1=0.9,
max_fun=15000)
Like, everything you see here I changed
Spent hours messing around with it trying to create variations on each other
what about a 10-layer network?
also, put validation_fraction=0.1 to test your network on 10% of your train data
so like if i made hidden_layer_sizes = (1000,500,250,125,60,30,15)? I thought the length of this tuple dictated the number of layers??
it does
are you using colab?
Right now it's testing against 20% of the total data. Like i split my data into test and train data
No im not, what's that?
nvm that
Ill try something like this
try leaving the split, and just put the parameter there. be careful to pass all your data variable (before the split)
umm ok, why do it this way round? The test and train data is always jumbled up before hand each time to make sure that it's not just doing exactly the same thing each time
for newbies, integrated is much better
because it reduces the chance of a problem somewhere
Also i still got this despite using the above
I told you, ditch the scores and focus on the accuracy
okok sorry ill go implement that now
your precision/recall may matter, but it depends on the task you are doing (like what does you implementation value - false positives, false negatives etc.) different values are preferrred for different scenarios
like for predicting cancer, you do not want any False Negatives.
I guess for breaks i wouldnt want a False Negative either
you just want a high accuracy for prediction 🤷
for your task, FP's, FN's etc. dont matter
because you want to say to the CEO that your model can predict 90% of the time whether the machine would break or not
not talk to him about FP's or precision/recall
Ah im using the MLPClassifier did you want me to swap to Regression? Sorry if im being slow, i dont mean to be frustrating to help!
yeah, regression
aight lemme swap it to regression real quick
The regressor assumes a linear relationship right?
anyone know of a non-blocking way to implement matplotlib in an async environment: the regular run_in_executor from asyncio doesnt work
@kindred radish no MLP has a nonlinear activation
ooof im getting an error lol
raise ValueError("Classification metrics can't handle a mix of {0} "
ValueError: Classification metrics can't handle a mix of multiclass and continuous targets
Im much too tired to be able to try and handle the error
So i'll probably call it a night and try and do something tomorrow. Is it alright if I could ping you as well tomorrow @grave frost ? Im on GMT timezone
absolutely fine if not, you;ve helped me a lot already! ^^
you've probably not set the columns to be predicted correctly. its basically an error that it can't handle classification and continous regression together; you have to choose one
yeah, thats alright
thank you!
I think @misty flint mentioned this before but since you only have 300 samples I would stay away from a deep learning model and probably stick with a gradient booster if all the data is tabular (no images or any unstructured data). I’m assuming since it’s your first dive into ML your supervisor will want the model to be at least somewhat explainable and not just a black box which a gradient booster will give you to some extent. They’re also probably more of the standard for structured data.
Depending on what problem you’re trying to solve you may have been right in choosing classification over regression. If you just want to predict if a machine will fail that day then I would turn the problem into a binary classification one. That should be a bit more achievable with that little data. Otherwise I’d stick with regression and not do multi-class classification.
Can someone explain in plain english what is the difference between Dense and Sparse matrix?
I feel like i get but i want to be sure i have the right idea...
Sparse matrices are mostly zeroes, so much so that it becomes efficient to create a separate data structure that stores the location and value of the nonzero entries rather than the whole matrix
but for example, why are sparse matrices preferred in some cases? I was reviewing some NLP tasks and it seems sparse matrices are like the daily bread there.
is that an inevitable conclusion of the pain in the ass of text?
Yeah usually with NLP you’re dealing with a giant matrix where each column is a word and each value is the number of times that word appeared in the “document” (which is whatever you’re considering a single observation)
You can imagine how that can get a big fast and how lots of counts will be zero
I said word but usually it’s called something more general e.g token
Lol good question all I know is If you do One hot encoding it’s sparse
But beyond that and the mostly zeros I don’t really know
But no that’s not the end all for nlp most utilize some sort of embedding which turns your tokens to vectors so you don’t have to use one hot or sparse stuff
Yeah I should say that’s the simplest case like bag of words
Lots of techniques are concerned with reducing the dimensionality of the feature space
it depends on what you're doing
like the simplest form of vectorisation
bag of words modelling
there are many unique words
but most of them will not appear in any one document
-> lots of 0s
imagine a matrix where 99.9% of the values are 0
naively storing it in a data structure meant to hold dense data
will take ~100x of the amount of memory a sparse data structure would take
sparse data structures can also optimise for certain operations
for example, say you want the document(s) that contain the most of a certain word
you can clearly ignore all the 0s
there are many ways to store sparse data
each with their drawbacks and advantages
@velvet thorn thanks a lot man.
I see that sklearns CountVectorizer returns a scipy.sparse matrix
is that one of the optimized structures you mentioned?
ay so the ML model i built is in first place in my march madness pool
Yes
Sparse matrix types, unlike normal matrix types, aren't going to be stored like straight arrays
They're going to have more complicated encoding schemes to avoid doing useless operations
🍼
Hi there, I am beginner. I am currently working on EDA of 'Temperature Variation of Countries'. So, if any beginner (like minded) want to work/study with me. just drop a message. It's just like group study, nothing else.
Great I’ll keep trying and see! Thanks for trying
What does it mean for the model (nn) loss to fluctuate?
it's not training properly
decrease your learning rate
ideally have gradient clipping and stuff also
and normalise your data
I think the most common method is storing indices, right?
for sparse arrays with less density
!docs scipy.sparse
This appears to be a generic page not tied to a specific symbol.
there's compressed-sparse-rows, compressed-sparse-columns, dictionary-of-keys, and probably others I don't remember
and that's only for sparse matrices.
Hello there,
I have somehow managed to land a Data Engineer role. I have never been a Data Engineer before.
My background is DevOps (Linux for the last 15 years ,and also AWS, Ansible, systems, bash, python, networking, etc). At the interview they mentioned being in a Data Team and technologies and terms such as ETL, data pipelines, python, pytest, Jupiter Notebooks (with Pandas, numpy, matplotlib), AWS s3, data wrangling, Linux, SQL, Agile and doing code reviews in Github.
I think I have either worked with some of these technologies and/or have played with some of them before but I don't know anything about ETL, data pipelines, pytest, Jupiter Notebooks (with Pandas, numpy, matplotlib), and doing code reviews to any great extent. Could someone point me to some recommend learning guides such as MOOCS, online courses etc about those (ETL, data pipelines, pytest, Jupiter Notebooks and doing code reviews).
Also, what is it like to work in a Data Team, is it different to working in a Software Development Team? How best can I transition from a DevOps (mostly Linux sysadmin type experience) to a Data Engineer mindset? What would be some good things to figure out and ask questions about when you first start in a Data Team?
Any help much appreciated.
Hello everyone!
I am working on a project that uses OCR. I have videos that show boxes from above rolling on a production lane from one side to another. I already have a code provided by someone else that cuts from video frames only the box’s label and my task is to OCR this label.
Videos are recorded by fish eyed camera lens so it is a little deformed, which is not making it easy for me. First issue I was struggling with was to find a proper OCR tool, because it is going to be running on a weak virtual machine with only CPU (4 cores). I found brilliant PaddleOCR which is lightweight and seems to fit my needs, but I can’t apply it properly on my box labels to OCR them. Without doing anything to the images, PaddleOCR has 85% efficiency in detecting and recognizing numbers properly, but I need to make it better. I figured that rotating an image is helping in many cases, but I can’t have one fixed angle by which I rotate every image, so I found that four point transform can help in my case. To properly use this transform I need to find a contour of my label. I tried playing with opencv to pre-process image but I can’t find universal parameters for all images in order to find this contour and transform it. I tried pre-processing images with thresholds, blurs, applying erosion, dilation etc. after which I use canny edge detection and then I try to apply four point transform. I will be thankful for any help. Images with labels I am working with look like this:
@hallow girder please save this question and ask again another time. I'm worried no one with that knowledge saw it.
dont do BS, try to unwarp the images properly
the issue is a fisheye warp, the solution should be an unwarp
if you have access to the camera, run a calibration procedure, or know the camera model, try to find its distortion coefficients online
no OCR software should have any issue with those photos given it's undistorted properly
using pre-trained embeddings here, when I try to generate a subword vector, it has no problem. but when I try to find its most similar word (using an inbuilt method) then it reports that the word is out-of-vocabulary. how does that work?
ahh, nvm
precision recall f1-score support
0 0.58 0.77 0.66 39
1 0.55 0.33 0.42 33
accuracy 0.57 72
macro avg 0.56 0.55 0.54 72
weighted avg 0.56 0.57 0.55 72
@grave frost As per @stiff barn 's suggestion, I've used a Gradient Booster to try and classify breaks or not. I just printed classification_report() to get this table
hmm, why are still doing classification? regression?
Aye I'm gonna try the regression one next, I just thought to print this out since I had everything set up for classification. I guess it was like "let's just see what happens" 😅 Ill go use GradientBoosterRegressor now
cool
Hey @hallow girder. I’ve worked as a data engineer for a few years now and currently work as one for a large company in the US so I will do my best to provide some help and answers.
First thing, congratulations! It is possible that the team wants someone with a lot of dev ops knowledge as we do need to perform a lot in that area.
Since you already have cloud and Python experience that is going to help a lot as well since we generally like to build in Python. If you have IAC experience as well that would be great to show the team.
If you know which cloud provider they are using, I would look into getting a data engineering certification for that cloud. Each of them have one I believe and there are usually courses on coursera for them. There is also a company called DataQuest which has a Data Engineer track which is pretty good. They go through writing memory efficient pipelines, pandas, numpy, ect... You may find their other tracks useful as well.
When you first get to the team you’ll need to figure out what you’ll actually be doing and what technologies they use. Data Engineering can vary to some extent where some companies you focus more on writing SQL procedures and managing databases, while others it’s more writing data pipelines in the cloud, and others it’s a lot of both. On that note, if you’re not strong in SQL that’s definitely a top if not the top skill to brush up on. Figuring out what the team generally actually does will help you target what to learn.
Feel free to reach out if you have any other questions out want to chat about the role.
lol regression is very achievable in little data if there is a simple correlation between the input features. that is no basis for a recommendation to a task
Do you have it set so there are only two labels, 1 and 0?
Yeah, 1 meant there was a break, 0 meant there was no break
That's for the output, y
That is not true in his case. Given that he is reframing a regression problem into a binary classification problem can increase accuracy as you’re essentially condensing a range of values into one. It can be helpful especially in his case where I assume there are cases where the same input values can lead to a different output values due to chance often. At the end of the day though, we don’t know that much about the data so I could be wrong.
all good points, but moot since that would cause bias in the model
@kindred radish how many times a week does it (on average) break?
I am here to learn as well so if there is something I’m missing then please call me on it.
The number of breaks is pretty small, each day (which is the time frame everything is recorded) there is about zero to four breaks
each day?!
yeah ikr lol
yeah, then its a regression hands down
wait why
I thought it was such a small amount of breaks
that regression wouldn't be a good idea that i could simplify the problem using a classifier
yeah, but having a large frequency each day makes it much more suitable to regression. if was like it broke 3-4 times a week with different days, then it would be classification because you could counter-act the bias
If it’s that often then binary classification will probably just always output 1
Can’t really learn much. If you had the data on an hourly basis or something that might be different
Why does it make it more suitable to regression?
because it can handle the fluctuations pretty well and classification would not handle the bias and would just output ones
When i think of regression, as a physicist, i think in terms of some equation being obeyed. I have no idea what equation this data would follow to yield the output we see
That’s what the machine must learn haha.
Could be a simple linear equation
Or a bunch of decision trees working together if a gradient booster/random forest
Or any of the number of regression models
I doubt it is linear, only because the factors are kind of complicated and aren't so simple
Thanks for calling me out there @grave frostz I wasn’t aware it was breaking that often
Like im not even sure that the data is correlated to whether the machine breaks or not, this is just the data ive been given. It could be something like the humidity of the room, which they haven't recorded, which is actually making the machine break
even if there was no correlation, you could find partial correlation which is help enough
You have the data in Pandas right? You can start by running df.corr() and get a look at the correlation between columns.
So even if the data didn't actually affect whether the machine breaks or not, a model could still figure something out?
I'll try that out ig. Is this fine to do one standardised data ?
More like even if the data isn’t directly correlated, there can be some partial correlation that can give the model some signal to make predictions on.
As in scaled data via MinMax or Standard scaling?
Standard scaling
That isnt the raw data, i've standardised it myself
Just wondering which to use. It's just more convenient for me to use the standardised one rn thats why i was asking!
It should preserve the correlation and be fine.
If the scaling didn’t preserve the correlation then that’s a problem haha
How do you have the data stored in the notebook now when training? Numpy, pandas DataFrame?
So i have a Pandas DataFrame of all the features, X, and the output, y.
Then i split the data into training and testing data and randomise their positions according to sklearn's train_test_split() function
So you'll want to run the correlation on the full training dataset with the y included.
Yeah not looking promising
-0.08902379625215213 -0.09753689601363387 -0.17866755803410775 0.02011288489513478
Are the correlations im getting to the total number of breaks
So what are those, are those the correlation of each feature with the y?
Yeah, that's the correlation of each feature with the total number of breaks that happened on that day
There is some signal there. Not much but some. feature 3 has a decent negative correlation.
It's not all bad as there may be some non-linear correlation that you cannot see here
Yeah i would presume this just tells me that there isn't really a linear correlation
You can bring that out using feature engineering techniques like feature crosses
And that the third feature is the "most" linearly correlated
Yeah, basically
The gradient booster should be able to capture some of the non linear correlation.
I would just start with the default.
You can try other options later in your hyperparameter tuning phase and see if others work better.
aight what metric should i use then? I just used classification_report() before but obviously this is regression?
i was referred here from the help channels, could someone help me with some code?
i am trying to make a bar chart, but i am unsure how to call up the x-axis and y-axis variables needed in
sb.factorplot(x=), (y=)
as the data is from several different csv files, and the math i have done is in functions
the code is on here https://paste.pythondiscord.com/uqezixusen.py
https://drive.google.com/drive/folders/1EJtc3R60eAMcMqLcSmGuPAGYPifLVIzz?usp=sharing here is the csv files relevant for the chart
@spark olive sb.factorplot(x=), (y=) is a syntax error
i need to put something in the x= and y=, but i cant figure out what to put
Which columns do you want to plot?
You’ll need a data frame with two columns containing the data you want to plot
x and y are strings with the names of the columns
i want to plot EAB_SUM as the x axis, and postBIODV as the y axis, but i cant figure out how to turn it from a function to a series
@stiff barn So i mentioned yesterday that this happened, but i have somehow got a negative R2 value...
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse)
My R^2 value for this is like -0.8 and the MSE is like 1
aiming for something like this
@spark olive the globals and the Xs in the function signature make the code a bit hard to follow
(realized i said barchart earlier when i meant dot graph, my bad there)
Is the data you want for EAB-SUM in the total column of XX_EAB?
the data i want for EAB_SUM would be the result of
XX_EAB['TOTAL'] = XX_EAB.sum()```
but it would differ depending on if it was NY, WI, or TX
the fuctions having the XX was for the purpose of being able to replace the "XX" with the state code (NY, WI, or TX) without having to retype all of it every time
Uh ok might be better to have a positional argument called state
oh? i have never used that before
I’d refactor your code so that it returns the column you want
And set it up so that you can give it the stats and it will get the sum for that state
Then use pd.concat to put them together
how do you do that?
Try writing a function takes the data and the state name and gives you the sum for that state
Then create a loop that puts the sum for each state in a list
Then convert that list to a series
how would you create the loop? sorry i am new to python! 😅
I want to implement a generative adversarial network in native python. not for any practical applications, but for learning
Did you adjust the dataset to regression where you have the full range of values not just 1 and 0?
Yeah im a mug i just realised i did that lmao
lol all good
Seems the regressor doesn't like that the input is continuous whilst the output is discrete?
ValueError: Classification metrics can't handle a mix of multiclass and continuous targets
What metrics are you using?
Lol good, how are they looking?
Annd Im getting the following:
MSE: 1.2018049174929522
R2: -0.2902649077024708
MSE: 0.6243021962953844
R2: -0.016531632695806264
MSE: 0.7926685016933164
R2: -0.3511394915226984
For three runs
So they're still negative. I checked to see the data that im running through and it all looks good
@spark olive no worries theysian might want to go to dedicated help channel for that kind of thing
Afraid I can’t walk you through it right now
@deft ruin i went there but they directed me to here instead
no worries though! any time youre free id really appreciate it. thank you so much for your pointers so far!
Fiddling with the Gradient Booster's parameters doesn't seem to help
Let it run for longer
aight
MSE is decreasing
For the most part
Also, if you can show your loss that would be helpful
Aight i slapped it up to 1000
Iter Train Loss Remaining Time
1 0.6906 0.00s
2 0.6635 0.50s
3 0.6451 0.33s
4 0.6294 0.50s
5 0.6113 0.40s
6 0.5965 0.50s
7 0.5845 0.43s
8 0.5727 0.50s
9 0.5605 0.44s
10 0.5534 0.50s
20 0.4828 0.39s
30 0.4391 0.39s
40 0.3983 0.36s
50 0.3602 0.36s
60 0.3275 0.34s
70 0.2983 0.35s
80 0.2739 0.33s
90 0.2500 0.33s
100 0.2338 0.32s
200 0.1234 0.28s
300 0.0760 0.24s
400 0.0524 0.20s
500 0.0355 0.17s
600 0.0244 0.13s
700 0.0175 0.10s
800 0.0133 0.07s
900 0.0098 0.03s
1000 0.0073 0.00s
I could fiddle with the learning rate?
So i made the learning rate like 0.001 as opposed to the default 0.1. This makes it much less negative, it actually makes it go to around zero.
I've read that having a "high" learning rate is what you do if you have a "small" amount of data?
negative r2 means that the model performs worse than a horizontal straight line. 0 would mean it performs just as good as one
.001 would be a fairly standard rate. It would be lower than the default though
0.1 is the default yeah
Trylowering max depth to 3 or 4
So this is the same as me just drawing a flat straight line through my data? That's pretty crap right? lol
Lol yeah basically
depth is default to 3 lemme try 4
Tried 4. it's looking like it's still 0
like floating around that, both negatively and positively
Is that on the holdout dataset?
holdout?
The dataset you save that doesn't touch the training process
oh the "test" data?
Like i split the data into a 80-20 train-test split. These metrics then compare the predicted output based on what's trained against the test dataset
Yeah, the test dataset
Yeah then it's all done on that
are you worried that it's correlating the training data fine, just not the test data?
If the positions of the training and testing data is randomised each time, doesn't that help?
More that it's over-fitting to the training set.
I'd suggest setting a random seed for both the model and the training test split so you don't get different results each time due to randomization
@spark olive sure thing i might be able to provide some more help later
Ok that sounds like a good idea
What's on the y-axis of this plot though @stiff barn Or did you mean to track how the R2 value varies upon changing the paremeters?
Hi! I need some help with pytorch.
So let's say I've calculated a gradient for a set of parameters( It doesn't matter, but let's say with a MSE loss function)
So, when I'd like to step the parameters
is there any difference between
parameter.data -= parameter.grad * learning_rate
vs
parameter.data -= parameter.grad.data * learning_rate?
Since I've already told Pytorch not to calculate the gradients for this stepping operation with "parameter.data"
Just try changing the parameters they suggested there and see the effects.
@deft ruin thank you so much! feel free to dm me if thats easier
The y axis is just a measure of goodness of fit
HMMMMMMMMMM:
i used exactly their code, and i've just put X_train,X_test,y_train and y_test into it
Wouldn't bother copying the code. I'd just try those parameters like subsample=0.5 in the code you had
haha sorry should have been more clear
Looks like im getting about the same as what I was getting before
Yeah I mean unfortunately there may not be much signal in the data you have. You could try doing some feature engineering or collecting more data.
If you're willing to share the notebook I can look for anything being off
Not sure if that's allowed for you though
I can't share it I'm afraid im under an NDA
it's a real bitch right? hahaha
I want to be able to conclusively say something about the data
Yeah, mine's pretty invasive
I had an idea to push in artificial input data to induce a correlation that the models would learn. For example: create a feature that always results in a break if it has a value of 0.5
And this would show that the data i've been given is likely to not an effect on whether the machine breaks or not
Would this be a good idea to do do you think?
Not a bad idea to validate your process
omg some hope!
If you had the time, you could also try something like this
You could use that to generate more training data that has the same statistical properties as what you have.
That could improve training
Wouldn't recommend it though unless you had a bunch of time
ohhh is this like... In unsupervised learning, say you had pictures of people's faces and you wanted a model to learn how to recognise them. You don't need to take more photos, you can just mirror them and that's the same as new data?
Yeah, I'd say the idea is similar.
Not exactly the same as new data but helps prevent overfitting and such.
ok that's cool, i might check this out after the meeting, unfortunately it's on Tuesday so i don't really have the time to look into it now
Gives more examples to train on
Thank you for helping me, this channel is awesome. Definitely made me realise how little i understood about what I was doing! Wish I'd been more active on here at the beginning of the academic year lmao
No problem, you seem to learn fast and are able to iterate quickly on what we suggested. It's a pretty cool community for sure!
We're all still always learning. You have to be in this field haha
thanks ahaha i've been in a bit of a panic so i was laser focused on whatever you guys said 😅
Yeah it definitely seems like it. When i was younger i saw lots of videos about ML and data science which was more on the meme-y side of things so i think i falsely assumed that the field wasn't as deep as it is. Obviously a terrible false assumption on my part! So much info for such a young field of study
Yeah it's all super cool. It's a very interesting way to solve problems
Unfortunately just takes a lot of data generally haha
Well that seems to be what was holding back the science when it was first come up with right?
Because a lot of this Computer Science was discovered decades ago
But """"Big Data""""" wasn't available back then as it is now. Like the amount of data that Google and Facebook or TikTok has on people is absolutely insane, so there's so much room to explore with ML
Yeah for sure. Both availability of data and processing power. Now you can buy a solid GPU for $700 and be able to build fairly large models in a few hours of training or just hop over the the cloud and rent even larger resources.
that is if you can find a GPU hahaha ;)
So im pretty happy with how everything's turned out, despite it being sloppy on my part. One thing I would like to check with you @stiff barn is why you suggested the GaussianBoost function? That question might be a bit involved so if there's material you know of that I could read about it then that would be awesome! I've looked around a bit on wikipedia and stuff, just unsure on some concepts i guess.
Haha I’ve been very lucky in that area with the 3090
The library XGBoost is what is generally used in the industry for structured data. That mostly contains optimized boosting models. They generally are the go to because they work really well on this kind of data. The most basic explanation I can think of is that you’re training a bunch of decision trees in a row where each subsequent decision tree tries to learn to correct the errors made by the decision tree before it
Thank you! So I should look into XGBoost then and read around that a bit?
@kindred radish Yeah, probably a good idea. This is the book I read to get a good introduction into a wide range of things. https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646/ref=asc_df_1492032646/?tag=hyprod-20&linkCode=df0&hvadid=385599638286&hvpos=&hvnetw=g&hvrand=13529348850306413326&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9004373&hvtargid=pla-523968811896&psc=1&tag=&ref=&adgrpid=79288120515&hvpone=&hvptwo=&hvadid=385599638286&hvpos=&hvnetw=g&hvrand=13529348850306413326&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9004373&hvtargid=pla-523968811896
Hey all!
I recently finished my Master in mechanical engineering and I thought about getting myself a little into ML (not for a job per se, but since i'm interested)
I'm currently playing some browser game, where Players can produce a number of units and fight each other. Combat is fairly simple each Unit picks a random Unit as a target and attacks it but some Units are good against others which means if they hit the Unit they'r good against they have a chance to attack again.
So my idea was to teach a ML with the Unit compositions of all the players on a Server and calculate the best possible "counter" to their Units.
Would you strongly advice me to not try this? If so, why?
Hey @marsh gale, welcome! It sounds like what you're attempting to do there is reinforcement learning. It sounds like it could be doable but I would suggest tackling a simple problem first for your first dive into ML. Something that already has a clear labeled dataset so you can wrap your head around some of the underlying concepts in ML. Reinforcement learning is definitely more on the advanced end and a current area of active research.
you can try naive Q-learning or DQN since your environment seems pretty simple enough with not much complexity
baby bottle?
out of context baby bottle
maybe he/she was trying to tease him?
idk
Hi, I will be happy if someone can guide me.
I am trying to build CNN that finds the most similar fonts to a font in the picture.
The point is, that the font in the picture is in a different language from the font I try to find.
For example, there is a picture with a font in Japanese, and neural networks need to find the most similar font in English.
I don't know a lot of neural network staff, and I don't ask for a step-by-step guide, just need a direction and from where to start.
I already staterd establish a database for it.
Any help would be welcomed!
will a small batch size cause the loss function to fluctuate, my data set is a 2000, 40, 1?
yes
well, your batch size will change the result of your learning rate
a larger batch size tends to work better with a slightly higher learning rate in my experience
have a look at the MNIST dataset. It's kinda the 'hello world' of machine learning and revolves around identifying handwritten digits with CNNs. It'd be a great point to study since you're also analysing images of text
Hey! Tyvm for the warm welcome! 🙂 Yeah thats what I thought, too but on the other hand, I really enjoy solving a problem I've set myself.
Yeah from what I've read it would fit into "reinforced learning".
@grave frost tyvm for the advice! What makes Q-learning/DQN so desireable for my task?
anyone know how to add the ID column to the left and stop it using country/region as index?
this isn't freshly read in data, its the result of transposing the data then setting first row as headers, so cant use read_csv parameters
HI there, I've covered numpy basics, and would like to practice, can you recommend me some sources that I can use to find exercises
Q-learning and DQN are usually preferred for solving simple environments - usually which do not include any multi-agent interactions. if your game is of greater complexity, you might want to look into more complex techniques like NEAT.
Either way, it depends on your environment.
@hollow zephyr try find some interesting kaggle data and look at the tasks tab if you want to find solutions yourself without following a youtube tutorial
Thanx a lot
is there any way to slim down pytorch installation to just torch.jit? It's 170MB as of now even w/ only CPU support
only using it for running traced models in prod
not trivially AFAIK
you want the dates to still be the index, but the index name to be id? You can just rename the index, if that is what you want
I assume you'd have to clone it and figure out what you can safely prune from the code base without breaking what you need?
Quick Question - what can be the best ways to squeeze all perf out of pre-training? (apart from things like hyperparameter tuing)
My guess is that if you try too hard, you'll just overfit, but I could be wildly incorrect.
yeah
yea, that's why I am not using it ¯_(ツ)_/¯ it was just an inquiry whether someone know some method that can enhance its performance
This aged well: https://www.reddit.com/r/MachineLearning/comments/6n97my/what_do_you_guys_think_about_siraj_ravals_videos/
3 Years ago - before the scandal
@grave frost this would be more suited for #ot0-psvm’s-eternal-disapproval
Since it’s more about a person than a DS/ML topic
@velvet thorn I'm trying to think of a reason why a general-purpose tool that prunes unused modules from 3rd party packages isn't out there. Is it because the very dynamic nature of python would make detection of unused modules very hard?
yes
well
actually I need to think about that
but my gut feeling is also that there’s not much need for such a tool?
yeah I don't think there's a conrete need. But I can also see how some people might use it (for example, minimizing attack surface on third party dependencies)
Hey guys, quick question. How can I downgrade to Python 3.8 without using virtual environments? I'm running on Linux Ubuntu.
While it's possible to downgrade, @quasi sparrow you should also check out just using python3.8
instead of python3 (which I'm guessing is linked to 3.9 on your system)
Oh, that's a good idea. Let me try that!
depends on who you are. academia vs. industry. research vs. applied. etc.
speaking of papers

also if you are in academia, it should be less about reading X amount of papers, and more about focusing on your particular field + 3 pass method
maybe those related to your research, you would do the complete 3 passes, while those in other domains, would get 1 pass

as a reference, they mentioned that someone did sexual orientation from algorithms
I want my technocratic dictatorship to have at least some engineered catgirls or sex robots, but instead we have Onlyfan thots
and i was like

i wonder if anyone is ever like: 'maybe we shouldnt do this project'

Seems like I need to use tensorflow ==1.11 to run this pretrained network. I would need to downgrade to tensorflow 3.8 so I can install tensorflow==1.11
bro I laugh at the U,S researchers thinking
have you heard of the 3 pass method?
if not, then reading papers will eat up all your time

"guys we have a very tough political situation at home, what should we do"
"I KNOW, WE CAN CREATE A ML TO IDENTIFY "THEM" -> them is anyone you dont like

I've only read 1 paper, ever, and it was the SAGA paper. It was...tough, but i think i got most of it
fuck implementing that though lol
two separate python installations do not share packages AFAIK. So you can just do python3.8 -m pip install tensorflow==1.11 and then python3.8 -c "import tensorflow" should work
pip3.8 could also possibly work instead of python3.8 -m pip, depends on if you have that linked
and then to test my claim that different python installations don't share packages, python3 -c "import tensorflow" would not work (or if it did, tensorflow would be a different version cause you said 1.11 is incompatible with 3.9)
i come from a background that is heavily focused on research so being able to read papers was part of the skill set
this is the original version: https://web.stanford.edu/class/cs244/papers/HowtoReadPaper.pdf
this is the modern take: https://towardsdatascience.com/how-to-read-scientific-papers-df3afd454179
i also have one "real" publication
so ig that helps

non-cs tho

Thanks! I will try that
I think it's a problem with tensorflow. I can't install version 1.11. It may be rolled out already
I don't think it's avaliable anymore
oh right it's on version 2. forgot about that
@quasi sparrow if you really need 1.11, would you be willing to use py 3.6? or do you need 3.8?
if you can hop down to 3.6, this should work:
pip install https://files.pythonhosted.org/packages/ce/d5/38cd4543401708e64c9ee6afa664b936860f4630dd93a49ab863f9998cd2/tensorflow-1.11.0-cp36-cp36m-manylinux1_x86_64.whl
or you'd have to change to whatever OS you're on
you can find the links here
ig = i guess
my paper is not important anymore
at least not for AI
i still put it on my resume tho

Thanks for all the help! I'm going to try with tensorflow 2.0 and fixing all the error one by one. It should be a good learning exercise, lol.
I'm trying to fine tune a Bert algorithm from hugging faces but damn, they make it seem so easy on the website
your job isn't to read papers, it's to know what's going on in the field
read as many as you need to to keep up, don't go full detail on them if you dont need to
if you know the general methods, their pros and cons, why they are used, and the general trends, you're fine
Has anyone worked with NEAT?
does anyone here know how to code in R?
Done some programming in R before
I'm like super stuck on something basic
I need to make a data frame with 3 elements
i got that part down but it's what each element does that i'm stuck on
here's what i need to do
Sample should contain sample numbers from 1 to 20;
Group should have alternating labels ‘A’ and ‘B’. ( Hint : use the rep() function with one of its arguments being the same number of rows as in column Sample);
Value should contain a sample of numbers between -20 and 20 (after setting the seed at 42)
and here's my code so far
df <- data.frame(sample = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
group = rep(c(A,B)sample),
value = c(13,11,9))```
it showed up strange on here but you get the point
i have no idea what to do for value
@wide oxide do you know how to fix it?
I can give it a try.
thanks let me know here or in dm 🙂
i think youre really close 
this is the bellman optimality equation (for reinforcement learning)
but thats about all i really know about it
can someone explain to me what the summation sign is for? and what max_a is?
@high badge "the optimal value for a state = the maximum of (the reward from the action taken plus a discounted sum of values for all the states gone over with the optimal action)"
it's one of the most basic statements of dynamic programming, although this is specifically written for reinforcement learning and has an added discount factor
Idk wtf T(s, a, s') is doing there, that shouldn't be there
oh
uh
Hi, I'm Hans. I'm new here. Do you guys have any recommendation for beginner who wants to learn data science?
okay, thanks man
Btw, is Kaggle good for learning data science from zero?
especially for a person who don't have IT background
a maths background is more important than anything IT or software related
kaggle is okay for practice, it doesn't teach you anything directly
oh, i see
Hey guys, new to this server (< 1 day). Hello!
I was wondering, is there a good place to learn /recommended learning track for ML for someone who has decent knowledge of multi-variable calc and linear algebra?
Bishop's!
Pattern Recognition and Machine Learning is a great book
Follow it up with Goodfellow's deep learning book
Thanks, I'll look into them
I disagree - In my opinion, I find competing in competitions the biggest motivator to learn new things and try to apply them to some real task. It keeps one motivated to keep learning something new every day and counter the people at the top of the LB!
They were asking about learning from kaggle from scratch. It doesn't teach you anything from scratch, it's a practice platform
Read context before you reply pls
I mean it does have intro courses but they're so basic might as well read wikipedia pages or X in Y minutes
intro courses but they're so basic
That's the point - its meant for beginners absolutely new to data science. and several people I know have benefited a lot from kaggle courses 🤷
It doesn't even teach you linear regression lol
In fact, it doesn't teach you ML at all
It gives you a tutorial thing which literally just calls the Decision Tree method without really explaining what it is
And a more advanced tutorial which swaps out the class name with RandomForests
I dont know why you choose to be contrarian about everything I say
It's objectively a horrible introduction that teaches you nothing about ML
