#data-science-and-ml

1 messages · Page 298 of 1

spiral trail
#

The ass is missing but it's a starting point?

bronze skiff
#

ai decensorship for japanese culture is a thing

#

basically-- you can "learn the ass"

grave frost
bronze skiff
#

its just gradient descent

#

descending into the

spiral trail
#

You could create a mask with a preference based on a big enough sample size, right? Then it's goodbye Tinder swiping 😊

#

On a serious note though, why is this not a thing on porn websites?

iron bough
#

hmm

#

i think it's because of bbq sauce

grave frost
#

they had a bug bounty program; must be hard for the experts to browse the site for vulnerabilities 😏

serene scaffold
#

@spiral trail @misty flint there are other examples of ML being used in the real world that are more appropriate for our server, which is designed for users as young as 13.

bronze skiff
#

13 year olds learn ds?

#

this is why the rust server is better

serene scaffold
# bronze skiff 13 year olds learn ds?

we allow anyone who is eligible to use Discord to fully participate in our community. While it's not likely that a 13 year old would have the prerequisite knowledge to succeed in data science right away, that doesn't preclude them from participating.

misty flint
misty flint
#

where is foxxy

#

maybe i will learn julia next Clown2

shut valve
#

im using speech brain for a project rn that uses pytorch first time working with it not doing anything advance gonna end up using the pertained models anyway so its the same layers and stuff. I just kinda started in tf and figured that the cert would help me get a job in ai

#

apparently its all just straight from the coursera course so only sequential

misty flint
#

my classmates and i were thinking of doing the certificate just to get more familiar with TF not really to use the certificate lol

#

since ik very few certificates mean anything to companies

#

which is fair

#

my friend said their company just hired a guy that had a certificate for a certain tech but didnt actually end up knowing the tech when they brought him onboard

exotic maple
#

that shit happens all the time lol

misty flint
#

he said if it was up to him, that guy def would not be hired

misty flint
exotic maple
#

sadly it means the DS learn certificates i have dont mean shit even thou I know -a bit- of it

exotic maple
misty flint
#

and its not like hes a phd and this is academia or anything

exotic maple
#

I have always rejected people with more than 2 pages

#

1 is optimal

dim dirge
#

Hello everyone, I am new to python programming and would like some help please in building a script!

misty flint
#

but my friend said no one consulted him so Oopsies

exotic maple
#

Fire away man

shut valve
#

well its not bad at all (the course and the exam) I guess it depends on your course load like i had the python skills to do this years ago in school but i was spreading my self thin enough there and didnt have the time to do it. Well I dont plan of cheating my through it i like ai and think it would just be nice to show with my projects

misty flint
exotic maple
#

-sad dog face-

exotic maple
#

Eventually i'm just going to run a sentiment analysis of my friends in twitter

misty flint
exotic maple
#

or just find a random dataset on kaggle

misty flint
exotic maple
#

-proceeds to embarass himself-

misty flint
#

one of my projects were doing some light nlp

#

on telegram data

dim dirge
misty flint
#

since all the public channels have an easy way to download their data

#

you just go to the top right corner, and literally press export chat data

#

so if you need ideas, theres that

#

they even have a telegram api

misty flint
#

not that i really use telegram. its not that popular in the states

#

but it still seems like something nice to put on the resume

exotic maple
# misty flint

classificattion challenge - Is Rex trolling or not? :v

misty flint
#

seems too good to be true, right?

#

thats what i thought too at first

dim dirge
misty flint
#

the problem comes when you try to download the data, its usually too big so you have to select what you want

shut valve
#

I honestly don't know the value of certs to projects like obv a good project is worth the most but like i make shitty little things that are fun for me i dont do medical or business stuff i do stupid shit

misty flint
#

you can still put it

#

did you make a docker container for it? its still using docker DoggoKek

shut valve
#

I do everything in python and yeah atm im just trying to get really good a few libs tf, (numpy, pands, the basic data exploratory stuff), I try to put them on plotly's Dash

misty flint
#

plotly's Dash ID_blurryeyes

#

im also trying to get better at that

#

data viz leggo

shut valve
#

word its sick

misty flint
#

ye fam

exotic maple
#

what is plotly's dash?

#

should i save another bookmarket of another tool to learn? -pukes-

shut valve
#

like Im not a front end dev and i just do the basic scatter bar colorful graphs stuff its just a way to have stuff on the internet

misty flint
#

im jk idk what you use for data viz

exotic maple
#

I hate you rex

#

die

misty flint
#

but think of it like a tableau alt

#

but for python

#

and more python-integrated

exotic maple
#

i was thinking of using this

shut valve
#

yeah like tableau with flask

exotic maple
#

since its free and shit xd

#

and looks pretty

misty flint
#

oh that one looks interesting

exotic maple
#

oh plotly is coded in python as well?

misty flint
#

i will also save it

exotic maple
#

another library?

misty flint
#

yeah

exotic maple
#

-dies buried by libraries-

misty flint
#

better than R packages

shut valve
#

yeah plotly is for graphs and data viz and dash is for front end deployment

misty flint
#

those are endless

exotic maple
#

devcelopment? NEVER

#

I tried using django once

#

almost killed my friend

shut valve
#

lol yeah thats why i like dash real easy one file type shit

misty flint
#

we used flask for our last project

#

and by we, i mean my friend

#

and by used, i mean we had one page that was a pain to figure out

#

💀

exotic maple
#

this is it?

shut valve
#

the struggling means your closer* to learning

misty flint
#

what about when youre bashing your head against the keyboard

#

what does that mean

shut valve
misty flint
exotic maple
#

sigh

#

-looks for youtube guides for plotly-

misty flint
#

theres another tool i was going to mention but i think warden will strangle me

exotic maple
#

oh yeah baby look at those animations. py_strong

misty flint
#

yeah

exotic maple
#

I¿ll get rid of you ala Gun Gale Online :v

misty flint
#

skull mask villain monkaCHRIST

#

but yeah

#

were going to try to use this for our telegram project

#

the code is v minimal

#

even less than flask

shut valve
#

yes streamlit is also very cool again its just i started in dash gonna build on what i know

exotic maple
#

bro is this what my CS friend called "tool hell"

misty flint
exotic maple
#

like 500 good tools

#

to do the same shit

misty flint
#

the engineer's dilemma

exotic maple
#

abandon ML

#

return to carpentry

misty flint
#

too caught up in the tools you never actually build anything

misty flint
exotic maple
#

When you have a hammer, everything looks like a nail

#

-throws a naive bayes classifier in your face-

misty flint
#

we need a ML emote

#

somehow

exotic maple
#

my GPU burning?

#

me cooking meat over my PC?

misty flint
#

💀

shut valve
#

new tool for that google colab

exotic maple
#

or my accuracy been lower than 0.7?

#

ay

misty flint
#

0.53

#

💀

exotic maple
#

might as well guess

#

lmao

misty flint
#

right?

#

it had to do with poor data

#

but good thing it was just a throwaway assignment

shut valve
#

what was your data?

misty flint
#

like super small sample so the machine couldnt learn properly

#

also

#

tensorflow emote when

#

i need one

iron basalt
dim dirge
# iron basalt Just ask, if it's the wrong channel we can tell you which channel to ask the que...

rps=np.loadtxt('Line_001RPS.txt',dtype = np.str)
sps=np.loadtxt('Line_001SPS.txt',dtype = np.str)
print('File 1 shape', sps.shape) #Pour connaître la structure du fichier en détail
print('File 2 shape', rps.shape)
rps.shape
sps.shape
print(len(sps))
Nlines_sps=sps.shape[0]
Nlines_rps=rps.shape[0]
fichier=open('mise_a_jour.txt','w')
for i in range(Nlines_sps): #boucle for pour gerer le premier fichier(qui fonctionne correctement)
Cf1 = sps[i,0] + ' ' + sps[i,1] + ' ' + sps[i,2] + ' ' + sps[i,3] #ligne du premier fichier
fichier.write(( (Cf1 +'\n')*282 ))
fichier.close()

iron basalt
#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

dim dirge
#
rps=np.loadtxt('Line_001RPS.txt',dtype = np.str)
sps=np.loadtxt('Line_001SPS.txt',dtype = np.str)
print('File 1 shape', sps.shape) #Pour connaître la structure du fichier en détail
print('File 2 shape', rps.shape)
rps.shape
sps.shape
print(len(sps))
Nlines_sps=sps.shape[0]
Nlines_rps=rps.shape[0]
fichier=open('mise_a_jour.txt','w')
for i in range(Nlines_sps): #boucle for pour gerer le premier fichier(qui fonctionne correctement)
    Cf1 = sps[i,0] + ' ' + sps[i,1] + ' ' + sps[i,2] + ' ' + sps[i,3] #ligne du premier fichier
    fichier.write(( (Cf1 +'\n')*282 ))
fichier.close()
iron basalt
#

ok, so what is the context and what is the goal?

dim dirge
#

I have indeed two data files with each: SPS (251 rows 4 columns) and RPS (781 rows and 4 columns)
first I duplicated each line of sps by 282, so I got a new file (mise_a_jour) of 70782 lines and 4 columns. then I want to make a scan in RPS so as to take the first 282 lines to add them in the mise_a_jour file and then continue the operation by starting again not at d but at d+2 (I shift of two(2) lines each time always taking 282 lines.
I know if I have been clear enough but you can ask me other questions!

iron basalt
#

So your problem is that you want to loop over RPS and take the first 282 lines and add them, then move the cursor down 2 lines and add the next 282 lines and so on. What are these 282 lines being added to?

#

"add them in the mise_a_jour file" - this needs more explanation.

#

by "add" do you mean append to the end of the file (such that the number of lines of the file increases by 282 each time)?

#

Also what is the context? So that we do not waste time on an XY problem. @dim dirge

dim dirge
dim dirge
iron basalt
#

So you are aligning / matching lines and adding them up, resulting in the same number of lines (element-wise addition)?

dim dirge
exotic maple
#

does pandas have an append I/O method when producing a CSV output?

In standard python i think we can

with("file", "a")

to add something at the end of a file instead of overwriting.

iron basalt
#

What do you want to happen when the end of SPS is reached before the end of RPS is reached?

tidal bough
dim dirge
dim dirge
#

if I take 282 lines of RPS and start again by d+2 each time, I will have 70782 lines also for RPS
NB: the numbering of RPS goes from 561 to 1342.
it is an acquisition in which each signal sent by one (1) point of SPS (line) is recorded by 282 points RPS (line). then to send my second signal I shift two (2) points in RPS. that is to say that the first two points of RPS which recorded the first point of SPS will not record the second point of SPS anymore

exotic maple
iron basalt
#

So one thing you want for sure is to duplicate each line of sps 282 times right? let's start with that.

#
sps_duplicated = np.repeat(sps, 282, axis=0)
iron basalt
#

next you want every other line from rps right?

#
every_other_rps = rps[::2]
#

oh wait nvm, you want to have a sliding window over rps

dim dirge
iron basalt
#
# Loop through rps, but with step of 2 (ever other)
for i in range(0, rps.shape[0], 2):
  # Do stuff here
dim dirge
iron basalt
#

You are looping through every other line of rps while also looping over every line of sps (not every other)? @dim dirge

dark willow
#

What's a nice way to embed PyCharm visualizations in a Medium article?

sharp gate
#

i think this is the right channel...

from PIL import Image
im = Image.open("maze.jpg")
im.show()

output = open('maze.txt', 'a+')
for pixel in iter(im.getdata()):
    output.write(str(pixel))

this gives me a lot of tupels in (R,G,B) configuration, however i would like them to be in a [1, 0, 0, 1, 1, 0] sorta configuration
i am fairly very new to doing this kind of thing in python.

#

so any help would be great

dim dirge
arctic wedgeBOT
#

Hey @dim dirge!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

#

Hey @dim dirge!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

#

Hey @dim dirge!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

arctic wedgeBOT
#

Hey @dim dirge!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

#

Hey @dim dirge!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

dim dirge
dim dirge
#

this is the first line of the file I need to have

bronze skiff
#

considering the material that cert is based off of is really basically, it won't really be an indicator of anything

#

the only cert that matters for ml jobs is your degree

#

lets be real

grave frost
#

yeah, totally agree

#

maybe you could get a job without degree, but you would have to be a 200iq genius for that

misty flint
#

your degree is a pretty important cert

tranquil apex
#

im stuck

#

i want to calculate the growth ratio between these dates for each file

file_name           date   dist        
20210314_080621.txt 03-16  0.820328
                    03-18  0.838098
20210314_080633.txt 03-16  0.755168
                    03-18  0.784473
20210314_080644.txt 03-16  0.561407
#

i dont think groupby.agg would work for this

misty flint
#

oh wait i misinterpreted your question

stiff barn
#

If anyone is interested, I just finished up a project experimenting with style GANs. I trained a model that generates 1024x1024 bird images. If you're curious, I post 2 per day on Twitter as @bird_not_exist. Happy to discuss methodology with anyone who is interested.

misty flint
#

just took a peek, looks awesome!

#

would love to ask more but idk anything about GANs but i might come to you later if i do something with it.

#

like, do you think its possible to do the same with fishies?

#

🐡

stiff barn
#

Thanks @misty flint! Happy to answer any questions you have. Feel free to send me a DM whenever as well.

#

You could for sure do it for fish haha.

#

Would just need to get enough images. I used something like 35K bird images to train and more would have been better.

#

Also need a pretty beefy GPU haha

misty flint
#

this is good info to know. thanks i might take you up on that offer when i get a bit further along in my ML studies

stiff barn
#

Good stuff! GANs seem to have a lot of really cool use cases for sure.

misty flint
exotic maple
misty flint
#

ah yes, learn all the tools

exotic maple
#

a goddamn

#

man

#

of culture

misty flint
stiff barn
#

Seems pretty sweet. Looks like it goes a lot further than plotly

#

Been a while since I bothered with plotly though. Could be better by now. Really just only use matplotlib and seaborn

misty flint
#

usually good for most cases

#

until you need to show non-technical people

stiff barn
#

I live a simple life lol.

misty flint
#

jealous

autumn veldt
#

i got 97 test data, and i willing to do some manual calculation, but when i print my test data, i only got 5 from top and 5 from bottom. im using print(X_test). what syntac do i need to write to show all my 97 data sample?

agile wing
#

exploratory data analytics

lapis sequoia
#

Hi - I wanted to ask what is the name of the thing I'm trying to do:

Say I have 200 sentences that follow this pattern:

Chevrolet can go 200 km/h
230 km/h is the top speed of this Ford

I wanted to train a model that would identify what's the CarBrand and TopSpeed in that sentence. What is the scientific name for what I'm trying to do? It's not sentiments

EDIT: it's named entity recognition

buoyant yacht
#

please suggest the world best github repo / any blog for sentiment analysis with good accuracy?

#

anyone knows?

grave frost
buoyant yacht
#

can you suggest some github repo for sentiment analysis ?

grave frost
#

you can google it lol, there are plenty out there

grave frost
#

if in pandas, use .values to convert a column to a numpy array

autumn veldt
grave frost
#

if it still can't be done, you can use magic functions to write the variable to the file (I think its %writefile variable > file_namee.txt) and view the file

remote pumice
#

i want to display date and time generated in python file (.py) in my django project its a opencv file
i have captured the drowsiness alert data so

lapis sequoia
#

I'm totally noob in Python, can someone help me? I have some homework from space data science and this is my notebook:

hollow sentinel
#

We can’t supply code for homework

#

but we can give general ideas

rotund dock
#

Hi anyone here familiar with scipy.stats?

hollow sentinel
rotund dock
#

I want to calculate the value for a given probability using the st.gumbel_r.ppf(). I'm comparing it with the analytical solution and it is giving me completely different results, anyone knows why?
I obtained the values for the scale and location using the moment of methods

P_class = [0.14285714, 0.28571429, 0.42857143, 0.57142857, 0.71428571, 0.85714286, 0.999999]

u = 8.590342451210152
alpha = 0.1841827435642898

h_class1 = st.gumbel_r.ppf(P_class, scale = u, loc = alpha)
h_class = u-np.log(-np.log(P_class))/alpha

Results
h_class1 = [ -5.53466431,  -1.7516637 ,   1.6076281 ,   5.17091797, 9.54112426,  16.24661736, 118.86414528]
h_class = [4.97583548,  7.36682128,  9.49000861, 11.74212971, 14.50424958, 18.74235061, 83.60013865]

I want to get the same h_class results when using the scipy function

serene scaffold
hollow sentinel
#

sorry I saw other people do it so I assumed it was ok

serene scaffold
hollow sentinel
#

Got it. Won’t happen again

rotund dock
#

I was just wondering if this was the right place to ask a question about that topic

hollow sentinel
#

Which topic

rotund dock
#

scipy.stats

#

Statistics

hollow sentinel
#

yeah it is

serene scaffold
grave frost
#

if it was me, This link should be in the first message every new user gets

serene scaffold
#

So for example, don't ask if anyone knows about a general library or a type of problem, and hope that that person will know how to answer your specific question. It's easier to start helping if we know exactly what you'd like help with.

serene scaffold
#

I believe we cover some of the same material in our question asking guide.

grave frost
#

typing out everything everytime someone new asks would flood the server. its not efficient

serene scaffold
grave frost
#

its not about the resource allocation, just simply that :
putting a link to a message is more efficient than typing out a huge wall of lines everytime

serene scaffold
#

That's why I have a file of copypastas that I wrote

grave frost
#

a link does not feel harsh or anything. its just a website. how can it convey some negative feelings?

serene scaffold
#

I can appreciate that getting a link like that might not feel harsh to you, but it does for a lot of people, and the reason it does makes complete sense.

#

How about you DM @sonic vapor if you'd like to discuss this further.

bronze skiff
#

otherwise you only lead to inefficiencies of a question-answering system and help vampirism

serene scaffold
#

Pasting the link without another word is the problem (not necessarily the content of the website), for the reasons I've explained previously.

You're not required to give long-winded responses for each ask-to-ask instance. You can simply say "Go ahead and ask" if you'd like.

Let us know in #community-meta if you'd like to discuss this further.

kindred radish
#

So I'm using sklearn's MLPClassifier to predict whether a machine breaks based on input data

#

I've tried a tonne of different paramter settings and 8/10 times the machine's score is terrible

#

It's not really predicting it at all

#

Is this indicative that the input data might not actually correlate to whether a break happens or not?

serene scaffold
#

@kindred radish what features are you using?

kindred radish
#

I've got like 6 features and they're features of the thing that's going into the machine

#

Such as the thickness of the material and its composition

#

I don't have any data on the machine itself apart from how many times it breaks a day

grave frost
#

@kindred radish wut?

#

can you clarify what exactly you want to accomplish?

kindred radish
#

Ok so:
-X_train are like 6 features about the film that goes into the machine.
-y_train is an array of 1s and 0s that represent whether or not the machine breaks.
-This is fed into the MLPClassifier that SkLearn provides.
-The aim was to create a model that could correctly predict if the machine breaks or not.
-Despite playing around with the classifier's parameters, the model is unsuccessfully predicting whether the machine breaks: the score and the precision are abysmal.

  • my question is: Does this mean that the input data is uncorrelated to whether or not the machine breaks? Can I say that this data doesn't have anything to do with the machine breaking?
kindred radish
#

Lmk if anything needs clarification, I think that's as clear as I could make it! I standardised the input data as well ^^

grave frost
#

yeah, either your model does not have correctly inputted data, or there is no correlation. can you identify whether or not the machie breaks seeing a sample?

ionic sun
#

ive been stuck for hours on trying to curve fit some data

#

the curve is relatively exponential

#

but when i plot the curve fit it gives me a straight line

#
def curve1(x, a, b, c):
    return a * (b ** x) + c

def plot1():
    fig, ax = plt.subplots()
    graph = ax.scatter(year1, production, c=production, cmap="bone_r", s=8)
    popt, _ = curve_fit(curve1, year1, production, p0=[-1000, 1e-6, 1])
    a, b, c = popt[0], popt[1], popt[2]
    fit = []
    for i in year1:
        fit.append(curve1(i, a, b, c))

    plt.plot(year1, fit)
    plt.tight_layout()
    plt.savefig("graph1.png")
    plt.show()```
kindred radish
polar charm
#

hello, when I run my network, "model.fit(x=train_samples, y=train_labels, batch_size=10, epochs=30, verbose=2)" and train_sample and _label are train_labels =['^GSPC.Adj Close.csv']
train_samples = ['^GSPC.Close.csv']

I get Function call stack:
train_function

misty flint
grave frost
kindred radish
kindred radish
#

But it's fine that it doesn't work, as long as that means I can say that the variables do not affect whether the machine breaks or not

grave frost
kindred radish
#

How do you mean?

#

All of the features have been standardised and I've removed outliers

grave frost
#
Machine Learning Mastery

Feature engineering is an informal topic, but one that is absolutely known and agreed to be key to success in applied machine learning. In creating this guide I went wide and deep and synthesized all of the material I could. You will discover what feature engineering is, what problem it solves, why it matters, how […]

kindred radish
#

Is this more to do with unsupervised learning?

#

Thank you for your answers so far btw!

grave frost
kindred radish
#

I'm unsure how much engineering I can really do with my features

#

They're measured values of things like Viscosity or amount of chemical

grave frost
#

well, its mostly common-sense/logic. What do you think is the best way to represent your data so that the model can understand

kindred radish
#

Well I made them all standardised so that they were comparable to one another

#

As some features are of the order of magnitude of 100 whilst some are like 0.1

#

So I guess that was a form of feature engineering?

grave frost
#

uh-huh

#

and what is your task as well as your model?

kindred radish
#

My task was exploratory, I wanted to see if this data was responsible for why the machines break

grave frost
#

wait, if you are doing EDA then why are you using models?

kindred radish
#

If I could get my model to reliably predict a break, that would mean that the input data affects the machine

#

EDA?

misty flint
#

exploratory data analysis

grave frost
#

a small list of what you have to do (not to solve the task, but to explore the data)

#

exploration: like what feature appears the most. plots etc.

misty flint
#

whether the machine breaks? what does that even mean

kindred radish
misty flint
#

you have some kind of machine that youre feeding data to to determine whether it physically breaks?

#

why arent you measuring performance of the machine?

#

or does the machine not have any metrics?

#

i know youre not referring to machine in machine learning

kindred radish
#

The machine's only metric is that it breaks or that it doesn't break

misty flint
#

its tough to do any feature engineering then

#

if all you know is that

kindred radish
#

Yeah exactly, there isn't context for me to problem solve

grave frost
#

well, you don't have to train a model anyways, so no feature engineering required

misty flint
#

can you not look for more context? i feel like it would go a long way in your data analysis

kindred radish
#

This is all the data that I have to work with unfortunately

misty flint
kindred radish
#

Yeah I know

misty flint
#

Despite playing around with the classifier's parameters, the model is unsuccessfully predicting whether the machine breaks: the score and the precision are abysmal.
how much data did you have to work with

#

you usually need A LOT of instances

kindred radish
#

After cleaning it, it's around 300

misty flint
kindred radish
#

Yeah not a lot

#

Kind of frustrating, this is my final project as well

misty flint
#

yikes

#

how many classifiers did you try

#

try more

#

best answer might be a combo

kindred radish
#

I've only used the sklearn MLPClassifier

misty flint
#

why did you use that one

kindred radish
#

Because i thought it would classify between "break" and "not break"

misty flint
#

i think you should not use a neural network

#

you dont have enough data

#

use more simple classifiers

#

its probably overfitting

kindred radish
#

Oh right, what kinds of classifier's would you recommend from SkLearn?

misty flint
#

try a bunch

#

logistic regression, naive-bayes, SVM, random forest, decision tree, etc.

kindred radish
#

Ok thank you I'll give it a go

#

I know I keep harping on about it

misty flint
#

see how your accuracy looks like afterwards

kindred radish
#

But If the precision is bad for those as well, maybe that suggests the input data has nothing to do with the output?

misty flint
#

yes

kindred radish
#

Thank god

misty flint
#

but youll have to try to see

kindred radish
#

Okok thank you

misty flint
#

technically thats a conclusion too

kindred radish
#

Yes exactly

misty flint
#

and then you can show that by showing all the different classifiers you tried

kindred radish
#

The worst thing I can say to my supervisor is "I don't know why it doesn't work"

misty flint
#

yep

kindred radish
#

So if I try a bunch of things and it doesn't work then I can say "this could be because the data isn't right for the job"

#

Which is much much nicer

misty flint
#

yep yep

kindred radish
#

Thank you !!!

misty flint
#

youve proven too

#

np

kindred radish
#

Feel so relieved

misty flint
kindred radish
#

Would you mind if I @you here if I run into trouble?

#

I won't do it for petty shit, just advice

misty flint
#

yeah sure

kindred radish
#

And it'd only be for tomorrow + Monday

misty flint
#

whenever i have time, ill end up responding

#

haha np

#

we're all here learning together

grave frost
kindred radish
#

Lemme show you what I mean, one sec

misty flint
#

you can look at accuracy scores using sklearn btw

#

theres a function/method

kindred radish
#

So like, right now it's incorrectly predicting "Break" when it's not a break

#

Ideally the top left and bottom right elements would be high numbers

misty flint
#

did you use the confusion matrix function

#

try that

kindred radish
#

aye that's what this ios

misty flint
#

hmm

#

why is it in percentages?

#

or is that how the data is?

kindred radish
#

It's set to be normalised

misty flint
#

ah

grave frost
#

whats your F-score?

kindred radish
#

uhhh how would i find that? Do you just literally mean "model.score"? For this run it was 57%

grave frost
#

well, is your data imbalanced?

misty flint
#

i think classification_report() tells you a bunch of scores

grave frost
#

(and please do not jump in a task without learning the appropriate basics,
you would struggle more and you wouldn't understand anything)

kindred radish
#

imbalanced how?

#

This is my final year project and I've been teaching myself everything since my supervisor doesn't understand how any of this works. This is the last stage of it all 😕

misty flint
#

since my supervisor doesn't understand how any of this works.
memecringeharold

#

thats a big rip

#

🕯️

grave frost
#

do you want to do data analysis or EDA?

kindred radish
#

I specifically have to do machine learning. I've got a final meeting with a CEO on Tuesday where I have to explain the results of this :))))))))))))

#

So i wont really have time to implement anything major

grave frost
#

how does making a model help in this?

#

to find correlation between features, there are different techniques

kindred radish
#

I originally just wanted to predict if the machine would break as I thought the data was correlated

#

As i wanted to be able to say to the CEO "hey this is why ML is good. Look, i can predict when your machines break"

#

And my supervisor also thought that would be good

grave frost
#

and what exactly is your task- meaning what exact variables have you been given?

kindred radish
#

The Ripeness of the film, the amount of a chemical in it, the viscosity of it, the thickness and some other stuff as well

grave frost
#

and WTH is the machine?

misty flint
#

hahahaha

#

thats what I was asking

#

💀

#

people always try to separate the data from the context and it never ends well

kindred radish
#

The machine essentially rolls the "liquid" into films

grave frost
#

and how often does it break per hour?

kindred radish
#

I havent been given that data, ive only been given how many times a day it breaks

grave frost
#

🤦 if its a number, then why are you trying to predict whether it breaks or not?

kindred radish
#

I tried to break the problem down into a simpler one

#

And i thought that predicting whether the machine would break at all would be simpler than trying to predict how many times it breaks

grave frost
#

please tell the whole problem next time. your task is not classification, its regression

kindred radish
#

the number of breaks is very small: around 0-5

grave frost
#

and does it break atleast once a day?

kindred radish
#

No, some days it doesnt break

grave frost
#

then that is a regression task - which is much easier than making it a classification task with 5 labels

kindred radish
#

When i tried to use MLPRegression it didn't work too well

misty flint
grave frost
kindred radish
#

I got a negative score lol

grave frost
#

accuracy is always positive

misty flint
#

idk how you can get a negative accuracy

kindred radish
#

yeah i know thats why i was so confused

misty flint
#

think you have the wrong numbers

kindred radish
#

And got a negative numner

#

yes i understand the square of something gives a positive

misty flint
#

you found R^2

#

thats not accuracy

deft ruin
#

Yeah that can be negative if the model performs worse than the mean

misty flint
#

anyway

#

heres a stats meme

kindred radish
#

Id read that a negative score was bad 😅

misty flint
grave frost
#

just leave that and focus on the accuracy 🤷

misty flint
#

💀

kindred radish
#

ok im sorry that i might seem really dumb or whatever, but ive been given literally no direction and im a Physicist, i code in python all day and pray to god that my code doesnt come back to haunt me because it looks monstrous

misty flint
#

friendly remember: sometimes, outliers can mean something, so just think about your problem as a whole before cutting them out, etc.

deft ruin
#

No worries man

grave frost
#

negative R2 is pretty bad - did you try increasing the number of hidden layers in the MLP?

kindred radish
#

thank you guys

#

Yeah i played around with the parameters for absolutely ages

deft ruin
#

Sometimes that means that there is something going on with your model specification

#

It might be worth taking a closer look at your data

kindred radish
#

Tried a bunch of things, brought on my CS housemate to look at the parameters too

misty flint
#

ruler just wants to remind people to brush up on stats basics before diving into ML

#

which is fair

#

bc it happens a lot

grave frost
#

something like this = hidden_layer=(10,30,30,50)

#

@kindred radish number of layers, not the amount of neurons in each layer

kindred radish
#

I did (100,50,25) ive tried just doing 100 or stuff

deft ruin
#

Might also be worth plotting some of the variables against each other and coloring by whether the machine broke that day to see if there is any pattern

#

Or doing the same with histograms

kindred radish
grave frost
misty flint
#

the problem with data analysis is it can take time

kindred radish
#

I did see improvement

#

But not much

grave frost
#

make it even longer; use solver lbfsg

kindred radish
#

Yeah ive used that solver because my data set is small

grave frost
#

and more hidden layers?

kindred radish
#

Yeah i tried a bunch of things

grave frost
#

make max_iter=1000 or so?

kindred radish
#
    return MLPClassifier(hidden_layer_sizes = (100,50,25),
                          random_state=0,
                          activation = 'relu', 
                          learning_rate_init=0.0001,
                          solver='adam',
                          max_iter=5000,
                          verbose=True,
                          early_stopping=False,
                          n_iter_no_change=10,
                          alpha=0.001,
                          tol=1e-8,
                          beta_1=0.9,
                          max_fun=15000)
#

Like, everything you see here I changed

#

Spent hours messing around with it trying to create variations on each other

grave frost
#

what about a 10-layer network?

#

also, put validation_fraction=0.1 to test your network on 10% of your train data

kindred radish
#

so like if i made hidden_layer_sizes = (1000,500,250,125,60,30,15)? I thought the length of this tuple dictated the number of layers??

kindred radish
kindred radish
grave frost
kindred radish
grave frost
kindred radish
#

umm ok, why do it this way round? The test and train data is always jumbled up before hand each time to make sure that it's not just doing exactly the same thing each time

grave frost
#

for newbies, integrated is much better

#

because it reduces the chance of a problem somewhere

kindred radish
grave frost
#

I told you, ditch the scores and focus on the accuracy

kindred radish
#

okok sorry ill go implement that now

grave frost
#

your precision/recall may matter, but it depends on the task you are doing (like what does you implementation value - false positives, false negatives etc.) different values are preferrred for different scenarios

#

like for predicting cancer, you do not want any False Negatives.

kindred radish
#

I guess for breaks i wouldnt want a False Negative either

grave frost
#

you just want a high accuracy for prediction 🤷

#

for your task, FP's, FN's etc. dont matter

#

because you want to say to the CEO that your model can predict 90% of the time whether the machine would break or not

#

not talk to him about FP's or precision/recall

kindred radish
#

So im using accuracy_score()?

#

That the right one?

grave frost
#

thats for classification

#

try max_error

kindred radish
kindred radish
#

aight lemme swap it to regression real quick

#

The regressor assumes a linear relationship right?

cerulean stream
#

anyone know of a non-blocking way to implement matplotlib in an async environment: the regular run_in_executor from asyncio doesnt work

deft ruin
#

@kindred radish no MLP has a nonlinear activation

kindred radish
#

ooof im getting an error lol

#
    raise ValueError("Classification metrics can't handle a mix of {0} "

ValueError: Classification metrics can't handle a mix of multiclass and continuous targets
#

Im much too tired to be able to try and handle the error

#

So i'll probably call it a night and try and do something tomorrow. Is it alright if I could ping you as well tomorrow @grave frost ? Im on GMT timezone

#

absolutely fine if not, you;ve helped me a lot already! ^^

grave frost
kindred radish
#

thank you!

stiff barn
# kindred radish thank you!

I think @misty flint mentioned this before but since you only have 300 samples I would stay away from a deep learning model and probably stick with a gradient booster if all the data is tabular (no images or any unstructured data). I’m assuming since it’s your first dive into ML your supervisor will want the model to be at least somewhat explainable and not just a black box which a gradient booster will give you to some extent. They’re also probably more of the standard for structured data.

#

Depending on what problem you’re trying to solve you may have been right in choosing classification over regression. If you just want to predict if a machine will fail that day then I would turn the problem into a binary classification one. That should be a bit more achievable with that little data. Otherwise I’d stick with regression and not do multi-class classification.

exotic maple
#

Can someone explain in plain english what is the difference between Dense and Sparse matrix?

#

I feel like i get but i want to be sure i have the right idea...

deft ruin
#

Sparse matrices are mostly zeroes, so much so that it becomes efficient to create a separate data structure that stores the location and value of the nonzero entries rather than the whole matrix

exotic maple
#

but for example, why are sparse matrices preferred in some cases? I was reviewing some NLP tasks and it seems sparse matrices are like the daily bread there.

#

is that an inevitable conclusion of the pain in the ass of text?

deft ruin
#

Yeah usually with NLP you’re dealing with a giant matrix where each column is a word and each value is the number of times that word appeared in the “document” (which is whatever you’re considering a single observation)

#

You can imagine how that can get a big fast and how lots of counts will be zero

#

I said word but usually it’s called something more general e.g token

shut valve
#

Lol good question all I know is If you do One hot encoding it’s sparse

#

But beyond that and the mostly zeros I don’t really know

#

But no that’s not the end all for nlp most utilize some sort of embedding which turns your tokens to vectors so you don’t have to use one hot or sparse stuff

deft ruin
#

Yeah I should say that’s the simplest case like bag of words

#

Lots of techniques are concerned with reducing the dimensionality of the feature space

velvet thorn
#

like the simplest form of vectorisation

#

bag of words modelling

#

there are many unique words

#

but most of them will not appear in any one document

#

-> lots of 0s

#

imagine a matrix where 99.9% of the values are 0

#

naively storing it in a data structure meant to hold dense data

#

will take ~100x of the amount of memory a sparse data structure would take

#

sparse data structures can also optimise for certain operations

#

for example, say you want the document(s) that contain the most of a certain word

#

you can clearly ignore all the 0s

#

there are many ways to store sparse data

#

each with their drawbacks and advantages

exotic maple
#

@velvet thorn thanks a lot man.

#

I see that sklearns CountVectorizer returns a scipy.sparse matrix

#

is that one of the optimized structures you mentioned?

astral path
#

ay so the ML model i built is in first place in my march madness pool

lean ledge
#

Sparse matrix types, unlike normal matrix types, aren't going to be stored like straight arrays

#

They're going to have more complicated encoding schemes to avoid doing useless operations

paper lake
spring seal
#

Hi there, I am beginner. I am currently working on EDA of 'Temperature Variation of Countries'. So, if any beginner (like minded) want to work/study with me. just drop a message. It's just like group study, nothing else.

rotund dock
uncut barn
#

What does it mean for the model (nn) loss to fluctuate?

lean ledge
#

it's not training properly

#

decrease your learning rate

#

ideally have gradient clipping and stuff also

#

and normalise your data

grave frost
#

for sparse arrays with less density

tidal bough
#

!docs scipy.sparse

arctic wedgeBOT
#

This appears to be a generic page not tied to a specific symbol.

tidal bough
#

there's compressed-sparse-rows, compressed-sparse-columns, dictionary-of-keys, and probably others I don't remember

#

and that's only for sparse matrices.

hallow girder
#

Hello there,

I have somehow managed to land a Data Engineer role. I have never been a Data Engineer before.
My background is DevOps (Linux for the last 15 years ,and also AWS, Ansible, systems, bash, python, networking, etc). At the interview they mentioned being in a Data Team and technologies and terms such as ETL, data pipelines, python, pytest, Jupiter Notebooks (with Pandas, numpy, matplotlib), AWS s3, data wrangling, Linux, SQL, Agile and doing code reviews in Github.

I think I have either worked with some of these technologies and/or have played with some of them before but I don't know anything about ETL, data pipelines, pytest, Jupiter Notebooks (with Pandas, numpy, matplotlib), and doing code reviews to any great extent.  Could someone point me to some recommend learning guides such as MOOCS, online courses etc about those (ETL, data pipelines, pytest, Jupiter Notebooks and doing code reviews).

Also, what is it like to work in a Data Team, is it different to working in a Software Development Team? How best can I transition from a DevOps (mostly Linux sysadmin type experience) to a Data Engineer mindset? What would be some good things to figure out and ask questions about when you first start in a Data Team?

Any help much appreciated.

lapis sequoia
#

Hello everyone!
I am working on a project that uses OCR. I have videos that show boxes from above rolling on a production lane from one side to another. I already have a code provided by someone else that cuts from video frames only the box’s label and my task is to OCR this label.
Videos are recorded by fish eyed camera lens so it is a little deformed, which is not making it easy for me. First issue I was struggling with was to find a proper OCR tool, because it is going to be running on a weak virtual machine with only CPU (4 cores). I found brilliant PaddleOCR which is lightweight and seems to fit my needs, but I can’t apply it properly on my box labels to OCR them. Without doing anything to the images, PaddleOCR has 85% efficiency in detecting and recognizing numbers properly, but I need to make it better. I figured that rotating an image is helping in many cases, but I can’t have one fixed angle by which I rotate every image, so I found that four point transform can help in my case. To properly use this transform I need to find a contour of my label. I tried playing with opencv to pre-process image but I can’t find universal parameters for all images in order to find this contour and transform it. I tried pre-processing images with thresholds, blurs, applying erosion, dilation etc. after which I use canny edge detection and then I try to apply four point transform. I will be thankful for any help. Images with labels I am working with look like this:

serene scaffold
#

@hallow girder please save this question and ask again another time. I'm worried no one with that knowledge saw it.

lean ledge
#

the issue is a fisheye warp, the solution should be an unwarp

#

if you have access to the camera, run a calibration procedure, or know the camera model, try to find its distortion coefficients online

#

no OCR software should have any issue with those photos given it's undistorted properly

grave frost
#

using pre-trained embeddings here, when I try to generate a subword vector, it has no problem. but when I try to find its most similar word (using an inbuilt method) then it reports that the word is out-of-vocabulary. how does that work?

#

ahh, nvm

kindred radish
#
              precision    recall  f1-score   support

           0       0.58      0.77      0.66        39
           1       0.55      0.33      0.42        33

    accuracy                           0.57        72
   macro avg       0.56      0.55      0.54        72
weighted avg       0.56      0.57      0.55        72

@grave frost As per @stiff barn 's suggestion, I've used a Gradient Booster to try and classify breaks or not. I just printed classification_report() to get this table

grave frost
#

hmm, why are still doing classification? regression?

kindred radish
#

Aye I'm gonna try the regression one next, I just thought to print this out since I had everything set up for classification. I guess it was like "let's just see what happens" 😅 Ill go use GradientBoosterRegressor now

stiff barn
# hallow girder Hello there, I have somehow managed to land a Data Engineer role. I have never ...

Hey @hallow girder. I’ve worked as a data engineer for a few years now and currently work as one for a large company in the US so I will do my best to provide some help and answers.

First thing, congratulations! It is possible that the team wants someone with a lot of dev ops knowledge as we do need to perform a lot in that area.

Since you already have cloud and Python experience that is going to help a lot as well since we generally like to build in Python. If you have IAC experience as well that would be great to show the team.

If you know which cloud provider they are using, I would look into getting a data engineering certification for that cloud. Each of them have one I believe and there are usually courses on coursera for them. There is also a company called DataQuest which has a Data Engineer track which is pretty good. They go through writing memory efficient pipelines, pandas, numpy, ect... You may find their other tracks useful as well.

When you first get to the team you’ll need to figure out what you’ll actually be doing and what technologies they use. Data Engineering can vary to some extent where some companies you focus more on writing SQL procedures and managing databases, while others it’s more writing data pipelines in the cloud, and others it’s a lot of both. On that note, if you’re not strong in SQL that’s definitely a top if not the top skill to brush up on. Figuring out what the team generally actually does will help you target what to learn.

Feel free to reach out if you have any other questions out want to chat about the role.

grave frost
stiff barn
kindred radish
#

Yeah, 1 meant there was a break, 0 meant there was no break

#

That's for the output, y

stiff barn
# grave frost lol regression is *very* achievable in little data if there is a simple correlat...

That is not true in his case. Given that he is reframing a regression problem into a binary classification problem can increase accuracy as you’re essentially condensing a range of values into one. It can be helpful especially in his case where I assume there are cases where the same input values can lead to a different output values due to chance often. At the end of the day though, we don’t know that much about the data so I could be wrong.

grave frost
#

@kindred radish how many times a week does it (on average) break?

stiff barn
#

I am here to learn as well so if there is something I’m missing then please call me on it.

kindred radish
#

The number of breaks is pretty small, each day (which is the time frame everything is recorded) there is about zero to four breaks

kindred radish
#

yeah ikr lol

grave frost
#

yeah, then its a regression hands down

kindred radish
#

wait why

#

I thought it was such a small amount of breaks

#

that regression wouldn't be a good idea that i could simplify the problem using a classifier

grave frost
#

yeah, but having a large frequency each day makes it much more suitable to regression. if was like it broke 3-4 times a week with different days, then it would be classification because you could counter-act the bias

stiff barn
#

If it’s that often then binary classification will probably just always output 1

#

Can’t really learn much. If you had the data on an hourly basis or something that might be different

kindred radish
#

Why does it make it more suitable to regression?

grave frost
#

because it can handle the fluctuations pretty well and classification would not handle the bias and would just output ones

kindred radish
#

When i think of regression, as a physicist, i think in terms of some equation being obeyed. I have no idea what equation this data would follow to yield the output we see

stiff barn
#

That’s what the machine must learn haha.

#

Could be a simple linear equation

#

Or a bunch of decision trees working together if a gradient booster/random forest

#

Or any of the number of regression models

kindred radish
#

I doubt it is linear, only because the factors are kind of complicated and aren't so simple

grave frost
#

it might be linear with data engineering

#

but you wouldn't know unless you do it.

stiff barn
#

Thanks for calling me out there @grave frostz I wasn’t aware it was breaking that often

kindred radish
#

Like im not even sure that the data is correlated to whether the machine breaks or not, this is just the data ive been given. It could be something like the humidity of the room, which they haven't recorded, which is actually making the machine break

grave frost
stiff barn
kindred radish
#

So even if the data didn't actually affect whether the machine breaks or not, a model could still figure something out?

kindred radish
stiff barn
#

More like even if the data isn’t directly correlated, there can be some partial correlation that can give the model some signal to make predictions on.

stiff barn
kindred radish
#

Standard scaling

#

That isnt the raw data, i've standardised it myself

#

Just wondering which to use. It's just more convenient for me to use the standardised one rn thats why i was asking!

stiff barn
#

It should preserve the correlation and be fine.

#

If the scaling didn’t preserve the correlation then that’s a problem haha

#

How do you have the data stored in the notebook now when training? Numpy, pandas DataFrame?

kindred radish
#

So i have a Pandas DataFrame of all the features, X, and the output, y.
Then i split the data into training and testing data and randomise their positions according to sklearn's train_test_split() function

stiff barn
#

So you'll want to run the correlation on the full training dataset with the y included.

kindred radish
#

Yeah not looking promising

#
-0.08902379625215213    -0.09753689601363387    -0.17866755803410775    0.02011288489513478
#

Are the correlations im getting to the total number of breaks

stiff barn
#

So what are those, are those the correlation of each feature with the y?

kindred radish
#

Yeah, that's the correlation of each feature with the total number of breaks that happened on that day

stiff barn
#

There is some signal there. Not much but some. feature 3 has a decent negative correlation.

#

It's not all bad as there may be some non-linear correlation that you cannot see here

kindred radish
#

Yeah i would presume this just tells me that there isn't really a linear correlation

stiff barn
#

You can bring that out using feature engineering techniques like feature crosses

kindred radish
#

And that the third feature is the "most" linearly correlated

stiff barn
#

Yeah, basically

#

The gradient booster should be able to capture some of the non linear correlation.

kindred radish
#

so i shouldnt 'LS' since that;s for a linear problem?

stiff barn
#

I would just start with the default.

#

You can try other options later in your hyperparameter tuning phase and see if others work better.

kindred radish
#

aight what metric should i use then? I just used classification_report() before but obviously this is regression?

stiff barn
#

MSE most likely. Could also use r2

#

Would probably use both

spark olive
#

i was referred here from the help channels, could someone help me with some code?

i am trying to make a bar chart, but i am unsure how to call up the x-axis and y-axis variables needed in
sb.factorplot(x=), (y=)
as the data is from several different csv files, and the math i have done is in functions

the code is on here https://paste.pythondiscord.com/uqezixusen.py

https://drive.google.com/drive/folders/1EJtc3R60eAMcMqLcSmGuPAGYPifLVIzz?usp=sharing here is the csv files relevant for the chart

serene scaffold
#

@spark olive sb.factorplot(x=), (y=) is a syntax error

spark olive
#

i need to put something in the x= and y=, but i cant figure out what to put

deft ruin
#

Which columns do you want to plot?

#

You’ll need a data frame with two columns containing the data you want to plot

#

x and y are strings with the names of the columns

spark olive
#

i want to plot EAB_SUM as the x axis, and postBIODV as the y axis, but i cant figure out how to turn it from a function to a series

kindred radish
#

@stiff barn So i mentioned yesterday that this happened, but i have somehow got a negative R2 value...

#

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse)

#

My R^2 value for this is like -0.8 and the MSE is like 1

spark olive
#

aiming for something like this

deft ruin
#

@spark olive the globals and the Xs in the function signature make the code a bit hard to follow

spark olive
#

(realized i said barchart earlier when i meant dot graph, my bad there)

deft ruin
#

Is the data you want for EAB-SUM in the total column of XX_EAB?

spark olive
#

the data i want for EAB_SUM would be the result of

XX_EAB['TOTAL'] = XX_EAB.sum()```
#

but it would differ depending on if it was NY, WI, or TX

#

the fuctions having the XX was for the purpose of being able to replace the "XX" with the state code (NY, WI, or TX) without having to retype all of it every time

deft ruin
#

Uh ok might be better to have a positional argument called state

spark olive
#

oh? i have never used that before

deft ruin
#

I’d refactor your code so that it returns the column you want

#

And set it up so that you can give it the stats and it will get the sum for that state

#

Then use pd.concat to put them together

spark olive
#

how do you do that?

deft ruin
#

Try writing a function takes the data and the state name and gives you the sum for that state

#

Then create a loop that puts the sum for each state in a list

#

Then convert that list to a series

spark olive
#

how would you create the loop? sorry i am new to python! 😅

craggy sundial
#

I want to implement a generative adversarial network in native python. not for any practical applications, but for learning

stiff barn
kindred radish
#

Yeah im a mug i just realised i did that lmao

stiff barn
#

lol all good

kindred radish
#

Seems the regressor doesn't like that the input is continuous whilst the output is discrete?

#

ValueError: Classification metrics can't handle a mix of multiclass and continuous targets

stiff barn
#

What metrics are you using?

kindred radish
#

Ah i see yeah it's the classification report#

#

OKKK it works now it works

stiff barn
#

Lol good, how are they looking?

kindred radish
#

Annd Im getting the following:

MSE:  1.2018049174929522
R2:  -0.2902649077024708
MSE:  0.6243021962953844
R2:  -0.016531632695806264
MSE:  0.7926685016933164
R2:  -0.3511394915226984
#

For three runs

#

So they're still negative. I checked to see the data that im running through and it all looks good

deft ruin
#

@spark olive no worries theysian might want to go to dedicated help channel for that kind of thing

#

Afraid I can’t walk you through it right now

spark olive
#

@deft ruin i went there but they directed me to here instead
no worries though! any time youre free id really appreciate it. thank you so much for your pointers so far!

kindred radish
#

Fiddling with the Gradient Booster's parameters doesn't seem to help

stiff barn
#

Let it run for longer

kindred radish
#

aight

stiff barn
#

MSE is decreasing

#

For the most part

#

Also, if you can show your loss that would be helpful

kindred radish
#

Aight i slapped it up to 1000

#
      Iter       Train Loss   Remaining Time 
         1           0.6906            0.00s
         2           0.6635            0.50s
         3           0.6451            0.33s
         4           0.6294            0.50s
         5           0.6113            0.40s
         6           0.5965            0.50s
         7           0.5845            0.43s
         8           0.5727            0.50s
         9           0.5605            0.44s
        10           0.5534            0.50s
        20           0.4828            0.39s
        30           0.4391            0.39s
        40           0.3983            0.36s
        50           0.3602            0.36s
        60           0.3275            0.34s
        70           0.2983            0.35s
        80           0.2739            0.33s
        90           0.2500            0.33s
       100           0.2338            0.32s
       200           0.1234            0.28s
       300           0.0760            0.24s
       400           0.0524            0.20s
       500           0.0355            0.17s
       600           0.0244            0.13s
       700           0.0175            0.10s
       800           0.0133            0.07s
       900           0.0098            0.03s
      1000           0.0073            0.00s
#

I could fiddle with the learning rate?

#

So i made the learning rate like 0.001 as opposed to the default 0.1. This makes it much less negative, it actually makes it go to around zero.
I've read that having a "high" learning rate is what you do if you have a "small" amount of data?

stiff barn
#

negative r2 means that the model performs worse than a horizontal straight line. 0 would mean it performs just as good as one

#

.001 would be a fairly standard rate. It would be lower than the default though

kindred radish
#

0.1 is the default yeah

stiff barn
#

Trylowering max depth to 3 or 4

kindred radish
#

So this is the same as me just drawing a flat straight line through my data? That's pretty crap right? lol

stiff barn
#

Lol yeah basically

kindred radish
#

depth is default to 3 lemme try 4

#

Tried 4. it's looking like it's still 0

#

like floating around that, both negatively and positively

stiff barn
#

Is that on the holdout dataset?

kindred radish
#

holdout?

stiff barn
#

The dataset you save that doesn't touch the training process

kindred radish
#

oh the "test" data?

#

Like i split the data into a 80-20 train-test split. These metrics then compare the predicted output based on what's trained against the test dataset

stiff barn
#

Yeah, the test dataset

kindred radish
#

Yeah then it's all done on that

kindred radish
#

are you worried that it's correlating the training data fine, just not the test data?

#

If the positions of the training and testing data is randomised each time, doesn't that help?

stiff barn
#

More that it's over-fitting to the training set.

#

I'd suggest setting a random seed for both the model and the training test split so you don't get different results each time due to randomization

deft ruin
#

@spark olive sure thing i might be able to provide some more help later

kindred radish
#

Ok that sounds like a good idea

#

What's on the y-axis of this plot though @stiff barn Or did you mean to track how the R2 value varies upon changing the paremeters?

sonic raft
#

Hi! I need some help with pytorch.
So let's say I've calculated a gradient for a set of parameters( It doesn't matter, but let's say with a MSE loss function)
So, when I'd like to step the parameters
is there any difference between
parameter.data -= parameter.grad * learning_rate
vs
parameter.data -= parameter.grad.data * learning_rate?
Since I've already told Pytorch not to calculate the gradients for this stepping operation with "parameter.data"

stiff barn
spark olive
#

@deft ruin thank you so much! feel free to dm me if thats easier

stiff barn
#

The y axis is just a measure of goodness of fit

kindred radish
#

i used exactly their code, and i've just put X_train,X_test,y_train and y_test into it

stiff barn
#

Wouldn't bother copying the code. I'd just try those parameters like subsample=0.5 in the code you had

kindred radish
#

ohhhhhhhhhhh

#

aight lemme reverse reverse

stiff barn
#

haha sorry should have been more clear

kindred radish
#

Looks like im getting about the same as what I was getting before

stiff barn
#

Yeah I mean unfortunately there may not be much signal in the data you have. You could try doing some feature engineering or collecting more data.

#

If you're willing to share the notebook I can look for anything being off

#

Not sure if that's allowed for you though

kindred radish
#

I can't share it I'm afraid im under an NDA

stiff barn
#

Yeah, makes sense

#

Same lol

kindred radish
#

it's a real bitch right? hahaha

#

I want to be able to conclusively say something about the data

stiff barn
#

Yeah, mine's pretty invasive

kindred radish
#

I had an idea to push in artificial input data to induce a correlation that the models would learn. For example: create a feature that always results in a break if it has a value of 0.5

#

And this would show that the data i've been given is likely to not an effect on whether the machine breaks or not

#

Would this be a good idea to do do you think?

stiff barn
#

Not a bad idea to validate your process

kindred radish
#

omg some hope!

stiff barn
#

If you had the time, you could also try something like this

#

You could use that to generate more training data that has the same statistical properties as what you have.

#

That could improve training

#

Wouldn't recommend it though unless you had a bunch of time

kindred radish
#

ohhh is this like... In unsupervised learning, say you had pictures of people's faces and you wanted a model to learn how to recognise them. You don't need to take more photos, you can just mirror them and that's the same as new data?

stiff barn
#

Yeah, I'd say the idea is similar.

#

Not exactly the same as new data but helps prevent overfitting and such.

kindred radish
#

ok that's cool, i might check this out after the meeting, unfortunately it's on Tuesday so i don't really have the time to look into it now

stiff barn
#

Gives more examples to train on

kindred radish
#

Thank you for helping me, this channel is awesome. Definitely made me realise how little i understood about what I was doing! Wish I'd been more active on here at the beginning of the academic year lmao

stiff barn
#

No problem, you seem to learn fast and are able to iterate quickly on what we suggested. It's a pretty cool community for sure!

#

We're all still always learning. You have to be in this field haha

kindred radish
#

thanks ahaha i've been in a bit of a panic so i was laser focused on whatever you guys said 😅

#

Yeah it definitely seems like it. When i was younger i saw lots of videos about ML and data science which was more on the meme-y side of things so i think i falsely assumed that the field wasn't as deep as it is. Obviously a terrible false assumption on my part! So much info for such a young field of study

stiff barn
#

Yeah it's all super cool. It's a very interesting way to solve problems

#

Unfortunately just takes a lot of data generally haha

kindred radish
#

Well that seems to be what was holding back the science when it was first come up with right?

#

Because a lot of this Computer Science was discovered decades ago

#

But """"Big Data""""" wasn't available back then as it is now. Like the amount of data that Google and Facebook or TikTok has on people is absolutely insane, so there's so much room to explore with ML

stiff barn
#

Yeah for sure. Both availability of data and processing power. Now you can buy a solid GPU for $700 and be able to build fairly large models in a few hours of training or just hop over the the cloud and rent even larger resources.

kindred radish
#

that is if you can find a GPU hahaha ;)

#

So im pretty happy with how everything's turned out, despite it being sloppy on my part. One thing I would like to check with you @stiff barn is why you suggested the GaussianBoost function? That question might be a bit involved so if there's material you know of that I could read about it then that would be awesome! I've looked around a bit on wikipedia and stuff, just unsure on some concepts i guess.

stiff barn
#

The library XGBoost is what is generally used in the industry for structured data. That mostly contains optimized boosting models. They generally are the go to because they work really well on this kind of data. The most basic explanation I can think of is that you’re training a bunch of decision trees in a row where each subsequent decision tree tries to learn to correct the errors made by the decision tree before it

kindred radish
#

Thank you! So I should look into XGBoost then and read around that a bit?

marsh gale
#

Hey all!
I recently finished my Master in mechanical engineering and I thought about getting myself a little into ML (not for a job per se, but since i'm interested)
I'm currently playing some browser game, where Players can produce a number of units and fight each other. Combat is fairly simple each Unit picks a random Unit as a target and attacks it but some Units are good against others which means if they hit the Unit they'r good against they have a chance to attack again.

So my idea was to teach a ML with the Unit compositions of all the players on a Server and calculate the best possible "counter" to their Units.

Would you strongly advice me to not try this? If so, why?

stiff barn
# marsh gale Hey all! I recently finished my Master in mechanical engineering and I thought a...

Hey @marsh gale, welcome! It sounds like what you're attempting to do there is reinforcement learning. It sounds like it could be doable but I would suggest tackling a simple problem first for your first dive into ML. Something that already has a clear labeled dataset so you can wrap your head around some of the underlying concepts in ML. Reinforcement learning is definitely more on the advanced end and a current area of active research.

faint ocean
#

Could somebody maybe take a look at this for me?

grave frost
misty flint
#

what does this even mean

grave frost
hollow sentinel
#

out of context baby bottle

grave frost
#

maybe he/she was trying to tease him?

hollow sentinel
#

idk

fiery frost
#

Hi, I will be happy if someone can guide me.
I am trying to build CNN that finds the most similar fonts to a font in the picture.
The point is, that the font in the picture is in a different language from the font I try to find.
For example, there is a picture with a font in Japanese, and neural networks need to find the most similar font in English.
I don't know a lot of neural network staff, and I don't ask for a step-by-step guide, just need a direction and from where to start.
I already staterd establish a database for it.
Any help would be welcomed!

uncut barn
#

will a small batch size cause the loss function to fluctuate, my data set is a 2000, 40, 1?

lethal crater
#

well, your batch size will change the result of your learning rate

#

a larger batch size tends to work better with a slightly higher learning rate in my experience

lethal crater
marsh gale
granite wolf
#

anyone know how to add the ID column to the left and stop it using country/region as index?

#

this isn't freshly read in data, its the result of transposing the data then setting first row as headers, so cant use read_csv parameters

hollow zephyr
#

HI there, I've covered numpy basics, and would like to practice, can you recommend me some sources that I can use to find exercises

grave frost
granite wolf
#

@hollow zephyr try find some interesting kaggle data and look at the tasks tab if you want to find solutions yourself without following a youtube tutorial

zinc lark
#

is there any way to slim down pytorch installation to just torch.jit? It's 170MB as of now even w/ only CPU support

#

only using it for running traced models in prod

still otter
serene scaffold
grave frost
#

Quick Question - what can be the best ways to squeeze all perf out of pre-training? (apart from things like hyperparameter tuing)

serene scaffold
#

My guess is that if you try too hard, you'll just overfit, but I could be wildly incorrect.

grave frost
grave frost
#

3 Years ago - before the scandal

hollow sentinel
#

Since it’s more about a person than a DS/ML topic

zinc lark
#

@velvet thorn I'm trying to think of a reason why a general-purpose tool that prunes unused modules from 3rd party packages isn't out there. Is it because the very dynamic nature of python would make detection of unused modules very hard?

misty flint
#

sounds like itll be very easy to break something

velvet thorn
#

well

#

actually I need to think about that

#

but my gut feeling is also that there’s not much need for such a tool?

zinc lark
#

yeah I don't think there's a conrete need. But I can also see how some people might use it (for example, minimizing attack surface on third party dependencies)

quasi sparrow
#

Hey guys, quick question. How can I downgrade to Python 3.8 without using virtual environments? I'm running on Linux Ubuntu.

zinc lark
#

While it's possible to downgrade, @quasi sparrow you should also check out just using python3.8

#

instead of python3 (which I'm guessing is linked to 3.9 on your system)

quasi sparrow
#

Oh, that's a good idea. Let me try that!

misty flint
#

depends on who you are. academia vs. industry. research vs. applied. etc.

#

speaking of papers

#

also if you are in academia, it should be less about reading X amount of papers, and more about focusing on your particular field + 3 pass method

#

maybe those related to your research, you would do the complete 3 passes, while those in other domains, would get 1 pass

exotic maple
#

Ah yes, political inclinations from algorithms

#

what could go wrong?

misty flint
misty flint
exotic maple
#

I want my technocratic dictatorship to have at least some engineered catgirls or sex robots, but instead we have Onlyfan thots

misty flint
#

and i was like

#

i wonder if anyone is ever like: 'maybe we shouldnt do this project'

quasi sparrow
exotic maple
#

bro I laugh at the U,S researchers thinking

misty flint
#

have you heard of the 3 pass method?

#

if not, then reading papers will eat up all your time

exotic maple
#

"guys we have a very tough political situation at home, what should we do"
"I KNOW, WE CAN CREATE A ML TO IDENTIFY "THEM" -> them is anyone you dont like

exotic maple
#

I've only read 1 paper, ever, and it was the SAGA paper. It was...tough, but i think i got most of it

#

fuck implementing that though lol

zinc lark
#

pip3.8 could also possibly work instead of python3.8 -m pip, depends on if you have that linked

#

and then to test my claim that different python installations don't share packages, python3 -c "import tensorflow" would not work (or if it did, tensorflow would be a different version cause you said 1.11 is incompatible with 3.9)

misty flint
#

i also have one "real" publication

#

so ig that helps

#

non-cs tho

quasi sparrow
#

I don't think it's avaliable anymore

zinc lark
#

oh right it's on version 2. forgot about that

#

@quasi sparrow if you really need 1.11, would you be willing to use py 3.6? or do you need 3.8?

#

if you can hop down to 3.6, this should work:

#
pip install https://files.pythonhosted.org/packages/ce/d5/38cd4543401708e64c9ee6afa664b936860f4630dd93a49ab863f9998cd2/tensorflow-1.11.0-cp36-cp36m-manylinux1_x86_64.whl
#

or you'd have to change to whatever OS you're on

#

you can find the links here

misty flint
#

ig = i guess

#

my paper is not important anymore

#

at least not for AI

#

i still put it on my resume tho

quasi sparrow
#

I'm trying to fine tune a Bert algorithm from hugging faces but damn, they make it seem so easy on the website

lean ledge
#

your job isn't to read papers, it's to know what's going on in the field

#

read as many as you need to to keep up, don't go full detail on them if you dont need to

#

if you know the general methods, their pros and cons, why they are used, and the general trends, you're fine

wide oxide
#

Has anyone worked with NEAT?

uneven gust
#

does anyone here know how to code in R?

wide oxide
uneven gust
#

I'm like super stuck on something basic

#

I need to make a data frame with 3 elements

#

i got that part down but it's what each element does that i'm stuck on

#

here's what i need to do

#

Sample should contain sample numbers from 1 to 20;
Group should have alternating labels ‘A’ and ‘B’. ( Hint : use the rep() function with one of its arguments being the same number of rows as in column Sample);
Value should contain a sample of numbers between -20 and 20 (after setting the seed at 42)

#

and here's my code so far

#
df <- data.frame(sample = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
                 group = rep(c(A,B)sample),
                 value = c(13,11,9))```
#

it showed up strange on here but you get the point

#

i have no idea what to do for value

misty flint
#

i need to learn more R sometime. all i know is tidyverse 10/10

uneven gust
#

@wide oxide do you know how to fix it?

wide oxide
uneven gust
#

thanks let me know here or in dm 🙂

misty flint
#

i think youre really close BongoCat

high badge
#

this is the bellman optimality equation (for reinforcement learning)
but thats about all i really know about it
can someone explain to me what the summation sign is for? and what max_a is?

lean ledge
#

@high badge "the optimal value for a state = the maximum of (the reward from the action taken plus a discounted sum of values for all the states gone over with the optimal action)"

#

it's one of the most basic statements of dynamic programming, although this is specifically written for reinforcement learning and has an added discount factor

#

Idk wtf T(s, a, s') is doing there, that shouldn't be there

high badge
#

oh

lean ledge
#

Oh T is the transition dynamics

#

yeah no thats fine

high badge
#

uh

sturdy heron
#

Hi, I'm Hans. I'm new here. Do you guys have any recommendation for beginner who wants to learn data science?

lean ledge
sturdy heron
#

Btw, is Kaggle good for learning data science from zero?

#

especially for a person who don't have IT background

lean ledge
#

a maths background is more important than anything IT or software related

#

kaggle is okay for practice, it doesn't teach you anything directly

sturdy heron
#

oh, i see

floral mauve
#

Hey guys, new to this server (< 1 day). Hello!

I was wondering, is there a good place to learn /recommended learning track for ML for someone who has decent knowledge of multi-variable calc and linear algebra?

misty flint
#

one of the links is math heavy so you can start there

lean ledge
#

Pattern Recognition and Machine Learning is a great book

#

Follow it up with Goodfellow's deep learning book

floral mauve
#

Thanks, I'll look into them

grave frost
lean ledge
#

Read context before you reply pls

#

I mean it does have intro courses but they're so basic might as well read wikipedia pages or X in Y minutes

grave frost
#

intro courses but they're so basic
That's the point - its meant for beginners absolutely new to data science. and several people I know have benefited a lot from kaggle courses 🤷

lean ledge
#

It doesn't even teach you linear regression lol

#

In fact, it doesn't teach you ML at all

#

It gives you a tutorial thing which literally just calls the Decision Tree method without really explaining what it is

#

And a more advanced tutorial which swaps out the class name with RandomForests

#

I dont know why you choose to be contrarian about everything I say

#

It's objectively a horrible introduction that teaches you nothing about ML

sweet ginkgo
#

Hello there. I need someone to help me/give advice about making a Snake game neural network

#

Thanks a lot