#data-science-and-ml | Python | Page 263

dawn vault Oct 25, 2020, 5:03 PM

#

pretty neat

hollow sentinel Oct 25, 2020, 5:03 PM

#

wow that's cool

#

yeah I haven't finished the course yet but I'll get there eventually

#

it's a very long course

dawn vault Oct 25, 2020, 5:05 PM

#

what i did was creating my own project along the way... most of the stuff is prettymuch hardcoded anyways.. so.. u end up using google/youtube/stack overflow anyhow.. and thats when u learn..

hollow sentinel Oct 25, 2020, 5:06 PM

#

oh yeah I create a project of my own after everything I learn

dawn vault Oct 25, 2020, 5:06 PM

#

what project ?

hollow sentinel Oct 25, 2020, 5:06 PM

#

I did a linear regression on some data I found from kaggle about chennai resevoirs

#

I did a logistic regression project w some Kaggle dataset about congenital heart disease

#

I don't have anything ground breaking yet

#

and I'm working on creating models with k nearest neighbors and random forest/decision trees

#

but what I have the most trouble with is data cleaning models is literally copying code

dawn vault Oct 25, 2020, 5:11 PM

#

cleaning and wrangling data ..is a huge process prob the most important one when dealing with models of al sorts..

hollow sentinel Oct 25, 2020, 5:12 PM

#

yeah that's why I always read Kaggle notebooks and Pandas Cookbook so I can learn more efficient data cleaning methods like imputation

#

or you can replace the NAN values with the averages of the column

austere swift Oct 25, 2020, 5:15 PM

#

or you could do what i did a few times and go with multivariate imputing lol

#

takes a while and lots of computational power though

#

I did it on a 2gb dataset and it took weeks lmfao

hollow sentinel Oct 25, 2020, 5:16 PM

#

multivariate imputing? @austere swift

#

nanniiiiiii

austere swift Oct 25, 2020, 5:16 PM

#

basically using like machine learning to fit the data to the column based on the values of the other columns

hollow sentinel Oct 25, 2020, 5:16 PM

#

oh i see

austere swift Oct 25, 2020, 5:17 PM

#

https://scikit-learn.org/stable/modules/impute.html

hollow sentinel Oct 25, 2020, 5:17 PM

#

yeah there's kaggle notebooks on imputing that i've been reading

#

I try to read scikit learn doc but it gets so boring

#

@austere swift dude what is gridsearch

austere swift Oct 25, 2020, 5:19 PM

#

its basically using a bunch of given values in a "grid" and testing them out

sleek rampart Oct 25, 2020, 5:19 PM

#

do anyone know why am I getting "ValueError: Input 0 of layer sequential_109 is incompatible with the layer: : expected min_ndim=4, found ndim=3. Full shape received: [None, 240, 3]" error

austere swift Oct 25, 2020, 5:19 PM

#

and seeing what works best

#

@sleek rampart code?

hollow sentinel Oct 25, 2020, 5:20 PM

#

^

sleek rampart Oct 25, 2020, 5:20 PM

#

it is a jupyter notebook

hollow sentinel Oct 25, 2020, 5:24 PM

#

so you need to do a grid search before you do a SVM

#

but why

dawn vault Oct 25, 2020, 5:30 PM

#

@hollow sentinel pretty good link on tidy data https://tomaugspurger.github.io/modern-5-tidy

datas-frame – Modern Pandas (Part 5): Tidy Data

Posts and writings by Tom Augspurger

hollow sentinel Oct 25, 2020, 5:39 PM

#

@dawn vault thanks man

#

fillna() is a great method for cleaning data too

sleek rampart Oct 25, 2020, 5:53 PM

#

did you guys find a solution to my prblem

austere swift Oct 25, 2020, 6:57 PM

#

@sleek rampart we can't help if we can't see the code lmao

#

!paste just put it here

arctic wedgeBOT Oct 25, 2020, 6:58 PM

#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

novel flame Oct 25, 2020, 7:17 PM

#

Hello everyone, please i have a general question.
why in Batch Normalization we use batch statistics instead of global statistics? wouldn't that be more accurate?
thanks in advance

heady hatch Oct 25, 2020, 7:35 PM

#

Hey guys quick question about long form text generation.

I'm trying to generate long form text but I don't have that much data, only about 146 points, and then fine tuning it on gpt2.

Is this feasible? I can generate more data by splitting the stories up to per line.

Any advice on how to go about this, the constraint is the data. The amount of data is limited.

sleek rampart Oct 25, 2020, 7:59 PM

#

@austere swift screenshot or the code itself

austere swift Oct 25, 2020, 8:00 PM

#

just paste it here

#

!paste

arctic wedgeBOT Oct 25, 2020, 8:00 PM

#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

sleek rampart Oct 25, 2020, 8:02 PM

#

not even 50% finished with the project, so the code is not that big yet

#

the code is not the big file, it is the data what is big

#

it is actual picture data

#

if you need more code, just tell me

#

but that is where the main problem is tho

signal furnace Oct 25, 2020, 8:34 PM

#

📎 unknown.png

#

how do i fix it

finite wasp Oct 25, 2020, 8:38 PM

#

xrange is python 2 only, now range.

#

@signal furnace

wild kestrel Oct 25, 2020, 9:00 PM

#

does anyone know how to use dataframes?

hollow sentinel Oct 25, 2020, 9:04 PM

#

@wild kestrel pandas?

wild kestrel Oct 25, 2020, 9:04 PM

#

bingo

hollow sentinel Oct 25, 2020, 9:05 PM

#

yeah that's basically everyone here haha

#

other than me

#

i'm a noob

wild kestrel Oct 25, 2020, 9:06 PM

#

oof

#

anyways

#

i got it

#

but

#

thx for responding 🙂

hollow sentinel Oct 25, 2020, 9:06 PM

#

yeah no problem i'm like always here

sleek rampart Oct 25, 2020, 9:12 PM

#

any solutions yet?

hollow sentinel Oct 25, 2020, 9:13 PM

#

@sleek rampart sorry man idk neural networks yet

sleek rampart Oct 25, 2020, 9:14 PM

#

all good bro, what are you working on?

hollow sentinel Oct 25, 2020, 9:14 PM

#

support vector machines

sleek rampart Oct 25, 2020, 9:14 PM

#

i see

austere swift Oct 25, 2020, 9:14 PM

#

@sleek rampart what line is the error on

hollow sentinel Oct 25, 2020, 9:15 PM

#

so is gridsearchCV used to change the hyperparamaters of an algorithm

sleek rampart Oct 25, 2020, 9:15 PM

#

📎 unknown.png

austere swift Oct 25, 2020, 9:16 PM

#

could you just copy paste your full code in this site?

#

!paste

arctic wedgeBOT Oct 25, 2020, 9:16 PM

#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

sleek rampart Oct 25, 2020, 9:16 PM

#

it is useless without the data

austere swift Oct 25, 2020, 9:17 PM

#

yeah but i can barely read the screenshots

sleek rampart Oct 25, 2020, 9:19 PM

#

https://paste.pythondiscord.com/gehaxekolu.makefile

austere swift Oct 25, 2020, 9:20 PM

#

also what are the shapes of the input data?

#

oh wait i think i see the issue

#

all the convolution layers have the same input shape, but whenever it goes through a convolution layer the shape changes

#

so the output from the first one is different from the input

sleek rampart Oct 25, 2020, 9:22 PM

#

it is for the training data 192,240,3 and for the test it is 48, 240,3

austere swift Oct 25, 2020, 9:22 PM

#

also your training and testing data has to be the same shape

#

you can't train it on one input then test it on an input with a different shape

hollow sentinel Oct 25, 2020, 9:23 PM

#

oof i didn't know that

#

you learn something new every day

sleek rampart Oct 25, 2020, 9:23 PM

#

i see, that's interesting

#

I always thought the training data does not have to be the same shape

#

maybe it does for image

austere swift Oct 25, 2020, 9:23 PM

#

@hollow sentinel well if you think about how neural networks actually work, it doesnt really make sense, because youd need a different number of connections with a different input shape

sleek rampart Oct 25, 2020, 9:24 PM

#

the code right here is not using testing data yet , it is only using training data

#

not done with the code

austere swift Oct 25, 2020, 9:24 PM

#

yeah but the issue I said earlier

#

convolution layers output different shapes than they input

sleek rampart Oct 25, 2020, 9:25 PM

#

yea

austere swift Oct 25, 2020, 9:25 PM

#

also, input_shape is not a required argument, you can just remove it

#

if you dont specify it itll just set it's shape to the shape of the output of the previous layer

sleek rampart Oct 25, 2020, 9:27 PM

#

yea, I did that and it was still giving me errors thats why I put it there

austere swift Oct 25, 2020, 9:29 PM

#

what error did you get when you did that

#

did it say that the input shape was too small or that it would be negative or something?

#

if so I know what you're talking about

sleek rampart Oct 25, 2020, 9:30 PM

#

📎 unknown.png

austere swift Oct 25, 2020, 9:30 PM

#

ohhh

#

you didnt do model.compile()

sleek rampart Oct 25, 2020, 9:30 PM

#

owwww

#

completely forgot about that

austere swift Oct 25, 2020, 9:30 PM

#

wait actually i think you did

sleek rampart Oct 25, 2020, 9:31 PM

#

yep, I did

austere swift Oct 25, 2020, 9:31 PM

#

yeah you did

sleek rampart Oct 25, 2020, 9:31 PM

#

ValueError: Input 0 of layer sequential_274 is incompatible with the layer: : expected min_ndim=4, found ndim=3. Full shape received: [None, 240, 3]

#

yea still the same error

austere swift Oct 25, 2020, 9:31 PM

#

ohhh ik what it is

#

you get that error on model.summary()

sleek rampart Oct 25, 2020, 9:32 PM

#

📎 unknown.png

#

still does the same error

austere swift Oct 25, 2020, 9:33 PM

#

i mean remove the input shape argument

sleek rampart Oct 25, 2020, 9:34 PM

#

same error

austere swift Oct 25, 2020, 9:34 PM

#

what error

sleek rampart Oct 25, 2020, 9:34 PM

#

📎 unknown.png

#

"ValueError: Input 0 of layer sequential_384 is incompatible with the layer: : expected min_ndim=4, found ndim=3. Full shape received: [None, 240, 3]"

austere swift Oct 25, 2020, 9:36 PM

#

try putting the input shape only on the first conv layer

#

not the second or third

sleek rampart Oct 25, 2020, 9:36 PM

#

yea, still does the same error

#

tried that already

bronze barn Oct 25, 2020, 9:37 PM

#

Any suggestions for a very dispersed histogram which also spikes high? I can't visualise this data in any meaningful way atm

📎 Dz9G7QRqIGVfX1JNcC36B3Tdk93EDCxmRGkyHHz8gSQ2az9MykqRJGO6S1CDDXZIaZLhLUoMMd0lqkOEuSQ0y3CWpQf8fL4s3War.png

wild kestrel Oct 25, 2020, 9:37 PM

#

does anyone know any ez

#

panda methods?

#

that i can demontsrate?

austere swift Oct 25, 2020, 9:37 PM

#

@sleek rampart put the input shape as (-1, 240, 240, 1)

sleek rampart Oct 25, 2020, 9:38 PM

#

let's see

#

was thinking that too

#

I think it will give a 5D one

#

"ValueError: Error converting shape to a TensorShape: Dimension -1 must be >= 0"

#

this the error this time

austere swift Oct 25, 2020, 9:39 PM

#

oh i meant reshape the data like that

#

the input data

#

sorry that was confusing shouldve worded that better

hollow sentinel Oct 25, 2020, 9:40 PM

#

@wild kestrel https://www.dataschool.io/python-pandas-tips-and-tricks/

Data School

100 pandas tricks to save you time and energy

Below you'll find 100 tricks that will save you time and energy every time you use pandas! These the best tricks I've learned from 5 years of teaching the pandas library. "Soooo many nifty little tips that will make my life so much easier!" - C.K. "...

#

i worship this dataschool guy he's great with pandas

sleek rampart Oct 25, 2020, 9:40 PM

#

lets see and keep input_shape the same tight

#

I think I did try that, was a slack post

#

"Found input variables with inconsistent numbers of samples: [3, 192]"

#

error this time

austere swift Oct 25, 2020, 9:46 PM

#

whats the shape of the input data now

sleek rampart Oct 25, 2020, 9:47 PM

#

After: (3, 192, 240, 1)

austere swift Oct 25, 2020, 9:49 PM

#

oh you have to reshape your labels too btw

#

did you do that?

wild kestrel Oct 25, 2020, 9:50 PM

#

hey can anyone help

austere swift Oct 25, 2020, 9:51 PM

#

ask your question

sleek rampart Oct 25, 2020, 9:51 PM

#

not yet

austere swift Oct 25, 2020, 9:52 PM

#

yeah reshape your labels

wild kestrel Oct 25, 2020, 9:52 PM

#

import pandas as pd
list=[]
enter_number=int(input("Enter number of animals you want: "))
for i in range(enter_number):
animal_input=input("Enter an animal: ")
list.append(animal_input)

df = pd.DataFrame({'animal': [list]})
print(df)

#

dont have time to make itin to python but

#

animal
0 [tiger, lion, mistress, dog]

#

see this output?

#

i wanna make it so that lion is below tiger

#

mistress below lion

#

and etc

#

how can i do that?

#

@austere swift

austere swift Oct 25, 2020, 9:53 PM

#

what do you mean

#

what does the df look like now?

wild kestrel Oct 25, 2020, 9:54 PM

#

ok so

#

animal
0 [tiger, lion, mistress, dog]

austere swift Oct 25, 2020, 9:54 PM

#

oh i think ik why

#

its cus you put brackets around list

wild kestrel Oct 25, 2020, 9:55 PM

#

so

#

how do i make it so

austere swift Oct 25, 2020, 9:55 PM

#

remove the brackets

wild kestrel Oct 25, 2020, 9:55 PM

#

i cant describe it lol

#

like

#

its ascending

#

i think

#

thats the word

#

ok

austere swift Oct 25, 2020, 9:55 PM

#

df = pd.DataFrame({'animal': [list]}) so change this line to this df = pd.DataFrame({'animal': list})

#

because the brackets will make it think you only want one line which contains list, when you want separate lines for each value in the list

wild kestrel Oct 25, 2020, 9:56 PM

#

oh thank god

#

it worked

#

thx

#

man

#

ah ic

#

makes sense

#

how do i refer a numberi nput?

sleek rampart Oct 25, 2020, 10:12 PM

#

still getting the " Found input variables with inconsistent numbers of samples: [3, 192]" error

wild kestrel Oct 25, 2020, 10:34 PM

#

hey can ananyone help

#

last question

wild kestrel Oct 25, 2020, 11:43 PM

#

nvm

hollow sentinel Oct 26, 2020, 12:26 AM

#

hey guys k means clustering is unsupervised right

#

so that means it has no labels

#

but what does it mean that there's no labels

velvet thorn Oct 26, 2020, 12:28 AM

#

but what does it mean that there's no labels
@hollow sentinel hm.

#

okay, so

#

in supervised learning

#

you're trying to build a relationship between features and target, right?

#

which is a kind of pattern.

#

in (this kind of) unsupervised learning, you're trying to find patterns inherent in the features that distinguish groups of them "enough", for some definition of "enough"

hollow sentinel Oct 26, 2020, 12:34 AM

#

ok @velvet thorn that's a pretty good explanation

#

https://youtu.be/rHeaoaiBM6Y

YouTube

Eye on Tech

Supervised vs. Unsupervised Machine Learning: What's the Difference?

The most common approaches to machine learning training are supervised and unsupervised learning -- but which is best for your purposes? Watch to learn more about the differences between supervised and unsupervised machine learning and how each approach is used.

Machine learn...

▶ Play video

#

helpful video i found

#

maybe the Andrew Ng course will be more helpful in explaining the theory

nova shuttle Oct 26, 2020, 12:53 AM

#

📎 66166696259366912.png

agile wing Oct 26, 2020, 1:01 AM

#

yeah im actually studying andrew ng now haha

#

funny ur saying that, im even streaming about it now lol

hollow sentinel Oct 26, 2020, 1:05 AM

#

@agile wing hahahahaha yeah I've heard a lot about him hoping he can explain stuff even better than Jose Portilla

agile wing Oct 26, 2020, 1:06 AM

#

oh yes for sure

#

Jose Portilla is more like... here's what you do, but Andrew delves into the how it works

#

in algorithms

hollow sentinel Oct 26, 2020, 1:07 AM

#

yeah but I'll definitely look at Portilla's stuff when I'm confused

agile wing Oct 26, 2020, 1:07 AM

#

literally I have enrolled his udemy on jose, and he just tells me to read islr... and then do the work

#

yeah

hollow sentinel Oct 26, 2020, 1:07 AM

#

hahaha yeah during the machine learning section he always says to read the intro to stats learning

#

I don't find reading the book helpful

#

I like doing projects more

agile wing Oct 26, 2020, 1:08 AM

#

yep

#

andrew ng was the course that introduced me to gradient descent in optimizing the cost model. ISLR, especially in first algorithms, linear they mostly talk about iteratively reweight least squares.... i was so confused at first why they dont talk about one or other across courses until way later I found out there's more than one optimization technique to get your optimal parameters in supervised learning. I'm stil learning.

hollow sentinel Oct 26, 2020, 1:11 AM

#

haha i'm always learning

#

stay humble

agile wing Oct 26, 2020, 1:11 AM

#

yea

hollow sentinel Oct 26, 2020, 1:12 AM

#

i still need to refine my pandas skills

#

to clean data

agile wing Oct 26, 2020, 1:12 AM

#

yeah

#

me too

twilit siren Oct 26, 2020, 2:41 AM

#

p2 = p.groupby(['colN']).get_group('value')
p2.drop_duplicates(subset = 'colN1', keep = 'first', inplace = True)

I'm using the code above to group a specific set of data to then detect the duplicates of however when removing the duplicates it only returns the grouped data with the missing duplicates.

(edit I started a #help-corn so if you have an answer go there)

#

How can I get that ungrouped data back?

#

This is using the pandas module.

serene scaffold Oct 26, 2020, 2:43 AM

#

I have a dataframe that's structured like this

gold  sys_a  sys_b
A     A      B
B     B      B
C     B      A

where gold is the correct class and sys_a and sys_b are predictions from two systems. I need to calculate if the difference between the two systems is statistically significant. As far as I can tell, the solution involves randomly picking sys_a or sys_b for every row and then calculating the binary classification scores, and doing this n times for some n in the tens of thousands.

#

I think I can figure out how to do that, but I'm not sure what to do with those scores once I have them. I assume something to do with how they're distributed.

#

the null hypothesis being that sys_a and sys_b are exactly the same and there's no distribution.

slate scroll Oct 26, 2020, 2:55 AM

#

I don't think I've ever seen a setup that way since the null hypothesis is not really null. That null hypothesis implies that two models have somehow independently arrived at the same predictions?

Why not just compare model performance (and tradeoffs) vs the gold standard and choose one that way?

#

I mean I guess you could set it up that way but the bootstrapping seems unnecessary unless your sample size is very small. If you're going to bootstrap for additional significance you should bootstrap inputs to the model though if at all possible.

serene scaffold Oct 26, 2020, 2:59 AM

#

@slate scroll this isn't something I know a whole lot about. My advisor just told me that I'm going to be asked if the results are statistically significant.

#

my project is just augmenting the training data. I don't understand the actual ML component very well.

slate scroll Oct 26, 2020, 3:01 AM

#

Ahh ok, I always come at things from a practical perspective as I'm an applied MLE.

#

So I would compare a vs gold and b vs gold too, but it sounds like you're anticipating an a vs b significance proof as well, correct?

serene scaffold Oct 26, 2020, 3:02 AM

#

for the gold, sys_a, and sys_b that I have, sys_b has a significantly higher F1 score, which is what I wanted

slate scroll Oct 26, 2020, 3:02 AM

#

Careful with that word, "significantly" 🙂

serene scaffold Oct 26, 2020, 3:02 AM

#

well

slate scroll Oct 26, 2020, 3:03 AM

#

Unless you can prove it

serene scaffold Oct 26, 2020, 3:03 AM

#

I hope it is

#

I think it's like 7% higher and there were, let's see how many test instances

#

25 thousand

slate scroll Oct 26, 2020, 3:05 AM

#

Looks like for comparing multi-class classifiers the suggested test is a Stuart-Maxwell test https://www.rdocumentation.org/packages/DescTools/versions/0.99.37/topics/StuartMaxwellTest

StuartMaxwellTest function | R Documentation

This function computes the marginal homogeneity test for a (k \times k) matrix of assignments of objects to k categories or an (n \times 2 \times k) matrix of category scores for n data objects by two raters. The statistic is distributed as chi-square with k-1 degrees of ...

serene scaffold Oct 26, 2020, 3:05 AM

#

you tricked me into reading R docs

slate scroll Oct 26, 2020, 3:06 AM

#

Although that may be overkill depending on what you're after, you could do k-fold wilcoxan rank-sum tests.

#

This is probably a better resource for the first approach: https://en.wikipedia.org/wiki/McNemar's_test

McNemar%27s_test

#

The Stuart-Maxwell is an extension of that to >2 classes

serene scaffold Oct 26, 2020, 3:07 AM

#

my coworker did something similar a year ago but I couldn't get the code for it to work (hence my reinvention of the wheel). There was something about a binomial test.

slate scroll Oct 26, 2020, 3:08 AM

#

A binomial test will assume independence which is false for k-fold analysis. Therefore, it is invalid. A wilcoxan rank-sum test is more appropriate.

#

Let me know if you'd like me to explain any topics further, I'm trying to stay succinct.

serene scaffold Oct 26, 2020, 3:09 AM

#

we didn't k-fold cross validate

slate scroll Oct 26, 2020, 3:10 AM

#

To compare the models, you'll need to train multiple models. The second approach I mentioned above which is the same as what you suggested by randomly sampling data and training models. That approach is not independent.

serene scaffold Oct 26, 2020, 3:10 AM

#

my project entailed augmenting the training data and not at all modifying the test data, so doing k-fold cv wouldn't quite work

slate scroll Oct 26, 2020, 3:10 AM

#

That is what I'm calling k-fold

#

it's not k-fold cv, but k-fold comparison?

#

It's all just terminology, the point is I don't see any way to get a distribution that's independent, so you can't use any test that assumes independence. The only way would be to maybe split your data without replacement. But even then you're assuming independence between all samples in your data which is almost never true IRL.

serene scaffold Oct 26, 2020, 3:14 AM

#

I'll have to ask my advisor, I guess.

slate scroll Oct 26, 2020, 3:15 AM

#

IMO the Stuart Maxwell test will definitely be the best since it's designed for this exact scenario.

serene scaffold Oct 26, 2020, 3:17 AM

#

Great, I'll start reading about that right now

hollow sentinel Oct 26, 2020, 3:31 AM

#

me not understanding a single thing I just read

oblique socket Oct 26, 2020, 3:33 AM

#

Looking at Titanic Dataset, is this the best way to find average age of survivor by gender? Is it efficient to use slicing notation?

df[df.Survived == 1].groupby('Sex')['Age'].mean()```

serene scaffold Oct 26, 2020, 3:36 AM

#

@hollow sentinel "Data Science from Scratch" is a great O'Riley book to get started. It was written for an older version of Python but I don't think it's noticeable.

hollow sentinel Oct 26, 2020, 3:54 AM

#

@serene scaffold have you tried pandas cookbook

serene scaffold Oct 26, 2020, 3:54 AM

#

no

hollow sentinel Oct 26, 2020, 3:54 AM

#

I'm not a big fan of any book I like Udemy, Coursera , and edx courses mostly

#

but I will try it out

lapis sequoia Oct 26, 2020, 4:41 AM

#

https://www.kdnuggets.com/2018/09/essential-math-data-science.html#.X5ZTSntcr-o.telegram
Essential Math for Data Science: ‘Why’ and ‘How’ - KDnuggets

KDnuggets

Essential Math for Data Science: ‘Why’ and ‘How’ - KDnuggets

It always pays to know the machinery under the hood (even at a high level) than being just the guy behind the wheel with no knowledge about the car.

molten ridge Oct 26, 2020, 4:51 AM

#

I am a School Student (in 9th Standard as of now)

Can i Do Data Science and then Machine learning if i follow the Guide above?

I can do these topics (i got national rank about 150 out of 100K in an Exam)

#

I am already Familiar with most of the Stuff in the First Section of the Guide

#

📎 unknown.png

lone osprey Oct 26, 2020, 5:03 AM

#

Hlo

molten ridge Oct 26, 2020, 5:03 AM

#

hi

lone osprey Oct 26, 2020, 5:04 AM

#

Which exam?

molten ridge Oct 26, 2020, 5:04 AM

#

it was a Scholarship Exam of a Coaching

#

they ask the best Questions

lone osprey Oct 26, 2020, 5:04 AM

#

Ohh

molten ridge Oct 26, 2020, 5:04 AM

#

higher than the Standard set by Government

lone osprey Oct 26, 2020, 5:04 AM

#

U know python?

molten ridge Oct 26, 2020, 5:04 AM

#

yes

#

quite alot

lone osprey Oct 26, 2020, 5:05 AM

#

Nice

#

Then, u can learn data science and ml

#

But maths part is difficult for u

#

U know calculus?

molten ridge Oct 26, 2020, 5:05 AM

#

I dont Know but I can Do Maths

lone osprey Oct 26, 2020, 5:06 AM

#

There's basic calculus, atleast u can try that

#

But multivariate calculus, u have to come 12th or clg

molten ridge Oct 26, 2020, 5:06 AM

#

Is the Calculus needed of School level Or Higher (Graduation) level

lone osprey Oct 26, 2020, 5:06 AM

#

Graduation lvl is multivariate calculus

#

Actually I have never learnt advanced maths of ml yet

molten ridge Oct 26, 2020, 5:07 AM

#

do Data Science need advanced Maths?

lone osprey Oct 26, 2020, 5:07 AM

#

Actually, I have no idea on which maths to which parts

#

I know we need basic algebra for linear regression, etc..

#

Understanding how it works is advanced maths I think

molten ridge Oct 26, 2020, 5:09 AM

#

I am from india (CBSE Board)

#

ok so I will Try to do the Maths and then do Data Science and ML

#

I guess we should move to Dm.....

shell berry Oct 26, 2020, 5:55 AM

#

How can you use word encodings as a feature in a model?

#

You can't really use a bag of words approach there, right?

#

If I want to do something like intent classification etc

velvet thorn Oct 26, 2020, 6:04 AM

#

How can you use word encodings as a feature in a model?
@shell berry word encodings are a vector

#

what kind of model are you thinking of?

shell berry Oct 26, 2020, 6:05 AM

#

@velvet thorn may I please DM you?

velvet thorn Oct 26, 2020, 6:05 AM

#

nope, sorry

#

just talk here

shell berry Oct 26, 2020, 6:06 AM

#

I have a SVC I'm inputting bag of words vectors into, outputting 40 different classes

#

So each vector is just the size of the vocabulary

#

And I have (# of sentences) of vectors

#

And I'm using tf-idf

velvet thorn Oct 26, 2020, 6:07 AM

#

hm.

shell berry Oct 26, 2020, 6:07 AM

#

So that's workign decently

velvet thorn Oct 26, 2020, 6:07 AM

#

go on

shell berry Oct 26, 2020, 6:07 AM

#

Getting ~80% f-score

#

Now if I want to use word embeddings to get better performance

#

How would I "fit" it into my model? An embedding of one word is 1x300

#

So the size of a vector of one sentence is 300x(# of words)

#

I can't really use a constant sized bag of words approach here

#

How do I represent the "feature"? Does that make sense?

velvet thorn Oct 26, 2020, 6:08 AM

#

yup

#

you could pad that

#

or you could look into models that can handle variable length sequences in one way or another

#

for example, a simple RNN

shell berry Oct 26, 2020, 6:09 AM

#

Not allowed to use any DL for my assignment 🙂

#

If I pad them won't I have like insanely huge vectors, possible over 10,000 dimensions

#

And wouldn't that make it really harder for a model to get fit?

#

Is there a way I could group similar words together or something into a model with a fixed size amount? i.e every index is a word, and the index value is "How many similar words are there", or something

velvet thorn Oct 26, 2020, 6:12 AM

#

And wouldn't that make it really harder for a model to get fit?
@shell berry yes

#

Is there a way I could group similar words together or something into a model with a fixed size amount? i.e every index is a word, and the index value is "How many similar words are there", or something
@shell berry let me try to process that

#

a bit sleepy

shell berry Oct 26, 2020, 6:13 AM

#

I only have ~3k examples so I don't want too many dimensions

#

ok, thanks

#

Maybe I could average each sentence? If the word embedding is of size 300, I could simply have each index of the final sentence's embedding be the average of the embedding of the words within it or something, idk

#

I feel like I'd lose a lot of info like that

velvet thorn Oct 26, 2020, 6:15 AM

#

okay, honestly I haven't done NLP in a long time

#

and it was never my specialty

#

and I never did it with SVCs

#

so I don't want to tell you to do something

#

that might end up being wrong

#

Maybe I could average each sentence? If the word embedding is of size 300, I could simply have each index of the final sentence's embedding be the average of the embedding of the words within it or something, idk
@shell berry I feel like

#

this would throw away information

#

(a fair bit)

shell berry Oct 26, 2020, 6:16 AM

#

Yeah hmm

#

And np, happy to experiment with any ideas you have

velvet thorn Oct 26, 2020, 6:16 AM

#

like

#

most of my NLP experience

#

is with DL

#

which is out of the question, right

shell berry Oct 26, 2020, 6:16 AM

#

Yup lol

#

But for rn I think my features are more important than my model

#

Im just using tf-idf (pos tags, bigrams, trigrams didnt help at all)

velvet thorn Oct 26, 2020, 6:19 AM

#

Im just using tf-idf (pos tags, bigrams, trigrams didnt help at all)
@shell berry yeah, those are iffy at best AFAIK

shell berry Oct 26, 2020, 6:19 AM

#

Any ideas for what else to use?

velvet thorn Oct 26, 2020, 6:20 AM

#

not really, sorry

shell berry Oct 26, 2020, 6:20 AM

#

One hot encoding was pretty meh too

#

Np

velvet thorn Oct 26, 2020, 6:20 AM

#

maybe I'll come back in a few hours

#

food coma right now

shell berry Oct 26, 2020, 6:20 AM

#

lol

velvet thorn Oct 26, 2020, 6:20 AM

#

but there are other NLP experts here so hopefully one comes along

shell berry Oct 26, 2020, 6:21 AM

#

awesome, thx again

ripe forge Oct 26, 2020, 6:22 AM

#

What are you trying to predict?

#

(and uh, I'm not an nlp expert either)

shell berry Oct 26, 2020, 6:22 AM

#

@ripe forge Intent classification from sentences

#

"I want cake" -> food_order

#

"I wanna watch the new borat" -> movie_watch

#

Theres ~30-40 classes

ripe forge Oct 26, 2020, 6:23 AM

#

You can average out the word vectors to get sentence vectors of same dimension as word. It loses information but may still retain enough to give good performance.

shell berry Oct 26, 2020, 6:23 AM

#

Yeah but the problem is

velvet thorn Oct 26, 2020, 6:23 AM

#

You can average out the word vectors to get sentence vectors of same dimension as word. It loses information but may still retain enough to give good performance.
@ripe forge yeah, I was iffy about this

#

maybe if you remove stop words?

shell berry Oct 26, 2020, 6:24 AM

#

"I dont wanna watch it" and "I wanna watch the movie"

ripe forge Oct 26, 2020, 6:24 AM

#

And if that doesn't work, you can use only top tf idf words and average them out

shell berry Oct 26, 2020, 6:24 AM

#

youre gonna lose the info of "dont"

#

@velvet thorn my current model gives worse performance if I remove stopwords

#

I was also iffy on removing the more sparse words because again, some rare words could be useful

velvet thorn Oct 26, 2020, 6:24 AM

#

possible

ripe forge Oct 26, 2020, 6:25 AM

#

My suggestion is, don't trust your gut when it comes to word vectors and sentence vectors. Actually try it out. You will often be surprised.

shell berry Oct 26, 2020, 6:25 AM

#

but maybe the dimensionality reduction would improve it

#

hmm ok Ill try thanks 🙂

ripe forge Oct 26, 2020, 6:25 AM

#

If you're allowed to use Bert I'd suggest Bert's sentence transformers. Not sure if that's an option or not

#

No training involved, and their sentence vectors are great.

shell berry Oct 26, 2020, 6:27 AM

#

@ripe forge Can you recommend a specific library or anything for them?

ripe forge Oct 26, 2020, 6:27 AM

#

Yep, one sec

shell berry Oct 26, 2020, 6:28 AM

#

ty

ripe forge Oct 26, 2020, 6:28 AM

#

https://pypi.org/project/sentence-transformers/

PyPI

sentence-transformers

Sentence Embeddings using BERT / RoBERTa / XLM-R

#

It's slightly on the larger side dimension wise by default. 768 iirc. But maybe there's a parameter there for that, I'm not sure.

shell berry Oct 26, 2020, 6:29 AM

#

Thank you!

velvet thorn Oct 26, 2020, 6:29 AM

#

@ripe forge Can you recommend a specific library or anything for them?
@shell berry doesn't this count as DL

#

or can you just not build your own models

ripe forge Oct 26, 2020, 6:30 AM

#

That's my question to you as well Kali, because I'd definitely count it as DL. It's pretrained but it's definitely a DL model. So is pretrained dl for features allowed?

shell berry Oct 26, 2020, 6:30 AM

#

Yeah I was thihnking about that but

#

I have plausible deniability for these I think

#

😛

ripe forge Oct 26, 2020, 6:31 AM

#

Also one other thought I have

#

If a lot of your classes are negation of existing classes...

#

You could do your work in two steps. First classifier only predicts the ~20 combined classes

#

And the next predicts a negation

shell berry Oct 26, 2020, 6:32 AM

#

Makes sense

ripe forge Oct 26, 2020, 6:32 AM

#

Reducing the number of classes a single model has to deal with can potentially be a good idea, and allow simpler models or features to work

shell berry Oct 26, 2020, 6:32 AM

#

The class Im mislabeling most is "other"

#

Out of the 30ish classes, I only mislabel most of them a few times

#

90% of mislabels is one class, "other"

ripe forge Oct 26, 2020, 6:32 AM

#

Oh. Other class is a rough one.

shell berry Oct 26, 2020, 6:32 AM

#

which is just so similar to the other classes

#

its like different in grammar and stuff

#

the vocab is so smilar

velvet thorn Oct 26, 2020, 6:33 AM

#

you could stack classifiers, too

#

other/non-other

#

and non-other feeds into specific categories

#

other/non-other
@velvet thorn can tweak the decision boundary for this

#

this is actually a problem I worked on like a year ago

#

NLP problem too

shell berry Oct 26, 2020, 6:33 AM

#

Hm thats a good idea yeah but it's still really similar

velvet thorn Oct 26, 2020, 6:33 AM

#

classifying text by national origin

shell berry Oct 26, 2020, 6:33 AM

#

"I would like to order a sandwich" -> class

#

"I like ordering sandwiches" -> other

ripe forge Oct 26, 2020, 6:34 AM

#

On a side note, the question of how to teach a model that none of my relevant classes are present is one I've never really found a satisfactory answer for.

shell berry Oct 26, 2020, 6:34 AM

#

"I want to watch a movie" -> class

ripe forge Oct 26, 2020, 6:34 AM

#

Oh that's a nasty one.

shell berry Oct 26, 2020, 6:34 AM

#

"I like movies" -> other

#

yup lol 😛

#

I could maybe hardcore in tons of words if I go through the data but ehhhhh

velvet thorn Oct 26, 2020, 6:35 AM

#

On a side note, the question of how to teach a model that none of my relevant classes are present is one I've never really found a satisfactory answer for.
@ripe forge it wouldn't work as is, right

#

because the underlying assumption is that one class is present

ripe forge Oct 26, 2020, 6:35 AM

#

If you're willing to somewhat go down that route, you could maybe use a Grammer parser. I'm not sure what they're called. But spacy has one. Dependency parser I think

velvet thorn Oct 26, 2020, 6:35 AM

#

so you could either layer it on top of a model that separates out the "not present" instances, or turn it into multiclass classification

shell berry Oct 26, 2020, 6:35 AM

#

yeah but I don't think feeding that into anything other than a DL model would be very good

ripe forge Oct 26, 2020, 6:35 AM

#

And see if you could write some rules to fix some prediction in post.

velvet thorn Oct 26, 2020, 6:35 AM

#

I'm not sure if you can encode the constraint that "at most one class exists"

#

yeah but I don't think feeding that into anything other than a DL model would be very good
@shell berry indeed

#

classical ML for NLP is just 😦

ripe forge Oct 26, 2020, 6:36 AM

#

Ain't that the truth. Vectors are your friends. Use em

shell berry Oct 26, 2020, 6:37 AM

#

Its difficult intuitively to see how vectors would work here

#

Let's say I have 300 dimensions and 90% of it is similar

ripe forge Oct 26, 2020, 6:38 AM

#

You're right in that they'll get stuff wrong.

shell berry Oct 26, 2020, 6:38 AM

#

Even if I have 10 dimensions different

#

They'll be soooo far away

ripe forge Oct 26, 2020, 6:38 AM

#

But so does your current approaches yes?

shell berry Oct 26, 2020, 6:38 AM

#

Yup 😛 lol

#

Im getting like 70% accuracy 79.998% fscore

ripe forge Oct 26, 2020, 6:38 AM

#

With vectors one lesson I've really learnt.. It's important to try them out. And try different combinations with them

shell berry Oct 26, 2020, 6:39 AM

#

Side note I'm really not enjoying how lots of data science (Im new) seems to be experimenting... Is there a time and place in data science for actual stuff with definite calculable answers like in algorithms, math, physics, etc?

ripe forge Oct 26, 2020, 6:39 AM

#

As long as you know if a problem is tough for a machine, you have reasonable expectations (you know, 100%accuracy ain't happening etc) then you're in a good spot to experiment with options

shell berry Oct 26, 2020, 6:40 AM

#

other than intuition and experience, you can't really methodically know how to get the "right" answer

velvet thorn Oct 26, 2020, 6:40 AM

#

Side note I'm really not enjoying how lots of data science (Im new) seems to be experimenting... Is there a time and place in data science for actual stuff with definite calculable answers like in algorithms, math, physics, etc?
@shell berry it depends on which part of DS you're working on.

#

if you're a more engineery kinda person, like you build DS toolsets...

ripe forge Oct 26, 2020, 6:40 AM

#

Sometimes your intuition can mislead you.

shell berry Oct 26, 2020, 6:41 AM

#

I mean if someone with 20 years of DS experience was working on my same task

#

they'd have more intuition right

velvet thorn Oct 26, 2020, 6:41 AM

#

I would hope so

shell berry Oct 26, 2020, 6:41 AM

#

They wouldnt be able to calculate the "right way" like it was an equation

velvet thorn Oct 26, 2020, 6:41 AM

#

but you never know 🤔

ripe forge Oct 26, 2020, 6:41 AM

#

I don't think so. What they'd know is "I need to try this this and this"

shell berry Oct 26, 2020, 6:42 AM

#

It feels like people try different methods and once one works, they rationalize why it works, when if it were to be another method that worked, they'd rationalize that instead

#

just looking through kaggle notebooks and stuff

ripe forge Oct 26, 2020, 6:42 AM

#

Well, okay it's not always like that. A standardized approach is establishing baselines.

#

You use 3 or 4 pretty standard untuned approaches and get some reasonable expectations of what kind of performance your data gives you upfront

#

Usually that gives you a sense of where you need to focus your attention

#

The truth is, your data directly determines how your models will perform. Data comes first. So, even if the problem statement is similar, but the data isn't, it may completely change the best model or approach

shell berry Oct 26, 2020, 6:45 AM

#

makes sense

ripe forge Oct 26, 2020, 6:45 AM

#

Thats why I'd highly recommend, develop your intuition, but always cross examine it with small tests or experiments.

shell berry Oct 26, 2020, 6:45 AM

#

ty

#

Is there a such thing as like

#

automatic feature extraction/detection

ripe forge Oct 26, 2020, 6:45 AM

#

And ofcourse the right answer is, always use Bert 😛

shell berry Oct 26, 2020, 6:45 AM

#

lol 😛

#

It seems like what features you use is 1000x more important than what model you use

ripe forge Oct 26, 2020, 6:46 AM

#

Hm. There are a lot of automatic feature extractions out there, but it's a lot easier with structured data. And it's usually some kind of brute force

#

It seems like what features you use is 1000x more important than what model you use
@shell berry in nlp, this is sooo important especially

#

You'd always want to segment your approaches based on what domain of work you're doing. So say, automatic feature extraction... Structured data? Yes there's a few options. Unstructured? Eh depends, not really.

#

Do I recommend any of them upfront? Honestly, no.

#

Better to explore the data first, get a sense of what you're dealing with.

shell berry Oct 26, 2020, 6:50 AM

#

So like for Alexa or Siri, did the teams at Amazon and Apple probably have to manually decide what features to use?

#

To train their models?

#

or do they do completely diff things

ripe forge Oct 26, 2020, 6:50 AM

#

Heh.

#

Features are important. But what's really important is data

#

The sheer volume of data there makes it so that you can train really deep models

shell berry Oct 26, 2020, 6:51 AM

#

But doesnt that data have to be

#

labeled

#

They have to choose how to represent it

ripe forge Oct 26, 2020, 6:52 AM

#

Actually, not necessarily. Especially depending on the task

#

I hope I'm not going to write something that's factually incorrect when I say this.. So you should double check.. But Bert is not a supervised model for example. It's considered semi supervised or unsupervised

#

Or to simplify, say, word to vec is unsupervised

shell berry Oct 26, 2020, 6:54 AM

#

Yeah that makes sense, it doesnt necessarily have to know the meanings of words to know how they're used in sentences? is that it?

ripe forge Oct 26, 2020, 6:54 AM

#

The general theme is, essentially deep learning models have this idea that you give them a task that's unsupervised. They, in the process, learn some information about the data in their weights that's useful downstream

#

So you end up getting features perfectly ready for use with simply insane volumes of Data and zero labelling

#

So yep exactly as you said

shell berry Oct 26, 2020, 6:55 AM

#

Ill have to process that idea a bit

#

thanks for the help, Ill update you with my results of using the bert encodings

ripe forge Oct 26, 2020, 6:56 AM

#

All the best!

lapis sequoia Oct 26, 2020, 8:38 AM

#

Hey guys, does anyone know why this curve is so strange?

📎 Screenshot_2020-10-26_at_09.37.34.png

velvet thorn Oct 26, 2020, 9:08 AM

#

@lapis sequoia why do you think it's strange

lapis sequoia Oct 26, 2020, 9:19 AM

#

@lapis sequoia why do you think it's strange
@velvet thorn something with the values?

arctic wedgeBOT Oct 26, 2020, 11:13 AM

#

Hey @lapis sequoia!

It looks like you tried to attach file type(s) that we do not allow (.html). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .flac, .afdesign, .m4a, .csv.

Feel free to ask in #community-meta if you think this is a mistake.

lapis sequoia Oct 26, 2020, 12:00 PM

#

i have an sql question

#

anyone?

velvet thorn Oct 26, 2020, 12:33 PM

#

anyone?
@lapis sequoia just post your question..

glad spindle Oct 26, 2020, 2:17 PM

#

Good morning everyone, I was wondering if Tensorflow could be used for training a ML model for parsing and generating financial documents

earnest forge Oct 26, 2020, 2:27 PM

#

https://i.stack.imgur.com/3MhPr.png
Why do we specify 1/m in this formula?

glad spindle Oct 26, 2020, 2:28 PM

#

I'm sorry I don't really know how to help you with that I was asking about Tensorflow in general

earnest forge Oct 26, 2020, 2:29 PM

#

did you start ML or AI from TensorFlow?

#

Or took smth more simpler such as scikit-learn at the first?

glad spindle Oct 26, 2020, 2:31 PM

#

Well I haven't selected a framework yet because I'm not sure which one I should use

#

Basically I want to build a model that learns how to generate financial statements for my company based on parsing historical financial statements

velvet thorn Oct 26, 2020, 2:51 PM

#

https://i.stack.imgur.com/3MhPr.png
Why do we specify 1/m in this formula?
@earnest forge because you're taking the partial derivative of the MSE

#

mean squared error

#

hence 1/m

#

Good morning everyone, I was wondering if Tensorflow could be used for training a ML model for parsing and generating financial documents
@glad spindle possible, but if you're just starting out, probably not.

glad spindle Oct 26, 2020, 2:55 PM

#

Is there a better framework to use? Like PyTorch?

velvet thorn Oct 26, 2020, 2:55 PM

#

well

#

both are usable

#

but different.

#

so you can just pick one

glad spindle Oct 26, 2020, 2:55 PM

#

I built something similar in the past using symbolic AI but we want to switch to using NN

#

That way it only needs to train on data rather than us having to declaratively express rules

velvet thorn Oct 26, 2020, 3:14 PM

#

I built something similar in the past using symbolic AI but we want to switch to using NN
@glad spindle have you ever built a NN?

hollow sentinel Oct 26, 2020, 3:44 PM

#

so recommender systems recommend stuff given data?

#

i don't wanna do it bc it requires linear algebra

pine tide Oct 26, 2020, 3:50 PM

#

Anyone have any resources for learning the basics of ML if you're familiar with Python already?

glad spindle Oct 26, 2020, 3:55 PM

#

No @velvet thorn

austere swift Oct 26, 2020, 3:56 PM

#

@pine tide do you know the math behind it and stuff already and wanna know how to do it or do you just not know any of it

pine tide Oct 26, 2020, 3:58 PM

#

I have a rough idea of the math but let's assume I know nothing

hollow sentinel Oct 26, 2020, 3:58 PM

#

@pine tide Python for Data Science and Machine Learning by Jose Portilla. It's a Udemy course

austere swift Oct 26, 2020, 3:59 PM

#

are you familiar with basic concepts of linear algebra and calculus?

#

yeah theres also a coursera course on it which is pretty nice

hollow sentinel Oct 26, 2020, 3:59 PM

#

Andrew Ng

austere swift Oct 26, 2020, 3:59 PM

#

i didnt really learn from any courses i just read a bunch of papers and stuff so idk which courses are good

hollow sentinel Oct 26, 2020, 4:00 PM

#

@austere swift damn dude I cannot read papers for the life of me like machine learning books make me fall asleep

#

but @pine tide I can strongly recommend Python for Data Science and Machine Learning by Jose Portilla

pine tide Oct 26, 2020, 4:01 PM

#

are you familiar with basic concepts of linear algebra and calculus?
@austere swift Yes

#

but @pine tide I can strongly recommend Python for Data Science and Machine Learning by Jose Portilla
@hollow sentinel Thank you 🙂

austere swift Oct 26, 2020, 4:01 PM

#

then yeah you can probably check out some of those courses

hollow sentinel Oct 26, 2020, 4:02 PM

#

@austere swift have you made a recommender system before

austere swift Oct 26, 2020, 4:03 PM

#

no why

hollow sentinel Oct 26, 2020, 4:03 PM

#

nothing I just find it cool

#

@pine tide whatever you do do not jump right into neural nets and deep learning

pine tide Oct 26, 2020, 4:04 PM

#

Figured as much, I want to start with real basic stuff

hollow sentinel Oct 26, 2020, 4:04 PM

#

great haha bc i know people who did that and they don't know what they're doing

#

unless you're insanely intelligent to pick up neural nets with no basis of machine learning

austere swift Oct 26, 2020, 4:10 PM

#

the only thing thats a really bad way to do it is to just jump into the execution without knowing how to do the math behind it or anything

#

cus then you'll probably make some shitty models because you wont know how the loss algorithms and stuff works

#

neural networks are actually pretty easy to make if you use something like keras

serene scaffold Oct 26, 2020, 4:37 PM

#

I'm digging through the pandas docs but I have a dataframe with three columns, and I need a new dataframe of two columns that's always the first column and a random selection from either the second or third.

heady hatch Oct 26, 2020, 4:39 PM

#

You could do something along the lines of

pd.concat([df['first'], df[np.random.choice(['second', 'third'])], axis=1).

serene scaffold Oct 26, 2020, 4:40 PM

#

does that randomly pick a column and then only use that column, or is it random for every row?

#

it needs to be the latter.

heady hatch Oct 26, 2020, 4:41 PM

#

Oh wait you want a random choice of every row?

serene scaffold Oct 26, 2020, 4:41 PM

#

yes

heady hatch Oct 26, 2020, 4:42 PM

#

I don't know if there's a builtin solution but I would probably generate 0 and 1, and then index into the two columns.

quick epoch Oct 26, 2020, 4:50 PM

#

Hi, is it possible to create a multicoloured bar without stacking it up?

lone osprey Oct 26, 2020, 4:50 PM

#

??

quick epoch Oct 26, 2020, 4:51 PM

#

So if it’s going to see let’s say “1” value it is gonna change to blue if it’s gonna see “2” the. Is gonna change to green if it is gonna see 1 again then it will change to blue

hollow sentinel Oct 26, 2020, 4:52 PM

#

i never realized NLP was so prevalent

quick epoch Oct 26, 2020, 4:52 PM

#

So one bar will have 2 blues and 1 green

hollow sentinel Oct 26, 2020, 4:53 PM

#

https://stackoverflow.com/questions/47138271/how-to-create-a-stacked-bar-chart-for-my-dataframe-using-seaborn

Stack Overflow

How to create a stacked bar chart for my DataFrame using seaborn?

I have a DataFrame df:

df = pd.DataFrame(columns=["App","Feature1", "Feature2","Feature3",
"Feature4","Feature5",
"Feature6","Feature7","Featu...

quick epoch Oct 26, 2020, 4:53 PM

#

It’s not really that

#

Plus I am using matplotlib

#

The colours are stacking up

hollow sentinel Oct 26, 2020, 4:54 PM

#

seeeeeeaaaaboorrnnnnn

quick epoch Oct 26, 2020, 4:54 PM

#

And I don’t want that

hollow sentinel Oct 26, 2020, 4:54 PM

#

sorry

#

maybe someone else knows

mint kelp Oct 26, 2020, 5:46 PM

#

📎 unknown.png

#

Is it possible to ask python to put info into a search bar like this

serene scaffold Oct 26, 2020, 5:49 PM

#

@mint kelp that's a #user-interfaces question.

hollow sentinel Oct 26, 2020, 5:59 PM

#

hey guys what does sep control in the read_csv() method

#

i'm reading the pandas doc rn and idk

bleak fox Oct 26, 2020, 6:05 PM

#

hey guys what does sep control in the read_csv() method
@hollow sentinel sep represents separator which is ", " For csv files.

hollow sentinel Oct 26, 2020, 6:07 PM

#

@bleak fox yeah but what does the "," mean

#

i just see Portilla do it in the Udemy course

bleak fox Oct 26, 2020, 6:08 PM

#

Actually there are file fomats of csv which is by default a comma seperated same like a text file is tab seperated... So as to made python understand what type of seperator was used while saving file. This parameter is being used.

hollow sentinel Oct 26, 2020, 6:09 PM

#

ohhhhhhh

#

like they're separated by a comma

#

ok

bleak fox Oct 26, 2020, 6:09 PM

#

Yup

hollow sentinel Oct 26, 2020, 6:09 PM

#

i'm dumb lmao

hollow sentinel Oct 26, 2020, 6:27 PM

#

omg NLP is hard

#

should I try to find the XBOX live chats and see which ones are inappropiate with NLP

#

that would be a funny project

shell berry Oct 26, 2020, 6:28 PM

#

📎 unknown.png

#

this should only be 18 combinations of hyperparams, right?

#

My program has run like over 70 models and its still going

hollow sentinel Oct 26, 2020, 6:29 PM

#

idek bro

shell berry Oct 26, 2020, 6:29 PM

#

😔

hollow sentinel Oct 26, 2020, 6:30 PM

#

sorry
I'm a noob

#

@shell berry how do you figure out which machine learning algorithm to use given a dataset

plucky zephyr Oct 26, 2020, 6:31 PM

#

newbie ask, did you always check your model overfit or not?

shell berry Oct 26, 2020, 6:31 PM

#

📎 image.png

hollow sentinel Oct 26, 2020, 6:31 PM

#

thanks

#

I've seen that before

#

oh they call algorithms estimators?

#

i didn't know that

#

learn something new every day i guess

last peak Oct 26, 2020, 6:59 PM

#

There is a link in an outlook email that opens an outlook message with a pre-populated reply message subject and a message. Can I open that message to edit with pythons win32com.client library??

shell berry Oct 26, 2020, 7:03 PM

#

@ripe forge

#

Ran my models with embeddings and got 5-6% worse performance

kind saddle Oct 26, 2020, 7:21 PM

#

if anyone has time please look at help-carbon

hollow sentinel Oct 26, 2020, 7:24 PM

#

@last peak have you tried?

autumn fjord Oct 26, 2020, 8:21 PM

#

Anyone here use Tensorflow?

last peak Oct 26, 2020, 8:28 PM

#

i dont know what to try

short dove Oct 26, 2020, 10:09 PM

#

do you apply feature scaling before or after the split into training and test

heady hatch Oct 26, 2020, 10:30 PM

#

You'd generally want to do the transformation of the features separately and fitting it on the train then doing it on the test.

#

For scaling, you would want to grab your min and max or mean and std from the training and then applying it to both training and testing.

serene scaffold Oct 26, 2020, 11:32 PM

#

>>> data
1  2  3
4  5  6
7  8  9
>>> data.iloc[:, 1:].apply(lambda x: choice(x), axis=0)
8
9

#

Trying to get it to always pick one cell or the other for each column, not just two cells arbitrarily

#

(but only for the last two columns)

velvet thorn Oct 26, 2020, 11:35 PM

#

>>> data
1  2  3
4  5  6
7  8  9
>>> data.iloc[:, 1:].apply(lambda x: choice(x), axis=0)
8
9

@serene scaffold huh.

#

so let me get this straight

#

for each row, you want to choose one of the last two columns?

serene scaffold Oct 26, 2020, 11:35 PM

#

2
6
9

would be a valid result.

#

so would

3
6
8```

velvet thorn Oct 26, 2020, 11:36 PM

#

yup, got it

#

let me think

#

I have one ugly way but I'm sure there's a more elegant one

serene scaffold Oct 26, 2020, 11:37 PM

#

some people were using generator expressions but that eliminates the performance boost from pandas.

velvet thorn Oct 26, 2020, 11:39 PM

#

okay this is my preliminary solution

#

df.values[np.arange(len(df)), np.random.randint(-2, 0, size=len(df))]

#

but I'm sure there's a better way

serene scaffold Oct 26, 2020, 11:39 PM

#

let's see

velvet thorn Oct 26, 2020, 11:39 PM

#

I just woke up so probably I'm not thinking straight

serene scaffold Oct 26, 2020, 11:40 PM

#

@velvet thorn well it works, so you're thinking at least that straight

#

I was hoping it could be done using only pandas so it's one less import/explicit dependency, but I'll deal with that later

#

however the actual use case is with strings and not numbers.

velvet thorn Oct 26, 2020, 11:41 PM

#

however the actual use case is with strings and not numbers.
@serene scaffold what do you mean

serene scaffold Oct 26, 2020, 11:41 PM

#

I was only using ints as an example. The dataframe in the real program is only of strings.

velvet thorn Oct 26, 2020, 11:41 PM

#

sure, but what's the difference

serene scaffold Oct 26, 2020, 11:41 PM

#

idk how that affects the usability of numpy

velvet thorn Oct 26, 2020, 11:42 PM

#

it shouldn't

#

because numpy there is only used to generate the indexes

serene scaffold Oct 26, 2020, 11:42 PM

#

ah I see

velvet thorn Oct 26, 2020, 11:44 PM

#

hm.

#

some Googling has suggested to me that that is the best way to do it

serene scaffold Oct 26, 2020, 11:45 PM

#

alright

velvet thorn Oct 26, 2020, 11:45 PM

#

I don't think there's a pure pandas solution

#

because that would go against its data model...?

#

but that's my guess

serene scaffold Oct 26, 2020, 11:47 PM

#

I'll let my advisor know that a person on the internet named gm figured it out.

velvet thorn Oct 26, 2020, 11:47 PM

#

🥴

lapis sequoia Oct 27, 2020, 12:11 AM

#

I have a few questions that I thought might apply here, so here goes.
1.) Is there a specific name for a program that can look at a graph and variables related to the graph, and predict the future of said graph?
2.) Whatever that type of program is called, what part of the program do I start with, as this is going to be basically my introductory program into python
3.) Can anyone tell me how I could get a graphical representation of the input data and the program's predictions, as opposed to just having to read raw numbers?

velvet thorn Oct 27, 2020, 12:12 AM

#

I have a few questions that I thought might apply here, so here goes.
1.) Is there a specific name for a program that can look at a graph and variables related to the graph, and predict the future of said graph?
2.) Whatever that type of program is called, what part of the program do I start with, as this is going to be basically my introductory program into python
3.) Can anyone tell me how I could get a graphical representation of the input data and the program's predictions, as opposed to just having to read raw numbers?
@lapis sequoia by "graph" you mean a mathematical graph i.e. linking nodes and edges?

#

what do you mean by "future"?

cursive sphinx Oct 27, 2020, 12:12 AM

#

Well, after weeks of testing my web scraper on a website, it finally blocked me with DDoS protection. . .

lapis sequoia Oct 27, 2020, 12:13 AM

#

I mean a simple line or bar graph, and by "future" I mean a program that can recognize patterns in the data and predict where the lines or bars would be based on where they have been in the past

velvet thorn Oct 27, 2020, 12:14 AM

#

I mean a simple line or bar graph, and by "future" I mean a program that can recognize patterns in the data and predict where the lines or bars would be based on where they have been in the past
@lapis sequoia ah, then you should probably say "plot" or "visualisation"

#

graph has a specific meaning.

lapis sequoia Oct 27, 2020, 12:14 AM

#

My apologies, I will remember that

velvet thorn Oct 27, 2020, 12:15 AM

#

what you're saying sounds basically like simple ML (machine learning)

#

do you have any statistics experience?

lapis sequoia Oct 27, 2020, 12:16 AM

#

I remember most of the Statistics section my algebra teacher taught me

#

and I am familiar with a lot of concepts used in Statistics

velvet thorn Oct 27, 2020, 12:17 AM

#

okay

#

so like

#

a linear regression would be the simplest method

#

are you familiar with that?

#

but you'd need the data in tabular, not graphical form

#

after making your predictions you can create an appropriate visualisation

#

the relevant libraries are numpy, pandas, sklearn, and matplotlib.

lapis sequoia Oct 27, 2020, 12:19 AM

#

Alright, I will brush up on my Statistics and look into those libraries

#

I had another question but I answered in the process of typing it. Thanks for your help though gm :)

velvet thorn Oct 27, 2020, 12:47 AM

#

yw!

hollow sentinel Oct 27, 2020, 1:21 AM

#

@lapis sequoia data visualization, linear regression + other algorithms is taught really well in Python For Data Science and Machine Learning by Jose Portilla if you want to check it out

#

@cursive sphinx haha time to go on the hacking channel

cursive sphinx Oct 27, 2020, 1:32 AM

#

I think it might just be IP related, if I change the timing on my code to be more human like maybe it will get pass.

#

My bot works though, I am happy. was trying to compile a list of names and prices, but then was also trying to figure out how to get urls but guess that's not happening 😄

austere swift Oct 27, 2020, 1:44 AM

#

@cursive sphinx usually with web scrapers people put like a 5s delay between requests to stop that from happening

random barn Oct 27, 2020, 1:44 AM

#

d

cursive sphinx Oct 27, 2020, 1:45 AM

#

I have 3 seconds at the moment, I'll try higher 😄

hollow sentinel Oct 27, 2020, 1:51 AM

#

hey guys

#

what does bins control in seaborn

#

are bins like the bars on a histogram?

velvet thorn Oct 27, 2020, 2:03 AM

#

are bins like the bars on a histogram?
@hollow sentinel yes

mint kelp Oct 27, 2020, 3:43 AM

#

Can anyone who knows how to use mechanize interpret this result for me?

#

import mechanize
br = mechanize.Browser()
br.open('https://www.rentometer.com/')
for form in br.forms():
    print('Form name:', form.name)
    print(form)

📎 unknown.png

#

what exactly is gooing on?

#

I dont know what textcontrol and selectcontrol is

heady hatch Oct 27, 2020, 5:23 AM

#

You should look into html TextControl and SelectControl.

#

It seems like an html form.

cedar sky Oct 27, 2020, 5:47 AM

#

I am in a kaggle time series competition anyone willing to participate

#

teaming up with me

mossy dragon Oct 27, 2020, 7:44 AM

#

oh that sounds interesting

#

can you link?

heady tide Oct 27, 2020, 10:00 AM

#

it's interesting to see the difference between the reults of the tf-idf vectoriser (top) and nltk's frequency distribution using stopwords (bottom).

📎 unknown.png

lapis sequoia Oct 27, 2020, 10:30 AM

#

How can I interpret this result? Are those shades confidence interval and what is the standard confidence interval?

sns.lineplot(x='age_group',
            y='time_to_death',           
            data=patient,
            hue='sex',
            palette='cividis')
plt.title('Time from infection to death',fontsize=16)
plt.xlabel('Age', fontsize=16)
plt.ylabel('Days', fontsize=16)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()```

📎 Screenshot_2020-10-27_at_11.29.19.png

velvet thorn Oct 27, 2020, 12:11 PM

#

@lapis sequoia did you check the documentation of sns.lineplot?

lapis sequoia Oct 27, 2020, 12:13 PM

#

@lapis sequoia did you check the documentation of sns.lineplot?
@velvet thorn Yes and it appears to be standard when you add hue

velvet thorn Oct 27, 2020, 12:13 PM

#

@velvet thorn Yes and it appears to be standard when you add hue
@lapis sequoia no, about the default confidence interval.

lapis sequoia Oct 27, 2020, 12:14 PM

#

I didn't find that on their documentation

#

That's why I ask..

velvet thorn Oct 27, 2020, 12:14 PM

#

ci: int or “sd” or None

Size of the confidence interval to draw when aggregating with an estimator. “sd” means to draw the standard deviation of the data. Setting to None will skip bootstrapping.

bronze halo Oct 27, 2020, 1:17 PM

#

Hi, I have some code that don't export when using VSCode only when using Jupyter. Is there something I'm missing in VSCode?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('qtwo.csv', parse_dates=['reservation_status_date'])

ax = df['adults'].value_counts().plot(kind='bar',
figsize=(14,8),
title="people")
ax.set_xlabel('reservation_status_date' <='2015-12')
ax.set_ylabel('adults')

lapis sequoia Oct 27, 2020, 1:32 PM

#

you aready have matplot lib in jupyter not vscode

lapis sequoia Oct 27, 2020, 1:47 PM

#

how to revert word vector back to token in spacy?

#

for some reason the most_similar function doesnt work either

true nacelle Oct 27, 2020, 1:54 PM

#

Hi. How does the .groupby function in pandas work in simple terms? I've used it to calculate the average stats across 7 gens of Pokémons, but I don't really know what it truly does. Thanks!

📎 KVSMLTiCOBAAAAAElFTkSuQmCC.png

lapis sequoia Oct 27, 2020, 1:57 PM

#

its used to combine categorical values to apply aggregate functions in dataframes

#

a groupby object is made that contains the categories of a column(pd.Series) and you can apply functions like mean(), median() to generate stats

true nacelle Oct 27, 2020, 2:09 PM

#

Ooooh. That makes more sense. I read some articles online but they were a bit too advanced for me. Thanks! @lapis sequoia

desert oar Oct 27, 2020, 2:40 PM

#

@true nacelle data.groupby('generation') creates a dataframe for each value of data['generation'], and gives you a handful of convenient methods for operating on those dataframes. usually you use it to "aggregate" data, i.e. produce 1 number for each group. in this case data.groupby('generation').mean() is similar to

rows = []
for gen in data['generation'].unique():
    row = data.loc[data['generation'] == gen].mean()
    rows.append(row)
pd.concat(rows)

except it's much less code and much faster to execute

true nacelle Oct 27, 2020, 2:44 PM

#

Ahh, that's a nice way to explain it. Thanks! @desert oar

desert oar Oct 27, 2020, 2:44 PM

#

you're welcome

#

there are lots of other things you can do with groupby too but that's the most common/basic use of it

velvet thorn Oct 27, 2020, 2:46 PM

#

yeah.

#

groupby is based on the idea of split-apply-combine

#

you split a DataFrame based on the value of a certain column into multiple sub-DataFrames, apply a function to each sub-DataFrame, then combine the result back into a single DataFrame

#

most often, the function is some sort of aggregation function, which leads to one row per group

#

but you can also transform and filter (and apply, but that's a bit more advanced) each group

rich reef Oct 27, 2020, 3:09 PM

#

Greetings, I'm running several hundreds of simulations and am plotting one of the KPI's for each one of them in individual figures. I'd like to add a text-box to the right of each plot that prints the parameters of the visualized results.
Any tips on how to approach this? I've been messing around with matplotlib to add text but I didn't really get far for what seems to be a trivial task

#

Simplified example of current plot:

    plt.style.use('seaborn-whitegrid')

    fig = plt.figure()
    ax = fig.add_subplot()  

    x = df['weektime'].apply(lambda x: x.total_seconds())
    y = df['mean'] * 100
    ax.plot(x, y)
    # There's more formatting stuff here like labels and ticks, I removed them for this example

    # Add config as textbox
    config_str = "my beatiful string for the textbox"
    # Add textbox with config_str here - how?

    file_path = Path(f'plots/{sim_name}-storage_plot.png')

    # Save new plot
    plt.savefig(file_path, dpi=200)
    plt.close()

lapis sequoia Oct 27, 2020, 4:54 PM

#

i m having trouble installing sklearn
https://www.pastiebin.com/5f98509abb624

austere swift Oct 27, 2020, 5:06 PM

#

python 3.9 doesnt support it yet

#

thats why its generally suggested to not go to 3.9 yet, there isnt much package support

hollow sentinel Oct 27, 2020, 5:09 PM

#

what does it mean to vectorize data

#

Portilla just throws that term around

austere swift Oct 27, 2020, 5:10 PM

#

its converting the data into arrays, where the processes can be done on the whole array at once

#

its more efficient that way

hollow sentinel Oct 27, 2020, 5:10 PM

#

ok cool

#

yeah maybe Ng explains it a little more in his course

true nacelle Oct 27, 2020, 5:12 PM

#

Can someone please help me understand how the .index functions in pandas work, in simple terms? Thanks.

lapis sequoia Oct 27, 2020, 5:14 PM

#

@austere swift what version then ? 3.8 ?

austere swift Oct 27, 2020, 5:14 PM

#

@lapis sequoia yeah 3.8 works

lapis sequoia Oct 27, 2020, 5:14 PM

#

how can i downgrade ?

#

reinstall ?

austere swift Oct 27, 2020, 5:15 PM

#

just download the installer for 3.8 and install it

#

you can keep both if you want, or just uninstall 3.9

#

up to you

lapis sequoia Oct 27, 2020, 5:16 PM

#

let me try

austere swift Oct 27, 2020, 5:17 PM

#

you can see all those lines where it says the versions and says that it doesnt match your environment lol

📎 unknown.png

hollow sentinel Oct 27, 2020, 5:17 PM

#

https://youtu.be/OYZNk7Z9s6I @true nacelle

YouTube

Data School

What do I need to know about the pandas index? (Part 1)

The DataFrame index is core to the functionality of pandas, yet it's confusing to many users. In this video, I'll explain what the index is used for and why you might want to store your data in the index. I'll also demonstrate how to set and reset the index, and show how that ...

▶ Play video

#

this guy is great btw

true nacelle Oct 27, 2020, 5:18 PM

#

Thanks! I'll watch it now 😄

hollow sentinel Oct 27, 2020, 5:18 PM

#

yep no problem

lapis sequoia Oct 27, 2020, 5:19 PM

#

@austere swift should i remove 3.9 from Path ?

hollow sentinel Oct 27, 2020, 5:19 PM

#

dude i don't get why pipelines are used

lapis sequoia Oct 27, 2020, 5:21 PM

#

and yeah sklearn installed successfully with 3.8

austere swift Oct 27, 2020, 5:21 PM

#

@lapis sequoia you dont have to

lapis sequoia Oct 27, 2020, 5:21 PM

#

alright Thx man

#

but can't import sklearn :/

austere swift Oct 27, 2020, 5:23 PM

#

do python --version

lapis sequoia Oct 27, 2020, 5:23 PM

#

3.8.6

austere swift Oct 27, 2020, 5:24 PM

#

when you import is it a module not found error or is it a different error

lapis sequoia Oct 27, 2020, 5:24 PM

#

yes not found

heady hatch Oct 27, 2020, 5:25 PM

#

@hollow sentinel You want to use pipelines for consistency and robustness. It serves as cognitive offloads as well.

hollow sentinel Oct 27, 2020, 5:25 PM

#

@heady hatch robustness?

heady hatch Oct 27, 2020, 5:25 PM

#

Right, imagine if your pipeline has 10 parts to it.

#

If you were to manually run all, you're going to introduce human errors into it.

#

You could write a class or a function that does the same, but that's what pipeline is for.

hollow sentinel Oct 27, 2020, 5:26 PM

#

so could you use a data pipeline for a linear regression

lapis sequoia Oct 27, 2020, 5:26 PM

#

📎 unknown.png

heady hatch Oct 27, 2020, 5:26 PM

#

Pipeline also has the advantage of being subclass of estimator class in sklearn.

hollow sentinel Oct 27, 2020, 5:26 PM

#

bc Portilla uses a pipeline for NLP

lapis sequoia Oct 27, 2020, 5:26 PM

#

📎 unknown.png

austere swift Oct 27, 2020, 5:27 PM

#

make sure the notebook is using the right version

heady hatch Oct 27, 2020, 5:27 PM

#

you can use the pipeline for anything really.

#

Conceptually you can think of it like an actual pipeline that does feature extractions or transformations and maybe even input into an estimator (sklearn term).

lapis sequoia Oct 27, 2020, 5:28 PM

#

@austere swift yes the notebook is running on 3.9

#

📎 unknown.png

#

how do i change that ?

hollow sentinel Oct 27, 2020, 5:29 PM

#

an estimator is an algorithm right @heady hatch

heady hatch Oct 27, 2020, 5:29 PM

#

Hmm define algorithm.

hollow sentinel Oct 27, 2020, 5:29 PM

#

like linear regression

📎 1F6qno2Txg6vaHBmKnyx09w.png

heady hatch Oct 27, 2020, 5:29 PM

#

Like a regressor/classifier in the sklearn ecosystem?

hollow sentinel Oct 27, 2020, 5:29 PM

#

these are estimators

heady hatch Oct 27, 2020, 5:29 PM

#

Yea.

#

That's what sklearn calls it.

austere swift Oct 27, 2020, 5:30 PM

#

@lapis sequoia click kernel at the top then click change kernel

hollow sentinel Oct 27, 2020, 5:30 PM

#

reading doc is so boring haha

#

sklearn + pandas doc puts me to sleep

heady hatch Oct 27, 2020, 5:30 PM

#

You don't have to read it to learn it.

hollow sentinel Oct 27, 2020, 5:30 PM

#

yeah i try to put it in my projects

lapis sequoia Oct 27, 2020, 5:30 PM

#

📎 unknown.png

#

its only this

#

@austere swift

austere swift Oct 27, 2020, 5:31 PM

#

try that yeah

lapis sequoia Oct 27, 2020, 5:31 PM

#

its 3.9

hollow sentinel Oct 27, 2020, 5:31 PM

#

F

austere swift Oct 27, 2020, 5:32 PM

#

@lapis sequoia try restarting jupyter, maybe it didnt recognize the new installation lol

hollow sentinel Oct 27, 2020, 5:32 PM

#

sometimes I'll leave jupyter open and it just forgets that I imported pandas

lapis sequoia Oct 27, 2020, 5:33 PM

#

yeah i tried that

#

i m trying to reinstall jupyter

hollow sentinel Oct 27, 2020, 5:34 PM

#

i had a lot of trouble setting up VSC if it makes you feel better

#

never got it to actually work

lapis sequoia Oct 27, 2020, 5:35 PM

#

finally it worked

#

oof

heady hatch Oct 27, 2020, 5:42 PM

#

Nice!

hollow sentinel Oct 27, 2020, 6:06 PM

#

nice!

serene scaffold Oct 27, 2020, 8:21 PM

#

I'm looking at these docs:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score
It's not clear how to indicate which label in y_true represents a null value

#

if the system predicts that an instance belongs to a class, and it doesn't belong to any class at all, that should be a false positive for the class it allegedly belongs to.

#

I guess you just set labels to a list of all the labels except whatever your null label is?

bold olive Oct 27, 2020, 8:27 PM

#

Get rid of the null values in label column.

#

That's usually a part of pre-processing anyways.

serene scaffold Oct 27, 2020, 8:27 PM

#

I got rid of all the rows that were both null

#

but I need to know if it should be null but the system made a prediction anyway

bold olive Oct 27, 2020, 8:28 PM

#

So you want to keep the null label values in your dataset?

#

In that case, just specify a class for it.

#

Doesn't have to be null but a numeric value for representation purposes.

#

That way, you'll know if the predictions are correct (and what they are).

dawn pond Oct 27, 2020, 8:47 PM

#

I'm kind of struggling trying to code a cost function for a simple single layer perceptron network

#

can anyone help me?

#

basically the desired output is matrix of size (5,1) and input (4,1)

#

how do i calculate cost if they arent equal

midnight rain Oct 27, 2020, 10:27 PM

#

I've been trying to optimize a bit of spacy code i got from a colleague.

for testdoc in docs:
        token_list = list()
        if len(testdoc) < 500_000:
            doc = nlp(testdoc,disable=["ner","entity_linker",'textcat','entity_ruler','sentencizer'])
            phrases = [p.text for p in doc._.phrases]  
            for doc in nlp.pipe(phrases,disable=['ner','textcat','entity_ruler','sentencizer']):
                token_list.append((" ").join(
                    [token.text.lower() for token in doc if ((len(token.text)>3)) if token.text.isalpha()]
                ))```
Is there to create a single `nlp.pipe()` call without having to process the doc and the doc's phrases separately? because the phrases is another generator (or list) im not able to just add a custom component that returns them since spacy wont automatically flatten the iterator

heady hatch Oct 27, 2020, 11:29 PM

#

I'm not super familiar with SpaCy, but can't you write a custom pipeline?

balmy junco Oct 28, 2020, 12:19 AM

#

I am trying to combine 2 dataframes, but here's how I want to do it - if a column is common between the two, I want to keep the data from the first dataframe. If the column from the first dataframe isn't in the second, I want to keep it. If the column from the second isn't in the first, I want to add add it to the end, and fill in a default value for the rest

#

I could do this myself, but I am wondering if a function already exists for doing so

#

?

heady hatch Oct 28, 2020, 12:20 AM

#

Nothing I can think of for something that complex.

But you can do a quick set computation to get all the things you need.

#

Maybe do something like merge with left join.

austere swift Oct 28, 2020, 12:21 AM

#

yeah that sounds like such a niche use that i dont think theyd have a function for that

balmy junco Oct 28, 2020, 12:23 AM

#

Here's what I was hoping might work

#

I mean I could just iterate through the columns in the dataframe and do it that way, but it seems kinda wasteful

#

Might be what I have to do though

velvet thorn Oct 28, 2020, 12:26 AM

#

I am trying to combine 2 dataframes, but here's how I want to do it - if a column is common between the two, I want to keep the data from the first dataframe. If the column from the first dataframe isn't in the second, I want to keep it. If the column from the second isn't in the first, I want to add add it to the end, and fill in a default value for the rest
@balmy junco think you need to do it in several steps

#

I presume you're joining on a common column

#

you can use suffixes to drop columns depending on which DF they came from

#

that takes care of the first two requirements

#

If the column from the second isn't in the first, I want to add add it to the end, and fill in a default value for the rest

#

what do you mean "fill in a default value for the rest"?

balmy junco Oct 28, 2020, 12:28 AM

#

for example

#

it would be like adding on a column df['hi'] = 'hello world'

#

i can do it in multiple steps easily but it seems like a waste of code lol

heady hatch Oct 28, 2020, 12:33 AM

#

but hmm what's the "rest" you're referring to? Because for any columns not in first but in second, it'll be already added to the new df.

#

Unless you're referring to columns not in first nor second?

velvet thorn Oct 28, 2020, 1:13 AM

#

^

twilit brook Oct 28, 2020, 1:40 AM

#

I am trying to combine 2 dataframes, but here's how I want to do it - if a column is common between the two, I want to keep the data from the first dataframe. If the column from the first dataframe isn't in the second, I want to keep it. If the column from the second isn't in the first, I want to add add it to the end, and fill in a default value for the rest
@balmy junco when you say “and fill in a default value for the rest”, are you looking to imputate the values?

#

For this, scikit learn has a neat function: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

#

And I assume your number of columns is large enough to not want to use .concat() and then manually .drop() the columns you don’t want. I would write an if-else statement: if df1[‘columnx’] != df2[‘columnx’]:
df = pd.df.concat([df2[‘columnx’]],axis=0

#

(Don’t follow the code I wrote out specifically, just the idea... lol)

#

Fellow pythonistas. Does anyone have any recommendation for plotting very large data sets? I have a data frame with ~67000 rows and matplotlib is not cutting it for plotting. The .ipynb cell gets stuck executing

#

My option b is to change the frequency of data point. The data frame is based on a certain tool running and it records data every 11 seconds. I could change the data frame to be data from every 30 seconds...

heady hatch Oct 28, 2020, 2:19 AM

#

@twilit brook what are you trying to plot?

twilit brook Oct 28, 2020, 2:34 AM

#

@heady hatch it’s the temperature and water flow of different components of an industrial tool

velvet thorn Oct 28, 2020, 2:34 AM

#

And I assume your number of columns is large enough to not want to use .concat() and then manually .drop() the columns you don’t want. I would write an if-else statement: if df1[‘columnx’] != df2[‘columnx’]:
df = pd.df.concat([df2[‘columnx’]],axis=0
@twilit brook so you want to do this once per loop iteration?

#

not a good idea.

#

Fellow pythonistas. Does anyone have any recommendation for plotting very large data sets? I have a data frame with ~67000 rows and matplotlib is not cutting it for plotting. The .ipynb cell gets stuck executing
@twilit brook what kind of plot?

heady hatch Oct 28, 2020, 2:36 AM

#

So it's temperature vs water flow? I'm assuming both are numerical values.

I've never actually come across issues plotting even at 2 million points.

#

matplotlib's scatterplot hangs for you then?

twilit brook Oct 28, 2020, 3:05 AM

#

It’s temperature and water flow over time. But now that I think about it, my time is in date-time format. That’s probably messing it up tons

#

@twilit brook so you want to do this once per loop iteration?
@velvet thorn yeah it’s not the most elegant solution..

heady hatch Oct 28, 2020, 3:11 AM

#

over time? so there are 3 dimensions?

balmy junco Oct 28, 2020, 3:15 AM

#

but hmm what's the "rest" you're referring to? Because for any columns not in first but in second, it'll be already added to the new df.
@heady hatch i am talking about if i am going through and adding row by row to the dataframe. these rows will be created from other dataframes (or i could add dictionaries). if there are fields in the rows that are not part of the dataframe then are added to, then i want to add those fields, and i want to add a default value for every other row.

#

@balmy junco when you say “and fill in a default value for the rest”, are you looking to imputate the values?
@twilit brook maybe my most recent message will make it make more sense

heady hatch Oct 28, 2020, 3:16 AM

#

Sorry what's this in relation to the first and second dataframe?

balmy junco Oct 28, 2020, 3:16 AM

#

the first dataframe contains information about a symbol

#

all the data from the first dataframe is computed

#

the second dataframe contains more information about the same symbol

#

theres only one or a few common fields

heady hatch Oct 28, 2020, 3:18 AM

#

I would merge on those same symbols first and then impute.

So you'll have a big dataframe full of these symbols, the additional information and nulls.

balmy junco Oct 28, 2020, 3:20 AM

#

so you would actually call the merge i was calling before?

#

and then all of the other values would default to nulls?

#

and then i would impute to replace them?

#

i mean i was thinking i might just iterate through the columns in the second dataframe. i would do a check whether it is in the fields.. if not, i add it with default values, and then set the value at that specific row

#

if it is, then i just set the value of the row

#

the alternative would be to not even worry about the dataframe issue in general

#

and just use a dictionary instead of a dataframe at first

#

and assign key value pairs

#

that way, it wont matter if it's a duplicate or not

#

then i could create a dataframe afterwards

#

tough to decide lol

#

i feel like the latter is the better, logical way to do it, but i am feeling lazy haha

#

if i convert a dictionary that doesnt share all the keys, it should still work, right?

velvet thorn Oct 28, 2020, 3:52 AM

#

just merge on the common column

#

and figure out the rest later

balmy junco Oct 28, 2020, 4:01 AM

#

lol valid

#

ill do that

fiery cosmos Oct 28, 2020, 5:42 AM

#

anyone here skilled in the ways of 2captcha that might be able to lend some guidance?

lone osprey Oct 28, 2020, 7:29 AM

#

guys a doubt

#

data = pd.read_csv('chats.txt', delimiter = "\n", header = None, names = ['text'])
data[['datetime_str','text_2']] = data["text"].str.split(" - ", 1, expand=True)
data["datetime"] = pd.to_datetime(data["datetime_str"], format="%d/%m/%Y, %I:%M %p", errors='coerce')

#

i did this

#

29/09/20, 5:48 pm - Vinurakav_Sanker: Nice
30/09/20, 1:18 am - Ranga Cs: <Media omitted>
30/09/20, 1:22 am - Ranga Cs: <Media omitted>
01/10/20, 6:48 am - Dhaneesh Cs: Bgm udu bgm udu " Tan tan tan taaan!! ":fire:```

#

this was first five lines of chats.txt

#

0 29/09/20, 4:23 pm - Ranga Cs: Ka pe ranasigam ...  29/09/20, 4:23 pm  Ranga Cs: Ka pe ranasigam Vijay sethupathi mov...      NaT  
1        29/09/20, 5:48 pm - Vinurakav_Sanker: Nice  29/09/20, 5:48 pm     Vinurakav_Sanker: Nice   NaT
2     30/09/20, 1:18 am - Ranga Cs: <Media omitted>  30/09/20, 1:18 am  Ranga Cs: <Media omitted>   NaT  
3     30/09/20, 1:22 am - Ranga Cs: <Media omitted>  30/09/20, 1:22 am  Ranga Cs: <Media omitted>   NaT  
4 01/10/20, 6:48 am - Dhaneesh Cs: Bgm udu bgm u...  01/10/20, 6:48 am  Dhaneesh Cs: Bgm udu bgm udu " Tan tan tan taa...      NaT  ```

#

i am getting this

#

NaT

#

how do i not get Nat??

#

please ping me

velvet thorn Oct 28, 2020, 8:34 AM

#

@lone osprey %y, not %Y should work

lone osprey Oct 28, 2020, 8:34 AM

#

i will try, thanks

#

it worked thanks @velvet thorn

velvet thorn Oct 28, 2020, 8:35 AM

#

yw

#

@lone osprey https://strftime.org/ might help you next time

Python strftime reference

A quick reference for Python's strftime formatting directives.

#

be careful with the capitalisation, it can cause problems

lone osprey Oct 28, 2020, 8:36 AM

#

kk

lapis sequoia Oct 28, 2020, 10:18 AM

#

ad?

#

<@&267628507062992896> <@&267629731250176001>

kindred finch Oct 28, 2020, 10:20 AM

#

!warn 756099540913750086 You have been told not to advertise your guild here before. Please stop doing this

arctic wedgeBOT Oct 28, 2020, 10:20 AM

#

:incoming_envelope: :ok_hand: applied warning to @cedar sky.

lapis sequoia Oct 28, 2020, 10:20 AM

#

Nice

midnight rain Oct 28, 2020, 12:02 PM

#

I'm not super familiar with SpaCy, but can't you write a custom pipeline?
@heady hatch yeah that's what ive been trying to do, but i cant figure out an efficient way to do so.

dim hemlock Oct 28, 2020, 2:17 PM

#

im not sure whether this is the right place to ask the question but is it possible make a custom spatial plot in matplotlib?

#

such as this. this is a room with boxes representing different items in the room

📎 unknown.png

hollow sentinel Oct 28, 2020, 2:44 PM

#

lemme check the doc lol

#

nah I can't find a spatial plot in the doc @dim hemlock

#

if you're a beginner to Pandas this kernel might be helpful: https://www.kaggle.com/python10pm/pandas-75-exercises-with-solutions

Pandas 75 exercises with solutions

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

lapis sequoia Oct 28, 2020, 3:42 PM

#

mmm

#

ok

hollow sentinel Oct 28, 2020, 3:48 PM

#

I find doing projects and cleaning the data to be more helpful

#

@lapis sequoia wow you're typing a lot

lapis sequoia Oct 28, 2020, 4:01 PM

#

no u

hollow sentinel Oct 28, 2020, 4:01 PM

#

hahahaha

lapis sequoia Oct 28, 2020, 4:02 PM

#

hi, i will go direct to tje main plot:

i have 100K rows for the file i neeed to fix, and almost 2K rows for the dictionary ( search string and replace by)
i need to keep it very fast like max 5min ( currently in 4sec its done ).

but i wanna improuve it on the replacement side, let say i have this text: "sm crzy txt vry lg one"
and my dict have an order like

"sm > some > crzy > crazy > some crazy > some crazy word > ect

and im kinda stuck here because i do replace the whole thing but it never replace it by steps so i end up with

"some crzy txt vry lg one"

and sooo if anyone know the way, i could appreciate it

sorry if my english is wonky

#

its not a lot, just i took my time to make it clear xd

#

my current way of doing it is like so

self.dataframe["col_name"] = self.dataframe["col_name"].replace(compiled_dict, regex=True)

hollow sentinel Oct 28, 2020, 4:03 PM

#

haha idk i'm a noob

lapis sequoia Oct 28, 2020, 4:03 PM

#

xD

#

npnp

hollow sentinel Oct 28, 2020, 4:07 PM

#

yeah i just picked up machine learning like 2-3 weeks ago

#

haha

lapis sequoia Oct 28, 2020, 4:08 PM

#

i did yesterday

#

but using js

#

to experiment xD

hollow sentinel Oct 28, 2020, 4:08 PM

#

you can do machine learning using js?

#

that's interesting

lapis sequoia Oct 28, 2020, 4:09 PM

#

well, its still python dependent, but you can

#

and not as complicated as python too

#

i mean

#

its js

#

xd

hollow sentinel Oct 28, 2020, 4:09 PM

#

i don't like JS

#

haha

#

it's just an inconsistent language

lapis sequoia Oct 28, 2020, 4:10 PM

#

im a node js devlopper, but current working as a python devlopper xDDD

hollow sentinel Oct 28, 2020, 4:10 PM

#

oh

#

cool

lapis sequoia Oct 28, 2020, 4:10 PM

#

i do like js a lot

#

since i do web dev too, its a good thing xd

#

anyhow i gtg, work time done sweating

hollow sentinel Oct 28, 2020, 4:12 PM

#

peace

midnight rain Oct 28, 2020, 6:44 PM

#

hi, i will go direct to tje main plot:

i have 100K rows for the file i neeed to fix, and almost 2K rows for the dictionary ( search string and replace by)
i need to keep it very fast like max 5min ( currently in 4sec its done ).

but i wanna improuve it on the replacement side, let say i have this text: "sm crzy txt vry lg one"
and my dict have an order like

"sm > some > crzy > crazy > some crazy > some crazy word > ect

and im kinda stuck here because i do replace the whole thing but it never replace it by steps so i end up with

"some crzy txt vry lg one"

and sooo if anyone know the way, i could appreciate it

sorry if my english is wonky
@lapis sequoia if you have a large enough file sometimes i've had better luck with *nix command line tools like sed and awk.

lapis sequoia Oct 28, 2020, 7:18 PM

#

@midnight rain im using python and pandas

midnight rain Oct 28, 2020, 7:20 PM

#

@lapis sequoia im just saying it can be a good idea to preprocess files using sed or awk sometimes before loading into python and starting your data science work

#

you can absolutely do it in python too though

lapis sequoia Oct 28, 2020, 7:20 PM

#

hmmmm, i see

#

im mostly doing data cleaning at first so dunno, i shall try this

midnight rain Oct 28, 2020, 7:21 PM

#

sed will probably do it faster than what you could write in python

fair girder Oct 28, 2020, 7:39 PM

#

anyone use sage math?

real wigeon Oct 28, 2020, 7:39 PM

#

not sure if this is a good place to ask

fair girder Oct 28, 2020, 7:40 PM

#

this server or this channel?

real wigeon Oct 28, 2020, 7:40 PM

#

i have an sql query that im trying to drop into a pandas df

#

from what I've seen i can pd.read_sql

#

however, I already have the query written out using a connection to mysql

#

this query uses wilcards as placeholders

#

so the query is currently already executed

#

from what I've seen online in the pandas docs, using read_sql would mean that I have to re-query the db using the read_sql call.

#

is this the case?

#

I want to avoid doing that because I'm not sure how read_sql handles place holders (since I need to customize the query based on user input)

#

some code:

#

def survey_results():
    error = None
    connection = db_connection()
    cursor = connection.cursor()
    phone_survey = PhoneResults()
    start_date = request.form.get('start_date')
    end_date = request.form.get('end_date')

    if request.method == "POST":
        get_results = "SELECT upload_timestamp, terminal_number, was_this_a_pandemic_related_call," \
                      "what_was_the_call, was_the_inquiry_resolved FROM phone_survey WHERE upload_timestamp" \
                      "BETWEEN %s AND %s"
        cursor.execute(get_results, (start_date, end_date))
    
        pandas_sql_query = pd.read_sql_query()

    return render_template("survey_reports.xhtml", form=phone_survey, error=error)```

#

as you see i already execute the query, how to just move that information it a df?

#

do i just encapsulate cursor.execute(get_results, (start_date, end_date)) in a variable? and reference that var in pandas?

midnight rain Oct 28, 2020, 8:01 PM

#

@real wigeon which db library are you using?

real wigeon Oct 28, 2020, 8:01 PM

#

mysql

#

im actually getting a syntax error in my query lol

#

i think it's how the lines were broken up

#

but id much rather get your help on the other issue xD

#

@midnight rain

#

well technically im using pymysql

midnight rain Oct 28, 2020, 8:02 PM

#

you should be able to build a DF form the cursor.fetchall()

#

it returns a list of tuples which you can build a DF from. although you'll probably want to name the columsn manually

real wigeon Oct 28, 2020, 8:03 PM

#

so i did something like this

#

run_the_query = cursor.execute(get_results, (start_date, end_date))
            
#pandas_sql_query = pd.read_sql_query()
df = pd.DataFrame(run_the_query, columns=['upload_timestamp', 'terminal_number', 'was_this_a_pandemic_related_call',
                                          'what_was_the_call', 'was_the_inquiry_resolved'])
print(df)```

#

but it's good to know i was headed in the right direction

#

so i did a cursor.execute

#

get_results was the SQL query

midnight rain Oct 28, 2020, 8:05 PM

#

try something like this:

cursor.execute(get_results, (start_date, end_date))

query_res = cursor.fetchall()
            
#pandas_sql_query = pd.read_sql_query()
df = pd.DataFrame(query_res, columns=['upload_timestamp', 'terminal_number', 'was_this_a_pandemic_related_call',
                                          'what_was_the_call', 'was_the_inquiry_resolved'])
print(df)```

real wigeon Oct 28, 2020, 8:05 PM

#

ahh

#

ok

midnight rain Oct 28, 2020, 8:06 PM

#

https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-fetchall.html

#

docs for it

real wigeon Oct 28, 2020, 8:07 PM

#

testing

lapis sequoia Oct 28, 2020, 8:08 PM

#

@midnight rain sed or awk aren't valid for what i do, and i need it to be compatible with multiples platforms and still doesn't anwser my question about replacing a single word b another :p

real wigeon Oct 28, 2020, 8:09 PM

#

bruh that worked @midnight rain

#

thx

#

i guess maybe a small follow up, but on a different topic

#

it seems mysql is like... i forgot the term... the way it defines ranges is not inclusive

#

like if you give a date range of 10-27-2020 - 10-28-2020 it will not include the data for 10-28 since it considers that value as the end of the range

#

and not to be included in the range

#

ohh so fetchall is more for parsing, execute is for grabbing everything

#

why not just grab what you need with execute O.o

midnight rain Oct 28, 2020, 8:18 PM

#

@lapis sequoia

import re
import fileinput

translation = {
    "f00": "foo",
    "b@r": "bar"
}

with open(filename, "r") as file:
    file_data = file.read()

for old, new in translation.items():
    file_data.replace(old, new)

with open(filename, "w") as file:
    file.write(filedata)

#

you can do something like that

lapis sequoia Oct 28, 2020, 8:19 PM

#

that won't fix my issue, still the same

midnight rain Oct 28, 2020, 8:19 PM

#

@real wigeon fetch all pulls all rows at once you can do one row at a time or pull everything

lapis sequoia Oct 28, 2020, 8:19 PM

#

i already do

file_data.replace(old, new)

and thats the problem

#

i have multiples steps in my dictionary, by doing so only one of them get done

real wigeon Oct 28, 2020, 8:20 PM

#

but since execute uses the sql query, you can parse out the data during that step

#

i guess the returned value is not compatible with dataframe format?

midnight rain Oct 28, 2020, 8:33 PM

#

@lapis sequoia I'm replacing them in the file first and im doing it with a loop over the items in the dictionary so it'll do them one at a time until its finished looping over the dict keys

#

@real wigeon i dont thinkt he execute actually returns the result of the query i think it just ads them to the cursor which you fetch form

real wigeon Oct 28, 2020, 8:34 PM

#

ouhhh

past pewter Oct 28, 2020, 9:27 PM

#

question- I would like to have a model that is able to separate my text
Cardboard Plastic # 1 & # 2 only ( food containers are acceptable only if they have been rinsed clean ) Newspaper Metals Aluminum Glass Electronics
should be
Cardboard
newspaper
plastic # 1 & # 2 only ( food containers are acceptable only if they have been rinsed clean )
metals aluminum
glass
electronics
I currently do this with a MASSIVE rules based system (regexes, etc.), and it still stumbles too much (the data is quite heterogenous, so purely rules based is a struggle)
But I have enough data that an ML model should be able to learn
Any ideas on how to approach this?

#

I'm fairly comfortable with ML and deep learning, though I generally use CNNs for my NLP work.

heady hatch Oct 28, 2020, 9:33 PM

#

@past pewter
Hmm I was thinking you can prepare a dataset where you separate the words via special characters and then predict where the special characters are?

I'm not a researcher though, so my knowledge will be a bit sparse.

#

Though to clarify is there a set amount of keywords like cardboard, plastic, newspaper etc?