#data-science-and-ml | Python | Page 228

steel ravine Jun 15, 2020, 11:35 PM

#

I will take into consideration NNs but don't want to jump to them if it would be an overkill

boreal portal Jun 15, 2020, 11:38 PM

#

You could spend as much time learning NN business as you spend on non NN stuff to solve this issue

#

And you can do so much stuff with NN stuff, more and more every day!

quiet zinc Jun 15, 2020, 11:41 PM

#

Hey folks, can someone help me with few questions in my university exam in Python related to NLTK mostly? (Nothing advanced, I believe)

boreal portal Jun 15, 2020, 11:42 PM

#

Don't break the law

desert oar Jun 15, 2020, 11:44 PM

#

@quiet zinc we can't and won't help with exams

#

we can provide limited homework help but that's it

#

!rules 5

arctic wedgeBOT Jun 15, 2020, 11:44 PM

#

Rules

5. Do not provide or request help on projects that may break laws, breach terms of services, be considered malicious/inappropriate or be for graded coursework/exams.

quiet zinc Jun 15, 2020, 11:44 PM

#

yea it's mostly task consisted of 4 questions but yea

blazing bridge Jun 16, 2020, 12:41 AM

#

Would coursera courses count as breaking this rule. Like if I don’t understand a lecture can I ask for help

plain jungle Jun 16, 2020, 12:47 AM

#

so I made a coupling algorithm, and im really excited about it cause it works. What are some fun stuff to test with it

#

at the moment i have just been feeding it random data such as

#

📎 unknown.png

#

to then get data of

#

📎 unknown.png

desert oar Jun 16, 2020, 1:00 AM

#

@plain jungle try some classic datasets like iris, boston housing, and titanic

#

and that one wine tasting dataset

#

https://archive.ics.uci.edu/ml/index.php

plain jungle Jun 16, 2020, 1:02 AM

#

id need two dimensional floats / integer data. I'd love to do the iris' (the flower one right?) but last I remember that was 6 dimensional

#

I should though work on making this scalure, just not sure how to plot anything more than 3d on mathplot

floral siren Jun 16, 2020, 1:02 AM

#

wow i just did a decision tree with iris

desert oar Jun 16, 2020, 1:03 AM

#

@plain jungle just pick 2 columns from it?

#

all the bivariate relationships in iris are usable

plain jungle Jun 16, 2020, 1:03 AM

#

heck... wow... yo! imma do that

desert oar Jun 16, 2020, 1:03 AM

#

or get really cRaZy and hit it with multidimensional scaling first

plain jungle Jun 16, 2020, 1:03 AM

#

thank you

sonic finch Jun 16, 2020, 2:42 AM

#

Quite basic question...I'm trying to install the BERT Service Client. I've tried following directions here: https://github.com/hanxiao/bert-as-service and tried following the troubleshooting here: https://github.com/hanxiao/bert-as-service/issues/194 for how to handle when it won't start up in the command line. Appreciate any insight.

GitHub

hanxiao/bert-as-service

Mapping a variable-length sentence to a fixed-length vector using BERT model - hanxiao/bert-as-service

GitHub

'bert-serving-start' is not recognized as an internal or external c...

Hi, This is a very silly question..... I have python 3.6.6, tensorflow 1.12.0, doing everything in conda environment, Windows 10. I pip installed bert-serving-server/client and it shows Successfull...

#

For reference, I have downloaded the files and do specify those in the command line calls that I'm making

spare karma Jun 16, 2020, 3:50 AM

#

@sonic finch not sure this will solve your problem, as I'm very new to BERT (like 2 days) but I just completed this pytorch tutorial that used bert..worked flawlessly for me. Message me if you have any questions. I'll try and help the best I can.

#

https://colab.research.google.com/drive/1PHv-IRLPCtv7oTcIGbsgZHqrB5LPvB7S#scrollTo=fMSr7C-F_sey

Google Colaboratory

#

and look at this one..

#

https://www.curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/

Curiousily

Sentiment Analysis with BERT and Transformers by Hugging Face using...

Detect sentiment in Google Play app reviews by building a text classifier using BERT

#

^one of the two is the most-updated one. I'll let you find which one is the most recent (one of them he messed up his cross-entropy). And then two vid-ja tutorials on the above:

#

https://www.youtube.com/watch?v=Osj0Z6rwJB4

YouTube

Venelin Valkov

Text Preprocessing | Sentiment Analysis with BERT using huggingface...

🗓️ 1:1 Consultation Session With Me: https://calendly.com/venelin-valkov/consulting
📖 Get SH*T Done with PyTorch Book: https://bit.ly/gtd-with-pytorch
🔔 Subscribe: http://bit.ly/venelin-subscribe
📔 Complete tutorial + notebook: https://www.curiousily.com/posts/sentiment-analys...

▶ Play video

#

https://www.youtube.com/watch?time_continue=737&v=8N-nM3QW7O0&feature=emb_logo

YouTube

Venelin Valkov

Text Classification | Sentiment Analysis with BERT using huggingfac...

🗓️ 1:1 Consultation Session With Me: https://calendly.com/venelin-valkov/consulting
📖 Get SH*T Done with PyTorch Book: https://bit.ly/gtd-with-pytorch
🔔 Subscribe: http://bit.ly/venelin-subscribe
📔 Complete tutorial + notebook: https://www.curiousily.com/posts/sentiment-analys...

▶ Play video

oblique belfry Jun 16, 2020, 3:52 AM

#

I have a Pandas question. I have data I need to do some time series stuff with. It is a set of events (event sourcing model) that fire at a certain time, for a certain uuid, and for a certain event. So timestamp, uuid, category is the data format. I want to get the average events per uuid per hour. So, I am thinking I need to is convert the category data to one-hot encoding. Then, pd.groupby("uuid").resample("1T").sum().mean() where the index is the timestamp. Is my reasoning flawed? I haven't done much with groupy.

flat quest Jun 16, 2020, 4:14 AM

#

@blazing bridge hey. Idk if someone has answers ur question yet,

but the two lines of code you sent do the same thing. Accesing by property and by the [] operator do the same thing for pandas objects.

As for the groupby function. We're grouping each element in the pandas object by their year. So each pandas row with the same year will be grouped together.

Then what you're doing here is getting the totalprod of each grouped year, and getting the average value.

blazing bridge Jun 16, 2020, 4:15 AM

#

as in pandas object do you mean the dataframe

flat quest Jun 16, 2020, 4:15 AM

#

yeah
dataframe or series

#

but groupby wont work for series

blazing bridge Jun 16, 2020, 4:15 AM

#

Ok [] and . both are used to access columns

flat quest Jun 16, 2020, 4:15 AM

#

yes

blazing bridge Jun 16, 2020, 4:15 AM

#

and they will be grouped by year

#

Thank you so much

flat quest Jun 16, 2020, 4:16 AM

#

yeah np

blazing bridge Jun 16, 2020, 4:16 AM

#

One more question

flat quest Jun 16, 2020, 4:16 AM

#

mhm?

blazing bridge Jun 16, 2020, 4:16 AM

#

how can we do more than one column

#

like totalprod and priceperlb

flat quest Jun 16, 2020, 4:16 AM

#

you want to average over multiple columns?

blazing bridge Jun 16, 2020, 4:16 AM

#

yeah grouped by year

flat quest Jun 16, 2020, 4:18 AM

#

oh for that you need to use .agg

so something like this.

df.groupby('year').agg({'mean': ['totalprod', 'priceperlib']}).

Basically you're running the mean function for the totalprods and priceperlib for each grouped year. The agg method allows you to reference multiple different types of summation or averaging methods, as well as referencing more than 1 column for each method by passing in an array.

blazing bridge Jun 16, 2020, 4:19 AM

#

so we have to import numpy as well

#

in order to use it or would it consider it as a list

flat quest Jun 16, 2020, 4:19 AM

#

pandas uses numpy

blazing bridge Jun 16, 2020, 4:20 AM

#

I saw the agg function but I didnt know it accessed multiple columns as well

#

thank you so much

flat quest Jun 16, 2020, 4:20 AM

#

yeah np
the pandas docs have a couple more options for the agg, and prob how to make custom functions to use on the groupby's so make sure to check those out.

blazing bridge Jun 16, 2020, 4:21 AM

#

`df.groupby('year').agg({'mean', 'sum' : ['totalprod', 'priceperlib']}).

#

yeah I will

#

would this work

flat quest Jun 16, 2020, 4:21 AM

#

well you're not passing anything to mean

#

so might as well just take it out

blazing bridge Jun 16, 2020, 4:22 AM

#

oh i thought it would do the mean and the sum of the columns we specified

flat quest Jun 16, 2020, 4:22 AM

#

for that you'll need to use mean and sum as separate keys in the object.

So like

{'mean' : [columns],
'sum': [columns]}

blazing bridge Jun 16, 2020, 4:23 AM

#

Oh ok thank you so much for your help. I know I've been a pain

flat quest Jun 16, 2020, 4:53 AM

#

nah its all good

blazing bridge Jun 16, 2020, 9:23 AM

#

can some explain the difference between a validation set and a test set

#

they seem very similar but i dont understand what the difference is

desert oar Jun 16, 2020, 11:04 AM

#

"Validation set" is usually used in the process of optimizing model parameters

#

"Test set" is saved until the very end of your work, as a final estimate of out of sample performance

#

Personally, I think the names should be reversed, but the terminology has been established for a few years now and it's stuck

wicked flare Jun 16, 2020, 11:09 AM

#

@desert oar This confuses me a little bit. What happens if you develop your parameters and manage to get good performance on your validation set, then you check with your test set and get awful performance. What do you do? Do you go out and get more data? I assume you don't want to reuse the test set, because if so, what's even the difference between the test and validation set?

desert oar Jun 16, 2020, 11:10 AM

#

@wicked flare If that happens, it means that you did a poor job of constructing your data sets, and before you even touch your model again you need to spend some time carefully assessing the differences between your three data sets

#

Which yes, might mean that you either need to get new data entirely or, in a time sensitive situation and with extreme caution, reshuffle your three sets together if you just got very very unlucky

wicked flare Jun 16, 2020, 11:11 AM

#

Ok, because there are some biases built into the validation set, so you overfitted the parameters to those biases or something?

desert oar Jun 16, 2020, 11:11 AM

#

Yes, basically you are now trying to figure out "did I over fit my model, or did I sample my data sets incorrectly"

wicked flare Jun 16, 2020, 11:11 AM

#

(or maybe the biases are in the test set)

slim fox Jun 16, 2020, 11:12 AM

#

which means that your train data is not good, I would say

desert oar Jun 16, 2020, 11:12 AM

#

Sometimes in things like kaggle they deliberately screw you with wonky scoring sets to punish overfitting, but you always need to make sure your data is in order before you try to adjust your training process

wicked flare Jun 16, 2020, 11:12 AM

#

I guess any of the three sets could be problematic.

#

If the training data is bad, the model will be bad. If the validation data is bad, you will tune the parameters incorrectly, and if the test set is bad, you will get a bad result even with a good model.

desert oar Jun 16, 2020, 11:13 AM

#

And yes, sometimes "get more data" is the best solution

slim fox Jun 16, 2020, 11:13 AM

#

the question here is also whether you created train/val/test from one data set or they come from different sources

#

quite often validation set is obtained by the split from the entire train set

lapis sequoia Jun 16, 2020, 11:15 AM

#

I'm just about to start a new job in machine learning in the UK public sector, previously I've only worked in the private sector. Anyone worked in both and can give me some heads up on what I should expect?

slim fox Jun 16, 2020, 11:16 AM

#

here also cross-validation comes in play -> you do it to make sure that the way you build model does not favor overfitting for instance

foggy nebula Jun 16, 2020, 11:22 AM

#

Hello Everyone,
I was wondering if any of you have experience in converting a dash plotly app into an .exe?
I have tried suggestions on the forums of dash but those don't seem to work for me and others as well on the forum

quaint basalt Jun 16, 2020, 1:41 PM

#

Hello world! Anyone familiar with scikit-learn online at the moment and available to answer some questions? I think I want to do something strange (a branching Pipeline with a variable number of features in the intermediate steps), and I don't know how.

desert oar Jun 16, 2020, 1:43 PM

#

@quaint basalt just describe your question in more detail, then somebody who knows the answer can see it and help

quaint basalt Jun 16, 2020, 1:43 PM

#

Sure thing. Let me see if I can format a ascii graph thingy with the pipeline I have in mind

desert oar Jun 16, 2020, 1:43 PM

#

As the saying goes, "don't ask to ask"

quaint basalt Jun 16, 2020, 1:48 PM

#

My main input is a (networkx) graph, and I basically want to cluster the nodes of that graph using a soft clustering algorithm. To do this I need to predict 1) the number of clusters, and 2) for every edge an associated weight. The number of clusters is pretty straight forward since 1 graph results in 1 number. The edge weights are the problem, since 1 graph results in N weights with a different N for every graph. My idea is roughly the following:

      _ <featurize graph> - <normalize etc> - <SVR to predict number of clusters> _
     /                                                                             \
graph                                                                               <cluster> - <score>
     \_ <featurize all edges [1]> - <normalize> - <SVR to predict weights>        -/

#

[1] This results in many features/items/datapoints

desert oar Jun 16, 2020, 2:18 PM

#

I have to take care of something at work, ping me this evening and I can take a look

#

In the meantime, look into FeatureUnion

#

In sklearn

quaint basalt Jun 16, 2020, 2:20 PM

#

Thanks, will do. I also found sklearn-lego, which also seems to be an essential part in this

desert oar Jun 16, 2020, 4:35 PM

#

yeah this is a bit more advanced than a sklearn pipeline is meant for

#

i didnt know about sklearn-lego

#

it looks more like just a collection of various custom transformers people have written

#

what is the last stage meant to represent?

quaint basalt Jun 16, 2020, 4:42 PM

#

The score? Might be just a misrepresentation from my part. What I mean is I will compare the found clustering to a ground truth

#

The main problem I run into is that I have an unknown number of edges per graph

#

sklearn-lego seems to be needed to enable feeding the output of the intermediate predictors to the clustering

#

I can maybe turn the bottom half in a custom predictor of sorts which takes a graph and returns a weights matrix. But then I may still have the problem that its shape is not constant.

desert oar Jun 16, 2020, 5:33 PM

#

honestly i would just write your own class for this

#

or function

#

or whatever

#

because the parameters from one of the models depends on the output from another model (# of clusters)

#

this is just way beyond what sklearn pipelines are able to handle

quaint basalt Jun 16, 2020, 5:35 PM

#

I was afraid of that. Thanks for confirming it 🙂

lapis sequoia Jun 16, 2020, 7:16 PM

#

Hey all, anyone got any links to some good tutorials/info pages for machine learning and data science using python and tensorflow? I just got hired for a data science internship and have absolutely no datascience or machine learning background and feeling a little over my head

solid aurora Jun 16, 2020, 7:17 PM

#

Kind of an odd question, but given a jupyter password, can I get a token for that instance?

#

i.e.

#

I am given a jupyter URL like:

#

jupyter.verylarge.cluster.org:123456/login```

#

and am given the password abcd1234

#

How can I get an auto-login link like jupyter.verylarge.cluster.org:123456/?token=396d6da2df034621a8836ab6c0689eae?

devout sail Jun 16, 2020, 7:29 PM

#

That sounds like a terrible idea

floral siren Jun 16, 2020, 7:29 PM

#

Data Science Projects with Python by Stephen Klosterman is the book that I have been using @lapis sequoia . Not an online tutorial, but it is the best book that I have used on the subject

lapis sequoia Jun 16, 2020, 7:30 PM

#

@floral siren awesome thanks ill check it out!

flat quest Jun 16, 2020, 7:37 PM

#

there's a couple good courses on udacity @lapis sequoia
the intro to tensorflow one is pretty good

solid aurora Jun 16, 2020, 7:54 PM

#

That sounds like a terrible idea
@devout sail are you referring to my question?

devout sail Jun 16, 2020, 7:55 PM

#

Yeah. I don't have a definite answer, but finding the token based on password would essentially allow you to enter the notebook by guessing the right password wouldn't it?

solid aurora Jun 16, 2020, 7:55 PM

#

I mean, you can do the same thing by bruteforcing the login page

#

My guess was the token would be stored in localstorage or something like that once the notebook was unlocked

#

@devout sail found this in the webpage source:```js

#

so the token is in the url but removed

#

how can I get around that?

#

I obviously can't change the source code of a remote jupyter book

#

and I can't see anything in the network tab of dev tools

solid aurora Jun 16, 2020, 8:30 PM

#

@devout sail what about from the cookie that is set?

#

I just noticed that jupyter sets a cookie when logging in via a password

devout sail Jun 16, 2020, 8:33 PM

#

I'm not sure if I know or want to help with that, sorry. You should find a legit way to open the notebook

solid aurora Jun 16, 2020, 8:34 PM

#

@devout sail bruh I have access to the notebook already

#

basically I am given web access with a url+password

#

I want to use it through vs code

#

which requires the token in order to connect to the jupyter daemon

#

there's nothing illegal/illegitimate going on here

devout sail Jun 16, 2020, 8:34 PM

#

Then you should get it from whoever gave the url+password

solid aurora Jun 16, 2020, 8:35 PM

#

I'll try that, but I don't think they'll be able to change the system so easily

#

it's an automated system that I request an instance from

#

and I'm given url+password to connect

oblique grove Jun 16, 2020, 8:37 PM

#

sorry if this isn't the right place to ask, but I don't really know whether I should use google colab or just pycharm while I'm learning machine learning

solid aurora Jun 16, 2020, 8:38 PM

#

colab is great because it has all the libraries set up for you

#

only issue is that while you can save code between runs, you can't save files

#

for most small ML projects that's perfect

oblique grove Jun 16, 2020, 8:42 PM

#

oh ok nice

#

and will keep that in mind

modern canyon Jun 16, 2020, 8:49 PM

#

Hello there folks, I recently finished an introductory data science course and was also shortlisted for an ML internship. I was given an assignment where I have to predict wine variety using various features. I have never worked on an ML project before and also don't have much knowledge on the theoretical side either. I just finished the assignment today and attained about 97% accuracy. I am sure it is dumb luck but just to be sure can you guys review my notebook (https://nbviewer.jupyter.org/github/shyam1998/Wine-Variety-Prediction/blob/master/main.ipynb) and see if I am doing something wrong to attain such accuracy?

#

note: They didn't give out the labels for the test data so I train_test_splitted the training data

south coyote Jun 16, 2020, 9:45 PM

#

@modern canyon I am a noob on these stuff pretty much even i've been taking ml classes for semesters now, i think its good.

#

I assume there aren't multiple entries with very similar parameters in the dataset though

#

If such a thing happened, then you will have the similar items both in training and test set i eman

flat quest Jun 16, 2020, 10:02 PM

#

@solid aurora actually u can save files. U'll just need to use drive for that.

But yes for larger projects, its recommended to use a standard ide or editor along with a version control system like git

solid aurora Jun 16, 2020, 10:06 PM

#

GWmythicalThonkCool reminds me of the days when I used to "version control" by uploading code to google drive

steel ravine Jun 16, 2020, 10:21 PM

#

Anyone with some knowledge about Neural Networks and Poker?

boreal portal Jun 16, 2020, 10:26 PM

#

Going with the NN solution i see?

steel ravine Jun 16, 2020, 10:27 PM

#

The goal is to make a bot that will learn to mirror the skill level of the player. Forcing the player to play against him self and learn his mistakes

boreal portal Jun 16, 2020, 10:27 PM

#

In online poker?

steel ravine Jun 16, 2020, 10:28 PM

#

This is for a custom poker video game, it can be played offline. Online features are only to serve ads because the game is free

boreal portal Jun 16, 2020, 10:28 PM

#

That would take forever to gather the data to make you know

#

If yo uhave a single player

#

Doing a single game with this ai it would take hundreds and hundreds of hands to make meaningful progress if you're doing it super efficient like.

steel ravine Jun 16, 2020, 10:29 PM

#

It would not start from 0, a pretrained model will be distributed with the game and will correct it self to mirror the player better

boreal portal Jun 16, 2020, 10:30 PM

#

Now i ain't a super poker player but you would want to gather the data that affects how a player would play

#

~~blackjack would be easier~~

steel ravine Jun 16, 2020, 10:32 PM

#

The problem I'm trying to solve is the input layer of the NN. Should I use one neuron for each card and set it to 1 if you have that card or should I only use a single input neuron and interpolate it between 0 and 1 to show the cards value

boreal portal Jun 16, 2020, 10:33 PM

#

I'd say more the better

#

On that one

#

If you want it to be good and use the least data possible right

steel ravine Jun 16, 2020, 10:36 PM

#

Ok, thanks for the input

boreal portal Jun 16, 2020, 10:36 PM

#

You gonna use some transfer learning?

#

And hell you can test many different models

#

You know you gotta experiment

#

Don't be afraid to go back and reexamine. Nothing about your project is set in stone.

vernal cypress Jun 16, 2020, 11:30 PM

#

hi, im just starting with pandas. is this a good place to ask for help? or should i got to python help?

paper niche Jun 16, 2020, 11:43 PM

#

@vernal cypress either’s fine

vernal cypress Jun 16, 2020, 11:44 PM

#

A'ight

#

I have an online store

#

each month i pay artists royalties

#

I'm trying to build a simple tool that will count up all the product sold within a month

#

i can export CSVs through my webstore admin panel

#

I sell shirts among other things

#

One design can be 8 or more colors and 6-7 sizes

#

the CSV i export keeps all of that in one column, "Lineitem name"

#

so when i use .groupby function i get this:

📎 unknown.png

#

What would be the best way of aggregating all rows that contain "Aesthetic shirt" to one and creating a new data frame from it?

#

is the .groupby even appropriate?

paper niche Jun 16, 2020, 11:53 PM

#

Probably u just need to split out the Lineitem name into 3 different columns (name, colour, size) then group by name

#

I think something that complicates things is that you have that Heather Prism shirt as well..

#

are all ur item names always before a hyphen?

vernal cypress Jun 16, 2020, 11:54 PM

#

oh damn you're right, they are

#

the csv i get has 72 columns, it's overwhelming lol. is there anyway i can drop all column except the one i need? i've been dropping them each by name

paper niche Jun 16, 2020, 11:57 PM

#

oh i just realized Heather Prism Mint is the colour isnt it

#

doh

vernal cypress Jun 16, 2020, 11:57 PM

#

yes heather prism mint is the color

paper niche Jun 16, 2020, 11:58 PM

#

yep one sec lemme shift over to my laptop

vernal cypress Jun 16, 2020, 11:58 PM

#

but just the fact that you pointed out that they're all seperated by a hyphen and the sizes are seperated by a slash allready helps a bunch lol

paper niche Jun 16, 2020, 11:59 PM

#

the csv i get has 72 columns, it's overwhelming lol. is there anyway i can drop all column except the one i need? i've been dropping them each by name
@vernal cypress you can just select the column you need instead of dropping the ones you don't.

df.loc[:, ['col_1', 'col_2']]

vernal cypress Jun 16, 2020, 11:59 PM

#

excellent, thank you

paper niche Jun 17, 2020, 12:03 AM

#

np, glad to help

#

you can also select a subset of columns to read in if you're using pd.read_csv() (I forgot the argument name, but you can find it on the docs I'm sure)

vernal cypress Jun 17, 2020, 12:07 AM

#

i'll probably ask a few more dumb newbie questions before i figure it out but now i think i have the right approach

rich silo Jun 17, 2020, 12:08 AM

#

Anyone here has experience in Dash?

paper niche Jun 17, 2020, 12:30 AM

#

just ask your question, don't ask to ask. If someone knows the solution, they will answer.

vernal cypress Jun 17, 2020, 12:36 AM

#

@paper niche
`def get_color(color):
return color.split('-')[1]

df['Column'] = df['Lineitem name'].apply(lambda x: get_color(x))`

gives me
"IndexError: list index out of range"

#

apparently it has something to do with pandas not knowing what to do with empty columns

#

some of the Lineitem names don't contain any hyphens or anything

#

is that why?

paper niche Jun 17, 2020, 12:44 AM

#

yeah if some names have no hyphens then .split('-') will only return a list of 1 element (the original string), then [1] will give a indexerror

#

you need the colour? I thought you wanted the name

vernal cypress Jun 17, 2020, 12:45 AM

#

well my idea was to seperate the color and size to seperate columns and then use .group by on the name

paper niche Jun 17, 2020, 12:45 AM

#

btw, no need for apply, there are inbuilt string methods you can access via .str on a dataframe/series

📎 Screenshot_2020-06-17_at_8.44.56_AM.png

#

this method also doesn't throw an error if your item name has no ' - ' delimiter, it just returns None in the second column (colour) after expanding

#

📎 Screenshot_2020-06-17_at_8.47.08_AM.png

vernal cypress Jun 17, 2020, 12:47 AM

#

oh sick!

#

lemme try

#

it works!

📎 unknown.png

#

now how do i move that data to a separate column, leaving just the name

paper niche Jun 17, 2020, 12:56 AM

#

assign it back to df, then drop the other columns you don't need

📎 Screenshot_2020-06-17_at_8.55.57_AM.png

#

or if you really only want the name, then

df['name'] = df['item'].str.split(' - ', expand=True)[0]

#

something like that

vernal cypress Jun 17, 2020, 12:58 AM

#

can i use df.loc or should i just df.drop?

paper niche Jun 17, 2020, 12:58 AM

#

assign it back to df, then drop the other columns you don't need
regarding this, you might also want to specify n = 1 in the argument for .str.split() to ensure you only get 2 columns out

#

either is fine, use whatever's more convenient

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html?highlight=split#pandas.Series.str.split

vernal cypress Jun 17, 2020, 1:45 AM

#

alright sweet, i got it down to a somewhat tidy csv now, i can df.groupby('Product name').count() and i see exactly what i want

fringe violet Jun 17, 2020, 2:14 AM

#

well i was told to go here.....

#

anyone know anything about euler's method?

stark mulch Jun 17, 2020, 2:27 AM

#

!ask

arctic wedgeBOT Jun 17, 2020, 2:27 AM

#

Asking good questions will yield a much higher chance of a quick response:

• Don't ask to ask your question, just go ahead and tell us your problem.
• Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving.
• Be patient while we're helping you.

You can find a much more detailed explanation on our website.

fringe violet Jun 17, 2020, 2:36 AM

#

i didn't feel like explaining in here my issue if no one in here knows anything about basic numerical methods for solving differential equations

#

hence why i just asked if anyone knows about euler's method, the algorithm i'm using

#

anyway, here's my code for eulersmethod which just returns a set of data that matplotlib plots

#

def eulersmethod(init_x, init_y, dx, range_val, derivative, val_is_range ):
    points = [[init_x], [init_y]]
    # start with initial point first apply Euler's Method to the positive right
    x = init_x
    y = init_y
    derivative_at_last_pnt = derivative(x, y)
    # treat range_val like a range
    if val_is_range:
        while y <= range_val / 2:
            x += dx
            y += dx * derivative_at_last_pnt
            points[0].append(x)
            points[1].append(y)
            derivative_at_last_pnt = derivative(x, y)
        # negative direction
        while y >= - (range_val / 2):
            x -= dx
            y -= dx * derivative_at_last_pnt
            points[0].insert(0, x)
            points[1].insert(0, y)
            derivative_at_last_pnt = derivative(x, y)
    else:
        # treat range_vale like a domain
        while x <= range_val / 2:
            x += dx
            y += dx * derivative_at_last_pnt
            points[0].append(x)
            points[1].append(y)
            derivative_at_last_pnt = derivative(x, y)
        # negative direction
        while x >= - (range_val / 2):
            x -= dx
            y -= dx * derivative_at_last_pnt
            points[0].insert(0, x)
            points[1].insert(0, y)
            derivative_at_last_pnt = derivative(x, y)

    return points```

#

i've solved on paper a couple of well known functions for their derivatives in terms of x and y and plugged those functions into the code to test it. e^x is giving me weird functionality

#

sinx is giving me a straight line

#

y = x looks fine

stark mulch Jun 17, 2020, 2:39 AM

#

have you printed your points out and inspected them?

fringe violet Jun 17, 2020, 2:40 AM

#

that would be a lot to look at which is why the computer is doing it

stark mulch Jun 17, 2020, 2:41 AM

#

fair. there's a point where your x value goes backwards. That's why you're getting the weird graph you're getting.

fringe violet Jun 17, 2020, 2:41 AM

#

for domain -10 to 10 i'd have 2000 points to look at

stark mulch Jun 17, 2020, 2:42 AM

#

print them out one per line, round them to a common width, and scroll until you spot the pattern.

fringe violet Jun 17, 2020, 2:42 AM

#

i know what the issue is giving me weird functionality with it going backwards hold up....

neat harness Jun 17, 2020, 2:43 AM

#

Is there a reason why you don't start all the way on the left or right (x axis) rather than in the middle?

fringe violet Jun 17, 2020, 2:43 AM

#

ok i fixed that issue

#

i start wherever the initial point is

#

this is how it's supposed to look minus that weird little blip but i think i could figure that out

📎 unknown.png

#

so that is fixed let me see what sin x looks like now

#

ok so as you probably know...

#

sinx doesn't look like this

📎 unknown.png

stark mulch Jun 17, 2020, 2:48 AM

#

so it seems like every y value you calculate is 0, then

fringe violet Jun 17, 2020, 2:48 AM

#

ok so i know what the issue is. i'm giving it an initial value of (0, 0) which should give me a sin function but i'm multiplying dx * derivative_of_sinx which is -y^2/2 so it always returns 0

neat harness Jun 17, 2020, 2:49 AM

#

Correct me if I'm misunderstanding this but it looks like you first loop with y or x until it goes above the upper limit, then you try to iterate until it goes below the lower limit? Wouldn't you want only one loop that requires it's between both limits?

fringe violet Jun 17, 2020, 2:51 AM

#

ok i have to separate it into two because i'm starting at some point. i want to use the derivative function to approximate the curve for some step size dx up to and below some hardcoded bounds

#

so i first go up then down. i have no way of knowing what the value at the far left will be until i get there and same for the far right. that's what the code does. it graphically solves the differential equation

#

Traceback (most recent call last): File "C:/Users/Nicholas/PycharmProjects/eulersmethod2/main.py", line 64, in <module> points = eulersmethod(x0, y0, 0.01, domain_size, sineofx, val_is_range=False) File "C:/Users/Nicholas/PycharmProjects/eulersmethod2/main.py", line 43, in eulersmethod derivative_at_last_pnt = derivative(x, y) File "C:/Users/Nicholas/PycharmProjects/eulersmethod2/main.py", line 53, in sineofx return -(y**2)/2 OverflowError: (34, 'Result too large')

#

i tried to start at initial value (pi/2, 1) didn't like that

#

this is just so broken if it isn't e^x or y=x... maybe euler's method just doesn't work on this kind of de?

neat harness Jun 17, 2020, 2:59 AM

#

Euler's method is fairly general. It may be worth taking a second look at your math outside of the function you shared

fringe violet Jun 17, 2020, 3:00 AM

#

well all the functions i'm using are fairly simple

neat harness Jun 17, 2020, 3:00 AM

#

And also what y value caused that error

fringe violet Jun 17, 2020, 3:00 AM

#

dy/dx = y is e^x

#

uh idk what value specifically but when the range is 2pi and i try to use sinofx() it gives that error

#

i'll post sinofx:

#

def sineofx(x, y):
    return -(y**2)/2```

neat harness Jun 17, 2020, 3:02 AM

#

Try running sinofx through a bunch of inputs and print each input before the ** line

#

Then you can see which causes an error

fringe violet Jun 17, 2020, 3:03 AM

#

i got this from the DE y'' + y = 0 which i know has the solution Acosx + Bsinx just from how many times this equation comes up in physics

#

so i solved it for dy/dx by integrating both sides to get y' = - (y^2/2)

#

alright i'll add that in there in a second. i'm checking to see if a polynomial will work

#

ok it's accurate for polynomials too

#

so far just the sinofx derivative function isn't working

#

how to set the width of the float when i print it?

neat harness Jun 17, 2020, 3:10 AM

#

There's a couple options, do you know about .format, str % value, or f-strings?

fringe violet Jun 17, 2020, 3:11 AM

#

oh nevermind i'll look into it later

#

-2.1399999999999983 for x returns......
-1.6618870538680888e+229

#

which doesn't make sense. it should oscillate between -1 and 1.

#

wait

#

i know what the issue is.... i rushed through the math and made a mistake. give me one sec

#

what i did just didn't make sense mathematically

#

y'' + y = 0
you can't just integrate y in terms of dx like that it doesn't make any sense

#

integ(y)dx =/= y^2/2 + c.

neat harness Jun 17, 2020, 3:17 AM

#

To print a float (x here) with a specific number of digits (3 here) after zero:

String.format function:
print("{:.03f}".format(x))
f-strings (my favorite, .format but convenient):
print(f"{x:.03f}")
String % operator:
print("%.03f" % x)

fringe violet Jun 17, 2020, 3:18 AM

#

thanks

#

well i think i know what the issue is now. i'm going to never touch this again and move on to the main point of my code haha

neat harness Jun 17, 2020, 3:19 AM

#

Good luck

fringe violet Jun 17, 2020, 3:19 AM

#

i don't need a sinx anyway. i just wanted to test euler's method so i could move on to a different method that was a bit more complicated

#

thanks

lapis sequoia Jun 17, 2020, 5:07 AM

#

so

#

I forgot to set seed

#

and my computations turned out really good.. I don't know what my seed is

#

how do I reproduce omg x.x

devout sail Jun 17, 2020, 5:15 AM

#

You don't? Not sure what you're doing, but you can run it several times and pick the best result

lapis sequoia Jun 17, 2020, 5:20 AM

#

I can't run it several times. It's running on tpu

#

I'm trying to figure out a way to find the current seed even though I didnt' set it

#

im hoping np.random.get_state() would work

#

I'm using pytorch..

devout sail Jun 17, 2020, 5:37 AM

#

If you're using jupyter or something and you didn't stop the kernel at any point you might be able to do it
https://stackoverflow.com/questions/32172054/how-can-i-retrieve-the-current-seed-of-numpys-random-number-generator for a more detailed explanation of how it works

Stack Overflow

How can I retrieve the current seed of NumPy's random number genera...

The following imports NumPy and sets the seed.

import numpy as np
np.random.seed(42)
However, I'm not interested in setting the seed but more in reading it. random.get_state() does not seem to co...

#

Though I'd say your aim should be saving the results / model, than trying to recreate the computation

hearty jewel Jun 17, 2020, 6:44 AM

#

anyone know why im getting a sytntax error here?

📎 unknown.png

modern coral Jun 17, 2020, 6:53 AM

#

Uhhhh

#

Missing bracket? [

spark stag Jun 17, 2020, 6:53 AM

#

@hearty jewel the line above, you open a [

modern coral Jun 17, 2020, 6:53 AM

#

@hearty jewel

📎 unknown.png

hearty jewel Jun 17, 2020, 6:53 AM

#

fixed it

#

still not working

modern coral Jun 17, 2020, 6:54 AM

#

What's the new error?

hearty jewel Jun 17, 2020, 6:54 AM

#

📎 unknown.png

#

same thing

#

is this a bug

spark stag Jun 17, 2020, 6:54 AM

#

line above

#

you have pulls[pulls... but don't close that first [

modern coral Jun 17, 2020, 6:54 AM

#

Right. THAT line is raising the error.

hearty jewel Jun 17, 2020, 6:55 AM

#

u guys aer awesome

#

lol

modern coral Jun 17, 2020, 6:55 AM

#

Because it has been searching for the closing ] but it just got a new line.
Well, we fixed two errors.

spark stag Jun 17, 2020, 6:57 AM

#

if there is a syntax error pointing to something like a variable at the start of a line then usually look back a line or 2

hearty jewel Jun 17, 2020, 6:58 AM

#

📎 unknown.png

#

now theres a new error lol

#

on the 'date'.year component

#

im trying to group by user by year

#

whats wrong with the code there?

polar acorn Jun 17, 2020, 7:04 AM

#

I think groupby needs a list of columns so it would have to be by_author.groupby(['user', by_author.....])

hearty jewel Jun 17, 2020, 7:04 AM

#

i tried htat

#

it didnt work

#

like

#

by_author['date].dt.year

#

📎 unknown.png

polar acorn Jun 17, 2020, 7:28 AM

#

Did you try counts = by_author.groupby(['user', by_author['date'].dt.year]).agg(?

hearty jewel Jun 17, 2020, 7:29 AM

#

i fixed it kind of

#

im getting a new error now

#

📎 unknown.png

#

on the 'date' in counts_wide

solid mantle Jun 17, 2020, 7:42 AM

#

My jupyter notebook breaks after running s particular piece of code. 'Kernel connection to server cannot be established'. Although it runs just fine before running that code

#

any ideas?

hearty jewel Jun 17, 2020, 7:43 AM

#

figured it out

#

thanks anyways guys!

lapis sequoia Jun 17, 2020, 8:44 AM

#

I need some free tpus

#

preferably over 8 cores and 64 gb ram

solid mantle Jun 17, 2020, 8:59 AM

#

np.random.seed(1)
N = 100
alpha_real = 2.5
beta_real = 0.9
eps_real = np.random.normal(0, 0.5, size=N)
x = np.random.normal(10, 1, N)
y_real = alpha_real + beta_real * x

y = y_real + eps_real

data = np.stack((x, y)).T

with pm.Model() as pearson_model:
μ = pm.Normal('μ', mu=data.mean(0), sd=10, shape=2)
σ_1 = pm.HalfNormal('σ_1', 10)
σ_2 = pm.HalfNormal('σ_2', 10)
ρ = pm.Uniform('ρ', -1., 1.)
r2 = pm.Deterministic('r2', ρ2)
cov = pm.math.stack(([σ_12, σ_1σ_2ρ],
[σ_1σ_2ρ, σ_2**2]))

y_pred = pm.MvNormal('y_pred', mu=μ, cov=cov, observed=data)

trace_p = pm.sample(1000)

#

its this bit of code

#

shuts down my jupyter kernel

buoyant imp Jun 17, 2020, 9:36 AM

#

Hello everyone,

#

I have a question regarding Mathematics for ML/DS etc.
I am currently learning Linear Algebra and I am fairly understanding the topic.
Should I start solving exercises from these topics manually (pen/paper style) to understand the topic more or is there anything else I should try?
I have Mathematics background during my undergrad but its been 2-3 years since I last solved any problems.
Also same goes with Probability and Statistics?

eager heath Jun 17, 2020, 9:57 AM

#

Yeah it can be nice to get your pen and paper and test yourself to make sure you learnt those topics fully :D

earnest meteor Jun 17, 2020, 12:46 PM

#

Hi I am reviving some client's project, it's based on Tensorflow 1.1 (I updated the package to 1.5).

There is a line statement like this:

# The original code is tf.contrib.lite, I migrated it as new style
interpreter = tf.lite.Interpreter(model_path='data/model.tflite')

I have a limited experience with it, I just need to move the ML parts as a separate python package for portability and write unit tests, so is this model "model.tflite" something common I can download from somewhere?

Thanks

buoyant imp Jun 17, 2020, 12:56 PM

#

Yeah it can be nice to get your pen and paper and test yourself to make sure you learnt those topics fully :D
@eager heath

Hey, any good resources for problems?

eager heath Jun 17, 2020, 1:02 PM

#

I don't have any, no sorry :/

buoyant imp Jun 17, 2020, 1:46 PM

#

It's alright. Thank you.

zealous hinge Jun 17, 2020, 1:47 PM

#

project euler!

lapis sequoia Jun 17, 2020, 2:30 PM

#

hello, is there a faster way to deal with 20 categorical columns with 50+ levels than to go through each one individually, look at the value counts, and assign it as a binary as whether or not it is the max value count?

ripe forge Jun 17, 2020, 3:18 PM

#

Depends on how exactly you coded it up

#

Cause logic wise, you have to do all that

#

So then the only question is, did you vectorize the code properly or did you use loops and so on

lapis sequoia Jun 17, 2020, 4:05 PM

#

alright thank you

desert oar Jun 17, 2020, 5:04 PM

#

how fast is "fast"

#

looping over 20 columns shouldn't take that long

#

def is_most_common_category(s):
    counts = s.value_counts()
    return s == counts.idxmax()

data_binarized = data[list_of_categorical_column_names].apply(is_most_common_category)

silk axle Jun 17, 2020, 5:41 PM

#

How do I go about centralising values in my pandas dataframe when I print it?

#

E.g. this column (I want to do for all columns)

📎 unknown.png

#

Also centralise headers like here

📎 unknown.png

desert oar Jun 17, 2020, 5:48 PM

#

@silk axle you can call .str.center before printing

#

the column names i think you can control with a display option

#

hmm.. colheader_justify is only for left or right

#

no centered

silk axle Jun 17, 2020, 5:49 PM

#

AttributeError: 'DataFrame' object has no attribute 'str'?

#

.to_string().center?

lapis sequoia Jun 17, 2020, 5:50 PM

#

how do you know how many categorical levels are too much? for instance, i have a 6000 row dataset with a categorical variable having 50 levels. is that too much?

desert oar Jun 17, 2020, 5:50 PM

#

@silk axle on each column

#

print(data.apply(lambda x: x.str.center()))

#

@lapis sequoia too much for what

silk axle Jun 17, 2020, 5:51 PM

#

Is there not a better way? @desert oar

desert oar Jun 17, 2020, 5:51 PM

#

not that i know of

#

im looking through the dispaly options

lapis sequoia Jun 17, 2020, 5:52 PM

#

@lapis sequoia too much for what
@desert oar for an ML model (i.e. random forest)

desert oar Jun 17, 2020, 5:52 PM

#

i dont see one

#

@lapis sequoia potentially yes, random forests tend to over-weight features with lots of categorical values

silk axle Jun 17, 2020, 5:52 PM

#

AttributeError: Can only use .str accessor with string values! since not all the values are string @desert oar

desert oar Jun 17, 2020, 5:52 PM

#

then convert to string first i guess. make a separate function to if/else based on the dtype

lapis sequoia Jun 17, 2020, 5:52 PM

#

any alternatives that i could use where high-level categorical features wont have an impact?

#

would i use like a regressoin?

desert oar Jun 17, 2020, 5:53 PM

#

something with regularization

silk axle Jun 17, 2020, 5:53 PM

#

then convert to string first i guess. make a separate function to if/else based on the dtype
@desert oar this doesn't really mean anything to me. I know literally nothing about pandas other than how to read in an excel file.

lapis sequoia Jun 17, 2020, 5:53 PM

#

@desert oar this doesn't really mean anything to me. I know literally nothing about pandas other than how to read in an excel file.
@silk axle bro u can google this

silk axle Jun 17, 2020, 5:54 PM

#

I can't google it if I don't know what they mean

desert oar Jun 17, 2020, 5:54 PM

#

@silk axle

def center_text(s):
    return s.map(str, na_action='ignore').str.center()

print(data.apply(center_text))

something like that

#

if you want to get more specific use if/else and check the .dtype of s

#

e.g. if it's a datetime series you can strftime it

silk axle Jun 17, 2020, 5:55 PM

#

Would s be each like record?

desert oar Jun 17, 2020, 5:56 PM

#

i think you need to brush up on pandas basics 😉

#

.apply by default applies a function to each column in the data

silk axle Jun 17, 2020, 5:56 PM

#

As I said, I know nothing about pandas

desert oar Jun 17, 2020, 5:56 PM

#

.map applies a function to each element of a series elementwise

#

a DataFrame is (logically) a collection of Serieses

silk axle Jun 17, 2020, 5:57 PM

#

TypeError: center() missing 1 required positional argument: 'width'

desert oar Jun 17, 2020, 5:58 PM

#

any .str method functions like the regular str methods

#

so .str.center is the same as str.center in regular python

silk axle Jun 17, 2020, 5:58 PM

#

didn't even know str.center was a thing

desert oar Jun 17, 2020, 5:59 PM

#

!d g str.center

arctic wedgeBOT Jun 17, 2020, 5:59 PM

#

`str.center`

str.center(width[, fillchar])```
Return centered in a string of length *width*. Padding is done using the specified *fillchar* (default is an ASCII space). The original string is returned if *width* is less than or equal to `len(s)`.

silk axle Jun 17, 2020, 5:59 PM

#

I guess the width would be the length of the header?

desert oar Jun 17, 2020, 5:59 PM

#

yeah, or the length of the longest string maybe

silk axle Jun 17, 2020, 5:59 PM

#

true

desert oar Jun 17, 2020, 6:00 PM

#

def center_text(x):
    s = x.map(str, na_action='ignore')
    l = max(s.name, s.map(len, na_action='ignore').max())
    return s.str.center(l)

maybe

silk axle Jun 17, 2020, 6:01 PM

#

    l = max(s.name, s.map(len, na_action='ignore').max())
TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'

#

Tbh I should probs stop copy-pasting and actually look at the docs to see what this stuff does lol

desert oar Jun 17, 2020, 6:01 PM

#

yes lol

#

also look at what i wrote

#

the point is that in general there isnt a good way to do this

#

without manually centering everything

#

how you do that is up to you

silk axle Jun 17, 2020, 6:07 PM

#

s = x.map(str, na_action='ignore') so this converts x to a str, right?

#

And if x is a missing value (e.g. NaN) then ignore it

lapis sequoia Jun 17, 2020, 6:11 PM

#

@desert oar so can i use a dataframe of 6000 rows that has categorical variables with 300 levels, 5 levels, and 2000 levels and put this through lasso regression?

desert oar Jun 17, 2020, 6:17 PM

#

yes @silk axle

#

@lapis sequoia regression has its own problems. each level of each variable is a separate parameter

lapis sequoia Jun 17, 2020, 6:18 PM

#

yeah thats what i thought too. im not sure how to proceed because there are so many levels

desert oar Jun 17, 2020, 6:18 PM

#

so yes lasso or ridge can help but you are depending on regularization to make it make sense

lapis sequoia Jun 17, 2020, 6:18 PM

#

what do you suggest

desert oar Jun 17, 2020, 6:18 PM

#

is it only these categorical features?

lapis sequoia Jun 17, 2020, 6:18 PM

#

no there are more

desert oar Jun 17, 2020, 6:18 PM

#

hm

lapis sequoia Jun 17, 2020, 6:18 PM

#

ill show you this

desert oar Jun 17, 2020, 6:19 PM

#

one option is feature hashing

#

for the 2000 level one

lapis sequoia Jun 17, 2020, 6:19 PM

#

📎 image0.jpg

desert oar Jun 17, 2020, 6:19 PM

#

which basically groups categories randomly together

#

huh

lapis sequoia Jun 17, 2020, 6:19 PM

#

alright ill look that up

desert oar Jun 17, 2020, 6:19 PM

#

are you sure all these features are meaningful

#

i see _NAME

lapis sequoia Jun 17, 2020, 6:19 PM

#

some of them arent

#

yeah those and the dates arent meaningful

ripe forge Jun 17, 2020, 6:19 PM

#

Date also

desert oar Jun 17, 2020, 6:19 PM

#

ok good i was going to say

#

you had better not include a date as a categorical lol

lapis sequoia Jun 17, 2020, 6:20 PM

#

yeah the problem is, some of them just have so many levels

desert oar Jun 17, 2020, 6:20 PM

#

what's an example of one of these features

#

like what does it actually represent

lapis sequoia Jun 17, 2020, 6:20 PM

#

im trying to group it into more levels so that i can have 80% of the value counts within 10 levels (for example) and have the remaining 20% listed as 'Other'

#

yeah so one of them is like 'Keywords' which has 2360 levels

#

which represents like 'Equity' or 'Investing' or something

ripe forge Jun 17, 2020, 6:21 PM

#

Oh i have a suggestion for that kind of crap

lapis sequoia Jun 17, 2020, 6:21 PM

#

another is the group with 1395 levels and a list of different business units

#

Oh i have a suggestion for that kind of crap
listening

desert oar Jun 17, 2020, 6:21 PM

#

1400 business units wew

silk axle Jun 17, 2020, 6:21 PM

#

I'm not getting anywhere with this @desert oar, it's really confusing me :/

desert oar Jun 17, 2020, 6:21 PM

#

@silk axle you can also just loop over your data and print each row manually

ripe forge Jun 17, 2020, 6:21 PM

#

For business units try doing some meaningful aggregation

desert oar Jun 17, 2020, 6:21 PM

#

^

ripe forge Jun 17, 2020, 6:21 PM

#

Like group together the units

lapis sequoia Jun 17, 2020, 6:22 PM

#

that are similar?

#

i see

ripe forge Jun 17, 2020, 6:22 PM

#

For the words, perhaps use embedding.

lapis sequoia Jun 17, 2020, 6:22 PM

#

this is so much more work than i anticipated ha

ripe forge Jun 17, 2020, 6:22 PM

#

It ends up making more variables, but all continuous

desert oar Jun 17, 2020, 6:22 PM

#

definitely feature hashing for the keywords, or even an embedding like word2vec if a lot of your records have multiple keywords

#

yes, welcome to machine learning

ripe forge Jun 17, 2020, 6:22 PM

#

And they contain meaning in vector space

desert oar Jun 17, 2020, 6:22 PM

#

65% fucking with data, 15% training models, 20% sitting in meetings

ripe forge Jun 17, 2020, 6:23 PM

#

More like 50% 5% and 45% fml

desert oar Jun 17, 2020, 6:23 PM

#

lol true

#

also throw in 10% writing ad-hoc code for random other projects

lapis sequoia Jun 17, 2020, 6:23 PM

#

lets say i have a column with 100 variables, and 50 of those variables capture 80% of the value counts. could i just create 51 variables and have 1 of them listed as 'other' for 20%

desert oar Jun 17, 2020, 6:24 PM

#

thats not a bad option

ripe forge Jun 17, 2020, 6:24 PM

#

Only if grouping them makes sense

lapis sequoia Jun 17, 2020, 6:24 PM

#

im trying to create a function right now to see how many variables are captured by 80% of the value counts

desert oar Jun 17, 2020, 6:24 PM

#

https://stats.stackexchange.com/q/146907/36229
https://stats.stackexchange.com/q/411767/36229

Cross Validated

Principled way of collapsing categorical variables with many levels?

What techniques are available for collapsing (or pooling) many categories to a few, for the purpose of using them as an input (predictor) in a statistical model?
Consider a variable like college s...

Cross Validated

Encoding of categorical variables with high cardinality

For unsupervised anomaly detection / fraud analytics on credit card data (where I don't have labeled fraudulent cases), there are a lot of variables to consider. The data is of mixed type with cont...

ripe forge Jun 17, 2020, 6:24 PM

#

Like it doesn't make sense if you grouped ceos and trainees together for example

lapis sequoia Jun 17, 2020, 6:24 PM

#

yeah the problem is you lose a lot of information

Only if grouping them makes sense

ripe forge Jun 17, 2020, 6:24 PM

#

Cause that might logically be bad

lapis sequoia Jun 17, 2020, 6:24 PM

#

right

#

thanks salt, ill check those links out

ripe forge Jun 17, 2020, 6:25 PM

#

There is one other option that I personally haven't checked out

#

Called target encoding. Supposed to be some magical nonsense

desert oar Jun 17, 2020, 6:25 PM

#

yep target encoding works

#

ive done it

#

theres no implementation currently for multi-class though

#

only regression and binary classification

ripe forge Jun 17, 2020, 6:25 PM

#

But it's another cheeky way to get a representation that's mcuh easier for models to handle

desert oar Jun 17, 2020, 6:25 PM

#

the multi-class version is more complicated too, i don't remember how it works off the top of my head

lapis sequoia Jun 17, 2020, 6:26 PM

#

ah so it wont work for my case?

desert oar Jun 17, 2020, 6:26 PM

#

youd have to implement it yourself

lapis sequoia Jun 17, 2020, 6:26 PM

#

alright. what about this scenario

#

so my dataframe together has 84 columns. what if i only looked at a dataframe with less than 15 different levels? that'd result in 37 columns

#

could i create a prediction off that

#

i know im losing a lot of information

desert oar Jun 17, 2020, 6:27 PM

#

https://dl.acm.org/doi/10.1145/507533.507538
https://arxiv.org/abs/1611.09477
references for target encoding ^

arXiv.org

vtreat: a data.frame Processor for Predictive Modeling

We look at common problems found in data that is used for predictive modeling
tasks, and describe how to address them with the vtreat R package. vtreat
prepares real-world data for predictive...

lapis sequoia Jun 17, 2020, 6:27 PM

#

thank you

desert oar Jun 17, 2020, 6:27 PM

#

i think you should at consider which features make sense for your business problem too

#

you can also try computing the mutual information between each feature and the target

#

and drop features with MI below a threshold

lapis sequoia Jun 17, 2020, 6:29 PM

#

so basically calculating variable importance?

desert oar Jun 17, 2020, 6:29 PM

#

more like bivariate association

lapis sequoia Jun 17, 2020, 6:30 PM

#

alright

#

i'll check this out

#

this is my first month working as a professional data scientist lol and i never implemented any of these methods in school

desert oar Jun 17, 2020, 6:30 PM

#

its just because correlation doesnt work on categorical features

#

otherwise youd just use correlation

#

yeah welcome

#

at least you have the sense to ask questions

#

i struggled for years just trying to DIY everything

lapis sequoia Jun 17, 2020, 6:31 PM

#

its just because correlation doesnt work on categorical features
yeah. if this data were numerical it'd make my life so much easier

desert oar Jun 17, 2020, 6:31 PM

#

yep. again welcome to data science

lapis sequoia Jun 17, 2020, 6:31 PM

#

i struggled for years just trying to DIY everything
lol i cant imagine how difficult it'd be if you didnt have someone to ask like i do right now

#

thank you though

desert oar Jun 17, 2020, 6:31 PM

#

hint: it sucked and i wasnt very good at it

lapis sequoia Jun 17, 2020, 6:31 PM

#

ha

silk axle Jun 17, 2020, 6:34 PM

#

@desert oar I've kinda made progress, but still not quite where I want

#

def center_text(x):
    if isinstance(x.dtype, str):
        l = max(x.map(len))  # get the highest length of string
        print(l)
        return x.str.center(l)  # center based on longest length
    else:
        s = x.map(str)
        return s.str.center(len(x.name))  # center based on length of title

print(df.apply(center_text))

desert oar Jun 17, 2020, 6:35 PM

#

isinstance won't work with dtype

#

dtypes are single-character strings

#

"string" columns are just "O" dtype which means "arbitrary python objects"

#

so sadly there's no dedicated string datatype in pandas

silk axle Jun 17, 2020, 6:36 PM

#

So I can't do something different for strings and integers?

desert oar Jun 17, 2020, 6:36 PM

#

you can, but it really depends on the dtype

#

you can have "integers" of 3.0 and 4.0 in a "float" dtype column

#

or you can have integers in an "O" dtype column even though it should probably be "int"

silk axle Jun 17, 2020, 6:38 PM

#

Date               datetime64[ns]
Horse                      object
Track                      object
Time                       object
Non Runner                   bool
Odds                       object
Total Stake                 int64
Win or Each Way            object
Actual Winnings           float64```those are the dtypes

desert oar Jun 17, 2020, 6:38 PM

#

looks like the times are strings?

#

Time

silk axle Jun 17, 2020, 6:39 PM

#

How would I change that?

#

11/06/2020 15:40:00 is an example

#

That's the format

#

dd/mm/yyyy hh:mm:ss

desert oar Jun 17, 2020, 6:39 PM

#

why do you have both Date and Time then?

#

regardless, doesnt matter

silk axle Jun 17, 2020, 6:40 PM

#

~~I merged Time and Date and forgot to remove Time~~

#

lol

#

Date               datetime64[ns]
Horse                      object
Track                      object
Non Runner                   bool
Odds                       object
Total Stake                 int64
Win or Each Way            object
Actual Winnings           float64```so these are the dtypes

desert oar Jun 17, 2020, 6:40 PM

#

ah wait

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_string.html

#

you can assing formatters to each column this way

#

use that

#

see formatters and float_format

silk axle Jun 17, 2020, 6:41 PM

#

I saw that earlier but no clue how to use it lmao

desert oar Jun 17, 2020, 6:41 PM

#

Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

#

formatterslist, tuple or dict of one-param. functions, optional
Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

silk axle Jun 17, 2020, 6:42 PM

#

formatters = center_text?

desert oar Jun 17, 2020, 6:42 PM

#

that, or

formatters={
    'Horse': lambda s: s.center(15),
    'Non Runner': lambda s: 'Non Runner' if s else 'Runner'
}

etc.

silk axle Jun 17, 2020, 6:43 PM

#

Right yea

#

Thanks 👍

silk axle Jun 17, 2020, 7:01 PM

#

I'm almost there @desert oar, got another quick question

desert oar Jun 17, 2020, 7:01 PM

#

sure

silk axle Jun 17, 2020, 7:01 PM

#

How can I make it so that it sends 4.00 instead of just 4 (type is int64)

📎 unknown.png

#

If I can £4.00

#

    'Total Stake': lambda s: f"{str(s).center(11):.2f}",
ValueError: Unknown format code 'f' for object of type 'str'

#

That's what I tried

#

Wait I know why

#

Yep fixed

desert oar Jun 17, 2020, 7:09 PM

#

you can use the float_format parameter too

#

oh if it's int nvm

#

    'Total Stake': lambda x: format(float(x), "0.2f").center(11)

#

@silk axle ^

#

also i was just using s to mean either "series" or "string

silk axle Jun 17, 2020, 7:11 PM

#

Rn I've got lambda s: f"£{s:.2f}".center(11)

desert oar Jun 17, 2020, 7:13 PM

#

!e ```python
x = 3
print(f"{x:0.2f}")

arctic wedgeBOT Jun 17, 2020, 7:13 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

3.00

desert oar Jun 17, 2020, 7:13 PM

#

cool, it casts it

#

i think with .format sometimes you get errors

#

cant remember

silk axle Jun 17, 2020, 7:13 PM

#

'Actual Winnings': lambda s: ("£" + f"{s:.2f}".zfill(5)).center(15) this works but I'm thinking there's a tidier way -- converts 0 to 0.00 then to 00.00 then to £00.00 and then centers

desert oar Jun 17, 2020, 7:14 PM

#

!e ```python
x = 1.5
print(f"£{x:2.2f}")

arctic wedgeBOT Jun 17, 2020, 7:14 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

£1.50

silk axle Jun 17, 2020, 7:14 PM

#

But

#

I think there was a reason I can't do that

#

Right yea

#

Because I want £01.50

desert oar Jun 17, 2020, 7:15 PM

#

oh sorry use 02

#

!e ```python
x = 1.5
print(f"£{x:02.2f}")

arctic wedgeBOT Jun 17, 2020, 7:15 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

£1.50

desert oar Jun 17, 2020, 7:15 PM

#

wait hm

#

!e ```python
print( format(43, "06.2f") )

arctic wedgeBOT Jun 17, 2020, 7:15 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

043.00

desert oar Jun 17, 2020, 7:15 PM

#

let me re-read i might have missed something in the syntax

silk axle Jun 17, 2020, 7:16 PM

#

Wait no

#

lambda s: f"£{s:05.2f}" this works

#

:02.2f will do for total len 2

#

You need total len as 5

desert oar Jun 17, 2020, 7:16 PM

#

yeah right

#

more corners of python i dont use very often

silk axle Jun 17, 2020, 7:23 PM

#

I just need to align the headers with the columns @desert oar

📎 unknown.png

desert oar Jun 17, 2020, 7:23 PM

#

does .to_string have any kwarg to mess w/ the headers?

silk axle Jun 17, 2020, 7:23 PM

#

right yea, good point

#

Assuming I want this but not sure which setting

📎 unknown.png

#

center works

#

Spacing between the columns isn't really consistent which is annoying

📎 unknown.png

#

Ig set col_space?

#

Wait no that wouldn't be it

#

Don't think so

#

Eh kinda does

#

Not really sure for this one @desert oar

desert oar Jun 17, 2020, 7:30 PM

#

at this point your guess is as good as mine

#

ive never had to do this

#

or ive just manually centered by iterrating over rows

silk axle Jun 17, 2020, 7:32 PM

#

Ig it'll do then

#

Thanks for all your help with this 👍

river wing Jun 17, 2020, 7:36 PM

#

I tried
import nltk
nltk.download('punkt')

It take very long time and then fail ultimately

desert oar Jun 17, 2020, 7:40 PM

#

show the error you get

lapis sequoia Jun 17, 2020, 7:54 PM

#

lets say i have a column with 100 variables, and 50 of those variables capture 80% of the value counts. could i just create 51 variables and have 1 of them listed as 'other' for 20%
@desert oar so back to this question, if im running low on time, would you suggest i do this ?

lapis sequoia Jun 17, 2020, 8:56 PM

#

i am using Python and xlwings, but the intellisense isn't working for xlwings. e.g. when i type in ws1.cells., it doesn't show the methods that i can use e.g. ws1.cells.clear_contents() any idea why?

desert oar Jun 17, 2020, 9:10 PM

#

@lapis sequoia it could work. maybe a better option is to look at the cumulative % of the data represented by each category, and chop it off at some threshold

#

oh wait

#

thats what you just said

#

yes

#

perfectly valid

lapis sequoia Jun 17, 2020, 9:11 PM

#

alright thank you, im gonna try running a baseline with variables that have less than 10 levels, and then running another model of all levels but using that method i just described

desert oar Jun 17, 2020, 9:12 PM

#

feature selection and engineering is always a slow process

lapis sequoia Jun 17, 2020, 9:12 PM

#

lol yeah. this is my first professional DS/ML project, and ive never had to deal with data that was this messy and unorganized before

desert oar Jun 17, 2020, 9:13 PM

#

at least you have all the data in one place

#

ive spent all day writing various scripts just to get my data

lapis sequoia Jun 17, 2020, 9:13 PM

#

oh thats even worse lol

desert oar Jun 17, 2020, 9:13 PM

#

under a completely arbitrary deadline

#

which is the most annoying part

#

everyone wants to acknowledge that software dev takes time, data science takes a fuckload of time

#

then i keep getting asked "how far do you think we can get in 3 weeks"

#

"well, week 1 will be spent writing and debugging data access and cleaning scripts, week 2 will be spent just understanding the data and prototyping maybe 1 or 2 algorithms, week 3 will be spent slapping together literally anything that seems to work because this is not enough time to get anything done"

thin terrace Jun 17, 2020, 9:17 PM

#

Is there an optimal way to handle missing values?

desert oar Jun 17, 2020, 9:18 PM

#

no. depends on your application

#

sometimes it's as simple as "fill in with the mean" and sometimes it's as complicated as "build an imputation model"

#

depends entirely on the data that's missing, the reasons for the missingness, the kind of model you're using, etc.

lapis sequoia Jun 17, 2020, 9:19 PM

#

Is there an optimal way to handle missing values?
@thin terrace lol im doing this right now

thin terrace Jun 17, 2020, 9:19 PM

#

can you give some example(s)?

lapis sequoia Jun 17, 2020, 9:19 PM

#

im using knn to impute categorical variables

desert oar Jun 17, 2020, 9:19 PM

#

^ thats one option

#

@thin terrace do you have a specific example in mind?

lapis sequoia Jun 17, 2020, 9:20 PM

#

everyone wants to acknowledge that software dev takes time, data science takes a fuckload of time
damn that sucks. are u a DS or ML engineer

desert oar Jun 17, 2020, 9:20 PM

#

data scientist

lapis sequoia Jun 17, 2020, 9:20 PM

#

you can replace with mean/median/mode/knn/regress

#

bunch of algorithms online

#

other options too

desert oar Jun 17, 2020, 9:20 PM

#

hell you can train an autoencoder on the non-missing records

lapis sequoia Jun 17, 2020, 9:20 PM

#

but i think those are the most frequent. is that correct?

#

yea that too

desert oar Jun 17, 2020, 9:21 PM

#

multiple imputation is another option with a lot of theoretical appeal but can be difficult to implement in practice

#

can do stuff w/ gaussian mixtures, sky is kind of the limit

#

you can even fit bayesian models where you set a prior for your missing values and train the model on that

thin terrace Jun 17, 2020, 9:22 PM

#

Well, not really. So I'm applying for a data scientist job (which i think I've landed). They gave me a classification task with a dataset full of missing values and I have never handled such things before (just graduated from uni). I didn't really know how to approach it. I removed some columns with very large amounts of missing values and then filled the remaining with 0.

#

But It yielded pretty shitty performance

lapis sequoia Jun 17, 2020, 9:22 PM

#

dont replace with 0. you can google algorithms to replace with mean/median/mode

#

unless it makes sense to replace wiht 0

#

for instance, my dataset rn has a few columns missing >60% of values -- i dropped those completely. now im running a knn on missing values

thin terrace Jun 17, 2020, 9:23 PM

#

When does it make sense to replace with 0? I tried with -999 first but it worked even worse

desert oar Jun 17, 2020, 9:23 PM

#

they didnt teach you anything about missing values in school?

lapis sequoia Jun 17, 2020, 9:23 PM

#

^ im surprised to hear that as well

desert oar Jun 17, 2020, 9:23 PM

#

what degree do you have

#

what field of study, rather

thin terrace Jun 17, 2020, 9:23 PM

#

M.Sc in engineering computer security

desert oar Jun 17, 2020, 9:23 PM

#

no wonder

thin terrace Jun 17, 2020, 9:24 PM

#

so im an outlier hehe

desert oar Jun 17, 2020, 9:24 PM

#

stop thinking like an engineer start considerg what the data actually is

#

is 0 a sensible default value for the data?

#

then you can impute with 0

#

is 0 a sensible default value for a stock price?

#

of course not

#

well maybe for some stocks

#

but not in general

lapis sequoia Jun 17, 2020, 9:25 PM

#

@thin terrace google how to impute missing values

thin terrace Jun 17, 2020, 9:25 PM

#

so basically its ok for features where 0 would represent "nothing" ?

desert oar Jun 17, 2020, 9:25 PM

#

no, youre still thinking about this wrong

#

in fact an engineer needs to think like this too

#

what does the data represent? how does my model actually work and use the numbers i give it?

thin terrace Jun 17, 2020, 9:27 PM

#

im in for a challenge at this job position, thats for sure lol

desert oar Jun 17, 2020, 9:28 PM

#

imo a good software engineer needs the same skillset

#

making sure that what you're doing actually makes sense for solving the real-world problem

#

for example i'm writing a client library for a huge web API right now

#

there are like 50 options

#

but my particular use case only needs like 10, 5 of which can be hard-coded

#

so i'm writing my code with that specifically in mind

#

if you're not thinking about the task from a real-world perspective you're just not going to develop good solutions. it's true for software engineering, even more true for "physical" engineering like civil/mechanical/electrical, and equally true for data science

thin terrace Jun 17, 2020, 9:33 PM

#

yeah i usually try to think about the bigger picture

#

data science is very new to me, a lot of new stuff

#

I was planning on becoming a dev but here I am

desert oar Jun 17, 2020, 9:38 PM

#

data science is more fun

thin terrace Jun 17, 2020, 9:41 PM

#

I hope I'll think so too

lapis sequoia Jun 17, 2020, 9:42 PM

#

data science is nerdy

#

whereas web dev is creative and fun 🙂

#

hehe

desert oar Jun 17, 2020, 9:42 PM

#

anyway with that in mind @thin terrace it definitely helps to consider: what kind of data is this, and what would happen to my model results if i did X

#

and that's why there will never be a catch-all "what do i do with missing data" answer

slim fox Jun 17, 2020, 9:43 PM

#

naaah webdev is boring xD

#

are you working as data scienttist, salt?

#

btw I just realized we have two salt helpers 😂

#

I though we had one who changed name sometimes @desert oar

thin terrace Jun 17, 2020, 9:43 PM

#

Yeah, I get that. I just don't know the answers to these questions. How do I learn?

desert oar Jun 17, 2020, 9:44 PM

#

salt is just "salt" or "salt-die"

#

i was here first technically 😉

slim fox Jun 17, 2020, 9:44 PM

#

learn by doing: practice on different data sets, try to see the bigger picture
since data science is so hot now there are plenty of resources to learn things: if you will look on internet for things like "imputation", "missing values in data" etc you will find quite some amount of guides and articles

desert oar Jun 17, 2020, 9:44 PM

#

@thin terrace partly a matter of looking to see what other people have done in other specific cases

#

e.g. there are a million approaches to missing data on the kaggle titanic dataset

#

and we just suggested a few, albeit complicated ones that i think are overkill for a job interivew task

slim fox Jun 17, 2020, 9:45 PM

#

Towards data science part of medium.com often has good articles for example

desert oar Jun 17, 2020, 9:45 PM

#

impute with mean/median/mode, fit regression, time series if appropriate, KNN, gaussian mixture

#

some models don't even need missing data imputation, e.g. random forest

#

if "missing" is a valid category level you can just leave it missing

slim fox Jun 17, 2020, 9:46 PM

#

i was here first technically 😉
@desert oar oh really? didn't know 🙂 guess I just happen not to see you somehow lol. I used to see salt-die, then there was #ask-meta-for-math channel and then I started yo see you but not salt-die

#

hence the worng conclusion

desert oar Jun 17, 2020, 9:46 PM

#

i was gone for a while

#

i'm not likely to be here consistently. i've just been on a lot recently

slim fox Jun 17, 2020, 9:47 PM

#

I see. do you work a data science-related job?

desert oar Jun 17, 2020, 9:47 PM

#

yes

thin terrace Jun 17, 2020, 9:48 PM

#

I tried leaving them as missing, must have fucked something up

#

At the interview they said I could just have replaced them with -1 and I would've got pretty good performance

#

Is that such a huge difference from replacing with 0?

slim fox Jun 17, 2020, 9:49 PM

#

hm. are those missing data are continious or categorical?

thin terrace Jun 17, 2020, 9:50 PM

#

mostly continuous

slim fox Jun 17, 2020, 9:50 PM

#

if they are continious numbers you can also check for the pearson correlation between those features and target variable

#

if they are not really correlated you might not need them at all

thin terrace Jun 17, 2020, 9:51 PM

#

well i basically ended up dropping most features and I guess that's where I went wrong

lapis sequoia Jun 17, 2020, 9:55 PM

#

if i had values like 5-10, 10-20, 20-30, >1000, i would have to encode these, correct?

#

before putting it into a ML/predictive model^

thin terrace Jun 17, 2020, 9:57 PM

#

sounds like a good idea

unborn talon Jun 17, 2020, 9:57 PM

#

📎 unknown.png

#

why ???

#

📎 unknown.png

thin terrace Jun 17, 2020, 9:57 PM

#

missing ()

unborn talon Jun 17, 2020, 9:58 PM

#

where ?

thin terrace Jun 17, 2020, 9:58 PM

#

np.array( [1,2,3,4] )

unborn talon Jun 17, 2020, 9:58 PM

#

oh THANKS ALOT

thin terrace Jun 17, 2020, 9:58 PM

#

yw

solid mantle Jun 17, 2020, 10:01 PM

#

anyone?

#

Does anyone know what passing another array as array index give?

#

idx = np.repeat(range(7), 20)
idx = np.append(idx, 7)
np.random.seed(314)
alpha_real = np.random.normal(2.5,0.5,size=8)

y = alpha_real[idx]

#

so, what would y give?

#

I looked at its kdeplot, i cant make anything out of it

thin terrace Jun 17, 2020, 10:03 PM

#

you can give an array of indices to an array to retrieve a new array with the elements at the given indices

solid mantle Jun 17, 2020, 10:05 PM

#

ohh thank you

desert oar Jun 17, 2020, 10:10 PM

#

@solid mantle time to read the numpy indexing docs 😉

#

https://numpy.org/doc/1.18/user/basics.indexing.html#basics-indexing

#

x = np.array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30])
print(x[0])
print(x[[0, 3]])
print(x[1:4])
print(x[x < 20])

#

the last one is indexing with an array/list of bools

solid mantle Jun 17, 2020, 10:16 PM

#

@desert oar thank you

slim fox Jun 17, 2020, 10:17 PM

#

then there is some crazy stuff too

#

https://numpy.org/doc/stable/reference/generated/numpy.lib.stride_tricks.as_strided.html

#

like stride tricks

#

haven't yet figured them out

thin terrace Jun 17, 2020, 10:22 PM

#

Warning This function has to be used with extreme care, see notes.

#

nice

slim fox Jun 17, 2020, 10:22 PM

#

for a good reason I beleive 🙂

thin terrace Jun 17, 2020, 10:23 PM

#

it manipulates the internal data structure of ndarray and, if done incorrectly, the array elements can point to invalid memory and can corrupt results or crash your program

#

I don't get the usage of it

slim fox Jun 17, 2020, 10:29 PM

#

I saw few examples

#

also this https://ipython-books.github.io/46-using-stride-tricks-with-numpy/

IPython Cookbook - 4.6. Using stride tricks with NumPy

IPython Cookbook,

#

https://ipython-books.github.io/47-implementing-an-efficient-rolling-average-algorithm-with-stride-tricks/

IPython Cookbook - 4.7. Implementing an efficient rolling average a...

IPython Cookbook,

#

in short, it can be very performant at times

lapis sequoia Jun 17, 2020, 11:34 PM

#

how can i impute missing categorical data using knn?

desert oar Jun 18, 2020, 12:30 AM

#

one basic method is to compute distances between rows, comparing only the non-missing fields in those rows. then for each field you fill each missing value from the nearest rows that have non-missing values

#

there are a few packages that implement that logic

#

https://github.com/iskandr/fancyimpute which wraps https://github.com/iskandr/knnimpute
or this one ad-hoc implementation i found in a blog post https://gist.github.com/YohanObadia/b310793cd22a4427faaadd9c381a5850 which has some more intelligent handling of different data types

#

its actually a little surprising that nobody has come out with a "professional-grade" library for KNN imputation, but i guess people often just end up writing their own code

marsh chasm Jun 18, 2020, 12:33 AM

#

Does anyone know a good way of searching for a term in a VERY LARGE CSV (5million rows total ~) without using a for loop

#

/is a for loop really that much slower ?

desert oar Jun 18, 2020, 12:33 AM

#

@marsh chasm in any column? or in a specific column

marsh chasm Jun 18, 2020, 12:33 AM

#

Specific column

desert oar Jun 18, 2020, 12:34 AM

#

plain python can loop over 5 million rows pretty fast. pandas can do it really fast

marsh chasm Jun 18, 2020, 12:34 AM

#

Hmm it’s been running for an hour it’s not done yet (regular for )

desert oar Jun 18, 2020, 12:34 AM

#

show your code?

marsh chasm Jun 18, 2020, 12:34 AM

#

yeah gimme a second

#

temp = []
keyword = "Panda Express"

for i in filenames:
    with open(i, 'rt') as f:
        reader = csv.reader(f)
        for row in reader: 
            if keyword == row[1]:
                temp.append(row[0])
                temp.append(row[11])
    dataframes.append(temp)
    temp.clear()       
    with open('PandaVisits.csv', 'wt') as p:
        writer = csv.writer(p)
        for r in dataframes:
            writer.writerow(r)```

#

filenames is a list of csv's that i want to analyze

desert oar Jun 18, 2020, 12:36 AM

#

im surprised thats taking an hour

#

wait

#

you forgot to un-indent the 2nd with open

#

so you're re-writing all your files every time you read 1 file

#

unless that's your goal...

#

even so that's a long time

#

regardless, pandas should be able to do this much faster

#

you dont have any colum names? data starts on the first row?

marsh chasm Jun 18, 2020, 12:40 AM

#

sorry was helping my dad w something

#

don't i need to do that bc the original for loop is like every file in filenames

desert oar Jun 18, 2020, 12:41 AM

#

you're doing some weird business with the data

#

your logic is all twisted

#

oh nvm

#

you're clearing temp..

#

but still

marsh chasm Jun 18, 2020, 12:41 AM

#

ah probably im new to this so it's not probably the best solution

desert oar Jun 18, 2020, 12:42 AM

#

you should move the writing to the end

#

so you aren't re-writing every time you read 1 file

marsh chasm Jun 18, 2020, 12:42 AM

#

oh wait

#

yeah lmao

desert oar Jun 18, 2020, 12:44 AM

#

regardless pandas makes this a lot faster

import pandas as pd

filenames = [ ... ]
dataframes = []
keyword = "Panda Express"

for filename in filenames:
    data = pd.read_csv(filename, usecols=[0, 1, 11], header=None, names=['x0', 'x1', 'x2'])
    has_keyword = data['x1'].str.contains(keyword)
    temp = data.loc[has_keyword, ['x0', 'x2']]
    dataframes.append(temp)

combined_data = pd.concat(dataframes)
combined_data.to_csv('PandaVisits.csv', index=False, header=False)

#

again, assuming your files don't have column headers

marsh chasm Jun 18, 2020, 12:44 AM

#

they have column headers

#

how does panda make it faster

#

is it like... multithreading ?

desert oar Jun 18, 2020, 12:45 AM

#

its written in C instead of looping manually in python

#

there is a huge amount of overhead in python code execution

marsh chasm Jun 18, 2020, 12:45 AM

#

ohhhh gotcha

desert oar Jun 18, 2020, 12:45 AM

#

what are the column names?

#

for the columns you want, anyway

marsh chasm Jun 18, 2020, 12:46 AM

#

well the ones of interest: date_range_start and raw_visit_counts

desert oar Jun 18, 2020, 12:46 AM

#

and the one that you're searching in?

marsh chasm Jun 18, 2020, 12:46 AM

#

location_name

desert oar Jun 18, 2020, 12:46 AM

#

import pandas as pd

filenames = [ ... ]
dataframes = []
keyword = "Panda Express"

for filename in filenames:
    data = pd.read_csv(filename, usecols=['date_range_start', 'location_name', 'raw_visit_counts'])
    has_keyword = data['location_name'] == keyword
    temp = data.loc[has_keyword, ['date_range_start', 'raw_visit_counts']]
    dataframes.append(temp)

combined_data = pd.concat(dataframes)
combined_data.to_csv('PandaVisits.csv', index=False)

try this

marsh chasm Jun 18, 2020, 12:47 AM

#

thanks! i'll crossreference it w the docs to learn it but thank u so much! being new to this its hard to figure out which tools are teh best to use

#

so thank u

desert oar Jun 18, 2020, 12:47 AM

#

youre welcome, thats the best way to learn

#

the user guides in pandas aren't always that helpful but the API reference is usually clear

marsh chasm Jun 18, 2020, 12:48 AM

#

gotcha ty

#

ok so i ran that code and i got these errors

#

📎 Screen_Shot_2020-06-17_at_8.57.07_PM.png

desert oar Jun 18, 2020, 12:57 AM

#

the usual caveats about untested code apply

marsh chasm Jun 18, 2020, 12:57 AM

#

yeah probably

desert oar Jun 18, 2020, 12:57 AM

#

can you show more of the error?

#

it looks like you cut off the bottom

marsh chasm Jun 18, 2020, 12:57 AM

#

📎 Screen_Shot_2020-06-17_at_8.57.51_PM.png

#

sorry about that

desert oar Jun 18, 2020, 12:58 AM

#

oh...

#

you copied and pasted my code verbatim

marsh chasm Jun 18, 2020, 12:58 AM

#

it looked fine ?

desert oar Jun 18, 2020, 12:58 AM

#

i bet you can figure out the problem

marsh chasm Jun 18, 2020, 12:58 AM

#

ok

#

let me check

desert oar Jun 18, 2020, 12:58 AM

#

hint: look near the top

marsh chasm Jun 18, 2020, 12:58 AM

#

OMG I TOLD MYSELF TO DELETE THAT

#

I THOUGHT I DID

desert oar Jun 18, 2020, 12:58 AM

#

lol

#

for future reference, 3 dots "..." is called an "ellipsis"

#

which hopefully makes that error message make sense

marsh chasm Jun 18, 2020, 12:59 AM

#

yeah thats why i was confused i was like tf where are the ellipsis i deleted the filename part

#

apparently

#

i did not

desert oar Jun 18, 2020, 1:01 AM

#

happens to the best of us

marsh chasm Jun 18, 2020, 1:02 AM

#

im looking through the docs... pandas is pretty useful xD

desert oar Jun 18, 2020, 1:03 AM

#

yep, for anything with tabular data it's pretty much indispensable

marsh chasm Jun 18, 2020, 1:03 AM

#

lmao to think i started this project using R

desert oar Jun 18, 2020, 1:03 AM

#

R isn't bad

#

pandas owes a lot to R for its design

#

the R code would look pretty similar, and if you use a 3rd party CSV reader instead of the built-in one it's just as fast if not faster

#

i've processed billions of rows in R doing more complicated operations than this

marsh chasm Jun 18, 2020, 1:05 AM

#

i used data.tables

#

purrr

desert oar Jun 18, 2020, 1:05 AM

#

hell yeah

marsh chasm Jun 18, 2020, 1:05 AM

#

but it was really

#

really

#

slow

desert oar Jun 18, 2020, 1:05 AM

#

data.table? shouldnt be

marsh chasm Jun 18, 2020, 1:05 AM

#

let me see if i can pull up the code

#

library(purrr)
setwd("/Volumes/Seagate Backup Plus Drive/UMichStuff/v2/main-file/Relevant")

filenames = list.files(getwd())

rawVisits = 12
startDate = 10
CSVnames = filenames[filenames %like% "weekly-patterns"]
df = map_df(CSVnames, fread, header = TRUE)
df = df[which(df$location_name == "Panda Express"), c(startDate, rawVisits)]
fwrite(df, "PandaExpressVisitList.csv")```

lapis sequoia Jun 18, 2020, 1:06 AM

#

my random forest model got a negative 17% R-2 value lol.

desert oar Jun 18, 2020, 1:10 AM

#

@marsh chasm

library("data.table")

filenames <- c( ... )
keyword <- "Panda Express"

dataframes <- lapply(filenames, function(filename) {
  dat <- fread(filename, select = c("date_range_start", "location_name", "raw_visit_counts"))
  dat[location_name == keyword, .(date_range_start, raw_visit_counts)]
})

combined_data <- rbindlist(dataframes)
fwrite(combined_data, "PandaVisits.csv")

pardon any mistakes, i haven't used R much recently

marsh chasm Jun 18, 2020, 1:10 AM

#

oh interesting

#

i started urnning the python code 2 minutes ago and its not done

#

is that normal

desert oar Jun 18, 2020, 1:11 AM

#

your data might be bigger than you realize

marsh chasm Jun 18, 2020, 1:11 AM

#

its 150 gigs

desert oar Jun 18, 2020, 1:11 AM

#

oh

#

you said 5 million rows

#

that's.... a lot more than 5 million

marsh chasm Jun 18, 2020, 1:11 AM

#

im pretty sure its 5 million rows

#

oh

#

is it

desert oar Jun 18, 2020, 1:11 AM

#

maybe 5 million rows with 100s of columns

#

that's a fuckton of data

marsh chasm Jun 18, 2020, 1:11 AM

#

: )

desert oar Jun 18, 2020, 1:11 AM

#

you have 150 gb of memory?

marsh chasm Jun 18, 2020, 1:12 AM

#

external hard drive

desert oar Jun 18, 2020, 1:12 AM

#

not storage

#

memory

#

RAM

marsh chasm Jun 18, 2020, 1:12 AM

#

150 gigs of ram hell no

#

im running from my mac laptop

desert oar Jun 18, 2020, 1:12 AM

#

uh

#

i'd ctrl+c this

marsh chasm Jun 18, 2020, 1:12 AM

#

uh oh

desert oar Jun 18, 2020, 1:12 AM

#

or at least keep your performance monitor open

#

do me a favor

#

how many files do you have

#

and how many lines are in each file

marsh chasm Jun 18, 2020, 1:12 AM

#

like how many csv's

desert oar Jun 18, 2020, 1:12 AM

#

yes

#

run this in bash:

wc -l my-data-directory/*.csv

#

obviously my-data-directory is the directory w/ your CSVs

lapis sequoia Jun 18, 2020, 1:13 AM

#

wait @desert oar how would i go about getting data from sql database with 15m rows into python for pandas editing?

#

csv limit is 1m

#

m = million

desert oar Jun 18, 2020, 1:13 AM

#

what do you mean csv limit is 1m

#

either write the query and pd.DataFrame it, or use pd.read_sql

#

(note that pd.read_sql requires sqlalchemy unless you're using sqlite)

lapis sequoia Jun 18, 2020, 1:14 AM

#

wait that wouldnt work for oracle though right

#

oracle database?

desert oar Jun 18, 2020, 1:15 AM

#

if it's supported by sqlalchemy it's supported by pandas

#

if not, like i said: write the query yourself and then convert to a dataframe after reading the data

marsh chasm Jun 18, 2020, 1:16 AM

#

i could have run the command wrong, but when i did i got this:

#

📎 Screen_Shot_2020-06-17_at_9.17.04_PM.png

desert oar Jun 18, 2020, 1:17 AM

#

put quotes around anything with a space in it

#

but not around the "*.csv" part

marsh chasm Jun 18, 2020, 1:17 AM

#

uh so like quotes around the whole filepath thing

desert oar Jun 18, 2020, 1:17 AM

#

wc -l "/Volumes/My University/Users/My Name"/*.csv

#

like that

marsh chasm Jun 18, 2020, 1:18 AM

#

ok cool

#

okie its running

#

oo its taking a long time to run

desert oar Jun 18, 2020, 1:19 AM

#

probably because you have 150 GB of data

marsh chasm Jun 18, 2020, 1:19 AM

#

yeah lol

#

it just counts the number of lines right

desert oar Jun 18, 2020, 1:20 AM

#

yeah, but you have a fuckton of data

marsh chasm Jun 18, 2020, 1:20 AM

#

a ha ha

#

so what will we do with the information when the computer finishes counting

desert oar Jun 18, 2020, 1:20 AM

#

we'll see the actual number of rows

#

or at least a good approximation thereof

marsh chasm Jun 18, 2020, 1:20 AM

#

okie

desert oar Jun 18, 2020, 1:21 AM

#

which will give us a better sense of how to approach this

marsh chasm Jun 18, 2020, 1:21 AM

#

oh it finished the first two... theyre at about 3,880,000 each

#

lets see how the other ones fare

desert oar Jun 18, 2020, 1:23 AM

#

how many files total?

marsh chasm Jun 18, 2020, 1:23 AM

#

32

#

yeah theyre all in the upper 3.8 millions

desert oar Jun 18, 2020, 1:23 AM

#

!e ```python
print( 32 * 3.8 * 1e06 )

arctic wedgeBOT Jun 18, 2020, 1:24 AM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

121600000.0

desert oar Jun 18, 2020, 1:24 AM

#

so you have like 120 million rows

#

this is approaching what you might call "big data"

marsh chasm Jun 18, 2020, 1:24 AM

#

woah coding buzzword

desert oar Jun 18, 2020, 1:24 AM

#

indeed

#

"big" = "too big for a hard drive"

#

so depending on the hard drive you're getting close

marsh chasm Jun 18, 2020, 1:24 AM

#

yeah im running this out of a 1TB external hard drive but thats storage

#

so not too bad

#

2 TB*

desert oar Jun 18, 2020, 1:25 AM

#

it's certainly "medium data" and too big for ram on most machines

#

i have a work machine with 256 GB of ram but even then i wouldn't load all this data at once, if only out of respect for my coworkers who also need the machine

marsh chasm Jun 18, 2020, 1:25 AM

#

oh boy

#

mmmmm i have 8gb of ram it looks like

#

according to about my mac

desert oar Jun 18, 2020, 1:27 AM

#

import pandas as pd
from tqdm import tqdm  # for a nice progress bar

filenames = [ ... ]
output_filename = 'PandaVisits.csv'
keyword = "Panda Express"

for fileno, filename in tqdm(enumerate(filenames)):
    data = pd.read_csv(filename, usecols=['date_range_start', 'location_name', 'raw_visit_counts'])
    has_keyword = data['location_name'] == keyword
    temp = data.loc[has_keyword, ['date_range_start', 'raw_visit_counts']]
    if fileno == 0:
        temp.to_csv(output_filename, index=False)
    else:
        temp.to_csv(output_filename, index=False, header=False, mode='a')

anyway this should read each file one at a time, then append it to the same pandavisits.csv file

#

and i added a pretty progress bar so you can see how long it will actually take

#

(pip install tqdm or conda install tqdm)

marsh chasm Jun 18, 2020, 1:27 AM

#

oh cool

#

okay i'll try that

#

is this what the progress bar looks like?
0it [00:00, ?it/s]

#

thats what pops up when i run the script initially

#

oh i got it

#

nvm

#

it looks like it'll take about 40 minutes to run the whole thing (75seconds per iteration and 32 iterations)

desert oar Jun 18, 2020, 1:35 AM

#

seems reasonable

marsh chasm Jun 18, 2020, 1:37 AM

#

thats reasonable

#

i'll watch netflix for 40 mins and check later xD

#

ty

desert oar Jun 18, 2020, 1:37 AM

#

you're welcome

#

sounds like typical data science

marsh chasm Jun 18, 2020, 1:38 AM

#

this is my first time working w data

#

like to this scale

#

i was using R because my paradigms class had a small data science unit so it was fresh in my mind

marsh chasm Jun 18, 2020, 2:28 AM

#

the code worked @desert oar ty

desert oar Jun 18, 2020, 2:29 AM

#

@marsh chasm nice

#

you can do basically the same with data.table tho

marsh chasm Jun 18, 2020, 2:29 AM

#

maybe i didn't let it run long enough : /

#

i expected that it was just taking too long

desert oar Jun 18, 2020, 2:29 AM

#

instead of lapply and rbindlist you do basically the same thing as i wrote in pandas

#

its like a 1:1 port

marsh chasm Jun 18, 2020, 2:29 AM

#

ah gotcha

#

welp

#

maybe its best i brush up on my python anyway 😩

lapis sequoia Jun 18, 2020, 2:38 AM

#

when im using categorical variables in a random forest, do i need to one-hot encode them? ive seen conflicting responses

#

when i dont one hot encode, i get a 'cannot convert string to int' error

desert oar Jun 18, 2020, 2:42 AM

#

@lapis sequoia #0682 most models still require numerical inputs even if the numbers correspond to categories

lapis sequoia Jun 18, 2020, 2:45 AM

#

i thought that as well

#

but dont random forests handle categorical data?

#

📎 unknown.png

#

i just saw a youtube video with 900/920 likes that one hot encoded so im just gonna do that lol

#

but im still interested in knowing if you can use a random forest without having to (one hot) encode categorical variables

#

(if anyone knows the answer pls tag me bc sometimes i forget to check here after asking a question lol)

desert oar Jun 18, 2020, 2:58 AM

#

Yes they can handle categorical data, but that depends on the person who wrote the software letting you specify which columns are categorical

#

Just think about how a decision tree is constructed

deft harbor Jun 18, 2020, 4:21 AM

#

Hey, salt rock is back. Nice.

lapis sequoia Jun 18, 2020, 4:52 AM

#

is there anything that xlwings can do that pandas CAN'T do?

hidden grail Jun 18, 2020, 7:19 AM

#

Is this chat for machine learning also?
Do you know how the computation time/complexity of a neural network will increase by implementing more classes?

river wing Jun 18, 2020, 7:52 AM

#

How to convert python into api

hidden grail Jun 18, 2020, 8:04 AM

#

You can choose a web-framework for Python, e.g. Django or Flask. I've used Flask for this in the past and it was really simple. You can easily define your API endpoints with the @app.route() decorator for different operations like GET or POST. Check out https://flask.palletsprojects.com/en/1.1.x/

blazing bridge Jun 18, 2020, 8:49 AM

#

For example, say we are trying to predict rent based on the size_sqft and the bedrooms in the apartment and the R² for our model is 0.72 — that means that all the x variables (square feet and number of bedrooms) together explain 72% variation in y (rent).

#

What does variation in y mean

restive obsidian Jun 18, 2020, 9:44 AM

#

is there anyone who alrady with kaggle notebook ? i want to ask about why model that i save with model.save() i can't open the model folder

astral mantle Jun 18, 2020, 1:03 PM

#

hey

#

Does anyone here know anything about neural networks?

paper niche Jun 18, 2020, 1:19 PM

#

What does variation in y mean
@blazing bridge (Explained) variance, the 72%, essentially means the (squared) error between the predicted y value and the mean y value. I think of it as "how much better your model is at predicting the y value compared to a naive one that just guesses the mean value of y"

#

Does anyone here know anything about neural networks?
@astral mantle Probably. I don't deal too much with NN myself, but just go ahead and ask your question. Someone who knows & has the time will answer.

astral mantle Jun 18, 2020, 1:21 PM

#

Oh well

#

I've been trying to udnerstand backpropagation

#

I get how i'd find the adjustment to the first set of weights in respect to the output layer

#

but how would i adjust the weights of those in hidden layers and further?

#

do I carry on using the chain rule or is there something else

ripe forge Jun 18, 2020, 1:46 PM

#

Nope, chain rule. That's it

hidden grail Jun 18, 2020, 2:01 PM

#

Hey, I'm trying to create a program that can classify whether an image contains a building or not. I'm not sure where I should begin. I guess I could create a binary classifier CNN with Keras/TensorFlow/PyTorch. Or maybe I could use object-recognition in OpenCV, like Haar-Cascades. Do you have any idea what would be a good approach for this project?

restive obsidian Jun 18, 2020, 2:13 PM

#

@astral mantle if u need theory understanding maybe u can try enroll andrew ng deeplearning class

#

*Coursera or watch on deeplearning.ai on youtube.

astral mantle Jun 18, 2020, 2:22 PM

#

oh ok

#

thanks

sinful fog Jun 18, 2020, 2:24 PM

#

how can i scrape reddit images ?

#

(download)

#

with a bot i mean

uncut shadow Jun 18, 2020, 2:30 PM

#

well, you should use reddit's API

#

if you want to scrap, then I'm quite sure it's against their ToS

#

so if yes, we cannot help you

lapis sequoia Jun 18, 2020, 3:01 PM

#

how to round arrays

#

on random i get like bunch of digits

boreal portal Jun 18, 2020, 3:02 PM

#

Modulo