#data-science-and-ml

1 messages · Page 228 of 1

steel ravine
#

I will take into consideration NNs but don't want to jump to them if it would be an overkill

boreal portal
#

You could spend as much time learning NN business as you spend on non NN stuff to solve this issue

#

And you can do so much stuff with NN stuff, more and more every day!

quiet zinc
#

Hey folks, can someone help me with few questions in my university exam in Python related to NLTK mostly? (Nothing advanced, I believe)

boreal portal
#

Don't break the law

desert oar
#

@quiet zinc we can't and won't help with exams

#

we can provide limited homework help but that's it

#

!rules 5

arctic wedgeBOT
#

5. Do not provide or request help on projects that may break laws, breach terms of services, be considered malicious/inappropriate or be for graded coursework/exams.

quiet zinc
#

yea it's mostly task consisted of 4 questions but yea

blazing bridge
#

Would coursera courses count as breaking this rule. Like if I don’t understand a lecture can I ask for help

plain jungle
#

so I made a coupling algorithm, and im really excited about it cause it works. What are some fun stuff to test with it

#

at the moment i have just been feeding it random data such as

#

to then get data of

desert oar
#

@plain jungle try some classic datasets like iris, boston housing, and titanic

#

and that one wine tasting dataset

plain jungle
#

id need two dimensional floats / integer data. I'd love to do the iris' (the flower one right?) but last I remember that was 6 dimensional

#

I should though work on making this scalure, just not sure how to plot anything more than 3d on mathplot

floral siren
#

wow i just did a decision tree with iris

desert oar
#

@plain jungle just pick 2 columns from it?

#

all the bivariate relationships in iris are usable

plain jungle
#

heck... wow... yo! imma do that

desert oar
#

or get really cRaZy and hit it with multidimensional scaling first

plain jungle
#

thank you

sonic finch
#

Quite basic question...I'm trying to install the BERT Service Client. I've tried following directions here: https://github.com/hanxiao/bert-as-service and tried following the troubleshooting here: https://github.com/hanxiao/bert-as-service/issues/194 for how to handle when it won't start up in the command line. Appreciate any insight.

#

For reference, I have downloaded the files and do specify those in the command line calls that I'm making

spare karma
#

@sonic finch not sure this will solve your problem, as I'm very new to BERT (like 2 days) but I just completed this pytorch tutorial that used bert..worked flawlessly for me. Message me if you have any questions. I'll try and help the best I can.

#

and look at this one..

#

^one of the two is the most-updated one. I'll let you find which one is the most recent (one of them he messed up his cross-entropy). And then two vid-ja tutorials on the above:

oblique belfry
#

I have a Pandas question. I have data I need to do some time series stuff with. It is a set of events (event sourcing model) that fire at a certain time, for a certain uuid, and for a certain event. So timestamp, uuid, category is the data format. I want to get the average events per uuid per hour. So, I am thinking I need to is convert the category data to one-hot encoding. Then, pd.groupby("uuid").resample("1T").sum().mean() where the index is the timestamp. Is my reasoning flawed? I haven't done much with groupy.

flat quest
#

@blazing bridge hey. Idk if someone has answers ur question yet,

but the two lines of code you sent do the same thing. Accesing by property and by the [] operator do the same thing for pandas objects.

As for the groupby function. We're grouping each element in the pandas object by their year. So each pandas row with the same year will be grouped together.

Then what you're doing here is getting the totalprod of each grouped year, and getting the average value.

blazing bridge
#

as in pandas object do you mean the dataframe

flat quest
#

yeah
dataframe or series

#

but groupby wont work for series

blazing bridge
#

Ok [] and . both are used to access columns

flat quest
#

yes

blazing bridge
#

and they will be grouped by year

#

Thank you so much

flat quest
#

yeah np

blazing bridge
#

One more question

flat quest
#

mhm?

blazing bridge
#

how can we do more than one column

#

like totalprod and priceperlb

flat quest
#

you want to average over multiple columns?

blazing bridge
#

yeah grouped by year

flat quest
#

oh for that you need to use .agg

so something like this.

df.groupby('year').agg({'mean': ['totalprod', 'priceperlib']}).

Basically you're running the mean function for the totalprods and priceperlib for each grouped year. The agg method allows you to reference multiple different types of summation or averaging methods, as well as referencing more than 1 column for each method by passing in an array.

blazing bridge
#

so we have to import numpy as well

#

in order to use it or would it consider it as a list

flat quest
#

pandas uses numpy

blazing bridge
#

I saw the agg function but I didnt know it accessed multiple columns as well

#

thank you so much

flat quest
#

yeah np
the pandas docs have a couple more options for the agg, and prob how to make custom functions to use on the groupby's so make sure to check those out.

blazing bridge
#

`df.groupby('year').agg({'mean', 'sum' : ['totalprod', 'priceperlib']}).

#

yeah I will

#

would this work

flat quest
#

well you're not passing anything to mean

#

so might as well just take it out

blazing bridge
#

oh i thought it would do the mean and the sum of the columns we specified

flat quest
#

for that you'll need to use mean and sum as separate keys in the object.

So like

{'mean' : [columns],
'sum': [columns]}
blazing bridge
#

Oh ok thank you so much for your help. I know I've been a pain

flat quest
#

nah its all good

blazing bridge
#

can some explain the difference between a validation set and a test set

#

they seem very similar but i dont understand what the difference is

desert oar
#

"Validation set" is usually used in the process of optimizing model parameters

#

"Test set" is saved until the very end of your work, as a final estimate of out of sample performance

#

Personally, I think the names should be reversed, but the terminology has been established for a few years now and it's stuck

wicked flare
#

@desert oar This confuses me a little bit. What happens if you develop your parameters and manage to get good performance on your validation set, then you check with your test set and get awful performance. What do you do? Do you go out and get more data? I assume you don't want to reuse the test set, because if so, what's even the difference between the test and validation set?

desert oar
#

@wicked flare If that happens, it means that you did a poor job of constructing your data sets, and before you even touch your model again you need to spend some time carefully assessing the differences between your three data sets

#

Which yes, might mean that you either need to get new data entirely or, in a time sensitive situation and with extreme caution, reshuffle your three sets together if you just got very very unlucky

wicked flare
#

Ok, because there are some biases built into the validation set, so you overfitted the parameters to those biases or something?

desert oar
#

Yes, basically you are now trying to figure out "did I over fit my model, or did I sample my data sets incorrectly"

wicked flare
#

(or maybe the biases are in the test set)

slim fox
#

which means that your train data is not good, I would say

desert oar
#

Sometimes in things like kaggle they deliberately screw you with wonky scoring sets to punish overfitting, but you always need to make sure your data is in order before you try to adjust your training process

wicked flare
#

I guess any of the three sets could be problematic.

#

If the training data is bad, the model will be bad. If the validation data is bad, you will tune the parameters incorrectly, and if the test set is bad, you will get a bad result even with a good model.

desert oar
#

And yes, sometimes "get more data" is the best solution

slim fox
#

the question here is also whether you created train/val/test from one data set or they come from different sources

#

quite often validation set is obtained by the split from the entire train set

lapis sequoia
#

I'm just about to start a new job in machine learning in the UK public sector, previously I've only worked in the private sector. Anyone worked in both and can give me some heads up on what I should expect?

slim fox
#

here also cross-validation comes in play -> you do it to make sure that the way you build model does not favor overfitting for instance

foggy nebula
#

Hello Everyone,
I was wondering if any of you have experience in converting a dash plotly app into an .exe?
I have tried suggestions on the forums of dash but those don't seem to work for me and others as well on the forum

quaint basalt
#

Hello world! Anyone familiar with scikit-learn online at the moment and available to answer some questions? I think I want to do something strange (a branching Pipeline with a variable number of features in the intermediate steps), and I don't know how.

desert oar
#

@quaint basalt just describe your question in more detail, then somebody who knows the answer can see it and help

quaint basalt
#

Sure thing. Let me see if I can format a ascii graph thingy with the pipeline I have in mind

desert oar
#

As the saying goes, "don't ask to ask"

quaint basalt
#

My main input is a (networkx) graph, and I basically want to cluster the nodes of that graph using a soft clustering algorithm. To do this I need to predict 1) the number of clusters, and 2) for every edge an associated weight. The number of clusters is pretty straight forward since 1 graph results in 1 number. The edge weights are the problem, since 1 graph results in N weights with a different N for every graph. My idea is roughly the following:

      _ <featurize graph> - <normalize etc> - <SVR to predict number of clusters> _
     /                                                                             \
graph                                                                               <cluster> - <score>
     \_ <featurize all edges [1]> - <normalize> - <SVR to predict weights>        -/
#

[1] This results in many features/items/datapoints

desert oar
#

I have to take care of something at work, ping me this evening and I can take a look

#

In the meantime, look into FeatureUnion

#

In sklearn

quaint basalt
#

Thanks, will do. I also found sklearn-lego, which also seems to be an essential part in this

desert oar
#

yeah this is a bit more advanced than a sklearn pipeline is meant for

#

i didnt know about sklearn-lego

#

it looks more like just a collection of various custom transformers people have written

#

what is the last stage meant to represent?

quaint basalt
#

The score? Might be just a misrepresentation from my part. What I mean is I will compare the found clustering to a ground truth

#

The main problem I run into is that I have an unknown number of edges per graph

#

sklearn-lego seems to be needed to enable feeding the output of the intermediate predictors to the clustering

#

I can maybe turn the bottom half in a custom predictor of sorts which takes a graph and returns a weights matrix. But then I may still have the problem that its shape is not constant.

desert oar
#

honestly i would just write your own class for this

#

or function

#

or whatever

#

because the parameters from one of the models depends on the output from another model (# of clusters)

#

this is just way beyond what sklearn pipelines are able to handle

quaint basalt
#

I was afraid of that. Thanks for confirming it 🙂

lapis sequoia
#

Hey all, anyone got any links to some good tutorials/info pages for machine learning and data science using python and tensorflow? I just got hired for a data science internship and have absolutely no datascience or machine learning background and feeling a little over my head

solid aurora
#

Kind of an odd question, but given a jupyter password, can I get a token for that instance?

#

i.e.

#

I am given a jupyter URL like:

#
jupyter.verylarge.cluster.org:123456/login```
#

and am given the password abcd1234

#

How can I get an auto-login link like jupyter.verylarge.cluster.org:123456/?token=396d6da2df034621a8836ab6c0689eae?

devout sail
#

That sounds like a terrible idea

floral siren
#

Data Science Projects with Python by Stephen Klosterman is the book that I have been using @lapis sequoia . Not an online tutorial, but it is the best book that I have used on the subject

lapis sequoia
#

@floral siren awesome thanks ill check it out!

flat quest
#

there's a couple good courses on udacity @lapis sequoia
the intro to tensorflow one is pretty good

solid aurora
#

That sounds like a terrible idea
@devout sail are you referring to my question?

devout sail
#

Yeah. I don't have a definite answer, but finding the token based on password would essentially allow you to enter the notebook by guessing the right password wouldn't it?

solid aurora
#

I mean, you can do the same thing by bruteforcing the login page

#

My guess was the token would be stored in localstorage or something like that once the notebook was unlocked

#

@devout sail found this in the webpage source:```js

function _remove_token_from_url() {
if (window.location.search.length <= 1) {
return;
}
var search_parameters = window.location.search.slice(1).split('&');
for (var i = 0; i < search_parameters.length; i++) {
if (search_parameters[i].split('=')[0] === 'token') {
// remote token from search parameters
search_parameters.splice(i, 1);
var new_search = '';
if (search_parameters.length) {
new_search = '?' + search_parameters.join('&');
}
var new_url = window.location.origin +
window.location.pathname +
new_search +
window.location.hash;
window.history.replaceState({}, "", new_url);
return;
}
}
}
_remove_token_from_url();

#

so the token is in the url but removed

#

how can I get around that?

#

I obviously can't change the source code of a remote jupyter book

#

and I can't see anything in the network tab of dev tools

solid aurora
#

@devout sail what about from the cookie that is set?

#

I just noticed that jupyter sets a cookie when logging in via a password

devout sail
#

I'm not sure if I know or want to help with that, sorry. You should find a legit way to open the notebook

solid aurora
#

@devout sail bruh I have access to the notebook already

#

basically I am given web access with a url+password

#

I want to use it through vs code

#

which requires the token in order to connect to the jupyter daemon

#

there's nothing illegal/illegitimate going on here

devout sail
#

Then you should get it from whoever gave the url+password

solid aurora
#

I'll try that, but I don't think they'll be able to change the system so easily

#

it's an automated system that I request an instance from

#

and I'm given url+password to connect

oblique grove
#

sorry if this isn't the right place to ask, but I don't really know whether I should use google colab or just pycharm while I'm learning machine learning

solid aurora
#

colab is great because it has all the libraries set up for you

#

only issue is that while you can save code between runs, you can't save files

#

for most small ML projects that's perfect

oblique grove
#

oh ok nice

#

and will keep that in mind

modern canyon
#

Hello there folks, I recently finished an introductory data science course and was also shortlisted for an ML internship. I was given an assignment where I have to predict wine variety using various features. I have never worked on an ML project before and also don't have much knowledge on the theoretical side either. I just finished the assignment today and attained about 97% accuracy. I am sure it is dumb luck but just to be sure can you guys review my notebook (https://nbviewer.jupyter.org/github/shyam1998/Wine-Variety-Prediction/blob/master/main.ipynb) and see if I am doing something wrong to attain such accuracy?

#

note: They didn't give out the labels for the test data so I train_test_splitted the training data

south coyote
#

@modern canyon I am a noob on these stuff pretty much even i've been taking ml classes for semesters now, i think its good.

#

I assume there aren't multiple entries with very similar parameters in the dataset though

#

If such a thing happened, then you will have the similar items both in training and test set i eman

flat quest
#

@solid aurora actually u can save files. U'll just need to use drive for that.

But yes for larger projects, its recommended to use a standard ide or editor along with a version control system like git

solid aurora
#

GWmythicalThonkCool reminds me of the days when I used to "version control" by uploading code to google drive

steel ravine
#

Anyone with some knowledge about Neural Networks and Poker?

boreal portal
#

Going with the NN solution i see?

steel ravine
#

The goal is to make a bot that will learn to mirror the skill level of the player. Forcing the player to play against him self and learn his mistakes

boreal portal
#

In online poker?

steel ravine
#

This is for a custom poker video game, it can be played offline. Online features are only to serve ads because the game is free

boreal portal
#

That would take forever to gather the data to make you know

#

If yo uhave a single player

#

Doing a single game with this ai it would take hundreds and hundreds of hands to make meaningful progress if you're doing it super efficient like.

steel ravine
#

It would not start from 0, a pretrained model will be distributed with the game and will correct it self to mirror the player better

boreal portal
#

Now i ain't a super poker player but you would want to gather the data that affects how a player would play

#

blackjack would be easier

steel ravine
#

The problem I'm trying to solve is the input layer of the NN. Should I use one neuron for each card and set it to 1 if you have that card or should I only use a single input neuron and interpolate it between 0 and 1 to show the cards value

boreal portal
#

I'd say more the better

#

On that one

#

If you want it to be good and use the least data possible right

steel ravine
#

Ok, thanks for the input

boreal portal
#

You gonna use some transfer learning?

#

And hell you can test many different models

#

You know you gotta experiment

#

Don't be afraid to go back and reexamine. Nothing about your project is set in stone.

vernal cypress
#

hi, im just starting with pandas. is this a good place to ask for help? or should i got to python help?

paper niche
#

@vernal cypress either’s fine

vernal cypress
#

A'ight

#

I have an online store

#

each month i pay artists royalties

#

I'm trying to build a simple tool that will count up all the product sold within a month

#

i can export CSVs through my webstore admin panel

#

I sell shirts among other things

#

One design can be 8 or more colors and 6-7 sizes

#

the CSV i export keeps all of that in one column, "Lineitem name"

#

What would be the best way of aggregating all rows that contain "Aesthetic shirt" to one and creating a new data frame from it?

#

is the .groupby even appropriate?

paper niche
#

Probably u just need to split out the Lineitem name into 3 different columns (name, colour, size) then group by name

#

I think something that complicates things is that you have that Heather Prism shirt as well..

#

are all ur item names always before a hyphen?

vernal cypress
#

oh damn you're right, they are

#

the csv i get has 72 columns, it's overwhelming lol. is there anyway i can drop all column except the one i need? i've been dropping them each by name

paper niche
#

oh i just realized Heather Prism Mint is the colour isnt it

#

doh

vernal cypress
#

yes heather prism mint is the color

paper niche
#

yep one sec lemme shift over to my laptop

vernal cypress
#

but just the fact that you pointed out that they're all seperated by a hyphen and the sizes are seperated by a slash allready helps a bunch lol

paper niche
#

the csv i get has 72 columns, it's overwhelming lol. is there anyway i can drop all column except the one i need? i've been dropping them each by name
@vernal cypress you can just select the column you need instead of dropping the ones you don't.

df.loc[:, ['col_1', 'col_2']]
vernal cypress
#

excellent, thank you

paper niche
#

np, glad to help

#

you can also select a subset of columns to read in if you're using pd.read_csv() (I forgot the argument name, but you can find it on the docs I'm sure)

vernal cypress
#

i'll probably ask a few more dumb newbie questions before i figure it out but now i think i have the right approach

rich silo
#

Anyone here has experience in Dash?

paper niche
#

just ask your question, don't ask to ask. If someone knows the solution, they will answer.

vernal cypress
#

@paper niche
`def get_color(color):
return color.split('-')[1]

df['Column'] = df['Lineitem name'].apply(lambda x: get_color(x))`

gives me
"IndexError: list index out of range"

#

apparently it has something to do with pandas not knowing what to do with empty columns

#

some of the Lineitem names don't contain any hyphens or anything

#

is that why?

paper niche
#

yeah if some names have no hyphens then .split('-') will only return a list of 1 element (the original string), then [1] will give a indexerror

#

you need the colour? I thought you wanted the name

vernal cypress
#

well my idea was to seperate the color and size to seperate columns and then use .group by on the name

paper niche
#

this method also doesn't throw an error if your item name has no ' - ' delimiter, it just returns None in the second column (colour) after expanding

vernal cypress
#

oh sick!

#

lemme try

#

now how do i move that data to a separate column, leaving just the name

paper niche
#

or if you really only want the name, then

df['name'] = df['item'].str.split(' - ', expand=True)[0]
#

something like that

vernal cypress
#

can i use df.loc or should i just df.drop?

paper niche
#

assign it back to df, then drop the other columns you don't need
regarding this, you might also want to specify n = 1 in the argument for .str.split() to ensure you only get 2 columns out

#

either is fine, use whatever's more convenient

vernal cypress
#

alright sweet, i got it down to a somewhat tidy csv now, i can df.groupby('Product name').count() and i see exactly what i want

fringe violet
#

well i was told to go here.....

#

anyone know anything about euler's method?

stark mulch
#

!ask

arctic wedgeBOT
#

Asking good questions will yield a much higher chance of a quick response:

• Don't ask to ask your question, just go ahead and tell us your problem.
• Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving.
• Be patient while we're helping you.

You can find a much more detailed explanation on our website.

fringe violet
#

i didn't feel like explaining in here my issue if no one in here knows anything about basic numerical methods for solving differential equations

#

hence why i just asked if anyone knows about euler's method, the algorithm i'm using

#

anyway, here's my code for eulersmethod which just returns a set of data that matplotlib plots

#
def eulersmethod(init_x, init_y, dx, range_val, derivative, val_is_range ):
    points = [[init_x], [init_y]]
    # start with initial point first apply Euler's Method to the positive right
    x = init_x
    y = init_y
    derivative_at_last_pnt = derivative(x, y)
    # treat range_val like a range
    if val_is_range:
        while y <= range_val / 2:
            x += dx
            y += dx * derivative_at_last_pnt
            points[0].append(x)
            points[1].append(y)
            derivative_at_last_pnt = derivative(x, y)
        # negative direction
        while y >= - (range_val / 2):
            x -= dx
            y -= dx * derivative_at_last_pnt
            points[0].insert(0, x)
            points[1].insert(0, y)
            derivative_at_last_pnt = derivative(x, y)
    else:
        # treat range_vale like a domain
        while x <= range_val / 2:
            x += dx
            y += dx * derivative_at_last_pnt
            points[0].append(x)
            points[1].append(y)
            derivative_at_last_pnt = derivative(x, y)
        # negative direction
        while x >= - (range_val / 2):
            x -= dx
            y -= dx * derivative_at_last_pnt
            points[0].insert(0, x)
            points[1].insert(0, y)
            derivative_at_last_pnt = derivative(x, y)

    return points```
#

i've solved on paper a couple of well known functions for their derivatives in terms of x and y and plugged those functions into the code to test it. e^x is giving me weird functionality

#

sinx is giving me a straight line

#

y = x looks fine

stark mulch
#

have you printed your points out and inspected them?

fringe violet
#

that would be a lot to look at which is why the computer is doing it

stark mulch
#

fair. there's a point where your x value goes backwards. That's why you're getting the weird graph you're getting.

fringe violet
#

for domain -10 to 10 i'd have 2000 points to look at

stark mulch
#

print them out one per line, round them to a common width, and scroll until you spot the pattern.

fringe violet
#

i know what the issue is giving me weird functionality with it going backwards hold up....

neat harness
#

Is there a reason why you don't start all the way on the left or right (x axis) rather than in the middle?

fringe violet
#

ok i fixed that issue

#

i start wherever the initial point is

#

this is how it's supposed to look minus that weird little blip but i think i could figure that out

#

so that is fixed let me see what sin x looks like now

#

ok so as you probably know...

stark mulch
#

so it seems like every y value you calculate is 0, then

fringe violet
#

ok so i know what the issue is. i'm giving it an initial value of (0, 0) which should give me a sin function but i'm multiplying dx * derivative_of_sinx which is -y^2/2 so it always returns 0

neat harness
#

Correct me if I'm misunderstanding this but it looks like you first loop with y or x until it goes above the upper limit, then you try to iterate until it goes below the lower limit? Wouldn't you want only one loop that requires it's between both limits?

fringe violet
#

ok i have to separate it into two because i'm starting at some point. i want to use the derivative function to approximate the curve for some step size dx up to and below some hardcoded bounds

#

so i first go up then down. i have no way of knowing what the value at the far left will be until i get there and same for the far right. that's what the code does. it graphically solves the differential equation

#

Traceback (most recent call last): File "C:/Users/Nicholas/PycharmProjects/eulersmethod2/main.py", line 64, in <module> points = eulersmethod(x0, y0, 0.01, domain_size, sineofx, val_is_range=False) File "C:/Users/Nicholas/PycharmProjects/eulersmethod2/main.py", line 43, in eulersmethod derivative_at_last_pnt = derivative(x, y) File "C:/Users/Nicholas/PycharmProjects/eulersmethod2/main.py", line 53, in sineofx return -(y**2)/2 OverflowError: (34, 'Result too large')

#

i tried to start at initial value (pi/2, 1) didn't like that

#

this is just so broken if it isn't e^x or y=x... maybe euler's method just doesn't work on this kind of de?

neat harness
#

Euler's method is fairly general. It may be worth taking a second look at your math outside of the function you shared

fringe violet
#

well all the functions i'm using are fairly simple

neat harness
#

And also what y value caused that error

fringe violet
#

dy/dx = y is e^x

#

uh idk what value specifically but when the range is 2pi and i try to use sinofx() it gives that error

#

i'll post sinofx:

#
def sineofx(x, y):
    return -(y**2)/2```
neat harness
#

Try running sinofx through a bunch of inputs and print each input before the ** line

#

Then you can see which causes an error

fringe violet
#

i got this from the DE y'' + y = 0 which i know has the solution Acosx + Bsinx just from how many times this equation comes up in physics

#

so i solved it for dy/dx by integrating both sides to get y' = - (y^2/2)

#

alright i'll add that in there in a second. i'm checking to see if a polynomial will work

#

ok it's accurate for polynomials too

#

so far just the sinofx derivative function isn't working

#

how to set the width of the float when i print it?

neat harness
#

There's a couple options, do you know about .format, str % value, or f-strings?

fringe violet
#

oh nevermind i'll look into it later

#

-2.1399999999999983 for x returns......
-1.6618870538680888e+229

#

which doesn't make sense. it should oscillate between -1 and 1.

#

wait

#

i know what the issue is.... i rushed through the math and made a mistake. give me one sec

#

what i did just didn't make sense mathematically

#

y'' + y = 0
you can't just integrate y in terms of dx like that it doesn't make any sense

#

integ(y)dx =/= y^2/2 + c.

neat harness
#

To print a float (x here) with a specific number of digits (3 here) after zero:

  1. String.format function:
    print("{:.03f}".format(x))
  2. f-strings (my favorite, .format but convenient):
    print(f"{x:.03f}")
  3. String % operator:
    print("%.03f" % x)
fringe violet
#

thanks

#

well i think i know what the issue is now. i'm going to never touch this again and move on to the main point of my code haha

neat harness
#

Good luck

fringe violet
#

i don't need a sinx anyway. i just wanted to test euler's method so i could move on to a different method that was a bit more complicated

#

thanks

lapis sequoia
#

so

#

I forgot to set seed

#

and my computations turned out really good.. I don't know what my seed is

#

how do I reproduce omg x.x

devout sail
#

You don't? Not sure what you're doing, but you can run it several times and pick the best result

lapis sequoia
#

I can't run it several times. It's running on tpu

#

I'm trying to figure out a way to find the current seed even though I didnt' set it

#

im hoping np.random.get_state() would work

#

I'm using pytorch..

devout sail
#

If you're using jupyter or something and you didn't stop the kernel at any point you might be able to do it
https://stackoverflow.com/questions/32172054/how-can-i-retrieve-the-current-seed-of-numpys-random-number-generator for a more detailed explanation of how it works

#

Though I'd say your aim should be saving the results / model, than trying to recreate the computation

hearty jewel
modern coral
#

Uhhhh

#

Missing bracket? [

spark stag
#

@hearty jewel the line above, you open a [

modern coral
hearty jewel
#

fixed it

#

still not working

modern coral
#

What's the new error?

hearty jewel
#

same thing

#

is this a bug

spark stag
#

line above

#

you have pulls[pulls... but don't close that first [

modern coral
#

Right. THAT line is raising the error.

hearty jewel
#

u guys aer awesome

#

lol

modern coral
#

Because it has been searching for the closing ] but it just got a new line.
Well, we fixed two errors.

spark stag
#

if there is a syntax error pointing to something like a variable at the start of a line then usually look back a line or 2

hearty jewel
#

now theres a new error lol

#

on the 'date'.year component

#

im trying to group by user by year

#

whats wrong with the code there?

polar acorn
#

I think groupby needs a list of columns so it would have to be by_author.groupby(['user', by_author.....])

hearty jewel
#

i tried htat

#

it didnt work

#

like

#

by_author['date].dt.year

polar acorn
#

Did you try counts = by_author.groupby(['user', by_author['date'].dt.year]).agg(?

hearty jewel
#

i fixed it kind of

#

im getting a new error now

#

on the 'date' in counts_wide

solid mantle
#

My jupyter notebook breaks after running s particular piece of code. 'Kernel connection to server cannot be established'. Although it runs just fine before running that code

#

any ideas?

hearty jewel
#

figured it out

#

thanks anyways guys!

lapis sequoia
#

I need some free tpus

#

preferably over 8 cores and 64 gb ram

solid mantle
#

np.random.seed(1)
N = 100
alpha_real = 2.5
beta_real = 0.9
eps_real = np.random.normal(0, 0.5, size=N)
x = np.random.normal(10, 1, N)
y_real = alpha_real + beta_real * x

y = y_real + eps_real

data = np.stack((x, y)).T

with pm.Model() as pearson_model:
μ = pm.Normal('μ', mu=data.mean(0), sd=10, shape=2)
σ_1 = pm.HalfNormal('σ_1', 10)
σ_2 = pm.HalfNormal('σ_2', 10)
ρ = pm.Uniform('ρ', -1., 1.)
r2 = pm.Deterministic('r2', ρ2)
cov = pm.math.stack(([σ_1
2, σ_1σ_2ρ],
[σ_1σ_2ρ, σ_2**2]))

y_pred = pm.MvNormal('y_pred', mu=μ, cov=cov, observed=data)

trace_p = pm.sample(1000)
#

its this bit of code

#

shuts down my jupyter kernel

buoyant imp
#

Hello everyone,

#

I have a question regarding Mathematics for ML/DS etc.
I am currently learning Linear Algebra and I am fairly understanding the topic.
Should I start solving exercises from these topics manually (pen/paper style) to understand the topic more or is there anything else I should try?
I have Mathematics background during my undergrad but its been 2-3 years since I last solved any problems.
Also same goes with Probability and Statistics?

eager heath
#

Yeah it can be nice to get your pen and paper and test yourself to make sure you learnt those topics fully :D

earnest meteor
#

Hi I am reviving some client's project, it's based on Tensorflow 1.1 (I updated the package to 1.5).

There is a line statement like this:

# The original code is tf.contrib.lite, I migrated it as new style
interpreter = tf.lite.Interpreter(model_path='data/model.tflite')

I have a limited experience with it, I just need to move the ML parts as a separate python package for portability and write unit tests, so is this model "model.tflite" something common I can download from somewhere?

Thanks

buoyant imp
#

Yeah it can be nice to get your pen and paper and test yourself to make sure you learnt those topics fully :D
@eager heath

Hey, any good resources for problems?

eager heath
#

I don't have any, no sorry :/

buoyant imp
#

It's alright. Thank you.

zealous hinge
#

project euler!

lapis sequoia
#

hello, is there a faster way to deal with 20 categorical columns with 50+ levels than to go through each one individually, look at the value counts, and assign it as a binary as whether or not it is the max value count?

ripe forge
#

Depends on how exactly you coded it up

#

Cause logic wise, you have to do all that

#

So then the only question is, did you vectorize the code properly or did you use loops and so on

lapis sequoia
#

alright thank you

desert oar
#

how fast is "fast"

#

looping over 20 columns shouldn't take that long

#
def is_most_common_category(s):
    counts = s.value_counts()
    return s == counts.idxmax()

data_binarized = data[list_of_categorical_column_names].apply(is_most_common_category)
silk axle
#

How do I go about centralising values in my pandas dataframe when I print it?

desert oar
#

@silk axle you can call .str.center before printing

#

the column names i think you can control with a display option

#

hmm.. colheader_justify is only for left or right

#

no centered

silk axle
#

AttributeError: 'DataFrame' object has no attribute 'str'?

#

.to_string().center?

lapis sequoia
#

how do you know how many categorical levels are too much? for instance, i have a 6000 row dataset with a categorical variable having 50 levels. is that too much?

desert oar
#

@silk axle on each column

#
print(data.apply(lambda x: x.str.center()))
#

@lapis sequoia too much for what

silk axle
#

Is there not a better way? @desert oar

desert oar
#

not that i know of

#

im looking through the dispaly options

lapis sequoia
#

@lapis sequoia too much for what
@desert oar for an ML model (i.e. random forest)

desert oar
#

i dont see one

#

@lapis sequoia potentially yes, random forests tend to over-weight features with lots of categorical values

silk axle
#

AttributeError: Can only use .str accessor with string values! since not all the values are string @desert oar

desert oar
#

then convert to string first i guess. make a separate function to if/else based on the dtype

lapis sequoia
#

any alternatives that i could use where high-level categorical features wont have an impact?

#

would i use like a regressoin?

desert oar
#

something with regularization

silk axle
#

then convert to string first i guess. make a separate function to if/else based on the dtype
@desert oar this doesn't really mean anything to me. I know literally nothing about pandas other than how to read in an excel file.

lapis sequoia
#

@desert oar this doesn't really mean anything to me. I know literally nothing about pandas other than how to read in an excel file.
@silk axle bro u can google this

silk axle
#

I can't google it if I don't know what they mean

desert oar
#

@silk axle

def center_text(s):
    return s.map(str, na_action='ignore').str.center()

print(data.apply(center_text))

something like that

#

if you want to get more specific use if/else and check the .dtype of s

#

e.g. if it's a datetime series you can strftime it

silk axle
#

Would s be each like record?

desert oar
#

i think you need to brush up on pandas basics 😉

#

.apply by default applies a function to each column in the data

silk axle
#

As I said, I know nothing about pandas

desert oar
#

.map applies a function to each element of a series elementwise

#

a DataFrame is (logically) a collection of Serieses

silk axle
#

TypeError: center() missing 1 required positional argument: 'width'

desert oar
#

any .str method functions like the regular str methods

#

so .str.center is the same as str.center in regular python

silk axle
#

didn't even know str.center was a thing

desert oar
#

!d g str.center

arctic wedgeBOT
#
str.center(width[, fillchar])```
Return centered in a string of length *width*. Padding is done using the specified *fillchar* (default is an ASCII space). The original string is returned if *width* is less than or equal to `len(s)`.
silk axle
#

I guess the width would be the length of the header?

desert oar
#

yeah, or the length of the longest string maybe

silk axle
#

true

desert oar
#
def center_text(x):
    s = x.map(str, na_action='ignore')
    l = max(s.name, s.map(len, na_action='ignore').max())
    return s.str.center(l)

maybe

silk axle
#
    l = max(s.name, s.map(len, na_action='ignore').max())
TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'
#

Tbh I should probs stop copy-pasting and actually look at the docs to see what this stuff does lol

desert oar
#

yes lol

#

also look at what i wrote

#

the point is that in general there isnt a good way to do this

#

without manually centering everything

#

how you do that is up to you

silk axle
#

s = x.map(str, na_action='ignore') so this converts x to a str, right?

#

And if x is a missing value (e.g. NaN) then ignore it

lapis sequoia
#

@desert oar so can i use a dataframe of 6000 rows that has categorical variables with 300 levels, 5 levels, and 2000 levels and put this through lasso regression?

desert oar
#

yes @silk axle

#

@lapis sequoia regression has its own problems. each level of each variable is a separate parameter

lapis sequoia
#

yeah thats what i thought too. im not sure how to proceed because there are so many levels

desert oar
#

so yes lasso or ridge can help but you are depending on regularization to make it make sense

lapis sequoia
#

what do you suggest

desert oar
#

is it only these categorical features?

lapis sequoia
#

no there are more

desert oar
#

hm

lapis sequoia
#

ill show you this

desert oar
#

one option is feature hashing

#

for the 2000 level one

lapis sequoia
desert oar
#

which basically groups categories randomly together

#

huh

lapis sequoia
#

alright ill look that up

desert oar
#

are you sure all these features are meaningful

#

i see _NAME

lapis sequoia
#

some of them arent

#

yeah those and the dates arent meaningful

ripe forge
#

Date also

desert oar
#

ok good i was going to say

#

you had better not include a date as a categorical lol

lapis sequoia
#

yeah the problem is, some of them just have so many levels

desert oar
#

what's an example of one of these features

#

like what does it actually represent

lapis sequoia
#

im trying to group it into more levels so that i can have 80% of the value counts within 10 levels (for example) and have the remaining 20% listed as 'Other'

#

yeah so one of them is like 'Keywords' which has 2360 levels

#

which represents like 'Equity' or 'Investing' or something

ripe forge
#

Oh i have a suggestion for that kind of crap

lapis sequoia
#

another is the group with 1395 levels and a list of different business units

#

Oh i have a suggestion for that kind of crap
listening

desert oar
#

1400 business units wew

silk axle
#

I'm not getting anywhere with this @desert oar, it's really confusing me :/

desert oar
#

@silk axle you can also just loop over your data and print each row manually

ripe forge
#

For business units try doing some meaningful aggregation

desert oar
#

^

ripe forge
#

Like group together the units

lapis sequoia
#

that are similar?

#

i see

ripe forge
#

For the words, perhaps use embedding.

lapis sequoia
#

this is so much more work than i anticipated ha

ripe forge
#

It ends up making more variables, but all continuous

desert oar
#

definitely feature hashing for the keywords, or even an embedding like word2vec if a lot of your records have multiple keywords

#

yes, welcome to machine learning

ripe forge
#

And they contain meaning in vector space

desert oar
#

65% fucking with data, 15% training models, 20% sitting in meetings

ripe forge
#

More like 50% 5% and 45% fml

desert oar
#

lol true

#

also throw in 10% writing ad-hoc code for random other projects

lapis sequoia
#

lets say i have a column with 100 variables, and 50 of those variables capture 80% of the value counts. could i just create 51 variables and have 1 of them listed as 'other' for 20%

desert oar
#

thats not a bad option

ripe forge
#

Only if grouping them makes sense

lapis sequoia
#

im trying to create a function right now to see how many variables are captured by 80% of the value counts

desert oar
#
ripe forge
#

Like it doesn't make sense if you grouped ceos and trainees together for example

lapis sequoia
#

yeah the problem is you lose a lot of information

Only if grouping them makes sense

ripe forge
#

Cause that might logically be bad

lapis sequoia
#

right

#

thanks salt, ill check those links out

ripe forge
#

There is one other option that I personally haven't checked out

#

Called target encoding. Supposed to be some magical nonsense

desert oar
#

yep target encoding works

#

ive done it

#

theres no implementation currently for multi-class though

#

only regression and binary classification

ripe forge
#

But it's another cheeky way to get a representation that's mcuh easier for models to handle

desert oar
#

the multi-class version is more complicated too, i don't remember how it works off the top of my head

lapis sequoia
#

ah so it wont work for my case?

desert oar
#

youd have to implement it yourself

lapis sequoia
#

alright. what about this scenario

#

so my dataframe together has 84 columns. what if i only looked at a dataframe with less than 15 different levels? that'd result in 37 columns

#

could i create a prediction off that

#

i know im losing a lot of information

desert oar
lapis sequoia
#

thank you

desert oar
#

i think you should at consider which features make sense for your business problem too

#

you can also try computing the mutual information between each feature and the target

#

and drop features with MI below a threshold

lapis sequoia
#

so basically calculating variable importance?

desert oar
#

more like bivariate association

lapis sequoia
#

alright

#

i'll check this out

#

this is my first month working as a professional data scientist lol and i never implemented any of these methods in school

desert oar
#

its just because correlation doesnt work on categorical features

#

otherwise youd just use correlation

#

yeah welcome

#

at least you have the sense to ask questions

#

i struggled for years just trying to DIY everything

lapis sequoia
#

its just because correlation doesnt work on categorical features
yeah. if this data were numerical it'd make my life so much easier

desert oar
#

yep. again welcome to data science

lapis sequoia
#

i struggled for years just trying to DIY everything
lol i cant imagine how difficult it'd be if you didnt have someone to ask like i do right now

#

thank you though

desert oar
#

hint: it sucked and i wasnt very good at it

lapis sequoia
#

ha

silk axle
#

@desert oar I've kinda made progress, but still not quite where I want

#
def center_text(x):
    if isinstance(x.dtype, str):
        l = max(x.map(len))  # get the highest length of string
        print(l)
        return x.str.center(l)  # center based on longest length
    else:
        s = x.map(str)
        return s.str.center(len(x.name))  # center based on length of title

print(df.apply(center_text))
desert oar
#

isinstance won't work with dtype

#

dtypes are single-character strings

#

"string" columns are just "O" dtype which means "arbitrary python objects"

#

so sadly there's no dedicated string datatype in pandas

silk axle
#

So I can't do something different for strings and integers?

desert oar
#

you can, but it really depends on the dtype

#

you can have "integers" of 3.0 and 4.0 in a "float" dtype column

#

or you can have integers in an "O" dtype column even though it should probably be "int"

silk axle
#
Date               datetime64[ns]
Horse                      object
Track                      object
Time                       object
Non Runner                   bool
Odds                       object
Total Stake                 int64
Win or Each Way            object
Actual Winnings           float64```those are the dtypes
desert oar
#

looks like the times are strings?

#

Time

silk axle
#

How would I change that?

#

11/06/2020 15:40:00 is an example

#

That's the format

#

dd/mm/yyyy hh:mm:ss

desert oar
#

why do you have both Date and Time then?

#

regardless, doesnt matter

silk axle
#

I merged Time and Date and forgot to remove Time

#

lol

#
Date               datetime64[ns]
Horse                      object
Track                      object
Non Runner                   bool
Odds                       object
Total Stake                 int64
Win or Each Way            object
Actual Winnings           float64```so these are the dtypes
desert oar
#

ah wait

#

you can assing formatters to each column this way

#

use that

#

see formatters and float_format

silk axle
#

I saw that earlier but no clue how to use it lmao

desert oar
#

Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

#

formatterslist, tuple or dict of one-param. functions, optional
Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

silk axle
#

formatters = center_text?

desert oar
#

that, or

formatters={
    'Horse': lambda s: s.center(15),
    'Non Runner': lambda s: 'Non Runner' if s else 'Runner'
}

etc.

silk axle
#

Right yea

#

Thanks 👍

silk axle
#

I'm almost there @desert oar, got another quick question

desert oar
#

sure

silk axle
#

How can I make it so that it sends 4.00 instead of just 4 (type is int64)

#

If I can £4.00

#
    'Total Stake': lambda s: f"{str(s).center(11):.2f}",
ValueError: Unknown format code 'f' for object of type 'str'
#

That's what I tried

#

Wait I know why

#

Yep fixed

desert oar
#

you can use the float_format parameter too

#

oh if it's int nvm

#
    'Total Stake': lambda x: format(float(x), "0.2f").center(11)
#

@silk axle ^

#

also i was just using s to mean either "series" or "string

silk axle
#

Rn I've got lambda s: f"£{s:.2f}".center(11)

desert oar
#

!e ```python
x = 3
print(f"{x:0.2f}")

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

3.00
desert oar
#

cool, it casts it

#

i think with .format sometimes you get errors

#

cant remember

silk axle
#

'Actual Winnings': lambda s: ("£" + f"{s:.2f}".zfill(5)).center(15) this works but I'm thinking there's a tidier way -- converts 0 to 0.00 then to 00.00 then to £00.00 and then centers

desert oar
#

!e ```python
x = 1.5
print(f"£{x:2.2f}")

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

£1.50
silk axle
#

But

#

I think there was a reason I can't do that

#

Right yea

#

Because I want £01.50

desert oar
#

oh sorry use 02

#

!e ```python
x = 1.5
print(f"£{x:02.2f}")

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

£1.50
desert oar
#

wait hm

#

!e ```python
print( format(43, "06.2f") )

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

043.00
desert oar
#

let me re-read i might have missed something in the syntax

silk axle
#

Wait no

#

lambda s: f"£{s:05.2f}" this works

#

:02.2f will do for total len 2

#

You need total len as 5

desert oar
#

yeah right

#

more corners of python i dont use very often

silk axle
desert oar
#

does .to_string have any kwarg to mess w/ the headers?

silk axle
#

right yea, good point

#

center works

#

Spacing between the columns isn't really consistent which is annoying

#

Ig set col_space?

#

Wait no that wouldn't be it

#

Don't think so

#

Eh kinda does

#

Not really sure for this one @desert oar

desert oar
#

at this point your guess is as good as mine

#

ive never had to do this

#

or ive just manually centered by iterrating over rows

silk axle
#

Ig it'll do then

#

Thanks for all your help with this 👍

river wing
#

I tried
import nltk
nltk.download('punkt')

It take very long time and then fail ultimately

desert oar
#

show the error you get

lapis sequoia
#

lets say i have a column with 100 variables, and 50 of those variables capture 80% of the value counts. could i just create 51 variables and have 1 of them listed as 'other' for 20%
@desert oar so back to this question, if im running low on time, would you suggest i do this ?

lapis sequoia
#

i am using Python and xlwings, but the intellisense isn't working for xlwings. e.g. when i type in ws1.cells., it doesn't show the methods that i can use e.g. ws1.cells.clear_contents() any idea why?

desert oar
#

@lapis sequoia it could work. maybe a better option is to look at the cumulative % of the data represented by each category, and chop it off at some threshold

#

oh wait

#

thats what you just said

#

yes

#

perfectly valid

lapis sequoia
#

alright thank you, im gonna try running a baseline with variables that have less than 10 levels, and then running another model of all levels but using that method i just described

desert oar
#

feature selection and engineering is always a slow process

lapis sequoia
#

lol yeah. this is my first professional DS/ML project, and ive never had to deal with data that was this messy and unorganized before

desert oar
#

at least you have all the data in one place

#

ive spent all day writing various scripts just to get my data

lapis sequoia
#

oh thats even worse lol

desert oar
#

under a completely arbitrary deadline

#

which is the most annoying part

#

everyone wants to acknowledge that software dev takes time, data science takes a fuckload of time

#

then i keep getting asked "how far do you think we can get in 3 weeks"

#

"well, week 1 will be spent writing and debugging data access and cleaning scripts, week 2 will be spent just understanding the data and prototyping maybe 1 or 2 algorithms, week 3 will be spent slapping together literally anything that seems to work because this is not enough time to get anything done"

thin terrace
#

Is there an optimal way to handle missing values?

desert oar
#

no. depends on your application

#

sometimes it's as simple as "fill in with the mean" and sometimes it's as complicated as "build an imputation model"

#

depends entirely on the data that's missing, the reasons for the missingness, the kind of model you're using, etc.

lapis sequoia
#

Is there an optimal way to handle missing values?
@thin terrace lol im doing this right now

thin terrace
#

can you give some example(s)?

lapis sequoia
#

im using knn to impute categorical variables

desert oar
#

^ thats one option

#

@thin terrace do you have a specific example in mind?

lapis sequoia
#

everyone wants to acknowledge that software dev takes time, data science takes a fuckload of time
damn that sucks. are u a DS or ML engineer

desert oar
#

data scientist

lapis sequoia
#

you can replace with mean/median/mode/knn/regress

#

bunch of algorithms online

#

other options too

desert oar
#

hell you can train an autoencoder on the non-missing records

lapis sequoia
#

but i think those are the most frequent. is that correct?

#

yea that too

desert oar
#

multiple imputation is another option with a lot of theoretical appeal but can be difficult to implement in practice

#

can do stuff w/ gaussian mixtures, sky is kind of the limit

#

you can even fit bayesian models where you set a prior for your missing values and train the model on that

thin terrace
#

Well, not really. So I'm applying for a data scientist job (which i think I've landed). They gave me a classification task with a dataset full of missing values and I have never handled such things before (just graduated from uni). I didn't really know how to approach it. I removed some columns with very large amounts of missing values and then filled the remaining with 0.

#

But It yielded pretty shitty performance

lapis sequoia
#

dont replace with 0. you can google algorithms to replace with mean/median/mode

#

unless it makes sense to replace wiht 0

#

for instance, my dataset rn has a few columns missing >60% of values -- i dropped those completely. now im running a knn on missing values

thin terrace
#

When does it make sense to replace with 0? I tried with -999 first but it worked even worse

desert oar
#

they didnt teach you anything about missing values in school?

lapis sequoia
#

^ im surprised to hear that as well

desert oar
#

what degree do you have

#

what field of study, rather

thin terrace
#

M.Sc in engineering computer security

desert oar
#

no wonder

thin terrace
#

so im an outlier hehe

desert oar
#

stop thinking like an engineer start considerg what the data actually is

#

is 0 a sensible default value for the data?

#

then you can impute with 0

#

is 0 a sensible default value for a stock price?

#

of course not

#

well maybe for some stocks

#

but not in general

lapis sequoia
#

@thin terrace google how to impute missing values

thin terrace
#

so basically its ok for features where 0 would represent "nothing" ?

desert oar
#

no, youre still thinking about this wrong

#

in fact an engineer needs to think like this too

#

what does the data represent? how does my model actually work and use the numbers i give it?

thin terrace
#

im in for a challenge at this job position, thats for sure lol

desert oar
#

imo a good software engineer needs the same skillset

#

making sure that what you're doing actually makes sense for solving the real-world problem

#

for example i'm writing a client library for a huge web API right now

#

there are like 50 options

#

but my particular use case only needs like 10, 5 of which can be hard-coded

#

so i'm writing my code with that specifically in mind

#

if you're not thinking about the task from a real-world perspective you're just not going to develop good solutions. it's true for software engineering, even more true for "physical" engineering like civil/mechanical/electrical, and equally true for data science

thin terrace
#

yeah i usually try to think about the bigger picture

#

data science is very new to me, a lot of new stuff

#

I was planning on becoming a dev but here I am

desert oar
#

data science is more fun

thin terrace
#

I hope I'll think so too

lapis sequoia
#

data science is nerdy

#

whereas web dev is creative and fun 🙂

#

hehe

desert oar
#

anyway with that in mind @thin terrace it definitely helps to consider: what kind of data is this, and what would happen to my model results if i did X

#

and that's why there will never be a catch-all "what do i do with missing data" answer

slim fox
#

naaah webdev is boring xD

#

are you working as data scienttist, salt?

#

btw I just realized we have two salt helpers 😂

#

I though we had one who changed name sometimes @desert oar

thin terrace
#

Yeah, I get that. I just don't know the answers to these questions. How do I learn?

desert oar
#

salt is just "salt" or "salt-die"

#

i was here first technically 😉

slim fox
#
  1. learn by doing: practice on different data sets, try to see the bigger picture
  2. since data science is so hot now there are plenty of resources to learn things: if you will look on internet for things like "imputation", "missing values in data" etc you will find quite some amount of guides and articles
desert oar
#

@thin terrace partly a matter of looking to see what other people have done in other specific cases

#

e.g. there are a million approaches to missing data on the kaggle titanic dataset

#

and we just suggested a few, albeit complicated ones that i think are overkill for a job interivew task

slim fox
#

Towards data science part of medium.com often has good articles for example

desert oar
#

impute with mean/median/mode, fit regression, time series if appropriate, KNN, gaussian mixture

#

some models don't even need missing data imputation, e.g. random forest

#

if "missing" is a valid category level you can just leave it missing

slim fox
#

i was here first technically 😉
@desert oar oh really? didn't know 🙂 guess I just happen not to see you somehow lol. I used to see salt-die, then there was #ask-meta-for-math channel and then I started yo see you but not salt-die

#

hence the worng conclusion

desert oar
#

i was gone for a while

#

i'm not likely to be here consistently. i've just been on a lot recently

slim fox
#

I see. do you work a data science-related job?

desert oar
#

yes

thin terrace
#

I tried leaving them as missing, must have fucked something up

#

At the interview they said I could just have replaced them with -1 and I would've got pretty good performance

#

Is that such a huge difference from replacing with 0?

slim fox
#

hm. are those missing data are continious or categorical?

thin terrace
#

mostly continuous

slim fox
#

if they are continious numbers you can also check for the pearson correlation between those features and target variable

#

if they are not really correlated you might not need them at all

thin terrace
#

well i basically ended up dropping most features and I guess that's where I went wrong

lapis sequoia
#

if i had values like 5-10, 10-20, 20-30, >1000, i would have to encode these, correct?

#

before putting it into a ML/predictive model^

thin terrace
#

sounds like a good idea

unborn talon
#

why ???

thin terrace
#

missing ()

unborn talon
#

where ?

thin terrace
#

np.array( [1,2,3,4] )

unborn talon
#

oh THANKS ALOT

thin terrace
#

yw

solid mantle
#

anyone?

#

Does anyone know what passing another array as array index give?

#

idx = np.repeat(range(7), 20)
idx = np.append(idx, 7)
np.random.seed(314)
alpha_real = np.random.normal(2.5,0.5,size=8)

y = alpha_real[idx]

#

so, what would y give?

#

I looked at its kdeplot, i cant make anything out of it

thin terrace
#

you can give an array of indices to an array to retrieve a new array with the elements at the given indices

solid mantle
#

ohh thank you

desert oar
#

@solid mantle time to read the numpy indexing docs 😉

#
x = np.array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30])
print(x[0])
print(x[[0, 3]])
print(x[1:4])
print(x[x < 20])
#

the last one is indexing with an array/list of bools

solid mantle
#

@desert oar thank you

slim fox
#

then there is some crazy stuff too

#

like stride tricks

#

haven't yet figured them out

thin terrace
#

Warning This function has to be used with extreme care, see notes.

#

nice

slim fox
#

for a good reason I beleive 🙂

thin terrace
#

it manipulates the internal data structure of ndarray and, if done incorrectly, the array elements can point to invalid memory and can corrupt results or crash your program

#

I don't get the usage of it

lapis sequoia
#

how can i impute missing categorical data using knn?

desert oar
#

one basic method is to compute distances between rows, comparing only the non-missing fields in those rows. then for each field you fill each missing value from the nearest rows that have non-missing values

#

there are a few packages that implement that logic

#

its actually a little surprising that nobody has come out with a "professional-grade" library for KNN imputation, but i guess people often just end up writing their own code

marsh chasm
#

Does anyone know a good way of searching for a term in a VERY LARGE CSV (5million rows total ~) without using a for loop

#

/is a for loop really that much slower ?

desert oar
#

@marsh chasm in any column? or in a specific column

marsh chasm
#

Specific column

desert oar
#

plain python can loop over 5 million rows pretty fast. pandas can do it really fast

marsh chasm
#

Hmm it’s been running for an hour it’s not done yet (regular for )

desert oar
#

show your code?

marsh chasm
#

yeah gimme a second

#
temp = []
keyword = "Panda Express"

for i in filenames:
    with open(i, 'rt') as f:
        reader = csv.reader(f)
        for row in reader: 
            if keyword == row[1]:
                temp.append(row[0])
                temp.append(row[11])
    dataframes.append(temp)
    temp.clear()       
    with open('PandaVisits.csv', 'wt') as p:
        writer = csv.writer(p)
        for r in dataframes:
            writer.writerow(r)```
#

filenames is a list of csv's that i want to analyze

desert oar
#

im surprised thats taking an hour

#

wait

#

you forgot to un-indent the 2nd with open

#

so you're re-writing all your files every time you read 1 file

#

unless that's your goal...

#

even so that's a long time

#

regardless, pandas should be able to do this much faster

#

you dont have any colum names? data starts on the first row?

marsh chasm
#

sorry was helping my dad w something

#

don't i need to do that bc the original for loop is like every file in filenames

desert oar
#

you're doing some weird business with the data

#

your logic is all twisted

#

oh nvm

#

you're clearing temp..

#

but still

marsh chasm
#

ah probably im new to this so it's not probably the best solution

desert oar
#

you should move the writing to the end

#

so you aren't re-writing every time you read 1 file

marsh chasm
#

oh wait

#

yeah lmao

desert oar
#

regardless pandas makes this a lot faster

import pandas as pd

filenames = [ ... ]
dataframes = []
keyword = "Panda Express"

for filename in filenames:
    data = pd.read_csv(filename, usecols=[0, 1, 11], header=None, names=['x0', 'x1', 'x2'])
    has_keyword = data['x1'].str.contains(keyword)
    temp = data.loc[has_keyword, ['x0', 'x2']]
    dataframes.append(temp)

combined_data = pd.concat(dataframes)
combined_data.to_csv('PandaVisits.csv', index=False, header=False)
#

again, assuming your files don't have column headers

marsh chasm
#

they have column headers

#

how does panda make it faster

#

is it like... multithreading ?

desert oar
#

its written in C instead of looping manually in python

#

there is a huge amount of overhead in python code execution

marsh chasm
#

ohhhh gotcha

desert oar
#

what are the column names?

#

for the columns you want, anyway

marsh chasm
#

well the ones of interest: date_range_start and raw_visit_counts

desert oar
#

and the one that you're searching in?

marsh chasm
#

location_name

desert oar
#
import pandas as pd

filenames = [ ... ]
dataframes = []
keyword = "Panda Express"

for filename in filenames:
    data = pd.read_csv(filename, usecols=['date_range_start', 'location_name', 'raw_visit_counts'])
    has_keyword = data['location_name'] == keyword
    temp = data.loc[has_keyword, ['date_range_start', 'raw_visit_counts']]
    dataframes.append(temp)

combined_data = pd.concat(dataframes)
combined_data.to_csv('PandaVisits.csv', index=False)

try this

marsh chasm
#

thanks! i'll crossreference it w the docs to learn it but thank u so much! being new to this its hard to figure out which tools are teh best to use

#

so thank u

desert oar
#

youre welcome, thats the best way to learn

#

the user guides in pandas aren't always that helpful but the API reference is usually clear

marsh chasm
#

gotcha ty

#

ok so i ran that code and i got these errors

desert oar
#

the usual caveats about untested code apply

marsh chasm
#

yeah probably

desert oar
#

can you show more of the error?

#

it looks like you cut off the bottom

marsh chasm
#

sorry about that

desert oar
#

oh...

#

you copied and pasted my code verbatim

marsh chasm
#

it looked fine ?

desert oar
#

i bet you can figure out the problem

marsh chasm
#

ok

#

let me check

desert oar
#

hint: look near the top

marsh chasm
#

OMG I TOLD MYSELF TO DELETE THAT

#

I THOUGHT I DID

desert oar
#

lol

#

for future reference, 3 dots "..." is called an "ellipsis"

#

which hopefully makes that error message make sense

marsh chasm
#

yeah thats why i was confused i was like tf where are the ellipsis i deleted the filename part

#

apparently

#

i did not

desert oar
#

happens to the best of us

marsh chasm
#

im looking through the docs... pandas is pretty useful xD

desert oar
#

yep, for anything with tabular data it's pretty much indispensable

marsh chasm
#

lmao to think i started this project using R

desert oar
#

R isn't bad

#

pandas owes a lot to R for its design

#

the R code would look pretty similar, and if you use a 3rd party CSV reader instead of the built-in one it's just as fast if not faster

#

i've processed billions of rows in R doing more complicated operations than this

marsh chasm
#

i used data.tables

#
  • purrr
desert oar
#

hell yeah

marsh chasm
#

but it was really

#

really

#

slow

desert oar
#

data.table? shouldnt be

marsh chasm
#

let me see if i can pull up the code

#
library(purrr)
setwd("/Volumes/Seagate Backup Plus Drive/UMichStuff/v2/main-file/Relevant")

filenames = list.files(getwd())

rawVisits = 12
startDate = 10
CSVnames = filenames[filenames %like% "weekly-patterns"]
df = map_df(CSVnames, fread, header = TRUE)
df = df[which(df$location_name == "Panda Express"), c(startDate, rawVisits)]
fwrite(df, "PandaExpressVisitList.csv")```
lapis sequoia
#

my random forest model got a negative 17% R-2 value lol.

desert oar
#

@marsh chasm

library("data.table")

filenames <- c( ... )
keyword <- "Panda Express"

dataframes <- lapply(filenames, function(filename) {
  dat <- fread(filename, select = c("date_range_start", "location_name", "raw_visit_counts"))
  dat[location_name == keyword, .(date_range_start, raw_visit_counts)]
})

combined_data <- rbindlist(dataframes)
fwrite(combined_data, "PandaVisits.csv")

pardon any mistakes, i haven't used R much recently

marsh chasm
#

oh interesting

#

i started urnning the python code 2 minutes ago and its not done

#

is that normal

desert oar
#

your data might be bigger than you realize

marsh chasm
#

its 150 gigs

desert oar
#

oh

#

you said 5 million rows

#

that's.... a lot more than 5 million

marsh chasm
#

im pretty sure its 5 million rows

#

oh

#

is it

desert oar
#

maybe 5 million rows with 100s of columns

#

that's a fuckton of data

marsh chasm
#

: )

desert oar
#

you have 150 gb of memory?

marsh chasm
#

external hard drive

desert oar
#

not storage

#

memory

#

RAM

marsh chasm
#

150 gigs of ram hell no

#

im running from my mac laptop

desert oar
#

uh

#

i'd ctrl+c this

marsh chasm
#

uh oh

desert oar
#

or at least keep your performance monitor open

#

do me a favor

#

how many files do you have

#

and how many lines are in each file

marsh chasm
#

like how many csv's

desert oar
#

yes

#

run this in bash:

wc -l my-data-directory/*.csv
#

obviously my-data-directory is the directory w/ your CSVs

lapis sequoia
#

wait @desert oar how would i go about getting data from sql database with 15m rows into python for pandas editing?

#

csv limit is 1m

#

m = million

desert oar
#

what do you mean csv limit is 1m

#

either write the query and pd.DataFrame it, or use pd.read_sql

#

(note that pd.read_sql requires sqlalchemy unless you're using sqlite)

lapis sequoia
#

wait that wouldnt work for oracle though right

#

oracle database?

desert oar
#

if it's supported by sqlalchemy it's supported by pandas

#

if not, like i said: write the query yourself and then convert to a dataframe after reading the data

marsh chasm
#

i could have run the command wrong, but when i did i got this:

desert oar
#

put quotes around anything with a space in it

#

but not around the "*.csv" part

marsh chasm
#

uh so like quotes around the whole filepath thing

desert oar
#
wc -l "/Volumes/My University/Users/My Name"/*.csv
#

like that

marsh chasm
#

ok cool

#

okie its running

#

oo its taking a long time to run

desert oar
#

probably because you have 150 GB of data

marsh chasm
#

yeah lol

#

it just counts the number of lines right

desert oar
#

yeah, but you have a fuckton of data

marsh chasm
#

a ha ha

#

so what will we do with the information when the computer finishes counting

desert oar
#

we'll see the actual number of rows

#

or at least a good approximation thereof

marsh chasm
#

okie

desert oar
#

which will give us a better sense of how to approach this

marsh chasm
#

oh it finished the first two... theyre at about 3,880,000 each

#

lets see how the other ones fare

desert oar
#

how many files total?

marsh chasm
#

32

#

yeah theyre all in the upper 3.8 millions

desert oar
#

!e ```python
print( 32 * 3.8 * 1e06 )

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

121600000.0
desert oar
#

so you have like 120 million rows

#

this is approaching what you might call "big data"

marsh chasm
#

woah coding buzzword

desert oar
#

indeed

#

"big" = "too big for a hard drive"

#

so depending on the hard drive you're getting close

marsh chasm
#

yeah im running this out of a 1TB external hard drive but thats storage

#

so not too bad

#

2 TB*

desert oar
#

it's certainly "medium data" and too big for ram on most machines

#

i have a work machine with 256 GB of ram but even then i wouldn't load all this data at once, if only out of respect for my coworkers who also need the machine

marsh chasm
#

oh boy

#

mmmmm i have 8gb of ram it looks like

#

according to about my mac

desert oar
#
import pandas as pd
from tqdm import tqdm  # for a nice progress bar

filenames = [ ... ]
output_filename = 'PandaVisits.csv'
keyword = "Panda Express"

for fileno, filename in tqdm(enumerate(filenames)):
    data = pd.read_csv(filename, usecols=['date_range_start', 'location_name', 'raw_visit_counts'])
    has_keyword = data['location_name'] == keyword
    temp = data.loc[has_keyword, ['date_range_start', 'raw_visit_counts']]
    if fileno == 0:
        temp.to_csv(output_filename, index=False)
    else:
        temp.to_csv(output_filename, index=False, header=False, mode='a')

anyway this should read each file one at a time, then append it to the same pandavisits.csv file

#

and i added a pretty progress bar so you can see how long it will actually take

#

(pip install tqdm or conda install tqdm)

marsh chasm
#

oh cool

#

okay i'll try that

#

is this what the progress bar looks like?
0it [00:00, ?it/s]

#

thats what pops up when i run the script initially

#

oh i got it

#

nvm

#

it looks like it'll take about 40 minutes to run the whole thing (75seconds per iteration and 32 iterations)

desert oar
#

seems reasonable

marsh chasm
#

thats reasonable

#

i'll watch netflix for 40 mins and check later xD

#

ty

desert oar
#

you're welcome

#

sounds like typical data science

marsh chasm
#

this is my first time working w data

#

like to this scale

#

i was using R because my paradigms class had a small data science unit so it was fresh in my mind

marsh chasm
#

the code worked @desert oar ty

desert oar
#

@marsh chasm nice

#

you can do basically the same with data.table tho

marsh chasm
#

maybe i didn't let it run long enough : /

#

i expected that it was just taking too long

desert oar
#

instead of lapply and rbindlist you do basically the same thing as i wrote in pandas

#

its like a 1:1 port

marsh chasm
#

ah gotcha

#

welp

#

maybe its best i brush up on my python anyway 😩

lapis sequoia
#

when im using categorical variables in a random forest, do i need to one-hot encode them? ive seen conflicting responses

#

when i dont one hot encode, i get a 'cannot convert string to int' error

desert oar
#

@lapis sequoia #0682 most models still require numerical inputs even if the numbers correspond to categories

lapis sequoia
#

i thought that as well

#

but dont random forests handle categorical data?

#

i just saw a youtube video with 900/920 likes that one hot encoded so im just gonna do that lol

#

but im still interested in knowing if you can use a random forest without having to (one hot) encode categorical variables

#

(if anyone knows the answer pls tag me bc sometimes i forget to check here after asking a question lol)

desert oar
#

Yes they can handle categorical data, but that depends on the person who wrote the software letting you specify which columns are categorical

#

Just think about how a decision tree is constructed

deft harbor
#

Hey, salt rock is back. Nice.

lapis sequoia
#

is there anything that xlwings can do that pandas CAN'T do?

hidden grail
#

Is this chat for machine learning also?
Do you know how the computation time/complexity of a neural network will increase by implementing more classes?

river wing
#

How to convert python into api

hidden grail
#

You can choose a web-framework for Python, e.g. Django or Flask. I've used Flask for this in the past and it was really simple. You can easily define your API endpoints with the @app.route() decorator for different operations like GET or POST. Check out https://flask.palletsprojects.com/en/1.1.x/

blazing bridge
#

For example, say we are trying to predict rent based on the size_sqft and the bedrooms in the apartment and the R² for our model is 0.72 — that means that all the x variables (square feet and number of bedrooms) together explain 72% variation in y (rent).

#

What does variation in y mean

restive obsidian
#

is there anyone who alrady with kaggle notebook ? i want to ask about why model that i save with model.save() i can't open the model folder

astral mantle
#

hey

#

Does anyone here know anything about neural networks?

paper niche
#

What does variation in y mean
@blazing bridge (Explained) variance, the 72%, essentially means the (squared) error between the predicted y value and the mean y value. I think of it as "how much better your model is at predicting the y value compared to a naive one that just guesses the mean value of y"

#

Does anyone here know anything about neural networks?
@astral mantle Probably. I don't deal too much with NN myself, but just go ahead and ask your question. Someone who knows & has the time will answer.

astral mantle
#

Oh well

#

I've been trying to udnerstand backpropagation

#

I get how i'd find the adjustment to the first set of weights in respect to the output layer

#

but how would i adjust the weights of those in hidden layers and further?

#

do I carry on using the chain rule or is there something else

ripe forge
#

Nope, chain rule. That's it

hidden grail
#

Hey, I'm trying to create a program that can classify whether an image contains a building or not. I'm not sure where I should begin. I guess I could create a binary classifier CNN with Keras/TensorFlow/PyTorch. Or maybe I could use object-recognition in OpenCV, like Haar-Cascades. Do you have any idea what would be a good approach for this project?

restive obsidian
#

@astral mantle if u need theory understanding maybe u can try enroll andrew ng deeplearning class

astral mantle
#

oh ok

#

thanks

sinful fog
#

how can i scrape reddit images ?

#

(download)

#

with a bot i mean

uncut shadow
#

well, you should use reddit's API

#

if you want to scrap, then I'm quite sure it's against their ToS

#

so if yes, we cannot help you

lapis sequoia
#

how to round arrays

#

on random i get like bunch of digits

boreal portal
#

Modulo