#data-science-and-ml
1 messages · Page 228 of 1
You could spend as much time learning NN business as you spend on non NN stuff to solve this issue
And you can do so much stuff with NN stuff, more and more every day!
Hey folks, can someone help me with few questions in my university exam in Python related to NLTK mostly? (Nothing advanced, I believe)
Don't break the law
@quiet zinc we can't and won't help with exams
we can provide limited homework help but that's it
!rules 5
5. Do not provide or request help on projects that may break laws, breach terms of services, be considered malicious/inappropriate or be for graded coursework/exams.
yea it's mostly task consisted of 4 questions but yea
Would coursera courses count as breaking this rule. Like if I don’t understand a lecture can I ask for help
so I made a coupling algorithm, and im really excited about it cause it works. What are some fun stuff to test with it
at the moment i have just been feeding it random data such as
to then get data of
@plain jungle try some classic datasets like iris, boston housing, and titanic
and that one wine tasting dataset
id need two dimensional floats / integer data. I'd love to do the iris' (the flower one right?) but last I remember that was 6 dimensional
I should though work on making this scalure, just not sure how to plot anything more than 3d on mathplot
wow i just did a decision tree with iris
@plain jungle just pick 2 columns from it?
all the bivariate relationships in iris are usable
heck... wow... yo! imma do that
or get really cRaZy and hit it with multidimensional scaling first
thank you
Quite basic question...I'm trying to install the BERT Service Client. I've tried following directions here: https://github.com/hanxiao/bert-as-service and tried following the troubleshooting here: https://github.com/hanxiao/bert-as-service/issues/194 for how to handle when it won't start up in the command line. Appreciate any insight.
Mapping a variable-length sentence to a fixed-length vector using BERT model - hanxiao/bert-as-service
For reference, I have downloaded the files and do specify those in the command line calls that I'm making
@sonic finch not sure this will solve your problem, as I'm very new to BERT (like 2 days) but I just completed this pytorch tutorial that used bert..worked flawlessly for me. Message me if you have any questions. I'll try and help the best I can.
and look at this one..
^one of the two is the most-updated one. I'll let you find which one is the most recent (one of them he messed up his cross-entropy). And then two vid-ja tutorials on the above:
🗓️ 1:1 Consultation Session With Me: https://calendly.com/venelin-valkov/consulting
📖 Get SH*T Done with PyTorch Book: https://bit.ly/gtd-with-pytorch
🔔 Subscribe: http://bit.ly/venelin-subscribe
📔 Complete tutorial + notebook: https://www.curiousily.com/posts/sentiment-analys...
🗓️ 1:1 Consultation Session With Me: https://calendly.com/venelin-valkov/consulting
📖 Get SH*T Done with PyTorch Book: https://bit.ly/gtd-with-pytorch
🔔 Subscribe: http://bit.ly/venelin-subscribe
📔 Complete tutorial + notebook: https://www.curiousily.com/posts/sentiment-analys...
I have a Pandas question. I have data I need to do some time series stuff with. It is a set of events (event sourcing model) that fire at a certain time, for a certain uuid, and for a certain event. So timestamp, uuid, category is the data format. I want to get the average events per uuid per hour. So, I am thinking I need to is convert the category data to one-hot encoding. Then, pd.groupby("uuid").resample("1T").sum().mean() where the index is the timestamp. Is my reasoning flawed? I haven't done much with groupy.
@blazing bridge hey. Idk if someone has answers ur question yet,
but the two lines of code you sent do the same thing. Accesing by property and by the [] operator do the same thing for pandas objects.
As for the groupby function. We're grouping each element in the pandas object by their year. So each pandas row with the same year will be grouped together.
Then what you're doing here is getting the totalprod of each grouped year, and getting the average value.
as in pandas object do you mean the dataframe
Ok [] and . both are used to access columns
yes
yeah np
One more question
mhm?
you want to average over multiple columns?
yeah grouped by year
oh for that you need to use .agg
so something like this.
df.groupby('year').agg({'mean': ['totalprod', 'priceperlib']}).
Basically you're running the mean function for the totalprods and priceperlib for each grouped year. The agg method allows you to reference multiple different types of summation or averaging methods, as well as referencing more than 1 column for each method by passing in an array.
so we have to import numpy as well
in order to use it or would it consider it as a list
pandas uses numpy
I saw the agg function but I didnt know it accessed multiple columns as well
thank you so much
yeah np
the pandas docs have a couple more options for the agg, and prob how to make custom functions to use on the groupby's so make sure to check those out.
`df.groupby('year').agg({'mean', 'sum' : ['totalprod', 'priceperlib']}).
yeah I will
would this work
oh i thought it would do the mean and the sum of the columns we specified
for that you'll need to use mean and sum as separate keys in the object.
So like
{'mean' : [columns],
'sum': [columns]}
Oh ok thank you so much for your help. I know I've been a pain
nah its all good
can some explain the difference between a validation set and a test set
they seem very similar but i dont understand what the difference is
"Validation set" is usually used in the process of optimizing model parameters
"Test set" is saved until the very end of your work, as a final estimate of out of sample performance
Personally, I think the names should be reversed, but the terminology has been established for a few years now and it's stuck
@desert oar This confuses me a little bit. What happens if you develop your parameters and manage to get good performance on your validation set, then you check with your test set and get awful performance. What do you do? Do you go out and get more data? I assume you don't want to reuse the test set, because if so, what's even the difference between the test and validation set?
@wicked flare If that happens, it means that you did a poor job of constructing your data sets, and before you even touch your model again you need to spend some time carefully assessing the differences between your three data sets
Which yes, might mean that you either need to get new data entirely or, in a time sensitive situation and with extreme caution, reshuffle your three sets together if you just got very very unlucky
Ok, because there are some biases built into the validation set, so you overfitted the parameters to those biases or something?
Yes, basically you are now trying to figure out "did I over fit my model, or did I sample my data sets incorrectly"
(or maybe the biases are in the test set)
which means that your train data is not good, I would say
Sometimes in things like kaggle they deliberately screw you with wonky scoring sets to punish overfitting, but you always need to make sure your data is in order before you try to adjust your training process
I guess any of the three sets could be problematic.
If the training data is bad, the model will be bad. If the validation data is bad, you will tune the parameters incorrectly, and if the test set is bad, you will get a bad result even with a good model.
And yes, sometimes "get more data" is the best solution
the question here is also whether you created train/val/test from one data set or they come from different sources
quite often validation set is obtained by the split from the entire train set
I'm just about to start a new job in machine learning in the UK public sector, previously I've only worked in the private sector. Anyone worked in both and can give me some heads up on what I should expect?
here also cross-validation comes in play -> you do it to make sure that the way you build model does not favor overfitting for instance
Hello Everyone,
I was wondering if any of you have experience in converting a dash plotly app into an .exe?
I have tried suggestions on the forums of dash but those don't seem to work for me and others as well on the forum
Hello world! Anyone familiar with scikit-learn online at the moment and available to answer some questions? I think I want to do something strange (a branching Pipeline with a variable number of features in the intermediate steps), and I don't know how.
@quaint basalt just describe your question in more detail, then somebody who knows the answer can see it and help
Sure thing. Let me see if I can format a ascii graph thingy with the pipeline I have in mind
As the saying goes, "don't ask to ask"
My main input is a (networkx) graph, and I basically want to cluster the nodes of that graph using a soft clustering algorithm. To do this I need to predict 1) the number of clusters, and 2) for every edge an associated weight. The number of clusters is pretty straight forward since 1 graph results in 1 number. The edge weights are the problem, since 1 graph results in N weights with a different N for every graph. My idea is roughly the following:
_ <featurize graph> - <normalize etc> - <SVR to predict number of clusters> _
/ \
graph <cluster> - <score>
\_ <featurize all edges [1]> - <normalize> - <SVR to predict weights> -/
[1] This results in many features/items/datapoints
I have to take care of something at work, ping me this evening and I can take a look
In the meantime, look into FeatureUnion
In sklearn
Thanks, will do. I also found sklearn-lego, which also seems to be an essential part in this
yeah this is a bit more advanced than a sklearn pipeline is meant for
i didnt know about sklearn-lego
it looks more like just a collection of various custom transformers people have written
what is the last stage meant to represent?
The score? Might be just a misrepresentation from my part. What I mean is I will compare the found clustering to a ground truth
The main problem I run into is that I have an unknown number of edges per graph
sklearn-lego seems to be needed to enable feeding the output of the intermediate predictors to the clustering
I can maybe turn the bottom half in a custom predictor of sorts which takes a graph and returns a weights matrix. But then I may still have the problem that its shape is not constant.
honestly i would just write your own class for this
or function
or whatever
because the parameters from one of the models depends on the output from another model (# of clusters)
this is just way beyond what sklearn pipelines are able to handle
I was afraid of that. Thanks for confirming it 🙂
Hey all, anyone got any links to some good tutorials/info pages for machine learning and data science using python and tensorflow? I just got hired for a data science internship and have absolutely no datascience or machine learning background and feeling a little over my head
Kind of an odd question, but given a jupyter password, can I get a token for that instance?
i.e.
I am given a jupyter URL like:
jupyter.verylarge.cluster.org:123456/login```
and am given the password abcd1234
How can I get an auto-login link like jupyter.verylarge.cluster.org:123456/?token=396d6da2df034621a8836ab6c0689eae?
That sounds like a terrible idea
Data Science Projects with Python by Stephen Klosterman is the book that I have been using @lapis sequoia . Not an online tutorial, but it is the best book that I have used on the subject
@floral siren awesome thanks ill check it out!
there's a couple good courses on udacity @lapis sequoia
the intro to tensorflow one is pretty good
That sounds like a terrible idea
@devout sail are you referring to my question?
Yeah. I don't have a definite answer, but finding the token based on password would essentially allow you to enter the notebook by guessing the right password wouldn't it?
I mean, you can do the same thing by bruteforcing the login page
My guess was the token would be stored in localstorage or something like that once the notebook was unlocked
@devout sail found this in the webpage source:```js
function _remove_token_from_url() {
if (window.location.search.length <= 1) {
return;
}
var search_parameters = window.location.search.slice(1).split('&');
for (var i = 0; i < search_parameters.length; i++) {
if (search_parameters[i].split('=')[0] === 'token') {
// remote token from search parameters
search_parameters.splice(i, 1);
var new_search = '';
if (search_parameters.length) {
new_search = '?' + search_parameters.join('&');
}
var new_url = window.location.origin +
window.location.pathname +
new_search +
window.location.hash;
window.history.replaceState({}, "", new_url);
return;
}
}
}
_remove_token_from_url();
so the token is in the url but removed
how can I get around that?
I obviously can't change the source code of a remote jupyter book
and I can't see anything in the network tab of dev tools
@devout sail what about from the cookie that is set?
I just noticed that jupyter sets a cookie when logging in via a password
I'm not sure if I know or want to help with that, sorry. You should find a legit way to open the notebook
@devout sail bruh I have access to the notebook already
basically I am given web access with a url+password
I want to use it through vs code
which requires the token in order to connect to the jupyter daemon
there's nothing illegal/illegitimate going on here
Then you should get it from whoever gave the url+password
I'll try that, but I don't think they'll be able to change the system so easily
it's an automated system that I request an instance from
and I'm given url+password to connect
sorry if this isn't the right place to ask, but I don't really know whether I should use google colab or just pycharm while I'm learning machine learning
colab is great because it has all the libraries set up for you
only issue is that while you can save code between runs, you can't save files
for most small ML projects that's perfect
Hello there folks, I recently finished an introductory data science course and was also shortlisted for an ML internship. I was given an assignment where I have to predict wine variety using various features. I have never worked on an ML project before and also don't have much knowledge on the theoretical side either. I just finished the assignment today and attained about 97% accuracy. I am sure it is dumb luck but just to be sure can you guys review my notebook (https://nbviewer.jupyter.org/github/shyam1998/Wine-Variety-Prediction/blob/master/main.ipynb) and see if I am doing something wrong to attain such accuracy?
note: They didn't give out the labels for the test data so I train_test_splitted the training data
@modern canyon I am a noob on these stuff pretty much even i've been taking ml classes for semesters now, i think its good.
I assume there aren't multiple entries with very similar parameters in the dataset though
If such a thing happened, then you will have the similar items both in training and test set i eman
@solid aurora actually u can save files. U'll just need to use drive for that.
But yes for larger projects, its recommended to use a standard ide or editor along with a version control system like git
reminds me of the days when I used to "version control" by uploading code to google drive
Anyone with some knowledge about Neural Networks and Poker?
Going with the NN solution i see?
The goal is to make a bot that will learn to mirror the skill level of the player. Forcing the player to play against him self and learn his mistakes
In online poker?
This is for a custom poker video game, it can be played offline. Online features are only to serve ads because the game is free
That would take forever to gather the data to make you know
If yo uhave a single player
Doing a single game with this ai it would take hundreds and hundreds of hands to make meaningful progress if you're doing it super efficient like.
It would not start from 0, a pretrained model will be distributed with the game and will correct it self to mirror the player better
Now i ain't a super poker player but you would want to gather the data that affects how a player would play
blackjack would be easier
The problem I'm trying to solve is the input layer of the NN. Should I use one neuron for each card and set it to 1 if you have that card or should I only use a single input neuron and interpolate it between 0 and 1 to show the cards value
I'd say more the better
On that one
If you want it to be good and use the least data possible right
Ok, thanks for the input
You gonna use some transfer learning?
And hell you can test many different models
You know you gotta experiment
Don't be afraid to go back and reexamine. Nothing about your project is set in stone.
hi, im just starting with pandas. is this a good place to ask for help? or should i got to python help?
@vernal cypress either’s fine
A'ight
I have an online store
each month i pay artists royalties
I'm trying to build a simple tool that will count up all the product sold within a month
i can export CSVs through my webstore admin panel
I sell shirts among other things
One design can be 8 or more colors and 6-7 sizes
the CSV i export keeps all of that in one column, "Lineitem name"
so when i use .groupby function i get this:
What would be the best way of aggregating all rows that contain "Aesthetic shirt" to one and creating a new data frame from it?
is the .groupby even appropriate?
Probably u just need to split out the Lineitem name into 3 different columns (name, colour, size) then group by name
I think something that complicates things is that you have that Heather Prism shirt as well..
are all ur item names always before a hyphen?
oh damn you're right, they are
the csv i get has 72 columns, it's overwhelming lol. is there anyway i can drop all column except the one i need? i've been dropping them each by name
yes heather prism mint is the color
yep one sec lemme shift over to my laptop
but just the fact that you pointed out that they're all seperated by a hyphen and the sizes are seperated by a slash allready helps a bunch lol
the csv i get has 72 columns, it's overwhelming lol. is there anyway i can drop all column except the one i need? i've been dropping them each by name
@vernal cypress you can just select the column you need instead of dropping the ones you don't.
df.loc[:, ['col_1', 'col_2']]
excellent, thank you
np, glad to help
you can also select a subset of columns to read in if you're using pd.read_csv() (I forgot the argument name, but you can find it on the docs I'm sure)
i'll probably ask a few more dumb newbie questions before i figure it out but now i think i have the right approach
Anyone here has experience in Dash?
just ask your question, don't ask to ask. If someone knows the solution, they will answer.
@paper niche
`def get_color(color):
return color.split('-')[1]
df['Column'] = df['Lineitem name'].apply(lambda x: get_color(x))`
gives me
"IndexError: list index out of range"
apparently it has something to do with pandas not knowing what to do with empty columns
some of the Lineitem names don't contain any hyphens or anything
is that why?
yeah if some names have no hyphens then .split('-') will only return a list of 1 element (the original string), then [1] will give a indexerror
you need the colour? I thought you wanted the name
well my idea was to seperate the color and size to seperate columns and then use .group by on the name
btw, no need for apply, there are inbuilt string methods you can access via .str on a dataframe/series
this method also doesn't throw an error if your item name has no ' - ' delimiter, it just returns None in the second column (colour) after expanding
oh sick!
lemme try
it works!
now how do i move that data to a separate column, leaving just the name
assign it back to df, then drop the other columns you don't need
or if you really only want the name, then
df['name'] = df['item'].str.split(' - ', expand=True)[0]
something like that
can i use df.loc or should i just df.drop?
assign it back to df, then drop the other columns you don't need
regarding this, you might also want to specifyn = 1in the argument for.str.split()to ensure you only get 2 columns out
either is fine, use whatever's more convenient
alright sweet, i got it down to a somewhat tidy csv now, i can df.groupby('Product name').count() and i see exactly what i want
!ask
Asking good questions will yield a much higher chance of a quick response:
• Don't ask to ask your question, just go ahead and tell us your problem.
• Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving.
• Be patient while we're helping you.
You can find a much more detailed explanation on our website.
i didn't feel like explaining in here my issue if no one in here knows anything about basic numerical methods for solving differential equations
hence why i just asked if anyone knows about euler's method, the algorithm i'm using
anyway, here's my code for eulersmethod which just returns a set of data that matplotlib plots
def eulersmethod(init_x, init_y, dx, range_val, derivative, val_is_range ):
points = [[init_x], [init_y]]
# start with initial point first apply Euler's Method to the positive right
x = init_x
y = init_y
derivative_at_last_pnt = derivative(x, y)
# treat range_val like a range
if val_is_range:
while y <= range_val / 2:
x += dx
y += dx * derivative_at_last_pnt
points[0].append(x)
points[1].append(y)
derivative_at_last_pnt = derivative(x, y)
# negative direction
while y >= - (range_val / 2):
x -= dx
y -= dx * derivative_at_last_pnt
points[0].insert(0, x)
points[1].insert(0, y)
derivative_at_last_pnt = derivative(x, y)
else:
# treat range_vale like a domain
while x <= range_val / 2:
x += dx
y += dx * derivative_at_last_pnt
points[0].append(x)
points[1].append(y)
derivative_at_last_pnt = derivative(x, y)
# negative direction
while x >= - (range_val / 2):
x -= dx
y -= dx * derivative_at_last_pnt
points[0].insert(0, x)
points[1].insert(0, y)
derivative_at_last_pnt = derivative(x, y)
return points```
i've solved on paper a couple of well known functions for their derivatives in terms of x and y and plugged those functions into the code to test it. e^x is giving me weird functionality
sinx is giving me a straight line
y = x looks fine
have you printed your points out and inspected them?
that would be a lot to look at which is why the computer is doing it
fair. there's a point where your x value goes backwards. That's why you're getting the weird graph you're getting.
for domain -10 to 10 i'd have 2000 points to look at
print them out one per line, round them to a common width, and scroll until you spot the pattern.
i know what the issue is giving me weird functionality with it going backwards hold up....
Is there a reason why you don't start all the way on the left or right (x axis) rather than in the middle?
ok i fixed that issue
i start wherever the initial point is
this is how it's supposed to look minus that weird little blip but i think i could figure that out
so that is fixed let me see what sin x looks like now
ok so as you probably know...
sinx doesn't look like this
so it seems like every y value you calculate is 0, then
ok so i know what the issue is. i'm giving it an initial value of (0, 0) which should give me a sin function but i'm multiplying dx * derivative_of_sinx which is -y^2/2 so it always returns 0
Correct me if I'm misunderstanding this but it looks like you first loop with y or x until it goes above the upper limit, then you try to iterate until it goes below the lower limit? Wouldn't you want only one loop that requires it's between both limits?
ok i have to separate it into two because i'm starting at some point. i want to use the derivative function to approximate the curve for some step size dx up to and below some hardcoded bounds
so i first go up then down. i have no way of knowing what the value at the far left will be until i get there and same for the far right. that's what the code does. it graphically solves the differential equation
Traceback (most recent call last): File "C:/Users/Nicholas/PycharmProjects/eulersmethod2/main.py", line 64, in <module> points = eulersmethod(x0, y0, 0.01, domain_size, sineofx, val_is_range=False) File "C:/Users/Nicholas/PycharmProjects/eulersmethod2/main.py", line 43, in eulersmethod derivative_at_last_pnt = derivative(x, y) File "C:/Users/Nicholas/PycharmProjects/eulersmethod2/main.py", line 53, in sineofx return -(y**2)/2 OverflowError: (34, 'Result too large')
i tried to start at initial value (pi/2, 1) didn't like that
this is just so broken if it isn't e^x or y=x... maybe euler's method just doesn't work on this kind of de?
Euler's method is fairly general. It may be worth taking a second look at your math outside of the function you shared
well all the functions i'm using are fairly simple
And also what y value caused that error
dy/dx = y is e^x
uh idk what value specifically but when the range is 2pi and i try to use sinofx() it gives that error
i'll post sinofx:
def sineofx(x, y):
return -(y**2)/2```
Try running sinofx through a bunch of inputs and print each input before the ** line
Then you can see which causes an error
i got this from the DE y'' + y = 0 which i know has the solution Acosx + Bsinx just from how many times this equation comes up in physics
so i solved it for dy/dx by integrating both sides to get y' = - (y^2/2)
alright i'll add that in there in a second. i'm checking to see if a polynomial will work
ok it's accurate for polynomials too
so far just the sinofx derivative function isn't working
how to set the width of the float when i print it?
There's a couple options, do you know about .format, str % value, or f-strings?
oh nevermind i'll look into it later
-2.1399999999999983 for x returns......
-1.6618870538680888e+229
which doesn't make sense. it should oscillate between -1 and 1.
wait
i know what the issue is.... i rushed through the math and made a mistake. give me one sec
what i did just didn't make sense mathematically
y'' + y = 0
you can't just integrate y in terms of dx like that it doesn't make any sense
integ(y)dx =/= y^2/2 + c.
To print a float (x here) with a specific number of digits (3 here) after zero:
- String.format function:
print("{:.03f}".format(x)) - f-strings (my favorite, .format but convenient):
print(f"{x:.03f}") - String % operator:
print("%.03f" % x)
thanks
well i think i know what the issue is now. i'm going to never touch this again and move on to the main point of my code haha
Good luck
i don't need a sinx anyway. i just wanted to test euler's method so i could move on to a different method that was a bit more complicated
thanks
so
I forgot to set seed
and my computations turned out really good.. I don't know what my seed is
how do I reproduce omg x.x
You don't? Not sure what you're doing, but you can run it several times and pick the best result
I can't run it several times. It's running on tpu
I'm trying to figure out a way to find the current seed even though I didnt' set it
im hoping np.random.get_state() would work
I'm using pytorch..
If you're using jupyter or something and you didn't stop the kernel at any point you might be able to do it
https://stackoverflow.com/questions/32172054/how-can-i-retrieve-the-current-seed-of-numpys-random-number-generator for a more detailed explanation of how it works
Though I'd say your aim should be saving the results / model, than trying to recreate the computation
anyone know why im getting a sytntax error here?
@hearty jewel the line above, you open a [
@hearty jewel
What's the new error?
Right. THAT line is raising the error.
Because it has been searching for the closing ] but it just got a new line.
Well, we fixed two errors.
if there is a syntax error pointing to something like a variable at the start of a line then usually look back a line or 2
now theres a new error lol
on the 'date'.year component
im trying to group by user by year
whats wrong with the code there?
I think groupby needs a list of columns so it would have to be by_author.groupby(['user', by_author.....])
Did you try counts = by_author.groupby(['user', by_author['date'].dt.year]).agg(?
My jupyter notebook breaks after running s particular piece of code. 'Kernel connection to server cannot be established'. Although it runs just fine before running that code
any ideas?
np.random.seed(1)
N = 100
alpha_real = 2.5
beta_real = 0.9
eps_real = np.random.normal(0, 0.5, size=N)
x = np.random.normal(10, 1, N)
y_real = alpha_real + beta_real * x
y = y_real + eps_real
data = np.stack((x, y)).T
with pm.Model() as pearson_model:
μ = pm.Normal('μ', mu=data.mean(0), sd=10, shape=2)
σ_1 = pm.HalfNormal('σ_1', 10)
σ_2 = pm.HalfNormal('σ_2', 10)
ρ = pm.Uniform('ρ', -1., 1.)
r2 = pm.Deterministic('r2', ρ2)
cov = pm.math.stack(([σ_12, σ_1σ_2ρ],
[σ_1σ_2ρ, σ_2**2]))
y_pred = pm.MvNormal('y_pred', mu=μ, cov=cov, observed=data)
trace_p = pm.sample(1000)
its this bit of code
shuts down my jupyter kernel
Hello everyone,
I have a question regarding Mathematics for ML/DS etc.
I am currently learning Linear Algebra and I am fairly understanding the topic.
Should I start solving exercises from these topics manually (pen/paper style) to understand the topic more or is there anything else I should try?
I have Mathematics background during my undergrad but its been 2-3 years since I last solved any problems.
Also same goes with Probability and Statistics?
Yeah it can be nice to get your pen and paper and test yourself to make sure you learnt those topics fully :D
Hi I am reviving some client's project, it's based on Tensorflow 1.1 (I updated the package to 1.5).
There is a line statement like this:
# The original code is tf.contrib.lite, I migrated it as new style
interpreter = tf.lite.Interpreter(model_path='data/model.tflite')
I have a limited experience with it, I just need to move the ML parts as a separate python package for portability and write unit tests, so is this model "model.tflite" something common I can download from somewhere?
Thanks
Yeah it can be nice to get your pen and paper and test yourself to make sure you learnt those topics fully :D
@eager heath
Hey, any good resources for problems?
I don't have any, no sorry :/
It's alright. Thank you.
project euler!
hello, is there a faster way to deal with 20 categorical columns with 50+ levels than to go through each one individually, look at the value counts, and assign it as a binary as whether or not it is the max value count?
Depends on how exactly you coded it up
Cause logic wise, you have to do all that
So then the only question is, did you vectorize the code properly or did you use loops and so on
alright thank you
how fast is "fast"
looping over 20 columns shouldn't take that long
def is_most_common_category(s):
counts = s.value_counts()
return s == counts.idxmax()
data_binarized = data[list_of_categorical_column_names].apply(is_most_common_category)
How do I go about centralising values in my pandas dataframe when I print it?
E.g. this column (I want to do for all columns)
Also centralise headers like here
@silk axle you can call .str.center before printing
the column names i think you can control with a display option
hmm.. colheader_justify is only for left or right
no centered
how do you know how many categorical levels are too much? for instance, i have a 6000 row dataset with a categorical variable having 50 levels. is that too much?
@silk axle on each column
print(data.apply(lambda x: x.str.center()))
@lapis sequoia too much for what
Is there not a better way? @desert oar
@lapis sequoia too much for what
@desert oar for an ML model (i.e. random forest)
i dont see one
@lapis sequoia potentially yes, random forests tend to over-weight features with lots of categorical values
AttributeError: Can only use .str accessor with string values! since not all the values are string @desert oar
then convert to string first i guess. make a separate function to if/else based on the dtype
any alternatives that i could use where high-level categorical features wont have an impact?
would i use like a regressoin?
something with regularization
then convert to string first i guess. make a separate function to if/else based on the dtype
@desert oar this doesn't really mean anything to me. I know literally nothing about pandas other than how to read in an excel file.
@desert oar this doesn't really mean anything to me. I know literally nothing about pandas other than how to read in an excel file.
@silk axle bro u can google this
I can't google it if I don't know what they mean
@silk axle
def center_text(s):
return s.map(str, na_action='ignore').str.center()
print(data.apply(center_text))
something like that
if you want to get more specific use if/else and check the .dtype of s
e.g. if it's a datetime series you can strftime it
Would s be each like record?
i think you need to brush up on pandas basics 😉
.apply by default applies a function to each column in the data
As I said, I know nothing about pandas
.map applies a function to each element of a series elementwise
a DataFrame is (logically) a collection of Serieses
TypeError: center() missing 1 required positional argument: 'width'
any .str method functions like the regular str methods
so .str.center is the same as str.center in regular python
didn't even know str.center was a thing
!d g str.center
str.center(width[, fillchar])```
Return centered in a string of length *width*. Padding is done using the specified *fillchar* (default is an ASCII space). The original string is returned if *width* is less than or equal to `len(s)`.
I guess the width would be the length of the header?
yeah, or the length of the longest string maybe
true
def center_text(x):
s = x.map(str, na_action='ignore')
l = max(s.name, s.map(len, na_action='ignore').max())
return s.str.center(l)
maybe
l = max(s.name, s.map(len, na_action='ignore').max())
TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'
Tbh I should probs stop copy-pasting and actually look at the docs to see what this stuff does lol
yes lol
also look at what i wrote
the point is that in general there isnt a good way to do this
without manually centering everything
how you do that is up to you
s = x.map(str, na_action='ignore') so this converts x to a str, right?
And if x is a missing value (e.g. NaN) then ignore it
@desert oar so can i use a dataframe of 6000 rows that has categorical variables with 300 levels, 5 levels, and 2000 levels and put this through lasso regression?
yes @silk axle
@lapis sequoia regression has its own problems. each level of each variable is a separate parameter
yeah thats what i thought too. im not sure how to proceed because there are so many levels
so yes lasso or ridge can help but you are depending on regularization to make it make sense
what do you suggest
is it only these categorical features?
no there are more
hm
ill show you this
alright ill look that up
Date also
yeah the problem is, some of them just have so many levels
im trying to group it into more levels so that i can have 80% of the value counts within 10 levels (for example) and have the remaining 20% listed as 'Other'
yeah so one of them is like 'Keywords' which has 2360 levels
which represents like 'Equity' or 'Investing' or something
Oh i have a suggestion for that kind of crap
another is the group with 1395 levels and a list of different business units
Oh i have a suggestion for that kind of crap
listening
1400 business units wew
I'm not getting anywhere with this @desert oar, it's really confusing me :/
@silk axle you can also just loop over your data and print each row manually
For business units try doing some meaningful aggregation
^
Like group together the units
For the words, perhaps use embedding.
this is so much more work than i anticipated ha
It ends up making more variables, but all continuous
definitely feature hashing for the keywords, or even an embedding like word2vec if a lot of your records have multiple keywords
yes, welcome to machine learning
And they contain meaning in vector space
65% fucking with data, 15% training models, 20% sitting in meetings
More like 50% 5% and 45% fml
lets say i have a column with 100 variables, and 50 of those variables capture 80% of the value counts. could i just create 51 variables and have 1 of them listed as 'other' for 20%
thats not a bad option
Only if grouping them makes sense
im trying to create a function right now to see how many variables are captured by 80% of the value counts
What techniques are available for collapsing (or pooling) many categories to a few, for the purpose of using them as an input (predictor) in a statistical model?
Consider a variable like college s...
Like it doesn't make sense if you grouped ceos and trainees together for example
yeah the problem is you lose a lot of information
Only if grouping them makes sense
Cause that might logically be bad
There is one other option that I personally haven't checked out
Called target encoding. Supposed to be some magical nonsense
yep target encoding works
ive done it
theres no implementation currently for multi-class though
only regression and binary classification
But it's another cheeky way to get a representation that's mcuh easier for models to handle
the multi-class version is more complicated too, i don't remember how it works off the top of my head
ah so it wont work for my case?
youd have to implement it yourself
alright. what about this scenario
so my dataframe together has 84 columns. what if i only looked at a dataframe with less than 15 different levels? that'd result in 37 columns
could i create a prediction off that
i know im losing a lot of information
https://dl.acm.org/doi/10.1145/507533.507538
https://arxiv.org/abs/1611.09477
references for target encoding ^
We look at common problems found in data that is used for predictive modeling
tasks, and describe how to address them with the vtreat R package. vtreat
prepares real-world data for predictive...
thank you
i think you should at consider which features make sense for your business problem too
you can also try computing the mutual information between each feature and the target
and drop features with MI below a threshold
so basically calculating variable importance?
more like bivariate association
alright
i'll check this out
this is my first month working as a professional data scientist lol and i never implemented any of these methods in school
its just because correlation doesnt work on categorical features
otherwise youd just use correlation
yeah welcome
at least you have the sense to ask questions
i struggled for years just trying to DIY everything
its just because correlation doesnt work on categorical features
yeah. if this data were numerical it'd make my life so much easier
yep. again welcome to data science
i struggled for years just trying to DIY everything
lol i cant imagine how difficult it'd be if you didnt have someone to ask like i do right now
thank you though
hint: it sucked and i wasnt very good at it
ha
@desert oar I've kinda made progress, but still not quite where I want
def center_text(x):
if isinstance(x.dtype, str):
l = max(x.map(len)) # get the highest length of string
print(l)
return x.str.center(l) # center based on longest length
else:
s = x.map(str)
return s.str.center(len(x.name)) # center based on length of title
print(df.apply(center_text))
isinstance won't work with dtype
dtypes are single-character strings
"string" columns are just "O" dtype which means "arbitrary python objects"
so sadly there's no dedicated string datatype in pandas
So I can't do something different for strings and integers?
you can, but it really depends on the dtype
you can have "integers" of 3.0 and 4.0 in a "float" dtype column
or you can have integers in an "O" dtype column even though it should probably be "int"
Date datetime64[ns]
Horse object
Track object
Time object
Non Runner bool
Odds object
Total Stake int64
Win or Each Way object
Actual Winnings float64```those are the dtypes
How would I change that?
11/06/2020 15:40:00 is an example
That's the format
dd/mm/yyyy hh:mm:ss
I merged Time and Date and forgot to remove Time
lol
Date datetime64[ns]
Horse object
Track object
Non Runner bool
Odds object
Total Stake int64
Win or Each Way object
Actual Winnings float64```so these are the dtypes
ah wait
you can assing formatters to each column this way
use that
see formatters and float_format
I saw that earlier but no clue how to use it lmao
Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
formatterslist, tuple or dict of one-param. functions, optional
Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
formatters = center_text?
that, or
formatters={
'Horse': lambda s: s.center(15),
'Non Runner': lambda s: 'Non Runner' if s else 'Runner'
}
etc.
I'm almost there @desert oar, got another quick question
sure
How can I make it so that it sends 4.00 instead of just 4 (type is int64)
If I can £4.00
'Total Stake': lambda s: f"{str(s).center(11):.2f}",
ValueError: Unknown format code 'f' for object of type 'str'
That's what I tried
Wait I know why
Yep fixed
you can use the float_format parameter too
oh if it's int nvm
'Total Stake': lambda x: format(float(x), "0.2f").center(11)
@silk axle ^
also i was just using s to mean either "series" or "string
Rn I've got lambda s: f"£{s:.2f}".center(11)
!e ```python
x = 3
print(f"{x:0.2f}")
@desert oar :white_check_mark: Your eval job has completed with return code 0.
3.00
'Actual Winnings': lambda s: ("£" + f"{s:.2f}".zfill(5)).center(15) this works but I'm thinking there's a tidier way -- converts 0 to 0.00 then to 00.00 then to £00.00 and then centers
!e ```python
x = 1.5
print(f"£{x:2.2f}")
@desert oar :white_check_mark: Your eval job has completed with return code 0.
£1.50
@desert oar :white_check_mark: Your eval job has completed with return code 0.
£1.50
@desert oar :white_check_mark: Your eval job has completed with return code 0.
043.00
let me re-read i might have missed something in the syntax
Wait no
lambda s: f"£{s:05.2f}" this works
:02.2f will do for total len 2
You need total len as 5
I just need to align the headers with the columns @desert oar
does .to_string have any kwarg to mess w/ the headers?
right yea, good point
Assuming I want this but not sure which setting
center works
Spacing between the columns isn't really consistent which is annoying
Ig set col_space?
Wait no that wouldn't be it
Don't think so
Eh kinda does
Not really sure for this one @desert oar
at this point your guess is as good as mine
ive never had to do this
or ive just manually centered by iterrating over rows
I tried
import nltk
nltk.download('punkt')
It take very long time and then fail ultimately
show the error you get
lets say i have a column with 100 variables, and 50 of those variables capture 80% of the value counts. could i just create 51 variables and have 1 of them listed as 'other' for 20%
@desert oar so back to this question, if im running low on time, would you suggest i do this ?
i am using Python and xlwings, but the intellisense isn't working for xlwings. e.g. when i type in ws1.cells., it doesn't show the methods that i can use e.g. ws1.cells.clear_contents() any idea why?
@lapis sequoia it could work. maybe a better option is to look at the cumulative % of the data represented by each category, and chop it off at some threshold
oh wait
thats what you just said
yes
perfectly valid
alright thank you, im gonna try running a baseline with variables that have less than 10 levels, and then running another model of all levels but using that method i just described
feature selection and engineering is always a slow process
lol yeah. this is my first professional DS/ML project, and ive never had to deal with data that was this messy and unorganized before
at least you have all the data in one place
ive spent all day writing various scripts just to get my data
oh thats even worse lol
under a completely arbitrary deadline
which is the most annoying part
everyone wants to acknowledge that software dev takes time, data science takes a fuckload of time
then i keep getting asked "how far do you think we can get in 3 weeks"
"well, week 1 will be spent writing and debugging data access and cleaning scripts, week 2 will be spent just understanding the data and prototyping maybe 1 or 2 algorithms, week 3 will be spent slapping together literally anything that seems to work because this is not enough time to get anything done"
Is there an optimal way to handle missing values?
no. depends on your application
sometimes it's as simple as "fill in with the mean" and sometimes it's as complicated as "build an imputation model"
depends entirely on the data that's missing, the reasons for the missingness, the kind of model you're using, etc.
Is there an optimal way to handle missing values?
@thin terrace lol im doing this right now
can you give some example(s)?
im using knn to impute categorical variables
everyone wants to acknowledge that software dev takes time, data science takes a fuckload of time
damn that sucks. are u a DS or ML engineer
data scientist
you can replace with mean/median/mode/knn/regress
bunch of algorithms online
other options too
hell you can train an autoencoder on the non-missing records
multiple imputation is another option with a lot of theoretical appeal but can be difficult to implement in practice
can do stuff w/ gaussian mixtures, sky is kind of the limit
you can even fit bayesian models where you set a prior for your missing values and train the model on that
Well, not really. So I'm applying for a data scientist job (which i think I've landed). They gave me a classification task with a dataset full of missing values and I have never handled such things before (just graduated from uni). I didn't really know how to approach it. I removed some columns with very large amounts of missing values and then filled the remaining with 0.
But It yielded pretty shitty performance
dont replace with 0. you can google algorithms to replace with mean/median/mode
unless it makes sense to replace wiht 0
for instance, my dataset rn has a few columns missing >60% of values -- i dropped those completely. now im running a knn on missing values
When does it make sense to replace with 0? I tried with -999 first but it worked even worse
they didnt teach you anything about missing values in school?
^ im surprised to hear that as well
no wonder
so im an outlier hehe
stop thinking like an engineer start considerg what the data actually is
is 0 a sensible default value for the data?
then you can impute with 0
is 0 a sensible default value for a stock price?
of course not
well maybe for some stocks
but not in general
@thin terrace google how to impute missing values
so basically its ok for features where 0 would represent "nothing" ?
no, youre still thinking about this wrong
in fact an engineer needs to think like this too
what does the data represent? how does my model actually work and use the numbers i give it?
im in for a challenge at this job position, thats for sure lol
imo a good software engineer needs the same skillset
making sure that what you're doing actually makes sense for solving the real-world problem
for example i'm writing a client library for a huge web API right now
there are like 50 options
but my particular use case only needs like 10, 5 of which can be hard-coded
so i'm writing my code with that specifically in mind
if you're not thinking about the task from a real-world perspective you're just not going to develop good solutions. it's true for software engineering, even more true for "physical" engineering like civil/mechanical/electrical, and equally true for data science
yeah i usually try to think about the bigger picture
data science is very new to me, a lot of new stuff
I was planning on becoming a dev but here I am
data science is more fun
I hope I'll think so too
anyway with that in mind @thin terrace it definitely helps to consider: what kind of data is this, and what would happen to my model results if i did X
and that's why there will never be a catch-all "what do i do with missing data" answer
naaah webdev is boring xD
are you working as data scienttist, salt?
btw I just realized we have two salt helpers 😂
I though we had one who changed name sometimes @desert oar
Yeah, I get that. I just don't know the answers to these questions. How do I learn?
- learn by doing: practice on different data sets, try to see the bigger picture
- since data science is so hot now there are plenty of resources to learn things: if you will look on internet for things like "imputation", "missing values in data" etc you will find quite some amount of guides and articles
@thin terrace partly a matter of looking to see what other people have done in other specific cases
e.g. there are a million approaches to missing data on the kaggle titanic dataset
and we just suggested a few, albeit complicated ones that i think are overkill for a job interivew task
Towards data science part of medium.com often has good articles for example
impute with mean/median/mode, fit regression, time series if appropriate, KNN, gaussian mixture
some models don't even need missing data imputation, e.g. random forest
if "missing" is a valid category level you can just leave it missing
i was here first technically 😉
@desert oar oh really? didn't know 🙂 guess I just happen not to see you somehow lol. I used to see salt-die, then there was #ask-meta-for-math channel and then I started yo see you but not salt-die
hence the worng conclusion
i was gone for a while
i'm not likely to be here consistently. i've just been on a lot recently
I see. do you work a data science-related job?
yes
I tried leaving them as missing, must have fucked something up
At the interview they said I could just have replaced them with -1 and I would've got pretty good performance
Is that such a huge difference from replacing with 0?
hm. are those missing data are continious or categorical?
mostly continuous
if they are continious numbers you can also check for the pearson correlation between those features and target variable
if they are not really correlated you might not need them at all
well i basically ended up dropping most features and I guess that's where I went wrong
if i had values like 5-10, 10-20, 20-30, >1000, i would have to encode these, correct?
before putting it into a ML/predictive model^
sounds like a good idea
missing ()
where ?
np.array( [1,2,3,4] )
oh THANKS ALOT
yw
anyone?
Does anyone know what passing another array as array index give?
idx = np.repeat(range(7), 20)
idx = np.append(idx, 7)
np.random.seed(314)
alpha_real = np.random.normal(2.5,0.5,size=8)
y = alpha_real[idx]
so, what would y give?
I looked at its kdeplot, i cant make anything out of it
you can give an array of indices to an array to retrieve a new array with the elements at the given indices
ohh thank you
@solid mantle time to read the numpy indexing docs 😉
x = np.array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30])
print(x[0])
print(x[[0, 3]])
print(x[1:4])
print(x[x < 20])
the last one is indexing with an array/list of bools
@desert oar thank you
for a good reason I beleive 🙂
it manipulates the internal data structure of ndarray and, if done incorrectly, the array elements can point to invalid memory and can corrupt results or crash your program
I don't get the usage of it
I saw few examples
IPython Cookbook,
IPython Cookbook,
in short, it can be very performant at times
how can i impute missing categorical data using knn?
one basic method is to compute distances between rows, comparing only the non-missing fields in those rows. then for each field you fill each missing value from the nearest rows that have non-missing values
there are a few packages that implement that logic
https://github.com/iskandr/fancyimpute which wraps https://github.com/iskandr/knnimpute
or this one ad-hoc implementation i found in a blog post https://gist.github.com/YohanObadia/b310793cd22a4427faaadd9c381a5850 which has some more intelligent handling of different data types
its actually a little surprising that nobody has come out with a "professional-grade" library for KNN imputation, but i guess people often just end up writing their own code
Does anyone know a good way of searching for a term in a VERY LARGE CSV (5million rows total ~) without using a for loop
/is a for loop really that much slower ?
@marsh chasm in any column? or in a specific column
Specific column
plain python can loop over 5 million rows pretty fast. pandas can do it really fast
Hmm it’s been running for an hour it’s not done yet (regular for )
show your code?
yeah gimme a second
temp = []
keyword = "Panda Express"
for i in filenames:
with open(i, 'rt') as f:
reader = csv.reader(f)
for row in reader:
if keyword == row[1]:
temp.append(row[0])
temp.append(row[11])
dataframes.append(temp)
temp.clear()
with open('PandaVisits.csv', 'wt') as p:
writer = csv.writer(p)
for r in dataframes:
writer.writerow(r)```
filenames is a list of csv's that i want to analyze
im surprised thats taking an hour
wait
you forgot to un-indent the 2nd with open
so you're re-writing all your files every time you read 1 file
unless that's your goal...
even so that's a long time
regardless, pandas should be able to do this much faster
you dont have any colum names? data starts on the first row?
sorry was helping my dad w something
don't i need to do that bc the original for loop is like every file in filenames
you're doing some weird business with the data
your logic is all twisted
oh nvm
you're clearing temp..
but still
ah probably im new to this so it's not probably the best solution
you should move the writing to the end
so you aren't re-writing every time you read 1 file
regardless pandas makes this a lot faster
import pandas as pd
filenames = [ ... ]
dataframes = []
keyword = "Panda Express"
for filename in filenames:
data = pd.read_csv(filename, usecols=[0, 1, 11], header=None, names=['x0', 'x1', 'x2'])
has_keyword = data['x1'].str.contains(keyword)
temp = data.loc[has_keyword, ['x0', 'x2']]
dataframes.append(temp)
combined_data = pd.concat(dataframes)
combined_data.to_csv('PandaVisits.csv', index=False, header=False)
again, assuming your files don't have column headers
they have column headers
how does panda make it faster
is it like... multithreading ?
its written in C instead of looping manually in python
there is a huge amount of overhead in python code execution
ohhhh gotcha
well the ones of interest: date_range_start and raw_visit_counts
and the one that you're searching in?
location_name
import pandas as pd
filenames = [ ... ]
dataframes = []
keyword = "Panda Express"
for filename in filenames:
data = pd.read_csv(filename, usecols=['date_range_start', 'location_name', 'raw_visit_counts'])
has_keyword = data['location_name'] == keyword
temp = data.loc[has_keyword, ['date_range_start', 'raw_visit_counts']]
dataframes.append(temp)
combined_data = pd.concat(dataframes)
combined_data.to_csv('PandaVisits.csv', index=False)
try this
thanks! i'll crossreference it w the docs to learn it but thank u so much! being new to this its hard to figure out which tools are teh best to use
so thank u
youre welcome, thats the best way to learn
the user guides in pandas aren't always that helpful but the API reference is usually clear
the usual caveats about untested code apply
yeah probably
it looked fine ?
i bet you can figure out the problem
hint: look near the top
lol
for future reference, 3 dots "..." is called an "ellipsis"
which hopefully makes that error message make sense
yeah thats why i was confused i was like tf where are the ellipsis i deleted the filename part
apparently
i did not
happens to the best of us
im looking through the docs... pandas is pretty useful xD
yep, for anything with tabular data it's pretty much indispensable
lmao to think i started this project using R
R isn't bad
pandas owes a lot to R for its design
the R code would look pretty similar, and if you use a 3rd party CSV reader instead of the built-in one it's just as fast if not faster
i've processed billions of rows in R doing more complicated operations than this
hell yeah
data.table? shouldnt be
let me see if i can pull up the code
library(purrr)
setwd("/Volumes/Seagate Backup Plus Drive/UMichStuff/v2/main-file/Relevant")
filenames = list.files(getwd())
rawVisits = 12
startDate = 10
CSVnames = filenames[filenames %like% "weekly-patterns"]
df = map_df(CSVnames, fread, header = TRUE)
df = df[which(df$location_name == "Panda Express"), c(startDate, rawVisits)]
fwrite(df, "PandaExpressVisitList.csv")```
my random forest model got a negative 17% R-2 value lol.
@marsh chasm
library("data.table")
filenames <- c( ... )
keyword <- "Panda Express"
dataframes <- lapply(filenames, function(filename) {
dat <- fread(filename, select = c("date_range_start", "location_name", "raw_visit_counts"))
dat[location_name == keyword, .(date_range_start, raw_visit_counts)]
})
combined_data <- rbindlist(dataframes)
fwrite(combined_data, "PandaVisits.csv")
pardon any mistakes, i haven't used R much recently
oh interesting
i started urnning the python code 2 minutes ago and its not done
is that normal
your data might be bigger than you realize
its 150 gigs
: )
you have 150 gb of memory?
external hard drive
uh oh
or at least keep your performance monitor open
do me a favor
how many files do you have
and how many lines are in each file
like how many csv's
yes
run this in bash:
wc -l my-data-directory/*.csv
obviously my-data-directory is the directory w/ your CSVs
wait @desert oar how would i go about getting data from sql database with 15m rows into python for pandas editing?
csv limit is 1m
m = million
what do you mean csv limit is 1m
either write the query and pd.DataFrame it, or use pd.read_sql
(note that pd.read_sql requires sqlalchemy unless you're using sqlite)
if it's supported by sqlalchemy it's supported by pandas
if not, like i said: write the query yourself and then convert to a dataframe after reading the data
uh so like quotes around the whole filepath thing
probably because you have 150 GB of data
yeah, but you have a fuckton of data
okie
which will give us a better sense of how to approach this
oh it finished the first two... theyre at about 3,880,000 each
lets see how the other ones fare
how many files total?
!e ```python
print( 32 * 3.8 * 1e06 )
@desert oar :white_check_mark: Your eval job has completed with return code 0.
121600000.0
woah coding buzzword
indeed
"big" = "too big for a hard drive"
so depending on the hard drive you're getting close
yeah im running this out of a 1TB external hard drive but thats storage
so not too bad
2 TB*
it's certainly "medium data" and too big for ram on most machines
i have a work machine with 256 GB of ram but even then i wouldn't load all this data at once, if only out of respect for my coworkers who also need the machine
import pandas as pd
from tqdm import tqdm # for a nice progress bar
filenames = [ ... ]
output_filename = 'PandaVisits.csv'
keyword = "Panda Express"
for fileno, filename in tqdm(enumerate(filenames)):
data = pd.read_csv(filename, usecols=['date_range_start', 'location_name', 'raw_visit_counts'])
has_keyword = data['location_name'] == keyword
temp = data.loc[has_keyword, ['date_range_start', 'raw_visit_counts']]
if fileno == 0:
temp.to_csv(output_filename, index=False)
else:
temp.to_csv(output_filename, index=False, header=False, mode='a')
anyway this should read each file one at a time, then append it to the same pandavisits.csv file
and i added a pretty progress bar so you can see how long it will actually take
(pip install tqdm or conda install tqdm)
oh cool
okay i'll try that
is this what the progress bar looks like?
0it [00:00, ?it/s]
thats what pops up when i run the script initially
oh i got it
nvm
it looks like it'll take about 40 minutes to run the whole thing (75seconds per iteration and 32 iterations)
seems reasonable
this is my first time working w data
like to this scale
i was using R because my paradigms class had a small data science unit so it was fresh in my mind
the code worked @desert oar ty
maybe i didn't let it run long enough : /
i expected that it was just taking too long
instead of lapply and rbindlist you do basically the same thing as i wrote in pandas
its like a 1:1 port
when im using categorical variables in a random forest, do i need to one-hot encode them? ive seen conflicting responses
when i dont one hot encode, i get a 'cannot convert string to int' error
@lapis sequoia #0682 most models still require numerical inputs even if the numbers correspond to categories
i thought that as well
but dont random forests handle categorical data?
i just saw a youtube video with 900/920 likes that one hot encoded so im just gonna do that lol
but im still interested in knowing if you can use a random forest without having to (one hot) encode categorical variables
(if anyone knows the answer pls tag me bc sometimes i forget to check here after asking a question lol)
Yes they can handle categorical data, but that depends on the person who wrote the software letting you specify which columns are categorical
Just think about how a decision tree is constructed
Hey, salt rock is back. Nice.
is there anything that xlwings can do that pandas CAN'T do?
Is this chat for machine learning also?
Do you know how the computation time/complexity of a neural network will increase by implementing more classes?
How to convert python into api
You can choose a web-framework for Python, e.g. Django or Flask. I've used Flask for this in the past and it was really simple. You can easily define your API endpoints with the @app.route() decorator for different operations like GET or POST. Check out https://flask.palletsprojects.com/en/1.1.x/
For example, say we are trying to predict rent based on the size_sqft and the bedrooms in the apartment and the R² for our model is 0.72 — that means that all the x variables (square feet and number of bedrooms) together explain 72% variation in y (rent).
What does variation in y mean
is there anyone who alrady with kaggle notebook ? i want to ask about why model that i save with model.save() i can't open the model folder
What does variation in y mean
@blazing bridge (Explained) variance, the 72%, essentially means the (squared) error between the predicted y value and the mean y value. I think of it as "how much better your model is at predicting the y value compared to a naive one that just guesses the mean value of y"
Does anyone here know anything about neural networks?
@astral mantle Probably. I don't deal too much with NN myself, but just go ahead and ask your question. Someone who knows & has the time will answer.
Oh well
I've been trying to udnerstand backpropagation
I get how i'd find the adjustment to the first set of weights in respect to the output layer
but how would i adjust the weights of those in hidden layers and further?
do I carry on using the chain rule or is there something else
Nope, chain rule. That's it
Hey, I'm trying to create a program that can classify whether an image contains a building or not. I'm not sure where I should begin. I guess I could create a binary classifier CNN with Keras/TensorFlow/PyTorch. Or maybe I could use object-recognition in OpenCV, like Haar-Cascades. Do you have any idea what would be a good approach for this project?
@astral mantle if u need theory understanding maybe u can try enroll andrew ng deeplearning class
*Coursera or watch on deeplearning.ai on youtube.
well, you should use reddit's API
if you want to scrap, then I'm quite sure it's against their ToS
so if yes, we cannot help you
Modulo