#data-science-and-ml

1 messages · Page 192 of 1

hardy crag
#

also you would need some kind of evaluation to know how good your algorithm performed/how satisfied you are with the result

#

the optimizer you are using is then a class which represents the algorithm, so it needs the model parameters in the constructor, some training methods that take data and a run method, which just evaluates data and does not train.

#

(or you can just use a framework for the model that you can compare your solution to, and for creating the rest of the script)

#

does that make sense?

chrome lily
#

Thank you so much, it makes some sense
Im just a bit clueless on how to begin writting the code as thats my main issue been coding for around 5 months now first year university student
Still learning the ropes

#

@hardy crag

hardy crag
#

just build in step by step. first, create the input, and just print it back to console and so on

chrome lily
#

Okay thank you so much!

hardy crag
#

yw

brisk lynx
#

can someone help me here? Im learning basic data science and did this for model accuracy

#
    def get_model_accuracy(self):
        x = self.selected_columns[["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_COMB"]]
        y = self.selected_columns[["CO2EMISSIONS"]]
        linear_reg = LinearRegression().fit(x, y)
        accuracy_value = -cross_val_score(linear_reg, x, y, cv=10, scoring="neg_mean_squared_error").mean()
        print(
            "model accuracy was:", format(accuracy_value ** 0.5, ".2f")

        )```
#

from what I learned the accuracy values need to be between [0, 1]

#

but the value im getting is 24

#

what I did wrong

#

?

placid snow
#

Accuracy would be calculated from Number of correct predictions / Total number of predictions -> ```
TP + TN

TP + TN + FP + FN``` So either you got the formula wrong, or some of your values don't add up

#

I haven't touched this is ages so can't really remember much more about it

brisk lynx
#

hum, I though the module would do the formula for me

placid snow
#

tp being true positives, tn true negatives and so on

#

Humm, I can look through my old course work for how I did it.

#

I believe i had a bit of accuracy calculated

#

¯_(ツ)_/¯

#

It's also pretty old, so not the best of python code in general.

brisk lynx
#

oh I just found out Linear_regression() has a score() method

#

but Ill look into your repo

placid snow
#

I forget if score is accuracy or something else, but that could be it

brisk lynx
#

yeah score did it

#

it returned 86.4%

#

hum, thats not good is it?

#
    def get_model_accuracy(self):
        x = self.selected_columns[["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_COMB"]]
        y = self.selected_columns[["CO2EMISSIONS"]]
        linear_reg = LinearRegression().fit(x, y)
        print("model accuracy was:", format(linear_reg.score(x, y) * 100, ".2f"), "%")
        print("The predicted co2 emission values were", linear_reg.predict(x))```
placid snow
#

That's fairly good, from what i remember

brisk lynx
#

well, thanks for the help, Ill check your repo now chibli, thanks

placid snow
#

Having too high of a score could mean overfitting

brisk lynx
#

that means the data is way to specific to the training set and not to a general situation, right?

placid snow
#

Something like that :P

#

It's a bit blurry for me

brisk lynx
#

thanks

placid snow
thin terrace
#

Anyone can give some directions on how to select nr of epochs, batch size, hidden layer size etc. I made a dataset and now I'm trying to evaluate how well my model can fit it. I can't make any sense of the accuracy results. Sometimes the accuracy can be everything from 25-95% and sometimes it hits a reoccurring 28.42% - even if i run with the exact same hyperparameters..

#
X = np.array(data.drop(['H', 'D', 'A'], axis=1))  # Features
y = np.array(data[['H', 'D', 'A']])  # Labels

# 10-fold Cross-validation
kf = KFold(n_splits=10)  # Shuffle=True?
for train_index, test_index in kf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

model = Sequential()
model.add(Dense(51, activation='relu'))
model.add(Dense(25, activation='relu'))  # Experiment with hidden layer size
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X, y, epochs=1000, batch_size=18)  # Experiment with epochs and batch_size

scores = model.evaluate(X, y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))```
hasty maple
#

that's the best the model can do I guess

thin terrace
#

It's not, cuz running it again it might just as well hit 95%

#

Or 62 or 54 or whatever

velvet anchor
#

There isn’t a magic way to check hyper parameters I don’t believe

#

It just kind of is trial and error

thin terrace
#

Well how can I trial and error if it always gives me different results even if I tweak the hyperparameters or not?

#

I can't really tell if it was the tweaking who made it better/worse or just the random change

hardy crag
#

well if the model is randomly initialized, the minimum it finds may be different from run to run

#

have you checked wether the the 28.xx% is the result from the model always predicting the same output for any input?

#

also: how big are your training and testing set?

#

there is a lot of "intuition" and "experience" and reading tips and tricks for x architecture involved in finding hyperparameters

thin terrace
#

Dataset is 380 samples (representing one season of EPL football games) which can be increased by adding more seasons. I've tried splitting it with the validation_splitparameter of model.compile() by 10%, 20% and 30%.

#

Can you point me to any good resources of these tips and tricks? @hardy crag

hardy crag
#

@thin terrace so the model always predicts Away win?

#

if you have an inbalanced dataset, do you correct for that?

small pumice
#

Hi,
I am trying to use satellite data and a neural network to predict whether a wildfire will occur in a given area within a month. I am using Google Earth Engine to collect the data. I have a question about the neural network: I will not have as many occurrences of fires as I will fires not occurring. Therefore, I will have to use the same amount of fire occurrences as non-fires. Then, however, the result that the neural network outputs for whether a fire will occur within a month will be skewed. How do I account for this?

hardy crag
#

you can correct in your loss function for inbalanced classes

#

like this stackoverflow

lapis sequoia
#

Hi. I am trying to work on document similarity to retrieve top n documents for a given query

#

at present, i am using tf idf vectorizer

#

Does someone know something better than tf idf that can capture semantic meaning as well?

bronze osprey
#

Can i convert .mlmodel (Apple MLKit Model) to .tflite (Tensorflow Lite Model) using python?

winter cliff
#

What is some example code of putting eye tracker data into a csv file?

supple ferry
#

Hey there! Anyone has experience with Logit model in Python? I am using now statsmodels version of it. What I try to do, when fitting my model, specify Logarithmic function so that it fits not on original X, but on Log(X) any easy way to do that except manually creating columns in df?

delicate nymph
#

hi

#

Can anyone explain to me something about BSpline?

deft harbor
#

Does anyone have good resources for dataviz or "storytelling"?

meager laurel
#

So i got redirected here and I needed advice

#

how i can use image analysis
to identify the differences between these two
and other images
Like these two are different stages of a microstructure
of a material
and I wanted to create an app that will be able to distinguish between these two

polar acorn
#

How many examples do you have of each stage?

chilly shuttle
#

You don't even need ml. Just run hough transform and observe the much more prominent lines in the left image

meager laurel
#

@chilly shuttle can you elaborate more please

#

Like you're saying to use Hough transform, and to find the image with more lines than dark shapes?

chilly shuttle
#

Run them through hough transform and look at the differences in output, it should be quite obvious after that

meager laurel
#

@chilly shuttle ohh ok, thank you!

chilly shuttle
#

if you want more help you need to give a lot more information about your problem, e.g. how many samples of each class you have as pptt asked

#

but the two you showed don't need ml to discern, you just need to look for straight lines

meager laurel
#

@chilly shuttle yea can I get back to you on that in a later day because

#

We just got the challenge today

#

From our materials uni course which is making a challenge for a hackathon

chilly shuttle
#

if it's a challenge why are you coming straight here for help

#

take it on as a challenge

meager laurel
#

And I'm planning to learn

#

Noo like

#

Ik what kinda thing to find out

#

It's just idk about the technology

#

Needed to find it out

#

Which I'm planning to leaen

#

Before the hackathon

#

That's why they showed the challenge

#

Rn

#

The hackathon is in a few weeks

#

So we can kinda prepare for it

#

Like I just needed to know what kind of technology I needed to learn to find the differences in the two images

#

And I was planning to learn those stuff

#

So I can use it in the actual hackathon

#

That's why, sorry for the miscommunication

chilly shuttle
#

i don't mind, we'll answer whatever questions

meager laurel
#

Thanks! The only questions I plan on asking are related to the technology help itself

#

Hence why I came to the Python discord to start off

chilly shuttle
#

just saying if it's a challenge i'd personally feel like any success would be diminished if I got my hand held through it

meager laurel
#

Cuz I want to get involved in computer vision

#

And this challenge was something for me to start off with

#

Yea that's true, well like I just wanted to know where to start off from

#

Cuz I'm clueless right now

#

About machine learning

#

Like I got a course from Udemy, and I'm gonna use that to learn and from there try to solve the problem

#

Because I don't know yet, I still have to get clarification from then, whether they have more samples and whether the app they want is supposed to scan different materials that have diff microstructures, so the samples would vary between each material

#

I just got the idea of this a few hours ago, so I was just desperately trying to find a place to start

polar acorn
#

Best of luck! But remember struggling and/or failing a bit is a great way to learn.

vestal ravine
#

Hello everyone
i'm tryin to build a live forex chart
i don't have any experience with that yet
but i do know a little about python
is there anyone here, who can help me?

narrow ivy
#

Hey guys, is there any free stock price API which allows me to get data more frequent than 1 day ? I want to make live updating chart using dash, but I need some great api that will make my chart to change for example each minute. Thanks in advance 😃

#

@vestal ravine I guess we're looking for same thing 😄 Try to learn really useful libraries like pandas and numpy and then matplotllib/plotly to creating charts. After that you will be in my position - to find some great API to get live data, as frequent as possible .

vestal ravine
#

ill take a look, Thnx.

meager laurel
#

@polar acorn Thankks!!!

silk hill
#

@mikey770 alpha vantage has an intraday api

small ore
#

@meager laurel Is that a scehmatic diagram from a text book or an actual black and white picture of the microstructure? Or an actual picture filtered to make the image simple for the task? A real life problem would be to recognize the process ( heat treatment/ natural process of formation etc) of any given material. In that case you will need a lot of sample pictures of each matrial structure

meager laurel
#

@small ore well our materials Prof posted that on the slide for the lecture and I cropped the image from there, so I'm assuming it's an actual picture.

#

This is the challenge part

#

And we just started our materials course, so I don't know much about how the microstructures change. But I remember hearing he said the temperature affects it. I'll have to find out more info about it.

small ore
#

Those may even be differently heat-treated steel. I too donno how a photograph of a microstructure is taken. I only remember seeing them in textbooks. Not even good ones

#

And my very little and out-dated knowledge tells me that recognizing microstructures were a forte of higly skilled and trained man power. So this challenge seems to me like they are trying to remove the discrepancies that come with human decisions and the high cost involved ( Although who knows, machines mayprove to be inferior in certain cases)

void anvil
#

^

Steel microstructure is a massive fucking pain in the ass. Good luck getting anything beyond meh results if they give you a real world dataset.

#

Photographs come from SEMs

feral lodge
meager laurel
#

@small ore @void anvil @feral lodge ohh ok, I'll look more into that, and ask my prof what he's expecting and about this stuff too

void anvil
#

I have a friend who works in a steel mill. ML is useless in > 99% of the cases. If you get an actual breakthrough you'll get hundreds of millions of dollars every year.

small ore
#

Well, unless you are working for a research firm, you may spend a million trying to get decent sample photographs

meager laurel
#

Idk I think I recall the prof saying to find the difference by checking the amount of white in the picture or the size or something. Idk if that's fully true now

small ore
#

True. They might be seeking something simple. I can't imagine someone throwing their students something so complex

thin terrace
#

Any suggestions to which metrics to use in ML classification?

polar acorn
#

Depends on the case. With nicely balanced classes accuracy might do fine. Screening for cancer? Maybe keep an eye on the f1 score and the false negative rate etc. etc.

thin terrace
#

@polar acorn balance is like 45/25/30%

void anvil
#

Depends strongly on what you're trying to predict

#

What input(s) do you have and what output(s) are you looking for?

thin terrace
#

Predicting football matches. Input is team ratings and some history data from previous games. output is home team win / draw / away team win

#

From previous work I've basically only seen accuracy being used

void anvil
#

You might be better off using a (-1,1) classifying system like they use in finance and putting draw in at 0

thin terrace
#

@void anvil why is that?

void anvil
#

experience

latent flicker
#

Is there an R discord?

heavy apex
#

Within data science, what tools are best in which situation. Currently doing a visualization class that is strictly Tableau, with only optional python and R learning. I'm obviously going to putting in the extra optional work, but which tool tends to hold the most weight in a working environment?

carmine lava
#

@feral lodge object tracking python codes or research paper please share if you know

quiet crest
#

Does anyone have jupyter notebook slowing down after some period of time and was able to fix it? I have been looking it up, but couldnt find much

orchid lintel
#

@thin terrace To expand on that answer, it's because there's an element of ordinality to those classes. ie, you're not predicting something truly Categorical like, say, "hair color". A win is "closer" to a draw than it is to a loss, and so it makes sense to put them on a scale.

#

@heavy apex Matplotlib itself is very powerful but honestly can be a pain. Seaborn's a much-easier wrapper around that, but it's limited in the types of visualizations it can do. Plotly is very good for both exploratory data analysis and presenting findings - it's also got a thing called Dash that can replace Tableau in a lot of instances. There's also Bokeh for interactive visualization. My favorite is probably Altair.

thin terrace
#

@void anvil @orchid lintel which activation function do you use on the output layer for such a label?

thin terrace
#

(and loss function)

vague jetty
void anvil
#

is also a great one as well

#

same with spacy

vague jetty
#

Prodigy is spaCy's version

#

Does anyone have experience with Doccano? I'm having trouble uploading a dataset.

vague jetty
#

Looks like Doccano isn't working for me. Any other suggestions for an easy interface for data annotation?

Specifically, I'm looking to annotate classification, not keywords in text.

vague jetty
#

Nvm, looks like Doccano started working for me.

chilly shuttle
#

that's pretty cool

#

i guess you couple something like that with something like mechanical turk to generate labelled datasets

meager laurel
#

Oh by the way, my professor got back to me and he said

#

"Sorry for the delay in getting back to you. Attached are two sample files similar to the ones that will be provided for the challenge. What we would like you to do is to measure the fractions of the light and dark phases. The fraction could simply be expressed as number of pixels of a given color over the total number of pixels. The challenge will be the lighting conditions. Sometimes you’ll get very good contrast between the phases (e.g. micro1.png) and sometimes there is less contrast (micro2_prec.png).

To do this, you can start by using Python libraries for reading an image and for interrogating the data."

#

That's what he wanted

vague jetty
#

Woop woop, got a follow-up email from a company for an ML research internship. They want me to submit a research proposal, but I have no idea where to start...

rugged leaf
#

@vague jetty ay nice job

void anvil
#

Basically here's a problem, here's what it'll do for your company, here are some methods that I could try to apply, and here's why it'll be worth your money

small ore
#

Like I want to learn deep ocean diving, want budget for a sonar measurement of dophin and whale sounds, so that I can help devise a way to distinguish between ships/subs and other creatures 😄 😛

vague jetty
#

So I'm really new to sentiment analysis and am playing around with a project doing sentiment on a hockey forum. I've scraped a bunch of posts and am working on annotating to to make the training and test sets. I imagine this is a vague, common question with no good answer, but how do I label a post like "He wasn't great at the u18's. He started off the year in Kingston pretty bad, but got a lot better later on near the new year. His consistency could use some work from what I've seen/heard. He can be dominant but he can also have some bad showings." It's pretty clearly both positive and negative. Right now I'm applying labels to the entire post. I imagine selecting the instances of positive text and negative text in the post would be better than blanket labeling the entire post, but it will take a lot more time to do that. Should I label it both positive and negative?

void anvil
#

You can count sentiment items per post (10x bad, 3x good) if you want. What's harder to do is pick up sarcasm (e.g. PK Subban is the worst role model in the league).

#

It's also a bit more difficult to pick out what's being said good / bad about the specific person in the post (e.g. player XYZ is doing badly so far this year but will pick up once he's off injury. He's still better than ABC who sucks.)

vague jetty
#

Yeah, that's a big issue I'm running into. Eventually I might try playing around with disambiguation, but I want to keep things simple right now. Do you recommend counting sentiment items over selecting the actual text?

void anvil
#

Counting sentiment items is > then just having a -1, 1 for good / bad on the post. It's not necessarily the best option.

obtuse kettle
#

Not sure if this is the place to ask. I’m a computer science student who’s currently taking calc. I love Microsoft’s ability to annotate equations (think it uses LaTex.). My issues is that during work if I get a chance to study I’m not able to use downloadable apps. Do you guys have a good note taking web app or web word processor that supports math equations? Google docs is super limited :/

desert oar
#

that is an odd one indeed. you could use markdown in a jupyter notebook, i think there are free notebook servers out there

#

otherwise i think google docs is your best bet

terse pewter
#

Hey does anyone know if date is a discrete or continuous numerical type?

#

I would think that it would be discrete

small ore
#

@obtuse kettle If Jupyter Labs is okay for your purpose ( It can in addition to taking notes with equations also run various programs inline. Can also have graphs, tables, etc) then Microsoft Azure is one option. Requires a login

obtuse kettle
#

@small ore Why azure over digital ocean or other vps?

small ore
#

I didnt say over anything else. That was the only one I know which has support for jupyter notebooks

obtuse kettle
#

I see. Thank you for your input, you as well salt 😃

obtuse kettle
#

Jupyter is a no go =/

#

Can't access my server from work

vague jetty
#

There's OverLeaf, but it's probably OverKill

#

(☞゚ヮ゚)☞

feral lodge
#

@obtuse kettle I also recommend overleaf for notes, I always use it for math assignments etc. It's like google docs for latex

obtuse kettle
#

Does it have a web app, free?

feral lodge
#

@terse pewter Date should be a discrete variable. If it's date/time its probably continuous though

terse pewter
#

Thank you!

feral lodge
#

ctrl-enter to compile the pdf so you can see what you're doing

terse pewter
#

That makes sense since time can be hours minutes seconds ms,etc

feral lodge
#

Agreed!

#

@carmine lava https://www.pyimagesearch.com/2015/09/14/ball-tracking-with-opencv/ This blog entry is a few years old, I hope it's not outdated. I think it's a nice first exposure to object tracking though! Regarding papers, it might be difficult to find something general to learn the basics from; you'll probably mostly find papers exploring specifics and trying to improve the state-of-the-art for difficult problems

carmine lava
#

@feral lodge thanks but i have seen it

#

Now what i am trying to do is give a unique id for object and track it we the object is in the frame it should give id 1 and if he goes out the frame and come back the id should be same @feral lodge 😱 if you find something close to it please let me known i really appreciate that

obtuse kettle
#

Overleaf seems really cool so far! Thank you for the suggestion! But what's the catch... ALl this for free? I can store as many files as I want?

feral lodge
#

100 MB storage for starters, you can raise it to 1GB for free if you do their referral stuff. If there is some nefarious dark side to overleaf I haven't seen it yet 👌 @obtuse kettle

obtuse kettle
#

I see. Thank you again for this ^_^ time to lean this magic

feral lodge
#

@carmine lava I can't help you much, I have hardly touched object tracking 😖 I don't think I've ever seen work on recognizing and labeling certain individuals among some object class; only class labeling (ie., labeling objects as a coffe cup rather than coffe cup #345 and coffe cup #346). I'll keep my eyes open though If I see something! What kinds of objects are you trying to label?

carmine lava
#

Person

#

I am traning person but the thing is when he goes out of frame and come back the id is changing my requirement is he should have same id

#

@Slandön# what are you using for detection i mean which lib faster rcnn or SSD which one

#

@feral lodge any tutorial on using faster rcnn

feral lodge
#

Sorry my friend, I have no idea about the libraries! If there's a big difference you can probably find some benchmarks or discussions online

#

Couldn't check very carefully though, duty calls 👺

hardy crag
#

@carmine lava the problem with individual people is that you need a good dataset. If you have a good dataset you can use tensorflows object detection, which includes several faster_rcnn architectures

spring yoke
#

./m

crude flame
#

anyone tried the book "Datascience from Scratch" by Joel Grus? I'm thinking of getting it, since I want to get some datascience book and it seems to put more emphasis on actually understanding things rather than just learning to feed a black box

#

uhm also I might be missing something, but when I check out the recently pinned guide on r/learnmachinelearning, the PhD-version of it just includes no guide (?)

desert oar
#

any book teaching you to feed a black box isnt worth your time or money

#

unfortunately i dont know that book in particular, no do i have any sane route to data science because my own personal path was so winding

#

but any good data science book with have both math/theory and applied content

#

my gripe with some "machine learning" books is they dont actually spend much time on applications

#

it should be a push-and-pull of: learn a concept, learn the math behind it, and finally learn to apply it

#

book exercises ideally would be both practice math/theory problems (e.g. derive some expression, prove a theorem), and practice mini-projects (simulate or download a dataset, then implement a model and make inferences on it)

void anvil
#

Honestly you don't need to know fuckall about the math behind machine learning because of the big, beautiful packages set up by people. 99% of the effort is in data manipulation and experimental verification.

ripe lava
#

@crude flame I can highly recommend An Introduction to Statistical Learning by James et al. All the main models used in supervised and unsupervised learning are described. I found it very intuitive and with just the right amount of math to get started (if you want to go really technical you can check Elements of Statistical Learning, the advanced book that they have written).

#

and it's free!

#

application exercises are in R though... But it should be easy to replicate most of the results using scikit-learn

wide oxide
#

Introduction to statistical learning is a tough book to start with

#

@ripe lava

ripe lava
#

I think anyone who once had a basic stats course should be fine

wide oxide
#

These days I am very much into behaviour analysis, and psychology.

#

But, I am completely clueless. I don't know which course to take or which thing to follow.

crude flame
#

I should have mentioned that I'm doing a PhD in mathematical physics, so I'm fine with technical stuff. Thanks for the suggestion!

orchid lintel
#

@crude flame If you really want the foundations of the algos, Elements of Statistical Learning is where you wanna be. There's a free online Stanford class on it too, with guest lectures by like the actual guy who came up with CART and stuff. https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about

#

@wide oxide If I were starting out now, I'd be doing DataQuest, I really like their material.

wraith crow
#

Anyone that is familiar with K-mean clustering? I'm having trouble with using a template in my exercise.

ripe lava
#

I might be repeating myself.... but go check out the ISLR book, its very well explained in there

wide oxide
#

@orchid lintel Let me check their material. I've completed like 2-3 courses on Datacamp for Data Scientist with Python.

#

Their material was more like learn python for data science

reef bone
#

@wraith crow Most people here will be familiar with k-means so go ahead and ask your question

#

Oh my god

#

Who did I ping

#

Oh you changed your name

#

All good

wraith crow
#

Hehe 😃

#

I can't get this function to work:

reef bone
#

Are you using sklearn's k-means implementation?

wraith crow
#

Hmm, the bot is not letting me upload the function

#

No, it's a function I have gotten from a template in the exericse

#

class KMeans:
""" Simple K-Means implementation. Note that you can access
the cluster means and the cluster assignments once you have
called the "fit" function. The cluster means are stored in the
variable 'cluster_means' and the assignments to the cluster
means in 'cluster_assignments'. You can also use the function
'assign_to_clusters' to obtain such assignments for a new set
X of points.
"""

def __init__(self, n_clusters=2, max_iter=30, seed=0, verbose=0):
    """ Constructor for the model.

    Parameters
    ----------
    n_clusters : int
        The number of clusters that should be
        found via the K-Means clustering approach.
    max_iter : int
        The maximum number of iterations (stopping condition)
    seed : int
        Number that is used to initialize the random 
        number generator.
    """
    
    self.n_clusters = n_clusters
    self.max_iter = max_iter
    self.seed = seed

def fit(self, X):
    """
    Fits the K-Means model. The final cluster assignments 
    (i.e., the indices) and the cluster means are stored
    in the variables 'cluster_assignments' and 'cluster_means',
    respectively, see the end of this function.

    Parameters
    ----------
    X : Array of shape [n_samples, n_features]
    """
wide oxide
#

(Which course are you taking? @wraith crow )

wraith crow
#

It's modelling and analysis of data

#

got it as a 'side-subject'

reef bone
#

Oh this looks like a python problem, let's make sure you instantiate KMeans first, as you're calling the method

wraith crow
#

Ah, I figured I might be calling it wrong?

reef bone
#

Try doing

kmeans = KMeans()
kmeans.fit(data)
wraith crow
#

Oh my god, that was it. Thanks a lot!

wide oxide
#

Why there is no subject like that for us ;__;

#

I want to learn data science and I end up learning models but, no practical stuff, like implementing..

wraith crow
#

Oh one thing, can I also initialise it using the init before kmeans.fit(data)?

#

For example if I need some other parameters than the one in the code, like KMeans(n clusters=2, max iter=30, seed=0).

#

Hmm seems to work

#

When I write it like
kmeans = KMeans()
kmeans.init(2,30,0)
kmeans.fit(data)

reef bone
#

This is a bit of a weird situation

#

Normally the method init would actually be called __init__ which makes it a constructor - a method that is automatically called when you instantiate a new object, but in this case it isn't

#

If it was named that, you could instantiate like this:

kmeans = KMeans(2, 30, 0)
#

And it would automatically call the init function with those values

#

But since it's not named properly, the method has to be called separately (just as you have done)

#

Oh wait

#

I'm dumb

wraith crow
#

well, they write I shouldn't change the orignal cell which says:
def init(self, n_clusters=2, max_iter=100, seed=0, verbose=0):
""" Constructor for the model.
So I need a way to change the parameters for the different exercises without changing that one in the original cell.

reef bone
#

It probably is called __init__, but Discord is making it look like that because the underscored translate to underline, right?

wraith crow
reef bone
#

Yeah I should have realized, sorry

#

It's 6am for me

wraith crow
#

no worries!

reef bone
#

So what you've done is fine

wraith crow
#

It's 07:05 here 😕

reef bone
#

But you can also do

means = KMeans(2, 30, 0)

And it will automatically call the __init__ function and pass those values

#

That's the intended purpose of the method

wraith crow
#

okay great, thank you! I've been stressing a lot over this, so you have been my saviour 😄

Seems like means = KMeans(2, 30, 0) doesn't work, because when I change it to means = KMeans(4, 30, 0) for exmaple, it doesn't give me 4 cluster means, but still 2

#

LIke this:

reef bone
#

The variable on the second line is called means

#

But then you're fitting to kmeans

#

So they are different

wraith crow
#

ah that did it

#

just needed that k there

reef bone
#

I need to catch some sleep now but feel free to ask questions here, very smart people lurk this chat

wraith crow
#

I have to go to sleep as well now, thanks a lot for your help! You made my day (or night)! 😃

#

Goodnight

wide oxide
#

@reef bone How did you learn data science?

reef bone
#

I'm still very much in the process of learning, but I picked data sciencey modules in uni and was blessed with incredible lecturers that made the field very accessible to me and eventually ended up doing a deep learning dissertation in my 3rd year

wide oxide
#

How should a beginner start?

#

I have tried studying from Datacamp, Introduction to Statistical Learning and many other courses like Andrew NG.

#

For datacamp, I completed 3-4 courses and ended up learning nothing practical.

#

ISL: Completed 4 chapters, understood 25% of it.

#

Andrew NG's course was good but, assignments were different.

reef bone
#

I've heard Introduction to Statistical Learning is a very good book. Apart from that however I don't think I'm the right person to answer this question, because I went the uni route so it was very different for me (learning from lectures, being driven by marked assignments, and the ability to ask for help directly in lab sessions). The book I learned the most from was Bishop's Pattern Recognition and Machine Learning - an incredible book, but a little math heavy and might be difficult to read for a beginner. I don't know much about those courses you have mentioned so I can't comment on them.

wide oxide
#

No problem, thank you!

#

This is what we are going to learn this semester in Mathematics:

reef bone
#

Looks like you'll get a thorough introduction to regression which is an important technique

#

Linear regression on its own is powerful

#

And then logistic regression introduces you to the sigmoid curve which forms the backbone of neural networks (although nowadays the ReLU activation seems to be more popular as it learns faster)

#

I think you'll learn a lot and you'll also get a good mathematical background that will pay off once you look into more complicated things

#

Out of interest, which level of education is this?

wide oxide
#

I am in 2nd semester of B.E. Computer Science Engineering

#

The thing is we are learning nothing more than formulas. The professor taught us how to calculate central tendency for individual, discrete, and continuous series but, didn't tell us where to use which one. No concept of outliers and nothing ;_;

#

Few commands in R and Python can calculate the same things

reef bone
#

That's annoying, but I'm a firm believer that having a good grasp on the mathematical background pays off

wide oxide
#

I love Mathematics so I try to find things by myself

reef bone
#

And you seem to be very driven on your own which is great

#

In university you shouldn't be afraid to go speak to your lecturers, maybe there is a reason why some things aren't covered yet, and I'm sure the lecturer wouldn't mind setting up a meeting with you and going over those things in more depth

#

I really need to get some sleep now, I wish you luck on your journey!

wide oxide
#

Thank you!

orchid lintel
wide oxide
#

@orchid lintel Thank you!

#

how do I use it?

wraith crow
#

@reef bone Can I ask you again?

reef bone
#

Sure, no guarantee I'll have the answer though

wraith crow
#

I just did the K-mean cluster for a simple data set, and know I have to do it for an image.

#

getting an error with reshape

reef bone
#

Hmm

#

data = china / 255.0 so seeing as all values in china will be in the range 0 - 255 (that's a range you'll see quite often, since these are the numbers we can represent using 8 unsigned bits), dividing them by 255 will rescale them to 0 to 1 range, which in this case is only useful for the visualisation part (it basically becomes a coefficient we can multiply other values with easily)

#

data = data.reshape(427 * 640, 3) in here, we start with data being a 3D array with dimensions (427, 640, 3), so it's an image with 427 pixel in height, 640 pixels in width, and 3 colour channels (r, g, b)

#

By calling reshape(427 * 640, 3) on it, we retrieve a 2D array, kinda like this:
x x x
y y y
z z z
Turns into:
x x x y y y z z z

wraith crow
#

Okay, and that's important for the "X : Array of shape [n_samples, n_features]" requirement right?

reef bone
#

that sounds right

#

i probably can't explain the error you're getting without seeing more of the code

#

mainly what the variables hold

#

it comes from trying to reshape an array in a way that's not possible

wraith crow
#

Looks like this one is the issue:

#

cph_image_tiny_recolored = new_colors.reshape(cph_image_tiny.shape)

#

Yeah, just the reshaping that not working

#

How would you reverse the reshaping of:
data = data.reshape(166 * 250, 3)
after it was processed by the function?

reef bone
#

reshape(166, 250, 3)

#

but you're trying to reshape the cluster means

wraith crow
#

Yeah

reef bone
#

can you print the shape of cluster means and cluster assignments

wraith crow
#

The first is assignments:

reef bone
#

i'm trying to understand the entire process of how that's supposed to work

#

i don't think the means are particularly useful here

wraith crow
#

"TODO: Segment the image by searching for 5 clusters

in the RGB space (see slide 21 of L13); use

'max_iter=5' and 'seed=0' as parameters."

reef bone
#

when you do print(CA.shape), what do you get?

wraith crow
#

This is the slide they refer to:

#

I get (41500,)

#

Then #cph_image_tiny_recolored = new_colors.reshape(cph_image_tiny.shape) should be the correct idea right? Try to get the cluster means form the colours picture. But is assignments missing there?

reef bone
#

sorry i'm lost

#

i'm very confused about why they make you reshape the image data in the first place

wraith crow
#

if I don't reshape it before calling the function, the shape of cluster means become (5, 250, 3)

#

If it's reshaped

#

"X : Array of shape [n_samples, n_features]"

#

then data have n samples (pixels) and 3 features

#

but that reshape was from the website I linked, so I'm not sure it's needed

#

Hmm, it does say consider each pixel as a point in R^3..

#

@reef bone

reef bone
#

oh right

#

can you do

#

on the line where you have new_colors = CM do new_colors = CM[CA]

#

that should give you an array of shape (41500, 3)

#

and then you can reshape this back to (166, 250, 3)

wraith crow
#

That worked, thank you!! 😄

#

Look at that beautiful thing

reef bone
#

nice! try playing with the K number (n_clusters) and it should look better as you go higher

#

sorry I took so long I was a little confused about what they want you to do

#

basically with new_colors = CM[CA] we are using the assignments as indices for the means, so to each sample (pixel) we're assigning it's mean in the 3D colour space

wraith crow
#

Yeah, it's a great solution

reef bone
#

numpy can be a little esoteric sometimes (intuition would tell you to use a loop here) but once you learn the ins and outs its an amazing tool

wraith crow
#

Should it be easy to select a random subset to fit the k-means model?

#

TODO: Segment the image 'copenhagen.jgp' in a similar

fashion (using n_clusters=16, max_iter=5, seed=0). Instead

of using all data points/pixels, consider a random subset of

size 5000 to fit the k-means model and to find suitable cluster

centers. You can use the numpy.random.choice function to select

a random subset of indices (without replacement).

#

Ah, otherwise it has to fit for 300k points 😛

reef bone
#

sure, that should be fairly easy

wraith crow
reef bone
#

ok so numpy.random.choice() only works for 1D arrays, and we want to draw from a 2D array

#

so we can grab the indices from the 0th axis (samples) like this

#

indices = np.random.choice(data[0], 5000)

#

and then slice into the data array with the indices like this

#

chosen_data = data[indices, :] that will select only the randomly chosen indices

#

sorry I have to run off now so I don't have time to go over this in more detail

#

but you should be able to figure out the rest

wraith crow
#

Okay thanks, I will try!

reef bone
#

if you run into any problems just ask here and someone will help I'm sure

#

numpy also has fairly good docs so feel free to refer to this if you're having trouble understanding it

wraith crow
#

Thank you again! 😃

#

get a "arrays used as indices must be of integer (or boolean) type"

#

I think I got it

wraith crow
#

Hmm, how can I reshape it back into the full image?

#

from the 5000 points

#

Hmm, I only have 5000 assignments, but I need 272640 to assign every pixel.

#

@reef bone Are you back?

reef bone
#

Does your KMeans class have a predict method? Or similar? You want to fit using the sebset (5000 samples) and then predict the closest cluster for each data point (all samples)

wraith crow
#

Doesn't seem like it has a predict method

reef bone
#

I think they want you to use this method

#

This is a bit of a struggle I'm on my phone

#

You want to pass it all data (X) and the means given by your fitting

#

And it will return the full assignment indices which we have worked with before

wraith crow
#

Hmm, so I can use assign_to_clusters(data,means) where means is the one I got from using the full function?

#

on the 5000 points

reef bone
#

Yes

#

It will return the full assignments

#

So you want to store them in a variable

#

assignments = kmeans.assign_to_clusters(data, means)

wraith crow
#

Trying run it now

#

Seems like it uses a lot of computation though, which seems like I'm missing the point of using 5000 points

#

Hmm, still running

#

never resolved, maybe it run in a infinite loop. I'll just write a comment and use the full data when plotting the image

reef bone
#

Thats odd, kmeans is generally quite fast

#

If you show your code I can take a look

wraith crow
#

Had uncommented the two yellow dots, and commented the yellow cross before

reef bone
#

I think you should do assign_to_clusters(data, CM)

#

Not CA

wraith crow
#

oh.. yeah that was a mistake. I'll try to run it again

#

it worked, and was much quicker

reef bone
#

The predictions are fast, its the fitting that takes time

wraith crow
#

Have you seen this equation before?

reef bone
#

Possibly, but I dont recognize it off the top of my head

#

What is it for?

wraith crow
#

it's non-linear regression

lapis sequoia
#

Hi guys, I wanted to start studying Data Science with Python or R through DataCamp. I know coding is about skill development and not where you study, but I wanted to ask if that website is good enough to get at least a good grasp on what Data Science is like.

wraith crow
#

Anyone that can help with this?

#

have to implement this approach

hardy crag
#

@lapis sequoia never tried it , but their podcast is fairly good and the host seems competent enough

#

@wraith crow hard to say what you should do without the additional code you've been provided

wraith crow
#

This is the 'dummy version'

#

this is most of the function

#

not sure where and how to construct that new equation

#

TODO: Implement the non-linear regression approach;

generate corresponding plots for sigma=0.1,

sigma=1.0, and sigma=10.0 by computing, for each

xbar in X_plot, the corresponding prediction

wraith crow
#

@reef bone Do you remember PCA well?

reef bone
#

I have some idea

wraith crow
#

I missed the earlier assigment with PCA, so I'm a bit lost there

reef bone
#

PCA is a fairly complex algorithm in comparison with kmeans, are they asking you to implement it yourself or are you using some library?

wraith crow
#

We have some templates available

#

Does this one seem fitting?

#

I just don't see training data in that example

reef bone
#

PCA is an unsupervised algorithm so training data will be similar to what you had with clustering

wraith crow
reef bone
#

Sorry I can't go over the entire thing with you today

#

The code you have shown looks good, most importantly it lets you extract the eigenvalues and eigenvectors

#

Because they are sorted by eigenvalues, the components that describe the data the most will be at the top

wraith crow
#

But they used "data" as input there, but I have 4 different 'data' as in trainset, testset, trainlabels and testlabels

reef bone
#

Looks like you're passing something called diatoms to the pca function

#

By definition PCA ignores labels, it's unsupervised

#

The labels might be useful later on, for example PCA can sometimes be used as preprocessing technique to reduce dimensionality before you implement some kind of a classifier

#

But PCA itself only decorrelates data and reduces its dimensionality

wraith crow
#

So if I should try to run

#

def pca(data):
# Extract data dimensions
d, N = data.shape
# First, center the data
center = np.mean(data, 1)
centers = np.matlib.repmat(center, N, 1)
data_cent = data - np.transpose(centers)

# Compute covariance and its eigenvalues from centered data
Sigma = np.cov(data_cent)
evals, evecs = np.linalg.eigh(Sigma)

# Return eigenvalues and eigenvectors and -- for the sake of the lecture -- also the centered data
return np.flip(evals,0), np.flip(evecs, 1), data_cent

PCevals, PCevecs, data_cent = pca(testset)

PCevals is a vector of eigenvalues in decreasing order. To verify, uncomment:

print(PCevals)

PCevecs is a matrix whose columns are the eigenvectors listed in the order of decreasing eigenvectors

#

do you have an idea of what data should be?

reef bone
#

Take a look at what diatoms is in the template code

#

It will be defined in one of the previous cells

wraith crow
#

classes seems a bit like labels

native rivet
#

Hi guys

#

I need help

#

Im new to machine learning

#

But cant fig out from where to start

#

I want to learn ML

#

Please help me

#

I have 0% knowledge about machine learning

lapis sequoia
#

@native rivet I'm not certain on exactly where to start since I barely got into data science myself, but I was told DataCamp machine learning courses are solid

supple ferry
#

Hey there everyone. I have a Pandas related problem and would be glad if someone can help me out. I posted this question on stackoverflow:
https://stackoverflow.com/questions/54288604/applying-a-function-which-involves-multiple-boolean-operations-on-multiindexed-d

#

I keep trying to solve it for the past hours, but no success

wide oxide
#

@native rivet DataCamp's "Become data scientist with python" course is more like "Learn Python for data science"

#

I did 4 courses on it and left it

native rivet
#

@wide oxide do you know python than right

#

Sorry i mean machine learning

wide oxide
#

Nope

small ore
#

@native rivet Depending on how much Math you alreaady know and how much you can handle, there are different courses. For an understanding of the principles behind ML, Andrew Ng's course on Courseera and one by Columbia university on EdX is nice. The latter is high is math than the former. There are a tonne of other material and courses too

native rivet
#

Bro i want to go from scratch like

#

All algebra , calculus etc

#

From zero

reef bone
#

This book covers a lot of the very basic math

native rivet
#

I need something like ml course step by step

#

From maths to intermediate

wide oxide
#

I am good with high school level mathematics

native rivet
#

Can you just teach me basics which req to ml

#

After that i can enrol udacity nano degree

wide oxide
#

Should I start with course by Columbia?

small ore
#

@native rivet Andrew Ngs course is manageable with very little math knowledge. He even has a couple lessons on martix multiplication and such and then covers required math as he goes through. But learning Math on your own from any source would be good for a better understanding

vapid mauve
#

Are numpy.append, numpy.stack manipulations slower than normal lists? Am I supposed to be using normal lists and append, and turning that into a numpy array using numpy.array?

native rivet
#

@small ore its not with python

wide oxide
#

There are codes available that implement the same on Python

small ore
#

@wide oxide Some statistics and probability would be good but if you are sharp enough you can manage by going through some YT videos

native rivet
#

So cant do any excersise properly

#

Do you guys recommend me udacity nando degree?

small ore
#

@native rivet Well, you said from scratch. So given the kind of courses I have seen it is best to take a course that isnt tool specific and then take a small course/read a tutorial for python modules

wide oxide
native rivet
#

Udacity will not take me from scratch?

small ore
#

I have no idea abt the Udacity course

wide oxide
small ore
#

@wide oxide Those topics in your Math course will certainly help

#

Also there are old columbia courses you can audit if you dont want to take the live course. See 📌 for a link

wide oxide
#

Thank you very much!

#

Do you have something for data science? @small ore

#

or the same is good to start with it as well?

small ore
#

There is a tonne of material. Courses/Texts/online material/free datasets/Blogs/etc. But good to start with learning the math and then basics of ML. Data science as far as I understand has many elements to it. Data exploration and ML should start you out well. You can learn data exploration when you start to learn some python ways of doing it

wide oxide
#

I will start with the course then. Thank you very much!

#

I want to get into behavioural analysis and higher mathematics.

wide oxide
void anvil
#

It depends on what side of ML you want to be on

#

there's the data science side where you don't need to know the math and treat each algorithm as a grey box, knowing what inputs / outputs / assumptions you make feeding / getting results from each model

wide oxide
#

I am just starting

void anvil
#

then there's the development side where you're developing faster or better ML algorithms where it's super important

#

Obviously if you understand how the algorithm works you will know what's happening much better

wide oxide
#

From where I can learn this level Mathematics?

void anvil
#

Calculus 1-3, Differential Equations, Statistics, Stochastic Calculus

#

they're Freshman - Jr year math

#

stochastic calculus, depending on the course, can be a grad level class

#

you'll also probably want to take classes on econometr ics

wide oxide
#

from where?

#

Statistics is already so vast

void anvil
#

MIT OCS is a good palce to start

#

introduction to statistcis

#

Look for statistics for engineers prlly

wide oxide
#

Last one has no lecture videos

#

@void anvil what do you think about this?

#
#

I have completed 5 videos of this before

#

did assignments as well (in semester breaks) and then semester started so had to leave

void anvil
#

that's fine as well

#

it's a harder course

wide oxide
#

I was able to understand 60-70% of the lectures

#

and was able to solve 50-60% of the assignments

#

Do you still think that I should follow the same course?

sharp gorge
#

hey, could anyone recommend a reliable source for any sort of data? The actual contents of the data doesn't really matter as long as I can scrape it easily or get it through an API

reef bone
#

Kaggle has datasets and competitions too

sharp gorge
#

thank you! <3

heavy apex
#

Omg, thank you @reef bone, but been looking for someplace to get some data to play around with more.

void anvil
acoustic bone
#

not "data science", but can someone direct me some resources on how matplotlib does its thing? like things will only work if you assign them to a variable

seems like an odd design

simple crag
#

?

night fulcrum
#

hello, does somebody here have experience with openai gym?
i have a question about it, i cant run ale anymore because my ubuntu doesnt start anymore since i've installed my new gpu,
if somebody knows, does the space invaders environment give the current score as score or the delta between the last step?

lapis sequoia
#

is this related to python

night fulcrum
#

yes

#

openai gym, tensorflow and keras are all in python

#

@lapis sequoia so?

lapis sequoia
#

you dont have to ping me

ripe lava
#

Hi guys! I am trying to figure out how to select candidate features to feed into an gradient boosted tree model (I am using XGBoost). Blindly feeding all my candidate variables doesn't do the trick as it significantly increases the search space and seems to deteriorate the validation score compared to a carefully selected list of candidate features. My idea was to train the model on all potential features and then make a selection based on feature importance (i.e. discard all the variables that don't contribute too much). What do you guys think of this approach? (asking here because somehow I can't find anything online on that topic)

agile epoch
#

I'm not super sure what you're talking about but it kinda sounds like pca

ripe lava
#

PCA transforms the variables which I would like to avoid.

languid adder
#

I'm trying to plot a groupby on a panda DF with seaborn but i notice i have to use the .head(n=1000) to get it to work. If I ommit the .head() I get an error saying the groupBySeries isn't callable. Here's the full code:

names = ["Sepal L","Sepal W","Petal L","Petal W","Class"]
irises = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data",names=names)
df = irises.groupby("Class")
sns.countplot(x=df["Class"].head(1000),label="Count")
plt.show()
agile epoch
#

irises.groupby("Class") is a pandas group by object not a dataframe. You need to apply something to it to get a proper dataframe back. Apparently head works which I did not know. Try .head(len(df)) lol

late garnet
#

@languid adder A count plot by nature will automatically do what you are trying to do with pandas; count the number of classes. You can simply do this.

sns.countplot(x='Class', data=irises)
agile epoch
#

That's so weird that .head() doesn't error that

#

Is that a bug?

languid adder
#

ah great @late garnet that's what I was after

#

.head is in the official documentation of the group by object so that's why I tried it. That's also why I was confused I couldn't work with it like a DF

languid adder
#

I'm trying to clean up some data as was wondering if this is a good way or if there is a better way
I want to create a new column in my DF that is based upon 2 other column:

def splitIntoName(value):
    arr = value.split(", ")
    if len(arr) > 1:
        return arr[1]
    return None
df["Name"] = pd.Series([splitIntoName(x) for x in df.description.dropna()])
df["Name"].fillna(df["OtherNamej"],inplace=True)
spark nimbus
#

with numpy array of floats x, and treshold float f, how would I make x the same array as it was before, but with any value below f turned to 0?

desert oar
#

@languid adder that seems like a reasonable way to go

#

err wait no

#
df["Name"] = pd.Series([splitIntoName(x) for x in df.description.dropna()])

this line is bad, dont do this

#
df["Name"] = df["description"].map(splitIntoName, na_action='ignore')

do this instead

#

also dont use inplace

#

so do this:

df["Name"] = df["description"] \
    .map(splitIntoName, na_action='ignore') \
    .fillna(df["OtherName"])
#

@spark nimbus you arent allowed to think when you write python code, otherwise you think too hard and ask silly questions ;)

x[x < f] = 0
spark nimbus
#

that actually works?

desert oar
#

why wouldnt it

#

have you read the numpy subsetting docs?

#

which are admittedly not easy to read

#
Most of the following examples show the use of indexing when referencing data in an array. The examples work just as well when assigning to an array. See the section at the end for specific examples and explanations on how assignments work.
#
b = y > 20
y[b]

etc

languid adder
#

@desert oar thanks. I'll dive deeper into the map function as I seem to do a lot of these types of mappings so it might be more useful to use .map instead of the construct with pd.Series([])

desert oar
#

@languid adder the bigger problem with your code is the .dropna() will make the whole thing misaligned

#

but yeah .map and .apply are there for a reason. use them

languid adder
#

ah yes the indexes are off.

#

didn't think about that...

desert oar
#

like you could do...

pd.Series([foo(x) if pd.notnull(x) else x for x in df['y']], index=df.index)

or

df['y'].map(foo, na_action='ignore')

i know which one i prefer 😉

languid adder
#

yes the later is more readable

worldly sigil
#

How do you deploy your models? I’ve been experimenting with Flask. Looking to see what others are doing.

placid snow
hexed juniper
#

so this is a bit of a "meta-data-science" question so i hope people wont jump at me 😃 I want to support a small open source data science community and look for the best team collaboration tool (or combination of tools - chat, file sharing, planning etc). do people have some recommendation? I guess discord is not sufficient for that.

south quest
#

We aren't a data science server but I can answer as a large Python community:
Chat: Discord, GitHub Discussions
Collaboration: GitHub mainly, for some internal things we do use Dropbox as well
Planning: We use GitHub Issues and Projects (Projects are very similar to trello but link very well to GitHub repos & issues)

hexed juniper
#

thanks @south quest thats quite reassuring that this combination works well (I'd like a small number of tools)

south quest
#

Yeah, we don't really have a huge number of tools and it has served us well since our move to GitHub. We now use Azure for CI, GitHub for basically all project management and Discord for development chat

#
  • discord has some nice github webhooks so it all integrates nicely
hexed juniper
#

I am new to discord but I quite like what I see so far, esp the python server looks super well organized so I'll be copying some best practices 😃

desert oar
#

@worldly sigil flask works well for my team but i have built up quite a bit of additional structure around it over time

#

If you are just receiving and serving JSON any half decent web framework will do. I like flask because it has been around for a while so it is fairly well tested in the field, but it's also generally simple to just get up and running

worldly sigil
#

@desert oar that's awesome to hear. Yeah, right now it's just receiving and serving JSON so I'll keep it simple til I need something more tailored to my scenario. thanks for the insight.

hexed juniper
#

@worldly sigil there is the lean way of flask and the fat way of django+drf. the good news is that once you adopt a rest/json architecture its really easy to switch

small ore
#

Noob question. Why do you guys use flask/json for data-science?

earnest prawn
#

It's about deploying models to the real world like when you built an auto complete model you want to provide it as a service somehow

languid adder
#

how can you improve resource utilization when optimizing or training a model.

#

For example i have following code:

from sklearn.model_selection import GridSearchCV
paramGrid = [
    {"n_estimators":[3,10,30],"max_features":[2,4,6,8]},
    {"bootstrap":[False],"n_estimators":[3,10],"max_features":[2,3,4]}
]
forestReg = RandomForestRegressor()
gridSearch = GridSearchCV(forestReg,paramGrid,cv=5,scoring="neg_mean_squared_error")
gridSearch.fit(housing_prep,housingLabels)
#

this runs a long time as it needs to do a lot of iterations however I notice my CPU never goes over 10%

#

so I was wondering how I can give it more resources so it's done quicker

languid adder
#

found out the n_jobs parameter does just what I wanted. Much quicker now 😃

languid adder
#

another question about performance...
I have a 6 core i7 processor. Is it best to use hyperthreading or not? At the moment it's disabled so I have 6 threads but was wondering if enabling hyperthreading would improve performance?

lapis sequoia
#

it's better to run it on the cloud :v

languid adder
#

$$$ 😛

violet bison
#

Hello chaps! I'd like a few opinions on something if you can: I have rather heavy CSVs (between 200MB to above 2GB) that come from a cloud provider where there is the billing information, how would you by storing this information and extract data on it? put it into a postgres DB and run custom python code alongside ? send it to an ELK stack? something else? The idea would be to get the biggest costs centers and see what I can do with that.

chilly shuttle
#

how many of those CSVs do you have that you need to coalesce?

#

if they total more than your memory, your options are spark, dask, or some out of memory dbms

#

as far as out of memory dbms go, I've been in love with Clickhouse recently so i'll shill that

worldly sigil
#

@languid adder have you considered using Dask added onto your Scikit-Learn code? You seem like you're digging for better performance, and Dask (or Pyspark) are a pretty fantastic blitz of performance.

languid adder
#

I haven't. Working my way through Hands on machine learning with scikit-learn and TF at the moment

#

will have a look at it afterwards, thanks for the tip!

lapis sequoia
#

@violet bison bigquery

#

why do people complicate things so much.. use something that doesn't have you running around trying to find more resources everytime.. focus on the objective

violet bison
#

thank you chaps, I'm in the EU and personnal google cloud accounts cannot be done, also, I have 20 GB of ram, it's cool that all the CSVs should fit into it, but thanks for the heads up

#

ooh snap GCP private accounts works now...

desert oar
#

nothing wrong with postgres

#

like why even bother loading 20 CSVs into memory if you just wanna do some queries and aggregates

#

spark? why

#

apt-get install postgresql, youre up and running in 10 minutes

#

@violet bison

#

grab a 1bn row sample and analyze it in R using data.table

#

can do that on a laptop

#

source: i have done it on a laptop

violet bison
#

I don't know R that's the thing, but I just wanted a few graphs and a tool capable of ingesting csv

carmine lava
#

Any one using Intel movidius

languid adder
#

is R so much better than python for DS and ML?

desert oar
#

For machine learning specifically, python is better because it's more of a general purpose programming language

#

For data science in general, R is a great tool to have at your disposal

#

Also machine learning libraries tend to be written with python in mind nowadays

#

R is better for many kinds of statistical analysis

#

"Better" being my subjective opinion

#

The R data.table library is ridiculously efficient at large in memory data processing

#

I haven't tried a comparable task with pandas, but im not sure i trust it for that purpose

languid adder
#

ok so that makes sense 😃
so for someone starting out in the ML/DS landscape, it is worth having both tools at their disposal?

serene veldt
#

sorry to bump in

#

anyone has any ide ahow to make an efficient RMSE in torch?

#

there are some version but its for loss functions

#

im just trying to get the RMSE from two tensors

desert oar
#

@languid adder learn one, then see about picking up the other once youre comfortable with one

#

@serene veldt you wanna use RMSE as an input to a layer?

serene veldt
#

nop, its not for networks

#

i have 2 tensors

#

1 expected values 1 results

#

wants the rmse of that

#

there will be no propagation whatsoever of that value

#

@desert oar

lapis sequoia
#

@languid adder R is for people in finance.. it's better for small data.. and time series

#

it sucks for network analysis in particular

#

plus.. R studio costs a few hundred dollars for licensing

lyric canopy
#

I think R is used the most in acedamic statistics

#

It's has many more packages available for that than Python has

#

More and more are being published in Python and there's a trend to also publish one for Python these days

chilly shuttle
#

R is for people who can't code

#

R is less and less in favour outside academia because it's a lot of work (sometimes rewrite in python) to productionise

languid adder
#

thx for the info guys. I ordered "An introduction to statistical learning" which uses R to show the examples. I wasn't planning on learning R and my intention was to code the examples in python instead so I'll stick with that plan

#

FYI I'm working for a major software company and our team needs to focus on DS/ML more so that's why I'm retraining myself...

lyric canopy
#

I disagree with the notion of @chilly shuttle I feel it's too much gatekeeping "programming"

#

I don't care for such exclusive sentiments

#

R is fine as a tool and a lot of people use it

#

It's just not the right tool for every job

chilly shuttle
#

eh, I stated the business reality

#

if you're in academia R is fine

#

as for commercial settings, I increasingly shy away from taking on data scientists that don't know anything except R

lyric canopy
#

R is widely used here in commercial settings as well, so I don't think it's a business reality

chilly shuttle
#

how do you productionise your R code?

lyric canopy
#

It's one of the main requirements on job advertisements for professional statistics here.

#

That's usually not the main goal of R

chilly shuttle
#

correct.

#

to productionise R code often means to re-implement it in python

lyric canopy
#

As I said, it's not the right tool for every job

chilly shuttle
#

why take on that burden in a commercial setting when you can take on data scientists who can do both in one shot?

lyric canopy
#

Maybe I don't have a software development application in mind, but that doesn't mean that "R is for people who can't program"

#

Because not every project is a software development project

#

I know it's easy to only see the big data and machine learning aspect of it, but there's a whole world of people who's job it is to research rather than develop

#

For that, R is a very nice tool and it offers much more tools than Python does

polar acorn
#

Also Rstudio has a free version that works fine

lyric canopy
#

A lot of cutting-edge statistical models have no implementation in Python at the moment

chilly shuttle
#

like what

lyric canopy
#

New developments in regression trees, several multidimensional scaling techniques, and most of the other models currently being developed in statistical science (as opposed to machine learning), as a lot of those researchers publish their packages in R. Some have started to switch to Python, but there's still a distinction between smaller datasets (like used in statistical learning) and larger dataset (more machine learning perspective)

chilly shuttle
#

so.. all academia?

#

we are pretty much in agreement then

lyric canopy
#

Now you purposefully misrepresent what I said.

#

That those models are developed in academia doesn't mean that they're exclusively used in it

#

My research department does a lot of consultency work for external organizations that use those models

chilly shuttle
#

i'm seriously curious, can you provide one anecdote?

lyric canopy
#

Sure

chilly shuttle
#

we've had to do non-bread and butter stuff maybe twice last year for external clients, and we're one of the biggest consultancies that exist. But I can imagine more specialized consultancies picking up the more exotic stuff

lyric canopy
#

One of our students recently started an internship with the Dutch Ministry of Social Affairs and Employment to use those statistical learning models to predict which companies should be inspected for "employing" illegal immigrants/victims of human trafficking.

#

All of that is done in R

chilly shuttle
#

and they couldn't address it with ML because?

lyric canopy
#

It's basically the scale of the data you have; the datasets are smaller than what you'd normally use for ML

#

The distinction is a bit vague, though

chilly shuttle
#

i should rephrase that as, 'they couldn't use existing techniques because?'

lyric canopy
#

Because techniques evolve contantly. It's basiclly still in its infancy

chilly shuttle
#

but that sounds like academia

lyric canopy
#

The development of the techniques is, but the application of the techniques that stem from it is not

chilly shuttle
#

and I'd be willing to take a wager that this student is a stats/similar academic who isn't CS proficient at programming?

lyric canopy
#

The student isn't, but the people who currently run the project are (and who are currently already using R for it)

chilly shuttle
#

I see

lyric canopy
#

Now, honestly, I would really like to see the use of Python spread as that would eliminate the delay in the spread of new techniques

#

Because, as you said, R is academia focussed

#

But, it's not just for people who can't program

chilly shuttle
#

anyway, my observations from fairly high up at a big4 consultancy:
we hire less and less data scientists who can't write code (i.e. R only)

r&d type work that can be done by pure stats folk is tiny in volume compared to commodity data platform deployments and the like

data science is increasingly becoming commoditised where creating a complex model is now a couple of clicks in a GUI. What isn't point and click at this point is data engineering

lyric canopy
#

Now, another point is whether the use of R in academia is a good thing

#

I'd rather that we all switch to Python

chilly shuttle
#

that is happening at the university i'm affiliated with (as part of a broader theme to add software engineering concepts to pretty much every discipline). Not sure how widespread that movement is though

lyric canopy
#

As the stuff we do in R isn't less complex than what you'd have to write in R, but R is not a very nice language (IMO) and Python is a general programming language.

#

The biggest issue is that there's a lot that's only implemented in R at the moment

#

The newest techniques for Multiple Imputation have very nice implementations in R (e.g., MICE is one of those packages), but support in Python is incomplete

chilly shuttle
#

yeah I'm not disagreeing with that

#

there's a clear trend of <thing> gets invented, reference implementation in R, stable python implementation 1-2 years down the track

#

it's just not something with a huge job market

lyric canopy
#

But, as you say, I'm a lot more focussed on applying research than to software development

#

Yeah, probably

#

Still, we have about 20 master students a year and they all receive multiple job offers before they even graduate

#

So, I guess the supply part is also relatively small

chilly shuttle
#

and some of them will be picked up by employers like mine just to pad out the credentials on proposals

#

and never do a day of enjoyable work in their life

#

i die a little every time i see a PhD stats person being put on a project to do fucking.. excel mangling

lyric canopy
#

I guess here most students are gobbled up after they finish the master of science and before a PhD

#

I can probably dig up a list with the kind of projects I'm talking about if you like

#

But I need to do some work first

#

(and I have to check which projects are confidential first)

chilly shuttle
#

its ok they sound like the kind of r&d projects we do once or twice a year

#

last one was modelling road network utilisation after a transition to self-driving cars, back when people still thought self-driving cars are about to take over the world

lyric canopy
#

Most of the stuff we do is smaller scale and there's a lot of (semi-) public sector work (if that term makes sense in English)

#

It's not really related to what I do, though, so I have to look up which projects we have at the moment

lapis sequoia
#

R studio does have a free version..but if you work at a company, you stillhave to fork over $ for a license..

#

R was nice.. manageable.. not for large data definitely.. I had to drop it because my python skills are very domain specific and being the dumbass that I am I didn't want to confuse syntax and lose progress v.v

chilly shuttle
#

you also have to fork over $ to monitor compliance for said licensing

lyric canopy
#

You can use the open-source RStudio (license AGPL v3) for commercial purposes without the commercial license.

chilly shuttle
#

soooort of

#

you can use it as an individual or a small company for free

#

as a large company, the risk associated with taking on foss for critical work far outweighs the cost of paying the commercial licensing

#

the main reason foss is shied away from in large enterprise is because when something goes wrong, you want to have a support contract with the vendor so you can blame/sue them

lapis sequoia
#

Oui

#

is why people use R in other environments

slate rock
#

R was my first and only programming language, so when I think about writing a script my mind goes to R. Is there any book, video, tutorial, whatever that I can use to grasp the Python's way?

lapis sequoia
#

I want this book.. but it's not released yet.. these guys seem to be taking forever

chilly shuttle
#

R is a programming language in only the loosest meaning of the term. Investing the effort to learn python or lua or c pays huge dividends because every subsequent programming language is quite familiar

languid adder
#

i found a machine learning course on coursera but it used python 2 and graphlab... is that still worth following? Concidering the examples aren't really focussed on python 3 and scikit learn or tensorflow?

small ore
#

What are your objectives? Learn what is behind ML or to learn the packages?

languid adder
#

both actually... If a course is using outdated packages (no commits over 2 years) than I assume the course might be outdated as well?

small ore
#

The one by Andrew Ng on Courseera uses Matlab/octave. It is just about what is behind ML but uses Matlab/octave to do excercises and assignments. These are easier than python in ways and enough to the point that you need to know what goes in and what comes out and infer from it. More mathematical approach by a columbia univ course on edx. It does not use any tool. See pin for a link. There is another edx course ( I think from microsoft) which teaches you to do ML using Python( Uses Jupyterlabs on azure). It uses basic libs like pandas, matplotlib and seaborn

#

And skikit learn

languid adder
#

yeah I noticed the course from Andew Ng (founder of Coursera) but was looking into a python course so I could combine the theory + practical knowledge at the same time

small ore
#

I suggest taking two different courses for these. The ones that teach the packages don't cover what goes on behind the scenes or why you are doing what

#

Or at least does not do it well

languid adder
#

so which one do you suggest for the theory?

#

Andrew Ng covers more neural networks (deep learning) and doesn't seem to cover things like linear regression, classification and so on

small ore
#

Andrew Ng starts with linear regression and classification and then goes on to neural networks( Which is again a way of solving the regression/classification problem)

languid adder
#

ah then I must be looking at a different course

small ore
#

If you want a more involved mathematical approach to get the intutions behind each method, The course from columbia is good. I suggest going with that course if you can manage to do some reading on the side from a standard text.

languid adder
#

link to the course? Can't seem to find it

small ore
#

Which one?

languid adder
#

the one from columbia

small ore
#

It is in the pins 📌

#

Top right of you screen where you mute

small ore
#

afaik this aint the place to ask abt regex

willow siren
#

Which channel?

small ore
#

One of the help channels please

willow siren
#

Alright, thank you

lapis sequoia
#

Any A.I or ML server on discord??

languid adder
lapis sequoia
#

I dunno

#

always found slack annoying

languid adder
#

i don't find much differences between discord and slack... I mainly use discord for hobby specific things and slack for professional things 😛

storm monolith
#

Hello, I have this numpy array :np_array = np.array([1, 2, 3]) I was wondering what is the most performance efficient way of obtaining the int 123 from it ?

charred tinsel
#

np_array[0] i think. If you want the 1

#

Im not that experienced but thats what I learned so far

placid snow
#
"".join(map(str, np.array([1, 2, 3]))``` perhaps?
#

You could always try the performance difference between options with the timeit module

storm monolith
#

I am trying to create an empty pandas df and later fill in the values, but the performance isnt great - upwards of 100 ms for 35 rows over 3 columns...

#

Can you help me set up the dtypes for it ? maybe thats the problem

#

pd.DataFrame(index=range(0, 35), columns=['MMID', 'Price', 'Size'], dtype={'MMID': object, 'Price': np.float64, 'Size': np.int64})

#

I keep getting TypeError: data type not understood

#

in MMID colum i want to store strings, Price - floats and ints insde Size

half basalt
#

I believe you can only specify one sort of dtype and not a dtype for every column

chilly shuttle
#

You can totally specify a dtype per column

languid adder
#

yeah plenty of DF have different dtype per column

undone spoke
#

any #api / requests question: I know how to set a ?per_page= , but I can't figure out how to ask an api what its maximum ?per_page - anybody? Thanks!

half basalt
#

"Parameters:

dtype : dtype, default None

Data type to force. Only a single dtype is allowed. If None, infer

copy : boolean, default False

Copy data from inputs. Only affects DataFrame / 2d ndarray input

"

half basalt
#

Nevermind...what I said before that was wrong

languid adder
#

i'm trying to apply a OneHotEncoder but fail to see how I add the encoded values to my dataset

#

at the moment I have this:

ms = df.MSZoning
encoded, cats = ms.factorize()
from sklearn.preprocessing import OneHotEncoder
ohc = OneHotEncoder(categories ="auto",sparse=False)
msOneHot = ohc.fit_transform(encoded.reshape(-1,1))
#

msOneHot now contains the columns with the binary values so I just need to add it to the dataset

#

I don't understand why the fit_transform doesn't add these automaticlly to my dataset?

languid adder
#

lol... just realized the get_dummies does exactly that:

df = pd.concat([df,pd.get_dummies(df.MSZoning, prefix='zoning')],axis=1)
fervent solar
desert oar
#

Did you try reading the docs..?

fervent solar
#

yes i got inplace from there but still don't understand why used subset['price']

placid snow
fervent solar
#

what is the good way to read documentation ?

#

and when to read it ?

placid snow
#

Just comes with reading many, and experimenting with the meaning of them. You read them when you dont know what something does

languid adder
#

how do I make sure I split my train and test data so my test data contains at least one of each categorical data.
At the moment I believe this isn't the case because I use OneHotEncoder and I have less features on my test data than on my training data...
I use this to split it up at the moment:

from sklearn.model_selection import train_test_split
trainSet, testSet = train_test_split(housing,test_size=0.2,random_state=42)
#

or... how can I add the missing columns with all 0 values?

fervent solar
polar acorn
languid adder
#

cool thanks. Meanwhile I build my own code to fill up the gaps by comparing the columns.

#

Always good to see other solutions as well

polar acorn
#

But why don't you add dummies before you split the data?

languid adder
#

well.. because I split my training data into training and test.
I also have a different file that I need to make predictions for and I don't have the labels for them

#

if I add the dummies before splitting, I might have the same issue when I predict on my unknown dataset

#

also, my split up test set has some values that aren't in my training so it gives a more accurate representation of the unknown set

terse pewter
#

@fervent solar I’m not entirely sure, but I think you tried to calculate a mean from your data frame but there are some values such as the one you see in the error that cannot be converted to a numerical value

#

I suggest looking at your data and cleaning out all the bad rows

languid adder
#

my RMSLE of a model is 0.00054004 which is based upon an unseen dataset where i have labels for.
my RMSLE of a different dataset all of a sudden is 1.37941
Do I need to conclude my model is rubbish or that I Might have made an error in importing/transforming the second set?

polar acorn
#

Whats your training error? It might provide some pointers about which of the RMSLE values are wrong

languid adder
#

I'm doing the housing predictions from Kaggle and my predictions for the submissions are all very low (41k mean with std of 300)
I haven't looked at the errors on my training or test set but because my test set performs rather well and the submission set gives these weird results, I might be inclined to think there's a problem with applying the submission set to my model

polar acorn
#

That hypothesis could be tested if a simpler model performs similarly on test and submission set I guess.

languid adder
#

I tried changing from RandomForest to DeiccionTreeRegressor and I get similar low values.
It's weird because the predictions in that model seems almost identical...
array([52373.58333333, 52373.58333333, 52373.58333333, ...,
52373.58333333, 52373.58333333, 52373.58333333])

#

I also did a test with SGDRegressor but my initial RMSE on the training set is around 3e+20 while my predications are around 2e+19 so again way lower

#

on the DecisionTreeRegressor I get a constant for all my predictions on the Kaggle submission set while my testing set gives a very nice RMSE of around 9k

#

so this probably shows I'm applying the submission set wrong

polar acorn
#

Probably, you should make sure you clean the data in the same way. And maybe also look at the submission set to see where it's different

languid adder
#

i have one method that I apply to both my test set and kaggle set so to make sure they are transformed in the same way

polar acorn
#

Because I don't think that challenge has a two test sets. The sample submission set is a sample of what your submission should look like, it only contains ID and price.

languid adder
#

yes

#

but I split up the trainset so I have my own test set to validate my model against an unseen set

polar acorn
#

👍 of course, I was a bit perplexed about where you got your second test set from.

languid adder
#

from common practice 😉

#

so that's also why I am surprised. My own test set performs rather well. 9k RMSE while the submission set doesn't work at all...

polar acorn
#

Hmm and you have looked at the size of both test sets after transforming them to check if there are any obvious differences?

languid adder
#

if I have different columns, I would get an error in my model.
The shape looks good

#

when I do a .describe() of both my trainSet and my kaggleSet after transformation, they look similar

#

the means of different columns I looked at are in similar ranges, same for std

#

i'll restart the kernel... I've seen similar issues that some things are in memory from trying and I remember spending a lot of time chasing ghosts... will try that and see.

polar acorn
#

Hmm I see that a few other submissions get a RMSLE of around 0.1 with some basic models. If your model gets you 0.00054004 on the first unseen test set. That seems very low, there might be some information leakage somewhere.

languid adder
#

ah yes you are right... So maybe the model is flawed after all. Overfitting like hell 😛

#

ow... now my predictions look better

#

now the predictions of my kaggle set is between a normal range. plenty of houses in the 100-500k mark 😄

#

will upload and see the result

#

ow darn spoke to early 😛 was plotting the training price 😦

#

yup same issue so will look into my model.
And yes... because my RMSLE is so small on my training set while the top guys at kaggle are at 0.1 and my training RMSLE has 0.0004 I indeed need to look into my model

#

but what I still don't understand is why it performs so well against my own test set

#

I'm sure I don't use my test data for training

languid adder
#

ah think I found something... I send a non standardized dataset to my Random Forest

#

when I standardize my set and calculate the RMSLE I now get 1.81040718 which is more inline with what I see on my submission 😛

#

when I apply that new model to the kaggle set, my housing prices are still off but at least they are all in the range of 181k which is better than in 40k so I bet if I upload that, I improved a lot 😛

languid adder
#

it keeps nagging me why my own test set does perform well but as soon as I apply it to the kaggle set it goes haywire

#

for example a DecisionTreeRegressor seem to work really well on the train data + test set but when I apply it to the kaggle set, I get the same prediction for each row

#

doesn't make any sense, right?

wind osprey
#

hello anyone able to assist with some data science/probability/linear algebra questions as it relates to the L2 Norm?

languid adder
#

alright, scored a 0.37 on kaggle now... looks a bit better

#

top 4000 😛

polar acorn
#

👏 Nice!

small ore
#

@wind osprey

#

!t ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

• Don't ask to ask your question, just go ahead and tell us your problem.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving
• Keep your patience while we're helping you.

You can find a much more detailed explanation on our website.

fervent solar
#

df.dropna(axis=0,inplace=True,subset=['price']) this code is not cleaning empty values @terse pewter

desert oar
#

i really dont recommend using inplace

#

it might actually be deprecated now, i cant remember

potent phoenix
#

If I want to classify a CSGO match stream at any given frame as either showing the game or not showing the game (showing commentary, commercial, et cetera), should I just get a random collection of frames and label them as "game" and "non-game"?

#

Then use a decision tree classifier to create a model?

river plume
#

which website/ course do you guys recommend to someone who wants to learn numpy ?

lapis sequoia
#

@river plume datacamp

#

@potent phoenix what would be non-game?.. you need to define the problem better.. if there's very little difference in the encoded images, it won't make a difference.. and you'll get false classifications every time..

river plume
#

@lapis sequoia alright I'll check it out

#

Thanks

lapis sequoia
#

np

languid adder
#

can someone recommend a book that covers data analysis with regards to ML so you have better understanding on how to interpret and manipulate data so it's optimized for your a model?

small ore
#

The usually recommended ESL and PRML wont serve your purpose?

languid adder
#

isn't ESL a math heavy book? I was more looking at a text book

small ore
#

Both are quite math heavy.

languid adder
#

i already have Introduction to Statistical learning which is from the same authors as ESL

small ore
#

I would have read that if it were python instead of R. Now that you have pointed a book that looks promising from what it says on the back cover, I would like to take a look at its contents. I donno how to access that from amazon

#

By contents I mean the contents page. Not the entire text contents

languid adder
#

if you google the book, you find a pdf. Not sure how legal that is... 😦

#

ah 😛

#

you should get some more info on the book website.

small ore
#

Oh. I meant the other book. Think Stats

languid adder
#

ah ok.

#

here you'll find the ToC

#

the more I dabble into DS/ML the more I realize how little I know and how much I have still to learn 😃

small ore
#

That book seems to cover a lot more topics than ESL and PRML. And they are all surely highly mathematical topics ( Although it migh be taught in a different approach as the back cover claims). Overwhelming number of topics there

languid adder
#

i do notice from the learning i've done so far that it's a good approach to get some high level overview of topics so you know what's out there and if you need further information/clarification, the more detailed books will guide you

small ore
#

That ToC does not look like high level overview to me. But it might be different to you

languid adder
#

i'll let you know in a few days 😃 Book should arrive tomorrow

small ore
#

I would certainly be eager to know that 😃 . Thank you

languid adder
small ore
#

There seems to be another book by the same name by Allen B. Downey

languid adder
#

he has a couple of books with o'reilly

#

and think stats has a 2nd edition if that's what you mean

#

I ordered the second edition

small ore
#

Oh wait. I misread. Sorry. I tried to discern from the german amazon site. I thought Taschenbuch was the author. Whatever that means

languid adder
#

yeah I'm not German but i have to use the german amazon as they deliver to Belgium

#

i read a review about that book that says it's an intro to many of the statistical concepts and the code is to explain the concept using code vs math

terse pewter
#

So I'm actually doing an R program, but my question is for statistics related

#

I did a KNN application on a dataset and the min-max normalization is rendering better results than a z-score standardization

#

why is this the case?

#

My prof had said that generally z-score standardization is the better way to go, but based on some various K values tested on both normalization methods, the min-max rendered the best results

languid adder
#

is it common practice to do data cleanup and feature engineering on a combined set of the train and test even although you don't have the labels for the test set?

#

i've seen it in several examples where people combine both sets before changing NA values to something else or before they call get_dummies

lapis sequoia
#

yeah usually people drop NAs.. but it depends.. on the type of variables..

#

some will fill with the median or mode

languid adder
#

yeah that I know. My question more about the practice of combining both the training and (unlabelled) test set before you do these operations

polar acorn
#

Depends I think, if it's just for analysis of that dataset or I'm completely sure the test data contain no surprises then I might do that. But if it's for training a model that I plan to use many times on different data, then I'd make a proper data wrangling pipeline and use it on both the training and testing set to see that it works. I don't know what everybody else does though, but for me it depends.

storm monolith
#
df = pd.read_csv(r'‪C:\Users\damia\Desktop\500.csv')
print(df)```
CSV contains : ```
Symbol
NCI
BGH
VYP``` I get this error : `UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character`
late garnet
storm monolith
#

@late garnet Didnt help , i specified utf-8. Notepad++ says its UTF-8

polar acorn
#

You could try, utf_8_sig

desert cradle
#

where did you specify utf-8 and what error (if different) did you get when you specified it

fervent solar
#

its not dropping values

#

i printed the price before dropna and after dropna still not working

ivory galleon
#

Any advice on how to debug an assertion error, from statsmodels.formula.api.ols, assert pytype not in (tokenize.NL, tokenize.NEWLINE)?

#

I'm not entirely sure what sort of error I'm even dealing with. I've verified that my column names are right, but I'm not sure what else could be going on.

#

The problem is somewhere in patsy, but I've poked at Patsy a bit and I'm not seeing an obvious solution.

#

Same problem arises when I type in a manual formula ("Employees ~ Time") and when I use ModelDesc, and patsy worked on a slightly different dataframe earlier in my notebook.

small pumice
#

I'm building a neural network using Keras. I have two variables for my input. The training data is split into two Numpy arrays, one with all the data for the first variable, and one with all the data for the second variable. The output data is also a Numpy array. How should I format the data to give to the network?

storm monolith
#

@desert cradle df = pd.read_csv(r'‪C:\Users\damia\Desktop\500.csv',encoding='utf-8') Same error, want full paste ?

#

File "pandas\_libs\parsers.pyx", line 387, in pandas._libs.parsers.TextReader.__cinit__ File "pandas\_libs\parsers.pyx", line 686, in pandas._libs.parsers.TextReader._setup_parser_source UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1: invalid character

desert cradle
#

ok, your actual problem

#

there is an invisible character in your filename

#

that's what i got when i pasted the string from the line you just typed into my prompt @storm monolith

#

delete and retype the string constant

#

thought it was weird that it was an encode error

storm monolith
#

lol, never would've noticed

#

thx

ivory galleon
#

I'm still stuck, trying to figure out what the tokenization is even doing that's making it go wrong.

#

I just wanted to run an ols on a dataframe!

languid adder
#

question regarding OneHotEncoder. If I have a column that can have 4 different values so with OneHotEncoder I transform that into 4 new columns.
If I calculate the correlation of those 4 new columns vs my label and notice only 2 of those columns have a decent correclation value, is it good to remove the other 2 or is that bad considering its origin?

#

the same question if I use a PolynomialFeatures transform? this can increase the amount of features exponentiel so dropping the additional columns that don't have a good correlation should help the model, right?

chilly shuttle
#

if you're only looking for linear relationships, sure

#

but corr won't tell you much about nonlinear or stateful relationships, which can be learned by a range of ML techniques

languid adder
#

ah yes, if there is higher relation the correlation won't point it out and I might drop those

chilly shuttle
#

i observed there's a bit of a miss in current ML curriculum around feature selection

#

most of the feature selection techniques presented rely on linear relations and are worthless for stuff like decision trees or ann

languid adder
#

yeah most of the stuff you find talks about correlation and if you're lucky they mention it doesn't apply to nonlinear relationships

chilly shuttle
#

pca is a huge offender

#

it's still taught as one of or THE primary feature selection technique, but it's actually useless for popular ml techniques

ivory galleon
#

OK, I've confirmed that
'response_terms = [patsy.Term([patsy.LookupFactor("Employees")])]
model_terms = [patsy.Term([])]
model_terms += [patsy.Term([patsy.LookupFactor("Time")])]'
works and
'ols(formula="Employees ~ Time",data=longer_df).fit().summary()'
doesn't.

old axle
#

whats the command to install numpy and pandas?

ivory galleon
#

@old axle, do you have pip?

old axle
#

yep

#

and git

#

is the packaged called numpy itsself

#

numpy/pandas

ivory galleon
#

pip install pandas and pip install numpy should do the trick.

old axle
#

i thought some had different names

#

okay thanks

ivory galleon
#

Some do, but it's always worth checking the defaults.

old axle
#

ok

ivory galleon
#

(My problem has been fixed! For the record, I hadn't updated a library that broke on the 3.6->3.7 upgrade.)

languid adder
#

if you have a dataset of around 1500 records and one of the categorical features has 2 values. One value has 1400 occurences and the other around 100.
I notice that those 100 records have a capped value for my label while if a record has the other value, the label can go much higher (over double the value of the other category)
I'm not sure that based upon the high difference in frequency of these categories, It would be wise to make this conclusion and include it in my model

languid adder
#

if you have following distribution of categories:

#

do I need to disregard that feature? Knowing that one category takes around 90% of the dataset so when it comes to probability... it has more opportunity to generate more outliers or wider distributions

carmine lava
#

Best course to start deeplearning 🤔

#

Any one

polar acorn
#

@carmine lava check pinned messages.

carmine lava
#

@polar acorn link

polar acorn
#

Pinned messages in this channel, top right above the chat window.

lime lava
#

Hi, Pandas question: I have two tables, one with a yearly value for las 10 years an another with several price entries by year. I want to create a new column that divides the second table price by the first table value if years match
is there a simple way to do this?

solar oracle
#

So you have yearly data and let's say monthly data for the same thing?

lime lava
#

actually

#

I have a yearly index in one table

#

and a lot of transactions for each year in another

#

fist table is like 10 rows

#

second one is like 300k

river plume
#

i have a dataframe column with such values: "Eraser (5)"; where the round brackets contain the price of object in int

#

How can i add another df column after extracting the price and storing tt in the nes column

#

Like df.Object should be "Eraser" and df.Price should be 5

desert cradle
#
>>> df = pd.DataFrame({'TheColumn': ['Eraser (5)']})
>>> r = re.compile(r'^(.*?)\s*\((\d+)\)$')
>>> ms = df['TheColumn'].apply(r.match)
>>> df['Object'] = ms.apply(lambda x: x.group(1))
>>> df['Price'] = ms.apply(lambda x: int(x.group(2)))
>>> df
    TheColumn  Object  Price
0  Eraser (5)  Eraser      5``` @river plume
river plume
#

thanks @desert cradle

river plume
#

df['Price'] = df['TheColumn'].str.extract('.((.)).*', expand=True)

#

This worked too @desert cradle